应用科学学报 ›› 2023, Vol. 41 ›› Issue (4): 669-681.doi: 10.3969/j.issn.0255-8297.2023.04.011

• 信号与信息处理 • 上一篇    下一篇

基于颜色通道特征融合的环境声音分类方法

董绍江1, 夏蒸富1, 方能炜2, 邢镔2, 胡小林2   

  1. 1. 重庆交通大学 机电与车辆工程学院, 重庆 400047;
    2. 重庆工业大数据创新中心有限公司, 重庆 400707
  • 收稿日期:2021-09-24 发布日期:2023-08-02
  • 通信作者: 董绍江,教授,博导,研究方向为机电一体化。E-mail:dongshaojiang100@163.com E-mail:dongshaojiang100@163.com
  • 基金资助:
    国家自然科学基金(No. 51775072);民用航天项目“ XXXX”(No. JW20*26012);重庆市科技创新领军人才支持计划项目(No. CSTCCCXLJRC201920);重庆市高校创新研究群体项目(No. CXQT20019);重庆市北碚区科学技术局技术创新与应用示范项目(No. 2020-5)资助

Environmental Sound Classification Method Based on Color Channel Feature Fusion

DONG Shaojiang1, XIA Zhengfu1, FANG Nengwei2, XING Bin2, HU Xiaolin2   

  1. 1. School of Mechantronics and Vehicle Engineering, Chongqing Jiaotong University, Chongqing 400074, China;
    2. Chongqing Industrial Big Data Innovation Center Co. Ltd., Chongqing 400707, China
  • Received:2021-09-24 Published:2023-08-02

摘要: 针对传统神经网络提取的复杂环境声音特征微弱,导致分类准确率低的问题,提出了一种基于颜色通道特征融合的环境声音分类方法。首先,从原始音频数据中提取出三种声音特征,即对数梅尔频谱图(log-Mel spectrogram,LMS)、梅尔倒谱系数(Mel-scale frequencycepstral coefficients,MFCC)以及能量谱图(energy spectrum,ES);其次,分别将以上三者作为RGB颜色通道分量进行特征融合,形成包含更多特征信息的声谱图,更全面表征环境声音;再次,为了避免由于数据集较少导致所训练的模型泛化能力较差,对预训练模型VGG-16采用微调方法进行训练;最后,在两个广泛使用的环境声音分类数据集以及实际场景采集的音频上验证本文所提方法的有效性,并与其他模型的准确率进行对比。结果表明,本文所提方法在ESC-10以及ESC-50数据集上的准确率分别能够达到88.2%和65.2%,并且能提高实际场景采集的音频分类效果。

关键词: RGB颜色通道, 特征融合, 微调训练, 环境声音分类, 预训练模型

Abstract: To address low classification accuracy in traditional neural networks processing complex environmental sounds, an environment sound classification method based on color channel feature fusion is proposed. Firstly, three acoustic features are extracted from the raw audio data, namely log-Mel Spectrogram (LMS), Mel-scale frequency cepstral Coefficients (MFCC) and energy spectrum (ES). Then, the above three features are used as RGB color channel components respectively for feature fusion to form a more representative spectrogram, which contributes to representing the environmental sound comprehensively. Subsequently, in order to avoid the poor generalization ability of the trained model due to the small number of datasets, the pre-trained network VGG-16 is trained by fine-tuning method. Finally, the effectiveness of the proposed method is verified on two widely used environmental sound classification datasets and audios collected in real scenarios, and compared with other models in terms of accuracy. The results show that the accuracy of the proposed method on ESC-10 and ESC-50 datasets can reach 88.2% and 65.2% respectively, improving the classification performance of audios collected in real scenarios.

Key words: RGB color channel, feature fusion, fine-tuning training, environment sound classification, pretrained model

中图分类号: