基于颜色通道特征融合的环境声音分类方法

董绍江, 夏蒸富, 方能炜, 邢镔, 胡小林

doi:10.3969/j.issn.0255-8297.2023.04.011

应用科学学报 >

2023 , Vol. 41 >Issue 4: 669 - 681

DOI: https://doi.org/10.3969/j.issn.0255-8297.2023.04.011

信号与信息处理

基于颜色通道特征融合的环境声音分类方法

展开

1. 重庆交通大学机电与车辆工程学院, 重庆 400047;
2. 重庆工业大数据创新中心有限公司, 重庆 400707

收稿日期: 2021-09-24

网络出版日期: 2023-08-02

基金资助

国家自然科学基金（No. 51775072）；民用航天项目“ XXXX”（No. JW20^*26012）；重庆市科技创新领军人才支持计划项目（No. CSTCCCXLJRC201920）；重庆市高校创新研究群体项目（No. CXQT20019）；重庆市北碚区科学技术局技术创新与应用示范项目（No. 2020-5）资助

收起

Environmental Sound Classification Method Based on Color Channel Feature Fusion

Expand

1. School of Mechantronics and Vehicle Engineering, Chongqing Jiaotong University, Chongqing 400074, China;
2. Chongqing Industrial Big Data Innovation Center Co. Ltd., Chongqing 400707, China

Received date: 2021-09-24

Online published: 2023-08-02

Fold

摘要

针对传统神经网络提取的复杂环境声音特征微弱，导致分类准确率低的问题，提出了一种基于颜色通道特征融合的环境声音分类方法。首先，从原始音频数据中提取出三种声音特征，即对数梅尔频谱图（log-Mel spectrogram,LMS）、梅尔倒谱系数（Mel-scale frequencycepstral coefficients,MFCC）以及能量谱图（energy spectrum,ES）；其次，分别将以上三者作为RGB颜色通道分量进行特征融合，形成包含更多特征信息的声谱图，更全面表征环境声音；再次，为了避免由于数据集较少导致所训练的模型泛化能力较差，对预训练模型VGG-16采用微调方法进行训练；最后，在两个广泛使用的环境声音分类数据集以及实际场景采集的音频上验证本文所提方法的有效性，并与其他模型的准确率进行对比。结果表明，本文所提方法在ESC-10以及ESC-50数据集上的准确率分别能够达到88.2%和65.2%，并且能提高实际场景采集的音频分类效果。

关键词： RGB颜色通道; 特征融合; 微调训练; 环境声音分类; 预训练模型

本文引用格式

董绍江, 夏蒸富, 方能炜, 邢镔, 胡小林 . 基于颜色通道特征融合的环境声音分类方法[J]. 应用科学学报, 2023 , 41(4) : 669 -681 . DOI: 10.3969/j.issn.0255-8297.2023.04.011

Abstract

To address low classification accuracy in traditional neural networks processing complex environmental sounds, an environment sound classification method based on color channel feature fusion is proposed. Firstly, three acoustic features are extracted from the raw audio data, namely log-Mel Spectrogram (LMS), Mel-scale frequency cepstral Coefficients (MFCC) and energy spectrum (ES). Then, the above three features are used as RGB color channel components respectively for feature fusion to form a more representative spectrogram, which contributes to representing the environmental sound comprehensively. Subsequently, in order to avoid the poor generalization ability of the trained model due to the small number of datasets, the pre-trained network VGG-16 is trained by fine-tuning method. Finally, the effectiveness of the proposed method is verified on two widely used environmental sound classification datasets and audios collected in real scenarios, and compared with other models in terms of accuracy. The results show that the accuracy of the proposed method on ESC-10 and ESC-50 datasets can reach 88.2% and 65.2% respectively, improving the classification performance of audios collected in real scenarios.

Key words： RGB color channel; feature fusion; fine-tuning training; environment sound classification; pretrained model

参考文献

[1] Alías F, Socoró J, Sevillano X. A review of physical and perceptual feature extraction techniques for speech, music and environmental sounds[J]. Applied Sciences, 2016, 6(5):143.
[2] Tripathi A M, Mishra A. Environment sound classification using an attention-based residual neural network[J]. Neurocomputing, 2021, 460:409-423.
[3] Piczak K J. Environmental sound classification with convolutional neural networks[C]//2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), 2015:1-6.
[4] Tripathi A M, Mishra A. Self-supervised learning for environmental sound classification[J]. Applied Acoustics, 2021, 182:108183.
[5] Su Y, Zhang K, Wang J Y, et al. Performance analysis of multiple aggregated acoustic features for environment sound classification[J]. Applied Acoustics, 2020, 158:107050.
[6] Peng N, Chen A B, Zhou G X, et al. Environment sound classification based on visual multi-feature fusion and GRU-AWS[J]. IEEE Access, 2020, 8:191100-191114.
[7] Mushtaq Z, Su S F, Tran Q V. Spectral images based environmental sound classification using CNN with meaningful data augmentation[J]. Applied Acoustics, 2021, 172:107581.
[8] Li S B, Yao Y, Hu J, et al. An ensemble stacked convolutional neural network model for environmental event sound recognition[J]. Applied Sciences, 2018, 8(7):1152.
[9] Nanni L, Maguolo G, Brahnam S, et al. An ensemble of convolutional neural networks for audio classification[J]. Applied Sciences, 2021, 11(13):5796.
[10] Luz J S, Oliveira M C, Araújo F H D, et al. Ensemble of handcrafted and deep features for urban sound classification[J]. Applied Acoustics, 2021, 175:107819.
[11] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[DB/OL]. 2014[2021-09-24]. https://arxiv.org/abs/1409.1556.
[12] Piczak K J. ESC:dataset for environmental sound classification[C]//23rd ACM international conference on Multimedia, 2015:1015-1018.
[13] Boddapati V, Petef A, Rasmusson J, et al. Classifying environmental sounds using image recognition networks[J]. Procedia Computer Science, 2017, 112:2048-2056.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献