基于压缩激励残差分组扩张卷积和密集线性门控Unet歌声分离方法

doi:10.3969/j.issn.0255-8297.2023.05.008

摘要/Abstract

摘要： 针对Unet频域歌声分离网络模型对语音时序信息的捕获困难以及底层特征利用率不高的问题，设计了一种相比于基线Unet网络参数量更小且歌声分离效果更好的卷积神经网络。首先设计了一种残差分组扩张卷积结合压缩激励模块，并将其引入到编码和解码阶段，该模块在参数量减少和增大网络感受野的同时自适应学习不同通道的重要特征，不但增强了有用特征，而且还抑制了无用特征。其次在传输层将线性门控单元采用密集相加连接来增强网络在特征传递过程中对时序特征的获取，并且使用扩张卷积来代替普通卷积以扩大网络的感受野。最后使用注意力门控机制来代替基线Unet中的跳跃连接以加强网络对底层特征的利用。在Ccmixter和MUSDB18数据集中进行实验，与基线网络相比，歌声分离的性能指标都有提升，并且其参数量大约只有基线网络的1/5。

关键词: 歌声分离, 分组扩张卷积, 门控线性单元, 注意力门控

Abstract: To improve speech timing information capture and utilize underlying features in Unet frequency domain singing voice separation network model, a convolutional neural network with smaller parameters and better song separation effect is proposed in this paper. Firstly, a residual group dilated convolution combined with squeeze-and-excitation module is incorporated into the encoding and decoding stage. While reducing the number of parameters and increasing the receptive field of the network, it can adaptively learn the importance of different channel features, so as to enhance the useful features and suppress the irrelevant ones. Secondly, in the transmission layer, the gating linear units are connected by dense addition to enhance the acquisition of temporal features in the process of feature transmission, and the dilated convolution is used to replace the ordinary convolution to expand the receptive field of the network. Finally, the attention gating mechanism is used to replace the jump connection in the baseline Unet to enhance the utilization of the underlying features. Experiments were conducted on the Ccmixter and MUSDB18 datasets, compared with the baseline network, the proposed approach achieves improvement in voice separation performance with only about one-fifth of the parameters.

Key words: singing voice separation, group dilated convolution, gating linear units, attention gating

中图分类号:

TN912.3

张天骐, 熊天, 吴超, 闻斌. 基于压缩激励残差分组扩张卷积和密集线性门控Unet歌声分离方法[J]. 应用科学学报, 2023, 41(5): 815-830.

ZHANG Tianqi, XIONG Tian, WU Chao, WEN Bin. Singing Voice Separation Method of Unet Based on Squeeze-and-Excitation Residual Group Dilated Convolution and Dense Linear Gate[J]. Journal of Applied Sciences, 2023, 41(5): 815-830.

参考文献

[1] Rafii Z, Pardo B. Repeating pattern extraction technique (REPET):a simple method for music/voice separation[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(1):73-84.
[2] Huang Posen, Chen S D, Smaragdis P, et al. Singing-voice separation from monaural recordings using robust principal component analysis[C]//IEEE 2012 International Conference on Acoustics, Speech and Signal Processing, 2012:57-60.
[3] Grais E, Erdogan H. Single channel speech music separation using nonnegative matrix factorization with sliding windows and spectral masks[C]//12th Annual Conference of the International Speech Communication Association (Interspeech 2011), 2011.
[4] Uhlich S, Giron F, Mitsufuji Y. Deep neural network based instrument extraction from music[C]//IEEE 2015 International Conference on Acoustics, Speech and Signal Processing (ICASSP2015), 2015:2135-2139.
[5] Sprechmann P, Bruna J, Lecun Y. Audio Source Separation with Discriminative Scattering Networks[C]//International Conference on Latent Variable Analysis and Signal Separation. Springer, Cham, 2015.
[6] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780.
[7] Chen J, Wang D L. Long short-term memory for speaker generalization in supervised speech separation[J]. The Journal of the Acoustical Society of America, 2017, 141(6):4705-4714.
[8] 张天. 单通道音乐信号中的人声伴奏分离方法研究[D]. 重庆:重庆邮电大学, 2020.
[9] Stter F R, Uhlich S, Liutkus A, et al. Open-unmix:a reference implementation for music source separation[J]. The Journal of Open Source Software, 2019, 4(41):1667.
[10] Simpson A J R, Roma G, Plumbley M D. Deep karaoke:extracting vocals from musical mixtures using a convolutional deep neural network[C]//12th International Conference on Latent Variable Analysis and Signal Separation (LVA), 2015:429-436.
[11] Jansson A, Humphrey E J, Montecchio N, et al. Singing voice separation with deep U-Net convolutional networks[C]//Proceedings of the 2017 International Society for Music Information Retrieval Conference (ISMIR2017), 2017:323-332.
[12] Stoller D, Ewert S, Dixon S. Wave-U-Net:a multi-scale neural network for end-to-end audio source separation[C]//2018 International Society for Music Information Retrieval Conference (ISMIR2018), 2018:334-340.
[13] Défossez A, Usunier N, Bottou L, et al. Demucs:deep extractor for music sources with extra unlabeled data remixed[DB/OL].[2021-09-29]. https://arxiv.org/abs/1909.01174.
[14] 汪斌, 陈宁. 基于残差注意力U-Net结构的端到端歌声分离模型[J]. 华东理工大学学报(自然科学版). 2021, 47(5):619-626. Wang B, Chen N. An end-to-end singing voice separation model based on residual attention u-net[J]. Journal of East China University of Science and Technology, 2021, 47(5):619-626. (in Chinese)
[15] Perez-Lapillo J, Galkin O, Weyde T. Improving singing voice separation with the waveU-Net using minimum hyperspherical energy[C]//2020 International Conference on Acoustics, Speech, and Signal Processing (ICASSP2020). IEEE, 2020.
[16] Ibtehaz N, Rahman M S. MultiResUNet:rethinking the U-Net architecture for multimodal biomedical image segmentation[J]. Neural Networks, 2019, 121:74-87.
[17] Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR2018). IEEE, 2018.
[18] Wang Y, Zhou Q, Liu J, et al. LEDNet:a Lightweight encoder-decoder network for realtime semantic segmentation[C]//2019 IEEE International Conference on Image Processing (ICIP2019). IEEE, 2019.
[19] Liu S, Huang D, Wang Y. Receptive field block net for accurate and fast object detection[C]//Proceedings of the European Conference on Computer Vision (ECCV), 2018:385-400.
[20] Dauphin Y N, Fan A, Auli M, et al. Language modeling with gated convolutional networks[C]//Proceedings of the 34th International Conference on Machine Learning, 2017, 70:933-941.
[21] 张天骐, 柏浩钧, 叶绍鹏, 等. 基于门控残差卷积编解码网络的单通道语音增强方法[J]. 信号处理, 2021, 37(10):1986-1995. Zhang T Q, Bai H J, Ye S P, et al. Single-channel speech enhancement method based on gated residual convolution encoder-and-decoder network[J]. Journal of Signal Processing, 2021, 37(10):1986-1995. (in Chinese)
[22] Liu J Y, Yang Y H. Dilated convolution with dilated GRU for music source separation[DB/OL].[2021-09-29]. https://arxiv.org/abs/1906.01203.
[23] Takahashi N, Mitsufuji Y. D3Net:Densely connected multidilated denseNet for music source separation[DB/OL].[2021-09-29]. https://arxiv.org/abs/2010.01733.
[24] Fang Y, Li Y, Tu X, et al. Face completion with hybrid dilated convolution[J].Signal Processing Image Communication, 2019, 80:115664.
[25] Oktay O, Schlemper J, Folgoc L L, et al. Attention U-Net:learning where to look for the pancreas[DB/OL].[2021-09-29]. https://arxiv.org/abs/1804.03999.