Singing Voice Separation Method of Unet Based on Squeeze-and-Excitation Residual Group Dilated Convolution and Dense Linear Gate

ZHANG Tianqi, XIONG Tian, WU Chao, WEN Bin

doi:10.3969/j.issn.0255-8297.2023.05.008

2023 , Vol. 41 >Issue 5: 815 - 830

DOI: https://doi.org/10.3969/j.issn.0255-8297.2023.05.008

Signal and Information Processing

Singing Voice Separation Method of Unet Based on Squeeze-and-Excitation Residual Group Dilated Convolution and Dense Linear Gate

Expand

School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

Received date: 2021-09-29

Online published: 2023-09-28

Fold

Abstract

To improve speech timing information capture and utilize underlying features in Unet frequency domain singing voice separation network model, a convolutional neural network with smaller parameters and better song separation effect is proposed in this paper. Firstly, a residual group dilated convolution combined with squeeze-and-excitation module is incorporated into the encoding and decoding stage. While reducing the number of parameters and increasing the receptive field of the network, it can adaptively learn the importance of different channel features, so as to enhance the useful features and suppress the irrelevant ones. Secondly, in the transmission layer, the gating linear units are connected by dense addition to enhance the acquisition of temporal features in the process of feature transmission, and the dilated convolution is used to replace the ordinary convolution to expand the receptive field of the network. Finally, the attention gating mechanism is used to replace the jump connection in the baseline Unet to enhance the utilization of the underlying features. Experiments were conducted on the Ccmixter and MUSDB18 datasets, compared with the baseline network, the proposed approach achieves improvement in voice separation performance with only about one-fifth of the parameters.

Key words： singing voice separation; group dilated convolution; gating linear units; attention gating

Cite this article

ZHANG Tianqi, XIONG Tian, WU Chao, WEN Bin . Singing Voice Separation Method of Unet Based on Squeeze-and-Excitation Residual Group Dilated Convolution and Dense Linear Gate[J]. Journal of Applied Sciences, 2023 , 41(5) : 815 -830 . DOI: 10.3969/j.issn.0255-8297.2023.05.008

References

[1] Rafii Z, Pardo B. Repeating pattern extraction technique (REPET):a simple method for music/voice separation[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(1):73-84.
[2] Huang Posen, Chen S D, Smaragdis P, et al. Singing-voice separation from monaural recordings using robust principal component analysis[C]//IEEE 2012 International Conference on Acoustics, Speech and Signal Processing, 2012:57-60.
[3] Grais E, Erdogan H. Single channel speech music separation using nonnegative matrix factorization with sliding windows and spectral masks[C]//12th Annual Conference of the International Speech Communication Association (Interspeech 2011), 2011.
[4] Uhlich S, Giron F, Mitsufuji Y. Deep neural network based instrument extraction from music[C]//IEEE 2015 International Conference on Acoustics, Speech and Signal Processing (ICASSP2015), 2015:2135-2139.
[5] Sprechmann P, Bruna J, Lecun Y. Audio Source Separation with Discriminative Scattering Networks[C]//International Conference on Latent Variable Analysis and Signal Separation. Springer, Cham, 2015.
[6] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780.
[7] Chen J, Wang D L. Long short-term memory for speaker generalization in supervised speech separation[J]. The Journal of the Acoustical Society of America, 2017, 141(6):4705-4714.
[8] 张天. 单通道音乐信号中的人声伴奏分离方法研究[D]. 重庆:重庆邮电大学, 2020.
[9] Stter F R, Uhlich S, Liutkus A, et al. Open-unmix:a reference implementation for music source separation[J]. The Journal of Open Source Software, 2019, 4(41):1667.
[10] Simpson A J R, Roma G, Plumbley M D. Deep karaoke:extracting vocals from musical mixtures using a convolutional deep neural network[C]//12th International Conference on Latent Variable Analysis and Signal Separation (LVA), 2015:429-436.
[11] Jansson A, Humphrey E J, Montecchio N, et al. Singing voice separation with deep U-Net convolutional networks[C]//Proceedings of the 2017 International Society for Music Information Retrieval Conference (ISMIR2017), 2017:323-332.
[12] Stoller D, Ewert S, Dixon S. Wave-U-Net:a multi-scale neural network for end-to-end audio source separation[C]//2018 International Society for Music Information Retrieval Conference (ISMIR2018), 2018:334-340.
[13] Défossez A, Usunier N, Bottou L, et al. Demucs:deep extractor for music sources with extra unlabeled data remixed[DB/OL].[2021-09-29]. https://arxiv.org/abs/1909.01174.
[14] 汪斌, 陈宁. 基于残差注意力U-Net结构的端到端歌声分离模型[J]. 华东理工大学学报(自然科学版). 2021, 47(5):619-626. Wang B, Chen N. An end-to-end singing voice separation model based on residual attention u-net[J]. Journal of East China University of Science and Technology, 2021, 47(5):619-626. (in Chinese)
[15] Perez-Lapillo J, Galkin O, Weyde T. Improving singing voice separation with the waveU-Net using minimum hyperspherical energy[C]//2020 International Conference on Acoustics, Speech, and Signal Processing (ICASSP2020). IEEE, 2020.
[16] Ibtehaz N, Rahman M S. MultiResUNet:rethinking the U-Net architecture for multimodal biomedical image segmentation[J]. Neural Networks, 2019, 121:74-87.
[17] Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR2018). IEEE, 2018.
[18] Wang Y, Zhou Q, Liu J, et al. LEDNet:a Lightweight encoder-decoder network for realtime semantic segmentation[C]//2019 IEEE International Conference on Image Processing (ICIP2019). IEEE, 2019.
[19] Liu S, Huang D, Wang Y. Receptive field block net for accurate and fast object detection[C]//Proceedings of the European Conference on Computer Vision (ECCV), 2018:385-400.
[20] Dauphin Y N, Fan A, Auli M, et al. Language modeling with gated convolutional networks[C]//Proceedings of the 34th International Conference on Machine Learning, 2017, 70:933-941.
[21] 张天骐, 柏浩钧, 叶绍鹏, 等. 基于门控残差卷积编解码网络的单通道语音增强方法[J]. 信号处理, 2021, 37(10):1986-1995. Zhang T Q, Bai H J, Ye S P, et al. Single-channel speech enhancement method based on gated residual convolution encoder-and-decoder network[J]. Journal of Signal Processing, 2021, 37(10):1986-1995. (in Chinese)
[22] Liu J Y, Yang Y H. Dilated convolution with dilated GRU for music source separation[DB/OL].[2021-09-29]. https://arxiv.org/abs/1906.01203.
[23] Takahashi N, Mitsufuji Y. D3Net:Densely connected multidilated denseNet for music source separation[DB/OL].[2021-09-29]. https://arxiv.org/abs/2010.01733.
[24] Fang Y, Li Y, Tu X, et al. Face completion with hybrid dilated convolution[J].Signal Processing Image Communication, 2019, 80:115664.
[25] Oktay O, Schlemper J, Folgoc L L, et al. Attention U-Net:learning where to look for the pancreas[DB/OL].[2021-09-29]. https://arxiv.org/abs/1804.03999.

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

References