基于压缩激励残差分组扩张卷积和密集线性门控Unet歌声分离方法

doi:10.3969/j.issn.0255-8297.2023.05.008

Abstract

Abstract: To improve speech timing information capture and utilize underlying features in Unet frequency domain singing voice separation network model, a convolutional neural network with smaller parameters and better song separation effect is proposed in this paper. Firstly, a residual group dilated convolution combined with squeeze-and-excitation module is incorporated into the encoding and decoding stage. While reducing the number of parameters and increasing the receptive field of the network, it can adaptively learn the importance of different channel features, so as to enhance the useful features and suppress the irrelevant ones. Secondly, in the transmission layer, the gating linear units are connected by dense addition to enhance the acquisition of temporal features in the process of feature transmission, and the dilated convolution is used to replace the ordinary convolution to expand the receptive field of the network. Finally, the attention gating mechanism is used to replace the jump connection in the baseline Unet to enhance the utilization of the underlying features. Experiments were conducted on the Ccmixter and MUSDB18 datasets, compared with the baseline network, the proposed approach achieves improvement in voice separation performance with only about one-fifth of the parameters.

Key words: singing voice separation, group dilated convolution, gating linear units, attention gating

CLC Number:

TN912.3

ZHANG Tianqi, XIONG Tian, WU Chao, WEN Bin. Singing Voice Separation Method of Unet Based on Squeeze-and-Excitation Residual Group Dilated Convolution and Dense Linear Gate[J]. Journal of Applied Sciences, 2023, 41(5): 815-830.

References

[1] Rafii Z, Pardo B. Repeating pattern extraction technique (REPET):a simple method for music/voice separation[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(1):73-84.
[2] Huang Posen, Chen S D, Smaragdis P, et al. Singing-voice separation from monaural recordings using robust principal component analysis[C]//IEEE 2012 International Conference on Acoustics, Speech and Signal Processing, 2012:57-60.
[3] Grais E, Erdogan H. Single channel speech music separation using nonnegative matrix factorization with sliding windows and spectral masks[C]//12th Annual Conference of the International Speech Communication Association (Interspeech 2011), 2011.
[4] Uhlich S, Giron F, Mitsufuji Y. Deep neural network based instrument extraction from music[C]//IEEE 2015 International Conference on Acoustics, Speech and Signal Processing (ICASSP2015), 2015:2135-2139.
[5] Sprechmann P, Bruna J, Lecun Y. Audio Source Separation with Discriminative Scattering Networks[C]//International Conference on Latent Variable Analysis and Signal Separation. Springer, Cham, 2015.
[6] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780.
[7] Chen J, Wang D L. Long short-term memory for speaker generalization in supervised speech separation[J]. The Journal of the Acoustical Society of America, 2017, 141(6):4705-4714.
[8] 张天. 单通道音乐信号中的人声伴奏分离方法研究[D]. 重庆:重庆邮电大学, 2020.
[9] Stter F R, Uhlich S, Liutkus A, et al. Open-unmix:a reference implementation for music source separation[J]. The Journal of Open Source Software, 2019, 4(41):1667.
[10] Simpson A J R, Roma G, Plumbley M D. Deep karaoke:extracting vocals from musical mixtures using a convolutional deep neural network[C]//12th International Conference on Latent Variable Analysis and Signal Separation (LVA), 2015:429-436.
[11] Jansson A, Humphrey E J, Montecchio N, et al. Singing voice separation with deep U-Net convolutional networks[C]//Proceedings of the 2017 International Society for Music Information Retrieval Conference (ISMIR2017), 2017:323-332.
[12] Stoller D, Ewert S, Dixon S. Wave-U-Net:a multi-scale neural network for end-to-end audio source separation[C]//2018 International Society for Music Information Retrieval Conference (ISMIR2018), 2018:334-340.
[13] Défossez A, Usunier N, Bottou L, et al. Demucs:deep extractor for music sources with extra unlabeled data remixed[DB/OL].[2021-09-29]. https://arxiv.org/abs/1909.01174.
[14] 汪斌, 陈宁. 基于残差注意力U-Net结构的端到端歌声分离模型[J]. 华东理工大学学报(自然科学版). 2021, 47(5):619-626. Wang B, Chen N. An end-to-end singing voice separation model based on residual attention u-net[J]. Journal of East China University of Science and Technology, 2021, 47(5):619-626. (in Chinese)
[15] Perez-Lapillo J, Galkin O, Weyde T. Improving singing voice separation with the waveU-Net using minimum hyperspherical energy[C]//2020 International Conference on Acoustics, Speech, and Signal Processing (ICASSP2020). IEEE, 2020.
[16] Ibtehaz N, Rahman M S. MultiResUNet:rethinking the U-Net architecture for multimodal biomedical image segmentation[J]. Neural Networks, 2019, 121:74-87.
[17] Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR2018). IEEE, 2018.
[18] Wang Y, Zhou Q, Liu J, et al. LEDNet:a Lightweight encoder-decoder network for realtime semantic segmentation[C]//2019 IEEE International Conference on Image Processing (ICIP2019). IEEE, 2019.
[19] Liu S, Huang D, Wang Y. Receptive field block net for accurate and fast object detection[C]//Proceedings of the European Conference on Computer Vision (ECCV), 2018:385-400.
[20] Dauphin Y N, Fan A, Auli M, et al. Language modeling with gated convolutional networks[C]//Proceedings of the 34th International Conference on Machine Learning, 2017, 70:933-941.
[21] 张天骐, 柏浩钧, 叶绍鹏, 等. 基于门控残差卷积编解码网络的单通道语音增强方法[J]. 信号处理, 2021, 37(10):1986-1995. Zhang T Q, Bai H J, Ye S P, et al. Single-channel speech enhancement method based on gated residual convolution encoder-and-decoder network[J]. Journal of Signal Processing, 2021, 37(10):1986-1995. (in Chinese)
[22] Liu J Y, Yang Y H. Dilated convolution with dilated GRU for music source separation[DB/OL].[2021-09-29]. https://arxiv.org/abs/1906.01203.
[23] Takahashi N, Mitsufuji Y. D3Net:Densely connected multidilated denseNet for music source separation[DB/OL].[2021-09-29]. https://arxiv.org/abs/2010.01733.
[24] Fang Y, Li Y, Tu X, et al. Face completion with hybrid dilated convolution[J].Signal Processing Image Communication, 2019, 80:115664.
[25] Oktay O, Schlemper J, Folgoc L L, et al. Attention U-Net:learning where to look for the pancreas[DB/OL].[2021-09-29]. https://arxiv.org/abs/1804.03999.

Singing Voice Separation Method of Unet Based on Squeeze-and-Excitation Residual Group Dilated Convolution and Dense Linear Gate

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments

[1]	ZHAO Yibo, LU Haozhi, LI Shuhui, YAN Tao. Research on Adaptive Speech Enhancement Method for Microphone Array Based on Convex Combination [J]. Journal of Applied Sciences, 2021, 39(2): 261-271.
[2]	ZHOU Ping, SHEN Hao, ZHENG Kai-peng. Speaker Recognition Based on Combination of MFCC and GFCC Feature Parameters [J]. Journal of Applied Sciences, 2019, 37(1): 24-32.
[3]	CHEN Xue-qin1,2, YU Yi-biao1, ZHAO He-ming1. Algorithm of Fractal Dimension Based on Neighborhood Extremum Difference Signal Power Spectrum with Application to Low SNR Speech Activity Detection [J]. Journal of Applied Sciences, 2013, 31(6): 579-584.
[4]	YANG Lin, WANG Cong-qing, ZHANG Hong-zhan. Performance of Online Blind Separation of Dynamically Segmented Mixed Acoustic Signal for CVR [J]. Journal of Applied Sciences, 2010, 28(2): 129-135.
[5]	CHEN Cun-bao, ZHAO Li. Speaker Verification Based on GMM-UBM with Embedded Auto-associate Neural Network [J]. Journal of Applied Sciences, 2010, 28(1): 38-43.
[6]	LIU Qiang;WANG Xin-wei;CHEN Ren-wen;LIU Lin . Adaptive Filter Using Orthonormal Basis Functions [J]. Journal of Applied Sciences, 2008, 26(5): 516-520 .
[7]	CHEN Xue-qin;ZHAO He-ming;YU Yi-biao. Tone Recognition of Whispered Mandarin Using Ant Colony Clustering Neural Network [J]. Journal of Applied Sciences, 2008, 26(5): 511-515 .
[8]	LIU Hai-bin, WU Zhen-yang, ZHAO Li, ZENG Yu-min. Hidden Markov Model Adaptation Algorithm Using Gaussian-Similarity-Analysis-Based Maximum a Posteriori Nonlinear Transform [J]. Journal of Applied Sciences, 2004, 22(4): 433-437.
[9]	YE Jun, LIU Feng, XU Bo-ling. Blind Signal Separation by Kurtosis [J]. Journal of Applied Sciences, 2004, 22(3): 370-374.
[10]	MAO Xiao-quan, HU Guang-rui, TANG Bin. Evolutionary Computation-based MMI Training in Speech Recognition [J]. Journal of Applied Sciences, 2002, 20(3): 251-253.
[11]	CUI Yu-hong, HU Guang-rui, HE Xu-ming. Optimization of GMM Based on Hybrid Evolutionary Algorithm and Its Application in Speaker Identification [J]. Journal of Applied Sciences, 2002, 20(2): 141-144.
[12]	SHEN Chun-hua, LU Jing, XU Bo-ling. The Application of Float-encoding Genetic Algorithm to System Identification [J]. Journal of Applied Sciences, 2001, 19(4): 299-302.
[13]	DAI Ming-yang, YU Kai, XU Bo-ling, YU Chong-zhi. Chinese Tone Extraction in Extremely Noisy Background [J]. Journal of Applied Sciences, 2001, 19(2): 121-126.
[14]	MA Shi-wei, DEN Jia-mei, CAO Jia-lin. Paramatric Adaptive Decomposition Based on Gaussian Elementary Functions and Related Adaptive Time-frequency Distribution [J]. Journal of Applied Sciences, 2001, 19(1): 33-36.
[15]	SHI Xiao-xing, WANG Tai-jun, HE Zhen-ya. The Learning Algorithm of the Second Order HMM and Its Relationship with the First Order HMM [J]. Journal of Applied Sciences, 2001, 19(1): 29-32.