应用科学学报 ›› 2023, Vol. 41 ›› Issue (5): 815-830.doi: 10.3969/j.issn.0255-8297.2023.05.008

• 信号与信息处理 • 上一篇    

基于压缩激励残差分组扩张卷积和密集线性门控Unet歌声分离方法

张天骐, 熊天, 吴超, 闻斌   

  1. 重庆邮电大学 通信与信息工程学院, 重庆 400065
  • 收稿日期:2021-09-29 发布日期:2023-09-28
  • 通信作者: 张天骐,博士,教授,研究方向为通信信号的调制解调、盲处理、图像语音信号处理、神经网络实现以及FPGA、VLSL实现。E-mail:zhangtq@cqupt.edu.cn E-mail:zhangtq@cqupt.edu.cn
  • 基金资助:
    国家自然科学基金(No.61671095,No.61702065,No.61701067,No.61771085);重庆市自然科学基金(No.cstc2021jcyj-msxmX0836)资助

Singing Voice Separation Method of Unet Based on Squeeze-and-Excitation Residual Group Dilated Convolution and Dense Linear Gate

ZHANG Tianqi, XIONG Tian, WU Chao, WEN Bin   

  1. School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
  • Received:2021-09-29 Published:2023-09-28

摘要: 针对Unet频域歌声分离网络模型对语音时序信息的捕获困难以及底层特征利用率不高的问题,设计了一种相比于基线Unet网络参数量更小且歌声分离效果更好的卷积神经网络。首先设计了一种残差分组扩张卷积结合压缩激励模块,并将其引入到编码和解码阶段,该模块在参数量减少和增大网络感受野的同时自适应学习不同通道的重要特征,不但增强了有用特征,而且还抑制了无用特征。其次在传输层将线性门控单元采用密集相加连接来增强网络在特征传递过程中对时序特征的获取,并且使用扩张卷积来代替普通卷积以扩大网络的感受野。最后使用注意力门控机制来代替基线Unet中的跳跃连接以加强网络对底层特征的利用。在Ccmixter和MUSDB18数据集中进行实验,与基线网络相比,歌声分离的性能指标都有提升,并且其参数量大约只有基线网络的1/5。

关键词: 歌声分离, 分组扩张卷积, 门控线性单元, 注意力门控

Abstract: To improve speech timing information capture and utilize underlying features in Unet frequency domain singing voice separation network model, a convolutional neural network with smaller parameters and better song separation effect is proposed in this paper. Firstly, a residual group dilated convolution combined with squeeze-and-excitation module is incorporated into the encoding and decoding stage. While reducing the number of parameters and increasing the receptive field of the network, it can adaptively learn the importance of different channel features, so as to enhance the useful features and suppress the irrelevant ones. Secondly, in the transmission layer, the gating linear units are connected by dense addition to enhance the acquisition of temporal features in the process of feature transmission, and the dilated convolution is used to replace the ordinary convolution to expand the receptive field of the network. Finally, the attention gating mechanism is used to replace the jump connection in the baseline Unet to enhance the utilization of the underlying features. Experiments were conducted on the Ccmixter and MUSDB18 datasets, compared with the baseline network, the proposed approach achieves improvement in voice separation performance with only about one-fifth of the parameters.

Key words: singing voice separation, group dilated convolution, gating linear units, attention gating

中图分类号: