基于Group-Res2Block的智能合成语音说话人确认方法

李菲, 苏兆品, 王年松, 杨波, 张国富

doi:10.3969/j.issn.0255-8297.2024.04.012

应用科学学报 >

2024 , Vol. 42 >Issue 4: 709 - 722

DOI: https://doi.org/10.3969/j.issn.0255-8297.2024.04.012

计算机科学与应用

基于Group-Res2Block的智能合成语音说话人确认方法

展开

1. 合肥工业大学计算机与信息学院, 安徽合肥 230601;
2. 合肥工业大学工业安全与应急技术安徽省重点实验室, 安徽合肥 230601;
3. 安徽省公安厅物证鉴定管理处, 安徽合肥 230000

收稿日期: 2023-02-27

网络出版日期: 2024-08-01

基金资助

安徽省重点研究与开发计划（No.202004d07020011,No.202104d07020001）；广东省类脑智能计算重点实验室开放课题（No.GBL202117）；中央高校基本科研业务费专项资金项目（No.PA2021GDSK0073,No.PA2021GDSK0074,No.PA2022GDSK0037）资助

收起

Intelligent Synthetic Voice Speaker Verification Method Based on Group-Res2Block

Expand

1. School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, Anhui, China;
2. Anhui Province Key Laboratory of Industry Safety and Emergency Technology, Hefei University of Technology, Hefei 230601, Anhui, China;
3. Institute of Forensic Science, Department of Public Security of Anhui Province, Hefei 230000, Anhui, China

Received date: 2023-02-27

Online published: 2024-08-01

Fold

摘要

针对现有说话人确认任务基于自然语音条件下并不适用于智能合成语音的问题，提出一种基于Group-Res2Block的智能合成语音说话人确认方法。首先,设计了GroupRes2Block结构，在Res2Block的基础上将当前分组与相邻前后分组进行合并形成新的分组，以增强说话人局部特征的上下文联系；其次，设计了并行结构的多尺度通道注意力特征融合机制，利用不同大小卷积核实现同一层级的特征在通道维度的特征选择，以获取更具表现力的说话人特征，避免信息冗余；最后，设计了串行结构的多尺度层注意力特征融合机制，构建层结构，将深浅层特征整体进行融合并赋予不同权重，以获取最优的特征表达。为验证所提出特征提取网络的有效性，构建了中英文两种智能合成语音数据集进行消融实验和对比实验。结果表明本文方法在该任务的评价指标精确度（accuracy,ACC)、等错误率（equal errorrate,EER）和最小检测代价函数（minimum detection cost function,minDCF）上是最优的。此外，通过对模型泛化性能进行测试，验证了本文方法对未知智能语音算法的适用性。

关键词： 说话人确认; 智能合成语音; Group-Res2Block深度神经网络; 多尺度特征; 注意力机制

本文引用格式

李菲, 苏兆品, 王年松, 杨波, 张国富 . 基于Group-Res2Block的智能合成语音说话人确认方法[J]. 应用科学学报, 2024 , 42(4) : 709 -722 . DOI: 10.3969/j.issn.0255-8297.2024.04.012

Abstract

The existing speaker verification task is primarily based on natural speech conditions, rendering it unsuitable for intelligent speech synthesis. In response, this paper proposes an intelligent synthetic voice speaker verification method based on Group-Res2Block. Firstly, the Group-Res2Block structure is designed, integrating the current group with adjacent front and rear groups to foster a stronger contextual connection of the speaker’s local characteristics. Secondly, a multi-scale channel attention feature fusion mechanism with parallel structure is designed. This mechanism employs various-sized convolution kernels to select features of the same level in the channel dimension, thereby extracting more expressive speaker features and avoiding information redundancy. Finally, a multi-scale attention feature fusion mechanism of serial structure is designed, and a layer structure is constructed to integrate the deep and shallow features as a whole and give different weights to obtain the optimal feature expression. To verify the effectiveness of the proposed feature extraction network, this paper constructs two kinds of intelligent synthetic speech datasets in Chinese and English. Through ablation and comparative experiments, it is shown that the proposed method outperforms others on evaluation metrics such as accuracy (ACC), equal error rate (EER) and minimum detection cost function (minDCF) for the task. Furthermore, the test results of the generalization performance of the model verify its applicability to unknown intelligent speech algorithms.

Key words： speaker verification; intelligent voice synthesis; Group-Res2Block deep neural network; multi-scale features; attention mechanism

参考文献

[1] Ian B, Michael B, Brett W. Enhancements to DTW and VQ decision algorithms for speaker recognition [J]. Speech Communication, 1993, 13(3/4): 427-433
[2] Gersho A, Gray R M. Vector quantization and signal compression [M]. Berlin: Springer, 1992.
[3] Reynolds D A. Speaker identification and verification using Gaussian mixture speaker models [J]. Speech Communication, 1995, 17(1/2): 91-108.
[4] Reynolds D A, Quatieri T F, Dunn R B. Speaker verification using adapted Gaussian mixture models [J]. Digital Signal Processing, 2000, 10(1/2/3): 19-41.
[5] Huang G, Liu Z, Van L, et al. Densely connected convolutional networks [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017: 2261-2269.
[6] Malykh E, Novoselov S, Kudashev O. On residual CNN in text-dependent speaker verification task [C]//19th International Conference on Speech and Computer (SPECOM), 2017, 10458: 593-601.
[7] Snyder D, Garcia-Romero D, Povey D, et al. Deep neural network embeddings for textindependent speaker verification [C]//Proceedings of the Interspeech 2017, 2017: 999-1003.
[8] 王华朋. 基于深度双向LSTM网络的说话人识别[J]. 计算机工程与设计, 2020, 41(6): 1768-1772. Wang H P. Speaker recognition based on deep bidirectional LSTM network [J]. Computer Engineering and Design, 2020, 41(6): 1768-1772. (in Chinese)
[9] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016: 770-778.
[10] Gao S H, Cheng M M, Zhao K, et al. Res2net: a new multi-scale backbone architecture [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(2): 652-662.
[11] Szegedy C. Going deeper with convolutions [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015: 1-9.
[12] Zhang Y, Lyu Z, Wu H, et al. MFA-conformer: multi-scale feature aggregation conformer for automatic speaker verification [C]//Proceedings of the Interspeech 2022, 2022: 306-310.
[13] Koji O, Takafumi K, Koichi S. Attentive statistics pooling for deep speaker embedding [C]//Proceedings of the Interspeech 2018, 2018: 2252-2256.
[14] Han B, Chen Z, Qian Y. Local information modeling with self-attention for speaker verification [C]//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing, 2022: 6727-6731.
[15] Woo S, Park J, Lee J, et al. CBAM: convolutional block attention module [C]//15th European Conference on Computer Vision (ECCV), 2018, 11211: 3-19.
[16] Hu J, Shen L, Sun G. Squeeze-and-excitation networks [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 7132-7141.
[17] Zhao M, Ma Y F, Liu M, et al. The speakIn system for voxCeleb speaker recognition challange 2021[DB/OL]. 2021[2023-02-27]. https://arxiv.org/abs/2109.01989.
[18] Desplanques B, Thienpondt J, Demuynck K. ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification [C]//Proceedings of the Interspeech 2020, 2020: 3830-3834.
[19] Safari P, India M, Hernando J. Self-attention encoding and pooling for speaker recognition [C]//Proceedings of the Interspeech 2020, 2020: 941-945.
[20] Wu Z Z, Kinnunen T, Evans N, et al. ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge [C]//16th Annual Conference of the International Speech Communication Association, 2015: 2037-2041.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献