The existing speaker verification task is primarily based on natural speech conditions, rendering it unsuitable for intelligent speech synthesis. In response, this paper proposes an intelligent synthetic voice speaker verification method based on Group-Res2Block. Firstly, the Group-Res2Block structure is designed, integrating the current group with adjacent front and rear groups to foster a stronger contextual connection of the speaker’s local characteristics. Secondly, a multi-scale channel attention feature fusion mechanism with parallel structure is designed. This mechanism employs various-sized convolution kernels to select features of the same level in the channel dimension, thereby extracting more expressive speaker features and avoiding information redundancy. Finally, a multi-scale attention feature fusion mechanism of serial structure is designed, and a layer structure is constructed to integrate the deep and shallow features as a whole and give different weights to obtain the optimal feature expression. To verify the effectiveness of the proposed feature extraction network, this paper constructs two kinds of intelligent synthetic speech datasets in Chinese and English. Through ablation and comparative experiments, it is shown that the proposed method outperforms others on evaluation metrics such as accuracy (ACC), equal error rate (EER) and minimum detection cost function (minDCF) for the task. Furthermore, the test results of the generalization performance of the model verify its applicability to unknown intelligent speech algorithms.
LI Fei, SU Zhaopin, WANG Niansong, YANG Bo, ZHANG Guofu
. Intelligent Synthetic Voice Speaker Verification Method Based on Group-Res2Block[J]. Journal of Applied Sciences, 2024
, 42(4)
: 709
-722
.
DOI: 10.3969/j.issn.0255-8297.2024.04.012
[1] Ian B, Michael B, Brett W. Enhancements to DTW and VQ decision algorithms for speaker recognition [J]. Speech Communication, 1993, 13(3/4): 427-433
[2] Gersho A, Gray R M. Vector quantization and signal compression [M]. Berlin: Springer, 1992.
[3] Reynolds D A. Speaker identification and verification using Gaussian mixture speaker models [J]. Speech Communication, 1995, 17(1/2): 91-108.
[4] Reynolds D A, Quatieri T F, Dunn R B. Speaker verification using adapted Gaussian mixture models [J]. Digital Signal Processing, 2000, 10(1/2/3): 19-41.
[5] Huang G, Liu Z, Van L, et al. Densely connected convolutional networks [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017: 2261-2269.
[6] Malykh E, Novoselov S, Kudashev O. On residual CNN in text-dependent speaker verification task [C]//19th International Conference on Speech and Computer (SPECOM), 2017, 10458: 593-601.
[7] Snyder D, Garcia-Romero D, Povey D, et al. Deep neural network embeddings for textindependent speaker verification [C]//Proceedings of the Interspeech 2017, 2017: 999-1003.
[8] 王华朋. 基于深度双向LSTM网络的说话人识别[J]. 计算机工程与设计, 2020, 41(6): 1768-1772. Wang H P. Speaker recognition based on deep bidirectional LSTM network [J]. Computer Engineering and Design, 2020, 41(6): 1768-1772. (in Chinese)
[9] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016: 770-778.
[10] Gao S H, Cheng M M, Zhao K, et al. Res2net: a new multi-scale backbone architecture [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(2): 652-662.
[11] Szegedy C. Going deeper with convolutions [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015: 1-9.
[12] Zhang Y, Lyu Z, Wu H, et al. MFA-conformer: multi-scale feature aggregation conformer for automatic speaker verification [C]//Proceedings of the Interspeech 2022, 2022: 306-310.
[13] Koji O, Takafumi K, Koichi S. Attentive statistics pooling for deep speaker embedding [C]//Proceedings of the Interspeech 2018, 2018: 2252-2256.
[14] Han B, Chen Z, Qian Y. Local information modeling with self-attention for speaker verification [C]//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing, 2022: 6727-6731.
[15] Woo S, Park J, Lee J, et al. CBAM: convolutional block attention module [C]//15th European Conference on Computer Vision (ECCV), 2018, 11211: 3-19.
[16] Hu J, Shen L, Sun G. Squeeze-and-excitation networks [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 7132-7141.
[17] Zhao M, Ma Y F, Liu M, et al. The speakIn system for voxCeleb speaker recognition challange 2021[DB/OL]. 2021[2023-02-27]. https://arxiv.org/abs/2109.01989.
[18] Desplanques B, Thienpondt J, Demuynck K. ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification [C]//Proceedings of the Interspeech 2020, 2020: 3830-3834.
[19] Safari P, India M, Hernando J. Self-attention encoding and pooling for speaker recognition [C]//Proceedings of the Interspeech 2020, 2020: 941-945.
[20] Wu Z Z, Kinnunen T, Evans N, et al. ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge [C]//16th Annual Conference of the International Speech Communication Association, 2015: 2037-2041.