Attack Towards Speaker Identification Using Deep Conversion Networks for Voiceprint Features

TAO Ziyu, SU Zhaopin, LIAN Chensi, WANG Niansong, ZHANG Guofu

doi:10.3969/j.issn.0255-8297.2024.05.006

Journal of Applied Sciences >

2024 , Vol. 42 >Issue 5: 782 - 794

DOI: https://doi.org/10.3969/j.issn.0255-8297.2024.05.006

Signal and Information Processing

Attack Towards Speaker Identification Using Deep Conversion Networks for Voiceprint Features

Expand

1. School of Computer and Information Technology, Hefei University of Technology, Hefei 230009, Anhui, China;
2. Department of Physical Evidence Identification, Anhui Public Security Department, Hefei 230000, Anhui, China;
3. Intelligent Interconnected Systems Laboratory of Anhui Province, Hefei University of Technology, Hefei 230009, Anhui, China;
4. Joint Laboratory of Intelligent Prevention and Recognition of Audio and Video, Hefei 230009, Anhui, China

Received date: 2023-11-08

Online published: 2024-09-29

Fold

Abstract

In the field of speaker identification (SID) systems, attacks often rely on fast gradient descent and mapping gradient descent algorithms, which suffer from unstable attack performance and poor auditory quality of generated attack samples. This paper proposes an advanced attack method against SID systems using deep neural networks to generate attack speeches with the target speaker’s voiceprint. Specifically, the attack process on SID system is first analyzed to determine the approach to generating attack speeches. Then, a two-dimensional convolutional neural network is designed as a generator to effectively integrate the speech content of the source speaker and the voiceprint features of the target speaker. A discriminator is designed based on adversarial learning to improve the quality of the attack speeches. Finally, comparative experiments are conducted on two automatic SID systems based on generalized end-to-end loss and AMSoftmax loss, respectively. Experimental results demonstrate that the proposed method not only improves the stability of attack performance, but also enhances the auditory quality of attack speeches. Moreover, the proposed method is applicable to short samples, making it suitable for practical attack scenarios.

Key words： speaker identification; attack speeches; voiceprint feature conversion; convolutional neural network

Cite this article

TAO Ziyu, SU Zhaopin, LIAN Chensi, WANG Niansong, ZHANG Guofu . Attack Towards Speaker Identification Using Deep Conversion Networks for Voiceprint Features[J]. Journal of Applied Sciences, 2024 , 42(5) : 782 -794 . DOI: 10.3969/j.issn.0255-8297.2024.05.006

References

[1] 贺前华, 詹俊瑶, 严海康, 等. 一种基于改进动态时间规整算法的语音样本筛选方法: 中国, CN111179914B [P]. 2022-12-16.
[2] 李玉华. 基于隐马尔可夫模型的连续语音同步识别系统[J]. 现代电子技术, 2019, 42(11): 64-67, 71. Li Y H. Continuous speech synchronization recognition system based on hidden Markov model [J]. Modern Electronics Technique, 2019, 42(11): 64-67, 71. (in Chinese)
[3] Barai B, Chakraborty T, Das N, et al. Closed-set speaker identification using VQ and GMM based models [J]. International Journal of Speech Technology, 2022, 25(1): 173-196. [1] 贺前华, 詹俊瑶, 严海康, 等. 一种基于改进动态时间规整算法的语音样本筛选方法: 中国, CN111179914B [P]. 2022-12-16.
[2] 李玉华. 基于隐马尔可夫模型的连续语音同步识别系统[J]. 现代电子技术, 2019, 42(11): 64-67, 71. Li Y H. Continuous speech synchronization recognition system based on hidden Markov model [J]. Modern Electronics Technique, 2019, 42(11): 64-67, 71. (in Chinese)
[3] Barai B, Chakraborty T, Das N, et al. Closed-set speaker identification using VQ and GMM based models [J]. International Journal of Speech Technology, 2022, 25(1): 173-196.
[4] 高骥. 基于语种对抗训练的跨语种说话人识别研究[D]. 武汉: 华中科技大学, 2018.
[5] Wan L, Wang Q, Papir A, et al. Generalized end-to-end loss for speaker verification [C]//IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018: 4879- 4883.
[6] Hajibabaei M, Dai D. Unified hypersphere embedding for speaker recognition [DB/OL]. 2018[2023-11-08]. http://arxiv.org/abs/1807.08312.
[7] Nakamura E, Kageyama Y, Hirose S. LSTM-based Japanese speaker identification using an omnidirectional camera and voice information [J]. IEEE Transactions on Electrical and Electronic Engineering, 2022, 17(5): 674-684.
[8] Wei G C, Zhang Y N, Min H, et al. End-to-end speaker identification research based on multi-scale SincNet and CGAN [J]. Neural Computing and Applications, 2023, 35(30): 22209- 22222.
[9] Kreuk F, Adi Y, Cisse M, et al. Fooling end-to-end speaker verification with adversarial examples [C]//IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018: 1962-1966.
[10] Huang C Y, Lin Y Y, Lee H Y, et al. Defending your voice: adversarial attack on voice conversion [C]//2021 IEEE Spoken Language Technology Workshop (SLT), 2021: 552-559.
[11] Chen G K, Chen S, Fan L L, et al. Who is real bob? adversarial attacks on speaker recognition systems [C]//IEEE Symposium on Security and Privacy, 2021: 694-711.
[12] Liu S X, Wu H B, Lee H Y, et al. Adversarial attacks on spoofing countermeasures of automatic speaker verification [C]//IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019: 312-319.
[13] Carlini N, Wagner D. Audio adversarial examples: targeted attacks on speech-to-text [C]//IEEE Security and Privacy Workshops (SPW), 2018: 1-7.
[14] Tian X H, Das R K, Li H Z. Black-box attacks on automatic speaker verification using feedback-controlled voice conversion [C]//Speaker and Language Recognition Workshop, 2020: 159-164.
[15] Park S W, Kim D Y, Joe M C. Cotatron: transcription-guided speech encoder for any-tomany voice conversion without parallel data [C]//Interspeech 2020, 2020, 1542: 4696-4700.
[16] Bodin E, Malik I, Ek C H, et al. Nonparametric inference for auto-encoding variational Bayes [DB/OL]. 2017[2023-11-08]. http://arxiv.org/abs/1712.06536.
[17] Kameoka H, Kaneko T, Tanaka K, et al. StarGAN-VC: non-parallel many-to-many voice conversion using star generative adversarial networks [C]//IEEE Spoken Language Technology Workshop (SLT), 2018: 266-273.
[18] Kameoka H, Kaneko T, Tanaka K, et al. StarGAN-VC2: rethinking conditional methods for StarGAN-based voice conversion [C]//Interspeech 2019, 2019, 2236: 679-683.
[19] Dhar S, Jana N D, Das S. An adaptive-learning-based generative adversarial network for one-to-one voice conversion [J]. IEEE Transactions on Artificial Intelligence, 2023, 4(1): 92-106.
[20] Zhao Z Q, Ma S F, Jia Y, et al. Disentangling content information by combining ASR and TTS bottleneck features for voice conversion [J]. International Journal of Asian Language Processing, 2023, 33(1), 235-246.
[21] Kaneko T, Kameoka H, Hiramatsu K, et al. Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks [C]//Interspeech 2017, 2017, 970: 1283-1287.
[22] Zhu J Y, Park T, Isola P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks [C]//2017 IEEE International Conference on Computer Vision (ICCV), 2017: 2242-2251.

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

References