Journal of Applied Sciences ›› 2024, Vol. 42 ›› Issue (5): 782-794.doi: 10.3969/j.issn.0255-8297.2024.05.006

• Signal and Information Processing • Previous Articles    

Attack Towards Speaker Identification Using Deep Conversion Networks for Voiceprint Features

TAO Ziyu1, SU Zhaopin1,3,4, LIAN Chensi2,4, WANG Niansong2,4, ZHANG Guofu1,3,4   

  1. 1. School of Computer and Information Technology, Hefei University of Technology, Hefei 230009, Anhui, China;
    2. Department of Physical Evidence Identification, Anhui Public Security Department, Hefei 230000, Anhui, China;
    3. Intelligent Interconnected Systems Laboratory of Anhui Province, Hefei University of Technology, Hefei 230009, Anhui, China;
    4. Joint Laboratory of Intelligent Prevention and Recognition of Audio and Video, Hefei 230009, Anhui, China
  • Received:2023-11-08 Published:2024-09-29

Abstract: In the field of speaker identification (SID) systems, attacks often rely on fast gradient descent and mapping gradient descent algorithms, which suffer from unstable attack performance and poor auditory quality of generated attack samples. This paper proposes an advanced attack method against SID systems using deep neural networks to generate attack speeches with the target speaker’s voiceprint. Specifically, the attack process on SID system is first analyzed to determine the approach to generating attack speeches. Then, a two-dimensional convolutional neural network is designed as a generator to effectively integrate the speech content of the source speaker and the voiceprint features of the target speaker. A discriminator is designed based on adversarial learning to improve the quality of the attack speeches. Finally, comparative experiments are conducted on two automatic SID systems based on generalized end-to-end loss and AMSoftmax loss, respectively. Experimental results demonstrate that the proposed method not only improves the stability of attack performance, but also enhances the auditory quality of attack speeches. Moreover, the proposed method is applicable to short samples, making it suitable for practical attack scenarios.

Key words: speaker identification, attack speeches, voiceprint feature conversion, convolutional neural network

CLC Number: