信号与信息处理

一种基于层次结构深度信念网络的音素识别方法

展开
  • 1.电子工程学院404教研室,合肥230037
    2.安徽省电子制约技术重点实验室,合肥230037
    3.科大讯飞公司,合肥230037
    4.77108部队52分队,成都611233
王一,博士生,研究方向:语音信号分析与识别技术等,E-mail:wygggg@126.com;杨俊安,教授,博导,研究方向:信号处理、智能计算等,E-mail:yanjunan@ustc.edu

收稿日期: 2013-09-08

  修回日期: 2014-03-28

  网络出版日期: 2014-03-28

基金资助

国家自然科学基金(No.61272333);安徽省自然科学基金(No.1208085MF94,No.1308085QF99)资助

Hierarchical Structure of Deep Belief Network for Phoneme Recognition

Expand
  • 1. Room 404, Electronic Engineering Institute, Hefei 230037, China
    2. Key Laboratory of Electronic Restriction, Anhui Province, Hefei 230037, China
    3. Anhui USTC iFlytek Corporation, Hefei 230037, China
    4. No.52 Sub Unit, No.77108 Unit, Chengdu 611233, China

Received date: 2013-09-08

  Revised date: 2014-03-28

  Online published: 2014-03-28

摘要

针对现有音素识别系统识别准确率不高、建模方法表征能力不强且易陷入局部最优解等问题,提出了一
种基于层次结构深度信念网络(deep belief network, DBN)的音素识别新方法. 该方法由基于层次结构DBN的瓶
颈特征以及基于DBN的音素分类器两部分组成:其中的瓶颈特征能够充分利用DBN能够处理长时段语音、监督
性的提取方法等特性;而基于DBN的音素分类器则具有更强的建模和表征能力. 因此,将两者结合在一起能够在
提取低维、监督性特征的同时,利用DBN更加有效地对音素后验概率进行识别. 在TIMIT数据库上进行的实验结
果表明,所提出的音素识别方法在识别正确率上相对于以往音素识别系统有较大提高.

本文引用格式

王一1,2, 杨俊安1,2, 刘辉1,2, 柳林3, 卢高4 . 一种基于层次结构深度信念网络的音素识别方法[J]. 应用科学学报, 2014 , 32(5) : 515 -522 . DOI: 10.3969/j.issn.0255-8297.2014.05.013

Abstract

To overcome the problem of poor recognition performance and being prone to be trapped in local
optima, this paper proposes a hierarchical phoneme classification method based on deep belief network (DBN).
The system consists of two parts: a bottleneck feature and a phoneme classifier, both DBN based. The two
parts are combined to form a phoneme recognition system. The system can extract low dimensional and
supervising features, and improve classification accuracy. Experiments on TIMIT corpus suggest that the
proposed system can obtain 18.5% phoneme error rate as compared with existing systems.

参考文献

[1]. Schwarz P. Phoneme Recognition based on Long Temporal Context [D]. PH.D. Thesis, Faculty of Information Technology BUT, Brno University of Technology, Brno, Czech, 2008.

[2]. Jansen A and Niyogi P. Point Process Models for Spotting Keywords in Continuous Speech. IEEE Transaction on Audio, Speech, and Language Processing [J]. 2009, 17 (8):1457-1470.

[3]. Siohan O and Bacchiani M. Fast Vocabulary Independent Audio Search Using Path-Based Graph Indexing [C]. Proceedings of the Eurospeech 2005, Lisbon, Portugal, 4-8 September 2005.

[4]. Matejka P, Schwarz P, Cernocký J and Chytil P. Phonotactic Language Identification using High Quality Phoneme Recognition [C]. Proceedings of the INTERSPEECH, Lisbon, Portugal, 2005: 2237-2240.

[5]. Deng L. An Overview of Deep-Structured Learning for Information Processing [C]. Proceedings of the Asian-Pacific Signal and Information Processing-Annual Summit and Conference, Xian, China, 2011:1-14.

[6]. Hinton G and Salakhutdinov R. Reducing the Dimensionality of Data with Neural Networks [J]. Science 2006, 313(5786): 504-507.

[7]. Bao Y, Jiang H and Liu C. Investigation on dimensionality reduction of concatenated features with deep neural network for LVCSR systems [C]. Proceedings of the IEEE 11th International Conference on Signal Processing (ICSP2012), Beijing, China, 2012: 562-566.

[8]. Mohamed A, Dahl G, Hinton G. Acoustic Modeling using Deep Belief Networks[J]. IEEE Transaction on Audio, Speech, and Language Processing 2012; 20 (1):14-22.

[9]. Dahl G, Dong Y, Deng L and Acero A. Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition [J]. IEEE Transaction on Audio, Speech, and Language Processing 2012, 20 (1):30-42.

[10]. Pinto J, Sivaram GSVS, Magimai-Doss M, Hermansky H and Bourlard H. Analysis of MLP Based Hierarchical Phoneme. IEEE Transactions on Audio, Speech, and Language Processing [J]. 2011, 19(2):225-241.

[11]. Sivaram GSVS, Hermansky H. Sparse Multilayer Perceptron for Phoneme Recognition. IEEE Transactions on Audio, Speech, and Language Processing [J].2012, 20(1): 23-29.

[12]. Tara S, Brian K and Bhuvana R. Auto-Encoder Bottleneck Features Using Deep Belief Networks [C]. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing 2012, Kyoto, Japan, 4153-4156 March 2012.

[13]. Siniscalchi SM, Yu D, Deng L and Lee CH. Speech Recognition Using Long-Span Temporal Patterns in a Deep Network Mode. IEEE Signal Processing Letters [J].2013, 20(3):201- 204.

[14]. Dong Y and Deng L. Deep Learning and Its Applications to Signal and Information Processing [J]. IEEE Signal Processing Magazine 2011, 28(1), 145-154.

[15]. Bergstra J, Breuleux O, Bastien F, Lamblin P, Pascanu R, Desjardins G, Turian J, Warde-Farley D and Bengio Y. Theano :A CPU and GPU Math Expression Compiler[C]. Proceedings of the Python for Scientific Computing Conference (SciPy) 2010. Austin, U.S.A.

[16]. The ICSI Quicknet Software Package [DB\CD]. Available from: http://www.icsi.berkeley.edu/Speech /qn.html.
 
文章导航

/