For the problems of low accuracy and weak generalization ability in the process of human emotion recognition, a fusion method of multi-modal emotion recognition based on speech, text and motion is proposed. In the speech mode, a depth wavefield extrapolation-improved wave physics model (DWE-WPM) is designed to simulate the sequence information mining process of long short-term memory (LSTM) network; In the text mode, a transformer model with multi-attention mechanism is used to capture the potential semantic expression of emotion; In the motion mode, sequential features of facial expression and hand action are combined by using two-way three-layer LSTM model with attention mechanism. Accordingly, a multi-modal fusion scheme is designed to achieve high-precision and strong generalization ability of emotion recognition. In the general emotion corpus IEMOCAP, the method proposed in this paper is compared with existing emotion recognition algorithms. Experimental results show that the proposed method has higher recognition accuracy both in single modality and multi-modals, with average accuracy improved by 16.4% and 10.5% respectively, effectively improving the ability of human emotion recognition in human-computer interaction.
JIA Ning, ZHENG Chunjun
. Multi-modal Emotion Recognition Using Speech, Text and Motion[J]. Journal of Applied Sciences, 2023
, 41(1)
: 55
-70
.
DOI: 10.3969/j.issn.0255-8297.2023.01.005
[1] Tiwari U, Soni M, Chakraborty R, et al. Multi-conditioning and data augmentation using generative noise model for speech emotion recognition in noisy conditions[C]//2020 IEEE International Conference on Acoustics, Speech and Signal Processing, 2020:7194-7198.
[2] Jermsittiparsert K, Abdurrahman A, Siriattakul P, et al. Pattern recognition and features selection for speech emotion recognition model using deep learning[J]. International Journal of Speech Technology, 2020, 23(4):1-8.
[3] 曾润华, 张树群. 改进卷积神经网络的语音情感识别方法[J]. 应用科学学报, 2018, 36(5):837-844. Zeng R H, Zhang S Q. Speech and emotional recognition method based on improving convolutional neural networks[J]. Journal of Applied Sciences, 2018, 36(5):837-844. (in Chinese)
[4] Chu Y, Li T G, Ye S, et al. Research on feature selection method in speech emotion recognition[J]. Journal of Applied Acoustics, 2020, 39(2):223-230.
[5] Wang W, Yang L P, Wei L. Extraction and analysis of speech emotion characteristics[J]. Research and Exploration in Laboratory, 2013, 32(7):91-94.
[6] Yang M H, Tao J H, Li H, et al. Nature multimodal human-computer-interaction dialog system[J]. Computer Science, 2014, 41(10):12-18.
[7] Hughes T W, Williamson I A D, Minkov M, et al. Wave physics as an analog recurrent neural network[J]. Science Advances, 2019, 5(12):6946-6958.
[8] Bouazizi M, Ohtsuki T. Multi-class sentiment analysis on Twitter:classification performance and challenges[J]. Big Data Mining and Analytics, 2019, 3:181-194.
[9] Liang Y, Meng F, Zhang J, et al. A dependency syntactic knowledge augmented interactive architecture for end-to-end aspect-based sentiment analysis[J]. Neurocomputing, 2020, 454:291-302.
[10] 司马懿, 易积政, 陈爱斌, 等. 动态人脸图像序列中表情完全帧的定位与识别[J]. 应用科学学报, 2021, 39(3):357-366. Si M Y, Yi J Z, Chen A B, et al. Fully expression frame localization and recognition based on dynamic face image sequences[J]. Journal of Applied Sciences, 2021, 39(3):357-366. (in Chinese)
[11] Jain D K, Shamsolmoali P, Sehdev P. Extended deep neural network for facial emotion recognition[J]. Pattern Recognition Letters, 2019, 120:69-74.
[12] Thomas K, Pranav E, Supriya M H. A generalized deep learning model for denoising image datasets[J]. International Journal of Engineering and Advanced Technology, 2020, 10(1):9-14.
[13] Ly S T, Lee G S, Kim S H, et al. Gesture-based emotion recognition by 3D-CNN and LSTM with keyframes selection[J]. International Journal of Contents, 2019, 15(4):59-64.
[14] Busso C, Bulut M, Lee C C, et al. IEMOCAP:interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation, 2008, 42(4):335-359.
[15] Poria S, Majumder N, Hazarika D, et al. Multimodal sentiment analysis:addressing key issues and setting up the baselines[J]. IEEE Intelligent Systems, 2018, 33(6):17-25.
[16] Sahu G. Multimodal speech emotion recognition and ambiguity resolution[EB/OL]. (2019-04-12)[2021-08-21]. https://arxiv.org/abs/1904.06022v1.
[17] Happy S L, Dantcheva A, Bremond F, et al. Expression recognition with deep features extracted from holistic and part-based models[J]. Image and Vision Computing, 2021, 105(1):104038.1-104038.11.
[18] Tripathi S, Beigi H. Multi-modal emotion recognition on IEMOCAP dataset using deep learning[EB/OL]. (2019-11-06)[2021-09-04]. https://arxiv.org/abs/1804.05788v3.
[19] Ren M, Nie W, Liu A, et al. Multi-modal correlated network for emotion recognition in speech[J]. Visual Informatics, 2019, 3(3):150-155.
[20] Mirsamadi S, Barsoum E, Zhang C. Automatic speech emotion recognition using recurrent neural networks with local attention[C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing, 2017:2227-2231.
[21] Chen M, Zhao X D. A multi-scale fusion framework for bimodal speech emotion recognition[C]//Interspeech 2020, 2020:374-378.