融合音频、文本、表情动作的多模态情感识别

doi:10.3969/j.issn.0255-8297.2023.01.005

应用科学学报 ›› 2023, Vol. 41 ›› Issue (1): 55-70.doi: 10.3969/j.issn.0255-8297.2023.01.005

融合音频、文本、表情动作的多模态情感识别

贾宁, 郑纯军

大连东软信息学院软件学院, 辽宁大连 116023

收稿日期:2022-06-18 出版日期:2023-01-31 发布日期:2023-02-03
通信作者: 贾宁,副教授,研究方向为多模态情感计算、语音合成。E-mail:jianing@neusoft.edu.cn E-mail:jianing@neusoft.edu.cn
基金资助:
国家重点研发计划项目基金（No.2021YFC3320300）；辽宁省教育厅项目基金（No.LJKQZ2021188，No.JG20DB032）资助

Multi-modal Emotion Recognition Using Speech, Text and Motion

JIA Ning, ZHENG Chunjun

School of Software, Dalian Neusoft University of Information, Dalian 116023, Liaoning, China

Received:2022-06-18 Online:2023-01-31 Published:2023-02-03

摘要/Abstract

摘要： 针对机器识别人类情感过程中的精度不高、泛化能力不强等问题，提出了一种基于语音、文本和表情动作的3种模态情感识别融合方法。在语音模态中，设计深度波场延拓和改进波动物理模型，模拟长短期记忆（long short-term memory，LSTM）网络的序列信息挖掘过程；在文本模态中，利用含有多头注意力机制的Transformer模型捕捉语义上潜在的情感表达；在表情动作模态中，将提取面部表情和手部动作的序列特征与双向三层含有注意力机制的LSTM模型相结合。最终提出一种多性能指标下的模态融合方案，以实现高精度的、强泛化能力的情感识别。在通用的交互式情感二元运动捕捉语料库IEMOCAP中，将所提出的方法与现有的情感识别算法进行对比，实验结果表明：所提出的算法在单个模态和多个模态中的识别精度均较高，平均精度改善达到16.4%和10.5%，有效提升了人机交互中情感识别的能力。

关键词: 语音情感识别, 文本情感识别, 动作情感识别, Transformer模型, 注意力机制

Abstract: For the problems of low accuracy and weak generalization ability in the process of human emotion recognition, a fusion method of multi-modal emotion recognition based on speech, text and motion is proposed. In the speech mode, a depth wavefield extrapolation-improved wave physics model (DWE-WPM) is designed to simulate the sequence information mining process of long short-term memory (LSTM) network; In the text mode, a transformer model with multi-attention mechanism is used to capture the potential semantic expression of emotion; In the motion mode, sequential features of facial expression and hand action are combined by using two-way three-layer LSTM model with attention mechanism. Accordingly, a multi-modal fusion scheme is designed to achieve high-precision and strong generalization ability of emotion recognition. In the general emotion corpus IEMOCAP, the method proposed in this paper is compared with existing emotion recognition algorithms. Experimental results show that the proposed method has higher recognition accuracy both in single modality and multi-modals, with average accuracy improved by 16.4% and 10.5% respectively, effectively improving the ability of human emotion recognition in human-computer interaction.

Key words: speech emotion recognition, text emotion recognition, motion emotion recognition, Transformer model, attention mechanism

中图分类号:

P751.1

贾宁, 郑纯军. 融合音频、文本、表情动作的多模态情感识别[J]. 应用科学学报, 2023, 41(1): 55-70.

JIA Ning, ZHENG Chunjun. Multi-modal Emotion Recognition Using Speech, Text and Motion[J]. Journal of Applied Sciences, 2023, 41(1): 55-70.

参考文献

[1] Tiwari U, Soni M, Chakraborty R, et al. Multi-conditioning and data augmentation using generative noise model for speech emotion recognition in noisy conditions[C]//2020 IEEE International Conference on Acoustics, Speech and Signal Processing, 2020:7194-7198.
[2] Jermsittiparsert K, Abdurrahman A, Siriattakul P, et al. Pattern recognition and features selection for speech emotion recognition model using deep learning[J]. International Journal of Speech Technology, 2020, 23(4):1-8.
[3] 曾润华, 张树群. 改进卷积神经网络的语音情感识别方法[J]. 应用科学学报, 2018, 36(5):837-844. Zeng R H, Zhang S Q. Speech and emotional recognition method based on improving convolutional neural networks[J]. Journal of Applied Sciences, 2018, 36(5):837-844. (in Chinese)
[4] Chu Y, Li T G, Ye S, et al. Research on feature selection method in speech emotion recognition[J]. Journal of Applied Acoustics, 2020, 39(2):223-230.
[5] Wang W, Yang L P, Wei L. Extraction and analysis of speech emotion characteristics[J]. Research and Exploration in Laboratory, 2013, 32(7):91-94.
[6] Yang M H, Tao J H, Li H, et al. Nature multimodal human-computer-interaction dialog system[J]. Computer Science, 2014, 41(10):12-18.
[7] Hughes T W, Williamson I A D, Minkov M, et al. Wave physics as an analog recurrent neural network[J]. Science Advances, 2019, 5(12):6946-6958.
[8] Bouazizi M, Ohtsuki T. Multi-class sentiment analysis on Twitter:classification performance and challenges[J]. Big Data Mining and Analytics, 2019, 3:181-194.
[9] Liang Y, Meng F, Zhang J, et al. A dependency syntactic knowledge augmented interactive architecture for end-to-end aspect-based sentiment analysis[J]. Neurocomputing, 2020, 454:291-302.
[10] 司马懿, 易积政, 陈爱斌, 等. 动态人脸图像序列中表情完全帧的定位与识别[J]. 应用科学学报, 2021, 39(3):357-366. Si M Y, Yi J Z, Chen A B, et al. Fully expression frame localization and recognition based on dynamic face image sequences[J]. Journal of Applied Sciences, 2021, 39(3):357-366. (in Chinese)
[11] Jain D K, Shamsolmoali P, Sehdev P. Extended deep neural network for facial emotion recognition[J]. Pattern Recognition Letters, 2019, 120:69-74.
[12] Thomas K, Pranav E, Supriya M H. A generalized deep learning model for denoising image datasets[J]. International Journal of Engineering and Advanced Technology, 2020, 10(1):9-14.
[13] Ly S T, Lee G S, Kim S H, et al. Gesture-based emotion recognition by 3D-CNN and LSTM with keyframes selection[J]. International Journal of Contents, 2019, 15(4):59-64.
[14] Busso C, Bulut M, Lee C C, et al. IEMOCAP:interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation, 2008, 42(4):335-359.
[15] Poria S, Majumder N, Hazarika D, et al. Multimodal sentiment analysis:addressing key issues and setting up the baselines[J]. IEEE Intelligent Systems, 2018, 33(6):17-25.
[16] Sahu G. Multimodal speech emotion recognition and ambiguity resolution[EB/OL]. (2019-04-12)[2021-08-21]. https://arxiv.org/abs/1904.06022v1.
[17] Happy S L, Dantcheva A, Bremond F, et al. Expression recognition with deep features extracted from holistic and part-based models[J]. Image and Vision Computing, 2021, 105(1):104038.1-104038.11.
[18] Tripathi S, Beigi H. Multi-modal emotion recognition on IEMOCAP dataset using deep learning[EB/OL]. (2019-11-06)[2021-09-04]. https://arxiv.org/abs/1804.05788v3.
[19] Ren M, Nie W, Liu A, et al. Multi-modal correlated network for emotion recognition in speech[J]. Visual Informatics, 2019, 3(3):150-155.
[20] Mirsamadi S, Barsoum E, Zhang C. Automatic speech emotion recognition using recurrent neural networks with local attention[C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing, 2017:2227-2231.
[21] Chen M, Zhao X D. A multi-scale fusion framework for bimodal speech emotion recognition[C]//Interspeech 2020, 2020:374-378.

融合音频、文本、表情动作的多模态情感识别

Multi-modal Emotion Recognition Using Speech, Text and Motion

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 13

编辑推荐

Metrics

本文评价

[1]	熊娟, 张孙杰, 阚亚亚, 陈家豪. 基于CAFPN和细化双头解耦的遥感图像目标检测[J]. 应用科学学报, 2023, 41(6): 989-1003.
[2]	韩慧妍, 吴伟州, 王文俊, 韩燮. 基于改进特征融合和区域生成网络的Mask R-CNN的管件分拣研究[J]. 应用科学学报, 2023, 41(5): 840-854.
[3]	张春森, 朱江乐, 张学芬, 刘旭东, 史书. 融合自注意力机制与生成对抗网络的DEM空洞填充[J]. 应用科学学报, 2023, 41(5): 789-800.
[4]	尹美杰, 倪翠, 王朋, 张广渊. 基于语义分割的遥感影像建筑变化检测[J]. 应用科学学报, 2023, 41(3): 448-460.
[5]	刘飞, 李欣. 基于AttentionU-Net的陆地卫星影像云检测[J]. 应用科学学报, 2022, 40(6): 906-917.
[6]	陈丽芳, 魏梦如. 基于强化局部特征的3D点云分类与分割网络[J]. 应用科学学报, 2022, 40(2): 328-337.
[7]	雷前慧, 潘丽丽, 邵伟志, 胡海鹏, 黄瑶. 基于三重注意力机制的新冠肺炎病灶分割模型[J]. 应用科学学报, 2022, 40(1): 105-115.
[8]	魏明军, 周太宇, 纪占林, 张鑫楠. 基于Mask-YOLO的复杂场景口罩佩戴检测[J]. 应用科学学报, 2022, 40(1): 93-104.
[9]	范守祥, 姚俊萍, 李晓军, 程开原. 一种多模特征融合的方面信息情感分类方法[J]. 应用科学学报, 2021, 39(6): 969-982.
[10]	王胜科, 任鹏飞, 吕昕, 庄新发. 基于中心点和双重注意力机制的无人机高分辨率图像小目标检测算法[J]. 应用科学学报, 2021, 39(4): 650-659.
[11]	彭宁, 陈爱斌, 周国雄, 陈文洁, 刘晶. 基于正弦注意力表征网络的环境声音识别[J]. 应用科学学报, 2021, 39(4): 641-649.
[12]	靳华中, 刘潇龙, 胡梓珂. 一种结合全局和局部特征的图像描述生成模型[J]. 应用科学学报, 2019, 37(4): 501-509.
[13]	曾润华, 张树群. 改进卷积神经网络的语音情感识别方法[J]. 应用科学学报, 2018, 36(5): 837-844.