基于双流自适应时空增强图卷积网络的手语识别

doi:10.3969/j.issn.0255-8297.2024.02.001

应用科学学报 ›› 2024, Vol. 42 ›› Issue (2): 189-199.doi: 10.3969/j.issn.0255-8297.2024.02.001

基于双流自适应时空增强图卷积网络的手语识别

金彦亮^1,2, 吴筱^1,2

1. 上海大学通信与信息工程学院, 上海 200444;
2. 上海大学上海先进通信与数据科学研究院, 上海 200444

收稿日期:2022-05-09 出版日期:2024-03-31 发布日期:2024-03-28
通信作者: 金彦亮,副教授,博导,研究方向为无线传感网络、人工智能。E-mail:jinyanliang@staff.shu.edu.cn E-mail:jinyanliang@staff.shu.edu.cn
基金资助:
上海市自然科学基金（No.22ZR1422200）；上海市科委重点基金（No.19511102803）；上海市产业项目（No.XTCX-KJ-2022-68）资助

Sign Language Recognition Based on Two-Stream Adaptive Enhanced Spatial Temporal Graph Convolutional Network

JIN Yanliang^1,2, WU Xiaowei^1,2

1. School of Communication and Information Engineering, Shanghai University, Shanghai 200444, China;
2. Shanghai Institute for Advanced Communication and Data Science, Shanghai University, Shanghai 200444, China

Received:2022-05-09 Online:2024-03-31 Published:2024-03-28

摘要/Abstract

摘要： 针对提取手语特征过程中出现的信息表征能力差、信息不完整问题，设计了一种双流自适应时空增强图卷积网络（two-stream adaptive enhanced spatial temporal graph convolutional network，TAEST-GCN）实现基于孤立词的手语识别。该网络使用人体身体、手部和面部节点作为输入，构造基于人体关节和骨骼的双流结构。通过自适应时空图卷积模块生成不同部位之间的连接，并充分利用其中的位置和方向信息。同时采用残差连接方式设计自适应多尺度时空注意力模块，进一步增强该网络在空域和时域的卷积能力。将双流网络提取到的有效特征进行加权融合，可以分类输出手语词汇。最后在公开的中文手语孤立词数据集上进行实验，在100类词汇和500类词汇分类任务中准确率达到了95.57%和89.62%。

关键词: 骨架数据, 双流结构, 自适应时空图卷积模块, 自适应多尺度时空注意力模块, 特征融合

Abstract: Aiming at the issues of poor information representation ability and incomplete information during the extraction of sign language features, this paper designs a two-stream adaptive enhanced spatial temporal graph convolutional network (TAEST-GCN) for sign language recognition based on isolated words. The network uses human body, hands and face nodes as inputs to construct a two-stream structure based on human joints and bones. The connection between different parts is generated by the adaptive spatial temporal graph convolutional module, ensuring the full utilization of the position and direction information. Meanwhile, an adaptive multi-scale spatial temporal attention module is built through residual connection to further enhance the convolution ability of the network in both spatial and temporal domain. The effective features extracted from the dual stream network are weighted and fused to classify and output sign language vocabulary. Finally, experiments are carried out on the public Chinese sign language isolated word dataset, achieving accuracy rates of 95.57% and 89.62% in 100 and 500 categories of words, respectively.

Key words: skeleton data, two-stream structure, adaptive spatial temporal graph convolutional module, adaptive multi-scale spatial temporal attention module, feature fusion

中图分类号:

TP391.4

金彦亮, 吴筱. 基于双流自适应时空增强图卷积网络的手语识别[J]. 应用科学学报, 2024, 42(2): 189-199.

JIN Yanliang, WU Xiaowei. Sign Language Recognition Based on Two-Stream Adaptive Enhanced Spatial Temporal Graph Convolutional Network[J]. Journal of Applied Sciences, 2024, 42(2): 189-199.

参考文献

[1] Rastgoo R, Kiani K, Escalera S. Sign language recognition:a deep survey [J]. Expert Systems with Applications, 2021, 164:113794.
[2] 张淑军, 张群, 李辉. 基于深度学习的手语识别综述[J]. 电子与信息学报, 2020, 42(4):1021-1032. Zhang S J, Zhang Q, Li H. Review of sign language recognition based on deep learning [J]. Journal of Electronics & Information Technology, 2020, 42(4):1021-1032. (in Chinese)
[3] Yang S, Zhu Q. Video-based Chinese sign language recognition using convolutional neural network [C]//2017 IEEE 9th International Conference on Communication Software and Networks (ICCSN), 2017:929-934.
[4] Konstantinidis D, Dimitropoulos K, Daras P. Sign language recognition based on hand and body skeletal data [C]//2018-3DTV-Conference:The True Vision-Capture, Transmission and Display of 3D Video (3DTV-CON), 2018:8478467.
[5] Huang J, Zhou W G, Li H Q, et al. Attention-based 3D-CNNs for large-vocabulary sign language recognition [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 29(9):2822-2832.
[6] Cao Z, Hidalgo G, Simon T, et al. OpenPose:realtime multi-person 2D pose estimation using part affinity fields [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(1):172-186.
[7] Shi L, Zhang Y F, Cheng J, et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019:12018-12027.
[8] Yan S J, Xiong Y J, Lin D H. Spatial temporal graph convolutional networks for skeletonbased action recognition [J]. 32nd AAAI Conference on Artificial Intelligence, 2018:7444-7452.
[9] Bai S, Kolter J Z, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling [DB/OL]. 2018[2023-05-09]. http://arxiv.org/abs/1803.01271.
[10] Liu Z Y, Zhang H W, Chen Z H, et al. Disentangling and unifying graph convolutions for skeleton-based action recognition [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020:140-149.
[11] Luan S, Zhao M, Chang X W, et al. Break the ceiling:stronger multi-scale deep graph convolutional networks [DB/OL]. 2019[2023-05-09]. http://arxiv.org/abs/1906.02174.
[12] Shi L, Zhang Y F, Cheng J, et al. Skeleton-based action recognition with directed graph neural networks [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019:7904-7913.
[13] Cortes C, Mohri M, Rostamizadeh A. L2 regularization for learning kernels [J]. 25th Conference on Uncertainty in Artificial Intelligence, 2009:109-116.
[14] Liu T, Zhou W G, Li H Q. Sign language recognition with long short-term memory [C]//2016 IEEE International Conference on Image Processing (ICIP), 2016:2871-2875.
[15] Liao Y Q, Xiong P W, Min W D, et al. Dynamic sign language recognition based on video sequence with BLSTM-3D residual networks [J]. IEEE Access, 2019, 7:38044-38054.
[16] Xiao Q K, Qin M Y, Guo P, et al. Multimodal fusion based on LSTM and a couple conditional hidden Markov model for Chinese sign language recognition [J]. IEEE Access, 2019, 7:112258-112268.
[17] Zhang S, Zhang Q. Sign language recognition based on global-local attention [J]. Journal of Visual Communication and Image Representation, 2021, 80(7):103280.

基于双流自适应时空增强图卷积网络的手语识别

Sign Language Recognition Based on Two-Stream Adaptive Enhanced Spatial Temporal Graph Convolutional Network

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 5

编辑推荐

Metrics

本文评价

[1]	董绍江, 夏蒸富, 方能炜, 邢镔, 胡小林. 基于颜色通道特征融合的环境声音分类方法[J]. 应用科学学报, 2023, 41(4): 669-681.
[2]	罗凡, 熊邦书, 余磊, 汪婉灵. 基于DBAFFNet的低照度图像增强[J]. 应用科学学报, 2023, 41(3): 476-487.
[3]	魏明军, 周太宇, 纪占林, 张鑫楠. 基于Mask-YOLO的复杂场景口罩佩戴检测[J]. 应用科学学报, 2022, 40(1): 93-104.
[4]	刘立昕，卞红雨. 用于水下目标跟踪的多特征融合PSOPF 算法[J]. 应用科学学报, 2013, 31(6): 564-568.
[5]	李侃1,2，平西建1. 基于图像内容和特征融合的隐写盲检测[J]. 应用科学学报, 2013, 31(1): 97-103.