Aiming at the issues of poor information representation ability and incomplete information during the extraction of sign language features, this paper designs a two-stream adaptive enhanced spatial temporal graph convolutional network (TAEST-GCN) for sign language recognition based on isolated words. The network uses human body, hands and face nodes as inputs to construct a two-stream structure based on human joints and bones. The connection between different parts is generated by the adaptive spatial temporal graph convolutional module, ensuring the full utilization of the position and direction information. Meanwhile, an adaptive multi-scale spatial temporal attention module is built through residual connection to further enhance the convolution ability of the network in both spatial and temporal domain. The effective features extracted from the dual stream network are weighted and fused to classify and output sign language vocabulary. Finally, experiments are carried out on the public Chinese sign language isolated word dataset, achieving accuracy rates of 95.57% and 89.62% in 100 and 500 categories of words, respectively.
JIN Yanliang, WU Xiaowei
. Sign Language Recognition Based on Two-Stream Adaptive Enhanced Spatial Temporal Graph Convolutional Network[J]. Journal of Applied Sciences, 2024
, 42(2)
: 189
-199
.
DOI: 10.3969/j.issn.0255-8297.2024.02.001
[1] Rastgoo R, Kiani K, Escalera S. Sign language recognition:a deep survey [J]. Expert Systems with Applications, 2021, 164:113794.
[2] 张淑军, 张群, 李辉. 基于深度学习的手语识别综述[J]. 电子与信息学报, 2020, 42(4):1021-1032. Zhang S J, Zhang Q, Li H. Review of sign language recognition based on deep learning [J]. Journal of Electronics & Information Technology, 2020, 42(4):1021-1032. (in Chinese)
[3] Yang S, Zhu Q. Video-based Chinese sign language recognition using convolutional neural network [C]//2017 IEEE 9th International Conference on Communication Software and Networks (ICCSN), 2017:929-934.
[4] Konstantinidis D, Dimitropoulos K, Daras P. Sign language recognition based on hand and body skeletal data [C]//2018-3DTV-Conference:The True Vision-Capture, Transmission and Display of 3D Video (3DTV-CON), 2018:8478467.
[5] Huang J, Zhou W G, Li H Q, et al. Attention-based 3D-CNNs for large-vocabulary sign language recognition [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 29(9):2822-2832.
[6] Cao Z, Hidalgo G, Simon T, et al. OpenPose:realtime multi-person 2D pose estimation using part affinity fields [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(1):172-186.
[7] Shi L, Zhang Y F, Cheng J, et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019:12018-12027.
[8] Yan S J, Xiong Y J, Lin D H. Spatial temporal graph convolutional networks for skeletonbased action recognition [J]. 32nd AAAI Conference on Artificial Intelligence, 2018:7444-7452.
[9] Bai S, Kolter J Z, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling [DB/OL]. 2018[2023-05-09]. http://arxiv.org/abs/1803.01271.
[10] Liu Z Y, Zhang H W, Chen Z H, et al. Disentangling and unifying graph convolutions for skeleton-based action recognition [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020:140-149.
[11] Luan S, Zhao M, Chang X W, et al. Break the ceiling:stronger multi-scale deep graph convolutional networks [DB/OL]. 2019[2023-05-09]. http://arxiv.org/abs/1906.02174.
[12] Shi L, Zhang Y F, Cheng J, et al. Skeleton-based action recognition with directed graph neural networks [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019:7904-7913.
[13] Cortes C, Mohri M, Rostamizadeh A. L2 regularization for learning kernels [J]. 25th Conference on Uncertainty in Artificial Intelligence, 2009:109-116.
[14] Liu T, Zhou W G, Li H Q. Sign language recognition with long short-term memory [C]//2016 IEEE International Conference on Image Processing (ICIP), 2016:2871-2875.
[15] Liao Y Q, Xiong P W, Min W D, et al. Dynamic sign language recognition based on video sequence with BLSTM-3D residual networks [J]. IEEE Access, 2019, 7:38044-38054.
[16] Xiao Q K, Qin M Y, Guo P, et al. Multimodal fusion based on LSTM and a couple conditional hidden Markov model for Chinese sign language recognition [J]. IEEE Access, 2019, 7:112258-112268.
[17] Zhang S, Zhang Q. Sign language recognition based on global-local attention [J]. Journal of Visual Communication and Image Representation, 2021, 80(7):103280.