基于中值引导多尺度特征融合的红外图像人体行为识别

doi:10.3969/j.issn.0255-8297.2026.02.007

摘要/Abstract

摘要： 传统深度学习模型在红外图像中由于缺乏判别性特征，难以有效区分相似行为，导致识别性能受限。为解决此问题，本文提出了一种基于中值引导多尺度特征融合的红外人体行为识别方法。首先，构建了一种融合中值增强注意力与多尺度特征对比的信息建模机制，该机制通过精细建模特征层级间的差异，引导网络聚焦于区分不同动作类别的关键特征区域，突破了传统方法依赖全局特征分类的局限。其次，设计了中值增强空间通道注意力模块，解决了传统孪生网络在红外行为图像中因深层特征缺乏显式位置信息而难以准确聚焦人体关键区域的问题。最后，提出了多尺度特征融合模块，有效融合多尺度特征，提升红外图像中行为细节与结构信息的表达能力，增强对微小动作变化的捕捉能力，降低因信息缺失和背景干扰导致的误判率。实验结果表明，本文所提方法的识别精度在红外拼接、PUB、VAIS等多个数据集中均优于现有主流方法，充分体现了该方法的有效性与先进性。

关键词: 人体行为识别, 红外图像, 中值增强空间通道注意力, 多尺度特征融合

Abstract: Conventional deep learning models exhibit limited recognition performance in infrared images, primarily because the lack of discriminative features makes them difficult to effectively distinguish similar behaviors. To solve this problem, a novel infrared image-oriented human action recognition method based on median-guided multi-scale feature fusion was proposed. First, an information modeling mechanism that integrated median-enhanced attention and multi-scale feature comparison was constructed. This mechanism finely modeled the differences between feature hierarchies, guiding the network to focus on the key feature regions that distinguished different action categories, therefore breaking through the limitation of traditional methods that relied on global features for classification. Second, a median-enhanced spatial and channel attention module was designed, which solved the problem that traditional Siamese networks were difficult to accurately focus on the key regions of the human body in infrared action images due to the lack of explicit positional information in deep features. Finally, a multi-scale feature fusion module was proposed, which could effectively fuse multi-scale features, enhance the expression ability of action details and structural information in infrared images, strengthen the model’s ability to capture subtle action changes, and reduce the misjudgment rate caused by information loss and background interference. Experimental results show that the recognition accuracy of the proposed method is superior to that of existing mainstream methods in multiple datasets such as infrared splicing, PUB, and VAIS, which fully demonstrates the effectiveness and advancement of this method.

Key words: human action recognition, infrared image, median-enhanced spatial and channel attention, multi-scale feature fusion

中图分类号:

TP391.4

袁帅, 余磊, 姚天, 熊邦书. 基于中值引导多尺度特征融合的红外图像人体行为识别[J]. 应用科学学报, 2026, 44(2): 266-281.

YUAN Shuai, YU Lei, YAO Tian, XIONG Bangshu. Human Action Recognition in Infrared Images Based on Median-Guided Multi-scale Feature Fusion[J]. Journal of Applied Sciences, 2026, 44(2): 266-281.

参考文献

[1] 张晓龙, 王庆伟, 李尚滨. 基于强化学习的多模态场景人体危险行为识别方法[J]. 应用科学学报, 2021, 39(4): 605-614. Zhang X L, Wang Q W, Li S B. Recognition method of human dangerous behavior in multimodal scenes using reinforcement learning [J]. Journal of Applied Sciences, 2021, 39(4): 605-614. (in Chinese)
[2] 刘硕, 瞿崇晓, 祝中科, 等. 基于MSR和AMSR的红外融合增强算法[J]. 应用科学学报, 2022, 40(3): 423-433. Liu S, Qu C X, Zhu Z K, et al. Infrared image fusion enhancement algorithm based on MSR and AMSR [J]. Journal of Applied Sciences, 2022, 40(3): 423-433. (in Chinese)
[3] 杨亚东, 黄胜一, 谭毅华. 基于低秩和重加权稀疏表示的红外弱小目标检测算法[J]. 应用科学学报, 2023, 41(5): 753-765. Yang Y D, Huang S Y, Tan Y H. Infrared dim and small target detection algorithm based on low-rank and reweighted sparse representation [J]. Journal of Applied Sciences, 2023, 41(5): 753-765. (in Chinese)
[4] 金安安, 李祥, 张丽, 等. 基于NSCT与压缩感知的红外影像融合[J]. 应用科学学报, 2022, 40(1): 80-92. Jin A A, Li X, Zhang L, et al. Infrared image fusion based on NSCT and compressed sensing [J]. Journal of Applied Sciences, 2022, 40(1): 80-92. (in Chinese)
[5] 张晶晶, 曹思华, 崔文楠, 等. 基于改进顶帽变换的红外弱小目标检测[J]. 电子与信息学报, 2024, 46(1): 267-276. Zhang J J, Cao S H, Cui W N, et al. Improved top-hat transform–based algorithm for infrared dim and small target detection [J]. Journal of Electronics & Information Technology, 2024, 46(1): 267-276. (in Chinese)
[6] 邵振峰, 蔡家骏, 王中元, 等. 面向智能监控摄像头的监控视频大数据分析处理[J]. 电子与信息学报, 2017, 39(5): 1116-1122. Shao Z F, Cai J J, Wang Z Y, et al. Analytical processing method of big surveillance video data based on smart monitoring cameras [J]. Journal of Electronics & Information Technology, 2017, 39(5): 1116-1122. (in Chinese)
[7] Pang Z X, Liu G H, Li G S, et al. An infrared image enhancement method via content and detail two-stream deep convolutional neural network [J]. Infrared Physics & Technology, 2023, 132: 104761.
[8] Feng Z Q, Wang X G, Zhou J Y, et al. MDJ: a multi-scale difference joint keyframe extraction algorithm for infrared surveillance video action recognition [J]. Digital Signal Processing, 2024, 148: 104469.
[9] Li J J, Gong R Y, Wang G. Enhancing fitness action recognition with ResNet-TransFit: integrating IoT and deep learning techniques for real-time monitoring [J]. Alexandria Engineering Journal, 2024, 109: 89-101.
[10] Li Y, Wu C Y, Feichtenhofer C, et al. Improved multiscale vision transformers for classification and detection[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 12294-12305.
[11] Wang L M, Xiong Y J, Wang Z, et al. Temporal segment networks for action recognition in videos [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(11): 2740-2755.
[12] Li K, Wang L, Wang X, et al. UniFormerV2: spatiotemporal learning by arming image VITs with video uniformer [DB/OL]. (2022-11-17) [2024-06-05]. https://arxiv.org/abs/2211.09552.
[13] Zhou B L, Andonian A, Oliva A, et al. Temporal relational reasoning in videos [C]//Computer Vision-ECCV 2018, 2018: 831-846.
[14] Chen S L, Wang X W, Sun Y F, et al. STAN: spatio-temporal analysis network for efficient video action recognition [J]. Expert Systems with Applications, 2025, 268: 126255.
[15] Mazari A, Sahbi H. Deep multiple aggregation networks for action recognition [J]. International Journal of Multimedia Information Retrieval, 2024, 13(1): 9-36.
[16] Lee E J, Ko B C, Nam J Y. Recognizing pedestrian’s unsafe behaviors in far-infrared imagery at night [J]. Infrared Physics & Technology, 2016, 76: 261-270.
[17] Tian Q H, Miao W L, Zhang L Z, et al. STCA: an action recognition network with spatiotemporal convolution and attention [J]. International Journal of Multimedia Information Retrieval, 2024, 14(1): 1-12.
[18] Zhao Q, Su Y X, Zhang H. Stme-net: spatio-temporal motion excitation network for action recognition [J]. Journal of Real-Time Image Processing, 2025, 22(2): 88-101.
[19] Chopra S, Hadsell R, Lecun Y. Learning a similarity metric discriminatively, with application to face verification [C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005: 539-546.
[20] 周啸辉, 余磊, 何茜, 等. 基于改进ResNet-18的红外图像人体行为识别方法研究[J]. 激光与红外, 2021, 51(9): 1178-1184. Zhou X H, Yu L, He X, et al. Research on human behavior recognition method in infrared image based on improved ResNet 18[J]. Laser & Infrared, 2021, 51(9): 1178-1184. (in Chinese)
[21] Zhang M M, Choi J, Daniilidis K, et al. VAIS: a dataset for recognizing maritime imagery in the visible and infrared spectrums [C]//IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2015: 10-16.