结合transformer多尺度实例交互的稀疏集目标检测

阚亚亚, 张孙杰, 熊娟, 祖奕

doi:10.3969/j.issn.0255-8297.2023.05.005

应用科学学报 >

2023 , Vol. 41 >Issue 5: 777 - 788

DOI: https://doi.org/10.3969/j.issn.0255-8297.2023.05.005

信号与信息处理

结合transformer多尺度实例交互的稀疏集目标检测

展开

上海理工大学光电信息与计算机工程学院, 上海 200093

收稿日期: 2022-04-21

网络出版日期: 2023-09-28

基金资助

上海市晨光学者基金（No.18CG52）资助

收起

Sparse Set Object Detection Combined with Transformer Multi-scale Instance Interaction

Expand

School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China

Received date: 2022-04-21

Online published: 2023-09-28

Fold

摘要

为改进稀疏集目标检测方法存在的特征图缺乏空间细节信息、目标特征没有做到全局上下文实例交互、全局语义信息没有得到充分学习等问题，设计了一种结合自适应特征增强和实例特征交互的稀疏集目标检测算法。自适应特征增强模块在特征提取过程中利用不同尺度的池化和卷积来丰富高级语义信息，减小低级语义信息背景噪声的干扰，降低目标错检率和漏检率。实例特征交互模块在边界框回归设计中结合transformer的多层注意力，并融合通道注意力和动态卷积网络对建议框的通道信息进行增强，改善了目标的边缘信息，提高了网络的实例特征交互效率。最后在COCO2017数据集与原始网络进行实验对比，平均精度提升了4.2%，其中在大目标上提升了4.6%，在PASCAL VOC数据集上提升了2.7%。

关键词： 稀疏集目标检测; 多尺度特征; 实例特征交互; transformer

本文引用格式

阚亚亚, 张孙杰, 熊娟, 祖奕 . 结合transformer多尺度实例交互的稀疏集目标检测[J]. 应用科学学报, 2023 , 41(5) : 777 -788 . DOI: 10.3969/j.issn.0255-8297.2023.05.005

Abstract

In order to improve the problem of lack of spatial detail information in feature maps, failure of target features to interact with global context instance, and insufficient learning of global semantic information, a sparse set object detection algorithm combining adaptive feature augmentation and instance feature interaction is designed. In the process of feature extraction, the adaptive feature augmentation module uses pooling and convolution at different scales to enrich high-level semantic information, and reduces noise interference such as the low-level semantic information background. Meanwhile, it decreases the rate of false detection and missed detection. In design of bounding box regression, the instance feature interaction module combines multi-layer attention of transformer which enhances the channel information of the proposal box. Channel attention and dynamic convolution network are also employed to improve the edge information of the object and increase the interaction efficiency of the network instance feature. Finally, experiment results show that the average accuracy of COCO2017 dataset is improved by 4.2%, 4.6% on the large target, and 2.7% on PASCAL VOC dataset, respectively.

Key words： sparse set object detection; multi-scale feature; instance feature interaction; transformer

参考文献

[1] 南晓虎, 丁雷. 深度学习的典型目标检测算法综述[J]. 计算机应用研究, 2020, 37(增刊2):15-21. Nan X H, Ding L. A review typical detection algorithms for deep learning[J]. Application Research of Computers, 2020, 37(Suppl.2):15-21(in Chinese)
[2] 罗会兰, 陈鸿坤. 基于深度学习的目标检测研究综述[J]. 电子学报, 2020, 48(6):1230-1239. Luo H L, Chen H K. An overview of object detection based on deep learning[J]. Acta Electronica Sinica, 2020, 48(6):1230-1239. (in Chinese)
[3] Girshick R. Fast R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015:1440-1448.
[4] Ren S, He K, Girshick R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 39(6):1137-1149.
[5] He K, Gkioxari G, Dollar P, et al. Mask R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017:2961-2969.
[6] Liu W, Anguelov D, Erhan D, et al. SSD:single shot multibox detector[C]//European Conference on Computer Vision. Cham:Springer, 2016:21-37.
[7] Redmon J, Divvala S, Girshick R, et al. You only look once:unified real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016:779-788.
[8] Redmon J, Farhadi A. YOLO9000:better, faster, stronger[C]//IEEE Conference on Computer Vision & Pattern Recognition. IEEE, 2017:6517-6525.
[9] 鞠默然, 罗江宁, 王仲博, 等. 融合注意力机制的多尺度目标检测算法[J]. 光学学报, 2020, 40(13):126-134. Ju M R, Luo J N, Wang Z B, et al. Multi-scale object detection based on attention mechanism[J]. Acta Optica Sinica, 2020, 40(13):126-134. (in Chinese)
[10] Wang K, Liew J H, Zou Y, et al. Panet:few-shot image semantic segmentation with prototype alignment[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019:9197-9206.
[11] Wang Y, Cui C, Zhou X, et al. ZigzagNet:efficient deep learning for real object recognition based on 3D models[C]//Asian Conference on Computer Vision. Cham:Springer, 2016:456-471.
[12] Peng H, Xue C, Shao Y, et al. Semantic segmentation of litchi branches using deep LabV3+ model[J]. IEEE Access, 2020, 8:164546-164555.
[13] Guo C, Fan B, Zhang Q, et al. AugFPN:improving multi-scale feature learning for object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020:12595-12604.
[14] Zhao H, Shi J, Qi X, et al. Pyramid scene parsing network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017:2881-2890.
[15] Ronneberger O, Fischer P, Brox T. U-net:convolutional networks for biomedical image segmentation[C]//International Conference on Medical Image Computing and ComputerAssisted Intervention. Cham:Springer, 2015:234-241.
[16] Parmar N, Vaswani A, Uszkoreit J, et al. Image transformer[C]//International Conference on Machine Learning, 2018:4055-4064.
[17] Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with transformers[C]//European Conference on Computer Vision. Cham:Springer, 2020:213-229.
[18] Zhu X, Su W, Lu L, et al. Deformable DETR:deformable transformers for end-to-end object detection[C]//International Conference on Learning Representations, 2020:234-246.
[19] Sun P, Zhang R, Jiang Y, et al. Sparse R-CNN:end-to-end object detection with learnable proposals[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021:14454-14463.
[20] Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018:7132-7141.
[21] Chen Y, Dai X, Liu M, et al. Dynamic convolution:attention over convolution kernels[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020:11030-11039
[22] Lin T Y, Maire M, Belongie S, et al. Microsoft coco:common objects in context[C]//European Conference on Computer Vision. Cham:Springer, 2014:740-755.
[23] Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017:2980-2988.
[24] Tian Z, Shen C, Chen H, et al. FCOS:fully convolutional one-stage object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019:9627-9636.
[25] Pang J, Chen K, Shi J, et al. Libra R-CNN:towards balanced learning for object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019:821-830.
[26] Zhang S, Chi C, Yao Y, et al. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020:9759-9768.
[27] Dai X, Chen Y, Xiao B, et al. Dynamic head:unifying object detection heads with attentions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021:7373-7382.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献