一种基于轻量化卷积模块的语义分割网络

doi:10.3969/j.issn.0255-8297.2025.01.005

摘要/Abstract

摘要： 融合深度学习的语义同步定位与地图构建技术为处理动态场景提供了有效的解决方案，但仍面临计算资源消耗大和模型复杂度高的挑战。为此，提出了一种基于BlendMask改进的轻量化语义分割网络。首先，设计了一种轻量的GDS-ECA卷积（Ghost-depthwise separable convolution with efficient channel attention）模块，利用深度可分离卷积替代Ghost卷积中的少量卷积操作，减少参数量和计算量，并添加注意力机制提升特征表达能力。其次，提出了特征提取网络BGTNet（bottleneck GDS-ECA attention transformer network），将GDS-ECA卷积应用于颈部模块的卷积层以提升网络的提取精度；此外，将特征金字塔网络（feature pyramid network,FPN）中的传统卷积替换为GDS-ECA卷积，构建轻量化特征金字塔网络，并结合BGTNet形成语义分割网络的主干网。最后在数据集COCO上进行了实验验证，改进后的模型处理图像时间缩短了7.3 ms，平均精度提升了1.5%。

关键词: 语义分割, 同步定位与地图构建, 轻量化, 注意力机制, 特征金字塔

Abstract: Semantic simultaneous localization and mapping augmented with deep learning provides an effective solution for handling dynamic scenes. However, this technology still faces challenges of high computational resource consumption and model complexity. To address these issues, this paper proposes a lightweight semantic segmentation network based on improvements to BlendMask. Firstly, a lightweight Ghost-depthwise separable convolution with efficient channel attention block (GDS-ECA) module is designed. This module replaces a few convolution operations in Ghost convolution with depthwise separable convolution to reduce parameters and computational load, while incorporating an attention mechanism to enhance feature representation capabilities. Secondly, a bottleneck GDS-ECA attention transformer network (BGTNet) is proposed, which applies GDS-ECA convolution to the neck module’s convolution layers to improve feature extraction precision. Additionally, traditional convolutions in the feature pyramid network (FPN) are replaced with GDS-ECA convolutions, creating a lightweight FPN (L-FPN). Combined with BGTNet, this forms the Backbone of the proposed semantic segmentation network. Finally, experiments on the COCO dataset validate the improvements, demonstrating a 7.3 ms reduction in processing time per image, and a 1.5% improvement in average precision.

Key words: semantic segmentation, simultaneous localization and mapping (SLAM), lightweight, attention mechanism, feature pyramid network

中图分类号:

TP183

连晓峰, 康毛毛, 谭励, 王艳莉. 一种基于轻量化卷积模块的语义分割网络[J]. 应用科学学报, 2025, 43(1): 66-79.

LIAN Xiaofeng, KANG Maomao, TAN Li, WANG Yanli. A Semantic Segmentation Network Based on Lightweight Convolutional Modules[J]. Journal of Applied Sciences, 2025, 43(1): 66-79.

参考文献

[1] Newcombe R A, Lovegrove S J, Davison A J. DTAM: dense tracking and mapping in real-time [C]//International Conference on Computer Vision, 2011: 2320-2327.
[2] Forster C, Pizzoli M, Scaramuzza D. SVO: fast semi-direct monocular visual odometry [C]//IEEE International Conference on Robotics and Automation, 2014: 15-22.
[3] Tateno K, Tombari F, Laina I, et al. CNN-SLAM: real-time dense monocular SLAM with learned depth prediction [C]//IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6565-6574.
[4] Campos C, Elvira R, Rodríguez J J G, et al. ORB-SLAM3: an accurate open-source library for visual, visual-inertial, and multimap SLAM [J]. IEEE Transactions on Robotics, 2021, 37(6): 1874-1890.
[5] Cadena C, Carlone L, Carrillo H, et al. Past, present, and future of simultaneous localization and mapping: toward the robust-perception age [J]. IEEE Transactions on Robotics, 2016, 32(6): 1309-1332.
[6] Wolf D F, Sukhatme G S. Mobile robot simultaneous localization and mapping in dynamic environments [J]. Autonomous Robots, 2005, 19(1): 53-65.
[7] Zhao H J, Chiba M, Shibasaki R, et al. SLAM in a dynamic large outdoor environment using a laser scanner [C]//IEEE International Conference on Robotics and Automation, 2008: 1455-1462.
[8] Bescos B, Fácil J M, Civera J, et al. DynaSLAM: tracking, mapping, and inpainting in dynamic scenes [J]. IEEE Robotics and Automation Letters, 2018, 3(4): 4076-4083.
[9] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation [C]//IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017: 640-651.
[10] Ren S Q, He K M, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
[11] He K, Gkioxari G, Dollar P, et al. Mask R-CNN [C]//IEEE International Conference on Computer Vision, 2017: 2980-2988.
[12] Huang Z, Huang L, Gong Y, et al. Mask scoring R-CNN [C]//IEEE Conference on Computer Vision and Pattern Recognition, 2019: 6402-6411.
[13] Chen K, Pang J M, Wang J Q, et al. Hybrid task cascade for instance segmentation [C]//IEEE Conference on Computer Vision and Pattern Recognition, 2019: 4969-4978.
[14] Chen X L, Girshick R, He K M, et al. TensorMask: a foundation for dense object segmentation [C]//IEEE/CVF International Conference on Computer Vision, 2019: 2061-2069.
[15] Chollet F. Xception: deep learning with depthwise separable convolutions [C]//IEEE Conference on Computer Vision and Pattern Recognition, 2017: 1800-1807.
[16] Howard A G, Zhu M, Chen B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications [EB/OL]. (2017-04-17) [2024-07-18]. http://arxiv.org/abs/1704.04861.
[17] Sandler M, Howard A, Zhu M L, et al. Mobile NetV2: inverted residuals and linear bottlenecks [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 4510- 4520.
[18] Howard A, Sandler M, Chen B, et al. Searching for MobileNetV3[C]//IEEE/CVF International Conference on Computer Vision, 2019: 1314-1324.
[19] Zoph B, Vasudevan V, Shlens J, et al. Learning transferable architectures for scalable image recognition [C]//IEEE Conference on Computer Vision and Pattern Recognition, 2018: 8697- 8710.
[20] Zhang X Y, Zhou X Y, Lin M X, et al. ShuffleNet: an extremely efficient convolutional neural network for mobile devices [C]//IEEE Conference on Computer Vision and Pattern Recognition, 2018: 6848-6856.
[21] Ma N N, Zhang X Y, Zheng H T, et al. ShuffleNet V2: practical guidelines for efficient CNN architecture design [C]//European Conference on Computer Vision, 2018: 122-138.
[22] Han K, Wang Y H, Tian Q, et al. GhostNet: more features from cheap operations [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 1577-1586.
[23] Wang Q L, Wu B G, Zhu P F, et al. ECA-Net: efficient channel attention for deep convolutional neural networks [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 11531-11539.
[24] Srinivas A, Lin T Y, Parmar N, et al. Bottleneck transformers for visual recognition [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 16514-16524.
[25] Li H C, Xiong P F, An J, et al. Pyramid attention network for semantic segmentation [EB/OL]. (2018-05-25) [2024-07-18]. http://arxiv.org/abs/1805.10180.
[26] Chen H, Sun K Y, Tian Z, et al. BlendMask: top-down meets bottom-up for instance segmentation [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 8570-8578.
[27] Chen L C, Zhu Y K, Papandreou G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation [C]//European Conference on Computer Vision, 2018: 833-851.