针对实时语义分割在物体大小差异显著的场景中,小物体分割错误和大物体分割结果出现空洞的问题,提出了一种基于复合三分支和深度特征编码的实时语义分割算法,由复合三分支模块(composite three-branch module,CTBM)、深度特征编码模块(deep feature encoding module,DFEM)和双分支多层感知机(dual-branch multi-layer perceptron,DBMLP)组成。CTBM通过双层多尺度特征提取和融合策略,从不同角度全面提取信息,使模型更好地感知特征间的全局关系,从而减少大物体分割结果中出现的空洞; DFEM通过编码方法提升模型对深层特征的表达能力,更好地感知小物体的语义信息,提升了小物体的分割精度; DBMLP同时利用全局和局部特征,有效地融合了多尺度语义信息,使分割结果边缘更平滑、轮廓更准确。在Cityscapes和ADE20K数据集上的评估结果显示,本文算法既满足了速度的实时需要,又分别以42.6 FPS和45.3 FPS实现了74.2%和40.4%的mIoU,明显优于其他实时语义分割算法。
To address the problems of small-object segmentation errors and holes in large-object segmentation results in scenes with significant differences in object sizes in real-time semantic segmentation, this paper proposed a real-time semantic segmentation algorithm based on a composite three-branch and deep feature encoding, consisting of a composite three-branch module (CTBM), a deep feature encoding module (DFEM), and a dual-branch multi-layer perceptron (DBMLP). The CTBM used a dual-layer multi-scale feature extraction and fusion strategy to comprehensively extract information from different perspectives, enabling the model to perceive the global relationships between features better and reduce the holes in the large-object segmentation results. The DFEM enhanced the model’ s ability to express deep features through encoding methods, better perceived the semantic information of small objects, and improved the segmentation accuracy of small objects. The DBMLP effectively integrated multi-scale semantic information by utilizing both global and local features, resulting in smoother edges and more accurate contours in segmentation results. Evaluation results on the Cityscapes and ADE20K datasets have shown that the algorithm not only meets real-time speed requirements but also achieves mIoU of 74.2% and 40.4% at 42.6 FPS and 45.3 FPS, respectively, significantly outperforming other real-time semantic segmentation algorithms.
[1] 栗风永, 叶彬, 秦川. 基于奇偶交叉卷积的轻量级图像语义分割网络[J]. 应用科学学报, 2022, 40(3): 448-456. Li F Y, Ye B, Qin C. Lightweight image semantic segmentation network based on parity cross convolution [J]. Journal of Applied Sciences, 2022, 40(3): 448-456. (in Chinese)
[2] Xie E, Wang W H, Yu Z D, et al. SegFormer: simple and efficient design for semantic segmentation with transformers [C]//Advances in Neural Information Processing Systems, 2021, 34: 12077-12090.
[3] Zhang W, Huang Z, Luo G, et al. TopFormer: token pyramid transformer for mobile semantic segmentation [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 12083-12093.
[4] Yan H, Zhang C, Wu M. Lawin transformer: improving semantic segmentation transformer with multi-scale representations via large window attention [DB/OL]. (2023-08-09) [2024-12-02]. http://arxiv.org/abs/2201.01615.
[5] Badrinarayanan V, Kendall A, Cipolla R. SegNet: a deep convolutional encoder-decoder architecture for image segmentation [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2481-2495.
[6] Paszke A, Chaurasia A, Kim S, et al. ENet: a deep neural network architecture for real-time semantic segmentation [DB/OL]. (2016-06-07) [2024-12-02]. http://arxiv.org/abs/1606.02147.
[7] Zhao H, Shi J, Qi X, et al. Pyramid scene parsing network [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017: 2881-2890.
[8] Yu C, Wang J, Peng C, et al. BiSeNet: bilateral segmentation network for real-time semantic segmentation [C]//European Conference on Computer Vision, 2018: 325-341.
[9] Yu C, Gao C, Wang J, et al. BiSeNet V2: bilateral network with guided aggregation for real-time semantic segmentation [J]. International Journal of Computer Vision, 2021, 129(11): 3051-3068.
[10] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16×16 words: transformers for image recognition at scale [DB/OL]. (2021-06-03) [2024-12-02]. http://arxiv.org/abs/2010.11929.
[11] Wang W, Xie E, Li X, et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions [C]//IEEE/CVF International Conference on Computer Vision, 2021: 568-578.
[12] Liu Z, Lin Y, Cao Y, et al. Swin Transformer: hierarchical vision transformer using shifted windows [C]//IEEE/CVF International Conference on Computer Vision, 2021: 10012-10022.
[13] Zheng S X, Lu J, Zhao H, et al. Rethinking semantic segmentation from a sequence-tosequence perspective with transformers [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 6881-6890.
[14] Xie G S, Liu J, Xiong H, et al. Scale-aware graph neural network for few-shot semantic segmentation [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 5475-5484.
[15] Xie G S, Xiong H, Liu J, et al. Few-shot semantic segmentation with cyclic memory network [C]//IEEE/CVF International Conference on Computer Vision, 2021: 7293-7302.
[16] Chen L C, Zhu Y, Papandreou G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation [C]//European Conference on Computer Vision, 2018: 801-818.
[17] Sun K, Xiao B, Liu D, et al. Deep high-resolution representation learning for human pose estimation [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 5693-5703.
[18] Zhang H, Dana K, Shi J, et al. Context encoding for semantic segmentation [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 7151-7160.
[19] Dai J, Qi H, Xiong Y, et al. Deformable convolutional networks [C]//IEEE/CVF International Conference on Computer Vision, 2017: 764-773.
[20] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015: 3431- 3440.
[21] Chen L C, Papandreou G, Schroff F, et al. Rethinking atrous convolution for semantic image segmentation [DB/OL]. (2017-12-05) [2024-12-02]. http://arxiv.org/abs/1706.05587.
[22] Gao G, Xu G, Li J, et al. FBSNet: a fast bilateral symmetrical network for real-time semantic segmentation [J]. IEEE Transactions on Multimedia, 2023, 25: 3273-3283.
[23] Wu B, Xiong X, Wang Y. Real-time semantic segmentation algorithm for street scenes based on attention mechanism and feature fusion [J]. Electronics, 2024, 13(18): 3699.
[24] Tu J, Chen G, Zhu H, et al. New depth-wise asymmetric bottleneck network with multi-scales for real-time semantic segmentation [C]//2024 IEEE 7th Information Technology, Networking, Electronic and Automation Control Conference, 2024: 511-516.