提出一种基于掩码区域卷积神经网络的文本检测模型。首先从扩大模型感受野并尽可能保持模型效率的角度出发,针对残差神经网络中的瓶颈结构进行优化,构建基于结构优化的残差神经网络(residual network based on structural optimization, ResNetSO);然后去除冗余特征以提高融合后特征质量,并将空间注意力机制应用于特征金字塔网络,构建了基于下层特征指导的特征金字塔网络(feature pyramid network based on lower feature guidance,FPNetLFG)。在两个公开数据集上的实验结果表明: 包含 ResNetSO 和 FPNetLFG 两个模块的模型应用在级联区域卷积神经网络、递归特征金字塔和可切换空洞卷积的目标检测模型中,分别可以带来 0.8% 和 0.3% 左右的 F1 值提升,从而说明了该方法的有效性和普遍适用性。
This paper proposes a text detection model based on mask region convolution neural network (Mask R-CNN). Firstly, the model optimizes the bottleneck structure of residual networks from the perspective of expanding the receptive field of the model and maintaining the efficiency of the model as much as possible, and proposes a residual network based on structural optimization (ResNetSO). Then for removing redundant features and improving the quality of fused features, the model generates a feature pyramid network based on lower feature guidance (FPNetLFG) by applying spatial attention mechanism to feature pyramid network. Finally, experimental results on two data sets show that as applying the proposed model, which consists of ResNetSO and FPNetLFG modules, in cascade region convolution neural network (Cascade R-CNN) and detecting objects with recursive feature pyramid and switchable atrous convolution (DetectoRS), F1 value can be improved by 0.8% and 0.3%, respectively, which verifies the effectiveness and universal applicability of this method.
[1] He K, Gkioxari G, Dollár P, et al. Mask R-CNN [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(2): 386-397.
[2] Xie S, Girshick R, Dollár P, et al. Aggregated residual transformations for deep neural networks [C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017: 1492-1500.
[3] Hu J, Shen L, Sun G. Squeeze-and-excitation networks [C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018: 7132-7141.
[4] Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection [C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017: 2117-2125.
[5] Liu S, Qi L, Qin H, et al. Path aggregation network for instance segmentation [C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018: 8759-8768.
[6] Tan M, Pang R, Le Q V. EfficientDet: scalable and efficient object detection [C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020: 10781-10790.
[7] Picron C, Tuytelaars T. Trident pyramid networks: the importance of processing at the feature pyramid level for better object detection [J/OL] (2021-10-08) [2022-5-30]. https://arXiv:2110.04004.
[8] Gao S H, Cheng M M, Zhao K, et al. Res2Net: a new multi-scale backbone architecture [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(2): 652-662.
[9] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition [C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016: 770-778.
[10] Cai Z, Vasconcelos N. Cascade R-CNN: delving into high quality object detection [C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018: 6154-6162.
[11] Chng C K, Liu Y, Sun Y, et al. ICDAR2019 robust reading challenge on arbitrary-shaped textRRC-art [C]//Proceedings of the 15th IEEE International Conference on Document Analysis and Recognition, Sydney, Australia, 2019: 1571-1576.
[12] Ch’ng C K, Chan C S. Total-text: a comprehensive dataset for scene text detection and recognition [C]//Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition, Kyoto, Japan, 2017, 1: 935-942.
[13] Liu Y, Jin L, Zhang S, et al. Curved scene text detection via transverse and longitudinal sequence connection [J]. Pattern Recognition, 2019, 90: 337-345.
[14] Chen K, Wang J, Pang J, et al. MMDetection: open mmlab detection toolbox and benchmark [J/OL]. (2019-06-17) [2022-05-30]. http://arXiv:1906.07155.
[15] Zhang H, Wu C, Zhang Z, et al. ResNeSt: split-attention networks [C]//2022 IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022: 2735-2745.
[16] Robbins H, Monro S. A stochastic approximation method [J]. The Annals of Mathematical Statistics, 1951: 400-407.
[17] Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database [C]//2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 2009: 248-255.