Aiming at the task of fine-grained classification of scene images, this paper proposes a fine-grained image classification method based on the attention network inference graph by integrating the multimodal information of image visual and textual features. First, we extract the global visual feature, local visual features and text features of the scene image, and form a new splicing feature by embedding the position information into the local visual features and textual features respectively. The feature is accordingly used as a node of the graph structure to generate a heterogeneous graph. Then, we design two meta-paths to decompose the heterogeneous graph into two isomorphic graphs, and put them into a two-level attention network inference graph with node-level attention and semantic-level attention. Finally, richer fine-grained feature expression can be obtained by multimodal fusion operations with the output node features and global visual feature. The proposed model enables effective combination of multimodal fusion and graph attention network, and performs strong competitiveness comparing with the current advanced mainstream methods on the two scene text fine-grained image datasets of Con-Text and Drink Bottle.
[1] Yang X, Zhang T, Xu C. Cross-domain feature learning in multi-media[J]. IEEE Transactions on Multimedia, 2015, 17(1):64-78.
[2] 金志威, 曹娟, 王博, 等. 融合多模态特征的社会多媒体谣言检测及技术研究[J]. 南京信息工程大学学报(自然科学版), 2017, 9(6):583-592. Jin Z W, Cao J, Wang B, et al. Research on social multimedia rumor detection and technology integrating multi-modal features[J]. Journal of Nanjing University of Information Science & Technology (Natural Science Edition), 2017, 9(6):583-592. (in Chinese)
[3] Wang B, Lin D, Xiong H, et al. Joint inference of objects and scenes with efficient learning of text-object-scene relations[J]. IEEE Transactions on Multimedia, 2016, 18(3):507-520.
[4] Movshovitz-Attias Y, Yu Q, Stumpe M C, et al. Ontological supervision for fine grained classification of street view storefronts[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015:1693-1702.
[5] Bai X, Yang M K, Lyu P Y, et al. Integrating scene text and visual appearance for fine-grained image classification[J]. IEEE Access, 2018, 6:66322-66335.
[6] Maflfla A, Dey S, Biten A F, et al. Multi-modal reasoning graph for scene-text based finegrained image classification and retrieval[C]//Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, 2021:4023-4033.
[7] 何云飞, 张以文, 吕智慧, 等. 异质信息网络中元路径感知的评分协同过滤[J]. 计算机学报, 2020, 43(12):2385-2397. He Y F, Zhang Y W, Lü Z H, et al. Meta path-aware rating collaborative filtering in heterogeneous information network[J]. Chinese Journal of Computers, 2020, 43(12):2385-2397. (in Chinese)
[8] 孙鑫, 刘学军, 李斌, 等. 基于图神经网络和时间注意力的会话序列推荐[J]. 计算机工程与设计, 2020, 41(10):2913-2920. Sun X, Liu X J, Li B, et al. Graph neural networks with time attention mechanism for sessionbased recommendations[J]. Computer Engineering and Design, 2020, 41(10):2913-2920. (in Chinese)
[9] 郭戈, 平西建, 张涛. 基于概念选择和重要性度量的多模态语义融合[J]. 应用科学学报, 2010, 28(3):266-270. Guo G, Ping X J, Zhang T. Multimodal fusion based on concept selection and importance measure[J]. Journal of Applied Sciences, 2010, 28(3):266-270. (in Chinese)
[10] 张晓龙, 王庆伟, 李尚滨. 基于强化学习的多模态场景人体危险行为识别方法[J]. 应用科学学报, 2021, 39(4):605-614. Zhang X L, Wang Q W, Li S B. Recognition method of human dangerous behavior in multimodal scenes using reinforcement learning[J]. Journal of Applied Sciences, 2021, 39(4):605-614. (in Chinese)
[11] Liu C X, Mao Z D, Zhang T Z, et al. Graph structured network for image-text matching[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020:10921-10930.
[12] Kumar S A, Anand M, Shashank S, et al. From strings to things:knowledge-enabled VQA model that can read and reason[C]//Proceedings of IEEE International Conference on Computer Vision, 2019:4602-4612.
[13] Yang X, Tang K H, Zhang H W, et al. Auto-encoding scene graphs for image captioning[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019:10685-10694.
[14] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016:770-778.
[15] Ren S Q, He K M, Girshick R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. Advances in Neural Information Processing Systems, 2015:91-99.
[16] Bojanowski P, Grave E, Joulin A, et al. Enriching word vectors with subword information[J]. Transactions of Association for Computational Linguistics, 2017, 5:135-146.
[17] Deng J, Dong W, Socher R, et al. ImageNet:a large-scale hierarchical image database[C]//IEEE Conference on Computer Vision and Pattern Recognition, 2009:248-255.
[18] Krishna R, Zhu Y K, Groth O, et al. Visual genome:connecting language and vision using crowdsourced dense image annotations[J]. International Journal of Computer Vision, 2017, 123(1):32-73.
[19] Karaoglu S, van Gemert J C, Gevers T. Con-text:text detection using background connectivity for fine-grained object classification[C]//Proceedings of the 21st ACM International Conference on Multimedia, 2013:757-760.
[20] Bai X, Yang M K, Lü P Y, et al. Integrating scene text and visual appearance for fine-grained image classification[J]. IEEE Access, 2018, 6:66322-66335.
[21] Liu L Y, Jiang H M, He P C, et al. On the variance of the adaptive learning rate and beyond[C/OL]//arXiv preprint arXiv:1908.03265, (2021-10-26)[2021-11-05]. https://arxiv.org/abs/1908.03265.
[22] Karaoglu S, Tao R, Gevers T, et al. Words matter:scene text for image classification and retrieval[J]. IEEE Transactions on Multimedia, 2017, 5:1063-1076.
[23] Maflfla A, Dey S, Biten A F, et al. Fine-grained image classifification and retrieval by combining visual and locally pooled textual features[C]//IEEE Winter Conference on Applications of Computer Vision, 2020:2950-2959.
[24] Kim J H, On K W, Lim W, et al. Hadamard product for low-rank bilinear pooling[C/OL]//arXiv:1610.04325, (2017-03-26)[2021-05-15]. https://arxiv.org/abs/1908.03265.
[25] Ben-Younes H, Cadene R, Thome N, et al. Block:bilinear superdiagonal fusion for visual question answering and visual relationship detection[J]. AAAI Technical Track:Vision, 2019, 33(1):AAAI-19, IAAI-19, EAAI-20.