基于注意力网络推理图的细粒度图像分类

郑智文, 甘健侯, 周菊香, 欧阳昭相, 鹿泽光

doi:10.3969/j.issn.0255-8297.2022.01.004

应用科学学报 >

2022 , Vol. 40 >Issue 1: 36 - 46

DOI: https://doi.org/10.3969/j.issn.0255-8297.2022.01.004

计算机应用专辑

基于注意力网络推理图的细粒度图像分类

展开

1. 云南师范大学民族教育信息化教育部重点实验室, 云南昆明 650500;
2. 云南师范大学云南省智慧教育重点实验室, 云南昆明 650500;
3. 德宏师范高等专科学校信息学院, 云南德宏 678400;
4. 中科国鼎数据科学研究院, 北京 100010

收稿日期: 2021-11-15

网络出版日期: 2022-01-28

基金资助

国家自然科学基金（No.62166050）资助

收起

Fine-Grained Image Classification Based on Inference Graph of Attention Network

Expand

1. Key Laboratory of Education Informatization for Nationalities, Ministry of Education, Yunnan Normal University, Kunming 650500, Yunnan, China;
2. Yunnan Key Laboratory of Smart Education, Yunnan Normal University, Kunming 650500, Yunnan, China;
3. School of Information, Dehong Teacher's College, Dehong 678400, Yunnan, China;
4. National Academy of Guoding Institute of Data Science, Beijing 100010, China

Received date: 2021-11-15

Online published: 2022-01-28

Fold

摘要

针对场景图像的细粒度分类任务，结合图像视觉和文本的多模态信息提出了一种基于注意力网络推理图的细粒度图像分类方法。首先提取场景图像的全局视觉特征、局部视觉特征和文本特征，把位置信息分别嵌入局部视觉特征和文本特征后拼接成新的特征，再将这个新的特征作为图结构的节点生成一个异构图；然后设计两条元路径将异构图分解成两个同构图，并将其分别放入设计有节点级注意和语义级注意的两级注意力网络推理图；最后将输出的节点特征与全局视觉特征进行多模态融合操作，获得更丰富的细粒度特征表达。所提出的模型实现了多模态融合与图注意力网络的有效结合，且在Con-Text和Drink Bottle两个场景文本细粒度图像数据集上与目前主流先进方法相比具有较强的竞争力。

关键词： 场景图像; 多模态; 图注意力网络; 节点级注意力; 语义级注意力

本文引用格式

郑智文, 甘健侯, 周菊香, 欧阳昭相, 鹿泽光 . 基于注意力网络推理图的细粒度图像分类[J]. 应用科学学报, 2022 , 40(1) : 36 -46 . DOI: 10.3969/j.issn.0255-8297.2022.01.004

Abstract

Aiming at the task of fine-grained classification of scene images, this paper proposes a fine-grained image classification method based on the attention network inference graph by integrating the multimodal information of image visual and textual features. First, we extract the global visual feature, local visual features and text features of the scene image, and form a new splicing feature by embedding the position information into the local visual features and textual features respectively. The feature is accordingly used as a node of the graph structure to generate a heterogeneous graph. Then, we design two meta-paths to decompose the heterogeneous graph into two isomorphic graphs, and put them into a two-level attention network inference graph with node-level attention and semantic-level attention. Finally, richer fine-grained feature expression can be obtained by multimodal fusion operations with the output node features and global visual feature. The proposed model enables effective combination of multimodal fusion and graph attention network, and performs strong competitiveness comparing with the current advanced mainstream methods on the two scene text fine-grained image datasets of Con-Text and Drink Bottle.

Key words： scene image; multimodal; graph attention network; node-level attention; semantic-level attention

参考文献

[1] Yang X, Zhang T, Xu C. Cross-domain feature learning in multi-media[J]. IEEE Transactions on Multimedia, 2015, 17(1):64-78.
[2] 金志威, 曹娟, 王博, 等. 融合多模态特征的社会多媒体谣言检测及技术研究[J]. 南京信息工程大学学报(自然科学版), 2017, 9(6):583-592. Jin Z W, Cao J, Wang B, et al. Research on social multimedia rumor detection and technology integrating multi-modal features[J]. Journal of Nanjing University of Information Science & Technology (Natural Science Edition), 2017, 9(6):583-592. (in Chinese)
[3] Wang B, Lin D, Xiong H, et al. Joint inference of objects and scenes with efficient learning of text-object-scene relations[J]. IEEE Transactions on Multimedia, 2016, 18(3):507-520.
[4] Movshovitz-Attias Y, Yu Q, Stumpe M C, et al. Ontological supervision for fine grained classification of street view storefronts[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015:1693-1702.
[5] Bai X, Yang M K, Lyu P Y, et al. Integrating scene text and visual appearance for fine-grained image classification[J]. IEEE Access, 2018, 6:66322-66335.
[6] Maflfla A, Dey S, Biten A F, et al. Multi-modal reasoning graph for scene-text based finegrained image classification and retrieval[C]//Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, 2021:4023-4033.
[7] 何云飞, 张以文, 吕智慧, 等. 异质信息网络中元路径感知的评分协同过滤[J]. 计算机学报, 2020, 43(12):2385-2397. He Y F, Zhang Y W, Lü Z H, et al. Meta path-aware rating collaborative filtering in heterogeneous information network[J]. Chinese Journal of Computers, 2020, 43(12):2385-2397. (in Chinese)
[8] 孙鑫, 刘学军, 李斌, 等. 基于图神经网络和时间注意力的会话序列推荐[J]. 计算机工程与设计, 2020, 41(10):2913-2920. Sun X, Liu X J, Li B, et al. Graph neural networks with time attention mechanism for sessionbased recommendations[J]. Computer Engineering and Design, 2020, 41(10):2913-2920. (in Chinese)
[9] 郭戈, 平西建, 张涛. 基于概念选择和重要性度量的多模态语义融合[J]. 应用科学学报, 2010, 28(3):266-270. Guo G, Ping X J, Zhang T. Multimodal fusion based on concept selection and importance measure[J]. Journal of Applied Sciences, 2010, 28(3):266-270. (in Chinese)
[10] 张晓龙, 王庆伟, 李尚滨. 基于强化学习的多模态场景人体危险行为识别方法[J]. 应用科学学报, 2021, 39(4):605-614. Zhang X L, Wang Q W, Li S B. Recognition method of human dangerous behavior in multimodal scenes using reinforcement learning[J]. Journal of Applied Sciences, 2021, 39(4):605-614. (in Chinese)
[11] Liu C X, Mao Z D, Zhang T Z, et al. Graph structured network for image-text matching[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020:10921-10930.
[12] Kumar S A, Anand M, Shashank S, et al. From strings to things:knowledge-enabled VQA model that can read and reason[C]//Proceedings of IEEE International Conference on Computer Vision, 2019:4602-4612.
[13] Yang X, Tang K H, Zhang H W, et al. Auto-encoding scene graphs for image captioning[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019:10685-10694.
[14] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016:770-778.
[15] Ren S Q, He K M, Girshick R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. Advances in Neural Information Processing Systems, 2015:91-99.
[16] Bojanowski P, Grave E, Joulin A, et al. Enriching word vectors with subword information[J]. Transactions of Association for Computational Linguistics, 2017, 5:135-146.
[17] Deng J, Dong W, Socher R, et al. ImageNet:a large-scale hierarchical image database[C]//IEEE Conference on Computer Vision and Pattern Recognition, 2009:248-255.
[18] Krishna R, Zhu Y K, Groth O, et al. Visual genome:connecting language and vision using crowdsourced dense image annotations[J]. International Journal of Computer Vision, 2017, 123(1):32-73.
[19] Karaoglu S, van Gemert J C, Gevers T. Con-text:text detection using background connectivity for fine-grained object classification[C]//Proceedings of the 21st ACM International Conference on Multimedia, 2013:757-760.
[20] Bai X, Yang M K, Lü P Y, et al. Integrating scene text and visual appearance for fine-grained image classification[J]. IEEE Access, 2018, 6:66322-66335.
[21] Liu L Y, Jiang H M, He P C, et al. On the variance of the adaptive learning rate and beyond[C/OL]//arXiv preprint arXiv:1908.03265, (2021-10-26)[2021-11-05]. https://arxiv.org/abs/1908.03265.
[22] Karaoglu S, Tao R, Gevers T, et al. Words matter:scene text for image classification and retrieval[J]. IEEE Transactions on Multimedia, 2017, 5:1063-1076.
[23] Maflfla A, Dey S, Biten A F, et al. Fine-grained image classifification and retrieval by combining visual and locally pooled textual features[C]//IEEE Winter Conference on Applications of Computer Vision, 2020:2950-2959.
[24] Kim J H, On K W, Lim W, et al. Hadamard product for low-rank bilinear pooling[C/OL]//arXiv:1610.04325, (2017-03-26)[2021-05-15]. https://arxiv.org/abs/1908.03265.
[25] Ben-Younes H, Cadene R, Thome N, et al. Block:bilinear superdiagonal fusion for visual question answering and visual relationship detection[J]. AAAI Technical Track:Vision, 2019, 33(1):AAAI-19, IAAI-19, EAAI-20.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献