Journal of Applied Sciences ›› 2022, Vol. 40 ›› Issue (1): 36-46.doi: 10.3969/j.issn.0255-8297.2022.01.004

• Special Issue on Computer Applications • Previous Articles     Next Articles

Fine-Grained Image Classification Based on Inference Graph of Attention Network

ZHENG Zhiwen1,2, GAN Jianhou1,2, ZHOU Juxiang1,2, OUYANG Zhaoxiang1,3, LU Zeguang4   

  1. 1. Key Laboratory of Education Informatization for Nationalities, Ministry of Education, Yunnan Normal University, Kunming 650500, Yunnan, China;
    2. Yunnan Key Laboratory of Smart Education, Yunnan Normal University, Kunming 650500, Yunnan, China;
    3. School of Information, Dehong Teacher's College, Dehong 678400, Yunnan, China;
    4. National Academy of Guoding Institute of Data Science, Beijing 100010, China
  • Received:2021-11-15 Published:2022-01-28

Abstract: Aiming at the task of fine-grained classification of scene images, this paper proposes a fine-grained image classification method based on the attention network inference graph by integrating the multimodal information of image visual and textual features. First, we extract the global visual feature, local visual features and text features of the scene image, and form a new splicing feature by embedding the position information into the local visual features and textual features respectively. The feature is accordingly used as a node of the graph structure to generate a heterogeneous graph. Then, we design two meta-paths to decompose the heterogeneous graph into two isomorphic graphs, and put them into a two-level attention network inference graph with node-level attention and semantic-level attention. Finally, richer fine-grained feature expression can be obtained by multimodal fusion operations with the output node features and global visual feature. The proposed model enables effective combination of multimodal fusion and graph attention network, and performs strong competitiveness comparing with the current advanced mainstream methods on the two scene text fine-grained image datasets of Con-Text and Drink Bottle.

Key words: scene image, multimodal, graph attention network, node-level attention, semantic-level attention

CLC Number: