应用科学学报 ›› 2022, Vol. 40 ›› Issue (1): 36-46.doi: 10.3969/j.issn.0255-8297.2022.01.004

• 计算机应用专辑 • 上一篇    下一篇

基于注意力网络推理图的细粒度图像分类

郑智文1,2, 甘健侯1,2, 周菊香1,2, 欧阳昭相1,3, 鹿泽光4   

  1. 1. 云南师范大学 民族教育信息化教育部重点实验室, 云南 昆明 650500;
    2. 云南师范大学 云南省智慧教育重点实验室, 云南 昆明 650500;
    3. 德宏师范高等专科学校 信息学院, 云南 德宏 678400;
    4. 中科国鼎数据科学研究院, 北京 100010
  • 收稿日期:2021-11-15 发布日期:2022-01-28
  • 通信作者: 周菊香,博士生,副研究员,研究方向为计算机视觉、机器学习。E-mail:zjuxiang@ynnu.edu.cn E-mail:zjuxiang@ynnu.edu.cn
  • 基金资助:
    国家自然科学基金(No.62166050)资助

Fine-Grained Image Classification Based on Inference Graph of Attention Network

ZHENG Zhiwen1,2, GAN Jianhou1,2, ZHOU Juxiang1,2, OUYANG Zhaoxiang1,3, LU Zeguang4   

  1. 1. Key Laboratory of Education Informatization for Nationalities, Ministry of Education, Yunnan Normal University, Kunming 650500, Yunnan, China;
    2. Yunnan Key Laboratory of Smart Education, Yunnan Normal University, Kunming 650500, Yunnan, China;
    3. School of Information, Dehong Teacher's College, Dehong 678400, Yunnan, China;
    4. National Academy of Guoding Institute of Data Science, Beijing 100010, China
  • Received:2021-11-15 Published:2022-01-28

摘要: 针对场景图像的细粒度分类任务,结合图像视觉和文本的多模态信息提出了一种基于注意力网络推理图的细粒度图像分类方法。首先提取场景图像的全局视觉特征、局部视觉特征和文本特征,把位置信息分别嵌入局部视觉特征和文本特征后拼接成新的特征,再将这个新的特征作为图结构的节点生成一个异构图;然后设计两条元路径将异构图分解成两个同构图,并将其分别放入设计有节点级注意和语义级注意的两级注意力网络推理图;最后将输出的节点特征与全局视觉特征进行多模态融合操作,获得更丰富的细粒度特征表达。所提出的模型实现了多模态融合与图注意力网络的有效结合,且在Con-Text和Drink Bottle两个场景文本细粒度图像数据集上与目前主流先进方法相比具有较强的竞争力。

关键词: 场景图像, 多模态, 图注意力网络, 节点级注意力, 语义级注意力

Abstract: Aiming at the task of fine-grained classification of scene images, this paper proposes a fine-grained image classification method based on the attention network inference graph by integrating the multimodal information of image visual and textual features. First, we extract the global visual feature, local visual features and text features of the scene image, and form a new splicing feature by embedding the position information into the local visual features and textual features respectively. The feature is accordingly used as a node of the graph structure to generate a heterogeneous graph. Then, we design two meta-paths to decompose the heterogeneous graph into two isomorphic graphs, and put them into a two-level attention network inference graph with node-level attention and semantic-level attention. Finally, richer fine-grained feature expression can be obtained by multimodal fusion operations with the output node features and global visual feature. The proposed model enables effective combination of multimodal fusion and graph attention network, and performs strong competitiveness comparing with the current advanced mainstream methods on the two scene text fine-grained image datasets of Con-Text and Drink Bottle.

Key words: scene image, multimodal, graph attention network, node-level attention, semantic-level attention

中图分类号: