应用科学学报 ›› 2019, Vol. 37 ›› Issue (4): 501-509.doi: 10.3969/j.issn.0255-8297.2019.04.007

• 信号与信息处理 • 上一篇    下一篇

一种结合全局和局部特征的图像描述生成模型

靳华中, 刘潇龙, 胡梓珂   

  1. 湖北工业大学 计算机学院, 武汉 430068
  • 收稿日期:2019-03-12 修回日期:2019-05-05 出版日期:2019-07-31 发布日期:2019-10-11
  • 作者简介:靳华中,副教授,研究方向:机器学习、物联网技术应用等,E-mail:galaxy0522@163.com
  • 基金资助:
    国家重点研发计划项目基金(No.2016YFC0702000);湖北省教育厅基金(No.省2014277)资助

An Image Caption Generation Model Combining Global and Local Features

JIN Huazhong, LIU Xiaolong, HU Zike   

  1. School of Computer Science, Hubei University of Technology, Wuhan 430068, China
  • Received:2019-03-12 Revised:2019-05-05 Online:2019-07-31 Published:2019-10-11

摘要: 针对局部特征的图像描述模型存在的不足之处,提出了一种结合局部和全局特征的带有注意力机制的图像描述生成模型.在编码器-解码器结构框架下,在编码器端利用InceptionV3和VGG16网络模型分别提取图像的局部特征和全局特征,将两种不同尺度的图像特征融合形成编码结果.在解码器端,利用长短期记忆网络将提取的图像特征翻译为自然语言,借助微软COCO数据集进行模型训练和测试.实验结果表明:与基于局部特征的图像描述生成模型相比,该方法能够从图像中提取更加丰富完整的信息,生成表达图像内容更加准确的句子.

关键词: 图像描述生成, 注意力机制, 图像特征, 卷积神经网络, 长短期记忆

Abstract: An image caption generation model with attention mechanism combined with local and global features is proposed for dealing with the weakness of the image description model by the local image features. Under the framework of encoder and decoder architecture, the local and global features of images are extracted by using Inception V3 and VGG16 network models at the encoder, and the image features of two different scales are fused to form the coding results. On the decoder side, long short-term memory(LSTM) network is used to translate the extracted image features into natural language. The proposed model is trained and tested on Microsoft COCO dataset. The experimental results show that the proposed method can extract more abundant and complete information from the image and generate more accurate sentences, compared with the image caption model based on local features.

Key words: image caption generation, attention mechanism, image feature, convolutional neural network(CNN), long short-term memory(LSTM)

中图分类号: