Journal of Applied Sciences ›› 2019, Vol. 37 ›› Issue (4): 501-509.doi: 10.3969/j.issn.0255-8297.2019.04.007

• Signal and Information Processing • Previous Articles     Next Articles

An Image Caption Generation Model Combining Global and Local Features

JIN Huazhong, LIU Xiaolong, HU Zike   

  1. School of Computer Science, Hubei University of Technology, Wuhan 430068, China
  • Received:2019-03-12 Revised:2019-05-05 Online:2019-07-31 Published:2019-10-11

Abstract: An image caption generation model with attention mechanism combined with local and global features is proposed for dealing with the weakness of the image description model by the local image features. Under the framework of encoder and decoder architecture, the local and global features of images are extracted by using Inception V3 and VGG16 network models at the encoder, and the image features of two different scales are fused to form the coding results. On the decoder side, long short-term memory(LSTM) network is used to translate the extracted image features into natural language. The proposed model is trained and tested on Microsoft COCO dataset. The experimental results show that the proposed method can extract more abundant and complete information from the image and generate more accurate sentences, compared with the image caption model based on local features.

Key words: image caption generation, attention mechanism, image feature, convolutional neural network(CNN), long short-term memory(LSTM)

CLC Number: