Signal and Information Processing

An Image Caption Generation Model Combining Global and Local Features

Expand
  • School of Computer Science, Hubei University of Technology, Wuhan 430068, China

Received date: 2019-03-12

  Revised date: 2019-05-05

  Online published: 2019-10-11

Abstract

An image caption generation model with attention mechanism combined with local and global features is proposed for dealing with the weakness of the image description model by the local image features. Under the framework of encoder and decoder architecture, the local and global features of images are extracted by using Inception V3 and VGG16 network models at the encoder, and the image features of two different scales are fused to form the coding results. On the decoder side, long short-term memory(LSTM) network is used to translate the extracted image features into natural language. The proposed model is trained and tested on Microsoft COCO dataset. The experimental results show that the proposed method can extract more abundant and complete information from the image and generate more accurate sentences, compared with the image caption model based on local features.

Cite this article

JIN Huazhong, LIU Xiaolong, HU Zike . An Image Caption Generation Model Combining Global and Local Features[J]. Journal of Applied Sciences, 2019 , 37(4) : 501 -509 . DOI: 10.3969/j.issn.0255-8297.2019.04.007

References

[1] Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R S, Bengio Y. Show, attend and tell:neural image caption generation with visual attention[J]. Computer Science, 2015:2048-2057.
[2] Fu K, Jin J, Cui R, Sha F, Zhang C. Aligning where to see and what to tell:image captioning with region-based attention and scene-specific contexts[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 39(12):2321-2334.
[3] Li S, Kulkarni G, Berg T L, Berg A C, Choi Y. Composing simple image descriptions using web-scale n-grams[C]//Proceedings of the Fifteenth Conference on Computational Natural Language Learning, 2011:220-228.
[4] Mitchell M, Han X, Dodge J, Mensch A, Goyal A, Berg A, Daume I H. Midge:generating image descriptions from computer vision detections[C]//Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012:747-756.
[5] Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg T L. BabyTalk:understanding and generating simple image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12):2891-2903.
[6] Yang Y, Teo C L, Daume H, Aloimonos Y. Corpus-guided sentence generation of natural images[C]//Conference on Empirical Methods in Natural Language Processing, 2011:444-454.
[7] Elliott D, Keller F. Image description using visual dependency representations[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013:1292-1302.
[8] Kuznetsova P, Ordonez V, Berg A C, Berg T L, Choi Y. Collective generation of natural image descriptions[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics:Long Papers-Volume 1. Association for Computational Linguistics, 2012:359-368.
[9] Kuznetsova P, Ordonez V, Berg T L, Choi Y. Treetalk:composition and compression of trees for image descriptions[J]. Transactions of the Association for Computational Linguistics, 2014(2):351-362.
[10] Karpathy A, Li F F. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015:3128-3137.
[11] Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A. Deep captioning with multimodal recurrent neural networks (m-RNN)[C]//ICLR, 2015:1412-1423.
[12] Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. Computer Science, 2014:1406.1078.
[13] Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell:a neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015:3156-3164.
[14] 张宇,张鹏远,颜永红. 基于注意力LSTM和多任务学习的远场语音识别[J]. 清华大学学报(自然科学版),2018, 58(1):249-253. Zhang Y, Zhang P Y, Yan Y H. Long short-term memory with attention and multitask learning for distant speech recognition[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(1):249-253. (in Chinese)
[15] Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T S. SCA-CNN:spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017:5659-5667.
[16] Lu J, Xiong C, Parikh D, Socher R. Knowing when to look:Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017:375-383.
[17] Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018:6077-6086.
[18] 李亚超,熊德意,张民. 神经机器翻译综述[J]. 计算机学报,2018, 41(12):2734-2755. Li Y C, Xiong D Y, Zhang M. A survey of neural machine translation[J]. Chinese Journal of Computers, 2018, 41(12):2734-2755. (in Chinese)
[19] Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell:lessons learned from the 2015 mscoco image captioning challenge[J]. IEEE Transactions on Pattern Analysis and Machine, 2017:39(2), 652-663.
[20] 王红,史金钏,张志伟. 基于注意力机制的LSTM的语义关系抽取[J]. 计算机应用研究,2018, 35(3):1417-1420. Wang H, Shi J X, Zhang Z W. Text semantic relation extraction of LSTM based on attention mechanism[J]. Application Research of Computers, 2018, 35(3):1417-1420. (in Chinese)
[21] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[C]//3rd International Conference on Learning Representations, San Diego, May 7-9, 2015.
[22] Papineni K, Roukos S, Ward T, Zhu W J. BLEU:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2012:311-318.
[23] Banerjee S, Lavie A. METEOR:an automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the aclWorkshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2015:65-72.
[24] Lin C Y. Rouge:a package for automatic evaluation of summaries[C]//Proceedings of the ACL-04 Workshop on Text Summarization Branches Out, Barcelona, 2004:74-81.
Outlines

/