一种结合全局和局部特征的图像描述生成模型

靳华中, 刘潇龙, 胡梓珂

doi:10.3969/j.issn.0255-8297.2019.04.007

应用科学学报 >

2019 , Vol. 37 >Issue 4: 501 - 509

DOI: https://doi.org/10.3969/j.issn.0255-8297.2019.04.007

信号与信息处理

一种结合全局和局部特征的图像描述生成模型

展开

湖北工业大学计算机学院, 武汉 430068

靳华中,副教授,研究方向:机器学习、物联网技术应用等,E-mail:galaxy0522@163.com

收稿日期: 2019-03-12

修回日期: 2019-05-05

网络出版日期: 2019-10-11

基金资助

国家重点研发计划项目基金（No.2016YFC0702000）；湖北省教育厅基金（No.省2014277）资助

收起

An Image Caption Generation Model Combining Global and Local Features

Expand

School of Computer Science, Hubei University of Technology, Wuhan 430068, China

Received date: 2019-03-12

Revised date: 2019-05-05

Online published: 2019-10-11

Fold

摘要

针对局部特征的图像描述模型存在的不足之处，提出了一种结合局部和全局特征的带有注意力机制的图像描述生成模型.在编码器-解码器结构框架下，在编码器端利用InceptionV3和VGG16网络模型分别提取图像的局部特征和全局特征，将两种不同尺度的图像特征融合形成编码结果.在解码器端，利用长短期记忆网络将提取的图像特征翻译为自然语言，借助微软COCO数据集进行模型训练和测试.实验结果表明：与基于局部特征的图像描述生成模型相比，该方法能够从图像中提取更加丰富完整的信息，生成表达图像内容更加准确的句子.

关键词： 图像描述生成; 注意力机制; 图像特征; 卷积神经网络; 长短期记忆

本文引用格式

靳华中, 刘潇龙, 胡梓珂 . 一种结合全局和局部特征的图像描述生成模型[J]. 应用科学学报, 2019 , 37(4) : 501 -509 . DOI: 10.3969/j.issn.0255-8297.2019.04.007

Abstract

An image caption generation model with attention mechanism combined with local and global features is proposed for dealing with the weakness of the image description model by the local image features. Under the framework of encoder and decoder architecture, the local and global features of images are extracted by using Inception V3 and VGG16 network models at the encoder, and the image features of two different scales are fused to form the coding results. On the decoder side, long short-term memory(LSTM) network is used to translate the extracted image features into natural language. The proposed model is trained and tested on Microsoft COCO dataset. The experimental results show that the proposed method can extract more abundant and complete information from the image and generate more accurate sentences, compared with the image caption model based on local features.

Key words： image caption generation; attention mechanism; image feature; convolutional neural network(CNN); long short-term memory(LSTM)

参考文献

[1] Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R S, Bengio Y. Show, attend and tell:neural image caption generation with visual attention[J]. Computer Science, 2015:2048-2057.
[2] Fu K, Jin J, Cui R, Sha F, Zhang C. Aligning where to see and what to tell:image captioning with region-based attention and scene-specific contexts[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 39(12):2321-2334.
[3] Li S, Kulkarni G, Berg T L, Berg A C, Choi Y. Composing simple image descriptions using web-scale n-grams[C]//Proceedings of the Fifteenth Conference on Computational Natural Language Learning, 2011:220-228.
[4] Mitchell M, Han X, Dodge J, Mensch A, Goyal A, Berg A, Daume I H. Midge:generating image descriptions from computer vision detections[C]//Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012:747-756.
[5] Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg T L. BabyTalk:understanding and generating simple image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12):2891-2903.
[6] Yang Y, Teo C L, Daume H, Aloimonos Y. Corpus-guided sentence generation of natural images[C]//Conference on Empirical Methods in Natural Language Processing, 2011:444-454.
[7] Elliott D, Keller F. Image description using visual dependency representations[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013:1292-1302.
[8] Kuznetsova P, Ordonez V, Berg A C, Berg T L, Choi Y. Collective generation of natural image descriptions[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics:Long Papers-Volume 1. Association for Computational Linguistics, 2012:359-368.
[9] Kuznetsova P, Ordonez V, Berg T L, Choi Y. Treetalk:composition and compression of trees for image descriptions[J]. Transactions of the Association for Computational Linguistics, 2014(2):351-362.
[10] Karpathy A, Li F F. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015:3128-3137.
[11] Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A. Deep captioning with multimodal recurrent neural networks (m-RNN)[C]//ICLR, 2015:1412-1423.
[12] Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. Computer Science, 2014:1406.1078.
[13] Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell:a neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015:3156-3164.
[14] 张宇,张鹏远,颜永红. 基于注意力LSTM和多任务学习的远场语音识别[J]. 清华大学学报(自然科学版),2018, 58(1):249-253. Zhang Y, Zhang P Y, Yan Y H. Long short-term memory with attention and multitask learning for distant speech recognition[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(1):249-253. (in Chinese)
[15] Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T S. SCA-CNN:spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017:5659-5667.
[16] Lu J, Xiong C, Parikh D, Socher R. Knowing when to look:Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017:375-383.
[17] Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018:6077-6086.
[18] 李亚超,熊德意,张民. 神经机器翻译综述[J]. 计算机学报,2018, 41(12):2734-2755. Li Y C, Xiong D Y, Zhang M. A survey of neural machine translation[J]. Chinese Journal of Computers, 2018, 41(12):2734-2755. (in Chinese)
[19] Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell:lessons learned from the 2015 mscoco image captioning challenge[J]. IEEE Transactions on Pattern Analysis and Machine, 2017:39(2), 652-663.
[20] 王红,史金钏,张志伟. 基于注意力机制的LSTM的语义关系抽取[J]. 计算机应用研究,2018, 35(3):1417-1420. Wang H, Shi J X, Zhang Z W. Text semantic relation extraction of LSTM based on attention mechanism[J]. Application Research of Computers, 2018, 35(3):1417-1420. (in Chinese)
[21] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[C]//3rd International Conference on Learning Representations, San Diego, May 7-9, 2015.
[22] Papineni K, Roukos S, Ward T, Zhu W J. BLEU:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2012:311-318.
[23] Banerjee S, Lavie A. METEOR:an automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the aclWorkshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2015:65-72.
[24] Lin C Y. Rouge:a package for automatic evaluation of summaries[C]//Proceedings of the ACL-04 Workshop on Text Summarization Branches Out, Barcelona, 2004:74-81.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献