一种结合全局和局部特征的图像描述生成模型

doi:10.3969/j.issn.0255-8297.2019.04.007

应用科学学报 ›› 2019, Vol. 37 ›› Issue (4): 501-509.doi: 10.3969/j.issn.0255-8297.2019.04.007

一种结合全局和局部特征的图像描述生成模型

靳华中, 刘潇龙, 胡梓珂

湖北工业大学计算机学院, 武汉 430068

收稿日期:2019-03-12 修回日期:2019-05-05 出版日期:2019-07-31 发布日期:2019-10-11
作者简介:靳华中,副教授,研究方向:机器学习、物联网技术应用等,E-mail:galaxy0522@163.com
基金资助:
国家重点研发计划项目基金（No.2016YFC0702000）；湖北省教育厅基金（No.省2014277）资助

An Image Caption Generation Model Combining Global and Local Features

JIN Huazhong, LIU Xiaolong, HU Zike

School of Computer Science, Hubei University of Technology, Wuhan 430068, China

Received:2019-03-12 Revised:2019-05-05 Online:2019-07-31 Published:2019-10-11

摘要/Abstract

摘要： 针对局部特征的图像描述模型存在的不足之处，提出了一种结合局部和全局特征的带有注意力机制的图像描述生成模型.在编码器-解码器结构框架下，在编码器端利用InceptionV3和VGG16网络模型分别提取图像的局部特征和全局特征，将两种不同尺度的图像特征融合形成编码结果.在解码器端，利用长短期记忆网络将提取的图像特征翻译为自然语言，借助微软COCO数据集进行模型训练和测试.实验结果表明：与基于局部特征的图像描述生成模型相比，该方法能够从图像中提取更加丰富完整的信息，生成表达图像内容更加准确的句子.

关键词: 图像描述生成, 注意力机制, 图像特征, 卷积神经网络, 长短期记忆

Abstract: An image caption generation model with attention mechanism combined with local and global features is proposed for dealing with the weakness of the image description model by the local image features. Under the framework of encoder and decoder architecture, the local and global features of images are extracted by using Inception V3 and VGG16 network models at the encoder, and the image features of two different scales are fused to form the coding results. On the decoder side, long short-term memory(LSTM) network is used to translate the extracted image features into natural language. The proposed model is trained and tested on Microsoft COCO dataset. The experimental results show that the proposed method can extract more abundant and complete information from the image and generate more accurate sentences, compared with the image caption model based on local features.

Key words: image caption generation, attention mechanism, image feature, convolutional neural network(CNN), long short-term memory(LSTM)

中图分类号:

TN391.4

靳华中, 刘潇龙, 胡梓珂. 一种结合全局和局部特征的图像描述生成模型[J]. 应用科学学报, 2019, 37(4): 501-509.

JIN Huazhong, LIU Xiaolong, HU Zike. An Image Caption Generation Model Combining Global and Local Features[J]. Journal of Applied Sciences, 2019, 37(4): 501-509.

参考文献

[1] Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R S, Bengio Y. Show, attend and tell:neural image caption generation with visual attention[J]. Computer Science, 2015:2048-2057.
[2] Fu K, Jin J, Cui R, Sha F, Zhang C. Aligning where to see and what to tell:image captioning with region-based attention and scene-specific contexts[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 39(12):2321-2334.
[3] Li S, Kulkarni G, Berg T L, Berg A C, Choi Y. Composing simple image descriptions using web-scale n-grams[C]//Proceedings of the Fifteenth Conference on Computational Natural Language Learning, 2011:220-228.
[4] Mitchell M, Han X, Dodge J, Mensch A, Goyal A, Berg A, Daume I H. Midge:generating image descriptions from computer vision detections[C]//Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012:747-756.
[5] Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg T L. BabyTalk:understanding and generating simple image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12):2891-2903.
[6] Yang Y, Teo C L, Daume H, Aloimonos Y. Corpus-guided sentence generation of natural images[C]//Conference on Empirical Methods in Natural Language Processing, 2011:444-454.
[7] Elliott D, Keller F. Image description using visual dependency representations[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013:1292-1302.
[8] Kuznetsova P, Ordonez V, Berg A C, Berg T L, Choi Y. Collective generation of natural image descriptions[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics:Long Papers-Volume 1. Association for Computational Linguistics, 2012:359-368.
[9] Kuznetsova P, Ordonez V, Berg T L, Choi Y. Treetalk:composition and compression of trees for image descriptions[J]. Transactions of the Association for Computational Linguistics, 2014(2):351-362.
[10] Karpathy A, Li F F. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015:3128-3137.
[11] Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A. Deep captioning with multimodal recurrent neural networks (m-RNN)[C]//ICLR, 2015:1412-1423.
[12] Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. Computer Science, 2014:1406.1078.
[13] Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell:a neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015:3156-3164.
[14] 张宇,张鹏远,颜永红. 基于注意力LSTM和多任务学习的远场语音识别[J]. 清华大学学报(自然科学版),2018, 58(1):249-253. Zhang Y, Zhang P Y, Yan Y H. Long short-term memory with attention and multitask learning for distant speech recognition[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(1):249-253. (in Chinese)
[15] Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T S. SCA-CNN:spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017:5659-5667.
[16] Lu J, Xiong C, Parikh D, Socher R. Knowing when to look:Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017:375-383.
[17] Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018:6077-6086.
[18] 李亚超,熊德意,张民. 神经机器翻译综述[J]. 计算机学报,2018, 41(12):2734-2755. Li Y C, Xiong D Y, Zhang M. A survey of neural machine translation[J]. Chinese Journal of Computers, 2018, 41(12):2734-2755. (in Chinese)
[19] Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell:lessons learned from the 2015 mscoco image captioning challenge[J]. IEEE Transactions on Pattern Analysis and Machine, 2017:39(2), 652-663.
[20] 王红,史金钏,张志伟. 基于注意力机制的LSTM的语义关系抽取[J]. 计算机应用研究,2018, 35(3):1417-1420. Wang H, Shi J X, Zhang Z W. Text semantic relation extraction of LSTM based on attention mechanism[J]. Application Research of Computers, 2018, 35(3):1417-1420. (in Chinese)
[21] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[C]//3rd International Conference on Learning Representations, San Diego, May 7-9, 2015.
[22] Papineni K, Roukos S, Ward T, Zhu W J. BLEU:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2012:311-318.
[23] Banerjee S, Lavie A. METEOR:an automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the aclWorkshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2015:65-72.
[24] Lin C Y. Rouge:a package for automatic evaluation of summaries[C]//Proceedings of the ACL-04 Workshop on Text Summarization Branches Out, Barcelona, 2004:74-81.

一种结合全局和局部特征的图像描述生成模型

An Image Caption Generation Model Combining Global and Local Features

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 12

编辑推荐

Metrics

本文评价

[1]	王孟轩, 张胜, 王月, 雷霆, 杜渂. 改进的CRNN模型在警情文本分类中的研究与应用[J]. 应用科学学报, 2020, 38(3): 388-400.
[2]	马鑫, 吴云, 鹿泽光. 基于混合神经网络的协同过滤推荐模型[J]. 应用科学学报, 2020, 38(3): 478-487.
[3]	严少均, 王子驰, 张新鹏. 长短期记忆网络的轨迹隐私保护[J]. 应用科学学报, 2019, 37(6): 835-843.
[4]	刘伟, 章琬苓, 项世军. 基于LBP-MDCT和CNN的人脸活体检测算法[J]. 应用科学学报, 2019, 37(5): 609-617.
[5]	王灿军, 廖鑫, 陈嘉欣, 秦拯, 刘绪崇. 基于卷积神经网络的面部图像修饰检测[J]. 应用科学学报, 2019, 37(5): 618-630.
[6]	吴韵清, 吴鹏, 陈北京, 鞠兴旺, 高野. 基于残差全卷积网络的图像拼接定位算法[J]. 应用科学学报, 2019, 37(5): 651-662.
[7]	赵云山, 段友祥. 基于Attention机制的卷积神经网络文本分类模型[J]. 应用科学学报, 2019, 37(4): 541-550.
[8]	曾润华, 张树群. 改进卷积神经网络的语音情感识别方法[J]. 应用科学学报, 2018, 36(5): 837-844.
[9]	杨滨, 张涛, 陈先意. 基于深度学习的图像局部模糊识别[J]. 应用科学学报, 2018, 36(2): 321-330.
[10]	史晓裕, 李斌, 谭舜泉. 深度学习空域隐写分析的预处理层[J]. 应用科学学报, 2018, 36(2): 309-320.
[11]	董伟, 王建军. 改进的卷积神经网络用于对比度增强取证[J]. 应用科学学报, 2017, 35(6): 745-753.
[12]	郭龙，平西建，周林，童莉. 基本图像特征用于文本图像文种识别[J]. 应用科学学报, 2011, 29(1): 56-60.