[1] Yang X, Zhang T, Xu C. Cross-domain feature learning in multi-media[J]. IEEE Transactions on Multimedia, 2015, 17(1):64-78. [2] 金志威, 曹娟, 王博, 等. 融合多模态特征的社会多媒体谣言检测及技术研究[J]. 南京信息工程大学学报(自然科学版), 2017, 9(6):583-592. Jin Z W, Cao J, Wang B, et al. Research on social multimedia rumor detection and technology integrating multi-modal features[J]. Journal of Nanjing University of Information Science & Technology (Natural Science Edition), 2017, 9(6):583-592. (in Chinese) [3] Wang B, Lin D, Xiong H, et al. Joint inference of objects and scenes with efficient learning of text-object-scene relations[J]. IEEE Transactions on Multimedia, 2016, 18(3):507-520. [4] Movshovitz-Attias Y, Yu Q, Stumpe M C, et al. Ontological supervision for fine grained classification of street view storefronts[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015:1693-1702. [5] Bai X, Yang M K, Lyu P Y, et al. Integrating scene text and visual appearance for fine-grained image classification[J]. IEEE Access, 2018, 6:66322-66335. [6] Maflfla A, Dey S, Biten A F, et al. Multi-modal reasoning graph for scene-text based finegrained image classification and retrieval[C]//Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, 2021:4023-4033. [7] 何云飞, 张以文, 吕智慧, 等. 异质信息网络中元路径感知的评分协同过滤[J]. 计算机学报, 2020, 43(12):2385-2397. He Y F, Zhang Y W, Lü Z H, et al. Meta path-aware rating collaborative filtering in heterogeneous information network[J]. Chinese Journal of Computers, 2020, 43(12):2385-2397. (in Chinese) [8] 孙鑫, 刘学军, 李斌, 等. 基于图神经网络和时间注意力的会话序列推荐[J]. 计算机工程与设计, 2020, 41(10):2913-2920. Sun X, Liu X J, Li B, et al. Graph neural networks with time attention mechanism for sessionbased recommendations[J]. Computer Engineering and Design, 2020, 41(10):2913-2920. (in Chinese) [9] 郭戈, 平西建, 张涛. 基于概念选择和重要性度量的多模态语义融合[J]. 应用科学学报, 2010, 28(3):266-270. Guo G, Ping X J, Zhang T. Multimodal fusion based on concept selection and importance measure[J]. Journal of Applied Sciences, 2010, 28(3):266-270. (in Chinese) [10] 张晓龙, 王庆伟, 李尚滨. 基于强化学习的多模态场景人体危险行为识别方法[J]. 应用科学学报, 2021, 39(4):605-614. Zhang X L, Wang Q W, Li S B. Recognition method of human dangerous behavior in multimodal scenes using reinforcement learning[J]. Journal of Applied Sciences, 2021, 39(4):605-614. (in Chinese) [11] Liu C X, Mao Z D, Zhang T Z, et al. Graph structured network for image-text matching[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020:10921-10930. [12] Kumar S A, Anand M, Shashank S, et al. From strings to things:knowledge-enabled VQA model that can read and reason[C]//Proceedings of IEEE International Conference on Computer Vision, 2019:4602-4612. [13] Yang X, Tang K H, Zhang H W, et al. Auto-encoding scene graphs for image captioning[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019:10685-10694. [14] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016:770-778. [15] Ren S Q, He K M, Girshick R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. Advances in Neural Information Processing Systems, 2015:91-99. [16] Bojanowski P, Grave E, Joulin A, et al. Enriching word vectors with subword information[J]. Transactions of Association for Computational Linguistics, 2017, 5:135-146. [17] Deng J, Dong W, Socher R, et al. ImageNet:a large-scale hierarchical image database[C]//IEEE Conference on Computer Vision and Pattern Recognition, 2009:248-255. [18] Krishna R, Zhu Y K, Groth O, et al. Visual genome:connecting language and vision using crowdsourced dense image annotations[J]. International Journal of Computer Vision, 2017, 123(1):32-73. [19] Karaoglu S, van Gemert J C, Gevers T. Con-text:text detection using background connectivity for fine-grained object classification[C]//Proceedings of the 21st ACM International Conference on Multimedia, 2013:757-760. [20] Bai X, Yang M K, Lü P Y, et al. Integrating scene text and visual appearance for fine-grained image classification[J]. IEEE Access, 2018, 6:66322-66335. [21] Liu L Y, Jiang H M, He P C, et al. On the variance of the adaptive learning rate and beyond[C/OL]//arXiv preprint arXiv:1908.03265, (2021-10-26)[2021-11-05]. https://arxiv.org/abs/1908.03265. [22] Karaoglu S, Tao R, Gevers T, et al. Words matter:scene text for image classification and retrieval[J]. IEEE Transactions on Multimedia, 2017, 5:1063-1076. [23] Maflfla A, Dey S, Biten A F, et al. Fine-grained image classifification and retrieval by combining visual and locally pooled textual features[C]//IEEE Winter Conference on Applications of Computer Vision, 2020:2950-2959. [24] Kim J H, On K W, Lim W, et al. Hadamard product for low-rank bilinear pooling[C/OL]//arXiv:1610.04325, (2017-03-26)[2021-05-15]. https://arxiv.org/abs/1908.03265. [25] Ben-Younes H, Cadene R, Thome N, et al. Block:bilinear superdiagonal fusion for visual question answering and visual relationship detection[J]. AAAI Technical Track:Vision, 2019, 33(1):AAAI-19, IAAI-19, EAAI-20. |