基于细粒度特征增强的多模态视觉问答研究
作者单位:

南京信息工程大学

基金项目:

浙江省自然科学基金联合基金资助项目(LZJMD25D050002);国家自然科学基金联合重点项目(U20B2061)


Research on Multi modal Visual Question Answering Based on Fine grained Feature Enhancement
Author:
Affiliation:

Nanjing University of Information Science & Technology

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献
  • | |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    现有多模态视觉问答(Visual Question Answering,VQA)模型忽略了图像中局部显著信息与文本中局部基本词之间的细粒度交互作用,图像与文本之间的语义相关性有待提高。为此,本文提出一种基于细粒度特征增强的多模态视觉问答方法。首先,对视觉和文本分别增加一种细粒度特征提取方法,以便更全面准确地提取图像和问题的语义特征;然后,为了利用不同层次模态之间的对齐信息,提出一种对齐引导的自注意力模块来对齐单一模态内(视觉或文本)细粒度特征和全局语义特征之间的对应关系,并以统一的方式融合不同层次的单模态信息;最后,在VQA v2.0和VQA-CP v2数据集上进行实验,结果表明,本文所提方法在各项视觉问答评估指标上的表现优于现有的模型。

    Abstract:

    Existing multimodal Visual Question Answering (VQA) models ignore the fine-grained interaction between local salient information in images and local basic words in texts, and the semantic relevance between images and texts needs to be improved. To this end, this paper proposes a multimodal visual question answering method based on fine-grained feature enhancement. First, a fine-grained feature extraction method is added to vision and text respectively to extract the semantic features of images and questions more comprehensively and accurately; then, in order to utilize the alignment information between modalities at different levels, an alignment-guided self-attention module is proposed to align the correspondence between fine-grained features and global semantic features within a single modality (visual or text), and fuse unimodal information at different levels in a unified way; finally, experiments are conducted on VQA v2.0 and VQA-CP v2 datasets. The results show that the proposed method performs better than existing models in various visual question answering evaluation indicators.

    参考文献
    [1] Chen C, Anjum S, Gurari D. Grounding answers for visual questions asked by visually impaired people[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 19098-19107.
    [2] Massiceti D, Anjum S, Gurari D. VizWiz grand challenge workshop at CVPR 2022[J]. ACM SIGACCESS Accessibility and Computing, 2022 (133): 1-1.
    [3] 李兆玺,刘红岩.融合全局和序列特征的多变量时间序列预测方法[J].计算机学报, 2023, 46(01).
    LI Zhaoxi, LIU Hongyan. Combining Global and Sequential Patterns for Multivariate Time Series Forecasting[J]. Chinese Journal of Computers, 2023, 46(01).
    [5] [4] Li H, Ke Q, Gong M, et al. Progressive video summarization via multimodal self-supervised learning[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023: 5584-5593.
    [6] [5] Wu Q, Shen C, Liu L, et al. What value do explicit high level concepts have in vision to language problems?[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 203-212.
    [7] [6] Donahue J, Anne Hendricks L, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 2625-2634.
    [8] [7] Wu J, Lu J, Sabharwal A, et al. Multi-modal answer validation for knowledge-based vqa[C]//Proceedings of the AAAI conference on artificial intelligence. 2022, 36(3): 2712-2721.
    [9] [8] Yan M, Xu H, Li C, et al. Achieving Human Parity on Visual Question Answering[J]. ACM Transactions on Information Systems, 2023, 41(3): 1-40.
    [10] [9] 张北辰,李亮,查正军,等.基于跨模态对比学习的视觉问答主动学习方法[J].计算机学报, 2022, 45(08): 1730-1745.
    ZHANG Beichen, LI Liang, ZHA Zhengjun, et al. Contrastive Cross-Modal Representation Learning Based Active Learning for Visual Question Answer[J]. Chinese Journal of Computers, 2022, 45(08):1730-1745.
    [12] [10] Song Y, Xu H, Fang D. Improving VQA via Dual-Level Feature Embedding Network[J]. Intelligent Automation & Soft Computing, 2024, 39(3).
    [13] [11] Yu Z, Yu J, Cui Y, et al. Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 6281-6290.
    [14] [12] Rahman T, Chou S H, Sigal L, et al. An improved attention for visual question answering[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 1653-1662.
    [15] [13] Sharma H, Jalal A S. Visual question answering model based on graph neural network and contextual attention[J]. Image and Vision Computing, 2021, 110: 104165.
    [16] [14] Marino K, Rastegari M, Farhadi A, et al. Ok-vqa: A visual question answering benchmark requiring external knowledge[C]//Proceedings of the IEEE/cvf conference on computer vision and pattern recognition. 2019: 3195-3204.
    [17] [15] Han M, Wang Y, Li Z, et al. Html: Hybrid temporal-scale multimodal learning framework for referring video object segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 13414-13423.
    [18] [16] Han J, Gong K, Zhang Y, et al. Onellm: One framework to align all modalities with language[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 26584-26595.
    [19] [17] Xu B, Huang S, Sha C, et al. MAF: a general matching and alignment framework for multimodal named entity recognition[C]//Proceedings of the fifteenth ACM international conference on web search and data mining. 2022: 1215-1223.
    [20] [18] Duan J, Chen L, Tran S, et al. Multi-modal alignment using representation codebook[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 15651-15660.
    [21] [19] He B, Wang J, Qiu J, et al. Align and attend: Multimodal summarization with dual contrastive losses[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 14867-14878.
    [22] [20] Zhu T, Li L, Yang J, et al. Multimodal sentiment analysis with image-text interaction network[J]. IEEE transactions on multimedia, 2022, 25: 3375-3385.
    [23] [21] Li Y, Quan R, Zhu L, et al. Efficient multimodal fusion via interactive prompting[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 2604-2613.
    [24] [22] Team C. Chameleon: Mixed-modal early-fusion foundation models[J]. arXiv preprint arXiv:2405.09818, 2024.
    [25] [23] Kim T G, Kang B J, Rho M, et al. A multimodal deep learning method for android malware detection using various features[J]. IEEE Transactions on Information Forensics and Security, 2018, 14(3): 773-788.
    [26] [24] Joze H R V, Shaban A, Iuzzolino M L, et al. MMTM: Multimodal transfer module for CNN fusion[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 13289-13299.
    [27] [25] Yu J, Zhu W, Zhu J, et al. Efficient feature extraction and late fusion strategy for audiovisual emotional mimicry intensity estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 4866-4872.
    [28] [26] Wang Y, Peng J, Zhang J, et al. Multimodal industrial anomaly detection via hybrid fusion[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 8032-8041.
    [29] [27] Xue Z, Marculescu R. Dynamic multimodal fusion[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 2575-2584.
    [30] [28] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
    [31] [29] Han G, Ma J, Huang S, et al. Few-shot object detection with fully cross-transformer[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 5321-5330.
    [32] [30] Goyal Y, Khot T, Summers-Stay D, et al. Making the v in vqa matter: Elevating the role of image understanding in visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 6904-6913.
    [33] [31] Agrawal A, Batra D, Parikh D, et al. Don''t just assume; look and answer: Overcoming priors for visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 4971-4980.
    [34] [32] Antol S, Agrawal A, Lu J, et al. Vqa: Visual question answering[C]//Proceedings of the IEEE international conference on computer vision. 2015: 2425-2433.
    [35] [33] Hao T, Mohit Bansal. LXMERT: Learning Cross-Modality Encoder Representations from Transformers[J], Conference on Empirical Methods in Natural Language Processing, 2019: 5099-5110.
    [36] [34] 兰红,张蒲芬.问题引导的空间关系图推理视觉问答模型[J].中国图象图形学报, 2022, 27(07): 2274-2286.
    LAN Hong, ZHANG Pufen. Question-guided spatial relation graph reasoning model for visual question answering[J]. Journal of Image and Graphics, 2022, 27(07): 2274-2286.
    [38] [35] Wang X, Chen Q, Hu T, et al. Visual-semantic dual channel network for visual question answering[C]//2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 2021: 1-10.
    [39] [36] Rahman T, Chou S H, Sigal L, et al. An improved attention for visual question answering[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 1653-1662.
    [40] [37] Yu Z, Jin Z, Yu J, et al. Towards efficient and elastic visual question answering with doubly slimmable transformer[J]. arXiv preprint arXiv:2203.12814, 2022.
    [41] [38] Xiong P, Shen Y, Jin H. MGA-VQA: multi-granularity alignment for visual question answering[J]. arXiv preprint arXiv:2201.10656, 2022.
    [42] [39] Qian Y, Hu Y, Wang R, et al. Question-driven graph fusion network for visual question answering[C]//2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2022: 1-6.
    [43] [40] Lu J, Wu C, Wang L, et al. Nested Attention Network with Graph Filtering for Visual Question and Answering[C]//ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023: 1-5.
    [44] [41] Liang P, Yang Y, Xiaopeng Z, Yanli J, Huimin L, Heng T S, et al. Answer Again: Improving VQA With Cascaded-Answering Model[J], IEEE Transactions on Knowledge and Data Engineering, 2022, 34(4): 1644-1655.
    [45] [42] Wang R, Qian Y, Feng F, et al. Co-VQA: Answering by interactive sub question sequence[J]. arXiv preprint arXiv:2204.00879, 2022.
    [46] [43] Liang P, Yang Y, Zheng W, Zi H, Heng T S, et al. MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network[J], IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(1): 318-329.
    [47] [44] Horng S J, Supardi J, Zhou W, et al. Recognizing very small face images using convolution neural networks[J]. IEEE Transactions on Intelligent Transportation Systems, 2020, 23(3): 2103-2115.
    [48] [45] Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 10012-10022.
    [49] [46] Chen Z, Lu Z, Rong H, et al. Multi-modal anchor adaptation learning for multi-modal summarization[J]. Neurocomputing, 2024, 570: 127144.
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

王志伟,陆振宇.基于细粒度特征增强的多模态视觉问答研究[J].南京信息工程大学学报,,():

复制
分享
文章指标
  • 点击次数:4
  • 下载次数: 0
  • HTML阅读次数: 0
  • 引用次数: 0
历史
  • 收稿日期:2025-01-07
  • 最后修改日期:2025-03-10
  • 录用日期:2025-03-10

地址:江苏省南京市宁六路219号    邮编:210044

联系电话:025-58731025    E-mail:nxdxb@nuist.edu.cn

南京信息工程大学学报 ® 2025 版权所有  技术支持:北京勤云科技发展有限公司