基于细粒度特征增强的多模态视觉问答研究

2025年6月5日 3:48 星期四

基于细粒度特征增强的多模态视觉问答研究
DOI:
                        
                    
作者:
                        王志伟王志伟
南京信息工程大学
在期刊界中查找
在百度中查找
在本站中查找
陆振宇陆振宇
南京信息工程大学
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:南京信息工程大学
作者简介:
通讯作者:
中图分类号:
基金项目:浙江省自然科学基金联合基金资助项目（LZJMD25D050002）；国家自然科学基金联合重点项目（U20B2061）

Research on Multi modal Visual Question Answering Based on Fine grained Feature Enhancement

Author:

wangzhiwei
wangzhiwei
Nanjing University of Information Science & Technology
在期刊界中查找
在百度中查找
在本站中查找
luzhenyu
luzhenyu
Nanjing University of Information Science & Technology
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Nanjing University of Information Science & Technology

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

现有多模态视觉问答（Visual Question Answering，VQA）模型忽略了图像中局部显著信息与文本中局部基本词之间的细粒度交互作用，图像与文本之间的语义相关性有待提高。为此，本文提出一种基于细粒度特征增强的多模态视觉问答方法。首先，对视觉和文本分别增加一种细粒度特征提取方法，以便更全面准确地提取图像和问题的语义特征；然后，为了利用不同层次模态之间的对齐信息，提出一种对齐引导的自注意力模块来对齐单一模态内（视觉或文本）细粒度特征和全局语义特征之间的对应关系，并以统一的方式融合不同层次的单模态信息；最后，在VQA v2.0和VQA-CP v2数据集上进行实验，结果表明，本文所提方法在各项视觉问答评估指标上的表现优于现有的模型。

关键词:视觉问答;多模态;细粒度;特征增强;实体对齐;特征融合

Abstract:

Existing multimodal Visual Question Answering (VQA) models ignore the fine-grained interaction between local salient information in images and local basic words in texts, and the semantic relevance between images and texts needs to be improved. To this end, this paper proposes a multimodal visual question answering method based on fine-grained feature enhancement. First, a fine-grained feature extraction method is added to vision and text respectively to extract the semantic features of images and questions more comprehensively and accurately; then, in order to utilize the alignment information between modalities at different levels, an alignment-guided self-attention module is proposed to align the correspondence between fine-grained features and global semantic features within a single modality (visual or text), and fuse unimodal information at different levels in a unified way; finally, experiments are conducted on VQA v2.0 and VQA-CP v2 datasets. The results show that the proposed method performs better than existing models in various visual question answering evaluation indicators.

Key words:Visual Question Answering; Multimodality; Fine-grained; Feature Enhancement; Entity Alignment; Feature Fusion

引用本文

王志伟,陆振宇.基于细粒度特征增强的多模态视觉问答研究[J].南京信息工程大学学报,,():

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2025-01-07
最后修改日期:2025-03-10
录用日期:2025-03-10
在线发布日期:
出版日期:

地址：江苏省南京市宁六路219号邮编：210044

联系电话：025-58731025 E-mail：nxdxb@nuist.edu.cn

引用本文

分享

文章指标

历史