Abstract:Existing multimodal Visual Question Answering (VQA) models ignore the fine-grained interaction between local salient information in images and local basic words in texts, and the semantic relevance between images and texts needs to be improved. To this end, this paper proposes a multimodal visual question answering method based on fine-grained feature enhancement. First, a fine-grained feature extraction method is added to vision and text respectively to extract the semantic features of images and questions more comprehensively and accurately; then, in order to utilize the alignment information between modalities at different levels, an alignment-guided self-attention module is proposed to align the correspondence between fine-grained features and global semantic features within a single modality (visual or text), and fuse unimodal information at different levels in a unified way; finally, experiments are conducted on VQA v2.0 and VQA-CP v2 datasets. The results show that the proposed method performs better than existing models in various visual question answering evaluation indicators.