Multimodal Encoders and Decoders with Gate Attention for Visual Question Answering

Haiyan Li; Dezhi Han

Haiyan Li ; Dezhi Han

Computer Science and Information Systems, Tome 18 (2021) no. 3

Cet article a éte moissonné depuis la source Computer Science and Information Systems website

Voir la notice de l'article

Résumé

Visual Question Answering (VQA) is a multimodal research related to Computer Vision (CV) and Natural Language Processing (NLP). How to better obtain useful information from images and questions and give an accurate answer to the question is the core of the VQA task. This paper presents a VQA model based on multimodal encoders and decoders with gate attention (MEDGA). Each encoder and decoder block in the MEDGA applies not only self-attention and crossmodal attention but also gate attention, so that the new model can better focus on inter-modal and intra-modal interactions simultaneously within visual and language modality. Besides, MEDGA further filters out noise information irrelevant to the results via gate attention and finally outputs attention results that are closely related to visual features and language features, which makes the answer prediction result more accurate. Experimental evaluations on the VQA 2.0 dataset and the ablation experiments under different conditions prove the effectiveness of MEDGA. In addition, the MEDGA accuracy on the test-std dataset has reached 70.11%, which exceeds many existing methods.

Keywords: Deep Learning, Artificial Intelligence, Visual Question Answering, Gate Attention, Multimodal Learning

@article{CSIS_2021_18_3_a18,
     author = {Haiyan Li and Dezhi Han},
     title = {Multimodal {Encoders} and {Decoders} with {Gate} {Attention} for {Visual} {Question} {Answering}},
     journal = {Computer Science and Information Systems},
     year = {2021},
     volume = {18},
     number = {3},
     url = {http://geodesic.mathdoc.fr/item/CSIS_2021_18_3_a18/}
}

TY  - JOUR
AU  - Haiyan Li
AU  - Dezhi Han
TI  - Multimodal Encoders and Decoders with Gate Attention for Visual Question Answering
JO  - Computer Science and Information Systems
PY  - 2021
VL  - 18
IS  - 3
UR  - http://geodesic.mathdoc.fr/item/CSIS_2021_18_3_a18/
ID  - CSIS_2021_18_3_a18
ER  -

%0 Journal Article
%A Haiyan Li
%A Dezhi Han
%T Multimodal Encoders and Decoders with Gate Attention for Visual Question Answering
%J Computer Science and Information Systems
%D 2021
%V 18
%N 3
%U http://geodesic.mathdoc.fr/item/CSIS_2021_18_3_a18/
%F CSIS_2021_18_3_a18

Haiyan Li; Dezhi Han. Multimodal Encoders and Decoders with Gate Attention for Visual Question Answering. Computer Science and Information Systems, Tome 18 (2021) no. 3. http://geodesic.mathdoc.fr/item/CSIS_2021_18_3_a18/

Parcourir par

Geodesic

Parcourir par