A visual question answering model based on image captioning

Kun Zhou, Qiongjie Liu, Dexin Zhao in Multimedia Systems vol. 30(6) by Springer Science and Business Media LLC at Nov 25, 2024

DOI: 10.1007/s00530-024-01573-9

ISSNS: 0942-4962·1432-1882

Abstract

Image captioning and visual question answering are two important tasks in the field of artificial intelligence, which have been widely used in various aspects of life and greatly facilitate our daily life. Image captioning and visual question answering have many similarities and use basically the same related knowledge and techniques. They are both cross-modal tasks involving computer vision and natural language processing, and can be studied in the same model and use the image captioning results to enhance the visual question answering output. Current research on these two tasks has largely been conducted independently, and the accuracy of the visual question answering results needs to be improved. Therefore, this paper proposes a visual question answering model IC-VQA based on image captioning. This model first performs the image captioning part, i.e., obtaining rich visual information by constructing object geometric relations and utilizing mesh information, and then generates question-specific image captioning by means of Attention+ Transformer. Transformer to generate questionspecific image captioning sentences. Then the visual question answering part is performed, i.e., the previously generated image captioning sentences are fused to answer the question through the Attention+ LSTM framework, which significantly improves the accuracy of the answer. Experiments on the datasets VQA1.0 and VQA2.0 resulted in an overall accuracy of 70.1 and 70.85, respectively, which significantly closes the gap with humans, which proves the effectiveness of the IC-VQA model, and the accuracy of the visual question answering output can be truly improved by fusing the captioning sentences about the question.