Home News Latest Papers Tutorials Datasets Wiki SOTA LLM Models GPU Leaderboard Events

English

Visual Question Answering (VQA)

Visual Question Answering (VQA) is a task in the field of computer vision that aims to answer questions about images using natural language. The core objective of this task is to enable machines to understand the content of images and provide answers in an accurate and coherent linguistic form. VQA has significant application value in human-computer interaction, intelligent assistance, and content understanding, significantly enhancing the visual cognitive abilities of machines.

VQA v2 test-dev

VQA v2 test-std

Gemini Ultra (pixel only)

VizWiz 2020 VQA

NS-VQA (1K programs)

COCO Visual Question Answering (VQA) real images 1.0 open ended

TextVQA test-standard

BLIP-2 ViT-G FlanT5 XXL (zero-shot)

COCO Visual Question Answering (VQA) real images 1.0 multiple choice

LXR955, No Ensemble

VCR (QA-R) test

VCR (Q-AR) test

VQA v1 test-dev

VizWiz 2020 Answerability

VQA v1 test-std

COCO Visual Question Answering (VQA) real images 2.0 open ended

COCO Visual Question Answering (VQA) abstract images 1.0 open ended

COCO Visual Question Answering (VQA) abstract 1.0 multiple choice

FigureQA - test 1

BERT LARGE Baseline

Visual Genome (subjects)

Visual Genome (pairs)

VizWiz 2018 Answerability

SAN † - hard mask

PrefixLM with CLIP and T5

DVQA test-familiar

PReFIL (Oracle OCR)