Visual Question Answering On Ok Vqa

Metrics

Accuracy

Results

Performance results of various models on this benchmark

Model Name	Accuracy	Paper Title	Repository
PaLM-E-562B	66.1	PaLM-E: An Embodied Multimodal Language Model
PICa	48.0	An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
MetaLM	11.4	Language Models are General-Purpose Interfaces
REVIVE (Ensemble)	58.0	REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering
A Simple Baseline for KB-VQA	61.2	A Simple Baseline for Knowledge-Based Visual Question Answering	-
Prophet	62.5	Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering
PNP-VQA	35.9	Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training
RA-VQA-FrDPR (T5-large)	51.22	Retrieval Augmented Visual Question Answering with Outside Knowledge
VLC-BERT	43.1	VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge
BLIP-2 ViT-L FlanT5 XL (zero-shot)	39.4	BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Frozen	5.9	Multimodal Few-Shot Learning with Frozen Language Models	-
T5(Tan and Bansal, 2019) + Prefixes	42.03	LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection
VK-OOD	52.4	Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
FewVLM	16.5	A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
BLIP-2 ViT-G FlanT5 XXL (zero-shot)	45.9	BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
LaKo	47.01	LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection
VLKD(ViT-B/16)	10.5	Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation	-
RA-VQA-v2 (BLIP 2)	62.08	Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering	-
BLIP-2 ViT-G OPT 2.7B (zero-shot)	31.7	BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Flamingo3B	41.2	Flamingo: a Visual Language Model for Few-Shot Learning

0 of 37 row(s) selected.