HyperAI
Home
News
Latest Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
English
HyperAI
Toggle sidebar
Search the site…
⌘
K
Home
SOTA
Visual Question Answering (VQA)
Visual Question Answering On Ok Vqa
Visual Question Answering On Ok Vqa
Metrics
Accuracy
Results
Performance results of various models on this benchmark
Columns
Model Name
Accuracy
Paper Title
Repository
PaLM-E-562B
66.1
PaLM-E: An Embodied Multimodal Language Model
PICa
48.0
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
MetaLM
11.4
Language Models are General-Purpose Interfaces
REVIVE (Ensemble)
58.0
REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering
A Simple Baseline for KB-VQA
61.2
A Simple Baseline for Knowledge-Based Visual Question Answering
-
Prophet
62.5
Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering
PNP-VQA
35.9
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training
RA-VQA-FrDPR (T5-large)
51.22
Retrieval Augmented Visual Question Answering with Outside Knowledge
VLC-BERT
43.1
VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge
BLIP-2 ViT-L FlanT5 XL (zero-shot)
39.4
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Frozen
5.9
Multimodal Few-Shot Learning with Frozen Language Models
-
T5(Tan and Bansal, 2019) + Prefixes
42.03
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection
VK-OOD
52.4
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
FewVLM
16.5
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
45.9
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
LaKo
47.01
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection
VLKD(ViT-B/16)
10.5
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation
-
RA-VQA-v2 (BLIP 2)
62.08
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering
-
BLIP-2 ViT-G OPT 2.7B (zero-shot)
31.7
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Flamingo3B
41.2
Flamingo: a Visual Language Model for Few-Shot Learning
0 of 37 row(s) selected.
Previous
Next