HyperAI
Home
News
Latest Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
English
HyperAI
Toggle sidebar
Search the site…
⌘
K
Home
SOTA
Visual Question Answering (VQA)
Visual Question Answering On Docvqa Test
Visual Question Answering On Docvqa Test
Metrics
ANLS
Results
Performance results of various models on this benchmark
Columns
Model Name
ANLS
Paper Title
Repository
MatCha
0.742
MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering
GPT-4
0.884
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering
PaLI-3
0.876
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
Qwen-VL
0.651
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
ERNIE-Layout large
0.8486
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding
DUBLIN
0.782
DUBLIN -- Document Understanding By Language-Image Network
-
Pix2Struct-base
0.721
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
DUBLIN (variable resolution)
0.803
DUBLIN -- Document Understanding By Language-Image Network
-
PaLI-3 (w/ OCR)
0.886
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
Qwen-VL-Plus
0.9024
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
PaLI-X (Single-task FT w/ OCR)
0.868
PaLI-X: On Scaling up a Multilingual Vision and Language Model
PaLI-X (Single-task FT)
0.80
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Claude + LATIN-Prompt
0.8336
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering
TILT-Large
0.8705
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
BERT_LARGE_SQUAD_DOCVQA_FINETUNED_Baseline
0.665
DocVQA: A Dataset for VQA on Document Images
DocFormerv2-large
0.8784
DocFormerv2: Local Features for Document Understanding
MLCD-Embodied-7B
0.916
Multi-label Cluster Discrimination for Visual Representation Learning
UDOP (aux)
0.878
Unifying Vision, Text, and Layout for Universal Document Processing
UDOP
0.847
Unifying Vision, Text, and Layout for Universal Document Processing
SMoLA-PaLI-X Generalist
0.906
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts
-
0 of 33 row(s) selected.
Previous
Next