Video Question Answering On Next Qa

Metrics

Accuracy

Results

Performance results of various models on this benchmark

Model Name	Accuracy	Paper Title	Repository
LLaVA-Video	83.2	Video Instruction Tuning With Synthetic Data	-
LLaVA-NeXT-Interleave(14B)	79.1	LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
ATM	58.3	ATM: Action Temporality Modeling for Video Question Answering	-
VideoChat2_HD_mistral	79.5	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
ViperGPT(0-shot)	60.0	ViperGPT: Visual Inference via Python Execution for Reasoning
LongVILA(7B)	80.7	LongVILA: Scaling Long-Context Visual Language Models for Long Videos
VGT(PT)	56.9	Video Graph Transformer for Video Question Answering
TCR	73.5	Text-Conditioned Resampler For Long Form Video Understanding	-
ViLA (3B)	75.6	ViLA: Efficient Video-Language Alignment for Video Question Answering
HiTeA	63.1	HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training	-
HQGA	51.4	Video as Conditional Graph Hierarchy for Multi-Granular Question Answering
RTQ	63.2	RTQ: Rethinking Video-language Understanding Based on Image-text Model
GF	58.83	Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
LSTP	72.1	Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge
LLaMA-VQA (33B)	75.5	Large Language Models are Temporal and Causal Reasoners for Video Question Answering
CoVGT(PT)	60.7	Contrastive Video Question Answering via Video Graph Transformer	-
SeViT	60.6	Semi-Parametric Video-Grounded Text Generation	-
VideoChat2_mistral	78.6	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Vamos	77.3	Vamos: Versatile Action Models for Video Understanding	-
LinVT-Qwen2-VL (7B)	85.5	LinVT: Empower Your Image-level Large Language Model to Understand Videos

0 of 44 row(s) selected.