HyperAI
Home
News
Latest Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
English
HyperAI
Toggle sidebar
Search the site…
⌘
K
Home
SOTA
Video Question Answering
Video Question Answering On Msrvtt Qa
Video Question Answering On Msrvtt Qa
Metrics
Accuracy
Results
Performance results of various models on this benchmark
Columns
Model Name
Accuracy
Paper Title
Repository
FrozenBiLM
47.0
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
mPLUG-2
48.0
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
FrozenBiLM (0-shot)
16.7
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
VIOLETv2
44.5
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Singularity-temporal
43.9
Revealing Single Frame Bias for Video-and-Language Learning
HBI
46.2
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
VALOR
49.2
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Singularity
43.5
Revealing Single Frame Bias for Video-and-Language Learning
VAST
50.1
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Mirasol3B
50.42
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
-
VindLU
44.6
VindLU: A Recipe for Effective Video-and-Language Pretraining
COSA
49.2
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
MA-LMM
48.5
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
EMCL-Net
45.8
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
0 of 14 row(s) selected.
Previous
Next