Question Answering On Newsqa

Results

Performance results of various models on this benchmark

Model Name	EM	F1	Paper Title	Repository
deepseek-r1	80.57	86.13	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
OpenAI/GPT-4o	70.21	81.74	GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model Pretraining Data	-
DecaProp	53.1	66.3	Densely Connected Attention Propagation for Reading Comprehension
FastQAExt	43.7	56.1	Making Neural QA as Simple as Possible but not Simpler
Riple/Saanvi-v0.1	72.61	85.44	Time-series Transformer Generative Adversarial Networks
LinkBERT (large)	-	72.6	LinkBERT: Pretraining Language Models with Document Links
BERT+ASGen	54.7	64.5	-	-
Anthropic/claude-3-5-sonnet	74.23	82.3	Claude 3.5 Sonnet Model Card Addendum	-
xAI/grok-2-1212	70.57	88.24	XAI for Transformers: Better Explanations through Conservative Propagation
OpenAI/o1-2024-12-17-high	81.44	88.7	0/1 Deep Neural Networks via Block Coordinate Descent	-
Google/Gemini 1.5 Flash	68.75	79.91	Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
AMANDA	48.4	63.7	A Question-Focused Multi-Factor Attention Network for Question Answering
OpenAI/o3-mini-2025-01-31-high	96.52	92.13	o3-mini vs DeepSeek-R1: Which One is Safer?
DyREX	-	68.53	DyREx: Dynamic Query Representation for Extractive Question Answering
MINIMAL(Dyn)	50.1	63.2	Efficient and Robust Question Answering from Minimal Context over Documents
SpanBERT	-	73.6	SpanBERT: Improving Pre-training by Representing and Predicting Spans

0 of 16 row(s) selected.