Codesota · Lineage · Reasoning Benchmarks6 benchmarks · 6 edgesUpdated 2026-04-27
Benchmark lineage

Reasoning Benchmarks

How evaluations of language-model reasoning evolved from broad knowledge testing to expert-level problem solving that frontier models still cannot reliably solve. The lineage runs from MMLU's wide-coverage factual sweep through specialist tracks like GPQA, to HLE — a 2,500-question exam designed by domain experts where top models still score below 35%. Branches include BIG-Bench Hard (multi-step reasoning) and ARC-AGI (fluid abstract reasoning), which each probe different failure modes than the main knowledge-testing spine.

Editor's note

MMLU was the benchmark that ended the 'GPT-3 can't do reasoning' era — 57-subject coverage meant every model report cited it, and that ubiquity made its saturation (GPT-4 hitting 86.4% in 2023) feel sudden. MMLU-Pro and GPQA Diamond raised the floor: 5-option questions, expert-written distractors, and subject-matter verifiability. HLE (Humanity's Last Exam) is the current attention path — 2,500 questions by credentialed domain experts, where o3 and Gemini 2.0 Ultra still sit below 35% as of early 2025. ARC-AGI is the most-watched branch: o3 (high-compute) crossed 75% in December 2024, but at a cost that makes it a separate category from routine inference. BIG-Bench Hard remains the cleanest COT-skill probe and hasn't saturated at the task level.

§ 01 · Lineage graph

Attention path plus branches.

Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.

attention path scope shift branch / fork active saturating saturated / superseded
DIRECT SUCCESSORSCOPE SHIFTDIRECT SUCCESSORMMLUSEP 2020SOTA 92.9%BIG-Bench HardOCT 2022ARC-AGINOV 2019MMLU-ProJUN 2024SOTA 91.0%GPQANOV 2023SOTA 91.9%HLEJAN 2025SOTA 38.3%
MMLUBIG-Bench Hard · scope shift
BIG-Bench Hard was extracted from BIG-bench to focus on the tasks where chain-of-thought reasoning matters most — orthogonal to MMLU's factual breadth.
MMLUARC-AGI · scope shift
ARC-AGI probes fluid reasoning and generalisation rather than memorised knowledge — a direct counterpoint to MMLU's broad-knowledge approach. Chollet designed it explicitly as an anti-memorisation benchmark.
MMLUMMLU-Pro · direct successor · attention
MMLU-Pro was built to restore discriminative signal after MMLU saturated. Harder questions, 10-option format, expert-curated distractors. The standard replacement for MMLU in model reports from 2024 onward.
MMLU-ProGPQA · scope shift · attention
GPQA narrows scope to PhD-level expert questions while raising verifiability — each answer is checked by domain-expert validators. MMLU-Pro raised the ceiling; GPQA shifted the task from breadth to verifiable expert depth.
GPQAHLE · direct successor · attention
HLE is the logical endpoint of the expert-difficulty trend: 2,500 questions where the bar is 'only a genuine domain expert can answer this reliably.' The current frontier attention path because no model has cleared 35%.
MMLUHLE · scope shift
HLE shares MMLU's multi-domain framing but inverts the difficulty objective — where MMLU measured broad knowledge, HLE selects for questions that require expert depth. A conceptual successor to the entire multitask-evaluation paradigm.
§ 02 · Benchmarks in this lineage

Nodes in detail.

Nov 2019Active

ARC-AGI

Abstraction and Reasoning Corpus — AGI

400 eval tasks of visual pattern completion designed to resist memorisation — each task has a unique rule. Humans average ~85%. GPT-4 class models barely exceeded 5% for years. o3 (high-compute) reached ~75% in December 2024, but at inference cost that makes it a distinct category.

Chollet (Google) · paper
Sep 2020Saturated
View benchmark page →

MMLU

Massive Multitask Language Understanding

57-subject multiple-choice exam spanning STEM, law, history, social sciences and more. 15,908 questions. Became the standard breadth-of-knowledge benchmark for two years; every major model report cited it. Saturated by GPT-4 in 2023 (86.4%) and Gemini Ultra shortly after.

Hendrycks et al. (UC Berkeley) · paper
Oct 2022Active

BIG-Bench Hard

BIG-Bench Hard (BBH)

23 tasks hand-selected from BIG-bench as those where CoT prompting most improves performance — word sorting, logical deduction, date arithmetic, causal reasoning. A focused probe of multi-step chain-of-thought skill rather than factual breadth.

Suzgun et al. · paper

GPQA

Graduate-Level Google-Proof Q&A

448 PhD-level multiple-choice questions in biology, chemistry, and physics written by domain experts and verified so answers are not trivially Googleable. The Diamond subset (198 Qs) is the standard eval slice. Human non-expert accuracy ~34%. GPT-4 class ~39%; o1/o3 ~78%.

Rein et al. (NYU) · paper

MMLU-Pro

MMLU-Pro (harder variant)

10-option questions (vs. 4), expert-curated distractors, 12,000 questions across 14 domains. Raises the ceiling MMLU lost — frontier models that scored 86%+ on MMLU cluster around 60–70% on MMLU-Pro.

Wang et al. · paper

HLE

Humanity's Last Exam

2,500 expert-contributed questions spanning mathematics, sciences, humanities, and professional domains — each verified by domain experts to be answerable only with deep specialist knowledge. Released Jan 2025. Top frontier models (o3, Gemini 2.0 Ultra) still below 35%.

Phan et al. (Center for AI Safety / Scale AI) · paper