Reasoning Benchmarks
How evaluations of language-model reasoning evolved from broad knowledge testing to expert-level problem solving that frontier models still cannot reliably solve. The lineage runs from MMLU's wide-coverage factual sweep through specialist tracks like GPQA, to HLE — a 2,500-question exam designed by domain experts where top models still score below 35%. Branches include BIG-Bench Hard (multi-step reasoning) and ARC-AGI (fluid abstract reasoning), which each probe different failure modes than the main knowledge-testing spine.
MMLU was the benchmark that ended the 'GPT-3 can't do reasoning' era — 57-subject coverage meant every model report cited it, and that ubiquity made its saturation (GPT-4 hitting 86.4% in 2023) feel sudden. MMLU-Pro and GPQA Diamond raised the floor: 5-option questions, expert-written distractors, and subject-matter verifiability. HLE (Humanity's Last Exam) is the current attention path — 2,500 questions by credentialed domain experts, where o3 and Gemini 2.0 Ultra still sit below 35% as of early 2025. ARC-AGI is the most-watched branch: o3 (high-compute) crossed 75% in December 2024, but at a cost that makes it a separate category from routine inference. BIG-Bench Hard remains the cleanest COT-skill probe and hasn't saturated at the task level.
Attention path plus branches.
Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.
Nodes in detail.
ARC-AGI
400 eval tasks of visual pattern completion designed to resist memorisation — each task has a unique rule. Humans average ~85%. GPT-4 class models barely exceeded 5% for years. o3 (high-compute) reached ~75% in December 2024, but at inference cost that makes it a distinct category.
MMLU
57-subject multiple-choice exam spanning STEM, law, history, social sciences and more. 15,908 questions. Became the standard breadth-of-knowledge benchmark for two years; every major model report cited it. Saturated by GPT-4 in 2023 (86.4%) and Gemini Ultra shortly after.
BIG-Bench Hard
23 tasks hand-selected from BIG-bench as those where CoT prompting most improves performance — word sorting, logical deduction, date arithmetic, causal reasoning. A focused probe of multi-step chain-of-thought skill rather than factual breadth.
GPQA
448 PhD-level multiple-choice questions in biology, chemistry, and physics written by domain experts and verified so answers are not trivially Googleable. The Diamond subset (198 Qs) is the standard eval slice. Human non-expert accuracy ~34%. GPT-4 class ~39%; o1/o3 ~78%.
MMLU-Pro
10-option questions (vs. 4), expert-curated distractors, 12,000 questions across 14 domains. Raises the ceiling MMLU lost — frontier models that scored 86%+ on MMLU cluster around 60–70% on MMLU-Pro.
HLE
2,500 expert-contributed questions spanning mathematics, sciences, humanities, and professional domains — each verified by domain experts to be answerable only with deep specialist knowledge. Released Jan 2025. Top frontier models (o3, Gemini 2.0 Ultra) still below 35%.