Codesota · Lineage · Mathematical Reasoning Benchmarks6 benchmarks · 5 edgesUpdated 2026-04-27
Benchmark lineage

Mathematical Reasoning Benchmarks

How mathematical reasoning evaluation evolved from grade-school word problems through competition mathematics to research-frontier problems that current AI cannot reliably solve. The lineage traces the shift from linguistic arithmetic (GSM8K) to formal mathematical proof and open research problems. Branches include the AIME competition track, which became a frontier benchmark after o1 broke it open, and FrontierMath, which sources unpublished problems from professional mathematicians.

Editor's note

GSM8K and MATH together defined the mathematical reasoning evaluation landscape from 2021 through 2023. GPT-4 solved ~92% of GSM8K; that number is now essentially meaningless as a discriminator. MATH (competition-level: AMC, AIME, Olympiad) proved harder and more durable — but o1 cleared 90%+ in late 2024. AIME is the most interesting transition point: it was always a competition benchmark but became a frontier AI benchmark in 2024 when GPT-4o, then o1, started scoring in ranges previously associated with top high-school competitors. FrontierMath (2024) and the HLE math split represent the current open frontier: problems sourced from professional mathematicians working at the boundary of known mathematics, where top models score below 5%.

§ 01 · Lineage graph

Attention path plus branches.

Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.

attention path scope shift branch / fork active saturating saturated / superseded
SCOPE SHIFTSCOPE SHIFTSCOPE SHIFTSCOPE SHIFTGSM8KOCT 2021SOTA 99.7%MATHNOV 2021SOTA 98.2%AIME 2024MAR 2024SOTA 96.7%OmniMATHOCT 2024FrontierMathNOV 2024HLE (math split)JAN 2025SOTA 38.3%
GSM8KMATH · scope shift · attention
GSM8K addressed grade-school arithmetic; MATH jumped directly to AMC/AIME/Olympiad competition problems — a 5-difficulty-level span above GSM8K. Released one month apart, they together defined the 2021–2023 math evaluation landscape.
MATHAIME 2024 · scope shift · attention
AIME is not an AI benchmark by origin — it is the human competition that feeds into USAMO/IMO. It became an AI frontier benchmark when o1 started scoring competitively. Because it is updated annually with fresh problems, it provides contamination control that MATH (fixed dataset) cannot.
MATHOmniMATH · scope shift
OmniMATH aggregates olympiad-level problems from IMO and national competitions, providing harder instances than MATH's Level 5 problems. A parallel ceiling-raising branch alongside AIME.
AIME 2024FrontierMath · scope shift · attention
AIME problems are finite and increasingly contaminated as training sets grow. FrontierMath sources unpublished research-frontier problems — contamination by design impossible. The step change from competition math to research math.
FrontierMathHLE (math split) · scope shift · attention
HLE's math split covers similar territory to FrontierMath but is embedded in a broader expert-exam framework that includes proofs, graduate coursework, and cross-domain reasoning. The current attention endpoint where no model exceeds 25%.
§ 02 · Benchmarks in this lineage

Nodes in detail.

Oct 2021Saturated
View benchmark page →

GSM8K

Grade School Math 8K

8,500 grade-school math word problems requiring multi-step arithmetic reasoning. Designed to require chain-of-thought reasoning; GPT-3 scored ~35%, GPT-4 ~92%. Saturated by 2024 — every frontier model exceeds 90%, providing no signal.

Cobbe et al. (OpenAI) · paper
Nov 2021Saturating
View benchmark page →

MATH

MATH Competition Mathematics

12,500 AMC/AIME/Olympiad-sourced problems across 7 difficulty levels. GPT-4 scored ~52% at launch; o1 cleared 90%+ in 2024. Each problem includes step-by-step solutions. The competition-math standard that lasted three years before frontier models surpassed the Level 5 cutoff.

Hendrycks et al. (UC Berkeley) · paper

AIME 2024

American Invitational Mathematics Examination 2024

30 competition problems (15 per part) from the 2024 AIME exam — the competition that feeds AMC → AIME → USAMO → IMO. Not designed as an AI benchmark, but adopted as one after GPT-4o scored 9/30 in 2024 and o1 scored 23/30, comparable to strong human competitors. Updated annually, providing contamination control.

Mathematical Association of America
Oct 2024Active

OmniMATH

OmniMATH: Olympiad-Level Math Benchmark

4,428 competition problems from IMO, national olympiads, and other elite contests, all at difficulty levels above standard AIME. Designed to remain discriminative after MATH saturated. Top models score 20–50% depending on difficulty tier.

Gao et al. · paper
Nov 2024Active

FrontierMath

FrontierMath: Expert-Level Mathematical Reasoning

500+ unpublished problems contributed by professional mathematicians at the boundary of current research — number theory, algebraic geometry, combinatorics at research depth. Problems are held out to prevent contamination. Top models score below 2% on the hardest tier.

Glazer et al. (Epoch AI) · paper

HLE (math split)

Humanity's Last Exam — Mathematics Subset

The mathematics and formal reasoning subset of HLE's 2,500-question expert exam. Encompasses proof-based, competition, and graduate-level mathematics. Current frontier models score below 25% on this split, slightly above the overall HLE average.

Phan et al. (Center for AI Safety / Scale AI) · paper
Mathematical Reasoning Benchmarks — benchmark lineage · Codesota | CodeSOTA