Codesota · Lineage · Mathematical Reasoning Benchmarks6 benchmarks · 5 edgesUpdated 2026-04-27

Benchmark lineage

Mathematical Reasoning Benchmarks

How mathematical reasoning evaluation evolved from grade-school word problems through competition mathematics to research-frontier problems that current AI cannot reliably solve. The lineage traces the shift from linguistic arithmetic (GSM8K) to formal mathematical proof and open research problems. Branches include the AIME competition track, which became a frontier benchmark after o1 broke it open, and FrontierMath, which sources unpublished problems from professional mathematicians.

Editor's note

GSM8K and MATH together defined the mathematical reasoning evaluation landscape from 2021 through 2023. GPT-4 solved ~92% of GSM8K; that number is now essentially meaningless as a discriminator. MATH (competition-level: AMC, AIME, Olympiad) proved harder and more durable — but o1 cleared 90%+ in late 2024. AIME is the most interesting transition point: it was always a competition benchmark but became a frontier AI benchmark in 2024 when GPT-4o, then o1, started scoring in ranges previously associated with top high-school competitors. FrontierMath (2024) and the HLE math split represent the current open frontier: problems sourced from professional mathematicians working at the boundary of known mathematics, where top models score below 5%.

§ 01 · Lineage graph

Attention path plus branches.

Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.

attention path scope shift branch / fork active saturating saturated / superseded

GSM8K → MATH · scope shift · attention

GSM8K addressed grade-school arithmetic; MATH jumped directly to AMC/AIME/Olympiad competition problems — a 5-difficulty-level span above GSM8K. Released one month apart, they together defined the 2021–2023 math evaluation landscape.

MATH → AIME 2024 · scope shift · attention

AIME is not an AI benchmark by origin — it is the human competition that feeds into USAMO/IMO. It became an AI frontier benchmark when o1 started scoring competitively. Because it is updated annually with fresh problems, it provides contamination control that MATH (fixed dataset) cannot.

MATH → OmniMATH · scope shift

OmniMATH aggregates olympiad-level problems from IMO and national competitions, providing harder instances than MATH's Level 5 problems. A parallel ceiling-raising branch alongside AIME.

AIME 2024 → FrontierMath · scope shift · attention

AIME problems are finite and increasingly contaminated as training sets grow. FrontierMath sources unpublished research-frontier problems — contamination by design impossible. The step change from competition math to research math.

FrontierMath → HLE (math split) · scope shift · attention

HLE's math split covers similar territory to FrontierMath but is embedded in a broader expert-exam framework that includes proofs, graduate coursework, and cross-domain reasoning. The current attention endpoint where no model exceeds 25%.

§ 02 · Benchmarks in this lineage

Nodes in detail.

Oct 2021Saturated

View benchmark page →

GSM8K

Grade School Math 8K

8,500 grade-school math word problems requiring multi-step arithmetic reasoning. Designed to require chain-of-thought reasoning; GPT-3 scored ~35%, GPT-4 ~92%. Saturated by 2024 — every frontier model exceeds 90%, providing no signal.

Cobbe et al. (OpenAI) · paper

Nov 2021Saturating

View benchmark page →

MATH

MATH Competition Mathematics

12,500 AMC/AIME/Olympiad-sourced problems across 7 difficulty levels. GPT-4 scored ~52% at launch; o1 cleared 90%+ in 2024. Each problem includes step-by-step solutions. The competition-math standard that lasted three years before frontier models surpassed the Level 5 cutoff.

Hendrycks et al. (UC Berkeley) · paper

Mar 2024Active

View benchmark page →

AIME 2024

American Invitational Mathematics Examination 2024

30 competition problems (15 per part) from the 2024 AIME exam — the competition that feeds AMC → AIME → USAMO → IMO. Not designed as an AI benchmark, but adopted as one after GPT-4o scored 9/30 in 2024 and o1 scored 23/30, comparable to strong human competitors. Updated annually, providing contamination control.

Mathematical Association of America

Oct 2024Active

OmniMATH

OmniMATH: Olympiad-Level Math Benchmark

4,428 competition problems from IMO, national olympiads, and other elite contests, all at difficulty levels above standard AIME. Designed to remain discriminative after MATH saturated. Top models score 20–50% depending on difficulty tier.

Gao et al. · paper

Nov 2024Active

FrontierMath

FrontierMath: Expert-Level Mathematical Reasoning

500+ unpublished problems contributed by professional mathematicians at the boundary of current research — number theory, algebraic geometry, combinatorics at research depth. Problems are held out to prevent contamination. Top models score below 2% on the hardest tier.

Glazer et al. (Epoch AI) · paper

Jan 2025Active

View benchmark page →

HLE (math split)

Humanity's Last Exam — Mathematics Subset

The mathematics and formal reasoning subset of HLE's 2,500-question expert exam. Encompasses proof-based, competition, and graduate-level mathematics. Current frontier models score below 25% on this split, slightly above the overall HLE average.

Phan et al. (Center for AI Safety / Scale AI) · paper