Mathematical Reasoning Benchmarks
How mathematical reasoning evaluation evolved from grade-school word problems through competition mathematics to research-frontier problems that current AI cannot reliably solve. The lineage traces the shift from linguistic arithmetic (GSM8K) to formal mathematical proof and open research problems. Branches include the AIME competition track, which became a frontier benchmark after o1 broke it open, and FrontierMath, which sources unpublished problems from professional mathematicians.
GSM8K and MATH together defined the mathematical reasoning evaluation landscape from 2021 through 2023. GPT-4 solved ~92% of GSM8K; that number is now essentially meaningless as a discriminator. MATH (competition-level: AMC, AIME, Olympiad) proved harder and more durable — but o1 cleared 90%+ in late 2024. AIME is the most interesting transition point: it was always a competition benchmark but became a frontier AI benchmark in 2024 when GPT-4o, then o1, started scoring in ranges previously associated with top high-school competitors. FrontierMath (2024) and the HLE math split represent the current open frontier: problems sourced from professional mathematicians working at the boundary of known mathematics, where top models score below 5%.
Attention path plus branches.
Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.
Nodes in detail.
GSM8K
8,500 grade-school math word problems requiring multi-step arithmetic reasoning. Designed to require chain-of-thought reasoning; GPT-3 scored ~35%, GPT-4 ~92%. Saturated by 2024 — every frontier model exceeds 90%, providing no signal.
MATH
12,500 AMC/AIME/Olympiad-sourced problems across 7 difficulty levels. GPT-4 scored ~52% at launch; o1 cleared 90%+ in 2024. Each problem includes step-by-step solutions. The competition-math standard that lasted three years before frontier models surpassed the Level 5 cutoff.
AIME 2024
30 competition problems (15 per part) from the 2024 AIME exam — the competition that feeds AMC → AIME → USAMO → IMO. Not designed as an AI benchmark, but adopted as one after GPT-4o scored 9/30 in 2024 and o1 scored 23/30, comparable to strong human competitors. Updated annually, providing contamination control.
OmniMATH
4,428 competition problems from IMO, national olympiads, and other elite contests, all at difficulty levels above standard AIME. Designed to remain discriminative after MATH saturated. Top models score 20–50% depending on difficulty tier.
FrontierMath
500+ unpublished problems contributed by professional mathematicians at the boundary of current research — number theory, algebraic geometry, combinatorics at research depth. Problems are held out to prevent contamination. Top models score below 2% on the hardest tier.
HLE (math split)
The mathematics and formal reasoning subset of HLE's 2,500-question expert exam. Encompasses proof-based, competition, and graduate-level mathematics. Current frontier models score below 25% on this split, slightly above the overall HLE average.