Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · Mathematical ReasoningHome/Tasks/Reasoning/Mathematical Reasoning

Mathematical Reasoning.

Mathematical reasoning benchmarks — GSM8K, MATH, Minerva, and the competition-level AIME/AMC tests — have become the primary yardstick for frontier model intelligence. OpenAI's o1 and o3 (2024-2025) cracked problems that were previously out of reach by scaling inference-time compute with search and verification. The MATH benchmark went from ~50% (GPT-4, early 2023) to >90% (o1, late 2024) in under two years, but Olympiad-level problems (FrontierMath, Putnam) and formal theorem proving (Lean 4) remain far from solved, preserving mathematical reasoning as the clearest ladder for measuring progress.

4
Datasets
127
Results
accuracy
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

MATH

12,500 competition mathematics problems (5,000 test) from AMC, AIME, and other sources covering algebra, geometry, number theory, and more. Harder than GSM8K. Modern evaluations typically use the MATH-500 representative subset.

Primary metric: accuracy
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on MATH.

#ModelaccuracyYearSource
o4-mini (high)98.22026paper ↗
2o3 (high)98.12026paper ↗
3o3-mini97.92026paper ↗
4o397.82026paper ↗
5o4-mini97.52026paper ↗
6Gemini 2.5 Pro97.32026paper ↗
7DeepSeek R197.32026paper ↗
8o196.42026paper ↗
9Kimi k1.596.22026paper ↗
10Claude 3.7 Sonnet96.22026paper ↗

What were you looking for on Mathematical Reasoning?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

4 datasets tracked for this task.

MATH
CANONICAL
46 results · accuracy
Top: o4-mini (high) 98.2
GSM8K
48 results · accuracy
Top: ERNIE 5.0 99.7
AIME 2025
22 results · accuracy
Top: Step-3.5-Flash PaCoRe 99.9
AIME 2024
11 results · accuracy
Top: o3 96.7
§ 05 · Related tasks

Other tasks in Reasoning.

Arithmetic ReasoningCommonsense ReasoningLogical ReasoningMulti-step Reasoning
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Mathematical Reasoning? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.