Codesota · LLM · Math BenchmarksLLM/Math
Math · updated April 2026

LLM Math Benchmarks.

From grade-school arithmetic (GSM8K) to Olympiad-level proofs (AIME 2024). Reasoning models now dominate — o3 and DeepSeek-R1 have surpassed human competition scores on most benchmarks.

GSM8K MATH-500AIME 2024
§ 01 · GSM8K

Grade-school word problems, 2–8 step arithmetic.

8,500 grade-school word problems requiring 2-8 arithmetic steps, from OpenAI (2021). Largely saturated at the frontier — useful for evaluating smaller models.

#ModelProviderAccuracyDate
ERNIE 5.0Baidu99.7%Apr 2026
2GPT-5OpenAI99.2%Apr 2026
3Gemini 2.5 ProGoogle99%Mar 2026
4o4-miniOpenAI99%Mar 2026
5o3OpenAI99%Mar 2026
6Claude 4Anthropic98.9%Apr 2026
7Llama-4-MaverickMeta98.7%Mar 2026
8Claude Opus 4.5Anthropic98.6%Mar 2026
9Llama 4 Behemoth 2TMeta98.5%Apr 2026
10GPT-4.5OpenAI98.2%Apr 2026
11Claude Opus 4Anthropic98%Mar 2026
12o1OpenAI97.8%Apr 2026
13o1-previewOpenAI97.8%Dec 2025
14o1OpenAI97.8%Apr 2026
15Claude Sonnet 4Anthropic97.8%Mar 2026
16DeepSeek R1DeepSeek97.3%Mar 2026
17Claude 3.5 SonnetAnthropic96.4%Dec 2025
18Qwen2.5-72B-InstructAlibaba95.8%Mar 2026
19DeepSeek-V3DeepSeek95.8%Mar 2026
20Claude 3.5 SonnetAnthropic95%Apr 2026
21Claude 3 OpusAnthropic95%Apr 2026
22Gemini UltraGoogle DeepMind94.4%Apr 2026
23Llama 3 70BMeta93%Dec 2025
24GPT-4OpenAI92%Apr 2026
25GPT-4OpenAI92%Apr 2026
26GPT-4oOpenAI92%Dec 2025
27Gemini 1.5 ProGoogle91.7%Dec 2025
28Claude 3 HaikuAnthropic88.9%Apr 2026
29Mixtral-8x22bMistral88%Apr 2026
30PaLM 540B (Self-Consistency)Google74%Apr 2026
31PaLM 540B (CoT)Google58%Apr 2026
32GPT-3 (base)OpenAI8%Apr 2026

Source: openai/grade-school-math · Chain-of-thought, maj@1.

§ 02 · MATH-500

Algebra, geometry, number theory, difficulty 1–5.

500 representative problems from the MATH dataset covering algebra, geometry, number theory, and pre-calculus at difficulty 1-5. Hendrycks et al. 2021. Reasoning models have recently surpassed 90%.

#ModelProviderAccuracyDate
o4-mini (high)OpenAI98.2%Mar 2026
2o3 (high)OpenAI98.1%Mar 2026
3o3-miniOpenAI97.9%Mar 2026
4o3OpenAI97.8%Mar 2026
5o4-miniOpenAI97.5%Mar 2026
6DeepSeek R1DeepSeek97.3%Mar 2026
7Gemini 2.5 ProGoogle97.3%Mar 2026
8o1OpenAI96.4%Mar 2026
9Kimi k1.5Moonshot AI96.2%Mar 2026
10Claude 3.7 SonnetAnthropic96.2%Mar 2026
11DeepSeek-R1-ZeroDeepSeek95.9%Mar 2026
12DeepSeek-R1-Distill-Llama-70BDeepSeek94.5%Mar 2026
13DeepSeek-R1-Distill-Qwen-32BDeepSeek94.3%Mar 2026
14DeepSeek-v3-0324DeepSeek94%Mar 2026
15Claude Opus 4.5Anthropic90.7%Mar 2026
16QwQ-32BAlibaba/Qwen90.6%Mar 2026
17DeepSeek-V3DeepSeek90.2%Mar 2026
18o1-miniOpenAI90%Mar 2026
19Llama-4-MaverickMeta89.4%Mar 2026
20Claude Opus 4Anthropic89.2%Mar 2026
21Claude Sonnet 4Anthropic88.9%Mar 2026
22GPT-4.5 PreviewOpenAI87.1%Mar 2026
23o1-previewOpenAI85.5%Mar 2026
24Qwen2.5-72B-InstructAlibaba83.1%Mar 2026
25GPT-4.1OpenAI82.1%Mar 2026
26GPT-4oOpenAI76.6%Mar 2026
27Grok 2xAI76.1%Mar 2026
28Llama 3.1 405BMeta73.8%Mar 2026
29GPT-4 TurboOpenAI73.4%Mar 2026
30Claude 3.5 SonnetAnthropic71.1%Mar 2026
31GPT-4o miniOpenAI70.2%Mar 2026
32Llama 3.1 70BMeta68%Mar 2026
33Gemini 1.5 ProGoogle67.7%Mar 2026
34Claude 3 OpusAnthropic60.1%Mar 2026

Source: hendrycks/math · MATH-500 representative subset.

§ 03 · AIME 2024

Olympiad-grade competition, integer answers 0–999.

15 integer-answer problems from the American Invitational Mathematics Examination 2024. Human competitors average ~3/15; top students score 10+. This is the sharpest differentiator among frontier models — only reasoning models breach 70%.

#ModelProvider% CorrectDate
o3OpenAI96.7%Mar 2026
2o4-miniOpenAI93.4%Mar 2026
3Gemini 2.5 ProGoogle92%Mar 2026
4o1-previewOpenAI83.3%Dec 2025
5Claude 3.7 SonnetAnthropic80%Mar 2026
6DeepSeek R1DeepSeek79.8%Mar 2026
7Claude 3.5 OpusAnthropic16%Dec 2025
8GPT-4oOpenAI13.4%Dec 2025

AIME 2024 I & II combined. Human AMC/AIME competitor baseline: ~20-30%.

§ 04 · Methodology

Frequently asked.

Why do reasoning models dominate math benchmarks?+

Models like o3 and DeepSeek-R1 use extended chain-of-thought with self-verification before committing to an answer. This internal search allows them to explore multiple solution paths and backtrack from errors — critical for multi-step proofs.

Is GSM8K still a useful benchmark?+

For frontier models, no — it's saturated above 94%. GSM8K is still useful for comparing smaller models (7B-13B) where performance still varies significantly between 60-90%.

What is the hardest math benchmark for LLMs?+

AIME 2024 remains meaningful at the frontier since it requires novel combinatorial and algebraic insight that can't be memorized. HLE (Humanity's Last Exam) includes math problems even harder than AIME, with frontier models scoring below 40%.

§ 05 · Related

Continue reading.

Math · deep dive
GSM8K & MATH Deep Dive
Detailed scores and historical progression
Reasoning
Reasoning Benchmarks
GPQA Diamond, MMLU-Pro, HLE
Index
All LLM Benchmarks
Full leaderboard overview