MATH

Unknown

12,500 competition mathematics problems (5,000 test) from AMC, AIME, and other sources covering algebra, geometry, number theory, and more. Harder than GSM8K. Modern evaluations typically use the MATH-500 representative subset.

Benchmark Stats

Models29
Papers29
Metrics1

SOTA History

Not enough data to show trend.

accuracy

accuracy

Higher is better

RankModelSourceScoreYearPaper
1o4-mini (high)

MATH-500, zero-shot CoT, pass@1. High reasoning effort.

Community98.22026Source
2o3 (high)

MATH-500, zero-shot CoT, pass@1. High reasoning effort.

Community98.12026Source
3o3-mini

MATH-500, zero-shot CoT, pass@1. High reasoning effort.

Editorial97.92026Source
4o3

MATH-500, zero-shot CoT, pass@1. Default reasoning effort.

Editorial97.82026Source
5o4-mini

MATH-500, zero-shot CoT, pass@1. Default reasoning effort.

Editorial97.52026Source
6DeepSeek-R1

MATH-500, pass@1. From official DeepSeek-R1 paper (Jan 2025).

Editorial97.32026Source
7Gemini 2.5 Pro

MATH-500, pass@1. Gemini 2.5 Pro (Mar 2025).

Community97.32026Source
8o1

MATH-500, zero-shot CoT, pass@1.

Editorial96.42026Source
9Claude 3.7 Sonnet

MATH-500 with extended thinking enabled.

Editorial96.22026Source
10Kimi k1.5

MATH-500, long-CoT variant. From official Kimi k1.5 paper (Jan 2025).

Community96.22026Source
11DeepSeek-R1-Zero

MATH-500, pass@1. DeepSeek-R1-Zero (pure RL, no SFT). From R1 paper (Jan 2025).

Community95.92026Source
12DeepSeek-R1-Distill-Llama-70B

MATH-500, pass@1. Distilled from DeepSeek-R1 into Llama-3.1-70B. From R1 paper (Jan 2025).

Community94.52026Source
13DeepSeek-R1-Distill-Qwen-32B

MATH-500, pass@1. Distilled from DeepSeek-R1 into Qwen-2.5-32B. From R1 paper (Jan 2025).

Community94.32026Source
14DeepSeek-V3-0324

MATH-500. DeepSeek-V3-0324 updated model (Mar 2025). Non-reasoning base model.

Community942026Source
15QwQ-32B

MATH-500, pass@1. QwQ-32B reasoning model by Alibaba/Qwen (Mar 2025).

Community90.62026Source
16deepseek-v3

MATH-500. Non-reasoning base model. From DeepSeek-V3 technical report (Dec 2024).

Editorial90.22026Source
17o1-mini

MATH-500, zero-shot CoT, pass@1.

Editorial902026Source
18GPT-4.5 Preview

Full MATH test set, zero-shot CoT.

Editorial87.12026Source
19o1-preview

MATH-500, zero-shot CoT, pass@1.

Editorial85.52026Source
20GPT-4.1

Full MATH test set, zero-shot CoT.

Editorial82.12026Source
21gpt-4o

Full MATH test set, zero-shot CoT. gpt-4o-2024-05-13.

Editorial76.62026Source
22Grok 2

Full MATH test set.

Editorial76.12026Source
23Llama 3.1 405B

Full MATH test set.

Editorial73.82026Source
24GPT-4 Turbo

Full MATH test set, zero-shot CoT.

Editorial73.42026Source
25claude-35-sonnet

Full MATH test set. Original Claude 3.5 Sonnet (June 2024).

Editorial71.12026Source
26gpt-4o-mini

Full MATH test set, zero-shot CoT.

Editorial70.22026Source
27Llama 3.1 70B

Full MATH test set.

Editorial682026Source
28gemini-15-pro

From Google's official evaluation.

Editorial67.72026Source
29Claude 3 Opus

Full MATH test set.

Editorial60.12026Source

Submit a Result

MATH Leaderboard | CodeSOTA | CodeSOTA