MATH
Unknown
12,500 competition mathematics problems (5,000 test) from AMC, AIME, and other sources covering algebra, geometry, number theory, and more. Harder than GSM8K. Modern evaluations typically use the MATH-500 representative subset.
Benchmark Stats
SOTA History
accuracy
accuracy
Higher is better
| Rank | Model | Source | Score | Year | Paper |
|---|---|---|---|---|---|
| 1 | o4-mini (high) MATH-500, zero-shot CoT, pass@1. High reasoning effort. | Community | 98.2 | 2026 | Source |
| 2 | o3 (high) MATH-500, zero-shot CoT, pass@1. High reasoning effort. | Community | 98.1 | 2026 | Source |
| 3 | o3-mini MATH-500, zero-shot CoT, pass@1. High reasoning effort. | Editorial | 97.9 | 2026 | Source |
| 4 | o3 MATH-500, zero-shot CoT, pass@1. Default reasoning effort. | Editorial | 97.8 | 2026 | Source |
| 5 | o4-mini MATH-500, zero-shot CoT, pass@1. Default reasoning effort. | Editorial | 97.5 | 2026 | Source |
| 6 | DeepSeek-R1 MATH-500, pass@1. From official DeepSeek-R1 paper (Jan 2025). | Editorial | 97.3 | 2026 | Source |
| 7 | Gemini 2.5 Pro MATH-500, pass@1. Gemini 2.5 Pro (Mar 2025). | Community | 97.3 | 2026 | Source |
| 8 | o1 MATH-500, zero-shot CoT, pass@1. | Editorial | 96.4 | 2026 | Source |
| 9 | Claude 3.7 Sonnet MATH-500 with extended thinking enabled. | Editorial | 96.2 | 2026 | Source |
| 10 | Kimi k1.5 MATH-500, long-CoT variant. From official Kimi k1.5 paper (Jan 2025). | Community | 96.2 | 2026 | Source |
| 11 | DeepSeek-R1-Zero MATH-500, pass@1. DeepSeek-R1-Zero (pure RL, no SFT). From R1 paper (Jan 2025). | Community | 95.9 | 2026 | Source |
| 12 | DeepSeek-R1-Distill-Llama-70B MATH-500, pass@1. Distilled from DeepSeek-R1 into Llama-3.1-70B. From R1 paper (Jan 2025). | Community | 94.5 | 2026 | Source |
| 13 | DeepSeek-R1-Distill-Qwen-32B MATH-500, pass@1. Distilled from DeepSeek-R1 into Qwen-2.5-32B. From R1 paper (Jan 2025). | Community | 94.3 | 2026 | Source |
| 14 | DeepSeek-V3-0324 MATH-500. DeepSeek-V3-0324 updated model (Mar 2025). Non-reasoning base model. | Community | 94 | 2026 | Source |
| 15 | QwQ-32B MATH-500, pass@1. QwQ-32B reasoning model by Alibaba/Qwen (Mar 2025). | Community | 90.6 | 2026 | Source |
| 16 | deepseek-v3 MATH-500. Non-reasoning base model. From DeepSeek-V3 technical report (Dec 2024). | Editorial | 90.2 | 2026 | Source |
| 17 | o1-mini MATH-500, zero-shot CoT, pass@1. | Editorial | 90 | 2026 | Source |
| 18 | GPT-4.5 Preview Full MATH test set, zero-shot CoT. | Editorial | 87.1 | 2026 | Source |
| 19 | o1-preview MATH-500, zero-shot CoT, pass@1. | Editorial | 85.5 | 2026 | Source |
| 20 | GPT-4.1 Full MATH test set, zero-shot CoT. | Editorial | 82.1 | 2026 | Source |
| 21 | gpt-4o Full MATH test set, zero-shot CoT. gpt-4o-2024-05-13. | Editorial | 76.6 | 2026 | Source |
| 22 | Grok 2 Full MATH test set. | Editorial | 76.1 | 2026 | Source |
| 23 | Llama 3.1 405B Full MATH test set. | Editorial | 73.8 | 2026 | Source |
| 24 | GPT-4 Turbo Full MATH test set, zero-shot CoT. | Editorial | 73.4 | 2026 | Source |
| 25 | claude-35-sonnet Full MATH test set. Original Claude 3.5 Sonnet (June 2024). | Editorial | 71.1 | 2026 | Source |
| 26 | gpt-4o-mini Full MATH test set, zero-shot CoT. | Editorial | 70.2 | 2026 | Source |
| 27 | Llama 3.1 70B Full MATH test set. | Editorial | 68 | 2026 | Source |
| 28 | gemini-15-pro From Google's official evaluation. | Editorial | 67.7 | 2026 | Source |
| 29 | Claude 3 Opus Full MATH test set. | Editorial | 60.1 | 2026 | Source |