8,500 grade-school word problems requiring 2-8 arithmetic steps, from OpenAI (2021). Largely saturated at the frontier — useful for evaluating smaller models.
| # | Model | Provider | Accuracy | Date |
|---|---|---|---|---|
| ★ | ERNIE 5.0 | Baidu | 99.7% | Apr 2026 |
| 2 | MiMo-V2.5-Pro | 99.6% | Apr 2026 | |
| 3 | GPT-5 | OpenAI | 99.2% | Apr 2026 |
| 4 | o3 | OpenAI | 99% | Mar 2026 |
| 5 | Gemini 2.5 Pro | 99% | Mar 2026 | |
| 6 | o4-mini | OpenAI | 99% | Mar 2026 |
| 7 | Claude 4 | Anthropic | 98.9% | Apr 2026 |
| 8 | Llama 4 Maverick | Meta | 98.7% | Mar 2026 |
| 9 | Claude Opus 4.5 | Anthropic | 98.6% | Mar 2026 |
| 10 | Llama 4 Behemoth 2T | Meta | 98.5% | Apr 2026 |
| 11 | GPT-4.5 | OpenAI | 98.2% | Apr 2026 |
| 12 | Claude Opus 4 | Anthropic | 98% | Mar 2026 |
| 13 | o1-preview | OpenAI | 97.8% | Dec 2025 |
| 14 | o1 | OpenAI | 97.8% | Apr 2026 |
| 15 | Claude Sonnet 4 | Anthropic | 97.8% | Mar 2026 |
| 16 | o1 | OpenAI | 97.8% | Apr 2026 |
| 17 | DeepSeek R1 | DeepSeek | 97.3% | Mar 2026 |
| 18 | Llama 3 (405B, Instruct) | Meta | 96.8% | Jul 2024 |
| 19 | Claude 3.5 Sonnet | Anthropic | 96.4% | Dec 2025 |
| 20 | Qwen2.5-Plus | 96% | Dec 2024 | |
| 21 | DeepSeek-V3 | DeepSeek | 95.8% | Mar 2026 |
| 22 | Qwen2.5-72B-Instruct | Alibaba | 95.8% | Mar 2026 |
| 23 | Qwen2.5-VL-72B | 95.3% | Feb 2025 | |
| 24 | Claude 3 Opus | Anthropic | 95% | Apr 2026 |
| 25 | Claude 3.5 Sonnet | Anthropic | 95% | Apr 2026 |
| 26 | MiniMax-Text-01 | MiniMax | 94.8% | Jan 2025 |
| 27 | MiniCPM-o 4.5-Instruct | 94.5% | Apr 2026 | |
| 28 | Gemini Ultra | Google DeepMind | 94.4% | Apr 2026 |
| 29 | Qwen3-235B-A22B | Alibaba | 94.39% | May 2025 |
| 30 | Llama 3 70B | Meta | 93% | Dec 2025 |
| 31 | GPT-4 | OpenAI | 92% | Apr 2026 |
| 32 | GPT-4 | OpenAI | 92% | Apr 2026 |
| 33 | GPT-4o | OpenAI | 92% | Dec 2025 |
| 34 | Gemini 1.5 Pro | 91.7% | Dec 2025 | |
| 35 | Claude 3 Haiku | Anthropic | 88.9% | Apr 2026 |
| 36 | Step-3.5-Flash Base | 88.2% | Feb 2026 | |
| 37 | Mixtral-8x22b | Mistral | 88% | Apr 2026 |
| 38 | HRM-Text-1B | 84.7% | May 2026 | |
| 39 | Apertus-70B-Instruct | 77.6% | Sep 2025 | |
| 40 | PaLM 540B (Self-Consistency) | 74% | Apr 2026 | |
| 41 | LLaMA-65B | 69.7% | Feb 2023 | |
| 42 | Chameleon 34B | 61.4% | May 2024 | |
| 43 | BitNet b1.58 2B4T | 58.38% | Apr 2025 | |
| 44 | PaLM 540B (CoT) | 58% | Apr 2026 | |
| 45 | Llama 2 70B (5-shot) | 56.8% | Jul 2023 | |
| 46 | Code Llama - Python 34B | 34.42% | Aug 2023 | |
| 47 | SmoLM2 (1.7B) | 31.1% | Feb 2025 | |
| 48 | GPT-3 (base) | OpenAI | 8% | Apr 2026 |
Source: openai/grade-school-math · Chain-of-thought, maj@1.
500 representative problems from the MATH dataset covering algebra, geometry, number theory, and pre-calculus at difficulty 1-5. Hendrycks et al. 2021. Reasoning models have recently surpassed 90%.
| # | Model | Provider | Accuracy | Date |
|---|---|---|---|---|
| ★ | o4-mini (high) | OpenAI | 98.2% | Mar 2026 |
| 2 | o3 (high) | OpenAI | 98.1% | Mar 2026 |
| 3 | o3-mini | OpenAI | 97.9% | Mar 2026 |
| 4 | o3 | OpenAI | 97.8% | Mar 2026 |
| 5 | o4-mini | OpenAI | 97.5% | Mar 2026 |
| 6 | Gemini 2.5 Pro | 97.3% | Mar 2026 | |
| 7 | DeepSeek R1 | DeepSeek | 97.3% | Mar 2026 |
| 8 | o1 | OpenAI | 96.4% | Mar 2026 |
| 9 | Claude 3.7 Sonnet | Anthropic | 96.2% | Mar 2026 |
| 10 | Kimi k1.5 | Moonshot AI | 96.2% | Mar 2026 |
| 11 | DeepSeek-R1-Zero | DeepSeek | 95.9% | Mar 2026 |
| 12 | DeepSeek-R1-Distill-Llama-70B | DeepSeek | 94.5% | Mar 2026 |
| 13 | DeepSeek-R1-Distill-Qwen-32B | DeepSeek | 94.3% | Mar 2026 |
| 14 | DeepSeek-v3-0324 | DeepSeek | 94% | Mar 2026 |
| 15 | Claude Opus 4.5 | Anthropic | 90.7% | Mar 2026 |
| 16 | QwQ-32B | Alibaba/Qwen | 90.6% | Mar 2026 |
| 17 | DeepSeek-V3 | DeepSeek | 90.2% | Mar 2026 |
| 18 | o1-mini | OpenAI | 90% | Mar 2026 |
| 19 | Llama 4 Maverick | Meta | 89.4% | Mar 2026 |
| 20 | Claude Opus 4 | Anthropic | 89.2% | Mar 2026 |
| 21 | Claude Sonnet 4 | Anthropic | 88.9% | Mar 2026 |
| 22 | GPT-4.5 Preview | OpenAI | 87.1% | Mar 2026 |
| 23 | o1-preview | OpenAI | 85.5% | Mar 2026 |
| 24 | Qwen2.5-Plus | 84.7% | Dec 2024 | |
| 25 | Qwen2.5-72B-Instruct | Alibaba | 83.1% | Mar 2026 |
| 26 | Qwen2.5-VL-72B | 83% | Feb 2025 | |
| 27 | GPT-4.1 | OpenAI | 82.1% | Mar 2026 |
| 28 | MiniMax-Text-01 | MiniMax | 77.4% | Jan 2025 |
| 29 | GPT-4o | OpenAI | 76.6% | Mar 2026 |
| 30 | Grok 2 | xAI | 76.1% | Mar 2026 |
| 31 | Llama 3 (405B, Instruct) | Meta | 73.8% | Jul 2024 |
| 32 | Llama 3.1 405B | Meta | 73.8% | Mar 2026 |
| 33 | GPT-4 Turbo | OpenAI | 73.4% | Mar 2026 |
| 34 | Qwen3-235B-A22B | Alibaba | 71.84% | May 2025 |
| 35 | Claude 3.5 Sonnet | Anthropic | 71.1% | Mar 2026 |
| 36 | GPT-4o mini | OpenAI | 70.2% | Mar 2026 |
| 37 | Llama 3.1 70B | Meta | 68% | Mar 2026 |
| 38 | Gemini 1.5 Pro | 67.7% | Mar 2026 | |
| 39 | Step-3.5-Flash Base | 66.8% | Feb 2026 | |
| 40 | Claude 3 Opus | Anthropic | 60.1% | Mar 2026 |
| 41 | HRM-Text-1B | 56.5% | May 2026 | |
| 42 | Aria | 50.8% | Oct 2024 | |
| 43 | Apertus-70B-Instruct | 30.8% | Sep 2025 | |
| 44 | Chameleon 34B | 22.5% | May 2024 | |
| 45 | LLaMA-65B | 20.5% | Feb 2023 | |
| 46 | SmoLM2 (1.7B) | 11.6% | Feb 2025 |
Source: hendrycks/math · MATH-500 representative subset.
15 integer-answer problems from the American Invitational Mathematics Examination 2024. Human competitors average ~3/15; top students score 10+. This is the sharpest differentiator among frontier models — only reasoning models breach 70%.
| # | Model | Provider | % Correct | Date |
|---|---|---|---|---|
| ★ | o3 | OpenAI | 96.7% | Mar 2026 |
| 2 | o4-mini | OpenAI | 93.4% | Mar 2026 |
| 3 | Gemini 2.5 Pro | 92% | Mar 2026 | |
| 4 | GLM-4.5-Air | Zhipu AI | 89.4% | Aug 2025 |
| 5 | Qwen3-Coder-Next | Qwen | 89.01% | Feb 2026 |
| 6 | Qwen3-235B-A22B | Alibaba | 85.7% | May 2025 |
| 7 | o1-preview | OpenAI | 83.3% | Dec 2025 |
| 8 | Claude 3.7 Sonnet | Anthropic | 80% | Mar 2026 |
| 9 | DeepSeek R1 | DeepSeek | 79.8% | Mar 2026 |
| 10 | Claude 3.5 Opus | Anthropic | 16% | Dec 2025 |
| 11 | GPT-4o | OpenAI | 13.4% | Dec 2025 |
AIME 2024 I & II combined. Human AMC/AIME competitor baseline: ~20-30%.
Models like o3 and DeepSeek-R1 use extended chain-of-thought with self-verification before committing to an answer. This internal search allows them to explore multiple solution paths and backtrack from errors — critical for multi-step proofs.
For frontier models, no — it's saturated above 94%. GSM8K is still useful for comparing smaller models (7B-13B) where performance still varies significantly between 60-90%.
AIME 2024 remains meaningful at the frontier since it requires novel combinatorial and algebraic insight that can't be memorized. HLE (Humanity's Last Exam) includes math problems even harder than AIME, with frontier models scoring below 40%.