Codesota · LLM · GSM8K & MATHLLM/GSM8K / MATH
Math · updated April 2026

GSM8K & MATH.

Math reasoning performance from grade-school word problems (GSM8K) to competition-level proofs (MATH). Reasoning models like o3 and DeepSeek-R1 now dominate both benchmarks.

GSM8K MATH
§ 01 · GSM8K

Grade-school arithmetic, multi-step.

8,500 grade-school math word problems requiring 2-8 arithmetic steps. Chain-of-thought prompting unlocks near-perfect accuracy in frontier models. Accuracy measured on the 1,319-problem test set.

#ModelProviderAccuracyDate
ERNIE 5.0Baidu99.7%Apr 2026
2GPT-5OpenAI99.2%Apr 2026
3Gemini 2.5 ProGoogle99%Mar 2026
4o4-miniOpenAI99%Mar 2026
5o3OpenAI99%Mar 2026
6Claude 4Anthropic98.9%Apr 2026
7Llama-4-MaverickMeta98.7%Mar 2026
8Claude Opus 4.5Anthropic98.6%Mar 2026
9Llama 4 Behemoth 2TMeta98.5%Apr 2026
10GPT-4.5OpenAI98.2%Apr 2026
11Claude Opus 4Anthropic98%Mar 2026
12o1OpenAI97.8%Apr 2026
13o1-previewOpenAI97.8%Dec 2025
14o1OpenAI97.8%Apr 2026
15Claude Sonnet 4Anthropic97.8%Mar 2026
16DeepSeek R1DeepSeek97.3%Mar 2026
17Claude 3.5 SonnetAnthropic96.4%Dec 2025
18Qwen2.5-72B-InstructAlibaba95.8%Mar 2026
19DeepSeek-V3DeepSeek95.8%Mar 2026
20Claude 3.5 SonnetAnthropic95%Apr 2026
21Claude 3 OpusAnthropic95%Apr 2026
22Gemini UltraGoogle DeepMind94.4%Apr 2026
23Llama 3 70BMeta93%Dec 2025
24GPT-4OpenAI92%Apr 2026
25GPT-4OpenAI92%Apr 2026
26GPT-4oOpenAI92%Dec 2025
27Gemini 1.5 ProGoogle91.7%Dec 2025
28Claude 3 HaikuAnthropic88.9%Apr 2026
29Mixtral-8x22bMistral88%Apr 2026
30PaLM 540B (Self-Consistency)Google74%Apr 2026
31PaLM 540B (CoT)Google58%Apr 2026
32GPT-3 (base)OpenAI8%Apr 2026

Source: openai/grade-school-math · Chain-of-thought, maj@1.

§ 02 · MATH

Competition-level, AMC to AIME.

12,500 competition problems at difficulty 1-5 (AMC/AIME level), covering algebra, counting, geometry, number theory, and pre-calculus. The harder MATH-500 subset (500 representative problems) is the standard evaluation split.

#ModelProviderAccuracyDate
o4-mini (high)OpenAI98.2%Mar 2026
2o3 (high)OpenAI98.1%Mar 2026
3o3-miniOpenAI97.9%Mar 2026
4o3OpenAI97.8%Mar 2026
5o4-miniOpenAI97.5%Mar 2026
6DeepSeek R1DeepSeek97.3%Mar 2026
7Gemini 2.5 ProGoogle97.3%Mar 2026
8o1OpenAI96.4%Mar 2026
9Kimi k1.5Moonshot AI96.2%Mar 2026
10Claude 3.7 SonnetAnthropic96.2%Mar 2026
11DeepSeek-R1-ZeroDeepSeek95.9%Mar 2026
12DeepSeek-R1-Distill-Llama-70BDeepSeek94.5%Mar 2026
13DeepSeek-R1-Distill-Qwen-32BDeepSeek94.3%Mar 2026
14DeepSeek-v3-0324DeepSeek94%Mar 2026
15Claude Opus 4.5Anthropic90.7%Mar 2026
16QwQ-32BAlibaba/Qwen90.6%Mar 2026
17DeepSeek-V3DeepSeek90.2%Mar 2026
18o1-miniOpenAI90%Mar 2026
19Llama-4-MaverickMeta89.4%Mar 2026
20Claude Opus 4Anthropic89.2%Mar 2026
21Claude Sonnet 4Anthropic88.9%Mar 2026
22GPT-4.5 PreviewOpenAI87.1%Mar 2026
23o1-previewOpenAI85.5%Mar 2026
24Qwen2.5-72B-InstructAlibaba83.1%Mar 2026
25GPT-4.1OpenAI82.1%Mar 2026
26GPT-4oOpenAI76.6%Mar 2026
27Grok 2xAI76.1%Mar 2026
28Llama 3.1 405BMeta73.8%Mar 2026
29GPT-4 TurboOpenAI73.4%Mar 2026
30Claude 3.5 SonnetAnthropic71.1%Mar 2026
31GPT-4o miniOpenAI70.2%Mar 2026
32Llama 3.1 70BMeta68%Mar 2026
33Gemini 1.5 ProGoogle67.7%Mar 2026
34Claude 3 OpusAnthropic60.1%Mar 2026

Source: hendrycks/math · MATH-500 subset, chain-of-thought.

§ 03 · Methodology

Frequently asked.

What is GSM8K?+

Grade School Math 8K — 8,500 word problems requiring 2-8 step arithmetic reasoning, created by OpenAI in 2021. Chain-of-thought prompting revealed a step-change in model capability. Now saturated at the frontier.

What makes MATH harder than GSM8K?+

MATH problems require domain knowledge (e.g., modular arithmetic, geometric proofs) not just arithmetic. Difficulty 5 problems (AIME-level) stump most people. The benchmark was designed to take years to saturate — reasoning models like o3 are now above 95%.

Why do reasoning models dominate math benchmarks?+

Reasoning models (o3, DeepSeek-R1) use extended chain-of-thought with self-verification before committing to an answer. This search-like inference process is especially effective on math, where step-by-step verification catches errors.

§ 04 · Related

Continue reading.

Math · all
All Math Benchmarks
GSM8K, MATH, AIME 2024, AMC 2023
Reasoning
Reasoning Benchmarks
GPQA Diamond, MMLU-Pro, HLE
Index
All LLM Benchmarks
Full leaderboard overview