Arithmetic Reasoning

Arithmetic reasoning — solving computation-heavy problems stated in natural language — tests whether models can reliably execute multi-step calculations. GPT-4 and Claude showed dramatic improvement over GPT-3 on benchmarks like GSM8K's arithmetic subset, but systematic errors on large-number multiplication and multi-digit division persist. Chain-of-thought prompting (Wei et al., 2022) was the breakthrough technique, and tool-augmented approaches (letting models call a calculator) essentially solve the task — making the pure reasoning version a test of memorization vs. genuine computation.

2
Datasets
6
Results
accuracy
Canonical metric
Canonical Benchmark

MAWPS

3,320 arithmetic word problems from various sources, testing basic arithmetic reasoning.

Primary metric: accuracy
View full leaderboard

Top 10

Leading models on MAWPS.

RankModelaccuracyYearSource
1
gpt-4o
97.22025paper
2
claude-35-sonnet
95.82025paper
3
llama-3-70b
94.12025paper

All datasets

2 datasets tracked for this task.

Related tasks

Other tasks in Reasoning.