Arithmetic Reasoning
Arithmetic reasoning — solving computation-heavy problems stated in natural language — tests whether models can reliably execute multi-step calculations. GPT-4 and Claude showed dramatic improvement over GPT-3 on benchmarks like GSM8K's arithmetic subset, but systematic errors on large-number multiplication and multi-digit division persist. Chain-of-thought prompting (Wei et al., 2022) was the breakthrough technique, and tool-augmented approaches (letting models call a calculator) essentially solve the task — making the pure reasoning version a test of memorization vs. genuine computation.
MAWPS
3,320 arithmetic word problems from various sources, testing basic arithmetic reasoning.
Top 10
Leading models on MAWPS.
All datasets
2 datasets tracked for this task.
Related tasks
Other tasks in Reasoning.