Reasoning

Testing if your model can think logically? Benchmark math problem solving, commonsense understanding, and multi-step reasoning capabilities.

5 tasks19 datasets158 results

AI reasoning has transformed in 2025 with test-time compute scaling rivaling traditional training approaches. Leading models now solve graduate-level problems through extended thinking, but cost and instruction-following trade-offs remain critical deployment considerations.

State of the Field (2025)

OpenAI's o3 and o4-mini achieve 98-99% on AIME math competition, while DeepSeek-R1 matches o1 performance as open-source MIT-licensed model
Test-time compute scaling now rivals training-time scaling: smaller models with extended inference match or exceed larger models on complex reasoning tasks
Gemini 3 Pro leads on multimodal reasoning (93.8% GPQA Diamond, 45.1% ARC-AGI-2), Claude 3.5 Sonnet excels at qualitative reasoning (59.4% GPQA)
Reasoning models struggle with instruction-following and exact arithmetic despite solving conceptually sophisticated problems - fundamental trade-off between reasoning depth and controllability

Quick Recommendations

Math Problem Solving (High Accuracy)

OpenAI o4-mini

99.5% on AIME 2025 with interpreter, best cost-performance ratio for mathematical reasoning

Graduate-Level Scientific Reasoning

Gemini 3 Deep Think

93.8% on GPQA Diamond, 41.0% on Humanity's Last Exam, excels at cross-domain scientific analysis

Coding Challenges & SWE Tasks

OpenAI o3 with thinking mode

74.9% on SWE-bench Verified, 89th percentile Codeforces, superior tool use and agentic capabilities

Qualitative Analysis & Multi-Perspective Reasoning

Claude 3.5 Sonnet

59.4% GPQA vs GPT-4o's 53.6%, 2x faster than Claude 3 Opus, excels at analytical thinking beyond pure math

Open-Weight Reasoning (Production Deployment)

DeepSeek-R1 or R1-Distill variants

MIT license, matches o1 performance, distilled variants offer competitive reasoning at fraction of frontier costs

Multimodal Reasoning (Vision + Text)

Gemini 3 Pro

87.6% Video-MMMU, 81% MMMU-Pro, 1M token context window for complex multi-document reasoning

Agentic Planning & Decision-Making

Hybrid: o3 for planning + GPT-4o for execution

Use reasoning models for decomposition and decisions, faster models for execution. 3-4x system performance improvement

Cost-Conscious Reasoning at Scale

Qwen3-32B or QwQ-32B

32B parameters with competitive reasoning, 256K context (up to 1M), state-of-the-art among open-weight thinking models

General-Purpose Chat (NOT Complex Reasoning)

GPT-4o or Claude 3.5 Sonnet

Faster, cheaper, better instruction-following. Reasoning models are overkill for information retrieval and simple tasks

Tasks & Benchmarks

Mathematical Reasoning

Mathematical reasoning benchmarks — GSM8K, MATH, Minerva, and the competition-level AIME/AMC tests — have become the primary yardstick for frontier model intelligence. OpenAI's o1 and o3 (2024-2025) cracked problems that were previously out of reach by scaling inference-time compute with search and verification. The MATH benchmark went from ~50% (GPT-4, early 2023) to >90% (o1, late 2024) in under two years, but Olympiad-level problems (FrontierMath, Putnam) and formal theorem proving (Lean 4) remain far from solved, preserving mathematical reasoning as the clearest ladder for measuring progress.

4 datasets62 resultsSOTA tracked

Commonsense Reasoning

Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social world works — is measured by benchmarks like CommonsenseQA, PIQA, and HellaSwag. Large language models have largely saturated early benchmarks (HellaSwag went from 95% to near-ceiling by 2023), forcing a shift to harder tests like ARC-Challenge and Winoground. The uncomfortable insight is that scale alone buys enormous commonsense performance, but adversarial probing still reveals brittle failures on spatial reasoning, temporal logic, and physical intuition that humans find trivial.

5 datasets45 resultsSOTA tracked

Multi-step Reasoning

Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capability that determines whether a model can solve complex real-world problems or only handle one-hop questions. Benchmarks like StrategyQA, MuSiQue, and BIG-Bench Hard isolate this ability, and the performance gap between single-step and multi-step tasks remains the widest failure mode of current LLMs. Techniques like chain-of-thought, tree-of-thought, and iterative refinement help, but error accumulation across steps means that 95% per-step accuracy yields only 60% accuracy over 10 steps — a fundamental scaling challenge.

4 datasets33 resultsSOTA tracked

Logical Reasoning

Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weakness in autoregressive language models: they pattern-match rather than prove. Benchmarks like LogiQA, FOLIO, and the ReClor reading comprehension test push models toward deductive rigor, and performance improves substantially with chain-of-thought and self-consistency decoding. But systematic evaluations (2023-2024) show that even frontier models fail on problems requiring more than 3-4 reasoning steps, and neurosymbolic approaches that compile to SAT solvers or proof assistants remain more reliable for true logical correctness.

4 datasets12 resultsSOTA tracked

Arithmetic Reasoning

Arithmetic reasoning — solving computation-heavy problems stated in natural language — tests whether models can reliably execute multi-step calculations. GPT-4 and Claude showed dramatic improvement over GPT-3 on benchmarks like GSM8K's arithmetic subset, but systematic errors on large-number multiplication and multi-digit division persist. Chain-of-thought prompting (Wei et al., 2022) was the breakthrough technique, and tool-augmented approaches (letting models call a calculator) essentially solve the task — making the pure reasoning version a test of memorization vs. genuine computation.

2 datasets6 resultsSOTA tracked

Show all datasets and SOTA results

Mathematical Reasoning

AIME 2024American Invitational Mathematics Examination 20242024

96.7(accuracy)o3

AIME 2025American Invitational Mathematics Examination 20252025

92.7(accuracy)o4-mini

GSM8KGrade School Math 8K2021

99(accuracy)o4-mini

MATHMathematics Aptitude Test of Heuristics2021

98.2(accuracy)o4-mini (high)

Commonsense Reasoning

ARC-ChallengeAI2 Reasoning Challenge2018

98.1(accuracy)o3

CommonsenseQACommonsenseQA2019

85.4(accuracy)GPT-4o

HellaSwagHellaSwag2019

95.3(accuracy)GPT-4o

MMLUMassive Multitask Language Understanding2021

92.9(accuracy)o3

WinoGrandeWinoGrande2019

87.5(accuracy)GPT-4o

Multi-step Reasoning

BIG-Bench HardBIG-Bench Hard (BBH)2022

93.1(accuracy)Claude 3.5 Sonnet

GPQAGraduate-Level Google-Proof Q&A2024

84(accuracy)Gemini 2.5 Pro

HotpotQAHotpotQA2018

71.3(f1)GPT-4o

StrategyQAStrategyQA2021

82.1(accuracy)GPT-4o

Logical Reasoning

ARC-AGI-1Abstraction and Reasoning Corpus for AGI (v1)2024

87.5(accuracy)o3

ARC-AGI-2Abstraction and Reasoning Corpus for AGI (v2)2025

5(accuracy)Gemini 2.5 Pro

LogiQALogiQA2020

56.3(accuracy)GPT-4o

ReClorReading Comprehension Dataset Requiring Logical Reasoning2020

72.4(accuracy)GPT-4o

Arithmetic Reasoning

MAWPSMath Word Problem Repository2016

97.2(accuracy)GPT-4o

SVAMPSimple Variations on Arithmetic Math Word Problems2021

93.7(accuracy)GPT-4o

Honest Takes

Don't Default to Reasoning Models

For most tasks - customer service, content generation, classification - standard LLMs like GPT-4o or Claude 3.5 Sonnet remain superior. Reasoning models waste compute on simple tasks and cost 3-10x more due to token consumption. Reserve them for genuinely complex multi-step problems.

Instruction-Following Degrades with Reasoning

Analysis of 23 reasoning models reveals widespread inability to follow user constraints, especially on harder problems. Models trained with extended CoT sacrifice controllability for reasoning depth. If your app requires strict compliance with specifications, standard models may outperform reasoning models.

Open-Weight Can Be More Expensive

DeepSeek-R1 and Qwen3 generate 1.5-4x more tokens than closed models for equivalent reasoning. Lower per-token pricing doesn't always mean lower total cost. Benchmark on your actual workload before assuming open-weight saves money.

Benchmark Saturation is Real

GPQA Diamond approaches saturation at 90%+ accuracy. AIME questions show data contamination risk - models perform better on 2024 vs 2025 questions. Internal evaluation on private, domain-specific problems matters more than public benchmark scores.

Latent Reasoning is the Next Frontier

Current reasoning models burn tokens generating natural language traces. The future is latent reasoning - internal compressed representations that preserve benefits without token overhead. This could fundamentally alter reasoning model economics in 2025-2026.

Get notified when these results update

New models drop weekly. We track them so you don't have to.