Reasoning
Testing if your model can think logically? Benchmark math problem solving, commonsense understanding, and multi-step reasoning capabilities.
AI reasoning has transformed in 2025 with test-time compute scaling rivaling traditional training approaches. Leading models now solve graduate-level problems through extended thinking, but cost and instruction-following trade-offs remain critical deployment considerations.
State of the Field (2025)
- OpenAI's o3 and o4-mini achieve 98-99% on AIME math competition, while DeepSeek-R1 matches o1 performance as open-source MIT-licensed model
- Test-time compute scaling now rivals training-time scaling: smaller models with extended inference match or exceed larger models on complex reasoning tasks
- Gemini 3 Pro leads on multimodal reasoning (93.8% GPQA Diamond, 45.1% ARC-AGI-2), Claude 3.5 Sonnet excels at qualitative reasoning (59.4% GPQA)
- Reasoning models struggle with instruction-following and exact arithmetic despite solving conceptually sophisticated problems - fundamental trade-off between reasoning depth and controllability
Quick Recommendations
Math Problem Solving (High Accuracy)
OpenAI o4-mini
99.5% on AIME 2025 with interpreter, best cost-performance ratio for mathematical reasoning
Graduate-Level Scientific Reasoning
Gemini 3 Deep Think
93.8% on GPQA Diamond, 41.0% on Humanity's Last Exam, excels at cross-domain scientific analysis
Coding Challenges & SWE Tasks
OpenAI o3 with thinking mode
74.9% on SWE-bench Verified, 89th percentile Codeforces, superior tool use and agentic capabilities
Qualitative Analysis & Multi-Perspective Reasoning
Claude 3.5 Sonnet
59.4% GPQA vs GPT-4o's 53.6%, 2x faster than Claude 3 Opus, excels at analytical thinking beyond pure math
Open-Weight Reasoning (Production Deployment)
DeepSeek-R1 or R1-Distill variants
MIT license, matches o1 performance, distilled variants offer competitive reasoning at fraction of frontier costs
Multimodal Reasoning (Vision + Text)
Gemini 3 Pro
87.6% Video-MMMU, 81% MMMU-Pro, 1M token context window for complex multi-document reasoning
Agentic Planning & Decision-Making
Hybrid: o3 for planning + GPT-4o for execution
Use reasoning models for decomposition and decisions, faster models for execution. 3-4x system performance improvement
Cost-Conscious Reasoning at Scale
Qwen3-32B or QwQ-32B
32B parameters with competitive reasoning, 256K context (up to 1M), state-of-the-art among open-weight thinking models
General-Purpose Chat (NOT Complex Reasoning)
GPT-4o or Claude 3.5 Sonnet
Faster, cheaper, better instruction-following. Reasoning models are overkill for information retrieval and simple tasks
Tasks & Benchmarks
Mathematical Reasoning
Mathematical reasoning benchmarks — GSM8K, MATH, Minerva, and the competition-level AIME/AMC tests — have become the primary yardstick for frontier model intelligence. OpenAI's o1 and o3 (2024-2025) cracked problems that were previously out of reach by scaling inference-time compute with search and verification. The MATH benchmark went from ~50% (GPT-4, early 2023) to >90% (o1, late 2024) in under two years, but Olympiad-level problems (FrontierMath, Putnam) and formal theorem proving (Lean 4) remain far from solved, preserving mathematical reasoning as the clearest ladder for measuring progress.
Commonsense Reasoning
Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social world works — is measured by benchmarks like CommonsenseQA, PIQA, and HellaSwag. Large language models have largely saturated early benchmarks (HellaSwag went from 95% to near-ceiling by 2023), forcing a shift to harder tests like ARC-Challenge and Winoground. The uncomfortable insight is that scale alone buys enormous commonsense performance, but adversarial probing still reveals brittle failures on spatial reasoning, temporal logic, and physical intuition that humans find trivial.
Multi-step Reasoning
Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capability that determines whether a model can solve complex real-world problems or only handle one-hop questions. Benchmarks like StrategyQA, MuSiQue, and BIG-Bench Hard isolate this ability, and the performance gap between single-step and multi-step tasks remains the widest failure mode of current LLMs. Techniques like chain-of-thought, tree-of-thought, and iterative refinement help, but error accumulation across steps means that 95% per-step accuracy yields only 60% accuracy over 10 steps — a fundamental scaling challenge.
Logical Reasoning
Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weakness in autoregressive language models: they pattern-match rather than prove. Benchmarks like LogiQA, FOLIO, and the ReClor reading comprehension test push models toward deductive rigor, and performance improves substantially with chain-of-thought and self-consistency decoding. But systematic evaluations (2023-2024) show that even frontier models fail on problems requiring more than 3-4 reasoning steps, and neurosymbolic approaches that compile to SAT solvers or proof assistants remain more reliable for true logical correctness.
Arithmetic Reasoning
Arithmetic reasoning — solving computation-heavy problems stated in natural language — tests whether models can reliably execute multi-step calculations. GPT-4 and Claude showed dramatic improvement over GPT-3 on benchmarks like GSM8K's arithmetic subset, but systematic errors on large-number multiplication and multi-digit division persist. Chain-of-thought prompting (Wei et al., 2022) was the breakthrough technique, and tool-augmented approaches (letting models call a calculator) essentially solve the task — making the pure reasoning version a test of memorization vs. genuine computation.
Show all datasets and SOTA results
Mathematical Reasoning
Commonsense Reasoning
Multi-step Reasoning
Logical Reasoning
Honest Takes
Don't Default to Reasoning Models
For most tasks - customer service, content generation, classification - standard LLMs like GPT-4o or Claude 3.5 Sonnet remain superior. Reasoning models waste compute on simple tasks and cost 3-10x more due to token consumption. Reserve them for genuinely complex multi-step problems.
Instruction-Following Degrades with Reasoning
Analysis of 23 reasoning models reveals widespread inability to follow user constraints, especially on harder problems. Models trained with extended CoT sacrifice controllability for reasoning depth. If your app requires strict compliance with specifications, standard models may outperform reasoning models.
Open-Weight Can Be More Expensive
DeepSeek-R1 and Qwen3 generate 1.5-4x more tokens than closed models for equivalent reasoning. Lower per-token pricing doesn't always mean lower total cost. Benchmark on your actual workload before assuming open-weight saves money.
Benchmark Saturation is Real
GPQA Diamond approaches saturation at 90%+ accuracy. AIME questions show data contamination risk - models perform better on 2024 vs 2025 questions. Internal evaluation on private, domain-specific problems matters more than public benchmark scores.
Latent Reasoning is the Next Frontier
Current reasoning models burn tokens generating natural language traces. The future is latent reasoning - internal compressed representations that preserve benefits without token overhead. This could fundamentally alter reasoning model economics in 2025-2026.
Get notified when these results update
New models drop weekly. We track them so you don't have to.