Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capability that determines whether a model can solve complex real-world problems or only handle one-hop questions. Benchmarks like StrategyQA, MuSiQue, and BIG-Bench Hard isolate this ability, and the performance gap between single-step and multi-step tasks remains the widest failure mode of current LLMs. Techniques like chain-of-thought, tree-of-thought, and iterative refinement help, but error accumulation across steps means that 95% per-step accuracy yields only 60% accuracy over 10 steps — a fundamental scaling challenge.
Graduate-level science QA benchmark designed to be difficult for non-experts and resistant to simple web lookup. GPQA Diamond is the common frontier reporting split.
Leading models on GPQA Diamond.
| # | Model | accuracy | Year | Source |
|---|---|---|---|---|
| ★ | Gemini 3 Pro | 91.9 | 2026 | paper ↗ |
| 2 | Claude Opus 4.6 | 91.3 | 2026 | paper ↗ |
| 3 | Kimi K2.6 | 90.5 | 2026 | paper ↗ |
| 4 | Gemini 3 Flash | 90.4 | 2026 | paper ↗ |
| 5 | DeepSeek-V4-Pro Max | 90.1 | 2026 | paper ↗ |
| 6 | Claude Sonnet 4.6 | 89.9 | 2026 | paper ↗ |
| 7 | GPT-5 | 89.0 | 2026 | paper ↗ |
| 8 | Qwen3.5-397B-A17B | 88.4 | 2026 | paper ↗ |
| 9 | DeepSeek-V4-Flash Max | 88.1 | 2026 | paper ↗ |
| 10 | Grok 4 | 88.0 | 2026 | paper ↗ |
Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.
4 datasets tracked for this task.
Still looking for something on Multi-step Reasoning? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.
Real humans read every message. We track what people are asking for and prioritize accordingly.