Codesota · Tasks · Multi-step ReasoningHome/Tasks/Reasoning/Multi-step Reasoning

Multi-step Reasoning.

Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capability that determines whether a model can solve complex real-world problems or only handle one-hop questions. Benchmarks like StrategyQA, MuSiQue, and BIG-Bench Hard isolate this ability, and the performance gap between single-step and multi-step tasks remains the widest failure mode of current LLMs. Techniques like chain-of-thought, tree-of-thought, and iterative refinement help, but error accumulation across steps means that 95% per-step accuracy yields only 60% accuracy over 10 steps — a fundamental scaling challenge.

Datasets

123

Results

accuracy

Canonical metric

§ 02 · Canonical benchmark

The reference dataset.

GPQA Diamond

Graduate-level science QA benchmark designed to be difficult for non-experts and resistant to simple web lookup. GPQA Diamond is the common frontier reporting split.

Primary metric: accuracy

View full leaderboard →

§ 03 · Top 10

Leading models.

Leading models on GPQA Diamond.

#	Model	accuracy	Year	Source
★	Gemini 3 Pro	91.9	2026	paper ↗
2	Claude Opus 4.6	91.3	2026	paper ↗
3	Kimi K2.6	90.5	2026	paper ↗
4	Gemini 3 Flash	90.4	2026	paper ↗
5	DeepSeek-V4-Pro Max	90.1	2026	paper ↗
6	Claude Sonnet 4.6	89.9	2026	paper ↗
7	GPT-5	89.0	2026	paper ↗
8	Qwen3.5-397B-A17B	88.4	2026	paper ↗
9	DeepSeek-V4-Flash Max	88.1	2026	paper ↗
10	Grok 4	88.0	2026	paper ↗