Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · Multi-step ReasoningHome/Tasks/Reasoning/Multi-step Reasoning

Multi-step Reasoning.

Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capability that determines whether a model can solve complex real-world problems or only handle one-hop questions. Benchmarks like StrategyQA, MuSiQue, and BIG-Bench Hard isolate this ability, and the performance gap between single-step and multi-step tasks remains the widest failure mode of current LLMs. Techniques like chain-of-thought, tree-of-thought, and iterative refinement help, but error accumulation across steps means that 95% per-step accuracy yields only 60% accuracy over 10 steps — a fundamental scaling challenge.

4
Datasets
123
Results
accuracy
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

GPQA Diamond

Graduate-level science QA benchmark designed to be difficult for non-experts and resistant to simple web lookup. GPQA Diamond is the common frontier reporting split.

Primary metric: accuracy
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on GPQA Diamond.

#ModelaccuracyYearSource
Gemini 3 Pro91.92026paper ↗
2Claude Opus 4.691.32026paper ↗
3Kimi K2.690.52026paper ↗
4Gemini 3 Flash90.42026paper ↗
5DeepSeek-V4-Pro Max90.12026paper ↗
6Claude Sonnet 4.689.92026paper ↗
7GPT-589.02026paper ↗
8Qwen3.5-397B-A17B88.42026paper ↗
9DeepSeek-V4-Flash Max88.12026paper ↗
10Grok 488.02026paper ↗

What were you looking for on Multi-step Reasoning?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

4 datasets tracked for this task.

GPQA Diamond
CANONICAL
74 results · accuracy
Top: Gemini 3 Pro 91.9
HLE
36 results · accuracy
Top: Kimi K2.6 54.0
BIG-Bench Hard
11 results · accuracy
Top: Claude 3.5 Sonnet 93.1
StrategyQA
2 results · accuracy
Top: GPT-4o 82.1
§ 05 · Related tasks

Other tasks in Reasoning.

Arithmetic ReasoningCommonsense ReasoningLogical ReasoningMathematical Reasoning
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Multi-step Reasoning? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.