Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · Commonsense ReasoningHome/Tasks/Reasoning/Commonsense Reasoning

Commonsense Reasoning.

Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social world works — is measured by benchmarks like CommonsenseQA, PIQA, and HellaSwag. Large language models have largely saturated early benchmarks (HellaSwag went from 95% to near-ceiling by 2023), forcing a shift to harder tests like ARC-Challenge and Winoground. The uncomfortable insight is that scale alone buys enormous commonsense performance, but adversarial probing still reveals brittle failures on spatial reasoning, temporal logic, and physical intuition that humans find trivial.

6
Datasets
182
Results
accuracy
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

MMLU

Broad multi-task language-understanding benchmark with 57 subjects spanning STEM, humanities, social sciences, and professional knowledge. Original 4-choice MCQ format; now saturated enough that top-frontier deltas should be read as a cluster rather than a strict ranking.

Primary metric: accuracy
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on MMLU.

#ModelaccuracyYearSource
o392.92026paper ↗
2GPT-5.292.42026paper ↗
3o191.82026paper ↗
4Claude Opus 4.591.82026paper ↗
5Claude Opus 4.591.62026paper ↗
6Gemini 3 Pro91.42026paper ↗
7Claude Opus 4.691.22026paper ↗
8GPT-4.5 Preview90.82026paper ↗
9GPT-590.82026paper ↗
10o1-preview90.82026paper ↗

What were you looking for on Commonsense Reasoning?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

6 datasets tracked for this task.

MMLU
CANONICAL
64 results · accuracy
Top: o3 92.9
MMLU-Pro
73 results · accuracy
Top: Qwen3.5-397B-A17B 87.8
HellaSwag
17 results · accuracy
Top: GPT-4o 95.3
WinoGrande
13 results · accuracy
Top: GPT-4o 87.5
ARC-Challenge
10 results · accuracy
Top: o3 98.1
CommonsenseQA
5 results · accuracy
Top: GPT-4o 85.4
§ 05 · Related tasks

Other tasks in Reasoning.

Arithmetic ReasoningLogical ReasoningMathematical ReasoningMulti-step Reasoning
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Commonsense Reasoning? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.