Codesota · Tasks · Logical ReasoningHome/Tasks/Reasoning/Logical Reasoning

Logical Reasoning.

Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weakness in autoregressive language models: they pattern-match rather than prove. Benchmarks like LogiQA, FOLIO, and the ReClor reading comprehension test push models toward deductive rigor, and performance improves substantially with chain-of-thought and self-consistency decoding. But systematic evaluations (2023-2024) show that even frontier models fail on problems requiring more than 3-4 reasoning steps, and neurosymbolic approaches that compile to SAT solvers or proof assistants remain more reliable for true logical correctness.

4

Datasets

12

Results

accuracy

Canonical metric

§ 02 · Canonical benchmark

The reference dataset.

LogiQA

8,678 logical reasoning questions from National Civil Servants Examinations of China.

Primary metric: accuracy

View full leaderboard →

§ 03 · Top 10

Leading models.

Leading models on LogiQA.

#	Model	accuracy	Year	Source
★	GPT-4o	56.3	2025	paper ↗
2	Claude 3.5 Sonnet	53.8	2025	paper ↗

What were you looking for on Logical Reasoning?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

4 datasets tracked for this task.

2 results · accuracy

Top: GPT-4o — 56.3

5 results · accuracy

Top: o3 — 87.5

3 results · accuracy

Top: Gemini 2.5 Pro — 5.00

2 results · accuracy

Top: GPT-4o — 72.4

§ 05 · Related tasks

Other tasks in Reasoning.

Arithmetic Reasoning Commonsense Reasoning Mathematical Reasoning Multi-step Reasoning

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Logical Reasoning? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.