Reasoning

Testing if your model can think logically? Benchmark math problem solving, commonsense understanding, and multi-step reasoning capabilities.

5 tasks 15 datasets 51 results

Mathematical Reasoning

Solving math word problems (GSM8K, MATH, Minerva).

3 datasets 13 results
GSM8K Grade School Math 8K 2021
SOTA: 97.8 (accuracy)
o1-preview

8,500 grade school math word problems requiring multi-step reasoning. The most popular math reasoning benchmark.

MATH Mathematics Aptitude Test of Heuristics 2021
SOTA: 94.8 (accuracy)
o1-preview

12,500 competition mathematics problems from AMC, AIME, and other sources. Harder than GSM8K.

AIME 2024 American Invitational Mathematics Examination 2024 2024
SOTA: 83.3 (accuracy)
o1-preview

30 challenging math problems from the 2024 AIME competition. Tests advanced mathematical reasoning.

Commonsense Reasoning

Reasoning about everyday situations (CommonsenseQA, HellaSwag).

5 datasets 20 results
CommonsenseQA CommonsenseQA 2019
SOTA: 85.4 (accuracy)
gpt-4o

12,247 multiple choice questions requiring commonsense reasoning about everyday concepts.

HellaSwag HellaSwag 2019
SOTA: 95.3 (accuracy)
gpt-4o

70K sentence completion problems testing commonsense natural language inference.

WinoGrande WinoGrande 2019
SOTA: 87.5 (accuracy)
gpt-4o

44K Winograd-style problems requiring commonsense reasoning to resolve pronoun references.

ARC-Challenge AI2 Reasoning Challenge 2018
SOTA: 96.7 (accuracy)
claude-35-sonnet

7,787 science questions requiring reasoning. Challenge set contains harder questions that retrieval fails on.

MMLU Massive Multitask Language Understanding 2021
SOTA: 92.3 (accuracy)
o1-preview

15,908 multiple choice questions across 57 subjects from elementary to professional level.

Logical Reasoning

Solving logic puzzles and deductive problems.

2 datasets 4 results
LogiQA LogiQA 2020
SOTA: 56.3 (accuracy)
gpt-4o

8,678 logical reasoning questions from National Civil Servants Examinations of China.

ReClor Reading Comprehension Dataset Requiring Logical Reasoning 2020
SOTA: 72.4 (accuracy)
gpt-4o

6,138 reading comprehension questions requiring logical reasoning from GMAT/LSAT exams.

Multi-step Reasoning

Complex reasoning requiring multiple inference steps (HotpotQA).

3 datasets 8 results
HotpotQA HotpotQA 2018
SOTA: 71.3 (f1)
gpt-4o

113K question-answer pairs requiring reasoning over multiple Wikipedia documents.

StrategyQA StrategyQA 2021
SOTA: 82.1 (accuracy)
gpt-4o

2,780 yes/no questions requiring implicit multi-step reasoning to answer.

GPQA Graduate-Level Google-Proof Q&A 2024
SOTA: 78 (accuracy)
o1-preview

448 expert-level questions in biology, physics, and chemistry. Designed to be unsearchable.

Arithmetic Reasoning

Performing arithmetic calculations and solving equations.

2 datasets 6 results
MAWPS Math Word Problem Repository 2016
SOTA: 97.2 (accuracy)
gpt-4o

3,320 arithmetic word problems from various sources, testing basic arithmetic reasoning.

SVAMP Simple Variations on Arithmetic Math Word Problems 2021
SOTA: 93.7 (accuracy)
gpt-4o

1,000 elementary-level math word problems testing robustness of arithmetic reasoning.