Reasoning
Testing if your model can think logically? Benchmark math problem solving, commonsense understanding, and multi-step reasoning capabilities.
Mathematical Reasoning
Solving math word problems (GSM8K, MATH, Minerva).
8,500 grade school math word problems requiring multi-step reasoning. The most popular math reasoning benchmark.
12,500 competition mathematics problems from AMC, AIME, and other sources. Harder than GSM8K.
30 challenging math problems from the 2024 AIME competition. Tests advanced mathematical reasoning.
Commonsense Reasoning
Reasoning about everyday situations (CommonsenseQA, HellaSwag).
12,247 multiple choice questions requiring commonsense reasoning about everyday concepts.
70K sentence completion problems testing commonsense natural language inference.
44K Winograd-style problems requiring commonsense reasoning to resolve pronoun references.
7,787 science questions requiring reasoning. Challenge set contains harder questions that retrieval fails on.
15,908 multiple choice questions across 57 subjects from elementary to professional level.
Logical Reasoning
Solving logic puzzles and deductive problems.
8,678 logical reasoning questions from National Civil Servants Examinations of China.
6,138 reading comprehension questions requiring logical reasoning from GMAT/LSAT exams.
Multi-step Reasoning
Complex reasoning requiring multiple inference steps (HotpotQA).
113K question-answer pairs requiring reasoning over multiple Wikipedia documents.
2,780 yes/no questions requiring implicit multi-step reasoning to answer.
448 expert-level questions in biology, physics, and chemistry. Designed to be unsearchable.
Arithmetic Reasoning
Performing arithmetic calculations and solving equations.
3,320 arithmetic word problems from various sources, testing basic arithmetic reasoning.
1,000 elementary-level math word problems testing robustness of arithmetic reasoning.