Multi-step Reasoning2024en

Graduate-Level Google-Proof Q&A

448 expert-level questions in biology, physics, and chemistry. Designed to be unsearchable.

Metrics:accuracy
Paper / Website
Current State of the Art

Gemini 3 Pro

Google

91.9

accuracy

GPQA — accuracy

33 results · 2 SOTA advances · higher is better

All results
SOTA frontier
40506070809020262027accuracyGemini 2.5 ProGemini 3 Pro

accuracy Progress Over Time

Showing 5 breakthroughs from Dec 2024 to Apr 2026

44.757.670.583.396.2Dec 2024Apr 2025Aug 2025Dec 2025Apr 2026accuracyDate

Key Milestones

Dec 2024
Qwen2.5-72B-Instruct

Qwen2.5-72B-Instruct. GPQA Diamond. Table 6 in Qwen2.5 Technical Report.

49.0
Jan 2025
DeepSeek-R1

GPQA Diamond, 0-shot CoT. Source: DeepSeek-R1 paper Table 3, arxiv:2501.12948 (Jan 2025).

71.5
+45.9%
Mar 2026
o3

Zero-shot CoT, pass@1. Default reasoning effort.

82.8
+15.8%
Mar 2026
Gemini 2.5 Pro

GPQA Diamond, 0-shot CoT. Source: Gemini 2.5 Pro technical report, Google DeepMind (April 2025).

84.0
+1.4%
Apr 2026
Gemini 3 ProCurrent SOTA
91.9
+9.4%
Total Improvement
87.6%
Time Span
1y 5m
Breakthroughs
5
Current SOTA
91.9

Top Models Performance Comparison

Top 10 models ranked by accuracy

accuracy1Gemini 3 Pro91.9100.0%2Claude Opus 4.691.399.3%3Gemini 3 Flash90.498.4%4Claude Sonnet 4.689.997.8%5GPT-589.096.8%6Grok 488.095.8%7Gemini 2.5 Pro84.091.4%8o382.890.1%9Gemini 2.5 Flash82.890.1%10o4-mini77.684.4%0%25%50%75%100%% of best
Best Score
91.9
Top Model
Gemini 3 Pro
Models Compared
10
Score Range
14.3

accuracyPrimary

#ModelScorePaper / CodeDate
1
Gemini 3 Pro
Google
91.9Apr 2026
2
Claude Opus 4.6API
Anthropic
91.3Apr 2026
3
Gemini 3 FlashAPI
Google
90.4Apr 2026
4
Claude Sonnet 4.6API
Anthropic
89.9Apr 2026
5
GPT-5API
OpenAI
89Apr 2026
6
Grok 4API
xAI
88Apr 2026
7
Gemini 2.5 ProAPI
Google
84Mar 2026
8
o3API
OpenAI
82.8Mar 2026
9
Gemini 2.5 Flash
Google
82.8Apr 2026
10
o4-miniAPI
OpenAI
77.6Mar 2026
11
Claude Opus 4API
Anthropic
76.7Mar 2026
12
o1API
OpenAI
75.7Mar 2026
13
Claude Opus 4.5API
Anthropic
74.9Mar 2026
14
o3-miniAPI
OpenAI
74.9Mar 2026
15
o1-preview
OpenAI
73.3Mar 2026
16
DeepSeek-R1Open Source
DeepSeek
71.5Mar 2026
17
Qwen3-235B-A22B
Alibaba
71.1Apr 2026
18
Claude Sonnet 4API
Anthropic
70Mar 2026
19
Llama-4-MaverickOpen Source
Meta
69.8Mar 2026
20
GPT-4.5 PreviewAPI
OpenAI
69.5Mar 2026
21
GPT-4.1 miniAPI
OpenAI
66.4Apr 2026
22
GPT-4.1API
OpenAI
66.3Mar 2026
23
o1-miniAPI
OpenAI
60Mar 2026
24
Claude 3.5 SonnetAPI
Anthropic
59.4Mar 2026
25
Grok 2API
xAI
56Mar 2026
26
Llama 3.1 405BOpen Source
Meta
50.7Mar 2026
27
Claude 3 OpusAPI
Anthropic
50.4Mar 2026
28
GPT-4oAPI
OpenAI
49.9Mar 2026
29
GPT-4 TurboAPI
OpenAI
49.3Mar 2026
30
Qwen2.5-72B-InstructOpen Source
Alibaba
49Mar 2026
31
Gemini 1.5 ProAPI
Google
46.2Mar 2026
32
Llama 3.1 70BOpen Source
Meta
41.7Mar 2026
33
GPT-4o mini
OpenAI
40.2Mar 2026

Other Multi-step Reasoning Datasets