Codesota · Models · o3OpenAI18 results · 17 benchmarks

Model card

o3.

OpenAIapi5 current SOTA

§ 01 · Benchmarks

Every benchmark o3 has a recorded score for.

#	Benchmark	Area · Task	Metric	Value	Rank	Date	Source
01	MMLU	Reasoning · Commonsense Reasoning	accuracy	92.9%	#1/41	2025-04-16	source ↗
02	RE-Bench	Agentic AI · RE-Bench	normalized-score	0.4%	#1/5	2025-04-01	source ↗
03	AIME 2024	Reasoning · Mathematical Reasoning	accuracy	96.7%	#1/8	—	source ↗
04	ARC-AGI-1	Reasoning · Logical Reasoning	accuracy	87.5%	#1/5	—	source ↗
05	ARC-Challenge	Reasoning · Commonsense Reasoning	accuracy	98.1%	#1/10	—	source ↗
06	HCAST	Agentic AI · HCAST	success-rate	49.0%	#2/6	2025-04-01	source ↗
07	METR Time Horizon	Agentic AI · Time Horizon	task-horizon-minutes	30.0%	#2/5	2025-04-01	source ↗
08	AIME 2025	Reasoning · Mathematical Reasoning	accuracy	86.7%	#2/5	—	source ↗
09	ARC-AGI-2	Reasoning · Logical Reasoning	accuracy	4.0%	#2/3	—	source ↗
10	GSM8K	Reasoning · Mathematical Reasoning	accuracy	99.0%	#3/32	—	source ↗
11	MATH	Reasoning · Mathematical Reasoning	accuracy	97.8%	#4/34	—	source ↗
12	HumanEval	Computer Code · Code Generation	pass@1	94.8%	#5/42	2025-04-01	source ↗
13	GPQA	Reasoning · Multi-step Reasoning	accuracy	82.8%	#8/33	—	source ↗
14	LiveCodeBench Pro	Computer Code · Code Generation	elo	1010.00	#9/9	—	source ↗
15	LiveCodeBench	Computer Code · Code Generation	pass@1	65.3%	#11/30	2024-03-12	source ↗
16	SWE-Bench Verified	Computer Code · Code Generation	resolve-rate	69.1%	#21/39	—	source ↗
17	HumanEval	Computer Code · Code Generation	pass@1	87.4%	#26/42	—	source ↗
18	SWE-bench Verified	Agentic AI · SWE-bench	resolve-rate	69.1%	#44/81	—	source ↗

Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.

§ 02 · Strengths by area

Where o3 actually performs.

Reasoning

benchmarks

avg rank #2.6 · 4 SOTA

Agentic AI

benchmarks

avg rank #12.3 · 1 SOTA

Computer Code

benchmarks

avg rank #14.4

§ 03 · Papers

2 papers with results for o3.

2025-04-01· Agentic AI· 3 results
METR: Measuring Autonomy in AI Systems (2025 Update)
2024-03-12· Computer Code· 1 result
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

§ 04 · Related models

Other OpenAI models scored on Codesota.

GPT-4o

Undisclosed params · 35 results · 9 SOTA

Undisclosed params · 8 results

GPT-4.1

7 results

§ 05 · Sources & freshness

Where these numbers come from.

openai-simple-evals

results

official-leaderboard

results

openai-system-card

results

arcprize-leaderboard

results

arxiv

result

shadow-page-humaneval

result

livecodebench-pro-official

result

openai-blog

result

editorial

result

14 of 18 rows marked verified. · first result 2024-03-12, latest 2025-04-16.

o3.

Every benchmark o3 has a recorded score for.

Where o3 actually performs.

2 papers with results for o3.

METR: Measuring Autonomy in AI Systems (2025 Update)

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Other OpenAI models scored on Codesota.

Where these numbers come from.