Codesota · Models · o4-miniOpenAI15 results · 15 benchmarks

Model card

o4-mini.

OpenAIapi3 current SOTA

§ 01 · Benchmarks

Every benchmark o4-mini has a recorded score for.

#	Benchmark	Area · Task	Metric	Value	Rank	Date	Source
01	AIME 2025	Reasoning · Mathematical Reasoning	accuracy	92.7%	#1/5	—	source ↗
02	HumanEval	Computer Code · Code Generation	pass@1	97.3%	#1/42	—	source ↗
03	MBPP	Computer Code · Code Generation	pass@1	94.9%	#1/19	—	source ↗
04	AIME 2024	Reasoning · Mathematical Reasoning	accuracy	93.4%	#2/8	—	source ↗
05	ARC-AGI-1	Reasoning · Logical Reasoning	accuracy	79.0%	#3/5	—	source ↗
06	ARC-AGI-2	Reasoning · Logical Reasoning	accuracy	3.0%	#3/3	—	source ↗
07	GSM8K	Reasoning · Mathematical Reasoning	accuracy	99.0%	#3/32	—	source ↗
08	LiveCodeBench Pro	Computer Code · Code Generation	elo	2092.00	#3/9	—	source ↗
09	ARC-Challenge	Reasoning · Commonsense Reasoning	accuracy	97.3%	#4/10	—	source ↗
10	MATH	Reasoning · Mathematical Reasoning	accuracy	97.5%	#5/34	—	source ↗
11	LiveCodeBench	Computer Code · Code Generation	pass@1	72.8%	#7/30	2024-03-12	source ↗
12	GPQA	Reasoning · Multi-step Reasoning	accuracy	77.6%	#10/33	—	source ↗
13	MMLU	Reasoning · Commonsense Reasoning	accuracy	90.0%	#15/41	2025-04-16	source ↗
14	SWE-Bench Verified	Computer Code · Code Generation	resolve-rate	68.1%	#22/39	—	source ↗
15	SWE-bench Verified	Agentic AI · SWE-bench	resolve-rate	68.1%	#45/81	—	source ↗

Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.

§ 02 · Strengths by area

Where o4-mini actually performs.

Computer Code

benchmarks

avg rank #6.8 · 2 SOTA

Reasoning

benchmarks

avg rank #5.1 · 1 SOTA

Agentic AI

benchmark

avg rank #45.0

§ 03 · Papers

1 paper with results for o4-mini.

2024-03-12· Computer Code· 1 result
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

§ 04 · Related models

Other OpenAI models scored on Codesota.

GPT-4o

Undisclosed params · 35 results · 9 SOTA

Undisclosed params · 8 results

GPT-4.1

7 results

§ 05 · Sources & freshness

Where these numbers come from.

openai-simple-evals

results

openai-system-card

results

arcprize-leaderboard

results

official-model-card

result

livecodebench-pro-official

result

official-leaderboard

result

swebench-leaderboard

result

editorial

result

11 of 15 rows marked verified. · first result 2024-03-12, latest 2025-04-16.

o4-mini.

Every benchmark o4-mini has a recorded score for.

Where o4-mini actually performs.

1 paper with results for o4-mini.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Other OpenAI models scored on Codesota.

Where these numbers come from.