LLM Coding Benchmarks.

Leaderboards for LiveCodeBench (contest problems), SWE-bench Verified (real GitHub issues), and HumanEval+ (enhanced unit test coverage). Three benchmarks covering different coding abilities.

LiveCodeBench ↓SWE-bench HumanEval+

§ 01 · LiveCodeBench

Contest problems, contamination-controlled.

Continuously updated contest problems from LeetCode, Codeforces, and AtCoder scraped after model training cutoffs. Tests code generation, self-repair, and test-output prediction on truly unseen problems.

#	Model	Provider	Pass@1	Date
★	Gemini 3 Pro Preview		91.7%	Apr 2026
2	Gemini 3 Flash	Google	90.8%	Apr 2026
3	GPT-5	OpenAI	85%	Apr 2026
4	Grok 4	xAI	79%	Apr 2026
5	Gemini 2.5 Pro	Google	75.6%	Apr 2026
6	DeepSeek-R1-0528	DeepSeek	73.3%	May 2025
7	o4-mini	OpenAI	72.8%	Mar 2026
8	Qwen3-235B-A22B	Alibaba	70.7%	May 2025
9	o3-mini	OpenAI	66.9%	Mar 2026
10	DeepSeek R1	DeepSeek	65.9%	Jan 2025
11	o3	OpenAI	65.3%	Mar 2026
12	DeepSeek-R1-Distill-Llama-70B	DeepSeek	65.2%	Jan 2025
13	Gemini 2.5 Flash	Google	63.9%	Apr 2026
14	Kimi k1.5	Moonshot AI	62.5%	Jan 2025
15	DeepSeek-R1-Distill-Qwen-32B	DeepSeek	62.1%	Jan 2025
16	Claude Opus 4	Anthropic	57.8%	Mar 2026
17	GPT-4.1	OpenAI	54.4%	Mar 2026
18	Claude Sonnet 4	Anthropic	52.8%	Mar 2026
19	DeepSeek-v3-0324	DeepSeek	49.2%	Mar 2025
20	DeepSeek-V3	DeepSeek	49.2%	Mar 2026
21	GPT-4.1 mini	OpenAI	48.3%	Apr 2026
22	Qwen2.5-Coder 32B	Alibaba	47.8%	Mar 2026
23	Llama-4-Maverick	Meta	43.4%	Apr 2025
24	DeepSeek-Coder-V2-Instruct	DeepSeek	43.4%	Mar 2026
25	GPT-4o	OpenAI	40.8%	Mar 2026
26	Gemma-3-27b	Google	39%	Mar 2025
27	Llama-4-Scout	Meta	32.8%	Apr 2025
28	Gemma 3 12B IT	Google DeepMind	32%	Mar 2025
29	Codestral 22B	Mistral	29.5%	Mar 2026
30	Gemma 3 4B IT	Google DeepMind	23%	Mar 2025

Source: livecodebench.github.io · Problems released after training cutoffs.

§ 02 · SWE-bench Verified

Real GitHub issues, human-verified.

500 real GitHub issues from popular Python repos. Human-verified to ensure the issue description is clear and the fix is testable. Measures real-world software engineering — not toy problems.

#	Model	Provider	% Resolved	Date
★	Claude Opus 4.7		87.6%	Apr 2026
2	Claude Opus 4.5	Anthropic	80.9%	Mar 2026
3	Claude Opus 4.6	Anthropic	80.8%	Mar 2026
4	Gemini 3.1 Pro	Google	80.6%	Mar 2026
5	MiniMax M2.5	MiniMax	80.2%	Mar 2026
6	GPT-5.2 Thinking	OpenAI	80%	Mar 2026
7	Claude Sonnet 4.6	Anthropic	79.6%	Mar 2026
8	Gemini 3 Flash	Google	78%	Mar 2026
9	Claude Sonnet 4.5	Anthropic	77.2%	Mar 2026
10	Kimi K2.5	Moonshot AI	76.8%	Mar 2026
11	GPT-5.1	OpenAI	76.3%	Mar 2026
12	Gemini 3 Pro	Google	76.2%	Mar 2026
13	GPT-5	OpenAI	74.9%	Mar 2026
14	MiniMax M2.1	MiniMax	74%	Mar 2026
15	Claude Haiku 4.5	Anthropic	73.3%	Mar 2026
16	Claude Sonnet 4	Anthropic	72.7%	Mar 2026
17	Claude Opus 4	Anthropic	72.5%	Mar 2026
18	Devstral 2	Mistral	72.2%	Mar 2026
19	Qwen3-Coder 480B A35B	Alibaba Cloud	69.6%	Mar 2026
20	MiniMax M2	MiniMax	69.4%	Mar 2026
21	o3	OpenAI	69.1%	Mar 2026
22	o4-mini	OpenAI	68.1%	Mar 2026
23	DeepSeek-V3.1	DeepSeek	66%	Mar 2026
24	Kimi-K2	Moonshot.AI	65.8%	Mar 2026
25	Grok 3	xAI	63.8%	Mar 2026
26	Gemini 2.5 Pro	Google	63.8%	Mar 2026
27	Claude 3.7 Sonnet	Anthropic	63.7%	Mar 2026
28	Gemini 2.5 Flash	Google	60.4%	Mar 2026
29	DeepSeek-R1-0528	DeepSeek	57.6%	Mar 2026
30	o3-mini	OpenAI	55.8%	Mar 2026
31	GPT-4.1	OpenAI	54.6%	Mar 2026
32	Claude 3.5 Sonnet	Anthropic	50.8%	Mar 2026
33	DeepSeek R1	DeepSeek	49.2%	Mar 2026
34	o1	OpenAI	48.9%	Mar 2026
35	Devstral Small 2505	Mistral	46.8%	Mar 2026
36	DeepSeek-V3	DeepSeek	42%	Mar 2026
37	GPT-4o	OpenAI	41.2%	Mar 2026
38	Claude 3.5 Haiku	Anthropic	40.6%	Mar 2026
39	DeepSeek-V2.5	DeepSeek	37%	Mar 2026

Source: swebench.com · Verified subset, agent scaffolding allowed.

§ 03 · HumanEval+

HumanEval, 80× the test cases.

EvalPlus extends HumanEval with 80x more test inputs per problem, catching solutions that pass original tests but fail on edge cases. More rigorous than the base HumanEval benchmark.

#	Model	Provider	Pass@1	Date
★	o3-mini (high)	OpenAI	95.1%	Feb 2025
2	Claude 3.7 Sonnet	Anthropic	94.3%	Feb 2025
3	Gemini 2.5 Pro	Google	93.7%	Mar 2025
4	DeepSeek-R1	DeepSeek	91.2%	Jan 2025
5	Claude 3.5 Sonnet	Anthropic	88.1%	Jun 2024
6	GPT-4o	OpenAI	87.4%	May 2024
7	Gemini 1.5 Pro	Google	80.3%	May 2024
8	Llama 3 70B Instruct	Meta	75.9%	Apr 2024

Source: evalplus/evalplus · 80x augmented test cases vs. original HumanEval.

§ 04 · Methodology

Frequently asked.

Which coding benchmark should I use to compare LLMs?+

LiveCodeBench is the best single benchmark for general coding ability — it avoids contamination and continuously updates. For software engineering tasks (debugging, refactoring real codebases), use SWE-bench Verified. HumanEval is saturated and no longer differentiates frontier models.

How does LiveCodeBench avoid data contamination?+

Problems are scraped from competitive programming platforms only after the problem appears post-training cutoff for each model. This ensures models cannot have seen the exact problems during training.

Why do some models lead SWE-bench but not LiveCodeBench?+

LiveCodeBench rewards algorithmic reasoning under tight constraints — reasoning models excel here. SWE-bench rewards understanding codebases, writing clean patches, and following project conventions — instruction-following models with longer context tend to have an edge.

§ 05 · Related

Continue reading.

Coding · classic

HumanEval & MBPP

Classic Python coding benchmarks, historical scores

Agentic

Agentic Benchmarks

SWE-bench, BinaryAudit, OTelBench

Index

All LLM Benchmarks

MMLU-Pro, GPQA Diamond, HLE overview