Codesota · LLM · Coding BenchmarksLLM/Coding
Coding · updated April 2026

LLM Coding Benchmarks.

Leaderboards for LiveCodeBench (contest problems), SWE-bench Verified (real GitHub issues), and HumanEval+ (enhanced unit test coverage). Three benchmarks covering different coding abilities.

LiveCodeBench SWE-benchHumanEval+
§ 01 · LiveCodeBench

Contest problems, contamination-controlled.

Continuously updated contest problems from LeetCode, Codeforces, and AtCoder scraped after model training cutoffs. Tests code generation, self-repair, and test-output prediction on truly unseen problems.

#ModelProviderPass@1Date
Gemini 3 Pro Preview91.7%Apr 2026
2Gemini 3 FlashGoogle90.8%Apr 2026
3GPT-5OpenAI85%Apr 2026
4Grok 4xAI79%Apr 2026
5Gemini 2.5 ProGoogle75.6%Apr 2026
6DeepSeek-R1-0528DeepSeek73.3%May 2025
7o4-miniOpenAI72.8%Mar 2026
8Qwen3-235B-A22BAlibaba70.7%May 2025
9o3-miniOpenAI66.9%Mar 2026
10DeepSeek R1DeepSeek65.9%Jan 2025
11o3OpenAI65.3%Mar 2026
12DeepSeek-R1-Distill-Llama-70BDeepSeek65.2%Jan 2025
13Gemini 2.5 FlashGoogle63.9%Apr 2026
14Kimi k1.5Moonshot AI62.5%Jan 2025
15DeepSeek-R1-Distill-Qwen-32BDeepSeek62.1%Jan 2025
16Claude Opus 4Anthropic57.8%Mar 2026
17GPT-4.1OpenAI54.4%Mar 2026
18Claude Sonnet 4Anthropic52.8%Mar 2026
19DeepSeek-v3-0324DeepSeek49.2%Mar 2025
20DeepSeek-V3DeepSeek49.2%Mar 2026
21GPT-4.1 miniOpenAI48.3%Apr 2026
22Qwen2.5-Coder 32BAlibaba47.8%Mar 2026
23Llama-4-MaverickMeta43.4%Apr 2025
24DeepSeek-Coder-V2-InstructDeepSeek43.4%Mar 2026
25GPT-4oOpenAI40.8%Mar 2026
26Gemma-3-27bGoogle39%Mar 2025
27Llama-4-ScoutMeta32.8%Apr 2025
28Gemma 3 12B ITGoogle DeepMind32%Mar 2025
29Codestral 22BMistral29.5%Mar 2026
30Gemma 3 4B ITGoogle DeepMind23%Mar 2025

Source: livecodebench.github.io · Problems released after training cutoffs.

§ 02 · SWE-bench Verified

Real GitHub issues, human-verified.

500 real GitHub issues from popular Python repos. Human-verified to ensure the issue description is clear and the fix is testable. Measures real-world software engineering — not toy problems.

#ModelProvider% ResolvedDate
Claude Opus 4.787.6%Apr 2026
2Claude Opus 4.5Anthropic80.9%Mar 2026
3Claude Opus 4.6Anthropic80.8%Mar 2026
4Gemini 3.1 ProGoogle80.6%Mar 2026
5MiniMax M2.5MiniMax80.2%Mar 2026
6GPT-5.2 ThinkingOpenAI80%Mar 2026
7Claude Sonnet 4.6Anthropic79.6%Mar 2026
8Gemini 3 FlashGoogle78%Mar 2026
9Claude Sonnet 4.5Anthropic77.2%Mar 2026
10Kimi K2.5Moonshot AI76.8%Mar 2026
11GPT-5.1OpenAI76.3%Mar 2026
12Gemini 3 ProGoogle76.2%Mar 2026
13GPT-5OpenAI74.9%Mar 2026
14MiniMax M2.1MiniMax74%Mar 2026
15Claude Haiku 4.5Anthropic73.3%Mar 2026
16Claude Sonnet 4Anthropic72.7%Mar 2026
17Claude Opus 4Anthropic72.5%Mar 2026
18Devstral 2Mistral72.2%Mar 2026
19Qwen3-Coder 480B A35BAlibaba Cloud69.6%Mar 2026
20MiniMax M2MiniMax69.4%Mar 2026
21o3OpenAI69.1%Mar 2026
22o4-miniOpenAI68.1%Mar 2026
23DeepSeek-V3.1DeepSeek66%Mar 2026
24Kimi-K2Moonshot.AI65.8%Mar 2026
25Grok 3xAI63.8%Mar 2026
26Gemini 2.5 ProGoogle63.8%Mar 2026
27Claude 3.7 SonnetAnthropic63.7%Mar 2026
28Gemini 2.5 FlashGoogle60.4%Mar 2026
29DeepSeek-R1-0528DeepSeek57.6%Mar 2026
30o3-miniOpenAI55.8%Mar 2026
31GPT-4.1OpenAI54.6%Mar 2026
32Claude 3.5 SonnetAnthropic50.8%Mar 2026
33DeepSeek R1DeepSeek49.2%Mar 2026
34o1OpenAI48.9%Mar 2026
35Devstral Small 2505Mistral46.8%Mar 2026
36DeepSeek-V3DeepSeek42%Mar 2026
37GPT-4oOpenAI41.2%Mar 2026
38Claude 3.5 HaikuAnthropic40.6%Mar 2026
39DeepSeek-V2.5DeepSeek37%Mar 2026

Source: swebench.com · Verified subset, agent scaffolding allowed.

§ 03 · HumanEval+

HumanEval, 80× the test cases.

EvalPlus extends HumanEval with 80x more test inputs per problem, catching solutions that pass original tests but fail on edge cases. More rigorous than the base HumanEval benchmark.

#ModelProviderPass@1Date
o3-mini (high)OpenAI95.1%Feb 2025
2Claude 3.7 SonnetAnthropic94.3%Feb 2025
3Gemini 2.5 ProGoogle93.7%Mar 2025
4DeepSeek-R1DeepSeek91.2%Jan 2025
5Claude 3.5 SonnetAnthropic88.1%Jun 2024
6GPT-4oOpenAI87.4%May 2024
7Gemini 1.5 ProGoogle80.3%May 2024
8Llama 3 70B InstructMeta75.9%Apr 2024

Source: evalplus/evalplus · 80x augmented test cases vs. original HumanEval.

§ 04 · Methodology

Frequently asked.

Which coding benchmark should I use to compare LLMs?+

LiveCodeBench is the best single benchmark for general coding ability — it avoids contamination and continuously updates. For software engineering tasks (debugging, refactoring real codebases), use SWE-bench Verified. HumanEval is saturated and no longer differentiates frontier models.

How does LiveCodeBench avoid data contamination?+

Problems are scraped from competitive programming platforms only after the problem appears post-training cutoff for each model. This ensures models cannot have seen the exact problems during training.

Why do some models lead SWE-bench but not LiveCodeBench?+

LiveCodeBench rewards algorithmic reasoning under tight constraints — reasoning models excel here. SWE-bench rewards understanding codebases, writing clean patches, and following project conventions — instruction-following models with longer context tend to have an edge.

§ 05 · Related

Continue reading.

Coding · classic
HumanEval & MBPP
Classic Python coding benchmarks, historical scores
Agentic
Agentic Benchmarks
SWE-bench, BinaryAudit, OTelBench
Index
All LLM Benchmarks
MMLU-Pro, GPQA Diamond, HLE overview