Codesota · LLM · Coding BenchmarksLLM/Coding
Coding · updated April 2026

LLM Coding Benchmarks.

Leaderboards for LiveCodeBench (contest problems), SWE-bench Verified (real GitHub issues), and HumanEval+ (enhanced unit test coverage). Three benchmarks covering different coding abilities.

LiveCodeBench SWE-benchHumanEval+
§ 01 · LiveCodeBench

Contest problems, contamination-controlled.

Continuously updated contest problems from LeetCode, Codeforces, and AtCoder scraped after model training cutoffs. Tests code generation, self-repair, and test-output prediction on truly unseen problems.

#ModelProviderPass@1Date
DeepSeek-V4-Pro MaxDeepSeek93.5%Apr 2026
2Gemini 3 Pro PreviewGoogle91.7%Apr 2026
3DeepSeek-V4-Flash MaxDeepSeek91.6%Apr 2026
4Gemini 3 FlashGoogle90.8%Apr 2026
5Kimi K2.689.6%Apr 2026
6DeepSeek-V3.2-SpecialeDeepSeek88.7%Dec 2025
7Kimi-K2.5Moonshot.AI85%Feb 2026
8GPT-5OpenAI85%Apr 2026
9Qwen3.6-27B83.9%Apr 2026
10Qwen3.5-397B-A17BAlibaba83.6%Feb 2026
11DeepSeek-V3.2DeepSeek83.3%Dec 2025
12NVIDIA-Nemotron-3-Super-120B-A12B-BF1681.19%Dec 2025
13Qwen3.6-35B-A3B80.4%Apr 2026
14Gemma 4 31BGoogle80%Apr 2026
15Grok 4xAI79%Apr 2026
16Gemini 2.5 ProGoogle75.6%Apr 2026
17Intern-S1-ProShanghai AI Lab74.3%Mar 2026
18Gemini 2.5 Pro74.2%Jul 2025
19DeepSeek-R1-0528DeepSeek73.3%May 2025
20GLM-4.5Zhipu AI72.9%Aug 2025
21o4-miniOpenAI72.8%Mar 2026
22Qwen3-235B-A22BAlibaba70.7%May 2025
23GLM-4.5-AirZhipu AI70.7%Aug 2025
24Qwen3-235B-A22BAlibaba70.7%May 2025
25Qwen3-VL-235B-A22B-ThinkingQwen70.1%Nov 2025
26NVIDIA-Nemotron-3-Nano-30B-A3B-BF1668.3%Dec 2025
27o3-miniOpenAI66.9%Mar 2026
28DeepSeek R1DeepSeek65.9%Jan 2025
29o3OpenAI65.3%Mar 2026
30DeepSeek-R1-Distill-Llama-70BDeepSeek65.2%Jan 2025
31Gemini 2.5 FlashGoogle63.9%Apr 2026
32Kimi k1.5Moonshot AI62.5%Jan 2025
33DeepSeek-R1-Distill-Qwen-32BDeepSeek62.1%Jan 2025
34Gemini 2.5 Flash59.3%Jul 2025
35Qwen3-Coder-NextQwen58.93%Feb 2026
36Claude Opus 4Anthropic57.8%Mar 2026
37Qwen2.5-72B-Instruct55.5%Dec 2024
38GPT-4.1OpenAI54.4%Mar 2026
39Qwen3-VL-235B-A22B-InstructQwen54.3%Nov 2025
40Claude Sonnet 4Anthropic52.8%Mar 2026
41DeepSeek-v3-0324DeepSeek49.2%Mar 2025
42DeepSeek-V3DeepSeek49.2%Mar 2026
43GPT-4.1 miniOpenAI48.3%Apr 2026
44Qwen2.5-Coder 32BAlibaba47.8%Mar 2026
45DeepSeek-Coder-V2-InstructDeepSeek43.4%Mar 2026
46Llama 4 MaverickMeta43.4%Apr 2025
47GPT-4oOpenAI40.8%Mar 2026
48Qwen3-VL-8B-InstructQwen39.3%Nov 2025
49Gemma-3-27bGoogle39%Mar 2025
50Llama-4-ScoutMeta32.8%Apr 2025
51Gemma 3 12B ITGoogle DeepMind32%Mar 2025
52Gemma 3 (27B, IT)29.7%Mar 2025
53Codestral 22BMistral29.5%Mar 2026
54Gemma 3 4B ITGoogle DeepMind23%Mar 2025

Source: livecodebench.github.io · Problems released after training cutoffs.

§ 02 · SWE-bench Verified

Real GitHub issues, human-verified.

500 real GitHub issues from popular Python repos. Human-verified to ensure the issue description is clear and the fix is testable. Measures real-world software engineering — not toy problems.

#ModelProvider% ResolvedDate
Claude Opus 4.7Anthropic87.6%Apr 2026
2Claude Opus 4.5Anthropic80.9%Mar 2026
3Claude Opus 4.6Anthropic80.8%Mar 2026
4DeepSeek-V4-Pro MaxDeepSeek80.6%Apr 2026
5Gemini 3.1 ProGoogle80.6%Mar 2026
6Kimi K2.680.2%Apr 2026
7MiniMax-M2.5MiniMaxAI80.2%Feb 2026
8MiniMax M2.5MiniMax80.2%Mar 2026
9GPT-5.2 ThinkingOpenAI80%Mar 2026
10Claude Sonnet 4.6Anthropic79.6%Mar 2026
11DeepSeek-V4-Flash MaxDeepSeek79%Apr 2026
12MiMo-V2.5-Pro78.9%Apr 2026
13Gemini 3 FlashGoogle78%Mar 2026
14GLM-5Zhipu AI77.8%Feb 2026
15Qwen3.6-27B77.2%Apr 2026
16Claude Sonnet 4.5Anthropic77.2%Mar 2026
17Kimi K2.5Moonshot AI76.8%Mar 2026
18Kimi-K2.5Moonshot.AI76.8%Feb 2026
19Qwen3.5-397B-A17BAlibaba76.4%Feb 2026
20GPT-5.1OpenAI76.3%Mar 2026
21Gemini 3 ProGoogle76.2%Mar 2026
22GPT-5OpenAI74.9%Mar 2026
23Step-3.5-Flash74.4%Feb 2026
24MiniMax M2.1MiniMax74%Mar 2026
25Qwen3.6-35B-A3B73.4%Apr 2026
26Claude Haiku 4.5Anthropic73.3%Mar 2026
27DeepSeek-V3.2DeepSeek73.1%Dec 2025
28Claude Sonnet 4Anthropic72.7%Mar 2026
29Claude Opus 4Anthropic72.5%Mar 2026
30Qwen3.5-27BAlibaba72.4%Feb 2026
31Ling-2.6-1T72.2%Apr 2026
32Devstral 2Mistral72.2%Mar 2026
33Qwen3.5-122B-A10BAlibaba72%Feb 2026
34Qwen3-Coder-NextQwen70.6%Feb 2026
35Qwen3-Coder 480B A35BAlibaba Cloud69.6%Mar 2026
36MiniMax M2MiniMax69.4%Mar 2026
37Qwen3.5-35B-A3BAlibaba69.2%Feb 2026
38o3OpenAI69.1%Mar 2026
39o4-miniOpenAI68.1%Mar 2026
40DeepSeek-V3.1DeepSeek66%Mar 2026
41Kimi-K2Moonshot.AI65.8%Mar 2026
42GLM-4.5Zhipu AI64.2%Aug 2025
43Grok 3xAI63.8%Mar 2026
44Gemini 2.5 ProGoogle63.8%Mar 2026
45Claude 3.7 SonnetAnthropic63.7%Mar 2026
46NVIDIA-Nemotron-3-Super-120B-A12B-BF1660.47%Dec 2025
47Gemini 2.5 FlashGoogle60.4%Mar 2026
48Gemini 2.5 Pro59.6%Jul 2025
49GLM-4.5-AirZhipu AI57.6%Aug 2025
50DeepSeek-R1-0528DeepSeek57.6%Mar 2026
51o3-miniOpenAI55.8%Mar 2026
52GPT-4.1OpenAI54.6%Mar 2026
53Claude 3.5 SonnetAnthropic50.8%Mar 2026
54DeepSeek R1DeepSeek49.2%Mar 2026
55o1OpenAI48.9%Mar 2026
56Gemini 2.5 Flash48.9%Jul 2025
57Devstral Small 2505Mistral46.8%Mar 2026
58DeepSeek-V3DeepSeek42%Mar 2026
59GPT-4oOpenAI41.2%Mar 2026
60Claude 3.5 HaikuAnthropic40.6%Mar 2026
61DeepSeek-V2.5DeepSeek37%Mar 2026

Source: swebench.com · Verified subset, agent scaffolding allowed.

§ 03 · HumanEval+

HumanEval, 80× the test cases.

EvalPlus extends HumanEval with 80x more test inputs per problem, catching solutions that pass original tests but fail on edge cases. More rigorous than the base HumanEval benchmark.

#ModelProviderPass@1Date
Llama 3 (405B, Instruct)Meta89%Jul 2024
2Qwen2.5-Plus87.8%Dec 2024
3Qwen2.5-VL-72B87.8%Feb 2025
4MiniCPM-o 4.5-Instruct86.6%Apr 2026
5Step-3.5-Flash Base81.1%Feb 2026
6Aria73.2%Oct 2024
7Code Llama - Instruct 70B67.8%Aug 2023
8BLT-Entropy 8B35.4%Dec 2024
9Llama 2 70B (5-shot)29.9%Jul 2023
10LLaMA-65B23.7%Feb 2023
11SmoLM2 (1.7B)22.6%Feb 2025
12BLOOM-176B15.52%Nov 2022

Source: evalplus/evalplus · 80x augmented test cases vs. original HumanEval.

§ 04 · Methodology

Frequently asked.

Which coding benchmark should I use to compare LLMs?+

LiveCodeBench is the best single benchmark for general coding ability — it avoids contamination and continuously updates. For software engineering tasks (debugging, refactoring real codebases), use SWE-bench Verified. HumanEval is saturated and no longer differentiates frontier models.

How does LiveCodeBench avoid data contamination?+

Problems are scraped from competitive programming platforms only after the problem appears post-training cutoff for each model. This ensures models cannot have seen the exact problems during training.

Why do some models lead SWE-bench but not LiveCodeBench?+

LiveCodeBench rewards algorithmic reasoning under tight constraints — reasoning models excel here. SWE-bench rewards understanding codebases, writing clean patches, and following project conventions — instruction-following models with longer context tend to have an edge.

§ 05 · Related

Continue reading.

Coding · classic
HumanEval & MBPP
Classic Python coding benchmarks, historical scores
Agentic
Agentic Benchmarks
SWE-bench, BinaryAudit, OTelBench
Index
All LLM Benchmarks
MMLU-Pro, GPQA Diamond, HLE overview