Leaderboards for LiveCodeBench (contest problems), SWE-bench Verified (real GitHub issues), and HumanEval+ (enhanced unit test coverage). Three benchmarks covering different coding abilities.
Continuously updated contest problems from LeetCode, Codeforces, and AtCoder scraped after model training cutoffs. Tests code generation, self-repair, and test-output prediction on truly unseen problems.
| # | Model | Provider | Pass@1 | Date |
|---|---|---|---|---|
| ★ | DeepSeek-V4-Pro Max | DeepSeek | 93.5% | Apr 2026 |
| 2 | Gemini 3 Pro Preview | 91.7% | Apr 2026 | |
| 3 | DeepSeek-V4-Flash Max | DeepSeek | 91.6% | Apr 2026 |
| 4 | Gemini 3 Flash | 90.8% | Apr 2026 | |
| 5 | Kimi K2.6 | 89.6% | Apr 2026 | |
| 6 | DeepSeek-V3.2-Speciale | DeepSeek | 88.7% | Dec 2025 |
| 7 | Kimi-K2.5 | Moonshot.AI | 85% | Feb 2026 |
| 8 | GPT-5 | OpenAI | 85% | Apr 2026 |
| 9 | Qwen3.6-27B | 83.9% | Apr 2026 | |
| 10 | Qwen3.5-397B-A17B | Alibaba | 83.6% | Feb 2026 |
| 11 | DeepSeek-V3.2 | DeepSeek | 83.3% | Dec 2025 |
| 12 | NVIDIA-Nemotron-3-Super-120B-A12B-BF16 | 81.19% | Dec 2025 | |
| 13 | Qwen3.6-35B-A3B | 80.4% | Apr 2026 | |
| 14 | Gemma 4 31B | 80% | Apr 2026 | |
| 15 | Grok 4 | xAI | 79% | Apr 2026 |
| 16 | Gemini 2.5 Pro | 75.6% | Apr 2026 | |
| 17 | Intern-S1-Pro | Shanghai AI Lab | 74.3% | Mar 2026 |
| 18 | Gemini 2.5 Pro | 74.2% | Jul 2025 | |
| 19 | DeepSeek-R1-0528 | DeepSeek | 73.3% | May 2025 |
| 20 | GLM-4.5 | Zhipu AI | 72.9% | Aug 2025 |
| 21 | o4-mini | OpenAI | 72.8% | Mar 2026 |
| 22 | Qwen3-235B-A22B | Alibaba | 70.7% | May 2025 |
| 23 | GLM-4.5-Air | Zhipu AI | 70.7% | Aug 2025 |
| 24 | Qwen3-235B-A22B | Alibaba | 70.7% | May 2025 |
| 25 | Qwen3-VL-235B-A22B-Thinking | Qwen | 70.1% | Nov 2025 |
| 26 | NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 | 68.3% | Dec 2025 | |
| 27 | o3-mini | OpenAI | 66.9% | Mar 2026 |
| 28 | DeepSeek R1 | DeepSeek | 65.9% | Jan 2025 |
| 29 | o3 | OpenAI | 65.3% | Mar 2026 |
| 30 | DeepSeek-R1-Distill-Llama-70B | DeepSeek | 65.2% | Jan 2025 |
| 31 | Gemini 2.5 Flash | 63.9% | Apr 2026 | |
| 32 | Kimi k1.5 | Moonshot AI | 62.5% | Jan 2025 |
| 33 | DeepSeek-R1-Distill-Qwen-32B | DeepSeek | 62.1% | Jan 2025 |
| 34 | Gemini 2.5 Flash | 59.3% | Jul 2025 | |
| 35 | Qwen3-Coder-Next | Qwen | 58.93% | Feb 2026 |
| 36 | Claude Opus 4 | Anthropic | 57.8% | Mar 2026 |
| 37 | Qwen2.5-72B-Instruct | 55.5% | Dec 2024 | |
| 38 | GPT-4.1 | OpenAI | 54.4% | Mar 2026 |
| 39 | Qwen3-VL-235B-A22B-Instruct | Qwen | 54.3% | Nov 2025 |
| 40 | Claude Sonnet 4 | Anthropic | 52.8% | Mar 2026 |
| 41 | DeepSeek-v3-0324 | DeepSeek | 49.2% | Mar 2025 |
| 42 | DeepSeek-V3 | DeepSeek | 49.2% | Mar 2026 |
| 43 | GPT-4.1 mini | OpenAI | 48.3% | Apr 2026 |
| 44 | Qwen2.5-Coder 32B | Alibaba | 47.8% | Mar 2026 |
| 45 | DeepSeek-Coder-V2-Instruct | DeepSeek | 43.4% | Mar 2026 |
| 46 | Llama 4 Maverick | Meta | 43.4% | Apr 2025 |
| 47 | GPT-4o | OpenAI | 40.8% | Mar 2026 |
| 48 | Qwen3-VL-8B-Instruct | Qwen | 39.3% | Nov 2025 |
| 49 | Gemma-3-27b | 39% | Mar 2025 | |
| 50 | Llama-4-Scout | Meta | 32.8% | Apr 2025 |
| 51 | Gemma 3 12B IT | Google DeepMind | 32% | Mar 2025 |
| 52 | Gemma 3 (27B, IT) | 29.7% | Mar 2025 | |
| 53 | Codestral 22B | Mistral | 29.5% | Mar 2026 |
| 54 | Gemma 3 4B IT | Google DeepMind | 23% | Mar 2025 |
Source: livecodebench.github.io · Problems released after training cutoffs.
500 real GitHub issues from popular Python repos. Human-verified to ensure the issue description is clear and the fix is testable. Measures real-world software engineering — not toy problems.
| # | Model | Provider | % Resolved | Date |
|---|---|---|---|---|
| ★ | Claude Opus 4.7 | Anthropic | 87.6% | Apr 2026 |
| 2 | Claude Opus 4.5 | Anthropic | 80.9% | Mar 2026 |
| 3 | Claude Opus 4.6 | Anthropic | 80.8% | Mar 2026 |
| 4 | DeepSeek-V4-Pro Max | DeepSeek | 80.6% | Apr 2026 |
| 5 | Gemini 3.1 Pro | 80.6% | Mar 2026 | |
| 6 | Kimi K2.6 | 80.2% | Apr 2026 | |
| 7 | MiniMax-M2.5 | MiniMaxAI | 80.2% | Feb 2026 |
| 8 | MiniMax M2.5 | MiniMax | 80.2% | Mar 2026 |
| 9 | GPT-5.2 Thinking | OpenAI | 80% | Mar 2026 |
| 10 | Claude Sonnet 4.6 | Anthropic | 79.6% | Mar 2026 |
| 11 | DeepSeek-V4-Flash Max | DeepSeek | 79% | Apr 2026 |
| 12 | MiMo-V2.5-Pro | 78.9% | Apr 2026 | |
| 13 | Gemini 3 Flash | 78% | Mar 2026 | |
| 14 | GLM-5 | Zhipu AI | 77.8% | Feb 2026 |
| 15 | Qwen3.6-27B | 77.2% | Apr 2026 | |
| 16 | Claude Sonnet 4.5 | Anthropic | 77.2% | Mar 2026 |
| 17 | Kimi K2.5 | Moonshot AI | 76.8% | Mar 2026 |
| 18 | Kimi-K2.5 | Moonshot.AI | 76.8% | Feb 2026 |
| 19 | Qwen3.5-397B-A17B | Alibaba | 76.4% | Feb 2026 |
| 20 | GPT-5.1 | OpenAI | 76.3% | Mar 2026 |
| 21 | Gemini 3 Pro | 76.2% | Mar 2026 | |
| 22 | GPT-5 | OpenAI | 74.9% | Mar 2026 |
| 23 | Step-3.5-Flash | 74.4% | Feb 2026 | |
| 24 | MiniMax M2.1 | MiniMax | 74% | Mar 2026 |
| 25 | Qwen3.6-35B-A3B | 73.4% | Apr 2026 | |
| 26 | Claude Haiku 4.5 | Anthropic | 73.3% | Mar 2026 |
| 27 | DeepSeek-V3.2 | DeepSeek | 73.1% | Dec 2025 |
| 28 | Claude Sonnet 4 | Anthropic | 72.7% | Mar 2026 |
| 29 | Claude Opus 4 | Anthropic | 72.5% | Mar 2026 |
| 30 | Qwen3.5-27B | Alibaba | 72.4% | Feb 2026 |
| 31 | Ling-2.6-1T | 72.2% | Apr 2026 | |
| 32 | Devstral 2 | Mistral | 72.2% | Mar 2026 |
| 33 | Qwen3.5-122B-A10B | Alibaba | 72% | Feb 2026 |
| 34 | Qwen3-Coder-Next | Qwen | 70.6% | Feb 2026 |
| 35 | Qwen3-Coder 480B A35B | Alibaba Cloud | 69.6% | Mar 2026 |
| 36 | MiniMax M2 | MiniMax | 69.4% | Mar 2026 |
| 37 | Qwen3.5-35B-A3B | Alibaba | 69.2% | Feb 2026 |
| 38 | o3 | OpenAI | 69.1% | Mar 2026 |
| 39 | o4-mini | OpenAI | 68.1% | Mar 2026 |
| 40 | DeepSeek-V3.1 | DeepSeek | 66% | Mar 2026 |
| 41 | Kimi-K2 | Moonshot.AI | 65.8% | Mar 2026 |
| 42 | GLM-4.5 | Zhipu AI | 64.2% | Aug 2025 |
| 43 | Grok 3 | xAI | 63.8% | Mar 2026 |
| 44 | Gemini 2.5 Pro | 63.8% | Mar 2026 | |
| 45 | Claude 3.7 Sonnet | Anthropic | 63.7% | Mar 2026 |
| 46 | NVIDIA-Nemotron-3-Super-120B-A12B-BF16 | 60.47% | Dec 2025 | |
| 47 | Gemini 2.5 Flash | 60.4% | Mar 2026 | |
| 48 | Gemini 2.5 Pro | 59.6% | Jul 2025 | |
| 49 | GLM-4.5-Air | Zhipu AI | 57.6% | Aug 2025 |
| 50 | DeepSeek-R1-0528 | DeepSeek | 57.6% | Mar 2026 |
| 51 | o3-mini | OpenAI | 55.8% | Mar 2026 |
| 52 | GPT-4.1 | OpenAI | 54.6% | Mar 2026 |
| 53 | Claude 3.5 Sonnet | Anthropic | 50.8% | Mar 2026 |
| 54 | DeepSeek R1 | DeepSeek | 49.2% | Mar 2026 |
| 55 | o1 | OpenAI | 48.9% | Mar 2026 |
| 56 | Gemini 2.5 Flash | 48.9% | Jul 2025 | |
| 57 | Devstral Small 2505 | Mistral | 46.8% | Mar 2026 |
| 58 | DeepSeek-V3 | DeepSeek | 42% | Mar 2026 |
| 59 | GPT-4o | OpenAI | 41.2% | Mar 2026 |
| 60 | Claude 3.5 Haiku | Anthropic | 40.6% | Mar 2026 |
| 61 | DeepSeek-V2.5 | DeepSeek | 37% | Mar 2026 |
Source: swebench.com · Verified subset, agent scaffolding allowed.
EvalPlus extends HumanEval with 80x more test inputs per problem, catching solutions that pass original tests but fail on edge cases. More rigorous than the base HumanEval benchmark.
| # | Model | Provider | Pass@1 | Date |
|---|---|---|---|---|
| ★ | Llama 3 (405B, Instruct) | Meta | 89% | Jul 2024 |
| 2 | Qwen2.5-Plus | 87.8% | Dec 2024 | |
| 3 | Qwen2.5-VL-72B | 87.8% | Feb 2025 | |
| 4 | MiniCPM-o 4.5-Instruct | 86.6% | Apr 2026 | |
| 5 | Step-3.5-Flash Base | 81.1% | Feb 2026 | |
| 6 | Aria | 73.2% | Oct 2024 | |
| 7 | Code Llama - Instruct 70B | 67.8% | Aug 2023 | |
| 8 | BLT-Entropy 8B | 35.4% | Dec 2024 | |
| 9 | Llama 2 70B (5-shot) | 29.9% | Jul 2023 | |
| 10 | LLaMA-65B | 23.7% | Feb 2023 | |
| 11 | SmoLM2 (1.7B) | 22.6% | Feb 2025 | |
| 12 | BLOOM-176B | 15.52% | Nov 2022 |
Source: evalplus/evalplus · 80x augmented test cases vs. original HumanEval.
LiveCodeBench is the best single benchmark for general coding ability — it avoids contamination and continuously updates. For software engineering tasks (debugging, refactoring real codebases), use SWE-bench Verified. HumanEval is saturated and no longer differentiates frontier models.
Problems are scraped from competitive programming platforms only after the problem appears post-training cutoff for each model. This ensures models cannot have seen the exact problems during training.
LiveCodeBench rewards algorithmic reasoning under tight constraints — reasoning models excel here. SWE-bench rewards understanding codebases, writing clean patches, and following project conventions — instruction-following models with longer context tend to have an edge.