AI That writes
Software
From single functions (HumanEval) to resolving GitHub issues (SWE-bench), code generation is the most practically impactful frontier of LLM capability.
Published Mar 28, 2026
Code Benchmark Stats
From Snippets to Agents
Pass@1 (Function Level)
The model gets one try to write a single function (e.g., "sort this list"). If it passes unit tests, it wins. This is what HumanEval measures.
Repo-Level Resolution
The model is given a real GitHub issue (bug report) and must navigate multiple files, reproduce the bug, and write a patch. This is SWE-bench.
def solve_problem(input_list):"""
Sorts list and removes duplicates.
>>> solve_problem([3, 1, 2, 1])
[1, 2, 3]
"""# Model generated code:return sorted(list(set(input_list)))Coding Proficiency Leaderboard
Comparing top models on standard function synthesis (HumanEval) and real-world engineering (SWE-bench Verified). Live from the database.
SWE-bench Verified
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.5 | 80.90 |
| 2 | Claude Opus 4.6 | 80.80 |
| 3 | Gemini 3.1 Pro | 80.60 |
| 4 | MiniMax M2.5 | 80.20 |
| 5 | GPT-5.2 Thinking | 80 |
| 6 | Claude Sonnet 4.6 | 79.60 |
| 7 | Gemini 3 Flash | 78 |
| 8 | Claude Sonnet 4.5 | 77.20 |
| 9 | Kimi K2.5 | 76.80 |
| 10 | GPT-5.1 | 76.30 |
HumanEval
| # | Model | Score |
|---|---|---|
| 1 | o4-mini | 97.30 |
| 2 | o3-mini | 96.30 |
| 3 | GPT-4.1 | 94.50 |
| 4 | GPT-4.1 mini | 93.80 |
| 5 | Qwen2.5-Coder-32B-Instruct | 92.70 |
| 6 | o1-preview | 92.40 |
| 7 | o1-mini | 92.40 |
| 8 | Claude-Opus-4 | 92.20 |
| 9 | Claude 3.5 Sonnet | 92 |
| 10 | GPT-4o | 91 |
*Scores may vary by prompt strategy (e.g., 0-shot vs few-shot). SWE-bench Verified scores depend on agent scaffolding.
The Benchmarks
APPS
202110,000 coding problems from Codewars, AtCoder, Kattis, and CodeForces. Ranges from introductory to competition level.
CodeContests
202213,610 competitive programming problems from CodeForces. ~200 private test cases per problem. 12+ programming languages.
HumanEval
2021164 hand-crafted Python programming problems with function signatures, docstrings, and unit tests. Standard benchmark for code generation.
HumanEval+
2023Extended HumanEval with 80x more test cases. Tests code robustness and edge case handling.
LiveCodeBench
2024Contamination-free coding benchmark collecting new problems from LeetCode, AtCoder, and CodeForces after model knowledge cutoffs. Updated continuously with fresh problems. Primary metric is pass@1 on the full test set.
MBPP
2021974 crowd-sourced Python programming problems suitable for beginners. Covers programming fundamentals and standard library.
MBPP+
2023Extended MBPP with additional test cases. Uses 399 hand-verified problems from MBPP-sanitized.
SWE-Bench
20232,294 real GitHub issues from popular Python repositories. Tests ability to resolve real-world software engineering tasks.
SWE-Bench Verified
2024500 manually verified GitHub issues confirmed solvable by human engineers. High-quality subset of SWE-bench.