Key AI Capability

AI That writes
Software

From single functions (HumanEval) to resolving GitHub issues (SWE-bench), code generation is the most practically impactful frontier of LLM capability.

Published Mar 28, 2026

Code Benchmark Stats

SWE-bench Verified
SOTA: Claude Opus 4.580.90 (resolve-rate)
HumanEval
SOTA: o4-mini97.30 (pass@1)
LiveCodeBench
SOTA: DeepSeek-R1-052873.30 (pass@1)

From Snippets to Agents

1

Pass@1 (Function Level)

The model gets one try to write a single function (e.g., "sort this list"). If it passes unit tests, it wins. This is what HumanEval measures.

2

Repo-Level Resolution

The model is given a real GitHub issue (bug report) and must navigate multiple files, reproduce the bug, and write a patch. This is SWE-bench.

solution.py
def solve_problem(input_list):"""
    Sorts list and removes duplicates.
    >>> solve_problem([3, 1, 2, 1])
    [1, 2, 3]
    """# Model generated code:return sorted(list(set(input_list)))
Test Passed|Execution time: 0.02s

Coding Proficiency Leaderboard

Comparing top models on standard function synthesis (HumanEval) and real-world engineering (SWE-bench Verified). Live from the database.

SWE-bench Verified

Leaderboard — resolve-rateFull →
#ModelScore
1Claude Opus 4.580.90
2Claude Opus 4.680.80
3Gemini 3.1 Pro80.60
4MiniMax M2.580.20
5GPT-5.2 Thinking80
6Claude Sonnet 4.679.60
7Gemini 3 Flash78
8Claude Sonnet 4.577.20
9Kimi K2.576.80
10GPT-5.176.30

HumanEval

Leaderboard — pass@1Full →
#ModelScore
1o4-mini97.30
2o3-mini96.30
3GPT-4.194.50
4GPT-4.1 mini93.80
5Qwen2.5-Coder-32B-Instruct92.70
6o1-preview92.40
7o1-mini92.40
8Claude-Opus-492.20
9Claude 3.5 Sonnet92
10GPT-4o91

*Scores may vary by prompt strategy (e.g., 0-shot vs few-shot). SWE-bench Verified scores depend on agent scaffolding.

The Benchmarks

APPS

2021

10,000 coding problems from Codewars, AtCoder, Kattis, and CodeForces. Ranges from introductory to competition level.

Language
python
Samples
N/A

CodeContests

2022

13,610 competitive programming problems from CodeForces. ~200 private test cases per problem. 12+ programming languages.

Language
multilingual
Samples
N/A

HumanEval

2021

164 hand-crafted Python programming problems with function signatures, docstrings, and unit tests. Standard benchmark for code generation.

Language
python
Samples
N/A

HumanEval+

2023

Extended HumanEval with 80x more test cases. Tests code robustness and edge case handling.

Language
python
Samples
N/A

LiveCodeBench

2024

Contamination-free coding benchmark collecting new problems from LeetCode, AtCoder, and CodeForces after model knowledge cutoffs. Updated continuously with fresh problems. Primary metric is pass@1 on the full test set.

Language
en
Samples
400

MBPP

2021

974 crowd-sourced Python programming problems suitable for beginners. Covers programming fundamentals and standard library.

Language
python
Samples
N/A

MBPP+

2023

Extended MBPP with additional test cases. Uses 399 hand-verified problems from MBPP-sanitized.

Language
python
Samples
N/A

SWE-Bench

2023

2,294 real GitHub issues from popular Python repositories. Tests ability to resolve real-world software engineering tasks.

Language
python
Samples
N/A

SWE-Bench Verified

2024

500 manually verified GitHub issues confirmed solvable by human engineers. High-quality subset of SWE-bench.

Language
python
Samples
N/A