Key AI Capability

AI That writes
Software

From single functions (HumanEval) to resolving GitHub issues (SWE-bench), code generation is the most practically impactful frontier of LLM capability.

Code Benchmark Stats

49.0%
SOTA on SWE-bench Verified
92.4%
SOTA on HumanEval
Python
Primary Language

From Snippets to Agents

1

Pass@1 (Function Level)

The model gets one try to write a single function (e.g., "sort this list"). If it passes unit tests, it wins. This is what HumanEval measures.

2

Repo-Level Resolution

The model is given a real GitHub issue (bug report) and must navigate multiple files, reproduce the bug, and write a patch. This is SWE-bench.

solution.py
def solve_problem(input_list):
    """
    Sorts list and removes duplicates.
    >>> solve_problem([3, 1, 2, 1])
    [1, 2, 3]
    """
    # Model generated code:
    return sorted(list(set(input_list)))
✓ Test Passed | Execution time: 0.02s

Coding Proficiency Leaderboard

Comparing top models on standard function synthesis (HumanEval) and real-world engineering (SWE-bench Verified).

Rank Model HumanEval (Pass@1) MBPP (Pass@1) SWE-bench (Verified)
#1
Claude 3.5 Sonnet
Anthropic
92.0% 89.2%
49.0%
#2
GPT-4o
OpenAI
90.2% 87.8%
41.2%
#3
o1-preview
OpenAI
92.4% - -
#4
DeepSeek V3
DeepSeek
82.6% - -
#5
Llama 3 70B
Meta
81.7% - -
#6
DeepSeek V2.5
DeepSeek
- -
37.0%

*SWE-bench Verified scores shown where available. Scores may vary by prompt strategy (e.g., 0-shot vs few-shot).

The Benchmarks

HumanEval

2021

164 hand-crafted Python programming problems with function signatures, docstrings, and unit tests. Standard benchmark for code generation.

Language
python
Samples
164

MBPP

2021

974 crowd-sourced Python programming problems suitable for beginners. Covers programming fundamentals and standard library.

Language
python
Samples
974

HumanEval+

2023

Extended HumanEval with 80x more test cases. Tests code robustness and edge case handling.

Language
python
Samples
164

MBPP+

2023

Extended MBPP with additional test cases. Uses 399 hand-verified problems from MBPP-sanitized.

Language
python
Samples
399

APPS

2021

10,000 coding problems from Codewars, AtCoder, Kattis, and CodeForces. Ranges from introductory to competition level.

Language
python
Samples
10,000

CodeContests

2022

13,610 competitive programming problems from CodeForces. ~200 private test cases per problem. 12+ programming languages.

Language
multilingual
Samples
13,610

SWE-Bench

2023

2,294 real GitHub issues from popular Python repositories. Tests ability to resolve real-world software engineering tasks.

Language
python
Samples
2,294

SWE-Bench Verified

2024

500 manually verified GitHub issues confirmed solvable by human engineers. High-quality subset of SWE-bench.

Language
python
Samples
500