Computer Code

Code Generation

Generating code from natural language descriptions (HumanEval, MBPP).

8 datasets10 results

Code Generation is a key task in computer code. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.

Benchmarks & SOTA

HumanEval

HumanEval: Hand-Written Evaluation Set

20215 results

164 hand-crafted Python programming problems with function signatures, docstrings, and unit tests. Standard benchmark for code generation.

State of the Art

o1-preview

OpenAI

92.4

pass@1

SWE-Bench Verified

SWE-bench Verified Subset

20243 results

500 manually verified GitHub issues confirmed solvable by human engineers. High-quality subset of SWE-bench.

State of the Art

Claude 3.5 Sonnet

Anthropic

49

resolve-rate

MBPP

Mostly Basic Python Problems

20212 results

974 crowd-sourced Python programming problems suitable for beginners. Covers programming fundamentals and standard library.

State of the Art

Claude 3.5 Sonnet

Anthropic

89.2

pass@1

HumanEval+

HumanEval+ Extended Version

20230 results

Extended HumanEval with 80x more test cases. Tests code robustness and edge case handling.

No results tracked yet

APPS

Automated Programming Progress Standard

20210 results

10,000 coding problems from Codewars, AtCoder, Kattis, and CodeForces. Ranges from introductory to competition level.

No results tracked yet

MBPP+

MBPP+ Extended Version

20230 results

Extended MBPP with additional test cases. Uses 399 hand-verified problems from MBPP-sanitized.

No results tracked yet

SWE-Bench

SWE-bench: Software Engineering Benchmark

20230 results

2,294 real GitHub issues from popular Python repositories. Tests ability to resolve real-world software engineering tasks.

No results tracked yet

CodeContests

CodeContests Competitive Programming

20220 results

13,610 competitive programming problems from CodeForces. ~200 private test cases per problem. 12+ programming languages.

No results tracked yet

Related Tasks