Code Generation

Generating code from natural language descriptions (HumanEval, MBPP).

9
Datasets
122
Results
resolve-rate
Canonical metric
Canonical Benchmark

SWE-Bench Verified

500 manually verified GitHub issues confirmed solvable by human engineers. High-quality subset of SWE-bench.

Primary metric: resolve-rate
View full leaderboard

Top 10

Leading models on SWE-Bench Verified.

RankModelresolve-rateYearSource
1
Claude Opus 4.5
80.92026paper
2
Claude Opus 4.6
80.82026paper
3
Gemini 3.1 Pro
80.62026paper
4
MiniMax M2.5
80.22026paper
5
GPT-5.2 Thinking
80.02026paper
6
Claude Sonnet 4.6
79.62026paper
7
Gemini 3 Flash
78.02026paper
8
Claude Sonnet 4.5
77.22026paper
9
Kimi K2.5
76.82026paper
10
GPT-5.1
76.32026paper

All datasets

9 datasets tracked for this task.

Related tasks

Other tasks in Computer Code.