Code Generation
Generating code from natural language descriptions (HumanEval, MBPP).
9
Datasets
122
Results
resolve-rate
Canonical metric
Canonical Benchmark
SWE-Bench Verified
500 manually verified GitHub issues confirmed solvable by human engineers. High-quality subset of SWE-bench.
Primary metric: resolve-rate
Top 10
Leading models on SWE-Bench Verified.
| Rank | Model | resolve-rate | Year | Source |
|---|---|---|---|---|
| 1 | Claude Opus 4.5 | 80.9 | 2026 | paper |
| 2 | Claude Opus 4.6 | 80.8 | 2026 | paper |
| 3 | Gemini 3.1 Pro | 80.6 | 2026 | paper |
| 4 | MiniMax M2.5 | 80.2 | 2026 | paper |
| 5 | GPT-5.2 Thinking | 80.0 | 2026 | paper |
| 6 | Claude Sonnet 4.6 | 79.6 | 2026 | paper |
| 7 | Gemini 3 Flash | 78.0 | 2026 | paper |
| 8 | Claude Sonnet 4.5 | 77.2 | 2026 | paper |
| 9 | Kimi K2.5 | 76.8 | 2026 | paper |
| 10 | GPT-5.1 | 76.3 | 2026 | paper |
All datasets
9 datasets tracked for this task.
SWE-Bench Verified
CANONICAL38results·resolve-rate
Top: Claude Opus 4.5 — 80.9
HumanEval
33results·pass@1
Top: o4-mini (high) — 99.3
LiveCodeBench
22results·pass@1
Top: DeepSeek-R1-0528 — 73.3
MBPP
14results·pass@1
Top: Claude 3.5 Sonnet (Oct 2024) — 91.0
HumanEval+
5results·pass@1
Top: Qwen2.5-Coder-32B — 87.2
MBPP+
4results·pass@1
Top: Qwen2.5-Coder-32B — 76.4
APPS
3results·pass@1
Top: CodeLlama-34B — 32.8
CodeContests
3results·pass@1
Top: GPT-4 + AlphaCodium — 44.0
SWE-Bench
0results·resolve-rate
Related tasks
Other tasks in Computer Code.