SWE-bench

SWE-bench — resolving real GitHub issues from popular Python repositories — became the defining benchmark for AI software engineering after its 2023 release by Princeton. The verified subset (500 curated problems) went from ~4% resolution rate with raw GPT-4 to over 50% with agentic scaffolds like SWE-Agent and Amazon Q Developer by mid-2025. What makes it uniquely challenging is the need to navigate large codebases, write tests, and produce patches that pass CI — skills that require genuine multi-file reasoning, not just code generation.

1
Datasets
0
Results
resolve-rate
Canonical metric
Canonical Benchmark

SWE-bench Verified

500 manually verified GitHub issues confirmed solvable by human engineers. The primary benchmark for software engineering agents. Results tracked from autonomous scaffolds (not just model capability).

Primary metric: resolve-rate
View full leaderboard

Top 10

Leading models on SWE-bench Verified.

No results yet. Be the first to contribute.

All datasets

1 dataset tracked for this task.

Related tasks

Other tasks in Agentic AI.