SWE-bench Verified
Software Engineering Benchmark (Verified subset) · Software Engineering · 2023 (Verified: 2024) · Princeton NLP / OpenAI
What it measures
Whether an agent can autonomously resolve real GitHub issues — reading codebases, localizing bugs, writing patches, and passing existing test suites.
How it works
- 1Agent receives a GitHub issue description and the full repository at the relevant commit.
- 2It must produce a code patch that resolves the issue.
- 3The patch is validated by running the project's own unit/integration tests.
- 4An instance is "resolved" only if all relevant tests pass and no existing tests break.
Key findings
- Scaffolding matters as much as model quality — the same model can vary 15+ points depending on the agent framework.
- Agents still struggle with large codebases (>100k lines) where localization is the bottleneck.
- Most resolved issues are small, localized bug fixes — multi-file architectural changes remain extremely difficult.
Limitations
- Python-only — no coverage of JavaScript, Rust, Go, or other languages.
- Issues are self-contained; real engineering involves cross-repo dependencies and ambiguous requirements.
- Test-based validation can miss subtle regressions not covered by existing tests.
- Verified subset was curated partly with help from OpenAI, raising neutrality questions.
Leaderboard
% resolved
Dataset: 500 verified instances from 12 Python repos