Autonomous Coding
Autonomous coding — AI systems that write, debug, and ship software without human guidance — is the most commercially immediate agentic capability. Benchmarks range from function-level synthesis (HumanEval, MBPP) to full-repository tasks (SWE-bench), and the field moved from autocomplete to genuine software engineering when Cognition's Devin (2024) and open alternatives like SWE-Agent and OpenHands demonstrated multi-file, multi-step coding workflows. The frontier is extended autonomy: can an agent maintain a codebase over days, not just resolve a single issue?
SWE-bench Verified
Human-validated subset of 500 GitHub issues from real Python repositories. Models must produce a patch that passes hidden tests. Standard benchmark for autonomous coding agents end-to-end (repo navigation, editing, testing).
Top 10
Leading models on SWE-bench Verified.
No results yet. Be the first to contribute.
All datasets
1 dataset tracked for this task.
Related tasks
Other tasks in Agentic AI.