Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · Autonomous CodingHome/Tasks/Agentic AI/Autonomous Coding

Autonomous Coding.

Agent benchmarks where systems complete coding, terminal, repository, or developer-workflow tasks with minimal human intervention.

2
Datasets
23
Results
pct_resolved
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

SWE-bench Verified

Human-validated subset of 500 GitHub issues from real Python repositories. Models must produce a patch that passes hidden tests. Standard benchmark for autonomous coding agents end-to-end (repo navigation, editing, testing).

Primary metric: pct_resolved
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on SWE-bench Verified.

#Modelpct_resolvedYearSource
Claude Opus 4.580.92026paper ↗
2Gemini 3 Pro78.82026paper ↗
3GPT-5 Codex74.92026paper ↗

What were you looking for on Autonomous Coding?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

2 datasets tracked for this task.

SWE-bench Verified
CANONICAL
3 results · pct_resolved
Top: Claude Opus 4.5 80.9
Terminal-Bench 2.0
20 results · accuracy
Top: Codex / GPT-5.5 82.0
§ 05 · Related tasks

Other tasks in Agentic AI.

Agent MemoryBioinformatics AgentsHCASTRE-BenchSWE-benchTask agentsTime HorizonTool Use
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Autonomous Coding? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.