Codesota · Tasks · SWE-benchTasks/Agents/SWE-bench

Agentic coding · last verified 2026-04

SWE-bench / Autonomous Coding Agents.

SWE-bench Verified is the hottest leaderboard in agentic AI. 500 hand-curated GitHub issues from popular Python repos: an agent gets the repo plus the issue text, and has to ship a patch that passes the hidden tests. SOTA in 2026 hovers at 75–80%. The interesting story isn’t the ranking — it’s that scaffolding matters as much as the underlying LLM. Same model, different harness, swing of 15–20 points. Below: 14 agents, harnesses, and underlying LLMs compared on the axes buyers actually care about.

Submit corrections below · Vendors: claim your listing →

§ 01 · The matrix

14 agents and models, side by side.

Closed agentic products · open-source harnesses · the underlying LLMs. Cost, published SWE-bench Verified score, sandbox model, and iteration loop.

ProductHarnessLLM

Provider / Model	Tier	License	Cost	SWE-bench V	Time / task	Sandbox	Iterates	OSS
Dv Cognition Devin (Cognition AI)	Product	Proprietary	$20/mo or per-task	~65–70%	10–40 min/task	Cloud VM (Devin's environment)	✓	—	Claim →
OpenAI Codex (OpenAI agent)	Product	Proprietary	ChatGPT Plus / API	~72% (Pro)	5–20 min/task	Cloud sandbox (per-task container)	✓	—	Claim →
Anthropic Claude Code (terminal agent)	Product	Proprietary	~$15–75/M tokens	~77% (Opus 4.7)	3–15 min/task	Local shell (your machine)	✓	—	Claim →
Au Augment Augment Agent	Product	Proprietary	$30/mo (Pro)	~70%	5–20 min/task	Cloud + IDE integration	✓	—	Claim →
Rp Replit Replit Agent	Product	Proprietary	$20/mo (Core)	—	5–25 min/task	Replit cloud workspace	✓	—	Claim →
Cu Cursor Cursor Agent (Composer)	Product	Proprietary	$20/mo (Pro)	~71% (Composer)	3–15 min/task	Local IDE (your machine)	✓	—	Claim →
Ai Aider (open) Aider	Harness	Open source	Free + your model	~64% (Polyglot leader)	2–10 min/task	Local + git	✓	✓	Claim →
OH All-Hands AI (open) OpenHands (formerly Open Devin)	Harness	Open source	Free + your model	~62–67%	5–25 min/task	Docker (per-task container)	✓	✓	Claim →
SW Princeton NLP (open) SWE-Agent (Princeton)	Harness	Open source	Free + your model	~50–55%	5–20 min/task	Docker (per-task container)	✓	✓	Claim →
Px Plandex (open) Plandex	Harness	Open source	Free + your model	—	Variable (long-task focus)	Local + sandbox branch	✓	✓	Claim →
Anthropic Claude Opus 4.7 / Sonnet 4.5	LLM	API only	Sonnet $3/$15 · Opus $15/$75 (per M)	~77–80% (Opus 4.7, w/ harness)	3–10 min/task	n/a (model only)	—	—	Claim →
OpenAI GPT-5	LLM	API only	Tokens (varies by tier)	~74% Verified · OpenAI now reports SWE-bench Pro	3–12 min/task	n/a (model only)	—	—	Claim →
DS DeepSeek DeepSeek V3.2	LLM	Open source	~$0.30 / $1.20 per M	~66%	5–15 min/task	n/a (model only)	—	✓	Claim →
MM MiniMax MiniMax M2.1	LLM	API only	Low-cost token tier	~74%	5–15 min/task	n/a (model only)	—	—	Claim →

SWE-bench Verified scores reflect each vendor’s own reported configuration as of 2026-04. Harness, retries, and reasoning-effort flags all move the number — treat the column as “ballpark with a generous harness” not as a head-to-head ranking. Spot an error? Tell us →

§ 02 · Which should I use?

Three-axis trade-off.

Picking an autonomous coding agent isn’t a leaderboard exercise — it’s a three-axis trade-off between sandbox model, harness ergonomics, and per-task economics. Shortcuts by use case:

Highest published SWE-bench Verified

Claude Code + Opus 4.7 · Cursor Composer · Codex

Top of the leaderboard in 2026 hovers around 75–80%. Same model in a different harness routinely drops 10–15 points — pick the harness, not just the model.

Cheapest agent per task

OpenHands + DeepSeek V3.2 · Aider + DeepSeek

Open harness + open-weights LLM lands tasks at roughly 1/20th the cost of Devin. You eat sandbox setup and observability.

Long-running multi-file work

Devin · Plandex · OpenHands

Tasks above 30 minutes need branch-style sandboxing and explicit plan-state. Most editor-tied agents lose context past that horizon.

Local, no cloud sandbox

Claude Code · Aider · Cursor

Run on your machine against your real working tree. Faster iteration, but no isolation — guard with git worktrees and avoid running unreviewed shell commands.

Repo-aware on a large monorepo

Augment · Cursor · Claude Code

Codebase indexing matters more than raw LLM intelligence past a few hundred files. Augment built explicitly for this; Cursor and Claude Code do well with explicit @-references.

Reproducible research / benchmarking

SWE-Agent · OpenHands

Fixed Docker sandbox, deterministic config, MIT licence. The right baseline if you want to compare LLMs on SWE-bench on your own infra.

Greenfield prototype, not an issue fix

Replit Agent · Cursor · Claude Code

SWE-bench measures patch-an-issue, not build-from-scratch. Replit Agent and Cursor optimise for the latter; Devin and OpenHands are over-kill for it.

§ 03 · What to actually test

Vendor demos lie.

Vendor demos are stitched from successful runs and pre-warmed contexts. Build your own 6-task evaluation set from your real backlog covering these failure modes — most agents stratify sharply on them.

Score by patch acceptance, not by “did the agent finish.” A confident wrong patch wastes more reviewer time than a clean “I’m stuck.”

Multi-file changes

Most published scores skew toward single-file fixes. Pick an issue that requires editing 3+ files and a config — most agents drop sharply.

Test failures interpretation

Run the suite, watch a real test fail, then ask the agent to fix it. Does it read the stack trace and iterate, or does it pattern-match the test name and over-fit?

Dependency understanding

Use a repo with a pinned old library version. A weak agent writes code against the latest API; a strong one reads the lockfile first.

Long-running tasks (>30 min)

Give the agent a task that needs 30+ minutes of work. Watch for context drift — forgotten imports, regressions on already-edited files, plan loops.

Repo navigation efficiency

Measure tokens-per-task on a 100K-line repo. A clever harness skims a fraction of the codebase; a naive one re-reads the same files until the budget blows.

Failure modes / asks for help

Set up a task the agent cannot solve (missing credentials, broken upstream). Does it stop and surface the blocker, or hallucinate a fix and silently break things?

§ 04 · Why scores lag

Verified isn’t face-value anymore.

Contamination. SWE-bench Verified is built from real, public GitHub issues. Frontier LLMs were almost certainly trained on the repos and the merged fixes. Treat the published 80% as a ceiling pinned by data leakage, not a measurement of generalisable agent quality.

Scaffolding inflation. Vendor scores use vendor-specific harnesses — custom retrieval, retry policies, judge LLMs — that are rarely open-sourced. Same model in a thin loop scores 10–15 points lower. The benchmark increasingly measures the harness, not the LLM.

SWE-bench Pro split. OpenAI shifted its public reporting to SWE-bench Pro (a private, harder, contamination-resistant set) citing exactly these concerns. Anthropic and most of the open-source community still report Verified. The ecosystem is split on which number to trust.

The honest takeaway: use SWE-bench Verified as a coarse filter (“is this in the right league?”), then evaluate finalists on tasks from your own backlog with your own harness. The matrix above gives you the inputs; your eval gives you the answer.

§ 05 · Reference benchmarks

What engineers look at.

The leaderboards engineers actually look at when picking a coding agent. Each measures a different thing — saturate one and the next becomes the reporting standard.

SWE-bench Verified

500 hand-curated issues · Python · multi-repo2024

OpenAI- and Princeton-curated subset of SWE-bench: real GitHub issues with hidden test suites, manually filtered for solvability and fairness. The default 2026 leaderboard.

Benchmark page →

SWE-bench Pro

Private, contamination-resistant · multi-language2025

OpenAI’s response to contamination concerns. Held-out, harder, less likely to overlap with pre-training data. The reporting standard OpenAI now leads with.

Benchmark page →

Aider Polyglot

225 problems · 6 languages · edit-style2024

Tests an agent’s ability to apply correct edits to existing code across Python, JS, Rust, Go, C++, and Java. The reference benchmark for “does the model edit, not just generate.”

Benchmark page →

LiveCodeBench

Time-stamped competitive-programming problems2024

Problems pulled from LeetCode, AtCoder, and Codeforces by date. Filter to post-cutoff problems to get a contamination-resistant view of real coding ability.

Benchmark page →

BigCodeBench

1,140 function-level tasks · library use2024

Function-level coding with real Python library calls (data, web, ML). Harder than HumanEval because tasks need correct API usage, not just algorithmic insight.

Benchmark page →

HumanEval

164 hand-written Python problems2021

The original LLM coding benchmark. Saturated above 95% by every frontier model — kept for historical comparability only. Not a 2026 buying signal.

Benchmark page →

§ 06 · Practical tips for 2026

Five rules.

Treat 80% as ceiling, not 100%. SWE-bench Verified scores already assume a generous helping of contamination. A vendor reporting 85% next quarter isn’t 5 points better — it’s 5 points more leaky.

Pick by harness ergonomics, not just LLM. The harness can swing the score 15–20 points. Choose the one that fits your stack (terminal vs IDE vs cloud sandbox), the LLM you already have a contract for, and the review workflow your team already runs.

Open-weights agents cost ~1/20th per task. OpenHands or Aider on top of Qwen3-Coder or DeepSeek V3.2 lands the same kinds of tasks at roughly 5% of Devin’s per-task cost. You eat sandbox setup, secret management, and observability — usually worth it past a few hundred tasks a month.

Sandbox isolation matters in production. Local-shell agents (Claude Code, Aider, Cursor) are productive but they can rm -rf things. Run them in git worktrees, deny destructive commands by default, and never give an agent your prod credentials.

Real value is maintenance, not greenfield. Where these agents pay back is the boring stuff: dependency upgrades, lint fixes, test scaffolding, deprecated-API migrations. Greenfield architecture decisions still need a senior human in the loop.

For vendors

Run a coding agent or harness? Claim your listing.

CodeSOTA’s SWE-bench page is read by engineering leaders picking the agent their team will live in. If you represent a vendor above — or one we missed — claim the listing to submit verified pricing, harness configuration, sandbox details, and a current Verified score. Free; credibility-gated, not pay-to-play.

Claim a listing →Get a rank badge →

Related comparisons

Code Generation (LLMs) →Frontier LLM leaderboard →Visual Question Answering →

Reply within 48 hours · No newsletter

What were you looking for on SWE-bench?

Missing an agent, a harness we skipped, or a use case you need help picking for? Tell us — we reply within 48 hours and update the page based on what readers actually ask.

Real humans read every message. We track what people are asking for and prioritize accordingly.