SWE-bench Verified is the hottest leaderboard in agentic AI. 500 hand-curated GitHub issues from popular Python repos: an agent gets the repo plus the issue text, and has to ship a patch that passes the hidden tests. SOTA in 2026 hovers at 75–80%. The interesting story isn’t the ranking — it’s that scaffolding matters as much as the underlying LLM. Same model, different harness, swing of 15–20 points. Below: 14 agents, harnesses, and underlying LLMs compared on the axes buyers actually care about.
Submit corrections below · Vendors: claim your listing →
Closed agentic products · open-source harnesses · the underlying LLMs. Cost, published SWE-bench Verified score, sandbox model, and iteration loop.
| Provider / Model | Tier | License | Cost | SWE-bench V | Time / task | Sandbox | Iterates | OSS | |
|---|---|---|---|---|---|---|---|---|---|
| Product | Proprietary | $20/mo or per-task | ~65–70% | 10–40 min/task | Cloud VM (Devin's environment) | ✓ | — | Claim → | |
| Product | Proprietary | ChatGPT Plus / API | ~72% (Pro) | 5–20 min/task | Cloud sandbox (per-task container) | ✓ | — | Claim → | |
| Product | Proprietary | ~$15–75/M tokens | ~77% (Opus 4.7) | 3–15 min/task | Local shell (your machine) | ✓ | — | Claim → | |
Au | Product | Proprietary | $30/mo (Pro) | ~70% | 5–20 min/task | Cloud + IDE integration | ✓ | — | Claim → |
Rp | Product | Proprietary | $20/mo (Core) | — | 5–25 min/task | Replit cloud workspace | ✓ | — | Claim → |
Cu | Product | Proprietary | $20/mo (Pro) | ~71% (Composer) | 3–15 min/task | Local IDE (your machine) | ✓ | — | Claim → |
| Harness | Open source | Free + your model | ~64% (Polyglot leader) | 2–10 min/task | Local + git | ✓ | ✓ | Claim → | |
| Harness | Open source | Free + your model | ~62–67% | 5–25 min/task | Docker (per-task container) | ✓ | ✓ | Claim → | |
| Harness | Open source | Free + your model | ~50–55% | 5–20 min/task | Docker (per-task container) | ✓ | ✓ | Claim → | |
| Harness | Open source | Free + your model | — | Variable (long-task focus) | Local + sandbox branch | ✓ | ✓ | Claim → | |
| LLM | API only | Sonnet $3/$15 · Opus $15/$75 (per M) | ~77–80% (Opus 4.7, w/ harness) | 3–10 min/task | n/a (model only) | — | — | Claim → | |
| LLM | API only | Tokens (varies by tier) | ~74% Verified · OpenAI now reports SWE-bench Pro | 3–12 min/task | n/a (model only) | — | — | Claim → | |
DS | LLM | Open source | ~$0.30 / $1.20 per M | ~66% | 5–15 min/task | n/a (model only) | — | ✓ | Claim → |
MM | LLM | API only | Low-cost token tier | ~74% | 5–15 min/task | n/a (model only) | — | — | Claim → |
SWE-bench Verified scores reflect each vendor’s own reported configuration as of 2026-04. Harness, retries, and reasoning-effort flags all move the number — treat the column as “ballpark with a generous harness” not as a head-to-head ranking. Spot an error? Tell us →
Picking an autonomous coding agent isn’t a leaderboard exercise — it’s a three-axis trade-off between sandbox model, harness ergonomics, and per-task economics. Shortcuts by use case:
Top of the leaderboard in 2026 hovers around 75–80%. Same model in a different harness routinely drops 10–15 points — pick the harness, not just the model.
Open harness + open-weights LLM lands tasks at roughly 1/20th the cost of Devin. You eat sandbox setup and observability.
Tasks above 30 minutes need branch-style sandboxing and explicit plan-state. Most editor-tied agents lose context past that horizon.
Run on your machine against your real working tree. Faster iteration, but no isolation — guard with git worktrees and avoid running unreviewed shell commands.
Codebase indexing matters more than raw LLM intelligence past a few hundred files. Augment built explicitly for this; Cursor and Claude Code do well with explicit @-references.
Fixed Docker sandbox, deterministic config, MIT licence. The right baseline if you want to compare LLMs on SWE-bench on your own infra.
SWE-bench measures patch-an-issue, not build-from-scratch. Replit Agent and Cursor optimise for the latter; Devin and OpenHands are over-kill for it.
Vendor demos are stitched from successful runs and pre-warmed contexts. Build your own 6-task evaluation set from your real backlog covering these failure modes — most agents stratify sharply on them.
Score by patch acceptance, not by “did the agent finish.” A confident wrong patch wastes more reviewer time than a clean “I’m stuck.”
Most published scores skew toward single-file fixes. Pick an issue that requires editing 3+ files and a config — most agents drop sharply.
Run the suite, watch a real test fail, then ask the agent to fix it. Does it read the stack trace and iterate, or does it pattern-match the test name and over-fit?
Use a repo with a pinned old library version. A weak agent writes code against the latest API; a strong one reads the lockfile first.
Give the agent a task that needs 30+ minutes of work. Watch for context drift — forgotten imports, regressions on already-edited files, plan loops.
Measure tokens-per-task on a 100K-line repo. A clever harness skims a fraction of the codebase; a naive one re-reads the same files until the budget blows.
Set up a task the agent cannot solve (missing credentials, broken upstream). Does it stop and surface the blocker, or hallucinate a fix and silently break things?
Contamination. SWE-bench Verified is built from real, public GitHub issues. Frontier LLMs were almost certainly trained on the repos and the merged fixes. Treat the published 80% as a ceiling pinned by data leakage, not a measurement of generalisable agent quality.
Scaffolding inflation. Vendor scores use vendor-specific harnesses — custom retrieval, retry policies, judge LLMs — that are rarely open-sourced. Same model in a thin loop scores 10–15 points lower. The benchmark increasingly measures the harness, not the LLM.
SWE-bench Pro split. OpenAI shifted its public reporting to SWE-bench Pro (a private, harder, contamination-resistant set) citing exactly these concerns. Anthropic and most of the open-source community still report Verified. The ecosystem is split on which number to trust.
The honest takeaway: use SWE-bench Verified as a coarse filter (“is this in the right league?”), then evaluate finalists on tasks from your own backlog with your own harness. The matrix above gives you the inputs; your eval gives you the answer.
The leaderboards engineers actually look at when picking a coding agent. Each measures a different thing — saturate one and the next becomes the reporting standard.
OpenAI- and Princeton-curated subset of SWE-bench: real GitHub issues with hidden test suites, manually filtered for solvability and fairness. The default 2026 leaderboard.
Benchmark page →OpenAI’s response to contamination concerns. Held-out, harder, less likely to overlap with pre-training data. The reporting standard OpenAI now leads with.
Benchmark page →Tests an agent’s ability to apply correct edits to existing code across Python, JS, Rust, Go, C++, and Java. The reference benchmark for “does the model edit, not just generate.”
Benchmark page →Problems pulled from LeetCode, AtCoder, and Codeforces by date. Filter to post-cutoff problems to get a contamination-resistant view of real coding ability.
Benchmark page →Function-level coding with real Python library calls (data, web, ML). Harder than HumanEval because tasks need correct API usage, not just algorithmic insight.
Benchmark page →The original LLM coding benchmark. Saturated above 95% by every frontier model — kept for historical comparability only. Not a 2026 buying signal.
Benchmark page →Treat 80% as ceiling, not 100%. SWE-bench Verified scores already assume a generous helping of contamination. A vendor reporting 85% next quarter isn’t 5 points better — it’s 5 points more leaky.
Pick by harness ergonomics, not just LLM. The harness can swing the score 15–20 points. Choose the one that fits your stack (terminal vs IDE vs cloud sandbox), the LLM you already have a contract for, and the review workflow your team already runs.
Open-weights agents cost ~1/20th per task. OpenHands or Aider on top of Qwen3-Coder or DeepSeek V3.2 lands the same kinds of tasks at roughly 5% of Devin’s per-task cost. You eat sandbox setup, secret management, and observability — usually worth it past a few hundred tasks a month.
Sandbox isolation matters in production. Local-shell agents (Claude Code, Aider, Cursor) are productive but they can rm -rf things. Run them in git worktrees, deny destructive commands by default, and never give an agent your prod credentials.
Real value is maintenance, not greenfield. Where these agents pay back is the boring stuff: dependency upgrades, lint fixes, test scaffolding, deprecated-API migrations. Greenfield architecture decisions still need a senior human in the loop.
CodeSOTA’s SWE-bench page is read by engineering leaders picking the agent their team will live in. If you represent a vendor above — or one we missed — claim the listing to submit verified pricing, harness configuration, sandbox details, and a current Verified score. Free; credibility-gated, not pay-to-play.
Missing an agent, a harness we skipped, or a use case you need help picking for? Tell us — we reply within 48 hours and update the page based on what readers actually ask.
Real humans read every message. We track what people are asking for and prioritize accordingly.