Codesota · AgenticFive benchmarks · thirty-plus models · one registryIssue: April 22, 2026

Editorial · Agentic coding

Coding agents,
measured.

Autocomplete is a solved parlour trick. The harder question is how a model behaves when it is given tools, a repo, and an hour to fix a real bug. Five benchmarks on this page attempt to answer it, each from a different angle.

What follows is the leaderboard as of April 2026 — software engineering, security analysis, observability instrumentation, autonomous task horizon, long-run startup planning, and now agent-memory coverage.

§ 01 · Software engineering

SWE-bench Verified.

500 hand-verified GitHub issues drawn from twelve popular Python repositories. This table uses the official all-agent Verified leaderboard; the mini-SWE-agent v2 bash-only slice is lower and should be read separately.

Metric: Resolve rate · higher is better
Suite: Verified subset · 500 tasks
Source: swebench.com

#	Model	Provider	Agent / Scaffold	Date	Resolve
01	Claude Opus 4.5 medium	Anthropic / UIUC	live-SWE-agent	Dec 2025	79.2%
02	Claude Opus 4.5	Anthropic / Sonar	Sonar Foundation Agent	Dec 2025	79.2%
03	Doubao-Seed-Code	ByteDance	TRAE	Sep 2025	78.8%
04	Gemini 3 Pro Preview	Google / UIUC	live-SWE-agent	Nov 2025	77.4%
05	Claude Sonnet 4 + GPT-5	Atlassian	Rovo Dev	Sep 2025	76.8%
06	Claude Sonnet 4	EPAM	AI/Run Developer Agent	Aug 2025	76.8%
07	Claude Opus 4.5 high	Anthropic / SWE-agent	mini-SWE-agent v2	Feb 2026	76.8%
08	Mixed frontier models	ACoder	ACoder	Aug 2025	76.4%
09	Gemini 3 Flash high	Google / SWE-agent	mini-SWE-agent v2	Feb 2026	75.8%
10	MiniMax M2.5 high	MiniMax / SWE-agent	mini-SWE-agent v2	Feb 2026	75.8%
11	Claude Opus 4.6	Anthropic / SWE-agent	mini-SWE-agent v2	Feb 2026	75.6%

Fig 1 · Resolve rate on the official SWE-bench Verified all-agent leaderboard. Copper rows mark the current top score. The mini-SWE-agent v2 bash-only slice currently tops out at 76.8%, so scaffold labels are part of the result.

§ 02 · Security

BinaryAudit, reverse-engineered.

33 tasks testing whether an agent can spot a backdoor or time-bomb planted inside a roughly 40 MB compiled binary. Tools on the table: Ghidra, radare2, patience.

Metric: Detection rate · higher is better
Caveat: False-positive column is adjacent; read them together
Source: QuesmaOrg/binaryaudit

#	Model	Provider	Detect	False +
01	Gemini 3.1 Pro Preview	Google	49%	12%
02	Claude Opus 4.6	Anthropic	49%	8%
03	GPT-5.2 Codex XHigh	OpenAI	46%	14%
04	Gemini 3 Pro Preview	Google	44%	9%
05	GPT-5.3 Codex XHigh	OpenAI	42%	11%
06	Claude Sonnet 4.6	Anthropic	31%	7%
07	DeepSeek v3.2	DeepSeek	18%	22%
08	Grok 4.1-Fast	xAI	12%	86%

Fig 2 · Detection and false-positive rates on BinaryAudit. A high detect rate paired with a high false-positive rate — Grok 4.1-Fast's 12% / 86% — is an agent flagging everything as suspicious rather than reading the binary.

§ 03 · Observability

OTelBench, across eleven languages.

23 tasks asking an agent to add distributed tracing, metrics, and logging to a real codebase using OpenTelemetry SDKs. Eleven languages on the table; the overall field average sits at 14%.

Metric: Pass rate · higher is better
Field avg.: 14% across all tested models
Source: QuesmaOrg/otel-bench

#	Model	Provider	Pass
01	claude-opus-4.5	Anthropic	29%
02	gpt-5.2	OpenAI	26%
03	claude-sonnet-4.5	Anthropic	22%
04	gemini-3-flash-preview	Google	19%
05	gemini-3-pro-preview	Google	16%
06	gpt-5.2-codex	OpenAI	16%
07	gpt-5.1	OpenAI	14%
08	glm-4.7	Z.ai	13%
09	deepseek-v3.2	DeepSeek	12%
10	gpt-5.1-codex-max	OpenAI	12%
11	kimi-k2-thinking	Moonshot AI	7%
12	claude-haiku-4.5	Anthropic	6%
13	grok-4	xAI	4%
14	grok-4.1-fast	xAI	3%

Fig 3 · OTelBench pass rates, fourteen models ranked. The distance between first (29%) and last (3%) is wider than the whole MMLU spread — agentic workloads still separate the field.

§ 04 · Autonomy

METR Time Horizon.

How long can an agent work on its own before it fails or asks for help? The 50% time horizon is the task complexity at which the agent succeeds half the time. The record has roughly doubled every 4.3 months since 2023.

Metric: 50% time horizon · longer is better
Suite: TH 1.1 · 228-task HCAST · Inspect framework
Source: metr.org/time-horizons

#	Model	Provider	Date	TH-50
01	Claude Opus 4.6	Anthropic	Feb 2026	~12 hr
02	GPT-5.3-Codex	OpenAI	Feb 2026	350 min
03	GPT-5.2	OpenAI	Dec 2025	352 min
04	Claude Opus 4.5	Anthropic	Nov 2025	293 min
05	Gemini 3 Pro	Google	Nov 2025	224 min
06	GPT-5.1-Codex-Max	OpenAI	Nov 2025	224 min
07	GPT-5	OpenAI	Aug 2025	203 min
08	o3	OpenAI	Apr 2025	120 min
09	Claude Opus 4	Anthropic	2025	101 min
10	Claude 3.7 Sonnet	Anthropic	Feb 2025	60 min
11	o1	OpenAI	Dec 2024	39 min

Fig 4 · 50% time horizon on METR TH 1.1. The December-2024 top (o1, 39 min) is now the bottom of the table — in sixteen months the horizon has extended from tens of minutes to roughly twelve hours.

§ 05 · Long-horizon planning

YC-Bench, a simulated year.

The agent is handed $200K and twelve months. It hires, fires, picks contracts, and handles adversarial clients in a partially observable world. Averaged across three seeds. Only three models cleared their starting capital.

Metric: Ending net worth · higher is better
Seeds: 3 · bankruptcy column counts failed seeds
Source: collinear-ai/yc-bench

#	Model	Provider	Net worth	Bankrupt
01	Claude Opus 4.6	Anthropic	$1.27M	0/3
02	GLM-5	Zhipu AI	$1.21M	0/3
03	GPT-5.4	OpenAI	$1.00M	0/3
04	Kimi-K2.5	Moonshot AI	$409K	1/3
05	Gemini 3 Flash	Google	$394K	0/3
06	Gemini 3.1 Flash Lite	Google	$203K	1/3
07	GPT-5.4 Mini	OpenAI	$138K	1/3
08	Claude Sonnet 4.6	Anthropic	$104K	2/3
09	Qwen 3.5-397B	Alibaba	$91K	1/3
10	Gemini 3.1 Pro	Google	$66K	1/3
11	GPT-5.4 Nano	OpenAI	$39K	1/3
12	Grok 4.20 Beta	xAI	$25K	2/3

Fig 5 · Ending net worth averaged over three seeds. Bankruptcies are tallied separately so that a high average driven by one lucky run is visible as such.

§ 06 · Memory

Agent memory, before the run.

Long-running agents fail when they repeat solved mistakes, preserve stale beliefs, ignore deletes, or cannot show why a preflight decision was made. Memory benchmarks belong in the agentic area, but local regression artifacts are not the same thing as an official leaderboard score.

Track: Agent Memory Benchmark + preflight-memory artifacts
Gate: No Audrey score until an official AMB run exists
Source: Evilander/Audrey

#	Artifact	Scope	Status	Source
01	Agent Memory Benchmark (AMB)	Provider harness	Track for official scores	vectorize-io/agent-memory-benchmark
02	Audrey memory artifacts	Local deterministic evidence	Evidence only; no leaderboard claim	HF report + raw artifacts
03	Audrey AMB provider request	Evaluation route	Pending official harness run	AMB issue #11

Fig 6 · Agent-memory coverage queue. The Audrey rows are local deterministic regression/performance evidence; CodeSOTA should promote them to scored leaderboard rows only after the AMB harness produces comparable results.

§ 07

Commentary

Why agentic is not code completion.

A code-completion benchmark asks: given this prefix, what is the next token? An agentic benchmark asks: given a goal, a shell, and an hour of wall-clock, what does the model do? The two metrics measure different things, and the scores do not transfer.

On HumanEval, a strong 2023 model clears 90% pass@1. On SWE-bench Verified the same architecture struggles past 50% — because solving a real issue requires reading the repo, running tests, interpreting a stack trace, and revising a patch. The failure modes are not compilation errors. They are bad plans.

That is also why scaffolds matter. mini-SWE-agent v2, SWE-agent, Aider, Cline, claude-code — each shapes the model's environment differently. A number without its scaffold is not a meaningful number; the tables above keep them together for exactly that reason.

The scored benchmarks on this page are the closest thing we have to a real job description: fix bugs, audit binaries, instrument services, work alone for a shift, plan a year. The memory queue adds another axis: can the agent avoid repeating itself when the facts change?

§ 08 · Related

What to read next.

Cross-linked · April 2026

Three places to go from here.

Adoption data

OpenRouter models

Inverted view of OpenRouter — every model in the catalog, every agent that uses it, ranked by spend, volume, and adoption.

Adoption data

OpenRouter trends

Vendor share over time. Where the dollar shifted month-over-month and what flipped on the chart.

Sister hub

LLM benchmarks

The full register of frontier LLM benchmarks. Reasoning, code, multimodal, and the rest.