Coding AgentsUpdated April 2026

Agentic AI Benchmarks

Real-world evaluations for AI coding agents — software engineering, security analysis, observability instrumentation, and long-horizon task planning.

5
Benchmarks Tracked
40+
Models Evaluated
500+
Tasks Across Benchmarks
76.8%
SWE-bench SOTA

Benchmark Overview

Each benchmark tests a distinct dimension of agent capability — from patching real bugs to reverse-engineering malware.

BenchmarkCategorySOTAModels
SWE-bench VerifiedSoftware Engineering76.8%8
BinaryAuditSecurity49%26
OTelBenchObservability29%14
METR Time HorizonAutonomy160 min7
YC-BenchLong-horizon Planning$1.27M12

SWE-bench Verified

500 hand-verified GitHub issues from 12 popular Python repositories. The gold-standard for software engineering agents.

Software Engineering
#ModelResolve Rate
Claude Opus 4.5 (high reasoning)
Anthropic
76.8%
2
Gemini 3 Flash (high reasoning)
Google
75.8%
3
MiniMax M2.5 (high reasoning)
MiniMax
73%
4
Claude Sonnet 4.6
Anthropic
65.2%
5
GPT-5.2-Codex
OpenAI
62.4%
6
Gemini 2.5 Pro
Google
55.3%
7
Claude 3.7 Sonnet
Anthropic
49%
8
GPT-4o
OpenAI
38.8%

Source: swebench.com · Results measured with agentic scaffolds on verified subset.

BinaryAudit

33 tasks testing backdoor and timebomb detection in ~40 MB compiled binaries using reverse engineering tools (Ghidra, radare2). Tests 26 models on real security analysis.

Security
#ModelDetection Rate
Gemini 3.1 Pro Preview
Google
49%
2
Claude Opus 4.6
Anthropic
49%
3
GPT-5.2 Codex XHigh
OpenAI
46%
4
Gemini 3 Pro Preview
Google
44%
5
GPT-5.3 Codex XHigh
OpenAI
42%
6
Claude Sonnet 4.6
Anthropic
31%
7
DeepSeek v3.2
DeepSeek
18%
8
Grok 4.1-Fast
xAI
12%

Source: QuesmaOrg/binaryaudit · Full results

OTelBench

23 tasks across 11 programming languages. Tests AI agents on adding distributed tracing, metrics, and logging to real codebases using OpenTelemetry SDKs.

Observability
#ModelPass Rate
claude-opus-4.5
Anthropic
29%
2
gpt-5.2
OpenAI
26%
3
claude-sonnet-4.5
Anthropic
22%
4
gemini-3-flash-preview
Google
19%
5
gemini-3-pro-preview
Google
16%
6
gpt-5.2-codex
OpenAI
16%
7
gpt-5.1
OpenAI
14%
8
glm-4.7
Z.ai
13%
9
deepseek-v3.2
DeepSeek
12%
10
gpt-5.1-codex-max
OpenAI
12%
11
kimi-k2-thinking
Moonshot AI
7%
12
claude-haiku-4.5
Anthropic
6%
13
grok-4
xAI
4%
14
grok-4.1-fast
xAI
3%

Source: QuesmaOrg/otel-bench · Full results · Overall average: 14% pass rate across all models.

METR Time Horizon

Measures how long an AI agent can work autonomously before failing or requiring human intervention. 50% time horizon = task complexity where agent succeeds half the time.

Autonomy
#Model50% Time Horizon
GPT-5.1-Codex-Max
OpenAI
160 min
2
GPT-5
OpenAI
137 min
3
o1-preview
OpenAI
120 min
4
GPT-4o
OpenAI
90 min
5
Claude 3 Opus
Anthropic
75 min
6
Claude 2.1
Anthropic
45 min
7
GPT-4
OpenAI
15 min

Source: evaluations.metr.org · Capabilities double approximately every 7 months.

YC-Bench

Simulates managing a startup over 1 year. Agents hire employees, select contracts, and maintain profitability in a partially observable environment with adversarial clients. 3 seeds, $200K starting capital.

Long-horizon Planning
#ModelNet Worth (avg)
Claude Opus 4.6
Anthropic
$1.27M
2
GLM-5
Zhipu AI
$1.21M
3
GPT-5.4
OpenAI
$1.00M
4
Kimi-K2.5
Moonshot AI
$409K
5
Gemini 3 Flash
Google
$394K
6
Gemini 3.1 Flash Lite
Google
$203K
7
GPT-5.4 Mini
OpenAI
$138K
8
Claude Sonnet 4.6
Anthropic
$104K
9
Qwen 3.5-397B
Alibaba
$91K
10
Gemini 3.1 Pro
Google
$66K
11
GPT-5.4 Nano
OpenAI
$39K
12
Grok 4.20 Beta
xAI
$25K

Source: collinear-ai/yc-bench · Paper · Only 3 models exceeded starting capital of $200K.

Know a benchmark we're missing?

We track agentic benchmarks as they emerge. Submit a benchmark or new results to be listed here.

Submit a Benchmark

Related