Agentic AI Benchmarks
Real-world evaluations for AI coding agents — software engineering, security analysis, observability instrumentation, and long-horizon task planning.
Benchmark Overview
Each benchmark tests a distinct dimension of agent capability — from patching real bugs to reverse-engineering malware.
| Benchmark | Category | SOTA | Models |
|---|---|---|---|
| SWE-bench Verified | Software Engineering | 76.8% | 8 |
| BinaryAudit | Security | 49% | 26 |
| OTelBench | Observability | 29% | 14 |
| METR Time Horizon | Autonomy | 160 min | 7 |
| YC-Bench | Long-horizon Planning | $1.27M | 12 |
SWE-bench Verified
500 hand-verified GitHub issues from 12 popular Python repositories. The gold-standard for software engineering agents.
| # | Model | Resolve Rate |
|---|---|---|
| ★ | Claude Opus 4.5 (high reasoning) Anthropic | |
| 2 | Gemini 3 Flash (high reasoning) Google | |
| 3 | MiniMax M2.5 (high reasoning) MiniMax | |
| 4 | Claude Sonnet 4.6 Anthropic | |
| 5 | GPT-5.2-Codex OpenAI | |
| 6 | Gemini 2.5 Pro Google | |
| 7 | Claude 3.7 Sonnet Anthropic | |
| 8 | GPT-4o OpenAI |
Source: swebench.com · Results measured with agentic scaffolds on verified subset.
BinaryAudit
33 tasks testing backdoor and timebomb detection in ~40 MB compiled binaries using reverse engineering tools (Ghidra, radare2). Tests 26 models on real security analysis.
| # | Model | Detection Rate |
|---|---|---|
| ★ | Gemini 3.1 Pro Preview Google | |
| 2 | Claude Opus 4.6 Anthropic | |
| 3 | GPT-5.2 Codex XHigh OpenAI | |
| 4 | Gemini 3 Pro Preview Google | |
| 5 | GPT-5.3 Codex XHigh OpenAI | |
| 6 | Claude Sonnet 4.6 Anthropic | |
| 7 | DeepSeek v3.2 DeepSeek | |
| 8 | Grok 4.1-Fast xAI |
Source: QuesmaOrg/binaryaudit · Full results
OTelBench
23 tasks across 11 programming languages. Tests AI agents on adding distributed tracing, metrics, and logging to real codebases using OpenTelemetry SDKs.
| # | Model | Pass Rate |
|---|---|---|
| ★ | claude-opus-4.5 Anthropic | |
| 2 | gpt-5.2 OpenAI | |
| 3 | claude-sonnet-4.5 Anthropic | |
| 4 | gemini-3-flash-preview Google | |
| 5 | gemini-3-pro-preview Google | |
| 6 | gpt-5.2-codex OpenAI | |
| 7 | gpt-5.1 OpenAI | |
| 8 | glm-4.7 Z.ai | |
| 9 | deepseek-v3.2 DeepSeek | |
| 10 | gpt-5.1-codex-max OpenAI | |
| 11 | kimi-k2-thinking Moonshot AI | |
| 12 | claude-haiku-4.5 Anthropic | |
| 13 | grok-4 xAI | |
| 14 | grok-4.1-fast xAI |
Source: QuesmaOrg/otel-bench · Full results · Overall average: 14% pass rate across all models.
METR Time Horizon
Measures how long an AI agent can work autonomously before failing or requiring human intervention. 50% time horizon = task complexity where agent succeeds half the time.
| # | Model | 50% Time Horizon |
|---|---|---|
| ★ | GPT-5.1-Codex-Max OpenAI | 160 min |
| 2 | GPT-5 OpenAI | 137 min |
| 3 | o1-preview OpenAI | 120 min |
| 4 | GPT-4o OpenAI | 90 min |
| 5 | Claude 3 Opus Anthropic | 75 min |
| 6 | Claude 2.1 Anthropic | 45 min |
| 7 | GPT-4 OpenAI | 15 min |
Source: evaluations.metr.org · Capabilities double approximately every 7 months.
YC-Bench
Simulates managing a startup over 1 year. Agents hire employees, select contracts, and maintain profitability in a partially observable environment with adversarial clients. 3 seeds, $200K starting capital.
| # | Model | Net Worth (avg) |
|---|---|---|
| ★ | Claude Opus 4.6 Anthropic | $1.27M |
| 2 | GLM-5 Zhipu AI | $1.21M |
| 3 | GPT-5.4 OpenAI | $1.00M |
| 4 | Kimi-K2.5 Moonshot AI | $409K |
| 5 | Gemini 3 Flash Google | $394K |
| 6 | Gemini 3.1 Flash Lite Google | $203K |
| 7 | GPT-5.4 Mini OpenAI | $138K |
| 8 | Claude Sonnet 4.6 Anthropic | $104K |
| 9 | Qwen 3.5-397B Alibaba | $91K |
| 10 | Gemini 3.1 Pro Google | $66K |
| 11 | GPT-5.4 Nano OpenAI | $39K |
| 12 | Grok 4.20 Beta xAI | $25K |
Source: collinear-ai/yc-bench · Paper · Only 3 models exceeded starting capital of $200K.
Know a benchmark we're missing?
We track agentic benchmarks as they emerge. Submit a benchmark or new results to be listed here.