Codesota · Lineage · Coding Benchmarks13 benchmarks · 12 edgesUpdated 2026-04-26
Benchmark lineage

Coding Benchmarks

How code-generation evaluation moved from short Python functions to repository-scale software engineering. Attention path tracks the benchmark frontier focus has migrated to; branches show specialised variants and successors that remain active in their own right.

Editor's note

APPS (2021-05) was the first widely-cited coding benchmark of the post-Codex era; OpenAI shipped HumanEval purpose-built two months later and attention migrated within a year. HumanEval and MBPP both saturated by 2023 — frontier models hit >95% pass@1, leaving no signal. EvalPlus (HumanEval+, MBPP+) reopened the gap with adversarial tests. Attention then jumped to LiveCodeBench (contamination-free by date) and SWE-bench Verified (repo-scale, human-filtered). As of 2025-09, OpenAI publicly announced they no longer evaluate on SWE-bench Verified — flawed tests reward shortcuts and training-data leakage inflates scores. SWE-bench Pro (Scale AI, arxiv 2509.16941) is the current attention path: 1,865 problems across public/commercial/held-out splits where GPT-5 and Claude Opus 4.1 land at ~23% vs >70% on Verified.

§ 01 · Lineage graph

Attention path plus branches.

Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.

attention path scope shift branch / fork active saturating saturated / superseded
DIRECT SUCCESSORDIRECT SUCCESSORSCOPE SHIFTSCOPE SHIFTDIRECT SUCCESSORDIRECT SUCCESSORAPPSMAY 2021HumanEvalJUL 2021SOTA 97.3%MBPPAUG 2021SOTA 94.9%HumanEval+MAY 2023MBPP+MAY 2023MultiPL-EAUG 2022CodeContestsFEB 2022LiveCodeBenchSEP 2023SOTA 91.7%SWE-benchOCT 2023SOTA 82.10SWE-bench VerifiedAUG 2024SOTA 87.60Multi-SWE-benchAPR 2025SWE-bench ProSEP 2025Terminal-BenchOCT 2025
APPSHumanEval · direct successor · attention
OpenAI shipped HumanEval purpose-built for Codex evaluation two months after APPS. Smaller (164 vs 10,000) but cleaner unit-test scaffolding, and the per-problem signal proved sharper at the frontier — leaderboard attention migrated within a year.
HumanEvalHumanEval+ · direct successor · attention
EvalPlus added 80× test cases per problem to catch the edge cases the original 164 missed. Reopened the gap on saturated leaderboards.
HumanEvalMBPP · variant
Companion benchmark from Google, released a month later. Same Python-function-synthesis task, broader and easier.
HumanEvalMultiPL-E · fork
MultiPL-E translates HumanEval (and MBPP) into 18+ programming languages.
HumanEvalCodeContests · scope shift
CodeContests jumps from function-level to competitive-programming difficulty — same task family, harder reasoning.
MBPPMBPP+ · direct successor
Same EvalPlus adversarial-test treatment applied to MBPP.
HumanEval+LiveCodeBench · scope shift · attention
Where leaderboard attention moved once EvalPlus problems also began saturating. LiveCodeBench's by-date contamination control became the new credibility floor.
LiveCodeBenchSWE-bench · scope shift · attention
From contest-style problems to real-world software engineering — issues, multi-file edits, regression tests. Different task, but the same field's frontier.
SWE-benchSWE-bench Verified · direct successor · attention
Human-filtered subset of 500 verified-solvable tasks. The original SWE-bench is rarely quoted now; Verified is what agentic-coding evals report.
SWE-bench VerifiedMulti-SWE-bench · fork
Multi-language fork (Java, TypeScript, Go, Rust, C/C++). A parallel branch rather than the main attention path.
SWE-bench VerifiedSWE-bench Pro · direct successor · attention
OpenAI publicly stopped evaluating Verified in Sep 2025 — contamination and shortcut-reward tests inflated scores. Pro adds held-out splits, commercial repos, and contamination control. GPT-5 / Claude Opus 4.1 drop from >70% on Verified to ~23% on Pro.
SWE-bench VerifiedTerminal-Bench · scope shift
Parallel branch, not a direct successor: SWE-bench Pro fixes Verified's contamination on the same 'fix one GitHub issue' task. Terminal-Bench changes the task — full terminal/devops/data sessions inside a Docker sandbox, with the agent harness scored as part of the system. Same era, different scope. Frontier closed-agent (Codex + GPT-5.5) currently 82.0%; open-weight harnesses trail by ~30pp.
§ 02 · Benchmarks in this lineage

Nodes in detail.

May 2021Saturated

APPS

Automated Programming Progress Standard

10,000 Python problems scraped from coding sites at three difficulty tiers (introductory, interview, competition). The first widely-shared coding benchmark of the post-Codex era — same Hendrycks group that built MMLU. Preceded HumanEval by two months and is the closest direct ancestor of the function-synthesis line.

Hendrycks et al. · paper
Jul 2021Saturated
View benchmark page →

HumanEval

HumanEval Python Function Synthesis

164 hand-written Python problems with unit tests. The first widely-shared LLM coding benchmark. Pass@1 became the standard code-quality metric.

Chen et al. (OpenAI) · paper
Aug 2021Saturated
View benchmark page →

MBPP

Mostly Basic Python Problems

974 entry-level Python problems crowdsourced from non-experts. Companion to HumanEval — broader coverage, easier on average, similar saturation curve.

Austin et al. (Google) · paper

CodeContests

CodeContests Competitive Programming

Codeforces-style competitive programming problems. Harder algorithmic reasoning than HumanEval; requires multi-sample generation to score well.

Li et al. (DeepMind, AlphaCode) · paper
Aug 2022Active

MultiPL-E

Multi-Programming-Language Evaluation

HumanEval and MBPP translated into 18+ languages. Tests whether code-LLMs generalise beyond Python or just memorised it.

Cassano et al. · paper

HumanEval+

HumanEval+ (EvalPlus)

80× more test cases per problem, automatically generated to catch the edge cases the original tests missed. Reopened the leaderboard gap that HumanEval had closed.

Liu et al. · paper

MBPP+

MBPP+ (EvalPlus)

Same EvalPlus treatment for MBPP — adversarial tests, broader coverage, hard mode.

Liu et al. · paper

LiveCodeBench

LiveCodeBench Contamination-Free Coding

Continuously scrapes new LeetCode/AtCoder/Codeforces problems and dates them — results can be filtered to problems posted after a model's training cutoff, eliminating contamination. Where the leaderboard moved once HumanEval+ also began saturating.

Jain et al. (UC Berkeley, MIT, Cornell) · paper
Oct 2023Superseded
View benchmark page →

SWE-bench

SWE-bench (original, unfiltered)

2,294 real GitHub issue→PR pairs across 12 Python repos. The first benchmark to test whether models could function as software engineers, not just function generators. Superseded by Verified after analysis showed many issues were unsolvable as posed.

Jimenez et al. (Princeton) · paper
Aug 2024Saturating
View benchmark page →

SWE-bench Verified

SWE-bench Verified (human-filtered subset)

500 SWE-bench tasks human-confirmed solvable with sufficient issue information and a passing test. Was the agentic-coding standard until 2025 — OpenAI publicly stopped evaluating on it in Sep 2025, citing flawed tests that reward shortcuts plus training-data leakage that inflates scores.

OpenAI + SWE-bench team · paper
Apr 2025Active

Multi-SWE-bench

Multi-SWE-bench (multi-language fork)

Extends SWE-bench beyond Python to Java, TypeScript, Go, Rust, C, C++. A parallel multi-language branch — useful for cross-language reasoning, but not where leaderboard attention has consolidated.

ByteDance team · paper
Sep 2025Active

SWE-bench Pro

SWE-bench Pro (Scale AI, contamination-controlled)

1,865 problems across public/commercial/held-out splits sourced from 41 actively-maintained business and B2B repos. Designed to fix Verified's contamination and shortcut problems — GPT-5 and Claude Opus 4.1 land at ~23% here vs >70% on Verified. The frontier OpenAI now reports.

Scale AI · paper
Oct 2025Active

Terminal-Bench

Terminal-Bench 2 (Stanford · Laude Institute)

152 hand-built terminal tasks — devops, data, SWE, scientific computing — each scored by container-internal unit tests inside a Docker sandbox. Agent-coupled: the harness, prompt scaffold and underlying model are measured as one system, unlike SWE-bench where only the model is scored. A scope shift, not a successor — Codex + GPT-5.5 currently leads at 82.0%.

Stanford · Laude Institute · paper