A continuously-growing pool of LeetCode, AtCoder and Codeforces problems, each tagged with the date it was published. Filter scores to problems posted after a model's training cutoff and the contamination that haunted HumanEval and MBPP simply disappears. Where leaderboard attention moved once the function-synthesis era ended.
Frontier reasoning models pass 80% on the latest window. Scope of the benchmark stops at single-file algorithmic problems; for repo-scale evaluation, attention moved to SWE-bench and now SWE-bench Pro.
LiveCodeBench was the first benchmark to make contamination a measured property rather than a hand-wave. It then handed the frontier-evaluation baton to SWE-bench, which switched from algorithmic contests to repo-scale software engineering — a different task entirely.
APPS (2021-05) was the first widely-cited coding benchmark of the post-Codex era; OpenAI shipped HumanEval purpose-built two months later and attention migrated within a year. HumanEval and MBPP both saturated by 2023 — frontier models hit >95% pass@1, leaving no signal. EvalPlus (HumanEval+, MBPP+) reopened the gap with adversarial tests. Attention then jumped to LiveCodeBench (contamination-free by date) and SWE-bench Verified (repo-scale, human-filtered). As of 2025-09, OpenAI publicly announced they no longer evaluate on SWE-bench Verified — flawed tests reward shortcuts and training-data leakage inflates scores. SWE-bench Pro (Scale AI, arxiv 2509.16941) is the current attention path: 1,865 problems across public/commercial/held-out splits where GPT-5 and Claude Opus 4.1 land at ~23% vs >70% on Verified.
The cards below are nodes from the curated coding lineage. Edges are typed: scope shift means leaderboard attention jumped tasks; direct successor means same task, sharper test set.
80× more test cases per problem, automatically generated to catch the edge cases the original tests missed. Reopened the leaderboard gap that HumanEval had closed.
Where leaderboard attention moved once EvalPlus problems also began saturating. LiveCodeBench's by-date contamination control became the new credibility floor.
See in lineage graph →Continuously scrapes new LeetCode/AtCoder/Codeforces problems and dates them — results can be filtered to problems posted after a model's training cutoff, eliminating contamination. Where the leaderboard moved once HumanEval+ also began saturating.
2,294 real GitHub issue→PR pairs across 12 Python repos. The first benchmark to test whether models could function as software engineers, not just function generators. Superseded by Verified after analysis showed many issues were unsolvable as posed.
From contest-style problems to real-world software engineering — issues, multi-file edits, regression tests. Different task, but the same field's frontier.
See in lineage graph →Each dot is a record-setting model on the latest LiveCodeBench window at the time it landed. Tracked separately so closed-source frontier progress is never confused with the open-weight catch-up curve. Reasoning chains drove most of the post-2024 acceleration on both lines.
Pass@1 on LiveCodeBench. Shaded row marks SOTA. Window may differ across rows (v3 / v4 / v5 / v6) — check the linked source for the exact problem cohort. Vendor-internal cards and the official Berkeley leaderboard are merged.
Reasoning models from OpenAI and Anthropic dominate the top of the leaderboard. DeepSeek-V3, Qwen3-Coder and DeepSeek-R1 close the gap quickly when reasoning chains are enabled — but no open model has yet topped the chart.
Every week a fresh batch of LeetCode/AtCoder/Codeforces problems is scraped, deduplicated and added with its publish date. The benchmark grows; old problems aren't deleted but can be filtered out.
Filter the dataset to only contain problems posted after a given model's training cutoff. Memorisation collapses; only generalisation remains. Vendor-claimed cutoff dates become testable.
Code generation (write a solution), self-repair (fix a buggy solution), code execution (predict output), test output prediction. Most leaderboards show pass@1 on generation.
Single-file, contest-style problems — closer to interview prep than to production work. SWE-bench is the complementary benchmark for repo-scale software engineering.