Codesota · Benchmark · LiveCodeBenchBrowse/Code/Code Generation
Coding lineage · contamination-controlled by date · Sep 2023

LiveCodeBench.

A continuously-growing pool of LeetCode, AtCoder and Codeforces problems, each tagged with the date it was published. Filter scores to problems posted after a model's training cutoff and the contamination that haunted HumanEval and MBPP simply disappears. Where leaderboard attention moved once the function-synthesis era ended.

Lineage status · Active· still useful — but agentic coding has moved past contests

Frontier reasoning models pass 80% on the latest window. Scope of the benchmark stops at single-file algorithmic problems; for repo-scale evaluation, attention moved to SWE-bench and now SWE-bench Pro.

Official leaderboard Read the paperFull coding lineage
§ 01 · Lineage

Where LiveCodeBench fits in coding eval.

LiveCodeBench was the first benchmark to make contamination a measured property rather than a hand-wave. It then handed the frontier-evaluation baton to SWE-bench, which switched from algorithmic contests to repo-scale software engineering — a different task entirely.

APPS
May 2021
Saturated
HumanEval
Jul 2021
Saturated
HumanEval+
May 2023
Active
LiveCodeBench
Sep 2023
Active
◆ this page
SWE-bench
Oct 2023
Superseded
SWE-bench Verified
Aug 2024
Saturating
SWE-bench Pro
Sep 2025
Active
Editor's note · 2026-04-26

APPS (2021-05) was the first widely-cited coding benchmark of the post-Codex era; OpenAI shipped HumanEval purpose-built two months later and attention migrated within a year. HumanEval and MBPP both saturated by 2023 — frontier models hit >95% pass@1, leaving no signal. EvalPlus (HumanEval+, MBPP+) reopened the gap with adversarial tests. Attention then jumped to LiveCodeBench (contamination-free by date) and SWE-bench Verified (repo-scale, human-filtered). As of 2025-09, OpenAI publicly announced they no longer evaluate on SWE-bench Verified — flawed tests reward shortcuts and training-data leakage inflates scores. SWE-bench Pro (Scale AI, arxiv 2509.16941) is the current attention path: 1,865 problems across public/commercial/held-out splits where GPT-5 and Claude Opus 4.1 land at ~23% vs >70% on Verified.

§ 02 · Context

What changed,
and what changed it.

The cards below are nodes from the curated coding lineage. Edges are typed: scope shift means leaderboard attention jumped tasks; direct successor means same task, sharper test set.

In-edge · scope shift
HumanEval+
LiveCodeBench
May 2023 → Sep 2023

80× more test cases per problem, automatically generated to catch the edge cases the original tests missed. Reopened the leaderboard gap that HumanEval had closed.

Where leaderboard attention moved once EvalPlus problems also began saturating. LiveCodeBench's by-date contamination control became the new credibility floor.

See in lineage graph →
◆ This page
LiveCodeBench
ActiveSep 2023
LiveCodeBench Contamination-Free Coding

Continuously scrapes new LeetCode/AtCoder/Codeforces problems and dates them — results can be filtered to problems posted after a model's training cutoff, eliminating contamination. Where the leaderboard moved once HumanEval+ also began saturating.

Jain et al. (UC Berkeley, MIT, Cornell) · paper
Out-edge · scope shift · current attention path
LiveCodeBench
SWE-bench
Sep 2023 → Oct 2023

2,294 real GitHub issue→PR pairs across 12 Python repos. The first benchmark to test whether models could function as software engineers, not just function generators. Superseded by Verified after analysis showed many issues were unsolvable as posed.

From contest-style problems to real-world software engineering — issues, multi-file edits, regression tests. Different task, but the same field's frontier.

See in lineage graph →
§ 03 · SOTA

23.6% → 91.7%, two tracks.

Each dot is a record-setting model on the latest LiveCodeBench window at the time it landed. Tracked separately so closed-source frontier progress is never confused with the open-weight catch-up curve. Reasoning chains drove most of the post-2024 acceleration on both lines.


API · latest
Mar 2026 · Gemini 3 Pro Preview · 91.7%
Open · latest
Feb 2026 · Qwen3-Max-Thinking · 81.2%
Frontier gap
10.5pp
Series
8 closed · 5 open
Closed / APIOpen weight
0%25%50%75%100%202320242025202623.626.872.873.278.482.185.691.749.265.972.475.881.2Gemini 3 Pro PreviewQwen3-Max-Thinking
Fig 2 · LiveCodeBench pass@1 (overall) by record-setting model, split by license. Solid = closed/API · dashed = open weight. Each dot only appears when the score strictly exceeds every previous record on its own line.
§ 04 · Leaderboard

Best published scores.

Pass@1 on LiveCodeBench. Shaded row marks SOTA. Window may differ across rows (v3 / v4 / v5 / v6) — check the linked source for the exact problem cohort. Vendor-internal cards and the official Berkeley leaderboard are merged.


Metric
pass@1 · higher is better
Rows
25
Source
live · benchmark_results
#ModelVendorTypeSubmittedSourcepass@1
01Gemini 3 Pro PreviewAPIMar 2026source91.7
02Gemini 3 FlashGoogleAPIMar 2026source90.8
03GPT-5OpenAIAPIsource85.0
04Grok 4xAIAPIsource79.0
05Gemini 2.5 ProGoogleAPIsource75.6
06DeepSeek-R1-0528DeepSeekOSSsource73.3
07o4-miniOpenAIAPIMar 2024LiveCodeBench: Holistic and Contamin…72.8
08Qwen3-235B-A22BAlibabaAPIsource70.7
09o3-miniOpenAIAPIMar 2024LiveCodeBench: Holistic and Contamin…66.9
10DeepSeek R1DeepSeekOSSsource65.9
11o3OpenAIAPIMar 2024LiveCodeBench: Holistic and Contamin…65.3
12DeepSeek-R1-Distill-Llama-70BDeepSeekOSSsource65.2
13Gemini 2.5 FlashGoogleAPIsource63.9
14Kimi k1.5Moonshot AIAPIsource62.5
15DeepSeek-R1-Distill-Qwen-32BDeepSeekOSSsource62.1
16Claude Opus 4AnthropicAPIMar 2024LiveCodeBench: Holistic and Contamin…57.8
17GPT-4.1OpenAIAPIMar 2024LiveCodeBench: Holistic and Contamin…54.4
18Claude Sonnet 4AnthropicAPIMar 2024LiveCodeBench: Holistic and Contamin…52.8
19DeepSeek-v3-0324DeepSeekOSSsource49.2
20DeepSeek-V3DeepSeekOSSMar 2024LiveCodeBench: Holistic and Contamin…49.2
21GPT-4.1 miniOpenAIAPIsource48.3
22Qwen2.5-Coder 32BAlibabaOSSMar 2024LiveCodeBench: Holistic and Contamin…47.8
23Llama-4-MaverickMetaOSSsource43.4
24DeepSeek-Coder-V2-InstructDeepSeekOSSMar 2024LiveCodeBench: Holistic and Contamin…43.4
25GPT-4oOpenAIAPIMar 2024LiveCodeBench: Holistic and Contamin…40.8
Fig 3 · Pass@1 across reasoning, frontier proprietary, and open-weight code models. The open-vs-closed gap at the top is currently 18.4 points — closing each release cycle.
§ 05 · Open vs closed

The gap is 18.4 points.

Reasoning models from OpenAI and Anthropic dominate the top of the leaderboard. DeepSeek-V3, Qwen3-Coder and DeepSeek-R1 close the gap quickly when reasoning chains are enabled — but no open model has yet topped the chart.

Open-weight avg
55.5%
9 models · top: DeepSeek-R1-0528 · 73.3%
API/closed avg
67.4%
16 models · top: Gemini 3 Pro Preview · 91.7%
Frontier gap
18.4pp
Gemini 3 Pro Preview − DeepSeek-R1-0528
§ 06 · Methodology

Why by-date evaluation matters.

Continuous problem stream

Every week a fresh batch of LeetCode/AtCoder/Codeforces problems is scraped, deduplicated and added with its publish date. The benchmark grows; old problems aren't deleted but can be filtered out.

Contamination as a knob

Filter the dataset to only contain problems posted after a given model's training cutoff. Memorisation collapses; only generalisation remains. Vendor-claimed cutoff dates become testable.

Four task scenarios

Code generation (write a solution), self-repair (fix a buggy solution), code execution (predict output), test output prediction. Most leaderboards show pass@1 on generation.

Algorithmic, not engineering

Single-file, contest-style problems — closer to interview prep than to production work. SWE-bench is the complementary benchmark for repo-scale software engineering.

§ 07 · Resources

Papers and code.

Key papers
Repositories

See the full coding lineage SWE-bench (next attention path)HumanEval (predecessor)