Codesota · Benchmark · LiveCodeBenchBrowse/Code/Code Generation

Coding lineage · contamination-controlled by date · Sep 2023

LiveCodeBench.

A continuously-growing pool of LeetCode, AtCoder and Codeforces problems, each tagged with the date it was published. Filter scores to problems posted after a model's training cutoff and the contamination that haunted HumanEval and MBPP simply disappears. Where leaderboard attention moved once the function-synthesis era ended.

Lineage status · Active· still useful — but agentic coding has moved past contests

Frontier reasoning models pass 80% on the latest window. Scope of the benchmark stops at single-file algorithmic problems; for repo-scale evaluation, attention moved to SWE-bench and now SWE-bench Pro.

Official leaderboard ↗Read the paper Full coding lineage →

§ 01 · Lineage

Where LiveCodeBench fits in coding eval.

LiveCodeBench was the first benchmark to make contamination a measured property rather than a hand-wave. It then handed the frontier-evaluation baton to SWE-bench, which switched from algorithmic contests to repo-scale software engineering — a different task entirely.

APPS

May 2021

Saturated

HumanEval

Jul 2021

Saturated

HumanEval+

May 2023

Active

LiveCodeBench

Sep 2023

Active

◆ this page

SWE-bench

Oct 2023

Superseded

SWE-bench Verified

Aug 2024

Saturating

SWE-bench Pro

Sep 2025

Active

Editor's note · 2026-04-26

APPS (2021-05) was the first widely-cited coding benchmark of the post-Codex era; OpenAI shipped HumanEval purpose-built two months later and attention migrated within a year. HumanEval and MBPP both saturated by 2023 — frontier models hit >95% pass@1, leaving no signal. EvalPlus (HumanEval+, MBPP+) reopened the gap with adversarial tests. Attention then jumped to LiveCodeBench (contamination-free by date) and SWE-bench Verified (repo-scale, human-filtered). As of 2025-09, OpenAI publicly announced they no longer evaluate on SWE-bench Verified — flawed tests reward shortcuts and training-data leakage inflates scores. SWE-bench Pro (Scale AI, arxiv 2509.16941) is the current attention path: 1,865 problems across public/commercial/held-out splits where GPT-5 and Claude Opus 4.1 land at ~23% vs >70% on Verified.

§ 02 · Context

What changed,
and what changed it.

The cards below are nodes from the curated coding lineage. Edges are typed: scope shift means leaderboard attention jumped tasks; direct successor means same task, sharper test set.

In-edge · scope shift

HumanEval+

→

LiveCodeBench

May 2023 → Sep 2023

80× more test cases per problem, automatically generated to catch the edge cases the original tests missed. Reopened the leaderboard gap that HumanEval had closed.

Where leaderboard attention moved once EvalPlus problems also began saturating. LiveCodeBench's by-date contamination control became the new credibility floor.

See in lineage graph →

◆ This page

LiveCodeBench

ActiveSep 2023

LiveCodeBench Contamination-Free Coding

Continuously scrapes new LeetCode/AtCoder/Codeforces problems and dates them — results can be filtered to problems posted after a model's training cutoff, eliminating contamination. Where the leaderboard moved once HumanEval+ also began saturating.

Jain et al. (UC Berkeley, MIT, Cornell) · paper

Out-edge · scope shift · current attention path

LiveCodeBench

→

SWE-bench

Sep 2023 → Oct 2023

2,294 real GitHub issue→PR pairs across 12 Python repos. The first benchmark to test whether models could function as software engineers, not just function generators. Superseded by Verified after analysis showed many issues were unsolvable as posed.

From contest-style problems to real-world software engineering — issues, multi-file edits, regression tests. Different task, but the same field's frontier.

See in lineage graph →

§ 03 · SOTA

23.6% → 91.7%, two tracks.

Each dot is a record-setting model on the latest LiveCodeBench window at the time it landed. Tracked separately so closed-source frontier progress is never confused with the open-weight catch-up curve. Reasoning chains drove most of the post-2024 acceleration on both lines.

API · latest: Mar 2026 · Gemini 3 Pro Preview · 91.7%
Open · latest: Feb 2026 · Qwen3-Max-Thinking · 81.2%
Frontier gap: 10.5pp
Series: 8 closed · 5 open

Closed / APIOpen weight

Fig 2 · LiveCodeBench pass@1 (overall) by record-setting model, split by license. Solid = closed/API · dashed = open weight. Each dot only appears when the score strictly exceeds every previous record on its own line.

§ 04 · Leaderboard

Best published scores.

Pass@1 on LiveCodeBench. Shaded row marks SOTA. Window may differ across rows (v3 / v4 / v5 / v6) — check the linked source for the exact problem cohort. Vendor-internal cards and the official Berkeley leaderboard are merged.

Metric: pass@1 · higher is better
Rows: 25
Source: live · benchmark_results

#	Model	Vendor	Type	Submitted	Source	pass@1
01	Gemini 3 Pro Preview	—	API	Mar 2026	source	91.7
02	Gemini 3 Flash	Google	API	Mar 2026	source	90.8
03	GPT-5	OpenAI	API	—	source	85.0
04	Grok 4	xAI	API	—	source	79.0
05	Gemini 2.5 Pro	Google	API	—	source	75.6
06	DeepSeek-R1-0528	DeepSeek	OSS	—	source	73.3
07	o4-mini	OpenAI	API	Mar 2024	LiveCodeBench: Holistic and Contamin…	72.8
08	Qwen3-235B-A22B	Alibaba	API	—	source	70.7
09	o3-mini	OpenAI	API	Mar 2024	LiveCodeBench: Holistic and Contamin…	66.9
10	DeepSeek R1	DeepSeek	OSS	—	source	65.9
11	o3	OpenAI	API	Mar 2024	LiveCodeBench: Holistic and Contamin…	65.3
12	DeepSeek-R1-Distill-Llama-70B	DeepSeek	OSS	—	source	65.2
13	Gemini 2.5 Flash	Google	API	—	source	63.9
14	Kimi k1.5	Moonshot AI	API	—	source	62.5
15	DeepSeek-R1-Distill-Qwen-32B	DeepSeek	OSS	—	source	62.1
16	Claude Opus 4	Anthropic	API	Mar 2024	LiveCodeBench: Holistic and Contamin…	57.8
17	GPT-4.1	OpenAI	API	Mar 2024	LiveCodeBench: Holistic and Contamin…	54.4
18	Claude Sonnet 4	Anthropic	API	Mar 2024	LiveCodeBench: Holistic and Contamin…	52.8
19	DeepSeek-v3-0324	DeepSeek	OSS	—	source	49.2
20	DeepSeek-V3	DeepSeek	OSS	Mar 2024	LiveCodeBench: Holistic and Contamin…	49.2
21	GPT-4.1 mini	OpenAI	API	—	source	48.3
22	Qwen2.5-Coder 32B	Alibaba	OSS	Mar 2024	LiveCodeBench: Holistic and Contamin…	47.8
23	Llama-4-Maverick	Meta	OSS	—	source	43.4
24	DeepSeek-Coder-V2-Instruct	DeepSeek	OSS	Mar 2024	LiveCodeBench: Holistic and Contamin…	43.4
25	GPT-4o	OpenAI	API	Mar 2024	LiveCodeBench: Holistic and Contamin…	40.8

Fig 3 · Pass@1 across reasoning, frontier proprietary, and open-weight code models. The open-vs-closed gap at the top is currently 18.4 points — closing each release cycle.

§ 05 · Open vs closed

The gap is 18.4 points.

Reasoning models from OpenAI and Anthropic dominate the top of the leaderboard. DeepSeek-V3, Qwen3-Coder and DeepSeek-R1 close the gap quickly when reasoning chains are enabled — but no open model has yet topped the chart.

Open-weight avg

55.5%

9 models · top: DeepSeek-R1-0528 · 73.3%

API/closed avg

67.4%

16 models · top: Gemini 3 Pro Preview · 91.7%

Frontier gap

18.4pp

Gemini 3 Pro Preview − DeepSeek-R1-0528

§ 06 · Methodology

Why by-date evaluation matters.

Continuous problem stream

Every week a fresh batch of LeetCode/AtCoder/Codeforces problems is scraped, deduplicated and added with its publish date. The benchmark grows; old problems aren't deleted but can be filtered out.

Contamination as a knob

Filter the dataset to only contain problems posted after a given model's training cutoff. Memorisation collapses; only generalisation remains. Vendor-claimed cutoff dates become testable.

Four task scenarios

Code generation (write a solution), self-repair (fix a buggy solution), code execution (predict output), test output prediction. Most leaderboards show pass@1 on generation.

Algorithmic, not engineering

Single-file, contest-style problems — closer to interview prep than to production work. SWE-bench is the complementary benchmark for repo-scale software engineering.

§ 07 · Resources

Papers and code.

Key papers

LiveCodeBench: Holistic and Contamination-Free Evaluation of LLMs for Code

Jain et al. (UC Berkeley · MIT · Cornell) · arXiv 2403.07974

HumanEval+ (EvalPlus): adversarial-test treatment of HumanEval / MBPP

Liu et al. · arXiv 2305.01210

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez et al. (Princeton) · ICLR 2024

CodeContests Competitive Programming (AlphaCode)

Li et al. (DeepMind) · arXiv 2203.07814

Repositories

LiveCodeBench/LiveCodeBench · 1.2k★

Official benchmark + harness. Apache 2.0.

evalplus/evalplus · 1.6k★

HumanEval+ / MBPP+ adversarial-test wrappers.

SWE-bench/SWE-bench · 4.4k★

Repo-scale SE benchmark — the next attention path.

bigcode-project/bigcodebench · 480★

Complementary library-call-rich code benchmark.

See the full coding lineage →SWE-bench (next attention path)HumanEval (predecessor)

LiveCodeBench.

Where LiveCodeBench fits in coding eval.

What changed,and what changed it.

23.6% → 91.7%, two tracks.

Best published scores.

The gap is 18.4 points.

Why by-date evaluation matters.

Papers and code.

What changed,
and what changed it.