SATURATED — Benchmark no longer differentiates top models

HumanEval: From 28.8% to 99%
in five years.

The complete history of the benchmark that defined AI code generation.43 models tracked from July 2021 through saturation in 2026.

164

Problems

Jul 2021

Published

~99%

Current SOTA

Saturated

Status

What is HumanEval?

HumanEval is a benchmark of 164 hand-written Python programming problems created by OpenAI in July 2021. Each problem includes a function signature, a docstring describing what the function should do, and a set of unit tests to verify correctness.

The metric is pass@1: the percentage of problems where the model's first attempt passes all unit tests. No retries, no cherry-picking.

It was introduced alongside Codex in the paper “Evaluating Large Language Models Trained on Code” (Chen et al., 2021) and quickly became the standard yardstick for code generation ability.

Quick Facts

LanguagePython

Problems164

Tests/problem~7.7 avg

Metricpass@1

CreatorOpenAI

Paperarxiv:2107.03374

StatusSaturated

The Progression

SOTA Progression (pass@1)

Best score at each point in time, July 2021 – March 2026

Best Score by Organization

Each org's highest verified HumanEval pass@1

Every Model, Plotted

Each dot is a model. Color = organization. Size = parameter count.

OpenAI

Anthropic

Google

Complete Timeline

2021

The Starting Line

OpenAI publishes "Evaluating Large Language Models Trained on Code" with 164 hand-written Python problems. Codex, fine-tuned on GitHub, sets the first bar.

Jul 2021

GPT-3OpenAI

Cannot generate code

Jul 2021

GPT-J 6BEleutherAI6B

11.4%

Jul 2021

Codex 300MOpenAI300M

13.2%

Jul 2021

Codex 2.5BOpenAI2.5B

21.4%

Jul 2021

Codex 12BOpenAI12B

First SOTA on HumanEval

28.8%

2022

Specialized Code Models

The field realizes code needs its own models. Salesforce, Google, and OpenAI push toward 50%.

2022

PolyCoder 2.7BCMU2.7B

5.6%

2022

InCoder 6.7BMeta6.7B

15.2%

2022

CodeGen-Mono 6.1BSalesforce6.1B

26.1%

2022

text-davinci-002OpenAI

30.5%

2022

PaLM-Coder 540BGoogle540B

36%

2022

code-davinci-002OpenAI

Nearly doubles Codex original

47%

2023

The ChatGPT Explosion

Chat-tuned models smash through 70%. GPT-4 reaches 85%. Open-source catches up fast with WizardCoder and Code Llama.

Mar 2023

StarCoder 15BHuggingFace15B

33.6%

Mar 2023

GPT-3.5 TurboOpenAI

ChatGPT breaks 70%

72.2%

Jun 2023

WizardCoder 15BMicrosoft15B

57.3%

Jun 2023

GPT-4OpenAI

0-shot; 82.7% with optimized prompting

67%

Aug 2023

Code Llama 34BMeta34B

53.7%

Aug 2023

Code Llama Instruct 70BMeta70B

67.8%

Oct 2023

Phind-CodeLlama v2Phind34B

73.8%

Nov 2023

WizardCoder-Python 34BMicrosoft34B

73.2%

Nov 2023

GPT-4-1106-PreviewOpenAI

First model near 90%

85.7%

2024

Breaking 90%

GPT-4o breaks 90% in May. Claude 3.5 Sonnet matches it. Qwen2.5-Coder hits 92.7% from Alibaba. The ceiling is in sight.

Feb 2024

Claude 3 OpusAnthropic

84.9%

Apr 2024

GPT-4 TurboOpenAI

87.1%

May 2024

Codestral 22BMistral AI22B

81.1%

May 2024

GPT-4oOpenAI

First to break 90%

90.2%

Jun 2024

Claude 3.5 SonnetAnthropic

92%

Jun 2024

DeepSeek-Coder-V2DeepSeek

90.2%

Jul 2024

Mistral Large 2Mistral AI

92%

Jul 2024

Llama 3.1 405BMeta405B

89%

Sep 2024

o1-miniOpenAI

92.4%

Sep 2024

Qwen2.5-Coder 32BAlibaba32B

92.7%

Oct 2024

Claude 3.5 Sonnet v2Anthropic

New SOTA: 93.7%

93.7%

Nov 2024

Amazon Nova ProAmazon

89%

Dec 2024

Llama 3.3 70BMeta70B

88.4%

2025

Saturation

Scores converge above 90% across all major vendors. The benchmark can no longer differentiate top models. The community pivots to harder tests.

Jan 2025

Mistral Small 3Mistral AI24B

84.8%

Mar 2025

Gemma 3 27BGoogle27B

87.8%

Mar 2025

Mistral Small 3.1Mistral AI24B

88.4%

Apr 2025

Granite 3.3 8BIBM8B

8B model near 90%

89.7%

Jul 2025

Kimi K2 InstructMoonshot AI

93.3%

Aug 2025

GPT-5OpenAI

93.4%

Sep 2025

Kimi K2 0905Moonshot AI

Highest verified score

94.5%

2026

Post-Saturation

Multiple models approach or claim 99%. HumanEval is retired as a meaningful differentiator. The community has moved on.

Mar 2026

Sarvam-30BSarvam AI30B

92.1%

2026

Gemini 2.5 ProGoogle

Approaching perfect score

99%

2026

Kimi K2.5Moonshot AI

99%

Key Milestones

28.8%Jul 2021

Codex launches HumanEval

The starting line. A 12B parameter model sets the first benchmark.

47%2022

code-davinci-002 doubles it

Specialized code models prove the approach works.

72%Mar 2023

ChatGPT breaks 70%

Chat-tuned models are surprisingly good at code.

85.7%Nov 2023

GPT-4 approaches 90%

The 90% barrier is within reach for the first time.

90.2%May 2024

GPT-4o breaks 90%

The psychological barrier falls. Three models follow within weeks.

93.7%Oct 2024

Claude 3.5 Sonnet v2

Anthropic takes the lead. The gap between vendors shrinks to noise.

94.5%Sep 2025

Kimi K2 pushes ceiling

Highest verified score. Improvement from 93% to 95% takes a full year.

~99%2026

Effectively solved

Multiple models approach perfect. The benchmark can no longer differentiate.

Why HumanEval is saturated

HumanEval served its purpose brilliantly. In 2021, it was the right benchmark at the right time — simple enough to be reproducible, hard enough to be meaningful. But three structural limitations made saturation inevitable:

164

Too few problems

With only 164 problems, each one worth 0.6% of the total score. Statistical noise from a single problem can shift rankings.

~7.7

Too few tests per problem

Many problems have only 3–5 unit tests. Models can pass with subtly wrong solutions that happen to satisfy weak test suites.

Public

Data contamination

The problems have been on GitHub for 5 years. They're in every training dataset. Models may have memorized solutions, not learned to code.

This doesn't mean HumanEval scores are meaningless — a model scoring 30% is genuinely worse at coding than one scoring 90%. But the difference between 93% and 95% is mostly noise.

What comes after HumanEval

The community has moved to harder, more realistic benchmarks:

HumanEval+

Active

80x more tests per problem (764 avg). Same problems, much stricter. Drops most model scores by 10–20%.

SWE-bench Verified

Gold standard

Real GitHub issues from popular repos. Models must navigate codebases, understand context, and write patches.

LiveCodeBench

Active

Rolling set of competitive programming problems from recent contests. Contamination-resistant by design.

BigCodeBench

Active

Library-realistic tasks requiring real-world API usage, not just algorithmic puzzles.

Data sources & methodology

Scores compiled from: original papers (arxiv), official model cards, llm-stats.com leaderboard, and HumanEval Revisited (arxiv:2402.14852).

All scores are pass@1 unless noted. Where multiple evaluations exist for the same model, we prefer the officially reported score. Prompting strategy (0-shot vs few-shot, system prompt) can cause 5–15% variance.

Last updated: March 17, 2026.

Related benchmarks

SWE-bench

Real GitHub issue resolution

Code Generation

All code benchmarks

LLM Rankings

Most used models + quality scores

HumanEval: From 28.8% to 99%in five years.

What is HumanEval?

Quick Facts

The Progression

SOTA Progression (pass@1)

Best Score by Organization

Every Model, Plotted

Complete Timeline

The Starting Line

Specialized Code Models

The ChatGPT Explosion

Breaking 90%

Saturation

Post-Saturation

Key Milestones

Codex launches HumanEval

code-davinci-002 doubles it

ChatGPT breaks 70%

GPT-4 approaches 90%

GPT-4o breaks 90%

Claude 3.5 Sonnet v2

Kimi K2 pushes ceiling

Effectively solved

Why HumanEval is saturated

Too few problems

Too few tests per problem

Data contamination

What comes after HumanEval

HumanEval+

SWE-bench Verified

LiveCodeBench

BigCodeBench

Data sources & methodology

Related benchmarks

HumanEval: From 28.8% to 99%
in five years.