HumanEval: From 28.8% to 99%
in five years.
The complete history of the benchmark that defined AI code generation.43 models tracked from July 2021 through saturation in 2026.
164
Problems
Jul 2021
Published
~99%
Current SOTA
Saturated
Status
What is HumanEval?
HumanEval is a benchmark of 164 hand-written Python programming problems created by OpenAI in July 2021. Each problem includes a function signature, a docstring describing what the function should do, and a set of unit tests to verify correctness.
The metric is pass@1: the percentage of problems where the model's first attempt passes all unit tests. No retries, no cherry-picking.
It was introduced alongside Codex in the paper “Evaluating Large Language Models Trained on Code” (Chen et al., 2021) and quickly became the standard yardstick for code generation ability.
Quick Facts
The Progression
SOTA Progression (pass@1)
Best score at each point in time, July 2021 – March 2026
Best Score by Organization
Each org's highest verified HumanEval pass@1
Every Model, Plotted
Each dot is a model. Color = organization. Size = parameter count.
Complete Timeline
The Starting Line
OpenAI publishes "Evaluating Large Language Models Trained on Code" with 164 hand-written Python problems. Codex, fine-tuned on GitHub, sets the first bar.
Cannot generate code
First SOTA on HumanEval
Specialized Code Models
The field realizes code needs its own models. Salesforce, Google, and OpenAI push toward 50%.
Nearly doubles Codex original
The ChatGPT Explosion
Chat-tuned models smash through 70%. GPT-4 reaches 85%. Open-source catches up fast with WizardCoder and Code Llama.
ChatGPT breaks 70%
0-shot; 82.7% with optimized prompting
First model near 90%
Breaking 90%
GPT-4o breaks 90% in May. Claude 3.5 Sonnet matches it. Qwen2.5-Coder hits 92.7% from Alibaba. The ceiling is in sight.
First to break 90%
New SOTA: 93.7%
Saturation
Scores converge above 90% across all major vendors. The benchmark can no longer differentiate top models. The community pivots to harder tests.
8B model near 90%
Highest verified score
Post-Saturation
Multiple models approach or claim 99%. HumanEval is retired as a meaningful differentiator. The community has moved on.
Approaching perfect score
Key Milestones
Codex launches HumanEval
The starting line. A 12B parameter model sets the first benchmark.
code-davinci-002 doubles it
Specialized code models prove the approach works.
ChatGPT breaks 70%
Chat-tuned models are surprisingly good at code.
GPT-4 approaches 90%
The 90% barrier is within reach for the first time.
GPT-4o breaks 90%
The psychological barrier falls. Three models follow within weeks.
Claude 3.5 Sonnet v2
Anthropic takes the lead. The gap between vendors shrinks to noise.
Kimi K2 pushes ceiling
Highest verified score. Improvement from 93% to 95% takes a full year.
Effectively solved
Multiple models approach perfect. The benchmark can no longer differentiate.
Why HumanEval is saturated
HumanEval served its purpose brilliantly. In 2021, it was the right benchmark at the right time — simple enough to be reproducible, hard enough to be meaningful. But three structural limitations made saturation inevitable:
Too few problems
With only 164 problems, each one worth 0.6% of the total score. Statistical noise from a single problem can shift rankings.
Too few tests per problem
Many problems have only 3–5 unit tests. Models can pass with subtly wrong solutions that happen to satisfy weak test suites.
Data contamination
The problems have been on GitHub for 5 years. They're in every training dataset. Models may have memorized solutions, not learned to code.
This doesn't mean HumanEval scores are meaningless — a model scoring 30% is genuinely worse at coding than one scoring 90%. But the difference between 93% and 95% is mostly noise.
What comes after HumanEval
The community has moved to harder, more realistic benchmarks:
HumanEval+
Active80x more tests per problem (764 avg). Same problems, much stricter. Drops most model scores by 10–20%.
SWE-bench Verified
Gold standardReal GitHub issues from popular repos. Models must navigate codebases, understand context, and write patches.
LiveCodeBench
ActiveRolling set of competitive programming problems from recent contests. Contamination-resistant by design.
BigCodeBench
ActiveLibrary-realistic tasks requiring real-world API usage, not just algorithmic puzzles.
Data sources & methodology
Scores compiled from: original papers (arxiv), official model cards, llm-stats.com leaderboard, and HumanEval Revisited (arxiv:2402.14852).
All scores are pass@1 unless noted. Where multiple evaluations exist for the same model, we prefer the officially reported score. Prompting strategy (0-shot vs few-shot, system prompt) can cause 5–15% variance.
Last updated: March 17, 2026.