BenchmarksContamination|7 min read

Is SWE-bench Verified Contaminated? OpenAI Shifts to SWE-bench Pro

OpenAI has stopped reporting SWE-bench Verified scores, citing contamination concerns and the outsized impact of agent scaffolding. With top scores diverging by 12 points depending on the test harness, the most-cited coding benchmark may no longer measure what we think it does.

81%
With Agent Scaffolding
69%
Standalone (Same Model)
+12pt
Scaffolding Gap
54%
Top SWE-bench Pro Score

SWE-bench Verified has been the gold standard for evaluating AI coding ability since mid-2024. A curated set of 500 real GitHub issues, human-verified for correctness, it promised a reliable signal of how well models can actually fix bugs and implement features. But cracks have been forming for months, and OpenAI's decision to abandon Verified scores in February 2026 has brought the debate into the open.

The core problem is twofold: the benchmark's fixed dataset has been public long enough for contamination to seep into training data, and the lack of standardized evaluation harnesses means that agent scaffolding -- not model capability -- increasingly determines the score.

Timeline: How We Got Here

Oct 2023SWE-bench launches

Princeton releases SWE-bench with 2,294 real GitHub issues. Top models score under 5%. The benchmark is hailed as a true test of software engineering ability.

Jun 2024SWE-bench Verified introduced

A curated subset of 500 human-verified problems replaces the full benchmark as the standard leaderboard. Scores hover around 30-40% for top models.

Oct 2024First contamination whispers

Researchers note that some models show suspiciously high scores on specific Verified problems that appeared in popular training corpora. GitHub issues are public data.

Dec 2024Agent scaffolding arms race begins

Teams discover that wrapping the same model in better agent scaffolding (retries, file exploration, test-driven loops) can boost scores by 10-15 points. The benchmark starts measuring engineering infrastructure, not just model capability.

Jan 202681% achieved with heavy scaffolding

A submission using aggressive agent scaffolding hits 81% on Verified, while the same base model scores 69% standalone. The 12-point gap ignites the contamination debate.

Feb 2026OpenAI drops Verified, backs Pro

OpenAI announces it will no longer report SWE-bench Verified scores, citing contamination and scaffolding concerns. They shift to SWE-bench Pro as their primary code eval.

The Contamination Evidence

Contamination in SWE-bench Verified comes from two distinct vectors, and understanding both is critical to evaluating whether the benchmark is still useful.

1. Training Data Contamination

SWE-bench problems are drawn from real GitHub issues and pull requests. These are public data. Any model trained on GitHub data after June 2024 has likely seen some subset of the 500 Verified problems -- including the solutions.

  • -GitHub issues + PRs are in Common Crawl and The Stack
  • -Deduplication doesn't catch reformatted solutions
  • -Blog posts discussing SWE-bench problems enter training data

2. Scaffolding Inflation

Even without direct contamination, agent scaffolding has become the dominant factor in scores. The same model can score 69% standalone or 81% with a sophisticated agent harness that retries, explores files, and runs tests iteratively.

  • -Retry loops that attempt the same problem multiple times
  • -Test-driven feedback loops (run tests, fix, repeat)
  • -Specialized file exploration and context gathering

The compounding problem: When contamination and scaffolding combine, scores inflate dramatically. An agent that has "seen" a problem during training can leverage scaffolding to systematically converge on the memorized solution, even if the raw completion wouldn't reproduce it. This makes it nearly impossible to separate genuine capability from benchmark gaming.

SWE-bench Verified vs SWE-bench Pro

SWE-bench Pro was introduced to address the contamination and scaffolding problems. Here is how the two benchmarks compare:

MetricVerifiedPro
Total Problems
Pro uses a rotating pool to prevent memorization
500~300 (rotating)
Problem Source
Pro issues are unseen during training
Fixed GitHub issuesHeld-out, post-cutoff issues
Agent Scaffolding
Pro controls for scaffolding advantage
Allowed (varies wildly)Standardized harness
Top Score (Feb 2026)
The gap reveals scaffolding + contamination impact
~81% (w/ agents)~54%
Contamination Risk
Pro was designed specifically to resist contamination
High (fixed dataset since 2024)Low (rotating, post-cutoff)
Industry Adoption
Transition underway but not complete
Still widely reportedGrowing (OpenAI, Google leading)

The 27-point gap between top Verified scores (81%) and top Pro scores (54%) is the clearest evidence that Verified has become unreliable as a standalone metric.

What This Means for the Field

For Model Developers

The shift to SWE-bench Pro raises the bar significantly. Models can no longer benefit from having seen the test problems during training, and standardized evaluation harnesses mean scaffolding tricks won't inflate scores. Expect reported numbers to drop by 20-30 points as the industry transitions. This is not a regression in capability -- it is a correction in measurement.

For Engineering Teams Choosing Models

SWE-bench Verified scores from the past year should be treated with skepticism, especially for models released after mid-2025. Look for SWE-bench Pro scores, or better yet, run your own evaluations on your actual codebase. A model scoring 54% on Pro is likely more capable than one scoring 75% on Verified with heavy scaffolding.

For the Benchmark Ecosystem

This is a healthy correction. Benchmarks have a natural lifecycle: they launch, gain adoption, get optimized against, and eventually need replacement. SWE-bench Verified lasted roughly 18 months as a reliable signal, which is longer than most AI benchmarks. The transition to Pro demonstrates the community's ability to self-correct.

The Scaffolding Question: What Are We Actually Measuring?

The 12-point gap between scaffolded and standalone scores raises a deeper question: should benchmarks measure the model alone or the model-plus-system? In practice, nobody uses a raw model to fix bugs. Real-world coding assistants use file search, test execution, context retrieval, and iterative refinement.

The problem is not that scaffolding exists, but that it is unstandardized. When Company A reports 81% using a proprietary agent loop and Company B reports 72% with a simpler setup, the comparison is meaningless. SWE-bench Pro addresses this by mandating a standardized evaluation harness, so scores reflect model capability within identical infrastructure.

The Bottom Line

1.

SWE-bench Verified is compromised. The combination of training data contamination and unstandardized scaffolding means top scores no longer reliably measure model capability.

2.

SWE-bench Pro is the successor. Rotating held-out problems and standardized evaluation harnesses address both contamination vectors.

3.

The transition will be messy. Many companies still report Verified scores. Expect 6-12 months of dual-reporting before Pro becomes the standard.

4.

Lower scores, better signal. When the industry moves to Pro, headline numbers will drop dramatically. This does not mean models got worse -- it means we are finally measuring them honestly.

Related Resources