Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · RE-BenchHome/Tasks/Agentic AI/RE-Bench

RE-Bench.

RE-Bench (Research Engineering Benchmark) from METR evaluates AI agents on 7 open-ended ML research engineering tasks requiring genuine experimentation — training models, analyzing data, and iterating on approaches over extended time horizons up to 8 hours. Unlike pass/fail coding benchmarks, RE-Bench uses continuous scoring that measures quality of results, capturing the difference between a mediocre and excellent solution. It revealed a critical finding: current frontier models (as of late 2024) plateau after ~2 hours of autonomous work while human experts continue improving, exposing the "long-horizon reliability" gap in agentic AI.

1
Datasets
5
Results
normalized-score
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

RE-Bench

7 challenging open-ended ML research engineering tasks requiring multi-hour autonomous work. Agents compete against human researchers on real tasks like implementing new architectures or optimizing training pipelines. Score is normalized against human performance.

Primary metric: normalized-score
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on RE-Bench.

#Modelnormalized-scoreYearSource
o30.3802025paper ↗
2Claude 3.7 Sonnet0.2902025paper ↗
3o10.1702024paper ↗
4Claude 3.5 Sonnet0.1202024paper ↗
5GPT-4 Turbo (2024)0.0702024paper ↗

What were you looking for on RE-Bench?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

1 dataset tracked for this task.

RE-Bench
CANONICAL
5 results · normalized-score
Top: o3 0.380
§ 05 · Related tasks

Other tasks in Agentic AI.

Agent MemoryAutonomous CodingBioinformatics AgentsHCASTSWE-benchTask agentsTime HorizonTool Use
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on RE-Bench? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.