Codesota · Tasks · RE-BenchHome/Tasks/Agentic AI/RE-Bench

RE-Bench.

RE-Bench (Research Engineering Benchmark) from METR evaluates AI agents on 7 open-ended ML research engineering tasks requiring genuine experimentation — training models, analyzing data, and iterating on approaches over extended time horizons up to 8 hours. Unlike pass/fail coding benchmarks, RE-Bench uses continuous scoring that measures quality of results, capturing the difference between a mediocre and excellent solution. It revealed a critical finding: current frontier models (as of late 2024) plateau after ~2 hours of autonomous work while human experts continue improving, exposing the "long-horizon reliability" gap in agentic AI.

Datasets

Results

normalized-score

Canonical metric

§ 02 · Canonical benchmark

The reference dataset.

RE-Bench

7 challenging open-ended ML research engineering tasks requiring multi-hour autonomous work. Agents compete against human researchers on real tasks like implementing new architectures or optimizing training pipelines. Score is normalized against human performance.

Primary metric: normalized-score

View full leaderboard →

§ 03 · Top 10

Leading models.

Leading models on RE-Bench.

#	Model	normalized-score	Year	Source
★	o3✓	0.380	2025	paper ↗
2	Claude 3.7 Sonnet✓	0.290	2025	paper ↗
3	o1✓	0.170	2024	paper ↗
4	Claude 3.5 Sonnet✓	0.120	2024	paper ↗
5	GPT-4 Turbo (2024)✓	0.070	2024	paper ↗

What were you looking for on RE-Bench?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

1 dataset tracked for this task.

RE-Bench

CANONICAL

5 results · normalized-score

Top: o3 — 0.380

§ 05 · Related tasks

Other tasks in Agentic AI.

Agent Memory Autonomous Coding Bioinformatics Agents HCAST SWE-bench Task agents Time Horizon Tool Use

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on RE-Bench? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.