Codesota · Tasks · Time HorizonHome/Tasks/Agentic AI/Time Horizon

Time Horizon.

Time horizon — how long an AI agent can work autonomously before requiring human correction — is arguably the single most important meta-metric for agentic AI. METR's evaluations suggest current frontier agents degrade significantly after 30-60 minutes of autonomous operation, while human software engineers can sustain productive work for hours. The metric matters because economic value scales exponentially with reliable autonomy duration: an agent that works reliably for 8 hours is not 16x more valuable than one that works for 30 minutes — it's qualitatively different, enabling entirely new categories of delegatable work.

Datasets

Results

task-horizon-minutes

Canonical metric

§ 02 · Canonical benchmark

The reference dataset.

METR Time Horizon

Measures the length of tasks AI agents can reliably complete autonomously. Task horizon is the 50th-percentile task length at 50% success. Higher = agent can handle longer multi-step tasks without human intervention.

Primary metric: task-horizon-minutes

View full leaderboard →

§ 03 · Top 10

Leading models.

Leading models on METR Time Horizon.

#	Model	task-horizon-minutes	Year	Source
★	Claude Opus 4✓	60.0	2025	paper ↗
2	o3✓	30.0	2025	paper ↗
3	Claude 3.7 Sonnet✓	14.0	2025	paper ↗
4	o1✓	4.00	2025	paper ↗
5	GPT-4 Turbo (2024)✓	2.00	2025	paper ↗

What were you looking for on Time Horizon?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

1 dataset tracked for this task.

METR Time Horizon

CANONICAL

5 results · task-horizon-minutes

Top: Claude Opus 4 — 60.0

§ 05 · Related tasks

Other tasks in Agentic AI.

Agent Memory Autonomous Coding Bioinformatics Agents HCAST RE-Bench SWE-bench Task agents Tool Use

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Time Horizon? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.