Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · HCASTHome/Tasks/Agentic AI/HCAST

HCAST.

HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI autonomy with human-calibrated baselines — every task has known completion times from professional software engineers, enabling direct human-vs-AI comparison. Tasks span realistic software engineering scenarios at varying difficulty levels, from simple bug fixes to complex architectural changes. The human calibration is what makes HCAST distinctive: instead of just pass/fail, it reveals whether AI agents are 10x slower, equally fast, or approaching superhuman speed on specific task types.

1
Datasets
6
Results
success-rate
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

HCAST

90 realistic software engineering tasks calibrated against human performance times. Tests whether agents can complete tasks that take humans 15 minutes to 4 hours. Primary metric: success rate across all tasks.

Primary metric: success-rate
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on HCAST.

#Modelsuccess-rateYearSource
Claude Opus 455.02025paper ↗
2o349.02025paper ↗
3Claude 3.7 Sonnet38.02025paper ↗
4o128.02025paper ↗
5Claude 3.5 Sonnet18.02025paper ↗
6GPT-4 Turbo (2024)12.02024paper ↗

What were you looking for on HCAST?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

1 dataset tracked for this task.

HCAST
CANONICAL
6 results · success-rate
Top: Claude Opus 4 55.0
§ 05 · Related tasks

Other tasks in Agentic AI.

Agent MemoryAutonomous CodingBioinformatics AgentsRE-BenchSWE-benchTask agentsTime HorizonTool Use
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on HCAST? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.