Codesota · Tasks · HCASTHome/Tasks/Agentic AI/HCAST

HCAST.

HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI autonomy with human-calibrated baselines — every task has known completion times from professional software engineers, enabling direct human-vs-AI comparison. Tasks span realistic software engineering scenarios at varying difficulty levels, from simple bug fixes to complex architectural changes. The human calibration is what makes HCAST distinctive: instead of just pass/fail, it reveals whether AI agents are 10x slower, equally fast, or approaching superhuman speed on specific task types.

Datasets

Results

success-rate

Canonical metric

§ 02 · Canonical benchmark

The reference dataset.

HCAST

90 realistic software engineering tasks calibrated against human performance times. Tests whether agents can complete tasks that take humans 15 minutes to 4 hours. Primary metric: success rate across all tasks.

Primary metric: success-rate

View full leaderboard →

§ 03 · Top 10

Leading models.

Leading models on HCAST.

#	Model	success-rate	Year	Source
★	Claude Opus 4✓	55.0	2025	paper ↗
2	o3✓	49.0	2025	paper ↗
3	Claude 3.7 Sonnet✓	38.0	2025	paper ↗
4	o1✓	28.0	2025	paper ↗
5	Claude 3.5 Sonnet✓	18.0	2025	paper ↗
6	GPT-4 Turbo (2024)✓	12.0	2024	paper ↗

What were you looking for on HCAST?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

1 dataset tracked for this task.

HCAST

CANONICAL

6 results · success-rate

Top: Claude Opus 4 — 55.0

§ 05 · Related tasks

Other tasks in Agentic AI.

Agent Memory Autonomous Coding Bioinformatics Agents RE-Bench SWE-bench Task agents Time Horizon Tool Use

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on HCAST? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.