GPT-4 Turbo (2024).

OpenAIproprietaryUnknown paramsGPT-4 Turbo (gpt-4-turbo-2024-04-09)

GPT-4 Turbo evaluated on METR autonomy tasks and HCAST.

§ 01 · Benchmarks

Every benchmark GPT-4 Turbo (2024) has a recorded score for.

#	Benchmark	Area · Task	Metric	Value	Rank	Date	Source
01	METR Time Horizon	Agentic AI · Time Horizon	task-horizon-minutes	2.0%	#5/5	2025-04-01	source ↗
02	RE-Bench	Agentic AI · RE-Bench	normalized-score	0.1%	#5/5	2024-11-22	source ↗
03	HCAST	Agentic AI · HCAST	success-rate	12.0%	#6/6	2023-12-19	source ↗
04	WebArena	Agentic AI · Web & Desktop Agents	success-rate	14.9%	#6/6	2023-07-26	source ↗
05	OSWorld	Agentic AI · Web & Desktop Agents	success-rate	6.5%	#13/13	2024-04-11	source ↗

Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.

§ 02 · Strengths by area

Where GPT-4 Turbo (2024) actually performs.

Agentic AI

benchmarks

avg rank #7.0

§ 03 · Papers

5 papers with results for GPT-4 Turbo (2024).

2025-04-01· Agentic AI· 1 result
METR: Measuring Autonomy in AI Systems (2025 Update)
2024-11-22· Agentic AI· 1 result
RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents Against Human Experts
2024-04-11· Agentic AI· 1 result
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
2023-12-19· Agentic AI· 1 result
HCAST: Human-Calibrated Autonomy Software Tasks
2023-07-26· Agentic AI· 1 result
WebArena: A Realistic Web Environment for Building Autonomous Agents

§ 04 · Related models

Other OpenAI models scored on Codesota.

GPT-4o

Undisclosed params · 35 results · 9 SOTA

Undisclosed params · 8 results

§ 05 · Sources & freshness

Where these numbers come from.

arxiv

results

official-leaderboard

results

5 of 5 rows marked verified. · first result 2023-07-26, latest 2025-04-01.

GPT-4 Turbo (2024).

Every benchmark GPT-4 Turbo (2024) has a recorded score for.

Where GPT-4 Turbo (2024) actually performs.

5 papers with results for GPT-4 Turbo (2024).

METR: Measuring Autonomy in AI Systems (2025 Update)

RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents Against Human Experts

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

HCAST: Human-Calibrated Autonomy Software Tasks

WebArena: A Realistic Web Environment for Building Autonomous Agents

Other OpenAI models scored on Codesota.

Where these numbers come from.