Codesota · Tasks · Web & Desktop AgentsHome/Tasks/Agentic AI/Web & Desktop Agents

Web & Desktop Agents.

Web and desktop agents — AI systems that operate browsers and GUIs to complete real tasks — are benchmarked by WebArena, VisualWebArena, Mind2Web, and OSWorld. Current agents (GPT-4V + Playwright, Claude Computer Use) achieve 15-35% success on realistic web tasks, far below human performance. The core difficulty is grounding: mapping high-level instructions ("book a flight under $300") to pixel-level or DOM-level actions across unpredictable, dynamic interfaces. This is where multimodal understanding meets sequential decision-making, and progress here directly predicts when AI assistants can truly act on your behalf.

Datasets

Results

success-rate

Canonical metric

§ 02 · Canonical benchmark

The reference dataset.

OSWorld

369 real computer tasks across Windows, macOS, and Ubuntu requiring GUI interaction. Tests agents operating full desktop apps like spreadsheets, image editors, and terminals. Much harder than web-only benchmarks.

Primary metric: success-rate

View full leaderboard →

§ 03 · Top 10

Leading models.

Leading models on OSWorld.

#	Model	success-rate	Year	Source
★	Agent S3 w/ bBoN	63.5	2025	paper ↗
2	GLM-5V-Turbo	62.3	2026	paper ↗
3	CoAct-1✓	60.8	2026	paper ↗
4	JEDI-7B with o3 planner	51.0	2025	paper ↗
5	UI-TARS-2✓	47.5	2026	paper ↗
6	GTA1 (7B)✓	45.2	2026	paper ↗
7	UI-TARS-1.5✓	42.5	2026	paper ↗
8	Agent S2 (Gemini 2.5)✓	41.4	2026	paper ↗
9	Holo2-8B	39.9	2026	paper ↗
10	Qwen3-VL-235B-A22B-Thinking	38.1	2025	paper ↗