Codesota · Tasks · Tool UseHome/Tasks/Agentic AI/Tool Use

Tool Use.

Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like retail and airline customer service.

Datasets

Results

—

Canonical metric

§ 02 · Canonical benchmark

The reference dataset.

Seeking canonical benchmark for this task.

§ 03 · Top 10

Leading models across all datasets in this task.

#	Model	accuracy	Year	Source
★	GLM-5	89.7	2026	paper ↗
2	Step-3.5-Flash	88.2	2026	paper ↗
3	Qwen3.5-397B-A17B	86.7	2026	paper ↗
4	Qwen3.5-35B-A3B	81.2	2026	paper ↗
5	Intern-S1-Pro	80.9	2026	paper ↗
6	DeepSeek-V3.2	80.3	2025	paper ↗
7	Qwen3.5-122B-A10B	79.5	2026	paper ↗
8	Claude Opus 4.5✓	79.0	—	paper ↗
9	Qwen3.5-27B	79.0	2026	paper ↗
10	Ling-2.6-1T	78.4	2026	paper ↗

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

1 dataset tracked for this task.

§ 05 · Related tasks

Reply within 48 hours · No newsletter

Still looking for something on Tool Use? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.