Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like retail and airline customer service.
Seeking canonical benchmark for this task.
Suggest one →Leading models across all datasets in this task.
| # | Model | accuracy | Year | Source |
|---|---|---|---|---|
| ★ | GLM-5 | 89.7 | 2026 | paper ↗ |
| 2 | Step-3.5-Flash | 88.2 | 2026 | paper ↗ |
| 3 | Qwen3.5-397B-A17B | 86.7 | 2026 | paper ↗ |
| 4 | Qwen3.5-35B-A3B | 81.2 | 2026 | paper ↗ |
| 5 | Intern-S1-Pro | 80.9 | 2026 | paper ↗ |
| 6 | DeepSeek-V3.2 | 80.3 | 2025 | paper ↗ |
| 7 | Qwen3.5-122B-A10B | 79.5 | 2026 | paper ↗ |
| 8 | Claude Opus 4.5✓ | 79.0 | — | paper ↗ |
| 9 | Qwen3.5-27B | 79.0 | 2026 | paper ↗ |
| 10 | Ling-2.6-1T | 78.4 | 2026 | paper ↗ |
Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.
Still looking for something on Tool Use? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.
Real humans read every message. We track what people are asking for and prioritize accordingly.