Tool Use2024en
Tau2-Bench: Agentic Tool-Use Benchmark
Agentic benchmark testing tool-use capabilities across retail and airline customer service domains. Measures ability to use APIs and tools to resolve real-world tasks. Average pass rate across domains.
Metrics:pass_rate
Paper / WebsiteCurrent State of the Art
Claude Opus 4.5
Anthropic
79
pass_rate
Tau2-Bench — pass_rate
8 results · 1 SOTA advances · higher is better
All results
SOTA frontier
Top Models Performance Comparison
Top 8 models ranked by pass_rate
Best Score
79.0
Top Model
Claude Opus 4.5
Models Compared
8
Score Range
43.0
pass_ratePrimary
| # | Model | Score | Paper / Code | Date |
|---|---|---|---|---|
| 1 | Claude Opus 4.5 Anthropic | 79 | - | |
| 2 | GPT-5.2 OpenAI | 73 | - | |
| 3 | Gemini 3 Pro Google | 69 | - | |
| 4 | Claude Sonnet 4.5 Anthropic | 63 | - | |
| 5 | GPT-5.1 OpenAI | 59 | - | |
| 6 | Gemini 2.5 Pro Google | 54 | - | |
| 7 | Claude 3.7 Sonnet Anthropic | 47 | - | |
| 8 | GPT-4oAPI OpenAI | 36 | - |