AI Search Arena:
Which AI Searches the Web Best?
Blind pairwise evaluation of 22 search-augmented AI models — from Claude Opus 4.6 Search to Perplexity, Grok, and Diffbot. Ranked by real user preference across hundreds of thousands of web-search battles.
Full Leaderboard
Elo ratings derived from pairwise user preference votes. Higher confidence intervals indicate fewer battles; scores stabilise above ~10K votes. Prices are per million tokens (input / output).
| # | Model | Provider | Elo | ±CI | Votes | Input $ | Output $ | Context | License |
|---|---|---|---|---|---|---|---|---|---|
| 🥇 | claude-opus-4-6-search | Anthropic | 1255 | ±10 | 3,607 | $5 | $25 | 1M | Proprietary |
| 🥈 | grok-4.20-beta1 | xAI | 1225 | ±8 | 4,687 | N/A | N/A | N/A | Proprietary |
| 🥉 | gpt-5.2-search | OpenAI | 1219 | ±6 | 20,150 | $1.75 | $14 | 400K | Proprietary |
| 4 | gemini-3-flash-grounding | 1218 | ±6 | 25,311 | N/A | N/A | N/A | Proprietary | |
| 5 | gemini-3-pro-grounding | 1214 | ±5 | 31,966 | $2 | $12 | N/A | Proprietary | |
| 6 | gpt-5.1-search | OpenAI | 1210 | ±6 | 23,283 | $1.25 | $10 | 400K | Proprietary |
| 7 | claude-sonnet-4-6-search | Anthropic | 1203 | ±10 | 3,602 | $3 | $15 | 1M | Proprietary |
| 8 | gpt-5.2-search-non-reasoning | OpenAI | 1183 | ±6 | 20,045 | $1.75 | $14 | 400K | Proprietary |
| 9 | grok-4-1-fast-searchValue | xAI | 1181 | ±5 | 26,758 | $0.20 | $0.50 | 2M | Proprietary |
| 10 | grok-4-fast-searchValue | xAI | 1173 | ±4 | 42,193 | $0.20 | $0.50 | 2M | Proprietary |
| 11 | claude-opus-4-5-search | Anthropic | 1170 | ±6 | 15,488 | $5 | $25 | 200K | Proprietary |
| 12 | o3-search | OpenAI | 1143 | ±5 | 20,407 | $2 | $8 | 200K | Proprietary |
| 13 | gemini-2.5-pro-grounding | 1143 | ±4 | 45,483 | $1.25 | $10 | 1M | Proprietary | |
| 14 | grok-4-search | xAI | 1142 | ±5 | 19,018 | $3 | $15 | 256K | Proprietary |
| 15 | ppl-sonar-reasoning-pro-high | Perplexity | 1141 | ±5 | 28,673 | $1 | $1 | 127.1K | Proprietary |
| 16 | claude-sonnet-4-5-search | Anthropic | 1138 | ±7 | 14,385 | $3 | $15 | 1M | Proprietary |
| 17 | claude-opus-4-1-search | Anthropic | 1138 | ±4 | 44,888 | $15 | $75 | 200K | Proprietary |
| 18 | gpt-5-search | OpenAI | 1133 | ±5 | 20,519 | $1.25 | $10 | 400K | Proprietary |
| 19 | ppl-sonar-pro-high | Perplexity | 1131 | ±5 | 28,131 | $1 | $1 | 127.1K | Proprietary |
| 20 | claude-opus-4-search | Anthropic | 1129 | ±5 | 30,695 | $15 | $75 | 200K | Proprietary |
| 21 | diffbot-small-xlOpen | Diffbot | 1024 | ±8 | 6,378 | N/A | N/A | N/A | Apache 2.0 |
| 22 | api-gpt-4o-search | OpenAI | 1006 | ±11 | 3,375 | $30 | $60 | 8.2K | Proprietary |
claude-opus-4-6-search
grok-4.20-beta1
gpt-5.2-search
gemini-3-flash-grounding
gemini-3-pro-grounding
gpt-5.1-search
claude-sonnet-4-6-search
gpt-5.2-search-non-reasoning
grok-4-1-fast-search
grok-4-fast-search
claude-opus-4-5-search
o3-search
gemini-2.5-pro-grounding
grok-4-search
ppl-sonar-reasoning-pro-high
claude-sonnet-4-5-search
claude-opus-4-1-search
gpt-5-search
ppl-sonar-pro-high
claude-opus-4-search
diffbot-small-xl
api-gpt-4o-search
Pareto Frontier: Cost vs Quality
Which models give the best search quality for their price? Models on the Pareto frontier (connected by the line) are optimal — no other model is both cheaper and better.
Pareto-Optimal Models
No other model is both cheaper and higher quality than these.
| Model | Elo | Input $/M | Output $/M | Why it's optimal |
|---|---|---|---|---|
| Grok-4-1-fast-searchxAI | 1181 | $0.20 | $0.50 | Cheapest with competitive quality. 25x less than Claude. |
| GPT-5.1-searchOpenAI | 1210 | $1.25 | $10 | Best mid-range. +29 Elo over Grok-fast for 6x price. |
| GPT-5.2-searchOpenAI | 1219 | $1.75 | $14 | +9 Elo over 5.1 for only 40% more. Sweet spot. |
| Claude Opus 4.6 SearchAnthropic | 1255 | $5 | $25 | Absolute best quality. Pay 2.8x more for +36 Elo. |
Key Takeaway
Perplexity and Gemini are NOT on the Pareto frontier. At $1/M, Perplexity Sonar Reasoning (Elo 1141) is outperformed by Grok-fast at $0.20/M (Elo 1181) — cheaper AND better. Gemini 3 Pro Grounding (Elo 1214 at $2/M) is dominated by GPT-5.2-search (Elo 1219 at $1.75/M). The only 4 rational choices depend on your budget: Grok for cheap, GPT-5.1/5.2 for mid-range, Claude for best quality.
Best Value for Search
When search quality per dollar is the priority, the Grok Fast family stands apart from every other provider.
grok-4-1-fast-search
By xAI — 26,758 battles · ±5 CI
grok-4-fast-search
By xAI — 42,193 battles · ±4 CI
Why Grok Fast dominates value
- →At $0.20 input and $0.50 output, Grok 4 Fast Search is 25× cheaper than Claude Opus 4.6 Search on output tokens while ranking only 82 Elo points lower.
- →2M token context window — the largest in the leaderboard — allows full-document retrieval without chunking.
- →42,193 votes gives Grok 4 Fast the second-highest sample size in the Arena, meaning its Elo is statistically robust.
Perplexity vs the Field
Perplexity is the only search-native AI company in the leaderboard. Its models are purpose-built for retrieval, yet occupy the mid-table. Here is what the numbers reveal.
ppl-sonar-reasoning-pro-high
ppl-sonar-pro-high
Key observations
Perplexity charges $1/$1 per million tokens — input and output at the same rate. This is uniquely predictable for high-volume search workloads where output length is variable.
Sonar Reasoning Pro High sits 114 Elo below Claude Opus 4.6 Search. In Arena terms, that means Claude wins roughly 63% of head-to-head comparisons.
127.1K context is the smallest in the competitive tier — 8× shorter than Grok Fast and 8× shorter than Claude/Gemini 1M-context models.
Bottom line: Perplexity remains the go-to for consumer search UX and predictable API billing. But in pure Arena quality, it trails the frontier by a meaningful margin. For enterprise search pipelines where quality is paramount, Grok Fast Search offers a better quality-to-cost ratio than Perplexity at similar absolute prices.
Provider Breakdown
Six providers compete in the 2026 AI Search Arena. Here is how each approaches web search integration.
Dominates top-1 with 1M context window across Sonnet/Opus.
Broadest model range; GPT-5.2 Search leads the OpenAI family.
Flash Grounding nearly matches Pro at a fraction of the cost.
Best price-performance ratio; Fast variants at $0.20/$0.50.
Flat-rate $1/$1 pricing — uniquely predictable for search budgets.
Only open-weight entry (Apache 2.0); SaaS API built on Diffbot's knowledge graph.
How the Arena Works
Evaluation method
- 1.A real user submits a search query (news, factual, research).
- 2.Two models answer the same query simultaneously, identities hidden.
- 3.User picks the better response — or votes tie.
- 4.Elo is updated for both models using standard chess Elo formulas.
What gets measured
- →Factual accuracy — does the answer match ground truth?
- →Citation quality — are sources credible and relevant?
- →Recency — does the model surface up-to-date information?
- →Synthesis — does it aggregate multiple sources coherently?