Live Elo Rankings · 22 Models

AI Search Arena:
Which AI Searches the Web Best?

Blind pairwise evaluation of 22 search-augmented AI models — from Claude Opus 4.6 Search to Perplexity, Grok, and Diffbot. Ranked by real user preference across hundreds of thousands of web-search battles.

Leader Claude Opus 4.6 Search1255 Elo
Best Value Grok 4 Fast Search$0.20 / $0.50
Open-weight Diffbot Small XLApache 2.0
Total votes 479K battles

Full Leaderboard

Elo ratings derived from pairwise user preference votes. Higher confidence intervals indicate fewer battles; scores stabilise above ~10K votes. Prices are per million tokens (input / output).

#1Anthropic

claude-opus-4-6-search

1255
±10 Elo
Votes3,607
Input$5/M
Context1M
#2xAI

grok-4.20-beta1

1225
±8 Elo
Votes4,687
InputN/A/M
ContextN/A
#3OpenAI

gpt-5.2-search

1219
±6 Elo
Votes20,150
Input$1.75/M
Context400K
#4Google

gemini-3-flash-grounding

1218
±6 Elo
Votes25,311
InputN/A/M
ContextN/A
#5Google

gemini-3-pro-grounding

1214
±5 Elo
Votes31,966
Input$2/M
ContextN/A
#6OpenAI

gpt-5.1-search

1210
±6 Elo
Votes23,283
Input$1.25/M
Context400K
#7Anthropic

claude-sonnet-4-6-search

1203
±10 Elo
Votes3,602
Input$3/M
Context1M
#8OpenAI

gpt-5.2-search-non-reasoning

1183
±6 Elo
Votes20,045
Input$1.75/M
Context400K
#9xAIValue

grok-4-1-fast-search

1181
±5 Elo
Votes26,758
Input$0.20/M
Context2M
#10xAIValue

grok-4-fast-search

1173
±4 Elo
Votes42,193
Input$0.20/M
Context2M
#11Anthropic

claude-opus-4-5-search

1170
±6 Elo
Votes15,488
Input$5/M
Context200K
#12OpenAI

o3-search

1143
±5 Elo
Votes20,407
Input$2/M
Context200K
#13Google

gemini-2.5-pro-grounding

1143
±4 Elo
Votes45,483
Input$1.25/M
Context1M
#14xAI

grok-4-search

1142
±5 Elo
Votes19,018
Input$3/M
Context256K
#15Perplexity

ppl-sonar-reasoning-pro-high

1141
±5 Elo
Votes28,673
Input$1/M
Context127.1K
#16Anthropic

claude-sonnet-4-5-search

1138
±7 Elo
Votes14,385
Input$3/M
Context1M
#17Anthropic

claude-opus-4-1-search

1138
±4 Elo
Votes44,888
Input$15/M
Context200K
#18OpenAI

gpt-5-search

1133
±5 Elo
Votes20,519
Input$1.25/M
Context400K
#19Perplexity

ppl-sonar-pro-high

1131
±5 Elo
Votes28,131
Input$1/M
Context127.1K
#20Anthropic

claude-opus-4-search

1129
±5 Elo
Votes30,695
Input$15/M
Context200K
#21Diffbot

diffbot-small-xl

1024
±8 Elo
Votes6,378
InputN/A/M
ContextN/A
#22OpenAI

api-gpt-4o-search

1006
±11 Elo
Votes3,375
Input$30/M
Context8.2K

Pareto Frontier: Cost vs Quality

Which models give the best search quality for their price? Models on the Pareto frontier (connected by the line) are optimal — no other model is both cheaper and better.

Elo Score (quality →)
Input Price $/M tokens (cost →)
1000
1050
1100
1150
1200
1250
1300
$0.20
$1
$2
$5
$15
$30
Anthropic
OpenAI
Google
xAI
Perplexity
Pareto frontier

Pareto-Optimal Models

No other model is both cheaper and higher quality than these.

ModelEloInput $/MOutput $/MWhy it's optimal
Grok-4-1-fast-searchxAI1181$0.20$0.50Cheapest with competitive quality. 25x less than Claude.
GPT-5.1-searchOpenAI1210$1.25$10Best mid-range. +29 Elo over Grok-fast for 6x price.
GPT-5.2-searchOpenAI1219$1.75$14+9 Elo over 5.1 for only 40% more. Sweet spot.
Claude Opus 4.6 SearchAnthropic1255$5$25Absolute best quality. Pay 2.8x more for +36 Elo.

Key Takeaway

Perplexity and Gemini are NOT on the Pareto frontier. At $1/M, Perplexity Sonar Reasoning (Elo 1141) is outperformed by Grok-fast at $0.20/M (Elo 1181) — cheaper AND better. Gemini 3 Pro Grounding (Elo 1214 at $2/M) is dominated by GPT-5.2-search (Elo 1219 at $1.75/M). The only 4 rational choices depend on your budget: Grok for cheap, GPT-5.1/5.2 for mid-range, Claude for best quality.

Best Value for Search

When search quality per dollar is the priority, the Grok Fast family stands apart from every other provider.

Best Value #91181 Elo

grok-4-1-fast-search

By xAI26,758 battles · ±5 CI

$0.20
Input / M tokens
$0.50
Output / M tokens
2M
Context window
Best Value #101173 Elo

grok-4-fast-search

By xAI42,193 battles · ±4 CI

$0.20
Input / M tokens
$0.50
Output / M tokens
2M
Context window

Why Grok Fast dominates value

  • At $0.20 input and $0.50 output, Grok 4 Fast Search is 25× cheaper than Claude Opus 4.6 Search on output tokens while ranking only 82 Elo points lower.
  • 2M token context window — the largest in the leaderboard — allows full-document retrieval without chunking.
  • 42,193 votes gives Grok 4 Fast the second-highest sample size in the Arena, meaning its Elo is statistically robust.

Perplexity vs the Field

Perplexity is the only search-native AI company in the leaderboard. Its models are purpose-built for retrieval, yet occupy the mid-table. Here is what the numbers reveal.

Perplexity #151141 Elo

ppl-sonar-reasoning-pro-high

Input$1/M
Output$1/M
Context127.1K
Perplexity #191131 Elo

ppl-sonar-pro-high

Input$1/M
Output$1/M
Context127.1K

Key observations

Flat-rate pricing

Perplexity charges $1/$1 per million tokens — input and output at the same rate. This is uniquely predictable for high-volume search workloads where output length is variable.

Elo gap to top

Sonar Reasoning Pro High sits 114 Elo below Claude Opus 4.6 Search. In Arena terms, that means Claude wins roughly 63% of head-to-head comparisons.

Context constraint

127.1K context is the smallest in the competitive tier — 8× shorter than Grok Fast and 8× shorter than Claude/Gemini 1M-context models.

Bottom line: Perplexity remains the go-to for consumer search UX and predictable API billing. But in pure Arena quality, it trails the frontier by a meaningful margin. For enterprise search pipelines where quality is paramount, Grok Fast Search offers a better quality-to-cost ratio than Perplexity at similar absolute prices.

Provider Breakdown

Six providers compete in the 2026 AI Search Arena. Here is how each approaches web search integration.

Anthropic
1255
Top Elo
6 models in leaderboard

Dominates top-1 with 1M context window across Sonnet/Opus.

OpenAI
1219
Top Elo
6 models in leaderboard

Broadest model range; GPT-5.2 Search leads the OpenAI family.

Google
1218
Top Elo
3 models in leaderboard

Flash Grounding nearly matches Pro at a fraction of the cost.

xAI
1225
Top Elo
4 models in leaderboard

Best price-performance ratio; Fast variants at $0.20/$0.50.

Perplexity
1141
Top Elo
2 models in leaderboard

Flat-rate $1/$1 pricing — uniquely predictable for search budgets.

Diffbot
1024
Top Elo
1 model in leaderboard

Only open-weight entry (Apache 2.0); SaaS API built on Diffbot's knowledge graph.

How the Arena Works

Evaluation method

  • 1.A real user submits a search query (news, factual, research).
  • 2.Two models answer the same query simultaneously, identities hidden.
  • 3.User picks the better response — or votes tie.
  • 4.Elo is updated for both models using standard chess Elo formulas.

What gets measured

  • Factual accuracy — does the answer match ground truth?
  • Citation quality — are sources credible and relevant?
  • Recency — does the model surface up-to-date information?
  • Synthesis — does it aggregate multiple sources coherently?

Frequently Asked Questions

Which AI is best at searching the web in 2026?
Claude Opus 4.6 Search leads the AI Search Arena leaderboard with an Elo of 1255, followed closely by Grok 4.20-beta1 (1225) and GPT-5.2 Search (1219). Google's Gemini 3 Flash Grounding (1218) is the top value pick from Google.
How does Perplexity compare to ChatGPT for search?
In head-to-head Arena evaluations, Perplexity Sonar Reasoning Pro High (Elo 1141) and Sonar Pro High (1131) rank below top OpenAI and Anthropic search models. However, Perplexity offers flat-rate pricing at $1/$1 per million tokens, making it cost-competitive for high-volume search workloads.
What is the best value AI search model?
Grok 4 Fast Search and Grok 4.1 Fast Search offer the best price-to-performance ratio at $0.20 input / $0.50 output per million tokens, with 2M token context windows and Elo scores above 1170.
What is AI Search Arena?
AI Search Arena is a blind pairwise evaluation framework where real users compare two anonymous AI search responses side by side. Elo ratings are computed from thousands of head-to-head battles to produce a ranked leaderboard.
Does Diffbot have an AI search model?
Yes. Diffbot Small XL is an open-weight (Apache 2.0) AI search model trained on Diffbot's web knowledge graph. It scores Elo 1024 in the Arena — lower than proprietary models but notable as the only open-weight entry in the top 22.

Related Benchmarks