AI Search Arena: Which AI Searches the Web Best? | CodeSOTA

Full Leaderboard

Elo ratings derived from pairwise user preference votes. Higher confidence intervals indicate fewer battles; scores stabilise above ~10K votes. Prices are per million tokens (input / output).

#	Model	Provider	Elo	±CI	Votes	Input $	Output $	Context	License
🥇	claude-opus-4-6-search	Anthropic	1255	±10	3,607	$5	$25	1M	Proprietary
🥈	grok-4.20-beta1	xAI	1225	±8	4,687	N/A	N/A	N/A	Proprietary
🥉	gpt-5.2-search	OpenAI	1219	±6	20,150	$1.75	$14	400K	Proprietary
4	gemini-3-flash-grounding	Google	1218	±6	25,311	N/A	N/A	N/A	Proprietary
5	gemini-3-pro-grounding	Google	1214	±5	31,966	$2	$12	N/A	Proprietary
6	gpt-5.1-search	OpenAI	1210	±6	23,283	$1.25	$10	400K	Proprietary
7	claude-sonnet-4-6-search	Anthropic	1203	±10	3,602	$3	$15	1M	Proprietary
8	gpt-5.2-search-non-reasoning	OpenAI	1183	±6	20,045	$1.75	$14	400K	Proprietary
9	grok-4-1-fast-searchValue	xAI	1181	±5	26,758	$0.20	$0.50	2M	Proprietary
10	grok-4-fast-searchValue	xAI	1173	±4	42,193	$0.20	$0.50	2M	Proprietary
11	claude-opus-4-5-search	Anthropic	1170	±6	15,488	$5	$25	200K	Proprietary
12	o3-search	OpenAI	1143	±5	20,407	$2	$8	200K	Proprietary
13	gemini-2.5-pro-grounding	Google	1143	±4	45,483	$1.25	$10	1M	Proprietary
14	grok-4-search	xAI	1142	±5	19,018	$3	$15	256K	Proprietary
15	ppl-sonar-reasoning-pro-high	Perplexity	1141	±5	28,673	$1	$1	127.1K	Proprietary
16	claude-sonnet-4-5-search	Anthropic	1138	±7	14,385	$3	$15	1M	Proprietary
17	claude-opus-4-1-search	Anthropic	1138	±4	44,888	$15	$75	200K	Proprietary
18	gpt-5-search	OpenAI	1133	±5	20,519	$1.25	$10	400K	Proprietary
19	ppl-sonar-pro-high	Perplexity	1131	±5	28,131	$1	$1	127.1K	Proprietary
20	claude-opus-4-search	Anthropic	1129	±5	30,695	$15	$75	200K	Proprietary
21	diffbot-small-xlOpen	Diffbot	1024	±8	6,378	N/A	N/A	N/A	Apache 2.0
22	api-gpt-4o-search	OpenAI	1006	±11	3,375	$30	$60	8.2K	Proprietary

#1Anthropic

claude-opus-4-6-search

1255

±10 Elo

Votes3,607

Input$5/M

Context1M

#2xAI

grok-4.20-beta1

1225

±8 Elo

Votes4,687

InputN/A/M

ContextN/A

#3OpenAI

gpt-5.2-search

1219

±6 Elo

Votes20,150

Input$1.75/M

Context400K

#4Google

gemini-3-flash-grounding

1218

±6 Elo

Votes25,311

InputN/A/M

ContextN/A

#5Google

gemini-3-pro-grounding

1214

±5 Elo

Votes31,966

Input$2/M

ContextN/A

#6OpenAI

gpt-5.1-search

1210

±6 Elo

Votes23,283

Input$1.25/M

Context400K

#7Anthropic

claude-sonnet-4-6-search

1203

±10 Elo

Votes3,602

Input$3/M

Context1M

#8OpenAI

gpt-5.2-search-non-reasoning

1183

±6 Elo

Votes20,045

Input$1.75/M

Context400K

#9xAIValue

grok-4-1-fast-search

1181

±5 Elo

Votes26,758

Input$0.20/M

Context2M

#10xAIValue

grok-4-fast-search

1173

±4 Elo

Votes42,193

Input$0.20/M

Context2M

#11Anthropic

claude-opus-4-5-search

1170

±6 Elo

Votes15,488

Input$5/M

Context200K

#12OpenAI

o3-search

1143

±5 Elo

Votes20,407

Input$2/M

Context200K

#13Google

gemini-2.5-pro-grounding

1143

±4 Elo

Votes45,483

Input$1.25/M

Context1M

#14xAI

grok-4-search

1142

±5 Elo

Votes19,018

Input$3/M

Context256K

#15Perplexity

ppl-sonar-reasoning-pro-high

1141

±5 Elo

Votes28,673

Input$1/M

Context127.1K

#16Anthropic

claude-sonnet-4-5-search

1138

±7 Elo

Votes14,385

Input$3/M

Context1M

#17Anthropic

claude-opus-4-1-search

1138

±4 Elo

Votes44,888

Input$15/M

Context200K

#18OpenAI

gpt-5-search

1133

±5 Elo

Votes20,519

Input$1.25/M

Context400K

#19Perplexity

ppl-sonar-pro-high

1131

±5 Elo

Votes28,131

Input$1/M

Context127.1K

#20Anthropic

claude-opus-4-search

1129

±5 Elo

Votes30,695

Input$15/M

Context200K

#21Diffbot

diffbot-small-xl

1024

±8 Elo

Votes6,378

InputN/A/M

ContextN/A

#22OpenAI

api-gpt-4o-search

1006

±11 Elo

Votes3,375

Input$30/M

Context8.2K

Pareto Frontier: Cost vs Quality

Which models give the best search quality for their price? Models on the Pareto frontier (connected by the line) are optimal — no other model is both cheaper and better.

Elo Score (quality →)

Input Price $/M tokens (cost →)

1000

1050

1100

1150

1200

1250

1300

$0.20

$15

$30

Anthropic

OpenAI

Google

xAI

Perplexity

Pareto frontier

Pareto-Optimal Models

No other model is both cheaper and higher quality than these.

Model	Elo	Input $/M	Output $/M	Why it's optimal
Grok-4-1-fast-searchxAI	1181	$0.20	$0.50	Cheapest with competitive quality. 25x less than Claude.
GPT-5.1-searchOpenAI	1210	$1.25	$10	Best mid-range. +29 Elo over Grok-fast for 6x price.
GPT-5.2-searchOpenAI	1219	$1.75	$14	+9 Elo over 5.1 for only 40% more. Sweet spot.
Claude Opus 4.6 SearchAnthropic	1255	$5	$25	Absolute best quality. Pay 2.8x more for +36 Elo.

Key Takeaway

Perplexity and Gemini are NOT on the Pareto frontier. At $1/M, Perplexity Sonar Reasoning (Elo 1141) is outperformed by Grok-fast at $0.20/M (Elo 1181) — cheaper AND better. Gemini 3 Pro Grounding (Elo 1214 at $2/M) is dominated by GPT-5.2-search (Elo 1219 at $1.75/M). The only 4 rational choices depend on your budget: Grok for cheap, GPT-5.1/5.2 for mid-range, Claude for best quality.

Best Value for Search

When search quality per dollar is the priority, the Grok Fast family stands apart from every other provider.

Best Value #91181 Elo

grok-4-1-fast-search

By xAI — 26,758 battles · ±5 CI

$0.20

Input / M tokens

$0.50

Output / M tokens

Context window

Best Value #101173 Elo

grok-4-fast-search

By xAI — 42,193 battles · ±4 CI

$0.20

Input / M tokens

$0.50

Output / M tokens

Context window

Why Grok Fast dominates value

→At $0.20 input and $0.50 output, Grok 4 Fast Search is 25× cheaper than Claude Opus 4.6 Search on output tokens while ranking only 82 Elo points lower.
→2M token context window — the largest in the leaderboard — allows full-document retrieval without chunking.
→42,193 votes gives Grok 4 Fast the second-highest sample size in the Arena, meaning its Elo is statistically robust.

Perplexity vs the Field

Perplexity is the only search-native AI company in the leaderboard. Its models are purpose-built for retrieval, yet occupy the mid-table. Here is what the numbers reveal.

Perplexity #151141 Elo

ppl-sonar-reasoning-pro-high

Input$1/M

Output$1/M

Context127.1K

Perplexity #191131 Elo

ppl-sonar-pro-high

Input$1/M

Output$1/M

Context127.1K

Key observations

Flat-rate pricing

Perplexity charges $1/$1 per million tokens — input and output at the same rate. This is uniquely predictable for high-volume search workloads where output length is variable.

Elo gap to top

Sonar Reasoning Pro High sits 114 Elo below Claude Opus 4.6 Search. In Arena terms, that means Claude wins roughly 63% of head-to-head comparisons.

Context constraint

127.1K context is the smallest in the competitive tier — 8× shorter than Grok Fast and 8× shorter than Claude/Gemini 1M-context models.

Bottom line: Perplexity remains the go-to for consumer search UX and predictable API billing. But in pure Arena quality, it trails the frontier by a meaningful margin. For enterprise search pipelines where quality is paramount, Grok Fast Search offers a better quality-to-cost ratio than Perplexity at similar absolute prices.

Provider Breakdown

Six providers compete in the 2026 AI Search Arena. Here is how each approaches web search integration.

Anthropic

1255

Top Elo

6 models in leaderboard

Dominates top-1 with 1M context window across Sonnet/Opus.

OpenAI

1219

Top Elo

6 models in leaderboard

Broadest model range; GPT-5.2 Search leads the OpenAI family.

Google

1218

Top Elo

3 models in leaderboard

Flash Grounding nearly matches Pro at a fraction of the cost.

xAI

1225

Top Elo

4 models in leaderboard

Best price-performance ratio; Fast variants at $0.20/$0.50.

Perplexity

1141

Top Elo

2 models in leaderboard

Flat-rate $1/$1 pricing — uniquely predictable for search budgets.

Diffbot

1024

Top Elo

1 model in leaderboard

Only open-weight entry (Apache 2.0); SaaS API built on Diffbot's knowledge graph.

How the Arena Works

Evaluation method

1.A real user submits a search query (news, factual, research).
2.Two models answer the same query simultaneously, identities hidden.
3.User picks the better response — or votes tie.
4.Elo is updated for both models using standard chess Elo formulas.

What gets measured

→Factual accuracy — does the answer match ground truth?
→Citation quality — are sources credible and relevant?
→Recency — does the model surface up-to-date information?
→Synthesis — does it aggregate multiple sources coherently?

Frequently Asked Questions

Which AI is best at searching the web in 2026?▾

Claude Opus 4.6 Search leads the AI Search Arena leaderboard with an Elo of 1255, followed closely by Grok 4.20-beta1 (1225) and GPT-5.2 Search (1219). Google's Gemini 3 Flash Grounding (1218) is the top value pick from Google.

How does Perplexity compare to ChatGPT for search?▾

In head-to-head Arena evaluations, Perplexity Sonar Reasoning Pro High (Elo 1141) and Sonar Pro High (1131) rank below top OpenAI and Anthropic search models. However, Perplexity offers flat-rate pricing at $1/$1 per million tokens, making it cost-competitive for high-volume search workloads.

What is the best value AI search model?▾

Grok 4 Fast Search and Grok 4.1 Fast Search offer the best price-to-performance ratio at $0.20 input / $0.50 output per million tokens, with 2M token context windows and Elo scores above 1170.

What is AI Search Arena?▾

AI Search Arena is a blind pairwise evaluation framework where real users compare two anonymous AI search responses side by side. Elo ratings are computed from thousands of head-to-head battles to produce a ranked leaderboard.

Does Diffbot have an AI search model?▾

Yes. Diffbot Small XL is an open-weight (Apache 2.0) AI search model trained on Diffbot's web knowledge graph. It scores Elo 1024 in the Arena — lower than proprietary models but notable as the only open-weight entry in the top 22.

Related Benchmarks

Live

AI Search Arena:
Which AI Searches the Web Best?

Full Leaderboard

Pareto Frontier: Cost vs Quality

Pareto-Optimal Models

Key Takeaway

Best Value for Search

grok-4-1-fast-search

grok-4-fast-search

Why Grok Fast dominates value

Perplexity vs the Field

Key observations

Provider Breakdown

How the Arena Works

Evaluation method

What gets measured

Frequently Asked Questions

Related Benchmarks

LLM Benchmarks

MTEB Embedding Leaderboard

Code Generation Hub

AI Search Arena:Which AI Searches the Web Best?

Full Leaderboard

Pareto Frontier: Cost vs Quality

Pareto-Optimal Models

Key Takeaway

Best Value for Search

grok-4-1-fast-search

grok-4-fast-search

Why Grok Fast dominates value

Perplexity vs the Field

Key observations

Provider Breakdown

How the Arena Works

Evaluation method

What gets measured

Frequently Asked Questions

Related Benchmarks

LLM Benchmarks

MTEB Embedding Leaderboard

Code Generation Hub

AI Search Arena:
Which AI Searches the Web Best?