New research

~52% of tracked LLM inference — a year on OpenRouter

Live53 weeks tracked · 382 verified results · updated 2026-04-14

State of the art,
verified in the wild.

CodeSOTA tracks the real frontier of machine learning. Independent benchmarks across 17 research areas, every result linked to its paper and code, and original analysis of how the market actually uses these models. No marketing claims — just data.

382
Verified results
164
Models tracked
98
Datasets
17
Research areas

Original research

AI inference market · 12 months

11.1× growth
Chinese labs: 15% → 52%full analysis →
Weekly OpenRouter rankings, analyzed & priced by CodeSOTA

Pick by workload · Not by leaderboard

The market isn't choosing a winner. It's running three lanes at once.

Benchmark leaders don't lose revenue — they lose the long tail. When you look at what AI apps actually route through OpenRouter, the market splits cleanly into three tiers. Pick the wrong one and you're either burning money or shipping broken answers.

What's different

Everyone else lists models. We tell you which one to use.

HuggingFace has the catalog. OpenRouter has the routing. Model cards have marketing. None of them tell you what wins in the wild. That's the gap we fill — verified benchmarks, priced cost curves, and real usage data joined into one editorial surface.

HuggingFace

800K+ models, no curation, no verification. Leaderboards that anyone can game. Useful as a hosting layer, silent as a decision tool.

Hosts models

OpenRouter

Routes tokens. Shows what apps use. No benchmark context, no quality analysis, no editorial. You see the usage but have to interpret it yourself.

Routes traffic

CodeSOTA

17 research areas. 382 verified results. A year of market data. Original analysis joining benchmarks, cost, and usage into picks you can act on.

Makes the call

Deeper than a leaderboard

Every benchmark links to its paper, code, and methodology. Every result is cross-checked. We publish write-ups (ParseBench, Hermes Agent case study, OpenRouter market trends) that explain what the numbers mean — not just what they are.

Independent by design

No corporate owner, no play-to-pay, no sponsored leaderboards. All data is open JSON. If a vendor doesn't ship, they don't rank — no exceptions. The only way up is by shipping better results.

How we source results

Four sources, in order of trust.

Not every benchmark result can be reproduced by us directly — but every result carries its source, so you can decide how much weight to give it.

  1. 1

    Our own testing (open-weight)

    We run open-weight models locally on the same benchmarks under identical conditions. No vendor APIs, no marketing numbers, fully reproducible. This is the highest-trust tier.

  2. 2

    Our own testing (vendor API)

    For closed models we hit the vendor's API directly with the same prompts, same scoring, same dataset. Reproducible given the API, and unaffected by whatever the vendor's marketing page claims this week.

  3. 3

    Published paper results

    Numbers taken directly from a peer-reviewed or arXiv paper, linked with the paper URL and access date. We don't re-run these but we verify they exist in the source.

  4. 4

    Vendor-reported results

    Scores from a model card, blog post, or system card — labeled as such and weighted accordingly. Useful for coverage, treated with skepticism. If an independent source disagrees, we surface both.

Every data row in our JSON carries its source type and URL. See our full methodology.

Recent research from CodeSOTA

Beyond the leaderboards: what's actually being used

Benchmark scores are one signal. What the market actually runs in production is another. CodeSOTA publishes original research that joins both — starting with a year of inference-market data we analyzed, priced, and inverted ourselves. One finding: Chinese open-weight labs went from ~15% to ~52% of tracked flow in 53 weeks, while the total market grew 11×.

CodeSOTA analysis · OpenRouter vendor share · 53 weeks

full chart →
0%25%50%75%100%2025-042025-102026-04

Working with

Vendors who submit to our leaderboards and verify against our methodology. Join them →

For model vendors & labs

Ship a model? Get it on the record.

Launching a new LLM, OCR model, or agent tool? We verify results independently against the original benchmark, link to your paper and code, track your numbers over time, and drive evaluated buyers from the leaderboards — not marketing decks. No fee, no priority tiers, no play-to-play.

  • Independent verification — we re-run public benchmarks where we can
  • Cross-linked to arXiv, GitHub, HuggingFace, and pricing pages
  • JSON-indexed, cacheable, citable — machine-readable end to end
  • Read by practitioners picking between models for real workloads

Why this exists

When Meta shut down Papers with Code in July 2025, the ML community lost its reference for what state-of-the-art looks like. 9,327 benchmarks, 79,817 papers, gone overnight.

CodeSOTA rebuilds that infrastructure — independently — and extends it with live market data most benchmark sites don't track. We verify results ourselves where possible, link every claim to its source, and publish everything as open JSON. No corporate owner that might pull the plug.

"Outstanding work. Just yesterday I was searching for good OCR comparisons and found only marketing BS. Good job!"

AI Consultant — Voice-AI at scale

"Super clean, slop-free UI, but most importantly the copy: very precise positioning and project overview."

Senior Architect

Open data

Every benchmark result, every OpenRouter snapshot, every weekly trend — available as JSON. No API key, no rate limits. Build dashboards, cite in papers, integrate into your routing layer.

Stay current

New benchmarks, market shifts, and model comparisons — delivered when it matters.