Chinese open-weight labs captured ~52% of tracked LLM inference — a year on OpenRouter →
State of the art,
verified in the wild.
CodeSOTA tracks the real frontier of machine learning. Independent benchmarks across 17 research areas, every result linked to its paper and code, and original analysis of how the market actually uses these models. No marketing claims — just data.
Original research
AI inference market · 12 months
Pick by workload · Not by leaderboard
The market isn't choosing a winner. It's running three lanes at once.
Benchmark leaders don't lose revenue — they lose the long tail. When you look at what AI apps actually route through OpenRouter, the market splits cleanly into three tiers. Pick the wrong one and you're either burning money or shipping broken answers.
SOTA · Premium
When correctness is the constraint
- •Claude Opus 4.6
- •Claude Sonnet 4.6
- •GPT-5.4
- •Gemini 3.1 Pro
Cost-effective
Where the real work runs
- •Gemini 3 Flash
- •Qwen3.6 Plus
- •MiMo-V2-Pro
- •DeepSeek V3.2
Commodity · Scale
Billions of tokens, pennies on the dollar
- •MiMo-V2-Flash
- •Step 3.5 Flash
- •Trinity Large
- •Free-tier fleet
What's different
Everyone else lists models. We tell you which one to use.
HuggingFace has the catalog. OpenRouter has the routing. Model cards have marketing. None of them tell you what wins in the wild. That's the gap we fill — verified benchmarks, priced cost curves, and real usage data joined into one editorial surface.
HuggingFace
800K+ models, no curation, no verification. Leaderboards that anyone can game. Useful as a hosting layer, silent as a decision tool.
Hosts models
OpenRouter
Routes tokens. Shows what apps use. No benchmark context, no quality analysis, no editorial. You see the usage but have to interpret it yourself.
Routes traffic
CodeSOTA
17 research areas. 382 verified results. A year of market data. Original analysis joining benchmarks, cost, and usage into picks you can act on.
Makes the call
Deeper than a leaderboard
Every benchmark links to its paper, code, and methodology. Every result is cross-checked. We publish write-ups (ParseBench, Hermes Agent case study, OpenRouter market trends) that explain what the numbers mean — not just what they are.
Independent by design
No corporate owner, no play-to-pay, no sponsored leaderboards. All data is open JSON. If a vendor doesn't ship, they don't rank — no exceptions. The only way up is by shipping better results.
How we source results
Four sources, in order of trust.
Not every benchmark result can be reproduced by us directly — but every result carries its source, so you can decide how much weight to give it.
- 1
Our own testing (open-weight)
We run open-weight models locally on the same benchmarks under identical conditions. No vendor APIs, no marketing numbers, fully reproducible. This is the highest-trust tier.
- 2
Our own testing (vendor API)
For closed models we hit the vendor's API directly with the same prompts, same scoring, same dataset. Reproducible given the API, and unaffected by whatever the vendor's marketing page claims this week.
- 3
Published paper results
Numbers taken directly from a peer-reviewed or arXiv paper, linked with the paper URL and access date. We don't re-run these but we verify they exist in the source.
- 4
Vendor-reported results
Scores from a model card, blog post, or system card — labeled as such and weighted accordingly. Useful for coverage, treated with skepticism. If an independent source disagrees, we surface both.
Every data row in our JSON carries its source type and URL. See our full methodology.
Recent research from CodeSOTA
Beyond the leaderboards: what's actually being used
Benchmark scores are one signal. What the market actually runs in production is another. CodeSOTA publishes original research that joins both — starting with a year of inference-market data we analyzed, priced, and inverted ourselves. One finding: Chinese open-weight labs went from ~15% to ~52% of tracked flow in 53 weeks, while the total market grew 11×.
CodeSOTA analysis · OpenRouter vendor share · 53 weeks
full chart →One year of OpenRouter →
Full stacked area, vendor-by-vendor shift table, cost-vs-quality writeup.
Who burns the most →
32 AI apps ranked by monthly spend and token volume. Per-app deep dives.
Which models agents actually use →
Every model in the catalog, every app that uses it, four rankings.
Working with
Vendors who submit to our leaderboards and verify against our methodology. Join them →
Research areas
Browse benchmarks by domain
Computer Vision
10 tasksDetection, segmentation, classification, OCR
NLP
9 tasksLanguage models, QA, translation, NER
Reasoning
MATH, GSM8KMathematical, logical, commonsense
Code
6 tasksGeneration, SWE-bench, debugging
Speech
5 tasksASR, TTS, speaker verification
Medical
4 tasksImaging, diagnosis, clinical NLP
Multimodal
5 tasksVision-language, VQA, text-to-image
Agentic AI
5 tasksAutonomous agents, HCAST, time horizon
For model vendors & labs
Ship a model? Get it on the record.
Launching a new LLM, OCR model, or agent tool? We verify results independently against the original benchmark, link to your paper and code, track your numbers over time, and drive evaluated buyers from the leaderboards — not marketing decks. No fee, no priority tiers, no play-to-play.
- ✓Independent verification — we re-run public benchmarks where we can
- ✓Cross-linked to arXiv, GitHub, HuggingFace, and pricing pages
- ✓JSON-indexed, cacheable, citable — machine-readable end to end
- ✓Read by practitioners picking between models for real workloads
Why this exists
When Meta shut down Papers with Code in July 2025, the ML community lost its reference for what state-of-the-art looks like. 9,327 benchmarks, 79,817 papers, gone overnight.
CodeSOTA rebuilds that infrastructure — independently — and extends it with live market data most benchmark sites don't track. We verify results ourselves where possible, link every claim to its source, and publish everything as open JSON. No corporate owner that might pull the plug.
"Outstanding work. Just yesterday I was searching for good OCR comparisons and found only marketing BS. Good job!"
AI Consultant — Voice-AI at scale
"Super clean, slop-free UI, but most importantly the copy: very precise positioning and project overview."
Senior Architect
Open data
Every benchmark result, every OpenRouter snapshot, every weekly trend — available as JSON. No API key, no rate limits. Build dashboards, cite in papers, integrate into your routing layer.
Stay current
New benchmarks, market shifts, and model comparisons — delivered when it matters.
No spam. Unsubscribe anytime.