CodeSOTA · Vision 2026

One endpoint per task.
Ready to use.

You shouldn't have to read a benchmark paper to transcribe a receipt. CodeSOTA is turning every AI task into a hosted endpoint with three tiers — SOTA, balanced, cheap — so you pick the trade-off and we run the rest.

1
Endpoint serving today (/v1/ocr)
8,948
Benchmark results in catalog
119
Tasks ranked, not yet served
167×
Cost delta on OCR (footnoted)

The manifesto

Intelligence
as a commodity.

Oil has grades. Electricity has tariffs. Shipping has class codes. Every mature market commoditizes by standardizing the contract, not the molecule. Intelligence is next.

Grade

Brent vs WTI
sota / balanced / cheap

Contract

barrel spec
POST /v1/<task>

Quality cert

assay report
CodeSOTA benchmark

Spot price

$/barrel
$/1K calls

OpenAI, Anthropic, Google — they're refineries. They output something extraordinary, but a refinery's output is only useful once the market around it standardizes how you buy, price, and substitute it.

CodeSOTA is building the standard behind the contract. Benchmarks are the assay. Task endpoints are the grade — one served today (/v1/ocr), the rest in flight. Everything else is implementation.

§ 01 — The thesis

The models already exist.
Nobody wants to pick.

Intelligence is moving from general and centralized to specific and everywhere. For any given task — OCR, transcription, translation, extraction — there is already an open-source model that comes within a few points of the frontier at a tiny fraction of the cost.

On OmniDocBench, the document-parsing benchmark CodeSOTA tracks, PaddleOCR-VL-1.5 scores 94.50 — higher than GPT-5.4 (85.80) and Gemini 2.5 Pro (84.20), at roughly 1/167th the price per 1,000 pages. The open model is measurably better and two orders of magnitude cheaper.

How the 167× is computed: PaddleOCR-VL-1.5 self-hosted cost of $0.09/1K pages is the amortised inference cost on a single A100 at a typical utilisation, not a retail price. GPT-5.4’s $15/1K is OpenAI’s published list price. So this is a COGS-vs-retail comparison, which flatters the delta. For a straight retail-vs-retail comparison, see the /ocr leaderboard where every row shows the actual hosted price you can buy today.

The problem isn't that the models don't exist. The problem is that picking one means reading papers, reconciling benchmarks, renting GPUs, and stitching APIs. On OCR today, CodeSOTA picks for you. We're building toward the rest, in design partnership with early teams.

§ 02 — How a task endpoint works

One request. Three possible answers.
You choose the trade-off.

POST /v1/ocr{"file":invoice.pdf,"tier":"sota"}CodeSOTA Routerbenchmark-backed1. GLM-OCR94.622. PaddleOCR-VL94.503. dots.ocr 3B88.414. GPT-5.485.80OmniDocBench · Mar 2026TIER 1 · SOTAGLM-OCR94.62 · $0.09/1Kself-host GPUTIER 2 · BALANCEDPaddleOCR-VL94.50 · $0.09/1Kshared poolTIER 3 · CHEAPdots.ocr 3B88.41 · $0.04/1Kbackground jobsEvery routing decision is a read on a CodeSOTA benchmark table.

§ 03 — Worked example

Watch it work on document OCR.

This is live data from codesota.com/ocr. The same table that powers the leaderboard also powers the router: whichever row is #1 this week is what the sota tier calls under the hood. When the ranking changes, the endpoint quietly follows.

OmniDocBench · /v1/ocrLive
#ModelScore$/1K pagesTier
1
GLM-OCR
open-source
94.62$0.09sota
2
PaddleOCR-VL-1.5
open-source
94.50$0.09balanced
3
dots.ocr 3B
open-source
88.41$0.04cheap
4
MonkeyOCR-pro
open-source
86.96$0.03cheap
5
GPT-5.4
closed API
85.80$15.00
6
Gemini 2.5 Pro
closed API
84.20$12.50
7
Mistral OCR 3
closed API
83.40$1.00
Source: CodeSOTA OmniDocBench composite · Mar 2026 · see full leaderboard →
$0.09 vs $15.00 per 1,000 pages
Self-hosted PaddleOCR-VL: ~1/167th the COGS of GPT-5.4 retail. Retail-to-retail prices →

§ 04 — The menu

Balanced is the default.
Everything else is opt-in.

The whole point of a three-tier menu is that you almost never need the top one. If a 30B open-source checkpoint is within a couple points of the frontier for 1/20th the price, that's the tier you should be calling 99% of the time — and it's the tier /v1/<task> routes to unless you say otherwise.

Default
tier: "balanced"
95% quality.
20× cheaper.

An open-source model that sits within a few points of the frontier at a tiny fraction of the price. A 30–70B open checkpoint on commodity GPUs. This is what you call 99% of the time.

Typical useproduction
Quality≥ 95% of SOTA
Cost~1/20×
Latencyfast
tier: "sota"· opt-in
The best — when you need it.

For the last few points of accuracy that actually matter: compliance runs, eval suites, audit trails. Most workloads don't need this. When they do, one flag flips.

Typical usecompliance
Quality100% of SOTA
Cost1× baseline
Latencymedium
tier: "cheap"· opt-in
For scale, edge, background.

The smallest model that still clears the quality bar. 3–8B open checkpoints, distilled variants, or classic CNNs where a VLM is overkill. When you're running a million calls, money wins.

Typical usescale, edge
Quality≥ 85% of SOTA
Cost~1/200×
Latencyinstant

§ 05 — Why CodeSOTA is the right place for this

We're not another router.
We're the benchmark layer, first.

01

Benchmarks are the fuel

Every other router on the market — OpenRouter, Martian, NotDiamond — is pure billing middleware. They guess quality from logs after the call. CodeSOTA starts with 164+ models across ~100 benchmarks. Routing decisions are a read on a table we already own.
02

Task-first, not model-first

Humans shop for inference by task: parse this PDF, transcribe this audio, classify this support ticket. A router indexed on tasks with a three-tier menu is how you buy intelligence — not by browsing 400 model cards.
03

Independence is the product

Rankings on codesota.com stay independent of our own endpoints. If a closed frontier API is genuinely #1 for a task, that's what wins the benchmark — and that's what the sota tier calls.
04

The flywheel compounds

Every call through a hosted task endpoint produces quality signal. Every new model we benchmark sharpens the router. Every new task we cover widens it. Benchmarks and endpoints are the same engine, viewed from two sides.

§ 06 — Built for agents, not just humans

The customer isn't always human.

Autonomous agents — Hermes, Claude, OpenCode, your own home-grown loop — need two different things from an inference layer: a brain to run their reasoning cycle, and tools to solve concrete tasks inside it. CodeSOTA covers both.

Need a brain

Which LLM runs the agent loop best?

Tool-use accuracy, instruction following, tokens/sec, and $/1M tokens are all tracked on CodeSOTA agent benchmarks. Pick the reasoning engine the same way you pick a task tier — by data, not vibes.

Need tools

Which specialized API solves the sub-task?

A multimodal LLM can read a PDF, but PaddleOCR-VL does it 166× cheaper and more accurately. A frontier model can transcribe audio, but Whisper is 50× cheaper. Agents should delegate modality-bound tasks to the right specialist — and the API surface has to make that easy.

Founder's note · why we're building this

“I've been picking the LLM that powers my Hermes agent for months now. The pareto-optimal choice on quality-vs-cost changes every few weeks — a new release, a price drop, a benchmark update, a latency regression. I don't want to babysit that decision anymore. If I don't, why would anyone else?”

— Kacper Wikiel · CodeSOTA

/v1/ocr · tool schema (what your agent sees)MCP / OpenAPI
// endpoint
POST /v1/ocr

// agent picks at call time
{
  "file": "invoice.pdf",
  "tier": "balanced",
  "max_cost_usd": 0.01,
  "timeout_s": 30
}
// capabilities the agent can reason about
{
  "task": "document-ocr",
  "tiers": {
    "sota": { quality: 0.946, usd_per_1k: 0.09 },
    "balanced": { quality: 0.945, usd_per_1k: 0.09 },
    "cheap": { quality: 0.884, usd_per_1k: 0.04 }
  },
  "benchmark": "omnidocbench",
  "updated": "2026-03-28"
}
Every endpoint ships with machine-readable quality + price metadata. Agents can pick a tier based on a budget constraint at tool-call time, not based on a README someone wrote last year.

§ 07 — The task catalog

OCR is live. The rest are in flight.

Every row is one task, one stable API contract, three tier choices. Backed by a dedicated CodeSOTA benchmark.

Today · serving requests

LIVEPOST/v1/ocrDocument OCR → Markdownhardparse.com →

Benchmark

OmniDocBench

Top model

GLM-OCR

Score

94.62

CodeSOTA leaderboard

/ocr

Roadmap · open to design partners

Request priority on a roadmap endpoint →
soon/v1/tts
soon/v1/stt
plan/v1/translate
plan/v1/summarize
plan/v1/embed
plan/v1/extract
plan/v1/classify
plan/v1/detect
plan/v1/code
No SOTA scores listed for roadmap endpoints — the CodeSOTA leaderboards behind each link have the live data; the contracts above are the API surfaces we’re building toward, not products you can call today.

§ 08 — Commitments

What we won't pretend.

Opinionated means we pick for you.

If you want to A/B twelve models yourself, we're the wrong product. CodeSOTA endpoints are curated: we read the benchmarks, make the call, and keep the contract stable. If a better model ships next week, we swap it behind the endpoint — you don't change your code.

SOTA moves. Endpoints hold still — unless you say otherwise.

By default, tier: "sota" is a contract for quality — when a better model tops the leaderboard, the endpoint quietly follows. That's the point of routing the choice away from you.

Pinning is a first-class option.

Curation isn't coercion. If you need a reproducible, audit-friendly contract — for compliance, for regression tests, for a regulated workflow — you can pin an exact model and version and we'll serve it unchanged for as long as you need it. No silent swaps, no surprise upgrades.
POST /v1/ocr
{
  "file": "invoice.pdf",
  "model": "glm-ocr@2026-03-01",
  "pin":   true
}
Use tier when you want the best answer today. Use model + pin when you need the same answer a year from now.

Independence is non-negotiable.

Rankings on CodeSOTA stay independent of our own endpoints. If a closed API from a big lab is genuinely the SOTA for a task, that's what wins the benchmark — and that's what the sota tier calls under the hood.

§ 09 — Traction

And the curve has already turned.

The benchmark side of CodeSOTA started pulling traffic in late 2025, right as the OCR leaderboard and task pages came online. This is the surface the task router will sit on top of — and it's already compounding.

Visitors

18,899

Page views

43,310

MoM growth

+71%

6K4K2K0Apr ’25JunAugOctDecFebApr ’26OCR pages live
12-month trailing · Vercel Analytics · codesota.com

The benchmark pages are the top of the funnel. The task endpoints are the conversion. The vision above isn't a bet — it's a roadmap on a curve that's already moving.

Build with us

Pick a task. Get the best model.
Pay the right price.

We're shipping the task catalog endpoint by endpoint. If a task we haven't covered yet is blocking you, come talk to us.