CodeSOTA · Vision 2026

One endpoint per task.
Ready to use.

You shouldn't have to read a benchmark paper to transcribe a receipt. CodeSOTA is turning every AI task into a hosted endpoint with three tiers — SOTA, balanced, cheap — so you pick the trade-off and we run the rest.

Try the OCR endpoint See CodeSOTA benchmarks

Endpoint serving today (/v1/ocr)

8,948

Benchmark results in catalog

119

Tasks ranked, not yet served

167×

Cost delta on OCR (footnoted)

Coverage across modalities

VisionOCR · classification · detection · segmentation→SpeechTTS · ASR · voice cloning · translation→MultimodalVLMs · captioning · VQA · video→Codegeneration · agents · repair · review→Text & ReasoningLLMs · QA · summarization · math→

The manifesto

Intelligence
as a commodity.

Oil has grades. Electricity has tariffs. Shipping has class codes. Every mature market commoditizes by standardizing the contract, not the molecule. Intelligence is next.

Grade

Brent vs WTI

sota / balanced / cheap

Contract

barrel spec

POST /v1/<task>

Quality cert

assay report

CodeSOTA benchmark

Spot price

$/barrel

$/1K calls

OpenAI, Anthropic, Google — they're refineries. They output something extraordinary, but a refinery's output is only useful once the market around it standardizes how you buy, price, and substitute it.

CodeSOTA is building the standard behind the contract. Benchmarks are the assay. Task endpoints are the grade — one served today (/v1/ocr), the rest in flight. Everything else is implementation.

§ 01 — The thesis

The models already exist.
Nobody wants to pick.

Intelligence is moving from general and centralized to specific and everywhere. For any given task — OCR, transcription, translation, extraction — there is already an open-source model that comes within a few points of the frontier at a tiny fraction of the cost.

On OmniDocBench, the document-parsing benchmark CodeSOTA tracks, PaddleOCR-VL-1.5 scores 94.50 — higher than GPT-5.4 (85.80) and Gemini 2.5 Pro (84.20), at roughly 1/167th the price per 1,000 pages. The open model is measurably better and two orders of magnitude cheaper.

How the 167× is computed: PaddleOCR-VL-1.5 self-hosted cost of $0.09/1K pages is the amortised inference cost on a single A100 at a typical utilisation, not a retail price. GPT-5.4’s $15/1K is OpenAI’s published list price. So this is a COGS-vs-retail comparison, which flatters the delta. For a straight retail-vs-retail comparison, see the /ocr leaderboard where every row shows the actual hosted price you can buy today.

The problem isn't that the models don't exist. The problem is that picking one means reading papers, reconciling benchmarks, renting GPUs, and stitching APIs. On OCR today, CodeSOTA picks for you. We're building toward the rest, in design partnership with early teams.

§ 02 — How a task endpoint works

One request. Three possible answers.
You choose the trade-off.

§ 03 — Worked example

Watch it work on document OCR.

This is live data from codesota.com/ocr. The same table that powers the leaderboard also powers the router: whichever row is #1 this week is what the sota tier calls under the hood. When the ranking changes, the endpoint quietly follows.

OmniDocBench · /v1/ocrLive

#	Model	Score	$/1K pages	Tier
1	GLM-OCR open-source	94.62	$0.09	sota
2	PaddleOCR-VL-1.5 open-source	94.50	$0.09	balanced
3	dots.ocr 3B open-source	88.41	$0.04	cheap
4	MonkeyOCR-pro open-source	86.96	$0.03	cheap
5	GPT-5.4 closed API	85.80	$15.00	—
6	Gemini 2.5 Pro closed API	84.20	$12.50	—
7	Mistral OCR 3 closed API	83.40	$1.00	—

Source: CodeSOTA OmniDocBench composite · Mar 2026 · see full leaderboard →

$0.09 vs $15.00 per 1,000 pages

Self-hosted PaddleOCR-VL: ~1/167th the COGS of GPT-5.4 retail. Retail-to-retail prices →

§ 04 — The menu

Balanced is the default.
Everything else is opt-in.

The whole point of a three-tier menu is that you almost never need the top one. If a 30B open-source checkpoint is within a couple points of the frontier for 1/20th the price, that's the tier you should be calling 99% of the time — and it's the tier /v1/<task> routes to unless you say otherwise.

Default

tier: "balanced"

95% quality.
20× cheaper.

An open-source model that sits within a few points of the frontier at a tiny fraction of the price. A 30–70B open checkpoint on commodity GPUs. This is what you call 99% of the time.

Typical useproduction

Quality≥ 95% of SOTA

Cost~1/20×

Latencyfast

tier: "sota"· opt-in

The best — when you need it.

For the last few points of accuracy that actually matter: compliance runs, eval suites, audit trails. Most workloads don't need this. When they do, one flag flips.

Typical usecompliance

Quality100% of SOTA

Cost1× baseline

Latencymedium

tier: "cheap"· opt-in

For scale, edge, background.

The smallest model that still clears the quality bar. 3–8B open checkpoints, distilled variants, or classic CNNs where a VLM is overkill. When you're running a million calls, money wins.

Typical usescale, edge

Quality≥ 85% of SOTA

Cost~1/200×

Latencyinstant

§ 05 — Why CodeSOTA is the right place for this

We're not another router.
We're the benchmark layer, first.

Benchmarks are the fuel

Every other router on the market — OpenRouter, Martian, NotDiamond — is pure billing middleware. They guess quality from logs after the call. CodeSOTA starts with 164+ models across ~100 benchmarks. Routing decisions are a read on a table we already own.

Task-first, not model-first

Humans shop for inference by task: parse this PDF, transcribe this audio, classify this support ticket. A router indexed on tasks with a three-tier menu is how you buy intelligence — not by browsing 400 model cards.

Independence is the product

Rankings on codesota.com stay independent of our own endpoints. If a closed frontier API is genuinely #1 for a task, that's what wins the benchmark — and that's what the sota tier calls.

The flywheel compounds

Every call through a hosted task endpoint produces quality signal. Every new model we benchmark sharpens the router. Every new task we cover widens it. Benchmarks and endpoints are the same engine, viewed from two sides.

§ 06 — Built for agents, not just humans

The customer isn't always human.

Autonomous agents — Hermes, Claude, OpenCode, your own home-grown loop — need two different things from an inference layer: a brain to run their reasoning cycle, and tools to solve concrete tasks inside it. CodeSOTA covers both.

Need a brain

Which LLM runs the agent loop best?

Tool-use accuracy, instruction following, tokens/sec, and $/1M tokens are all tracked on CodeSOTA agent benchmarks. Pick the reasoning engine the same way you pick a task tier — by data, not vibes.

Need tools

Which specialized API solves the sub-task?

A multimodal LLM can read a PDF, but PaddleOCR-VL does it 166× cheaper and more accurately. A frontier model can transcribe audio, but Whisper is 50× cheaper. Agents should delegate modality-bound tasks to the right specialist — and the API surface has to make that easy.

Founder's note · why we're building this

“I've been picking the LLM that powers my Hermes agent for months now. The pareto-optimal choice on quality-vs-cost changes every few weeks — a new release, a price drop, a benchmark update, a latency regression. I don't want to babysit that decision anymore. If I don't, why would anyone else?”

— Kacper Wikiel · CodeSOTA

/v1/ocr · tool schema (what your agent sees)MCP / OpenAPI

// endpoint
POST /v1/ocr

// agent picks at call time
{
  "file": "invoice.pdf",
  "tier": "balanced",
  "max_cost_usd": 0.01,
  "timeout_s": 30
}

// capabilities the agent can reason about
{
  "task": "document-ocr",
  "tiers": {
    "sota": { quality: 0.946, usd_per_1k: 0.09 },
    "balanced": { quality: 0.945, usd_per_1k: 0.09 },
    "cheap": { quality: 0.884, usd_per_1k: 0.04 }
  },
  "benchmark": "omnidocbench",
  "updated": "2026-03-28"
}

Every endpoint ships with machine-readable quality + price metadata. Agents can pick a tier based on a budget constraint at tool-call time, not based on a README someone wrote last year.

§ 07 — The task catalog

OCR is live. The rest are in flight.

Every row is one task, one stable API contract, three tier choices. Backed by a dedicated CodeSOTA benchmark.

Today · serving requests

LIVEPOST/v1/ocrDocument OCR → Markdownhardparse.com →

Benchmark

OmniDocBench

Top model

GLM-OCR

Score

94.62

CodeSOTA leaderboard

/ocr →

Roadmap · open to design partners

Request priority on a roadmap endpoint →

soon/v1/ttsText → speechTTS comparison →

soon/v1/sttSpeech → textASR comparison →

plan/v1/translateDocument translationMT comparison →

plan/v1/summarizeLong-doc summarizationSummarization →

plan/v1/embedText embeddingsEmbeddings →

plan/v1/extractStructured extraction

plan/v1/classifyText classificationClassification →

plan/v1/detectObject detectionDetection →

plan/v1/codeCode completionCode gen →

No SOTA scores listed for roadmap endpoints — the CodeSOTA leaderboards behind each link have the live data; the contracts above are the API surfaces we’re building toward, not products you can call today.

§ 08 — Commitments

What we won't pretend.

Opinionated means we pick for you.

If you want to A/B twelve models yourself, we're the wrong product. CodeSOTA endpoints are curated: we read the benchmarks, make the call, and keep the contract stable. If a better model ships next week, we swap it behind the endpoint — you don't change your code.

SOTA moves. Endpoints hold still — unless you say otherwise.

By default, tier: "sota" is a contract for quality — when a better model tops the leaderboard, the endpoint quietly follows. That's the point of routing the choice away from you.

Pinning is a first-class option.

Curation isn't coercion. If you need a reproducible, audit-friendly contract — for compliance, for regression tests, for a regulated workflow — you can pin an exact model and version and we'll serve it unchanged for as long as you need it. No silent swaps, no surprise upgrades.

POST /v1/ocr
{
  "file": "invoice.pdf",
  "model": "glm-ocr@2026-03-01",
  "pin":   true
}

Use tier when you want the best answer today. Use model + pin when you need the same answer a year from now.

Independence is non-negotiable.

Rankings on CodeSOTA stay independent of our own endpoints. If a closed API from a big lab is genuinely the SOTA for a task, that's what wins the benchmark — and that's what the sota tier calls under the hood.

§ 09 — Traction

And the curve has already turned.

The benchmark side of CodeSOTA started pulling traffic in late 2025, right as the OCR leaderboard and task pages came online. This is the surface the task router will sit on top of — and it's already compounding.

Visitors

18,899

Page views

43,310

MoM growth

+71%

12-month trailing · Vercel Analytics · codesota.com

The benchmark pages are the top of the funnel. The task endpoints are the conversion. The vision above isn't a bet — it's a roadmap on a curve that's already moving.

Build with us

Pick a task. Get the best model.
Pay the right price.

We're shipping the task catalog endpoint by endpoint. If a task we haven't covered yet is blocking you, come talk to us.

Try /v1/ocr on hardparse.com See CodeSOTA benchmarks About CodeSOTA →

One endpoint per task.Ready to use.

Intelligenceas a commodity.

The models already exist.Nobody wants to pick.

One request. Three possible answers.You choose the trade-off.