Codesota · OCR · Task: ReadHome/OCR/Read
Task Brief · image / PDF → text

Reading a document is the simplest task to describe and the easiest to get wrong.

“Turn this page into text” hides three different jobs. A born-digital PDF already contains perfect text — running OCR on it adds errors. A scan contains no text at all. And most real files are a mix of both. The first decision is not which model; it is whether to OCR at all.

PDFborn-digitalparse text layerscanned imageOCR modeltext /Markdown
The reading pipeline branches on the page, not the file. Route per page, then merge.
The distinction that costs people money

A PDF is not an image. Roughly two-thirds of PDFs in the wild are born-digital — they carry an embedded text layer that is character-perfect. The correct tool there is a parser (PyMuPDF, pdfplumber, pdfminer), not an OCR model. OCR-ing a digital PDF throws away ground-truth text and replaces it with a best-guess transcription. OCR is for the pages that have no text layer.

§ 01 · Route First

Three jobs hiding inside one request.

Born-digital PDF (has a text layer)
Parse, don't OCR. Extraction is lossless, instant, and free. Use OCR only if you need the page as it looks (e.g. to recover reading order the producer mangled).
Scanned page or photo (no text layer)
This is the actual OCR task. A document VLM (dots.ocr, PaddleOCR-VL, GLM-OCR) or a vendor read API turns pixels into text or Markdown. Quality depends on resolution — 200 DPI is the practical floor.
Mixed file (some pages digital, some scanned)
Route per page. Detect a text layer page by page, parse where present, OCR where absent. A whole-file decision either wastes money or silently corrupts the clean pages.

Rule of thumb: never feed a whole PDF to an OCR API without checking for a text layer first.

Which OCR fits your use case?

Answer 3 questions, get a personal recommendation. Or just drop your email — we reply.

§ 02 · Metrics

What the reading benchmarks actually score.

Character / Word Error Rate

CER / WER are edit distance over characters or words, normalised by reference length. 1% CER ≈ one wrong character per hundred. Lower is better. They reward raw transcription accuracy but say nothing about structure or order.

Normalised edit distance

OmniDocBench reports a text edit distance over the whole page (lower is better), so reading-order mistakes are penalised, not just glyph errors. OCRBench uses a 0–1000 composite over short tasks; OCRBench v2 adds private EN/ZH splits to resist contamination.

A model can win OCRBench (clean crops) and lose OmniDocBench (full pages) if its reading order is weak. Test on your own layout.

§ 03 · Where The Field Sits

Full-page reading, OmniDocBench.

RankModelTypeCompositeText edit dist. ↓
GLM-OCRExpert VLM94.62
2Qianfan-OCRExpert VLM93.12
3PaddleOCR-VLExpert VLM (0.9B)92.56
·dots.ocrExpert VLM (1.7B)0.125
·DeepSeek-OCRExpert VLM0.123
·PP-StructureV3Pipeline0.145

Source: OmniDocBench (live registry). Composite is higher-is-better; text edit distance is lower-is-better. On OCRBench's 0–1000 scale, frontier VLMs (Qwen2.5-VL 72B ≈ 885) now cluster near the top.

The headline: open-weight expert OCR models now lead the closed VLMs on full-page reading. For the closed-API view and the full live tables, see the open-weight leaderboard and the OmniDocBench page.

§ 04 · Implementation

Route per page, then OCR only the scans.

1 · Detect the text layer (PyMuPDF)

import fitz  # PyMuPDF

def has_text_layer(page, min_chars: int = 12) -> bool:
    """A born-digital page returns real characters; a scan returns ~nothing."""
    return len(page.get_text("text").strip()) >= min_chars

def read_pdf(path: str) -> list[str]:
    """Parse where there is a text layer, OCR only where there isn't."""
    doc = fitz.open(path)
    out = []
    for page in doc:
        if has_text_layer(page):
            # Perfect, lossless text — never OCR this.
            out.append(page.get_text("text"))
        else:
            # Scanned page → rasterise and hand to an OCR model.
            pix = page.get_pixmap(dpi=200)
            out.append(ocr_image(pix.tobytes("png")))
    return out

2 · OCR the scanned pages to Markdown

import replicate, base64

def ocr_image(png_bytes: bytes) -> str:
    """Scanned page → Markdown. Reading order + structure preserved."""
    b64 = base64.b64encode(png_bytes).decode()
    output = replicate.run(
        "rednote-hilab/dots.ocr",
        input={
            "image": f"data:image/png;base64,{b64}",
            "prompt": "Convert this page to Markdown. Preserve reading order, "
                      "headings, lists and tables. Do not invent text.",
            "temperature": 0.0,
        },
    )
    return "".join(output)
§ 05 · Failure Modes

Where reading still breaks.

Layout-driven errors

  • Reading order. Multi-column pages get interleaved into nonsense unless the model recovers column order first — a layout problem upstream of reading.
  • Headers, footers, marginalia. Running heads and page numbers get spliced into body text.
  • Tables in prose. A table read left-to-right as a sentence destroys the data — see the tables task.

Transcription errors

  • VLM hallucination. Generative models can “complete” faded text plausibly but wrongly. Use temperature 0 and an explicit “do not invent” instruction.
  • Low resolution. Below ~150 DPI, glyph confusions spike (rn→m, cl→d, 0→O).
  • Diacritics & scripts. Polish, Vietnamese, Arabic and CJK still trail Latin English; check the ZH split, not just EN.
  • Rotation & skew. Phone photos at an angle need deskewing before OCR.
§ 06 · References

Sources.

Which OCR fits your use case?

Answer 3 questions, get a personal recommendation. Or just drop your email — we reply.

#1 on OmniDocBench92.86 compositeSOTA shipped

Run the best OCR model on your Mac — $6

Hardparse runs PaddleOCR-VL-1.5 locally via Apple Metal. No cloud, no API keys, no subscription. Tables, formulas, handwriting, 109 languages.

Every purchase directly supports CodeSOTA's independent benchmark research.

§ 07 · Adjacent Tasks

Reading rarely stands alone.

Task · Layout
page → regions
Reading order comes from layout. Get it wrong and the text is right but scrambled.
Task · Tables
table image → cells
The part of a page that reading-as-text destroys. Needs structure recovery.
← Back to OCR task router