Table extraction is structure recognition, not transcription. The job is to recover the grid — cell boundaries, row and column membership, merged and spanning cells, header association — and serialise it to HTML, CSV or Markdown. OCR gives you the characters; this task gives you the table.
OCR reads in a line. A table is two-dimensional. Read left-to-right, the cells1,2003400become a meaningless sequence. Which is the unit price, which is the quantity, which is the total? Recovering that 2-D membership — and surviving merged cells, spanning headers and borderless rules — is what makes table extraction a distinct problem with its own models and its own metric.
Tree-Edit-Distance Similarity. Represents the table as an HTML tree and measures edit distance between prediction and ground truth — scoring both structure and cell text. Higher is better.
The same tree distance with cell contents ignored — pure structure. The gap between TEDS and TEDS-Struct tells you whether errors are layout or text.
Cell-level precision/recall on detected cells and their row/col indices. GriTS generalises this to a grid-similarity score robust to spanning cells.
None of CER / WER / edit distance — the reading metrics — capture whether the grid is right. That is the whole point.
Which OCR fits your use case?
Answer 3 questions, get a personal recommendation. Or just drop your email — we reply.
On clean, pre-cropped tables — PubTabNet (568k tables) and FinTabNet (113k financial tables) — specialist models exceed 97 TEDS. But that assumes someone already found the table and cut it out. Run the same models on a full document page, where the table must first be located and may be borderless or rotated, and scores fall to the high-80s / low-90s. The benchmark you cite should match the regime you ship in.
| Model | TEDS-S | TEDS |
|---|---|---|
| TFLOP | 98.38 | 96.66 |
| UniTable | 97.89 | 96.50 |
| TABLET | 97.67 | 96.79 |
| SEMv3 | 97.50 | 97.30 |
| TableFormer | — | 96.80 |
| Model | TEDS | Complex |
|---|---|---|
| PaddleOCR-VL | 93.52 | 91.2 |
| MinerU 2.5 | 91.9 | 89.8 |
| GPT-5.4 | 90.1 | 87.5 |
| Claude Sonnet 4 | 89.5 | 86.9 |
| dots.ocr | 88.9 | 86.8 |
Left: PubTabNet val (TFLOP arXiv:2501.11800; UniTable arXiv:2403.04822; TableFormer). FinTabNet's SOTA (VAST) sits ≈ 97.1 TEDS. Right: CodeSOTA document-context table benchmark. “Complex” = spanning/borderless tables.
Full methodology and the complete vendor list are on the table-extraction benchmark page.
import replicate, base64
def extract_table(png_bytes: bytes) -> str:
"""Ask a document VLM for HTML, not prose. HTML carries the structure."""
b64 = base64.b64encode(png_bytes).decode()
out = replicate.run(
"rednote-hilab/dots.ocr",
input={
"image": f"data:image/png;base64,{b64}",
"prompt": "Extract every table as HTML. Preserve rowspan and "
"colspan exactly. Do not flatten merged cells.",
"temperature": 0.0,
},
)
return "".join(out)import pandas as pd
from io import StringIO
def html_table_to_dataframe(html: str) -> pd.DataFrame:
"""A table model returns HTML with the grid intact — spans and all.
pandas reconstructs rows/cols; OCR text alone could not."""
# read_html resolves rowspan / colspan into a rectangular frame
return pd.read_html(StringIO(html))[0]
# vs. what plain OCR gives you for the same table:
# "Item Qty Price Widget 3 12.00 Gadget 1 4.50 Total 16.50"
# ...no idea which number belongs to which row or column.Which OCR fits your use case?
Answer 3 questions, get a personal recommendation. Or just drop your email — we reply.
Run the best OCR model on your Mac — $6
Hardparse runs PaddleOCR-VL-1.5 locally via Apple Metal. No cloud, no API keys, no subscription. Tables, formulas, handwriting, 109 languages.
Every purchase directly supports CodeSOTA's independent benchmark research.