Codesota · Research · OCR PriorityHome/OCR/Benchmark Priority

Research tool · OCR coverage matrix

OCR Benchmark Priority Matrix.

Which OCR benchmarks have the best model coverage for fair comparison? Data-driven analysis of 242 results across 10 OCR benchmarks and 42 models.

§ 01 · The fair comparison problem

Different models, different benchmarks.

dots.ocr reports OmniDocBench but not OCRBench v2. Mistral OCR reports neither. GPT-5.4 reports both but with different metrics. For fair “apples-to-apples” comparison, you need benchmarks where multiple models have published results.

Benchmarks with 5+ models

Open-source OCR models

Commercial / API models

Discriminative benchmarks

§ 02 · Why multiple benchmarks

Why Multiple Benchmarks Matter

A single benchmark score is like a single data point: potentially misleading, easily gamed, and insufficient for real decisions. Understanding why requires seeing the four fundamental problems with single-benchmark evaluation.

1. The Four Problems 2. Restaurant Analogy 3. Saturation Demo 4. OCR Capabilities 5. The Solution

The Four Fundamental Problems

Every single-benchmark evaluation suffers from at least one of these issues. The more benchmarks you use, the harder it is for these problems to hide.

Cherry-Picking

Model developers naturally report benchmarks where their model excels

Analogy: Like judging a chef only by their best dish

Capability Blindspots

One benchmark tests one skill. OCR has dozens of distinct challenges

Analogy: Like testing a doctor only on heart conditions

Benchmark Saturation

When everyone scores 95%+, the benchmark stops being useful

Analogy: Like grading everyone A+ - no discrimination

Coverage Gaps

You can only compare models on benchmarks they BOTH report

Analogy: Like comparing runners who ran different races

The Core Insight

A model that claims "SOTA on Benchmark X" might be:

- Terrible on Benchmarks Y and Z (cherry-picking)
- Only good at one specific capability (blindspots)
- Tied with 10 other models at 98% (saturation)
- Incomparable to models that skipped Benchmark X (coverage)

The Restaurant Analogy

Think of OCR models like restaurants. Each benchmark is like a dish category. Click through to see why multiple benchmarks matter.

SCENARIO

You want to find the best restaurant in town

THE PROBLEM

Each restaurant shows off their best dish

THE INSIGHT

Restaurant A shows their pasta, B shows their steak, C shows their sushi

Benchmark Saturation: When 99% Means Nothing

When everyone scores near-perfect, the benchmark loses its discriminative power. Compare these three scenarios.

Handwriting (IAM)

Discriminative

Model A

72.3%

Model B

85.1%

Model C

68.9%

Model D

91.2%

Model E

77.4%

22-point spread. Clear winner (Model D) and meaningful ranking.

Max: 91.2%|Min: 68.9%|Spread: 22.3 pts

OCR Is Not One Skill - It Is Many

OCR encompasses at least 8 distinct capabilities. A model can excel at printed text while failing at handwriting. No single benchmark tests all of these.

Document Text

Clean printed documents

OmniDocBenchLow

Table Extraction

Structured tables with cells

OmniDocBench (TEDS)Medium

Handwriting

Cursive and print handwriting

IAM DatabaseHigh

Scene Text

Text in photos, signs, products

ICDAR 2015High

Math/Formulas

LaTeX equations, symbols

olmOCR-BenchVery High

Multi-column

Newspapers, academic papers

olmOCR-BenchMedium

Low Quality Scans

Faded, blurry, old documents

olmOCR-BenchHigh

Receipts/Invoices

Structured data extraction

SROIEMedium

The Coverage Problem Visualized

Each row is a model. Green = benchmark reported, Red = benchmark missing. Notice how few benchmarks overlap - making comparison impossible!

Model	OmniDoc	OCRBench v2	SROIE	IAM	ICDAR	olmOCR
Model A	Y	Y	Y	-	-	-
Model B	-	-	-	Y	Y	-
Model C	Y	-	-	Y	-	-

Only Model A and Model C can be fairly compared (on OmniDoc). Model B uses entirely different benchmarks!

The Solution: Multi-Benchmark Evaluation

The number of benchmarks directly correlates with evaluation reliability. Here is why more benchmarks lead to better decisions.

How Benchmark Count Affects Reliability

1 benchmark

Cherry-picked. Hides weaknesses.

20%

2 benchmarks

Better, but still easy to game.

45%

3-5 benchmarks

Good coverage. Hard to fake.

75%

6+ benchmarks

Comprehensive. True picture.

95%

Naive Approach

Look at each model's reported benchmark

Model A: 97% on BenchX | Model B: 95% on BenchY | Model C: 99% on BenchZ

These are different benchmarks - cannot compare!

Smart Approach

Find benchmarks where ALL models have results

All 3 models tested on OmniDocBench: A=82%, B=89%, C=76%

Now you can actually compare. Model B wins.

Practical Recommendations

For Model Developers

- Report on 5+ diverse benchmarks
- Include hard benchmarks (handwriting, tables)
- Do not just pick benchmarks you win

For Evaluators

- Compare only on shared benchmarks
- Weight harder benchmarks more
- Look for 10+ point spreads

For Decision Makers

- Match benchmarks to your use case
- Distrust single-benchmark claims
- Run your own tests when possible

Key Takeaways

Single Benchmarks Fail Because:

-Cherry-picking: Models report their best results
-Blindspots: One test, one capability
-Saturation: 99% vs 98% means nothing
-Coverage gaps: Cannot compare different tests

Multiple Benchmarks Succeed Because:

+Hard to game: Cannot excel at everything
+Coverage: Tests multiple capabilities
+Discrimination: At least some will show spread
+Fair comparison: More overlap between models

The Bottom Line: A model claiming "best OCR" based on one benchmark is like a student claiming "smartest in class" based on one quiz. Real evaluation requires comprehensive testing across multiple dimensions. That is what this page helps you do.

§ 03 · Model coverage gaps

Where each model doesn't report.

Which key OCR benchmarks is each model missing? Empty cells show gaps that prevent fair comparison.

Model	Coverage	OmniDocBench	olmOCR-Bench	OCRBench v2	OCRBench	SROIE	IAM Database	ICDAR 2015	Total-Text	CC-OCR	KITAB-Bench	ThaiOCRBench
GPT-5.4 api	45%	✓	✓	✓	—	—	—	—	—	✓	✓	—
Gemini 2.5 Pro api	27%	✓	—	✓	—	—	—	—	—	—	—	✓
dots.ocr 3B open-source	18%	✓	✓	—	—	—	—	—	—	—	—	—
PaddleOCR-VL open-source	18%	✓	✓	—	—	—	—	—	—	—	—	—
MinerU 2.5 open-source	18%	✓	✓	—	—	—	—	—	—	—	—	—
Mistral OCR 3 api	18%	✓	✓	—	—	—	—	—	—	—	—	—
Mistral OCR 3 api	18%	✓	—	✓	—	—	—	—	—	—	—	—
Claude Sonnet 4 api	18%	—	—	✓	—	—	—	—	—	—	—	✓
Chandra v0.1.0 open-source	9%	—	✓	—	—	—	—	—	—	—	—	—
Qwen2.5-VL 72B open-source	9%	—	—	—	—	—	—	—	—	—	—	✓
Tesseract open-source	9%	—	—	—	—	—	—	—	—	—	✓	—
EasyOCR open-source	9%	—	—	—	—	—	—	—	—	—	✓	—
olmOCR v0.4.0 open-source	9%	—	✓	—	—	—	—	—	—	—	—	—
Marker 1.10.1 open-source	9%	—	✓	—	—	—	—	—	—	—	—	—
Gemini 3.1 Pro api	9%	—	—	—	—	—	—	—	—	✓	—	—

✓ = Model has published results. — = No data available (cannot compare on this benchmark).

§ 04 · Fair comparison set

Benchmarks with overlap.

Benchmarks where multiple top models have published results. Use these for apples-to-apples comparison.

olmOCR-Bench

8 models

Document Parsing

OmniDocBench

7 models

Document Parsing

OCRBench v2

4 models

OCR Capabilities

KITAB-Bench

3 models

Arabic

ThaiOCRBench

3 models

Thai

CC-OCR

2 models

Multi-Scene

§ 05 · Recommended benchmark set

Six benchmarks, full coverage.

For comprehensive OCR model comparison, use these benchmarks that cover different aspects and have good model coverage.

Document parsing

OmniDocBench

981 pages, 9 categories. Tests tables, formulas, layouts. The most comprehensive document parsing benchmark.

Primary: composite score, table-teds

Challenging docs

olmOCR-Bench

7,010 tests across 1,402 PDFs. Old scans, math, multi-column, tiny text. Tests real-world edge cases.

Primary: pass-rate

OCR capabilities

OCRBench v2

8 core capabilities, 23 tasks. Tests text recognition, referring, extraction across English and Chinese.

Primary: overall-en-private, overall-zh-private

Handwriting

IAM Database

13,353 lines from 657 writers. The gold standard for handwriting recognition since 1999.

Primary: CER, WER

Receipts / invoices

SROIE

626 receipt images. Key information extraction: company, date, address, total. Standard invoice benchmark.

Primary: F1

Scene text

ICDAR 2015

1,500 images from wearable cameras. Industry standard for scene text detection in the wild.

Primary: F1

§ 06 · All OCR benchmarks by coverage

Sorted by model count.

Sorted by number of models with published results. SOTA highlight = strongest comparison set.

#	Benchmark	Category	Language	Models	Year	Priority
1	OCRBench v2 Tests 8 core OCR capabilities across 23 tasks. Evaluates LMMs on text recognition, referring, extraction.	ocr capabilities	multilingual	48	2024	Must include
2	OmniDocBench 981 annotated PDF pages across 9 document categories. Tests end-to-end document parsing including text, tables, and formulas.	document parsing	en	34	2024	Must include
3	olmOCR-Bench 7,010 unit tests across 1,402 PDF documents. Tests parsing of tables, math, multi-column layouts, old scans, and more.	document parsing	en	22	2024	Must include
4	ParseBench LlamaIndex 2026 document parsing benchmark. ~2,078 human-verified pages from ~1,211 enterprise documents (insurance, finance, government) with 169K rule-based tests across five dimensions: tables (GTRM), charts (ChartDataPointMatch), content faithfulness, semantic formatting, and visual grounding. No LLM-as-judge. Overall score = unweighted mean of the five dimensions.	document parsing	en	14	2026	Must include
5	FUNSD 199 fully annotated forms. Tests semantic entity labeling and linking.	document understanding	en	13	2019	Must include
6	KITAB-Bench 8,809 Arabic text samples across 9 domains. Tests Arabic script recognition.	document ocr	ar	8	2024	Must include
7	MME-VideoOCR 1,464 videos with 2,000 QA pairs across 25 tasks. Tests OCR capabilities in video content.	ocr capabilities	en	6	2024	Recommended
8	IAM 13,353 handwritten text lines from 657 writers. Standard handwriting benchmark.	handwriting recognition	en	5	1999	Recommended
9	CC-OCR Multi-scene text reading, key information extraction, multilingual text, and document parsing benchmark.	ocr capabilities	multilingual	5	2024	Recommended
10	ThaiOCRBench 2,808 Thai text samples across 13 tasks. Tests Thai script structural understanding.	document ocr	th	5	2024	Recommended

§ 07 · Discriminative benchmarks

Where scores actually spread.

Benchmarks where model scores are spread out (not saturated). A 30-point spread means models actually differ; a 5-point spread means everyone performs similarly.

Benchmark	Metric	Worst	Best	Spread	Models	Verdict
OmniDocBench	table-teds	0.80	93.52	92.72	4	Highly discriminative
OmniDocBench	composite	31.70	94.62	62.92	33	Highly discriminative
olmOCR-Bench	headers-footers	42.00	96.10	54.10	4	Highly discriminative
OCRBench v2	overall-zh-public	9.10	55.70	46.60	17	Highly discriminative
ParseBench	accuracy	45.20	84.90	39.70	14	Highly discriminative
OCRBench v2	overall-en-private	23.40	62.20	38.80	31	Highly discriminative
olmOCR-Bench	old-scans	40.70	73.10	32.40	5	Highly discriminative
OCRBench v2	overall-en-public	23.10	52.60	29.50	17	Highly discriminative
olmOCR-Bench	pass-rate	63.80	83.90	20.10	20	Highly discriminative
OCRBench v2	overall-zh-private	45.70	63.70	18.00	9	Good separation
FUNSD	f1	77.89	92.08	14.19	13	Good separation
MME-VideoOCR	total-accuracy	61.00	73.70	12.70	6	Good separation

Spread = Best score - Worst score. Higher spread = more meaningful for comparison.

§ 08 · By category

Grouped by task type.

Document Parsing

OmniDocBench

34 models

olmOCR-Bench

22 models

ParseBench

14 models

Text Extraction

FUNSD

13 models

MME-VideoOCR

6 models

Handwriting

IAM

5 models

Multilingual

OCRBench v2(multilingual)

48 models

CC-OCR(multilingual)

5 models

Specialized

KITAB-Bench(ar)

8 models

ThaiOCRBench(th)

5 models

§ 09 · Testing gaps

Needs more data.

Important OCR benchmarks with limited model coverage. These represent blind spots where we cannot make fair comparisons.

§ 10 · Decision guide

Pick by use case.

Comparing document parsers (PDFs, forms)?

Use OmniDocBench (composite) + olmOCR-Bench (pass-rate). Both have 10+ models tested.

Comparing text extraction (raw OCR)?

Use OCRBench v2 (overall-en) for general OCR, SROIE (F1) for receipts/invoices.

Comparing handwriting recognition?

Use IAM Database (CER). Standard since 1999, most models report it.

Comparing scene text (photos, signs)?

Use ICDAR 2015 (F1) + Total-Text for curved text.

Comparing multilingual OCR?

Use KITAB-Bench (Arabic), ThaiOCRBench (Thai), OCRBench v2 (zh-private). Limited model coverage — be careful with claims.

§ 11 · How to use this data

Four rules.

1.Identify benchmarks with 5+ models — These enable fair comparison. Fewer models = cherry-picked results.
2.Check discriminative power — If all models score 95%+, the benchmark is saturated. Use benchmarks with 10+ point spreads.
3.Compare on the same metric — OmniDocBench has composite, table-teds, layout-map. Make sure you are comparing the same one.
4.Be skeptical of single-benchmark claims — “SOTA on X” means nothing if the model skips harder benchmarks.

§ 12 · Compare OCR models

Ready to compare?

Use our benchmark data to make informed decisions about OCR models. Compare dots.ocr, GPT-5.4, Mistral OCR, PaddleOCR and more.

OCR Model Comparison Best for Invoices Best for Handwriting

← Back to OCR Overview