Codesota · Research · OCR PriorityHome/OCR/Benchmark Priority
Research tool · OCR coverage matrix

OCR Benchmark Priority Matrix.

Which OCR benchmarks have the best model coverage for fair comparison? Data-driven analysis of 242 results across 10 OCR benchmarks and 42 models.

§ 01 · The fair comparison problem

Different models, different benchmarks.

dots.ocr reports OmniDocBench but not OCRBench v2. Mistral OCR reports neither. GPT-5.4 reports both but with different metrics. For fair “apples-to-apples” comparison, you need benchmarks where multiple models have published results.

10
Benchmarks with 5+ models
27
Open-source OCR models
15
Commercial / API models
12
Discriminative benchmarks
§ 02 · Why multiple benchmarks

Why Multiple Benchmarks Matter

A single benchmark score is like a single data point: potentially misleading, easily gamed, and insufficient for real decisions. Understanding why requires seeing the four fundamental problems with single-benchmark evaluation.

1

The Four Fundamental Problems

Every single-benchmark evaluation suffers from at least one of these issues. The more benchmarks you use, the harder it is for these problems to hide.

1

Cherry-Picking

Model developers naturally report benchmarks where their model excels

Analogy: Like judging a chef only by their best dish
2

Capability Blindspots

One benchmark tests one skill. OCR has dozens of distinct challenges

Analogy: Like testing a doctor only on heart conditions
3

Benchmark Saturation

When everyone scores 95%+, the benchmark stops being useful

Analogy: Like grading everyone A+ - no discrimination
4

Coverage Gaps

You can only compare models on benchmarks they BOTH report

Analogy: Like comparing runners who ran different races

The Core Insight

A model that claims "SOTA on Benchmark X" might be:

  • - Terrible on Benchmarks Y and Z (cherry-picking)
  • - Only good at one specific capability (blindspots)
  • - Tied with 10 other models at 98% (saturation)
  • - Incomparable to models that skipped Benchmark X (coverage)
2

The Restaurant Analogy

Think of OCR models like restaurants. Each benchmark is like a dish category. Click through to see why multiple benchmarks matter.

SCENARIO

You want to find the best restaurant in town

THE PROBLEM

Each restaurant shows off their best dish

THE INSIGHT

Restaurant A shows their pasta, B shows their steak, C shows their sushi

3

Benchmark Saturation: When 99% Means Nothing

When everyone scores near-perfect, the benchmark loses its discriminative power. Compare these three scenarios.

Handwriting (IAM)

Discriminative
Model A
72.3%
Model B
85.1%
Model C
68.9%
Model D
91.2%
Model E
77.4%

22-point spread. Clear winner (Model D) and meaningful ranking.

Max: 91.2%|Min: 68.9%|Spread: 22.3 pts
4

OCR Is Not One Skill - It Is Many

OCR encompasses at least 8 distinct capabilities. A model can excel at printed text while failing at handwriting. No single benchmark tests all of these.

Document Text

Clean printed documents

OmniDocBenchLow

Table Extraction

Structured tables with cells

OmniDocBench (TEDS)Medium

Handwriting

Cursive and print handwriting

IAM DatabaseHigh

Scene Text

Text in photos, signs, products

ICDAR 2015High

Math/Formulas

LaTeX equations, symbols

olmOCR-BenchVery High

Multi-column

Newspapers, academic papers

olmOCR-BenchMedium

Low Quality Scans

Faded, blurry, old documents

olmOCR-BenchHigh

Receipts/Invoices

Structured data extraction

SROIEMedium

The Coverage Problem Visualized

Each row is a model. Green = benchmark reported, Red = benchmark missing. Notice how few benchmarks overlap - making comparison impossible!

ModelOmniDocOCRBench v2SROIEIAMICDARolmOCR
Model AYYY---
Model B---YY-
Model CY--Y--

Only Model A and Model C can be fairly compared (on OmniDoc). Model B uses entirely different benchmarks!

5

The Solution: Multi-Benchmark Evaluation

The number of benchmarks directly correlates with evaluation reliability. Here is why more benchmarks lead to better decisions.

How Benchmark Count Affects Reliability

1 benchmark
Cherry-picked. Hides weaknesses.
20%
2 benchmarks
Better, but still easy to game.
45%
3-5 benchmarks
Good coverage. Hard to fake.
75%
6+ benchmarks
Comprehensive. True picture.
95%

Naive Approach

Look at each model's reported benchmark

Model A: 97% on BenchX | Model B: 95% on BenchY | Model C: 99% on BenchZ

These are different benchmarks - cannot compare!

Smart Approach

Find benchmarks where ALL models have results

All 3 models tested on OmniDocBench: A=82%, B=89%, C=76%

Now you can actually compare. Model B wins.

Practical Recommendations

For Model Developers
  • - Report on 5+ diverse benchmarks
  • - Include hard benchmarks (handwriting, tables)
  • - Do not just pick benchmarks you win
For Evaluators
  • - Compare only on shared benchmarks
  • - Weight harder benchmarks more
  • - Look for 10+ point spreads
For Decision Makers
  • - Match benchmarks to your use case
  • - Distrust single-benchmark claims
  • - Run your own tests when possible

Key Takeaways

Single Benchmarks Fail Because:

  • -Cherry-picking: Models report their best results
  • -Blindspots: One test, one capability
  • -Saturation: 99% vs 98% means nothing
  • -Coverage gaps: Cannot compare different tests

Multiple Benchmarks Succeed Because:

  • +Hard to game: Cannot excel at everything
  • +Coverage: Tests multiple capabilities
  • +Discrimination: At least some will show spread
  • +Fair comparison: More overlap between models

The Bottom Line: A model claiming "best OCR" based on one benchmark is like a student claiming "smartest in class" based on one quiz. Real evaluation requires comprehensive testing across multiple dimensions. That is what this page helps you do.

§ 03 · Model coverage gaps

Where each model doesn't report.

Which key OCR benchmarks is each model missing? Empty cells show gaps that prevent fair comparison.

ModelCoverageOmniDocBencholmOCR-BenchOCRBench v2OCRBenchSROIEIAM DatabaseICDAR 2015Total-TextCC-OCRKITAB-BenchThaiOCRBench
GPT-5.4
api
45%
Gemini 2.5 Pro
api
27%
dots.ocr 3B
open-source
18%
PaddleOCR-VL
open-source
18%
MinerU 2.5
open-source
18%
Mistral OCR 3
api
18%
Mistral OCR 3
api
18%
Claude Sonnet 4
api
18%
Chandra v0.1.0
open-source
9%
Qwen2.5-VL 72B
open-source
9%
Tesseract
open-source
9%
EasyOCR
open-source
9%
olmOCR v0.4.0
open-source
9%
Marker 1.10.1
open-source
9%
Gemini 3.1 Pro
api
9%

✓ = Model has published results. — = No data available (cannot compare on this benchmark).

§ 04 · Fair comparison set

Benchmarks with overlap.

Benchmarks where multiple top models have published results. Use these for apples-to-apples comparison.

olmOCR-Bench
8 models
Document Parsing
OmniDocBench
7 models
Document Parsing
OCRBench v2
4 models
OCR Capabilities
KITAB-Bench
3 models
Arabic
ThaiOCRBench
3 models
Thai
CC-OCR
2 models
Multi-Scene
§ 05 · Recommended benchmark set

Six benchmarks, full coverage.

For comprehensive OCR model comparison, use these benchmarks that cover different aspects and have good model coverage.

Document parsing
OmniDocBench

981 pages, 9 categories. Tests tables, formulas, layouts. The most comprehensive document parsing benchmark.

Primary: composite score, table-teds
Challenging docs
olmOCR-Bench

7,010 tests across 1,402 PDFs. Old scans, math, multi-column, tiny text. Tests real-world edge cases.

Primary: pass-rate
OCR capabilities
OCRBench v2

8 core capabilities, 23 tasks. Tests text recognition, referring, extraction across English and Chinese.

Primary: overall-en-private, overall-zh-private
Handwriting
IAM Database

13,353 lines from 657 writers. The gold standard for handwriting recognition since 1999.

Primary: CER, WER
Receipts / invoices
SROIE

626 receipt images. Key information extraction: company, date, address, total. Standard invoice benchmark.

Primary: F1
Scene text
ICDAR 2015

1,500 images from wearable cameras. Industry standard for scene text detection in the wild.

Primary: F1
§ 06 · All OCR benchmarks by coverage

Sorted by model count.

Sorted by number of models with published results. SOTA highlight = strongest comparison set.

#BenchmarkCategoryLanguageModelsYearPriority
1OCRBench v2
Tests 8 core OCR capabilities across 23 tasks. Evaluates LMMs on text recognition, referring, extraction.
ocr capabilitiesmultilingual482024Must include
2OmniDocBench
981 annotated PDF pages across 9 document categories. Tests end-to-end document parsing including text, tables, and formulas.
document parsingen342024Must include
3olmOCR-Bench
7,010 unit tests across 1,402 PDF documents. Tests parsing of tables, math, multi-column layouts, old scans, and more.
document parsingen222024Must include
4ParseBench
LlamaIndex 2026 document parsing benchmark. ~2,078 human-verified pages from ~1,211 enterprise documents (insurance, finance, government) with 169K rule-based tests across five dimensions: tables (GTRM), charts (ChartDataPointMatch), content faithfulness, semantic formatting, and visual grounding. No LLM-as-judge. Overall score = unweighted mean of the five dimensions.
document parsingen142026Must include
5FUNSD
199 fully annotated forms. Tests semantic entity labeling and linking.
document understandingen132019Must include
6KITAB-Bench
8,809 Arabic text samples across 9 domains. Tests Arabic script recognition.
document ocrar82024Must include
7MME-VideoOCR
1,464 videos with 2,000 QA pairs across 25 tasks. Tests OCR capabilities in video content.
ocr capabilitiesen62024Recommended
8IAM
13,353 handwritten text lines from 657 writers. Standard handwriting benchmark.
handwriting recognitionen51999Recommended
9CC-OCR
Multi-scene text reading, key information extraction, multilingual text, and document parsing benchmark.
ocr capabilitiesmultilingual52024Recommended
10ThaiOCRBench
2,808 Thai text samples across 13 tasks. Tests Thai script structural understanding.
document ocrth52024Recommended
§ 07 · Discriminative benchmarks

Where scores actually spread.

Benchmarks where model scores are spread out (not saturated). A 30-point spread means models actually differ; a 5-point spread means everyone performs similarly.

BenchmarkMetricWorstBestSpreadModelsVerdict
OmniDocBenchtable-teds0.8093.5292.724Highly discriminative
OmniDocBenchcomposite31.7094.6262.9233Highly discriminative
olmOCR-Benchheaders-footers42.0096.1054.104Highly discriminative
OCRBench v2overall-zh-public9.1055.7046.6017Highly discriminative
ParseBenchaccuracy45.2084.9039.7014Highly discriminative
OCRBench v2overall-en-private23.4062.2038.8031Highly discriminative
olmOCR-Benchold-scans40.7073.1032.405Highly discriminative
OCRBench v2overall-en-public23.1052.6029.5017Highly discriminative
olmOCR-Benchpass-rate63.8083.9020.1020Highly discriminative
OCRBench v2overall-zh-private45.7063.7018.009Good separation
FUNSDf177.8992.0814.1913Good separation
MME-VideoOCRtotal-accuracy61.0073.7012.706Good separation

Spread = Best score - Worst score. Higher spread = more meaningful for comparison.

§ 08 · By category

Grouped by task type.

Document Parsing
OmniDocBench
34 models
olmOCR-Bench
22 models
ParseBench
14 models
Text Extraction
FUNSD
13 models
MME-VideoOCR
6 models
Handwriting
IAM
5 models
Multilingual
OCRBench v2(multilingual)
48 models
CC-OCR(multilingual)
5 models
Specialized
KITAB-Bench(ar)
8 models
ThaiOCRBench(th)
5 models
§ 09 · Testing gaps

Needs more data.

Important OCR benchmarks with limited model coverage. These represent blind spots where we cannot make fair comparisons.

§ 10 · Decision guide

Pick by use case.

Comparing document parsers (PDFs, forms)?

Use OmniDocBench (composite) + olmOCR-Bench (pass-rate). Both have 10+ models tested.

Comparing text extraction (raw OCR)?

Use OCRBench v2 (overall-en) for general OCR, SROIE (F1) for receipts/invoices.

Comparing handwriting recognition?

Use IAM Database (CER). Standard since 1999, most models report it.

Comparing scene text (photos, signs)?

Use ICDAR 2015 (F1) + Total-Text for curved text.

Comparing multilingual OCR?

Use KITAB-Bench (Arabic), ThaiOCRBench (Thai), OCRBench v2 (zh-private). Limited model coverage — be careful with claims.

§ 11 · How to use this data

Four rules.

  1. 1.Identify benchmarks with 5+ models — These enable fair comparison. Fewer models = cherry-picked results.
  2. 2.Check discriminative power — If all models score 95%+, the benchmark is saturated. Use benchmarks with 10+ point spreads.
  3. 3.Compare on the same metric — OmniDocBench has composite, table-teds, layout-map. Make sure you are comparing the same one.
  4. 4.Be skeptical of single-benchmark claims — “SOTA on X” means nothing if the model skips harder benchmarks.
§ 12 · Compare OCR models

Ready to compare?

Use our benchmark data to make informed decisions about OCR models. Compare dots.ocr, GPT-5.4, Mistral OCR, PaddleOCR and more.

OCR Model ComparisonBest for InvoicesBest for Handwriting
← Back to OCR Overview