Document OCR

Polish
OCR Benchmarks

From PolEval 2021 historical books to IMPACT-PSNC ground truth. Track models handling Polish diacritics and gothic fonts.

Polish OCR Stats

26.3%
Baseline CER (CodeSOTA Polish)
1,000
Images in CodeSOTA Dataset
5
Degradation Levels
NEW Benchmark

Polish OCR Challenges

Diacritics

Polish uses 9 unique characters with diacritical marks that standard OCR engines often confuse:

a c e l n o s z z

Gothic / Fraktur Fonts

Historical Polish documents (1791-1900s) often use gothic (blackletter) typefaces that are challenging for modern OCR trained on contemporary fonts.

OCR Post-Correction

PolEval 2021 focuses on correcting OCR errors using NLP, not raw OCR. This requires understanding Polish morphology and common OCR confusion patterns.

Polish Alphabet Unique Characters

A
ogonek
C
acute
E
ogonek
L
stroke
N
acute
O
acute
S
acute
Z
acute
Z
dot above

CodeSOTA Polish Benchmark

NEW

Our own contamination-resistant benchmark with 1,000 synthetic and real Polish text images. Four categories designed to expose language model reliance vs pure character recognition.

Tesseract 5.5.1 Baseline by Category

Wikipedia

Polish Wikipedia excerpts (potential contamination baseline)

5.2%

CER

Real Corpus

Pan Tadeusz, official documents

7.3%

CER

Synth Random

Random Polish characters (pure OCR)

40.6%

CER

Synth Words

Markov-generated words (no dictionary)

52.1%

CER

Overall (1000 images) 26.3% CER

Key Findings

10x gap between real text (7% CER) and synthetic (52% CER) reveals heavy language model dependence.

Wikipedia vs Real Corpus similar (5.2% vs 7.3%) suggests minimal Wikipedia-specific overfitting.

Degradation sensitivity: CER jumps 50x from clean to medium degradation (roller marks).

Synthetic categories are key for evaluating pure OCR - no dictionary fallback possible.

5 Degradation Levels (Augraphy)

Clean
No artifacts
Light
Subtle noise
Medium
Roller marks
Heavy
Ink bleed
Severe
Bad photocopy

Metrics

CER

Character Error Rate

Percentage of characters that are incorrect. Lower is better. 2.1% means ~2 errors per 100 characters.

WER

Word Error Rate

Percentage of words with any error. More strict than CER since one wrong character fails the whole word.

Word Accuracy

100% - WER

Percentage of words recognized correctly. Higher is better. 97.5% is excellent for historical texts.

Polish OCR Leaderboard

Comparing models on Polish OCR benchmarks. Lower CER is better.

Rank Model CodeSOTA CER PolEval CER IMPACT CER Type

Approaches

Traditional OCR

Tesseract with Polish language models. Good baseline but struggles with historical fonts.

Tesseract, EasyOCR

Commercial Engines

ABBYY FineReader leads on Polish historical documents with specialized font recognition.

ABBYY FineReader, Readiris

NLP Post-Correction

Polish transformers (RoBERTa, T5, HerBERT) correct OCR errors using language understanding.

Polish RoBERTa, HerBERT, plT5

The Datasets

PolEval 2021 OCR

2021

979 Polish books (69,000 pages) from 1791-1998. Focus on OCR post-correction using NLP methods. Major benchmark for Polish historical document processing.

Task
document-ocr
Pages/Samples
69,000

IMPACT-PSNC

2012

478 pages of ground truth from four Polish digital libraries at 99.95% accuracy. Includes annotations at region, line, word, and glyph levels. Gothic and antiqua fonts.

Task
document-ocr
Pages/Samples
478

reVISION

2025

Polish benchmark for vision-language models including OCR evaluation on educational exam materials. Covers middle school, high school, and professional exams.

Task
ocr-capabilities
Pages/Samples
N/A

Polish EMNIST Extension

2020

Extension of EMNIST dataset with Polish handwritten characters including diacritics (ą, ć, ę, ł, ń, ó, ś, ź, ż). Tests recognition of Polish-specific characters.

Task
handwriting-recognition
Pages/Samples
N/A

CodeSOTA Polish

2025

1,000 synthetic and real Polish text images with 5 degradation levels (clean to severe). Tests character-level OCR on diacritics with contamination-resistant synthetic categories. Categories: synth_random (pure character recognition), synth_words (Markov-generated words), real_corpus (Pan Tadeusz, official documents), wikipedia (potential contamination baseline).

Task
document-ocr
Pages/Samples
1,000