Polish
OCR Benchmarks
From PolEval 2021 historical books to IMPACT-PSNC ground truth. Track models handling Polish diacritics and gothic fonts.
Polish OCR Stats
Polish OCR Challenges
Diacritics
Polish uses 9 unique characters with diacritical marks that standard OCR engines often confuse:
Gothic / Fraktur Fonts
Historical Polish documents (1791-1900s) often use gothic (blackletter) typefaces that are challenging for modern OCR trained on contemporary fonts.
OCR Post-Correction
PolEval 2021 focuses on correcting OCR errors using NLP, not raw OCR. This requires understanding Polish morphology and common OCR confusion patterns.
Polish Alphabet Unique Characters
CodeSOTA Polish Benchmark
NEWOur own contamination-resistant benchmark with 1,000 synthetic and real Polish text images. Four categories designed to expose language model reliance vs pure character recognition.
Tesseract 5.5.1 Baseline by Category
Polish Wikipedia excerpts (potential contamination baseline)
CER
Pan Tadeusz, official documents
CER
Random Polish characters (pure OCR)
CER
Markov-generated words (no dictionary)
CER
Key Findings
10x gap between real text (7% CER) and synthetic (52% CER) reveals heavy language model dependence.
Wikipedia vs Real Corpus similar (5.2% vs 7.3%) suggests minimal Wikipedia-specific overfitting.
Degradation sensitivity: CER jumps 50x from clean to medium degradation (roller marks).
Synthetic categories are key for evaluating pure OCR - no dictionary fallback possible.
5 Degradation Levels (Augraphy)
Metrics
CER
Character Error Rate
Percentage of characters that are incorrect. Lower is better. 2.1% means ~2 errors per 100 characters.
WER
Word Error Rate
Percentage of words with any error. More strict than CER since one wrong character fails the whole word.
Word Accuracy
100% - WER
Percentage of words recognized correctly. Higher is better. 97.5% is excellent for historical texts.
Polish OCR Leaderboard
Comparing models on Polish OCR benchmarks. Lower CER is better.
| Rank | Model | CodeSOTA CER | PolEval CER | IMPACT CER | Type |
|---|
Approaches
Traditional OCR
Tesseract with Polish language models. Good baseline but struggles with historical fonts.
Commercial Engines
ABBYY FineReader leads on Polish historical documents with specialized font recognition.
NLP Post-Correction
Polish transformers (RoBERTa, T5, HerBERT) correct OCR errors using language understanding.
The Datasets
PolEval 2021 OCR
2021979 Polish books (69,000 pages) from 1791-1998. Focus on OCR post-correction using NLP methods. Major benchmark for Polish historical document processing.
IMPACT-PSNC
2012478 pages of ground truth from four Polish digital libraries at 99.95% accuracy. Includes annotations at region, line, word, and glyph levels. Gothic and antiqua fonts.
reVISION
2025Polish benchmark for vision-language models including OCR evaluation on educational exam materials. Covers middle school, high school, and professional exams.
Polish EMNIST Extension
2020Extension of EMNIST dataset with Polish handwritten characters including diacritics (ą, ć, ę, ł, ń, ó, ś, ź, ż). Tests recognition of Polish-specific characters.
CodeSOTA Polish
20251,000 synthetic and real Polish text images with 5 degradation levels (clean to severe). Tests character-level OCR on diacritics with contamination-resistant synthetic categories. Categories: synth_random (pure character recognition), synth_words (Markov-generated words), real_corpus (Pan Tadeusz, official documents), wikipedia (potential contamination baseline).