Home/OCR/Docling

Convert PDFs to Clean Markdown or JSON

Extract text, tables, and formulas from PDFs locally. No cloud APIs, works offline. Open-source Python library from IBM Research.

Open Source256M ParametersApache 2.0IBM Research

Quick Install

pip install docling

Python 3.9-3.14 | macOS, Linux, Windows | Apache 2.0 License

Documentation

Stop picking the wrong OCR model

Monthly OCR benchmark update — new models, price changes, accuracy deltas. Free.

When to Use Docling

1

Processing research papers or technical documents

Extract tables, equations (LaTeX), and structured content while preserving formatting. Works offline, no API costs.

2

Building RAG systems or document search

Convert PDFs to clean Markdown for embeddings. Preserves document structure (headings, lists) better than plain OCR.

3

Handling sensitive documents

Runs entirely on your machine. No data sent to cloud APIs. GDPR/HIPAA compliant by default.

4

Batch processing thousands of documents

Process ~0.35s per page on GPU, ~2-3s on CPU. No rate limits or API quotas to worry about.

Minimal Example

from docling.document_converter import DocumentConverter converter = DocumentConverter() result = converter.convert("document.pdf")# Export to Markdownprint(result.document.export_to_markdown())# Or JSON, HTML, plain textresult.document.export_to_dict() result.document.export_to_html()

Choose the Right Tool

Your SituationDoclingTesseractAWS TextractGPT-5.4 Vision
Extract tables to CSV/ExcelBuilt-inManual parsingBuilt-inVia prompt
Convert math formulasLaTeX exportNot supportedNot supportedVia prompt
Process 10,000 pagesFree, localFree, local$15,000 cost$100+ cost
Sensitive/confidential docsOfflineOfflineCloud uploadCloud upload
No internet accessWorks offlineWorks offlineRequires internetRequires internet
Processing speed neededFast (0.35s/page)Slow (2-5s/page)Medium + latencySlow + latency

Supported Formats

PDF
Native + scanned
DOCX
Word documents
PPTX
PowerPoint
XLSX
Excel
HTML
Web pages
Images
PNG, JPG, TIFF
Audio
WAV, MP3 (ASR)
VTT
Subtitles

Resources

#1 on OmniDocBench92.86 compositeSOTA shipped

Run the best OCR model on your Mac — $6

Hardparse runs PaddleOCR-VL-1.5 locally via Apple Metal. No cloud, no API keys, no subscription. Tables, formulas, handwriting, 109 languages.

Every purchase directly supports CodeSOTA's independent benchmark research.

Stop picking the wrong OCR model

Monthly OCR benchmark update — new models, price changes, accuracy deltas. Free.

Related Reading