Home / OCR / Docling

Convert PDFs to Clean Markdown or JSON

Extract text, tables, and formulas from PDFs locally. No cloud APIs, works offline. Open-source Python library from IBM Research.

Open Source 256M Parameters Apache 2.0 IBM Research

Quick Install

pip install docling

Python 3.9-3.14 | macOS, Linux, Windows | Apache 2.0 License

Documentation

When to Use Docling

1

Processing research papers or technical documents

Extract tables, equations (LaTeX), and structured content while preserving formatting. Works offline, no API costs.

2

Building RAG systems or document search

Convert PDFs to clean Markdown for embeddings. Preserves document structure (headings, lists) better than plain OCR.

3

Handling sensitive documents

Runs entirely on your machine. No data sent to cloud APIs. GDPR/HIPAA compliant by default.

4

Batch processing thousands of documents

Process ~0.35s per page on GPU, ~2-3s on CPU. No rate limits or API quotas to worry about.

Minimal Example

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")

# Export to Markdown
print(result.document.export_to_markdown())

# Or JSON, HTML, plain text
result.document.export_to_dict()
result.document.export_to_html()

Choose the Right Tool

Your Situation Docling Tesseract AWS Textract GPT-4o Vision
Extract tables to CSV/Excel Built-in Manual parsing Built-in Via prompt
Convert math formulas LaTeX export Not supported Not supported Via prompt
Process 10,000 pages Free, local Free, local $15,000 cost $100+ cost
Sensitive/confidential docs Offline Offline Cloud upload Cloud upload
No internet access Works offline Works offline Requires internet Requires internet
Processing speed needed Fast (0.35s/page) Slow (2-5s/page) Medium + latency Slow + latency

Supported Formats

PDF
Native + scanned
DOCX
Word documents
PPTX
PowerPoint
XLSX
Excel
HTML
Web pages
Images
PNG, JPG, TIFF
Audio
WAV, MP3 (ASR)
VTT
Subtitles

Resources