Convert PDFs to Clean Markdown or JSON
Extract text, tables, and formulas from PDFs locally. No cloud APIs, works offline. Open-source Python library from IBM Research.
Quick Install
pip install docling Python 3.9-3.14 | macOS, Linux, Windows | Apache 2.0 License
Documentation
Tutorial
Learning-oriented. Take your first steps with Docling by converting a PDF to structured Markdown.
Start here →How-To Guides
Problem-oriented. Solve specific tasks: extract tables, configure OCR engines, batch process documents.
Solve problems →Reference
Information-oriented. API documentation, configuration options, export formats, model specifications.
Look it up →Explanation
Understanding-oriented. How Docling works under the hood, architecture decisions, when to use what.
Understand →When to Use Docling
Processing research papers or technical documents
Extract tables, equations (LaTeX), and structured content while preserving formatting. Works offline, no API costs.
Building RAG systems or document search
Convert PDFs to clean Markdown for embeddings. Preserves document structure (headings, lists) better than plain OCR.
Handling sensitive documents
Runs entirely on your machine. No data sent to cloud APIs. GDPR/HIPAA compliant by default.
Batch processing thousands of documents
Process ~0.35s per page on GPU, ~2-3s on CPU. No rate limits or API quotas to worry about.
Minimal Example
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
# Export to Markdown
print(result.document.export_to_markdown())
# Or JSON, HTML, plain text
result.document.export_to_dict()
result.document.export_to_html() Choose the Right Tool
| Your Situation | Docling | Tesseract | AWS Textract | GPT-4o Vision |
|---|---|---|---|---|
| Extract tables to CSV/Excel | Built-in | Manual parsing | Built-in | Via prompt |
| Convert math formulas | LaTeX export | Not supported | Not supported | Via prompt |
| Process 10,000 pages | Free, local | Free, local | $15,000 cost | $100+ cost |
| Sensitive/confidential docs | Offline | Offline | Cloud upload | Cloud upload |
| No internet access | Works offline | Works offline | Requires internet | Requires internet |
| Processing speed needed | Fast (0.35s/page) | Slow (2-5s/page) | Medium + latency | Slow + latency |