Docling Tutorial: Convert PDF to Markdown
Learn to extract structured text, tables, and formulas from PDF documents using IBM Docling.
Verified Tutorial
All code in this tutorial was executed on December 17, 2025. Outputs shown are real results from processing the Docling paper (arxiv:2408.09869).
What You'll Learn
- 1. Install Docling and set up your environment
- 2. Convert a PDF to Markdown with 3 lines of code
- 3. Extract tables as structured data (CSV/DataFrame)
- 4. Use the VLM pipeline for complex documents
- 5. Batch process multiple documents
1 Installation
Create a new project directory and install Docling:
# Create project with uv (recommended)
uv init docling-tutorial
cd docling-tutorial
uv add docling pandas
# Or use pip
python -m venv .venv
source .venv/bin/activate
pip install docling pandas Docling works on macOS, Linux, and Windows. Python 3.9-3.12 supported. First run downloads ~1GB of model weights.
2 Your First Conversion
Convert a PDF to Markdown with three lines of code:
from docling.document_converter import DocumentConverter
# Create converter instance
converter = DocumentConverter()
# Convert a PDF file
result = converter.convert("document.pdf")
# Export to Markdown
markdown = result.document.export_to_markdown()
print(markdown) Console output:
2025-12-17 00:10:30 - INFO - Initializing pipeline for StandardPdfPipeline
2025-12-17 00:10:40 - INFO - Auto OCR model selected ocrmac.
2025-12-17 00:10:40 - INFO - Accelerator device: 'mps'
2025-12-17 00:10:57 - INFO - Processing document sample.pdf
2025-12-17 00:11:04 - INFO - Finished converting document sample.pdf in 34.95 sec. Actual output (truncated):
## Docling Technical Report
## Version 1.0
Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi ...
AI4K Group, IBM Research Ruschlikon, Switzerland
## Abstract
This technical report introduces Docling, an easy to use,
self-contained, MIT-licensed open-source package for PDF
document conversion. It is powered by state-of-the-art
specialized AI models for layout analysis (DocLayNet) and
table structure recognition (TableFormer)...
## 1 Introduction
Converting PDF documents back into a machine-processable
format has been a major challenge for decades... Full output: 33,201 characters from 10-page PDF in 34.95 seconds
What happens behind the scenes:
- 1. Layout analysis identifies text blocks, headers, tables, figures
- 2. Reading order is determined from spatial relationships
- 3. Tables are parsed using TableFormer model
- 4. OCR is auto-selected (ocrmac on macOS, rapidocr elsewhere)
- 5. Content is exported in requested format
3 Export Formats
Docling supports multiple output formats:
# Save conversion result to file
with open("output.md", "w") as f:
f.write(result.document.export_to_markdown())
# Also export to other formats
html = result.document.export_to_html() # 37,222 chars
text = result.document.export_to_text() # 33,114 chars
data = result.document.export_to_dict() # JSON dict 4 Extract Tables
Docling excels at table extraction. Access tables as structured data:
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
# Access tables
tables = list(result.document.tables)
print(f"Found {len(tables)} tables")
for i, table in enumerate(tables):
# Export as Markdown
print(f"\\n--- Table {i+1} ---")
print(table.export_to_markdown())
# Export as pandas DataFrame
df = table.export_to_dataframe()
df.to_csv(f"table_{i+1}.csv", index=False) Actual output:
Found 3 tables
--- Table 1 ---
| CPU | Thread budget | native backend TTS | ... |
|-----|---------------|-------------------|-----|
| Apple M3 Max (16 cores) | 4 16 | 177s 167s | ... |
| Intel Xeon E5-2690 | 4 16 | 375s 244s | ... | Exported CSV (table_1.csv):
CPU,Thread budget,native backend.TTS,native backend.Pages/s,...
Apple M3 Max (16 cores),4 16,177 s 167 s,1.27 1.34,...
Intel(R) Xeon E5-2690,4 16,375 s 244 s,0.60 0.92,... Tables found in test document:
- Table 1: Performance benchmarks - 2 rows x 8 columns
- Table 2: DocLayNet metrics - 1 row x 5 columns
- Table 3: References (partial) - structure preserved
5 VLM Pipeline (Advanced)
For complex documents with unusual layouts, formulas, or handwriting, use the Vision Language Model pipeline:
Install VLM support:
uv add "docling[vlm]"
# or: pip install "docling[vlm]" Use VLM for conversion:
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.pipeline.vlm_pipeline import VlmPipeline
from docling.datamodel.pipeline_options import VlmPipelineOptions
from docling.datamodel import vlm_model_specs
# Configure VLM pipeline
pipeline_options = VlmPipelineOptions(
vlm_options=vlm_model_specs.GRANITE_VISION_TRANSFORMERS # CPU/CUDA
# vlm_options=vlm_model_specs.GRANITEDOCLING_MLX # Apple Silicon
# vlm_options=vlm_model_specs.GRANITEDOCLING_VLLM # NVIDIA GPU
)
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_cls=VlmPipeline,
pipeline_options=pipeline_options,
),
}
)
result = converter.convert("complex_document.pdf")
print(result.document.export_to_markdown()) Note:
First run downloads the Granite-Docling model (~500MB). This only happens once.
6 Batch Processing
Process multiple PDFs in a directory:
from pathlib import Path
from docling.document_converter import DocumentConverter
from tqdm import tqdm
# Find all PDFs
pdf_files = list(Path(".").glob("*.pdf"))
print(f"Found {len(pdf_files)} PDF files")
# Create output directory
output_dir = Path("converted")
output_dir.mkdir(exist_ok=True)
# Convert each file
converter = DocumentConverter()
for pdf_path in tqdm(pdf_files, desc="Converting"):
result = converter.convert(str(pdf_path))
# Save as Markdown
output_path = output_dir / f"{pdf_path.stem}.md"
with open(output_path, "w") as f:
f.write(result.document.export_to_markdown())
print(f"Done! Files saved to {output_dir}/") Performance (from our test run)
Test: Docling paper (arxiv:2408.09869) on Apple Silicon with MPS acceleration
What You Learned
- Installed Docling and set up a Python environment
- Converted a PDF to Markdown using the basic pipeline
- Extracted tables as structured data and CSV files
- Used the VLM pipeline for higher accuracy
- Batch processed multiple documents
Next Steps
Download Tutorial Artifacts
These are the actual files generated during our test run: