Tutorial Learning-oriented

Docling Tutorial: Convert PDF to Markdown

Learn to extract structured text, tables, and formulas from PDF documents using IBM Docling.

Time: 30 minutes | Level: Beginner | Prerequisites: Python 3.9+

Verified Tutorial

All code in this tutorial was executed on December 17, 2025. Outputs shown are real results from processing the Docling paper (arxiv:2408.09869).

What You'll Learn

1. Install Docling and set up your environment
2. Convert a PDF to Markdown with 3 lines of code
3. Extract tables as structured data (CSV/DataFrame)
4. Use the VLM pipeline for complex documents
5. Batch process multiple documents

1 Installation

Create a new project directory and install Docling:

# Create project with uv (recommended)
uv init docling-tutorial
cd docling-tutorial
uv add docling pandas

# Or use pip
python -m venv .venv
source .venv/bin/activate
pip install docling pandas

Docling works on macOS, Linux, and Windows. Python 3.9-3.12 supported. First run downloads ~1GB of model weights.

2 Your First Conversion

Convert a PDF to Markdown with three lines of code:

from docling.document_converter import DocumentConverter

# Create converter instance
converter = DocumentConverter()

# Convert a PDF file
result = converter.convert("document.pdf")

# Export to Markdown
markdown = result.document.export_to_markdown()
print(markdown)

Console output:

2025-12-17 00:10:30 - INFO - Initializing pipeline for StandardPdfPipeline
2025-12-17 00:10:40 - INFO - Auto OCR model selected ocrmac.
2025-12-17 00:10:40 - INFO - Accelerator device: 'mps'
2025-12-17 00:10:57 - INFO - Processing document sample.pdf
2025-12-17 00:11:04 - INFO - Finished converting document sample.pdf in 34.95 sec.

Actual output (truncated):

## Docling Technical Report

## Version 1.0

Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi ...

AI4K Group, IBM Research Ruschlikon, Switzerland

## Abstract

This technical report introduces Docling, an easy to use,
self-contained, MIT-licensed open-source package for PDF
document conversion. It is powered by state-of-the-art
specialized AI models for layout analysis (DocLayNet) and
table structure recognition (TableFormer)...

## 1 Introduction

Converting PDF documents back into a machine-processable
format has been a major challenge for decades...

Full output: 33,201 characters from 10-page PDF in 34.95 seconds

What happens behind the scenes:

1. Layout analysis identifies text blocks, headers, tables, figures
2. Reading order is determined from spatial relationships
3. Tables are parsed using TableFormer model
4. OCR is auto-selected (ocrmac on macOS, rapidocr elsewhere)
5. Content is exported in requested format

3 Export Formats

Docling supports multiple output formats:

# Save conversion result to file
with open("output.md", "w") as f:
    f.write(result.document.export_to_markdown())

# Also export to other formats
html = result.document.export_to_html()      # 37,222 chars
text = result.document.export_to_text()      # 33,114 chars
data = result.document.export_to_dict()      # JSON dict

Markdown

33,201 chars

HTML

37,222 chars

Text

33,114 chars

JSON

522,564 chars

4 Extract Tables

Docling excels at table extraction. Access tables as structured data:

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")

# Access tables
tables = list(result.document.tables)
print(f"Found {len(tables)} tables")

for i, table in enumerate(tables):
    # Export as Markdown
    print(f"\\n--- Table {i+1} ---")
    print(table.export_to_markdown())

    # Export as pandas DataFrame
    df = table.export_to_dataframe()
    df.to_csv(f"table_{i+1}.csv", index=False)

Actual output:

Found 3 tables

--- Table 1 ---
| CPU | Thread budget | native backend TTS | ... |
|-----|---------------|-------------------|-----|
| Apple M3 Max (16 cores) | 4 16 | 177s 167s | ... |
| Intel Xeon E5-2690 | 4 16 | 375s 244s | ... |

Exported CSV (table_1.csv):

CPU,Thread budget,native backend.TTS,native backend.Pages/s,...
Apple M3 Max (16 cores),4 16,177 s 167 s,1.27 1.34,...
Intel(R) Xeon E5-2690,4 16,375 s 244 s,0.60 0.92,...

Tables found in test document:

Table 1: Performance benchmarks - 2 rows x 8 columns
Table 2: DocLayNet metrics - 1 row x 5 columns
Table 3: References (partial) - structure preserved

5 VLM Pipeline (Advanced)

For complex documents with unusual layouts, formulas, or handwriting, use the Vision Language Model pipeline:

Install VLM support:

uv add "docling[vlm]"
# or: pip install "docling[vlm]"

Use VLM for conversion:

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.pipeline.vlm_pipeline import VlmPipeline
from docling.datamodel.pipeline_options import VlmPipelineOptions
from docling.datamodel import vlm_model_specs

# Configure VLM pipeline
pipeline_options = VlmPipelineOptions(
    vlm_options=vlm_model_specs.GRANITE_VISION_TRANSFORMERS  # CPU/CUDA
    # vlm_options=vlm_model_specs.GRANITEDOCLING_MLX        # Apple Silicon
    # vlm_options=vlm_model_specs.GRANITEDOCLING_VLLM       # NVIDIA GPU
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options,
        ),
    }
)

result = converter.convert("complex_document.pdf")
print(result.document.export_to_markdown())

Note:

First run downloads the Granite-Docling model (~500MB). This only happens once.

6 Batch Processing

Process multiple PDFs in a directory:

from pathlib import Path
from docling.document_converter import DocumentConverter
from tqdm import tqdm

# Find all PDFs
pdf_files = list(Path(".").glob("*.pdf"))
print(f"Found {len(pdf_files)} PDF files")

# Create output directory
output_dir = Path("converted")
output_dir.mkdir(exist_ok=True)

# Convert each file
converter = DocumentConverter()

for pdf_path in tqdm(pdf_files, desc="Converting"):
    result = converter.convert(str(pdf_path))

    # Save as Markdown
    output_path = output_dir / f"{pdf_path.stem}.md"
    with open(output_path, "w") as f:
        f.write(result.document.export_to_markdown())

print(f"Done! Files saved to {output_dir}/")

Performance (from our test run)

34.95s

Conversion time

Pages processed

Tables extracted

33KB

Markdown output

Test: Docling paper (arxiv:2408.09869) on Apple Silicon with MPS acceleration

What You Learned

Installed Docling and set up a Python environment
Converted a PDF to Markdown using the basic pipeline
Extracted tables as structured data and CSV files
Used the VLM pipeline for higher accuracy
Batch processed multiple documents

Next Steps

How-To Guides

Configure OCR engines, extract invoices, integrate with RAG

Reference

API docs, configuration options, model specs

Explanation

How Docling works, architecture, design decisions

Download Tutorial Artifacts

These are the actual files generated during our test run:

output_preview.md table_1.csv summary.json