Tutorial Learning-oriented

Docling Tutorial: Convert PDF to Markdown

Learn to extract structured text, tables, and formulas from PDF documents using IBM Docling.

Time: 30 minutes | Level: Beginner | Prerequisites: Python 3.9+

Verified Tutorial

All code in this tutorial was executed on December 17, 2025. Outputs shown are real results from processing the Docling paper (arxiv:2408.09869).

What You'll Learn

  • 1. Install Docling and set up your environment
  • 2. Convert a PDF to Markdown with 3 lines of code
  • 3. Extract tables as structured data (CSV/DataFrame)
  • 4. Use the VLM pipeline for complex documents
  • 5. Batch process multiple documents

1 Installation

Create a new project directory and install Docling:

# Create project with uv (recommended)
uv init docling-tutorial
cd docling-tutorial
uv add docling pandas

# Or use pip
python -m venv .venv
source .venv/bin/activate
pip install docling pandas

Docling works on macOS, Linux, and Windows. Python 3.9-3.12 supported. First run downloads ~1GB of model weights.

2 Your First Conversion

Convert a PDF to Markdown with three lines of code:

from docling.document_converter import DocumentConverter

# Create converter instance
converter = DocumentConverter()

# Convert a PDF file
result = converter.convert("document.pdf")

# Export to Markdown
markdown = result.document.export_to_markdown()
print(markdown)

Console output:

2025-12-17 00:10:30 - INFO - Initializing pipeline for StandardPdfPipeline
2025-12-17 00:10:40 - INFO - Auto OCR model selected ocrmac.
2025-12-17 00:10:40 - INFO - Accelerator device: 'mps'
2025-12-17 00:10:57 - INFO - Processing document sample.pdf
2025-12-17 00:11:04 - INFO - Finished converting document sample.pdf in 34.95 sec.

Actual output (truncated):

## Docling Technical Report

## Version 1.0

Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi ...

AI4K Group, IBM Research Ruschlikon, Switzerland

## Abstract

This technical report introduces Docling, an easy to use,
self-contained, MIT-licensed open-source package for PDF
document conversion. It is powered by state-of-the-art
specialized AI models for layout analysis (DocLayNet) and
table structure recognition (TableFormer)...

## 1 Introduction

Converting PDF documents back into a machine-processable
format has been a major challenge for decades...

Full output: 33,201 characters from 10-page PDF in 34.95 seconds

What happens behind the scenes:

  1. 1. Layout analysis identifies text blocks, headers, tables, figures
  2. 2. Reading order is determined from spatial relationships
  3. 3. Tables are parsed using TableFormer model
  4. 4. OCR is auto-selected (ocrmac on macOS, rapidocr elsewhere)
  5. 5. Content is exported in requested format

3 Export Formats

Docling supports multiple output formats:

# Save conversion result to file
with open("output.md", "w") as f:
    f.write(result.document.export_to_markdown())

# Also export to other formats
html = result.document.export_to_html()      # 37,222 chars
text = result.document.export_to_text()      # 33,114 chars
data = result.document.export_to_dict()      # JSON dict
Markdown
33,201 chars
HTML
37,222 chars
Text
33,114 chars
JSON
522,564 chars

4 Extract Tables

Docling excels at table extraction. Access tables as structured data:

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")

# Access tables
tables = list(result.document.tables)
print(f"Found {len(tables)} tables")

for i, table in enumerate(tables):
    # Export as Markdown
    print(f"\\n--- Table {i+1} ---")
    print(table.export_to_markdown())

    # Export as pandas DataFrame
    df = table.export_to_dataframe()
    df.to_csv(f"table_{i+1}.csv", index=False)

Actual output:

Found 3 tables

--- Table 1 ---
| CPU | Thread budget | native backend TTS | ... |
|-----|---------------|-------------------|-----|
| Apple M3 Max (16 cores) | 4 16 | 177s 167s | ... |
| Intel Xeon E5-2690 | 4 16 | 375s 244s | ... |

Exported CSV (table_1.csv):

CPU,Thread budget,native backend.TTS,native backend.Pages/s,...
Apple M3 Max (16 cores),4 16,177 s 167 s,1.27 1.34,...
Intel(R) Xeon E5-2690,4 16,375 s 244 s,0.60 0.92,...

Tables found in test document:

  • Table 1: Performance benchmarks - 2 rows x 8 columns
  • Table 2: DocLayNet metrics - 1 row x 5 columns
  • Table 3: References (partial) - structure preserved

5 VLM Pipeline (Advanced)

For complex documents with unusual layouts, formulas, or handwriting, use the Vision Language Model pipeline:

Install VLM support:

uv add "docling[vlm]"
# or: pip install "docling[vlm]"

Use VLM for conversion:

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.pipeline.vlm_pipeline import VlmPipeline
from docling.datamodel.pipeline_options import VlmPipelineOptions
from docling.datamodel import vlm_model_specs

# Configure VLM pipeline
pipeline_options = VlmPipelineOptions(
    vlm_options=vlm_model_specs.GRANITE_VISION_TRANSFORMERS  # CPU/CUDA
    # vlm_options=vlm_model_specs.GRANITEDOCLING_MLX        # Apple Silicon
    # vlm_options=vlm_model_specs.GRANITEDOCLING_VLLM       # NVIDIA GPU
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options,
        ),
    }
)

result = converter.convert("complex_document.pdf")
print(result.document.export_to_markdown())

Note:

First run downloads the Granite-Docling model (~500MB). This only happens once.

6 Batch Processing

Process multiple PDFs in a directory:

from pathlib import Path
from docling.document_converter import DocumentConverter
from tqdm import tqdm

# Find all PDFs
pdf_files = list(Path(".").glob("*.pdf"))
print(f"Found {len(pdf_files)} PDF files")

# Create output directory
output_dir = Path("converted")
output_dir.mkdir(exist_ok=True)

# Convert each file
converter = DocumentConverter()

for pdf_path in tqdm(pdf_files, desc="Converting"):
    result = converter.convert(str(pdf_path))

    # Save as Markdown
    output_path = output_dir / f"{pdf_path.stem}.md"
    with open(output_path, "w") as f:
        f.write(result.document.export_to_markdown())

print(f"Done! Files saved to {output_dir}/")

Performance (from our test run)

34.95s
Conversion time
10
Pages processed
3
Tables extracted
33KB
Markdown output

Test: Docling paper (arxiv:2408.09869) on Apple Silicon with MPS acceleration

What You Learned

  • Installed Docling and set up a Python environment
  • Converted a PDF to Markdown using the basic pipeline
  • Extracted tables as structured data and CSV files
  • Used the VLM pipeline for higher accuracy
  • Batch processed multiple documents

Next Steps

Download Tutorial Artifacts

These are the actual files generated during our test run: