Question 1

How do I configure OCR engines in Docling?

Accepted Answer

Docling supports EasyOCR, Tesseract, and RapidOCR. You can configure them by setting the `ocr_options` in `PdfPipelineOptions`. For example, to use EasyOCR: `pipeline_options.ocr_options = EasyOcrOptions(lang=['en', 'fr'], use_gpu=True)`.

Question 2

How can I process scanned PDFs with Docling?

Accepted Answer

To process scanned PDFs where text extraction fails, you can force full-page OCR. Set `force_full_page_ocr=True` in your OCR options (e.g., `EasyOcrOptions`) and apply it to the `PdfPipelineOptions`.

Question 3

How do I extract tables to CSV or Excel with Docling?

Accepted Answer

After converting a document, iterate through `result.document.tables`. Each table object has an `export_to_dataframe()` method. You can then use pandas to save it: `df.to_csv('table.csv')` or `df.to_excel('table.xlsx')`.

Question 4

How do I extract structured invoice data?

Accepted Answer

First, convert the invoice PDF to Markdown using Docling. Then, pass the Markdown content to an LLM (like GPT-4) with a prompt to extract specific fields (invoice number, date, total) into a JSON format.

Question 5

How do I enable GPU acceleration in Docling?

Accepted Answer

For NVIDIA GPUs, install with `pip install docling[cuda]`. For Apple Silicon, install with `pip install 'docling[vlm]' mlx`. Then configure the pipeline options to use the appropriate model specs (e.g., `vlm_model_specs.GRANITEDOCLING_MLX` for Mac).

Question 6

How do I integrate Docling with LangChain?

Accepted Answer

Use the `DoclingLoader` from `langchain_community.document_loaders`. Load your documents, then split them using `RecursiveCharacterTextSplitter` and create a vector store (like FAISS) for RAG applications.

Question 7

How do I integrate Docling with LlamaIndex?

Accepted Answer

Use `DoclingReader` from `llama_index.readers.docling`. Load data with `reader.load_data()`, then create a `VectorStoreIndex` from the documents to build a query engine.

Question 8

How do I handle large documents in Docling to avoid memory issues?

Accepted Answer

For large documents, reduce memory usage by lowering image resolution (`pipeline_options.images_scale = 0.5`) and ensure you clean up resources explicitly using `del result` and `gc.collect()` after processing.

Docling How-To Guides

How to Configure OCR Engines

EasyOCR (Multi-language)

Tesseract (System OCR)

RapidOCR (Default, ONNX-based)

How to Process Scanned PDFs

How to Extract Tables to CSV/Excel

How to Extract Invoice Data

How to Enable GPU Acceleration

NVIDIA CUDA

Apple Silicon (M1/M2/M3)

How to Integrate with LangChain

How to Integrate with LlamaIndex

How to Handle Large Documents