Level 1: Single Blocks~15 min

Document Extraction and Parsing

Extract structured text from PDFs, scans, and complex layouts. The DOCUMENT modality is critical for enterprise RAG.

Why Document Parsing Matters

Most enterprise knowledge lives in documents: PDFs, scanned invoices, contracts, research papers. Before you can build RAG or search, you need to extract the text accurately.

Simple PDF text extraction often fails on complex layouts, tables, multi-column text, and scanned documents. Modern document parsers use computer vision and layout understanding to preserve structure.

The Challenge

A PDF is not just text. It contains:

  • -Tables with rows and columns that need structure preservation
  • -Multi-column layouts that naive extractors jumble together
  • -Headers, footers, and page numbers to filter out
  • -Figures, equations, and captions that need special handling

Enterprise Use Cases

Invoice Processing

Extract vendor names, line items, amounts, and dates from invoices. Automate accounts payable workflows.

Key fields: vendor, invoice number, date, line items, total amount

Resume Parsing

Structure candidate resumes into searchable fields. Power ATS systems and candidate matching.

Key fields: name, experience, skills, education, contact info

Contract Analysis

Extract clauses, parties, dates, and obligations from legal documents. Enable contract review automation.

Key fields: parties, effective date, terms, obligations, signatures

Research Paper Ingestion

Parse academic papers for RAG: preserve sections, citations, tables, and figures. Build searchable research knowledge bases.

Key fields: title, abstract, sections, citations, figures, tables

Document Parsing Tools

There are three main approaches to document parsing, each with different tradeoffs:

Docling (IBM) - PDF to Markdown

IBM's open-source document converter. Excellent for converting PDFs to clean Markdown with preserved structure. Handles tables, headers, and layout well.

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert('document.pdf')
markdown = result.document.export_to_markdown()
print(markdown)

Best for: Clean PDFs, research papers, structured documents

Unstructured.io - Multi-format

Handles PDFs, Word docs, PowerPoints, images, and more. Partitions documents into semantic elements (titles, paragraphs, tables, images) with category labels.

from unstructured.partition.auto import partition

elements = partition('document.pdf')
for element in elements:
    print(f'{element.category}: {element.text[:100]}...')

Best for: Mixed document types, enterprise pipelines, element categorization

Marker - Best for Books and Papers

Optimized for converting books, papers, and long documents to Markdown. Extracts images and preserves complex layouts including equations.

from marker.convert import convert_single_pdf
from marker.models import load_all_models

models = load_all_models()
full_text, images, metadata = convert_single_pdf('paper.pdf', models)
print(full_text)

Best for: Academic papers, textbooks, documents with equations and figures

Vision Language Models for Document OCR

The newest approach uses Vision Language Models (VLMs) to "read" documents directly. These models understand layout, tables, and handwriting without specialized training.

Why VLMs for Documents?

Traditional OCR extracts text character by character. VLMs understand the full document visually - they can read tables as tables, understand forms, and even interpret diagrams. Mistral OCR leads the benchmarks with 79.75% composite accuracy.

VLM Document Processing

# Example with GPT-4 Vision
import base64
from openai import OpenAI

client = OpenAI()

# Convert PDF page to image first
with open('page.png', 'rb') as f:
    image_data = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract all text..."},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}}
        ]
    }]
)

OmniDocBench Results

OmniDocBench is the comprehensive benchmark for document parsing, testing text extraction, table recognition, and layout understanding.

Mistral OCR
79.75%
GPT-4o
73%
Docling (IBM)
68%
Marker
65%
PyMuPDF + heuristics
55%

OmniDocBench composite score. Higher is better. VLM-based approaches lead traditional parsers.

Key Insight

VLMs like Mistral OCR and GPT-4o achieve the best accuracy but are expensive for high-volume processing. For most use cases, Docling or Marker provide the best balance of quality and cost - they run locally and handle most documents well.

Choosing the Right Tool

High Volume / Cost Sensitive

Use Docling or Marker. Run locally, no API costs, good accuracy.

Processes thousands of documents per hour. Best for batch processing.

Mixed Document Types

Use Unstructured.io. Handles PDFs, Word, PowerPoint, images, and more.

Enterprise-grade solution with hosted API option. Good element categorization.

Maximum Accuracy

Use Mistral OCR or GPT-4o. Best for complex layouts, handwriting, or critical documents.

Higher cost per page. Use for high-value documents or when accuracy is critical.

Research Papers and Books

Use Marker. Optimized for academic content with equations and figures.

Extracts images, preserves section structure, handles LaTeX equations.

Document-to-RAG Pipeline

Document parsing is typically the first step in a RAG pipeline. Here's a complete example:

# Complete document-to-RAG pipeline
from docling.document_converter import DocumentConverter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer

# Step 1: Parse document to markdown
converter = DocumentConverter()
result = converter.convert('contract.pdf')
markdown = result.document.export_to_markdown()

# Step 2: Chunk the text
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
chunks = splitter.split_text(markdown)

# Step 3: Generate embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(chunks)

# Step 4: Store in vector database
# ... (see vector database lesson)

Key Takeaways

  • 1

    Document parsing is the foundation: Without accurate extraction, RAG and search will fail on enterprise documents.

  • 2

    Three main tools: Docling (IBM), Unstructured.io, and Marker. Each has different strengths for different document types.

  • 3

    VLMs lead the benchmarks: Mistral OCR (79.75%) and GPT-4o (~73%) are the most accurate, but cost more per document.

  • 4

    Match tool to use case: High volume uses local tools. Critical documents use VLMs. Mixed formats use Unstructured.io.

Practice Exercise

Try these exercises to get hands-on with document parsing:

  1. 1.Install Docling (pip install docling) and parse a PDF to Markdown.
  2. 2.Compare the output of Docling vs PyMuPDF on a document with tables.
  3. 3.Try parsing a scanned document - compare traditional OCR vs VLM-based extraction.