Home/Building Blocks/Document Extraction
DocumentStructured Data

Document Extraction

Extract structured information from documents like PDFs, invoices, forms, and contracts.

How Structured Output Extraction Works

Turn unstructured documents into typed, validated data structures. From Pydantic schemas to LLM extraction with Instructor.

1

Why Structured Output Matters

Raw text is hard to process programmatically. Structured output gives you typed, validated data that integrates directly with your codebase.

Unstructured Text
INVOICE #INV-2024-0892
Date: December 15, 2024
Bill To: Acme Corp
         123 Business Ave
         New York, NY 10001

Items:
- Widget Pro x5 @ $29.99 = $149.95
- Service Fee = $25.00

Subtotal: $174.95
Tax (8%): $13.99
Total: $188.94

Payment Due: January 15, 2025
Structured JSON
{
  "invoice_number": "INV-2024-0892",
  "date": "2024-12-15",
  "customer": {
    "name": "Acme Corp",
    "address": "123 Business Ave, New York, NY 10001"
  },
  "items": [
    {
      "description": "Widget Pro",
      "quantity": 5,
      "unit_price": 29.99,
      "total": 149.95
    },
    {
      "description": "Service Fee",
      "quantity": 1,
      "unit_price": 25,
      "total": 25
    }
  ],
  "subtotal": 174.95,
  "tax": 13.99,
  "total": 188.94,
  "due_date": "2025-01-15"
}
Type Safety

Catch errors at parse time, not runtime. IDE autocomplete works.

Validation

Ensure data meets constraints: required fields, value ranges, formats.

Integration

Directly usable in databases, APIs, analytics pipelines.

2

Pydantic: The Schema Language

Pydantic is Python's most popular data validation library. Define schemas using type hints, get automatic validation, serialization, and JSON Schema generation.

model.py
from pydantic import BaseModel

class User(BaseModel):
    name: str
    age: int
    email: str
    is_active: bool = True

How It Works

Pydantic uses Python type hints to define the schema. Default values make fields optional.

Key Features

  • - Automatic type coercion (str "42" to int 42)
  • - Rich error messages on validation failure
  • - Generates JSON Schema automatically
  • - Serialization to dict/JSON built-in

Interactive: Build a Schema

Click buttons to add fields...
from pydantic import BaseModel

class MyModel(BaseModel):
    pass
3

JSON Schema: The Bridge to LLMs

Pydantic models automatically generate JSON Schema. This schema is what LLMs use to understand the expected output format. It's the contract between your code and the model.

Python to JSON Schema Type Mapping

Python TypeJSON SchemaExample Value
str"type": "string""hello"
int"type": "integer"42
float"type": "number"3.14
bool"type": "boolean"true
List[str]"type": "array", "items": {"type": "string"}["a", "b"]
Optional[str]"anyOf": [{"type": "string"}, {"type": "null"}]"text" or null
Literal["a", "b"]"enum": ["a", "b"]"a"
datetime"type": "string", "format": "date-time""2024-01-01T00:00:00Z"
Pydantic Model
from pydantic import BaseModel
from typing import List, Optional

class Invoice(BaseModel):
    invoice_id: str
    amount: float
    items: List[str]
    paid: bool
    notes: Optional[str] = None
Generated JSON Schema
{
  "type": "object",
  "properties": {
    "invoice_id": {"type": "string"},
    "amount": {"type": "number"},
    "items": {
      "type": "array",
      "items": {"type": "string"}
    },
    "paid": {"type": "boolean"},
    "notes": {
      "anyOf": [
        {"type": "string"},
        {"type": "null"}
      ]
    }
  },
  "required": ["invoice_id", "amount", "items", "paid"]
}

How LLMs Use JSON Schema

When you provide a JSON Schema to an LLM (via function calling or structured output mode), the model constrains its token generation to only produce valid JSON matching the schema. This is done at the logit level - invalid tokens are masked out during sampling.

4

Instructor: LLM + Pydantic

Instructor is the glue between Pydantic and LLMs. It patches OpenAI/Anthropic clients to acceptresponse_model and handles validation, retries, and streaming.

How Instructor Works

1
Define Schema
Create Pydantic model describing desired output
2
Generate JSON Schema
Pydantic auto-generates JSON Schema from model
3
Inject into Prompt
Schema sent to LLM as function/tool definition
4
LLM Response
LLM generates JSON matching the schema
5
Validate & Parse
Pydantic validates and creates typed object
6
Retry on Error
If validation fails, Instructor retries with error context
extract.pypip install instructor
import instructor
from openai import OpenAI
from pydantic import BaseModel
from typing import List

class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price: float

class Invoice(BaseModel):
    invoice_number: str
    total: float
    items: List[LineItem]

# Patch OpenAI client with Instructor
client = instructor.from_openai(OpenAI())

# Extract structured data from text
invoice = client.chat.completions.create(
    model="gpt-4o",
    response_model=Invoice,
    messages=[{
        "role": "user",
        "content": """
        Invoice #12345
        Widget Pro x5 @ $29.99 = $149.95
        Service Fee = $25.00
        Total: $174.95
        """
    }]
)

print(invoice.invoice_number)  # "12345"
print(invoice.total)           # 174.95
print(invoice.items[0].description)  # "Widget Pro"

Automatic Retries

If the LLM output fails Pydantic validation, Instructor automatically retries with the error message included in the prompt.

client.chat.completions.create(
    model="gpt-4o",
    response_model=Invoice,
    max_retries=3,  # Retry up to 3 times
    messages=[...]
)

Streaming Support

Get partial objects as they stream in. Useful for long extractions where you want to show progress.

for partial in client.chat.completions.create_partial(
    model="gpt-4o",
    response_model=Invoice,
    messages=[...]
):
    print(partial)  # Partial Invoice
5

Extraction Methods Compared

There are multiple approaches to getting structured output from LLMs. Each has tradeoffs in reliability, speed, and flexibility.

When to Use What

Instructor

Best for production apps with OpenAI/Anthropic. Type-safe, battle-tested.

OpenAI Structured Outputs

Best for simple schemas when you only use OpenAI. No extra dependencies.

Outlines

Best for local/open-source models. 100% guaranteed valid output.

LangChain Parsers

Best if already using LangChain. Works with any model.

6

Full Document Extraction Pipeline

Real-world document extraction combines OCR, layout analysis, and LLM extraction. Here's how the pieces fit together.

D
Document Input
PDF, image, or scanned document
T
OCR / Text Extraction
Convert to machine-readable text
L
Layout Analysis
Identify tables, sections, headers
E
Entity Recognition
Find dates, amounts, names
S
Schema Mapping
Map entities to Pydantic fields
V
Validation
Verify types and constraints
O
Structured Output
JSON/dict ready for downstream
Full Pipeline: PDF Invoice to Structured Data
import instructor
from openai import OpenAI
from pydantic import BaseModel
from typing import List
from docling import DocumentConverter  # Or PyMuPDF, pdfplumber, etc.

# 1. Define the schema
class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price: float
    total: float

class Invoice(BaseModel):
    invoice_number: str
    vendor_name: str
    date: str
    items: List[LineItem]
    subtotal: float
    tax: float
    total: float

# 2. Extract text from PDF
converter = DocumentConverter()
result = converter.convert("invoice.pdf")
text = result.document.export_to_markdown()

# 3. Use LLM to extract structured data
client = instructor.from_openai(OpenAI())

invoice = client.chat.completions.create(
    model="gpt-4o",
    response_model=Invoice,
    messages=[
        {
            "role": "system",
            "content": "Extract invoice data from the following document."
        },
        {
            "role": "user",
            "content": text
        }
    ]
)

# 4. Use the typed, validated data
print(f"Invoice: {invoice.invoice_number}")
print(f"Total: ${invoice.total:.2f}")
for item in invoice.items:
    print(f"  - {item.description}: {item.quantity} x ${item.unit_price}")

OCR / Text Extraction

  • - Docling - Layout-aware
  • - PyMuPDF - Fast, native PDF
  • - Tesseract - Scanned docs
  • - Azure Doc AI - Pre-built

LLM Extraction

  • - GPT-4o - Best accuracy
  • - Claude 3.5 - Long docs
  • - Gemini 1.5 - 1M context
  • - Llama 3.1 - Self-host

Structured Output

  • - Instructor - Production
  • - Outlines - Local models
  • - Marvin - Lightweight
  • - BAML - Type-first

The Complete Picture

Document
->
OCR/Parse
->
Text
->
Pydantic Schema
->
Instructor + LLM
->
Validated Object
->
Database / API

Structured output extraction turns messy documents into clean, typed data. Pydantic defines the contract, JSON Schema bridges to LLMs, and Instructor handles the plumbing. The result: reliable, production-ready document processing.

Use Cases

  • Invoice processing
  • Resume parsing
  • Contract analysis
  • Form digitization

Architectural Patterns

Layout-Aware OCR + LLM

Use document OCR (preserving layout) then LLM for extraction.

Pros:
  • +Handles complex layouts
  • +Flexible schemas
  • +Good accuracy
Cons:
  • -Multi-step
  • -LLM cost for extraction

End-to-End Document VLM

Vision-language models that directly process document images.

Pros:
  • +Single model
  • +Handles visual elements
Cons:
  • -May miss fine text
  • -Fixed context window

Template-Based Extraction

Define zones/templates for known document types.

Pros:
  • +Very fast
  • +High accuracy for known formats
Cons:
  • -Breaks on new formats
  • -Maintenance overhead

Implementations

API Services

Azure Document Intelligence

Microsoft
API

Pre-built and custom extractors. Good for invoices, receipts.

Google Document AI

Google
API

Strong OCR + extraction. Pre-built processors.

Open Source

Docling (IBM)

MIT
Open Source

PDF to structured output. Layout-aware, handles tables well.

Unstructured.io

Apache 2.0
Open Source

Multi-format document processing. Good for RAG pipelines.

Marker

GPL-3.0
Open Source

PDF to Markdown. Excellent for books and papers.

Benchmarks

Quick Facts

Input
Document
Output
Structured Data
Implementations
3 open source, 2 API
Patterns
3 approaches

Have benchmark data?

Help us track the state of the art for document extraction.

Submit Results