Home/Building Blocks/Document Extraction

Document→Structured Data

Document Extraction

Extract structured information from documents like PDFs, invoices, forms, and contracts.

How Structured Output Extraction Works

Turn unstructured documents into typed, validated data structures. From Pydantic schemas to LLM extraction with Instructor.

1. Why Structured 2. Pydantic 3. JSON Schema 4. Instructor 5. Methods 6. Pipeline

Why Structured Output Matters

Raw text is hard to process programmatically. Structured output gives you typed, validated data that integrates directly with your codebase.

Unstructured Text

INVOICE #INV-2024-0892
Date: December 15, 2024
Bill To: Acme Corp
         123 Business Ave
         New York, NY 10001

Items:
- Widget Pro x5 @ $29.99 = $149.95
- Service Fee = $25.00

Subtotal: $174.95
Tax (8%): $13.99
Total: $188.94

Payment Due: January 15, 2025

Structured JSON

{
  "invoice_number": "INV-2024-0892",
  "date": "2024-12-15",
  "customer": {
    "name": "Acme Corp",
    "address": "123 Business Ave, New York, NY 10001"
  },
  "items": [
    {
      "description": "Widget Pro",
      "quantity": 5,
      "unit_price": 29.99,
      "total": 149.95
    },
    {
      "description": "Service Fee",
      "quantity": 1,
      "unit_price": 25,
      "total": 25
    }
  ],
  "subtotal": 174.95,
  "tax": 13.99,
  "total": 188.94,
  "due_date": "2025-01-15"
}

Type Safety

Catch errors at parse time, not runtime. IDE autocomplete works.

Validation

Ensure data meets constraints: required fields, value ranges, formats.

Integration

Directly usable in databases, APIs, analytics pipelines.

Pydantic: The Schema Language

Pydantic is Python's most popular data validation library. Define schemas using type hints, get automatic validation, serialization, and JSON Schema generation.

model.py

from pydantic import BaseModel

class User(BaseModel):
    name: str
    age: int
    email: str
    is_active: bool = True

How It Works

Pydantic uses Python type hints to define the schema. Default values make fields optional.

Key Features

- Automatic type coercion (str "42" to int 42)
- Rich error messages on validation failure
- Generates JSON Schema automatically
- Serialization to dict/JSON built-in

Interactive: Build a Schema

ADD FIELDS:

Click buttons to add fields...

GENERATED CODE:

from pydantic import BaseModel

class MyModel(BaseModel):
    pass

JSON Schema: The Bridge to LLMs

Pydantic models automatically generate JSON Schema. This schema is what LLMs use to understand the expected output format. It's the contract between your code and the model.

Python to JSON Schema Type Mapping

Python Type	JSON Schema	Example Value
str	"type": "string"	"hello"
int	"type": "integer"	42
float	"type": "number"	3.14
bool	"type": "boolean"	true
List[str]	"type": "array", "items": {"type": "string"}	["a", "b"]
Optional[str]	"anyOf": [{"type": "string"}, {"type": "null"}]	"text" or null
Literal["a", "b"]	"enum": ["a", "b"]	"a"
datetime	"type": "string", "format": "date-time"	"2024-01-01T00:00:00Z"

Pydantic Model

from pydantic import BaseModel
from typing import List, Optional

class Invoice(BaseModel):
    invoice_id: str
    amount: float
    items: List[str]
    paid: bool
    notes: Optional[str] = None

Generated JSON Schema

{
  "type": "object",
  "properties": {
    "invoice_id": {"type": "string"},
    "amount": {"type": "number"},
    "items": {
      "type": "array",
      "items": {"type": "string"}
    },
    "paid": {"type": "boolean"},
    "notes": {
      "anyOf": [
        {"type": "string"},
        {"type": "null"}
      ]
    }
  },
  "required": ["invoice_id", "amount", "items", "paid"]
}

How LLMs Use JSON Schema

When you provide a JSON Schema to an LLM (via function calling or structured output mode), the model constrains its token generation to only produce valid JSON matching the schema. This is done at the logit level - invalid tokens are masked out during sampling.

Instructor: LLM + Pydantic

Instructor is the glue between Pydantic and LLMs. It patches OpenAI/Anthropic clients to acceptresponse_model and handles validation, retries, and streaming.

How Instructor Works

Define Schema

Create Pydantic model describing desired output

Generate JSON Schema

Pydantic auto-generates JSON Schema from model

Inject into Prompt

Schema sent to LLM as function/tool definition

LLM Response

LLM generates JSON matching the schema

Validate & Parse

Pydantic validates and creates typed object

Retry on Error

If validation fails, Instructor retries with error context

extract.pypip install instructor

import instructor
from openai import OpenAI
from pydantic import BaseModel
from typing import List

class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price: float

class Invoice(BaseModel):
    invoice_number: str
    total: float
    items: List[LineItem]

# Patch OpenAI client with Instructor
client = instructor.from_openai(OpenAI())

# Extract structured data from text
invoice = client.chat.completions.create(
    model="gpt-4o",
    response_model=Invoice,
    messages=[{
        "role": "user",
        "content": """
        Invoice #12345
        Widget Pro x5 @ $29.99 = $149.95
        Service Fee = $25.00
        Total: $174.95
        """
    }]
)

print(invoice.invoice_number)  # "12345"
print(invoice.total)           # 174.95
print(invoice.items[0].description)  # "Widget Pro"

Automatic Retries

If the LLM output fails Pydantic validation, Instructor automatically retries with the error message included in the prompt.

client.chat.completions.create(
    model="gpt-4o",
    response_model=Invoice,
    max_retries=3,  # Retry up to 3 times
    messages=[...]
)

Streaming Support

Get partial objects as they stream in. Useful for long extractions where you want to show progress.

for partial in client.chat.completions.create_partial(
    model="gpt-4o",
    response_model=Invoice,
    messages=[...]
):
    print(partial)  # Partial Invoice

Extraction Methods Compared

There are multiple approaches to getting structured output from LLMs. Each has tradeoffs in reliability, speed, and flexibility.

When to Use What

Instructor

Best for production apps with OpenAI/Anthropic. Type-safe, battle-tested.

OpenAI Structured Outputs

Best for simple schemas when you only use OpenAI. No extra dependencies.

Outlines

Best for local/open-source models. 100% guaranteed valid output.

LangChain Parsers

Best if already using LangChain. Works with any model.

Full Document Extraction Pipeline

Real-world document extraction combines OCR, layout analysis, and LLM extraction. Here's how the pieces fit together.

Document Input

PDF, image, or scanned document

OCR / Text Extraction

Convert to machine-readable text

Layout Analysis

Identify tables, sections, headers

Entity Recognition

Find dates, amounts, names

Schema Mapping

Map entities to Pydantic fields

Validation

Verify types and constraints

Structured Output

JSON/dict ready for downstream

Full Pipeline: PDF Invoice to Structured Data

import instructor
from openai import OpenAI
from pydantic import BaseModel
from typing import List
from docling import DocumentConverter  # Or PyMuPDF, pdfplumber, etc.

# 1. Define the schema
class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price: float
    total: float

class Invoice(BaseModel):
    invoice_number: str
    vendor_name: str
    date: str
    items: List[LineItem]
    subtotal: float
    tax: float
    total: float

# 2. Extract text from PDF
converter = DocumentConverter()
result = converter.convert("invoice.pdf")
text = result.document.export_to_markdown()

# 3. Use LLM to extract structured data
client = instructor.from_openai(OpenAI())

invoice = client.chat.completions.create(
    model="gpt-4o",
    response_model=Invoice,
    messages=[
        {
            "role": "system",
            "content": "Extract invoice data from the following document."
        },
        {
            "role": "user",
            "content": text
        }
    ]
)

# 4. Use the typed, validated data
print(f"Invoice: {invoice.invoice_number}")
print(f"Total: ${invoice.total:.2f}")
for item in invoice.items:
    print(f"  - {item.description}: {item.quantity} x ${item.unit_price}")

OCR / Text Extraction

- Docling - Layout-aware
- PyMuPDF - Fast, native PDF
- Tesseract - Scanned docs
- Azure Doc AI - Pre-built

LLM Extraction

- GPT-4o - Best accuracy
- Claude 3.5 - Long docs
- Gemini 1.5 - 1M context
- Llama 3.1 - Self-host

Structured Output

- Instructor - Production
- Outlines - Local models
- Marvin - Lightweight
- BAML - Type-first

The Complete Picture

Document

OCR/Parse

Text

Pydantic Schema

Instructor + LLM

Validated Object

Database / API

Structured output extraction turns messy documents into clean, typed data. Pydantic defines the contract, JSON Schema bridges to LLMs, and Instructor handles the plumbing. The result: reliable, production-ready document processing.