Document Extraction and Parsing
Extract structured text from PDFs, scans, and complex layouts. The DOCUMENT modality is critical for enterprise RAG.
Why Document Parsing Matters
Most enterprise knowledge lives in documents: PDFs, scanned invoices, contracts, research papers. Before you can build RAG or search, you need to extract the text accurately.
Simple PDF text extraction often fails on complex layouts, tables, multi-column text, and scanned documents. Modern document parsers use computer vision and layout understanding to preserve structure.
The Challenge
A PDF is not just text. It contains:
- -Tables with rows and columns that need structure preservation
- -Multi-column layouts that naive extractors jumble together
- -Headers, footers, and page numbers to filter out
- -Figures, equations, and captions that need special handling
Enterprise Use Cases
Invoice Processing
Extract vendor names, line items, amounts, and dates from invoices. Automate accounts payable workflows.
Key fields: vendor, invoice number, date, line items, total amount
Resume Parsing
Structure candidate resumes into searchable fields. Power ATS systems and candidate matching.
Key fields: name, experience, skills, education, contact info
Contract Analysis
Extract clauses, parties, dates, and obligations from legal documents. Enable contract review automation.
Key fields: parties, effective date, terms, obligations, signatures
Research Paper Ingestion
Parse academic papers for RAG: preserve sections, citations, tables, and figures. Build searchable research knowledge bases.
Key fields: title, abstract, sections, citations, figures, tables
Document Parsing Tools
There are three main approaches to document parsing, each with different tradeoffs:
Docling (IBM) - PDF to Markdown
IBM's open-source document converter. Excellent for converting PDFs to clean Markdown with preserved structure. Handles tables, headers, and layout well.
converter = DocumentConverter()
result = converter.convert('document.pdf')
markdown = result.document.export_to_markdown()
print(markdown)
Best for: Clean PDFs, research papers, structured documents
Unstructured.io - Multi-format
Handles PDFs, Word docs, PowerPoints, images, and more. Partitions documents into semantic elements (titles, paragraphs, tables, images) with category labels.
elements = partition('document.pdf')
for element in elements:
print(f'{element.category}: {element.text[:100]}...')
Best for: Mixed document types, enterprise pipelines, element categorization
Marker - Best for Books and Papers
Optimized for converting books, papers, and long documents to Markdown. Extracts images and preserves complex layouts including equations.
from marker.models import load_all_models
models = load_all_models()
full_text, images, metadata = convert_single_pdf('paper.pdf', models)
print(full_text)
Best for: Academic papers, textbooks, documents with equations and figures
Vision Language Models for Document OCR
The newest approach uses Vision Language Models (VLMs) to "read" documents directly. These models understand layout, tables, and handwriting without specialized training.
Why VLMs for Documents?
Traditional OCR extracts text character by character. VLMs understand the full document visually - they can read tables as tables, understand forms, and even interpret diagrams. Mistral OCR leads the benchmarks with 79.75% composite accuracy.
VLM Document Processing
# Example with GPT-4 Vision
import base64
from openai import OpenAI
client = OpenAI()
# Convert PDF page to image first
with open('page.png', 'rb') as f:
image_data = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Extract all text..."},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}}
]
}]
)OmniDocBench Results
OmniDocBench is the comprehensive benchmark for document parsing, testing text extraction, table recognition, and layout understanding.
OmniDocBench composite score. Higher is better. VLM-based approaches lead traditional parsers.
Key Insight
VLMs like Mistral OCR and GPT-4o achieve the best accuracy but are expensive for high-volume processing. For most use cases, Docling or Marker provide the best balance of quality and cost - they run locally and handle most documents well.
Choosing the Right Tool
High Volume / Cost Sensitive
Use Docling or Marker. Run locally, no API costs, good accuracy.
Processes thousands of documents per hour. Best for batch processing.
Mixed Document Types
Use Unstructured.io. Handles PDFs, Word, PowerPoint, images, and more.
Enterprise-grade solution with hosted API option. Good element categorization.
Maximum Accuracy
Use Mistral OCR or GPT-4o. Best for complex layouts, handwriting, or critical documents.
Higher cost per page. Use for high-value documents or when accuracy is critical.
Research Papers and Books
Use Marker. Optimized for academic content with equations and figures.
Extracts images, preserves section structure, handles LaTeX equations.
Document-to-RAG Pipeline
Document parsing is typically the first step in a RAG pipeline. Here's a complete example:
from docling.document_converter import DocumentConverter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
# Step 1: Parse document to markdown
converter = DocumentConverter()
result = converter.convert('contract.pdf')
markdown = result.document.export_to_markdown()
# Step 2: Chunk the text
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50
)
chunks = splitter.split_text(markdown)
# Step 3: Generate embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(chunks)
# Step 4: Store in vector database
# ... (see vector database lesson)
Key Takeaways
- 1
Document parsing is the foundation: Without accurate extraction, RAG and search will fail on enterprise documents.
- 2
Three main tools: Docling (IBM), Unstructured.io, and Marker. Each has different strengths for different document types.
- 3
VLMs lead the benchmarks: Mistral OCR (79.75%) and GPT-4o (~73%) are the most accurate, but cost more per document.
- 4
Match tool to use case: High volume uses local tools. Critical documents use VLMs. Mixed formats use Unstructured.io.
Practice Exercise
Try these exercises to get hands-on with document parsing:
- 1.Install Docling (
pip install docling) and parse a PDF to Markdown. - 2.Compare the output of Docling vs PyMuPDF on a document with tables.
- 3.Try parsing a scanned document - compare traditional OCR vs VLM-based extraction.