Home / OCR / Document Scanner
Tutorial

Build a Document Scanner

Detect document edges, correct perspective, and enhance scanned images. Interactive demo below.

Try It

Upload a photo of a document (receipt, page, ID card). The scanner will detect the edges, let you adjust them, and transform the image to a flat, rectangular scan.

1
Upload
2
Detect
3
Transform
4
OCR
5
Extract

Loading OpenCV.js...

How It Works

Document scanning involves four steps:

  1. Edge detection: Find where the document boundaries are using Canny edge detection
  2. Contour finding: Extract the document outline as a 4-point polygon
  3. Perspective transform: Warp the tilted document into a flat rectangle
  4. Enhancement: Improve contrast and optionally convert to black-and-white

Step 1: Edge Detection

The Canny algorithm finds edges by looking for rapid changes in pixel intensity. We first convert to grayscale and blur to reduce noise:

Original image

1. Original

Grayscale conversion

2. Grayscale

Gaussian blur

3. Gaussian Blur (5x5)

Canny edge detection

4. Canny Edges

gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
blurred = cv2.GaussianBlur(gray, (5, 5), 0)
edges = cv2.Canny(blurred, 50, 150)

The 50, 150 are the low and high thresholds. Edges with gradient above 150 are kept; edges between 50-150 are kept only if connected to strong edges.

Step 2: Find the Document Contour

We find all contours (closed shapes) in the edge image, then look for the largest one that approximates to a 4-point polygon:

All contours found

All Contours (112 found)

Quadrilateral candidates

4-Point Polygons

Selected document with corners

Selected + Corner Labels

contours, _ = cv2.findContours(edges, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)

for contour in sorted(contours, key=cv2.contourArea, reverse=True):
    peri = cv2.arcLength(contour, True)
    approx = cv2.approxPolyDP(contour, 0.02 * peri, True)

    if len(approx) == 4:
        doc_contour = approx
        break

The approxPolyDP function simplifies the contour. The epsilon value (0.02 * perimeter) controls how much simplification - larger values produce simpler shapes.

Step 3: Perspective Transform

Now we have 4 corner points of the tilted document. We want to map these to a rectangle. This is a homography transformation:

Source points on tilted document

Source: 4 Corner Points

Warped flat document

Result: Flat Rectangle

# Source points (corners of tilted document)
src_pts = np.array([[x1,y1], [x2,y2], [x3,y3], [x4,y4]], dtype=np.float32)

# Destination points (rectangle)
dst_pts = np.array([[0,0], [width,0], [width,height], [0,height]], dtype=np.float32)

# Get transform matrix and apply
M = cv2.getPerspectiveTransform(src_pts, dst_pts)
result = cv2.warpPerspective(img, M, (width, height))

The order of corners matters. They must be in the same order (e.g., clockwise starting from top-left) in both arrays.

Step 4: Enhancement

For text documents, adaptive thresholding produces clean black-and-white output:

Grayscale scan

Grayscale

Adaptive threshold

Adaptive Threshold

gray = cv2.cvtColor(result, cv2.COLOR_BGR2GRAY)
enhanced = cv2.adaptiveThreshold(
    gray, 255,
    cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
    cv2.THRESH_BINARY,
    11, 2  # block size, constant
)

Unlike global thresholding, adaptive thresholding calculates the threshold for each pixel based on its neighbors. This handles uneven lighting across the document.

Complete Python Code

import cv2
import numpy as np

def scan_document(image_path: str, output_path: str) -> None:
    """
    Scan a document: detect edges, correct perspective, enhance.
    """
    # Load image
    img = cv2.imread(image_path)
    orig = img.copy()

    # Resize for processing (keep aspect ratio)
    height, width = img.shape[:2]
    scale = 500 / max(height, width)
    img = cv2.resize(img, None, fx=scale, fy=scale)

    # Convert to grayscale and blur
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    blurred = cv2.GaussianBlur(gray, (5, 5), 0)

    # Edge detection
    edges = cv2.Canny(blurred, 50, 150)
    edges = cv2.dilate(edges, np.ones((3, 3), np.uint8))

    # Find contours
    contours, _ = cv2.findContours(edges, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
    contours = sorted(contours, key=cv2.contourArea, reverse=True)

    # Find the document contour (largest 4-sided polygon)
    doc_contour = None
    for contour in contours:
        peri = cv2.arcLength(contour, True)
        approx = cv2.approxPolyDP(contour, 0.02 * peri, True)

        if len(approx) == 4:
            doc_contour = approx
            break

    if doc_contour is None:
        raise ValueError("Could not detect document edges")

    # Scale contour back to original image size
    doc_contour = (doc_contour / scale).astype(np.float32)

    # Order corners: top-left, top-right, bottom-right, bottom-left
    pts = doc_contour.reshape(4, 2)
    rect = order_corners(pts)

    # Calculate output dimensions
    width_top = np.linalg.norm(rect[1] - rect[0])
    width_bottom = np.linalg.norm(rect[2] - rect[3])
    height_left = np.linalg.norm(rect[3] - rect[0])
    height_right = np.linalg.norm(rect[2] - rect[1])

    max_width = int(max(width_top, width_bottom))
    max_height = int(max(height_left, height_right))

    # Perspective transform
    dst = np.array([
        [0, 0],
        [max_width - 1, 0],
        [max_width - 1, max_height - 1],
        [0, max_height - 1]
    ], dtype=np.float32)

    M = cv2.getPerspectiveTransform(rect, dst)
    scanned = cv2.warpPerspective(orig, M, (max_width, max_height))

    # Enhance: convert to grayscale and apply adaptive threshold
    scanned_gray = cv2.cvtColor(scanned, cv2.COLOR_BGR2GRAY)
    scanned_enhanced = cv2.adaptiveThreshold(
        scanned_gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY, 11, 2
    )

    cv2.imwrite(output_path, scanned_enhanced)
    print(f"Saved: {output_path}")


def order_corners(pts: np.ndarray) -> np.ndarray:
    """Order corners: top-left, top-right, bottom-right, bottom-left."""
    rect = np.zeros((4, 2), dtype=np.float32)

    # Top-left has smallest sum, bottom-right has largest
    s = pts.sum(axis=1)
    rect[0] = pts[np.argmin(s)]
    rect[2] = pts[np.argmax(s)]

    # Top-right has smallest diff, bottom-left has largest
    d = np.diff(pts, axis=1)
    rect[1] = pts[np.argmin(d)]
    rect[3] = pts[np.argmax(d)]

    return rect


if __name__ == "__main__":
    scan_document("photo.jpg", "scanned.png")

Install: pip install opencv-python numpy

Minimal Version (15 lines)

If you know the document will be detected correctly, here's the minimal version:

import cv2
import numpy as np

# Load and detect edges
img = cv2.imread("photo.jpg")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
edges = cv2.Canny(cv2.GaussianBlur(gray, (5,5), 0), 50, 150)

# Find largest 4-sided contour
contours, _ = cv2.findContours(edges, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
for c in sorted(contours, key=cv2.contourArea, reverse=True):
    approx = cv2.approxPolyDP(c, 0.02 * cv2.arcLength(c, True), True)
    if len(approx) == 4:
        pts = approx.reshape(4, 2).astype(np.float32)
        break

# Transform to rectangle
w, h = 800, 1000
dst = np.array([[0,0], [w,0], [w,h], [0,h]], dtype=np.float32)
M = cv2.getPerspectiveTransform(pts, dst)
result = cv2.warpPerspective(img, M, (w, h))

cv2.imwrite("scanned.png", result)

When Edge Detection Fails

Auto-detection fails when:

  • Document is on a similar-colored background (white paper on white desk)
  • Part of the document is cut off in the photo
  • Strong shadows or reflections break the edge
  • Multiple documents in the frame

For these cases, let users manually select the 4 corners (like in the demo above). Many apps show the auto-detected corners but allow adjustment before transforming.

Adding OCR

Once you have a clean scan, run OCR to extract text. See Getting Started with OCR for how to use PaddleOCR or GPT-4o on your scanned documents.

from paddleocr import PaddleOCR

ocr = PaddleOCR(lang='en')
result = ocr.predict('scanned.png')

for item in result:
    for text in item.get('rec_texts', []):
        print(text)

Why Benchmarks Don't Tell the Whole Story

OCR benchmarks typically measure Character Error Rate (CER) and Word Error Rate (WER). These metrics count how many characters or words the model got wrong:

# CER = (insertions + deletions + substitutions) / total_characters
# WER = (insertions + deletions + substitutions) / total_words

Ground truth: "Invoice Total: $1,234.56"
OCR output:   "Invoice Total: $1,234.56"
CER = 0%, WER = 0%  # Perfect!

But for invoices and tables, perfect character accuracy doesn't mean correct extraction. Consider this invoice table:

Ground truth (what you want):
┌─────────────────────┬─────┬─────────┬───────────┐
│ Description         │ Qty │ Price   │ Total     │
├─────────────────────┼─────┼─────────┼───────────┤
│ Web Development     │ 40  │ $150.00 │ $6,000.00 │
│ UI/UX Design        │ 20  │ $125.00 │ $2,500.00 │
└─────────────────────┴─────┴─────────┴───────────┘

OCR output (what you get):
Web Development
40
$150.00
$6,000.00
UI/UX Design
20
$125.00
$2,500.00

CER and WER are both 0% - every character is correct. But the table structure is completely lost. You can't programmatically answer "what's the price of UI/UX Design?" without reconstructing which numbers belong to which row.

What Benchmarks Actually Measure

Metric Measures Misses
CER Character-level accuracy Word boundaries, structure, semantics
WER Word-level accuracy Line order, table structure, relationships
TEDS Table structure (edit distance on HTML) Cell content accuracy, merged cells
F1 (field extraction) Correct key-value pairs extracted Best for invoices, but schema-dependent

The Real Pipeline for Tabular Data

For invoices and forms, text extraction is just step 1. You also need:

# Step 1: Preprocess (what we covered above)
scanned = scan_document("invoice_photo.jpg")

# Step 2: OCR - extract text with bounding boxes
ocr = PaddleOCR(lang='en')
result = ocr.predict(scanned)
# Returns: [{"text": "Web Development", "bbox": [x1,y1,x2,y2], "confidence": 0.98}, ...]

# Step 3: Table detection - find table regions
# (Requires separate model or heuristics)

# Step 4: Cell assignment - which text belongs to which cell
# (Spatial clustering based on bounding box positions)

# Step 5: Structure reconstruction - row/column relationships
# (Graph-based or rule-based assignment)

# Step 6: Field extraction - map to your schema
# {"line_items": [{"description": "Web Development", "qty": 40, "price": 150.00}]}

Traditional OCR engines (PaddleOCR, Tesseract) only do Step 2. You're responsible for Steps 3-6, which is where most of the complexity lies.

Why GPT-4o Changes This

Vision-language models like GPT-4o collapse Steps 2-6 into a single prompt:

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": """Extract line items from this invoice as JSON:
{"line_items": [{"description": str, "qty": int, "price": float, "total": float}]}"""},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}}
        ]
    }],
    response_format={"type": "json_object"}
)

# Returns structured JSON directly - no table detection needed

GPT-4o understands that "40" in the Qty column relates to "Web Development" in the Description column, even though they're spatially separated. This is document understanding, not just text extraction.

When to Use What

PaddleOCR + Custom Logic
High volume, consistent layouts, cost-sensitive. Build your own table parser for your specific document format.
GPT-4o
Variable layouts, complex tables, need semantic understanding. ~$0.015/image but handles edge cases automatically.
Specialized Document AI
AWS Textract, Google Document AI, Azure Form Recognizer. Middle ground: structured output without LLM costs.

See our OCR benchmarks comparison for how different models perform on various document types, including tables and forms.

Browser vs Server

Approach Pros Cons
OpenCV.js (browser) No server needed, instant preview, privacy (images stay local) 5MB download, slower than native, limited to what JS can do
Python/OpenCV (server) Fast processing, full OpenCV features, can chain with OCR Requires backend, upload latency, server costs
GPT-4o Vision Can extract text directly without scanning, handles messy images ~$0.01-0.02/image, requires API call

For simple scanning (receipts, documents), the browser approach works well. For high-volume processing or integration with OCR pipelines, use server-side Python.

More