Video→Text

Video OCR

Extract on-screen text from video frames for subtitles, broadcast monitoring, and compliance.

How Video OCR Works

Video OCR extends image-based text recognition to handle the temporal dimension. The challenge is not just reading text in each frame, but understanding which detections represent the same text across time, and efficiently processing hours of footage without redundant work.

1. The Pipeline 2. Frame Selection 3. Text Tracking 4. Temporal Consistency 5. Deduplication 6. Code

The Video OCR Pipeline

Picture a video as a stream of images flowing through a series of processing stations. Unlike single-image OCR, we must decide which frames deserve attention, track text entities as they persist or change, and consolidate redundant detections into clean output.

Frame Selection

Choose which frames to process

Text Detection

Find text regions in each frame

Text Recognition

Convert detected regions to text

Tracking

Associate text across frames

Deduplication

Remove redundant detections

The Core Challenge

A 1-hour video at 30fps contains 108,000 frames. Running OCR on every frame is wasteful (most frames are nearly identical) and produces massive redundancy. The art of video OCR lies in intelligent sampling and temporal aggregation: process fewer frames, but get better results by reasoning about text persistence across time.

Frame Selection Strategies

Which frames should we process? The answer depends on your content. A news broadcast with static lower-thirds needs different sampling than a sports game with moving scoreboards.

Every N Frames

Process every Nth frame uniformly

Advantages

+ Simple to implement
+ Predictable processing time
+ Works for any video

Disadvantages

- May miss short-lived text
- Wastes compute on static scenes

Implementation

# Sample every 30 frames (1 per second at 30fps)
frame_indices = range(0, total_frames, 30)
for idx in frame_indices:
    cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
    ret, frame = cap.read()
    results.append(ocr.process(frame))

Sampling Visualization

Frame 0

Frame 29

Green = Processed frames6 frames processed (every 5th)

Text Tracking Across Frames

When the same text appears in multiple frames, we want to recognize it as a single entity, not count it multiple times. Text tracking associates detections across time to build coherent text "tracks."

Interactive Demo: Watch Text Tracking

Track: title

BREAKING NEWS

Track: ticker

Markets rally on strong e...

Frame 0 | 0.0s

Active Text Tracks

titleConf: 98%

BREAKING NEWS

tickerConf: 95%

Markets rally on strong earnings...

Bounding Box IoU

Track text regions by overlap between frames

Complexity: Simple

Good for static text

Text Similarity

Match regions by comparing recognized text content

Complexity: Medium

Robust to position changes

Embedding Distance

Use visual embeddings to match text regions

Complexity: High

Best for varying fonts/styles

Optical Flow

Predict text region motion between frames

Complexity: High

Best for moving text

Temporal Consistency

OCR is imperfect. The same text might be recognized as "BREAKING NEWS" in one frame and "BREAK1NG NEWS" in another. Temporal consistency uses multiple observations to correct errors and boost confidence.

Without Temporal Consistency

Frame 1:"BREAKING NEWS"

Frame 2:"BREAK1NG NEWS"

Frame 3:"BREAKIN6 NEWS"

Frame 4:"BREAKING NEWS"

Output: 4 different strings, 4 separate entries

With Temporal Consistency

Consensus:"BREAKING NEWS"

Seen in 4 frames, 2 with exact match

Voting result:

"BREAKING NEWS" x2"BREAK1NG NEWS" x1"BREAKIN6 NEWS" x1

Output: 1 corrected string, confidence boosted

Temporal Consistency Techniques

Majority Voting

Count character-level or word-level occurrences across frames. Choose the most frequent spelling.

Confidence Weighting

Weight each observation by its OCR confidence score. Higher confidence detections have more influence.

Language Model Correction

Use spell checkers or language models to prefer valid words over OCR artifacts.

Deduplication Strategies

After tracking text across frames, we often have multiple observations of the same content. Deduplication consolidates these into unique entries with temporal metadata.

From Raw Detections to Unique Texts

Raw Detections (6 frames)

0.0sBREAKING NEWS

0.0sMarkets rally on strong earnings...

1.0sBREAKING NEWS

1.0s...earnings reports exceed expectations

2.0sBREAKING NEWS

2.0s...expectations. Tech sector leads gains

3.0sINTERVIEW: CEO SPEAKS

3.0sJohn Smith, CEO TechCorp

4.0sINTERVIEW: CEO SPEAKS

4.0sJohn Smith, CEO TechCorp

5.0sINTERVIEW: CEO SPEAKS

5.0sJohn Smith, CEO TechCorp

5.0s"Innovation drives growth"

Total: 13 detections

Deduplicate

Unique Texts

Total: 0 unique texts

Exact Match

Remove identical text strings

Use case: Static overlays, watermarks

Edit Distance

Merge texts within N character edits

Use case: OCR error correction

Time Window

Keep only one instance within N seconds

Use case: Scrolling tickers, captions

Confidence Voting

Keep highest confidence version across detections

Use case: Quality optimization

Code Examples

Complete implementations for video OCR with tracking, temporal consistency, and deduplication.

OpenCV + Tesseractpip install opencv-python pytesseract

Classic

import cv2
import pytesseract
from collections import defaultdict

def extract_text_from_video(video_path, sample_rate=30):
    """Extract text from video sampling every N frames."""
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

    results = []
    frame_idx = 0

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        # Process every Nth frame
        if frame_idx % sample_rate == 0:
            # Preprocess for OCR
            gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
            gray = cv2.threshold(gray, 0, 255,
                cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]

            # Run Tesseract with bounding boxes
            data = pytesseract.image_to_data(
                gray, output_type=pytesseract.Output.DICT
            )

            # Parse results
            timestamp = frame_idx / fps
            for i, text in enumerate(data['text']):
                if data['conf'][i] > 60 and text.strip():
                    results.append({
                        'time': timestamp,
                        'text': text,
                        'bbox': (data['left'][i], data['top'][i],
                                data['width'][i], data['height'][i]),
                        'conf': data['conf'][i]
                    })

        frame_idx += 1

    cap.release()
    return results

# Deduplicate results
def deduplicate_text(results, time_window=2.0):
    """Remove duplicate text within time window."""
    seen = {}  # text -> (last_time, best_conf, data)
    unique = []

    for r in results:
        text = r['text'].lower().strip()
        if text in seen:
            last_time, best_conf, _ = seen[text]
            if r['time'] - last_time > time_window:
                # New occurrence after window
                unique.append(seen[text][2])
                seen[text] = (r['time'], r['conf'], r)
            elif r['conf'] > best_conf:
                # Better confidence in same window
                seen[text] = (r['time'], r['conf'], r)
        else:
            seen[text] = (r['time'], r['conf'], r)

    # Add remaining
    for text, (_, _, data) in seen.items():
        unique.append(data)

    return sorted(unique, key=lambda x: x['time'])

Quick Reference

Frame Selection

- Every N frames (simple)
- Scene change detection
- Keyframe extraction
- Motion-based selection

Text Tracking

- IoU-based matching
- Text similarity matching
- Embedding distance
- Optical flow prediction

Best Practices

- Sample 1-2 FPS for static text
- Use scene detection for efficiency
- Apply temporal voting
- Deduplicate with time windows

The Bottom Line

Video OCR is not about running image OCR faster; it is about being smarter with temporal information. By sampling strategically, tracking text across frames, enforcing consistency, and deduplicating results, you can process hours of video efficiently while producing cleaner, more accurate transcripts than frame-by-frame processing ever could.