VideoText

Video OCR

Extract on-screen text from video frames for subtitles, broadcast monitoring, and compliance.

How Video OCR Works

Video OCR extends image-based text recognition to handle the temporal dimension. The challenge is not just reading text in each frame, but understanding which detections represent the same text across time, and efficiently processing hours of footage without redundant work.

1

The Video OCR Pipeline

Picture a video as a stream of images flowing through a series of processing stations. Unlike single-image OCR, we must decide which frames deserve attention, track text entities as they persist or change, and consolidate redundant detections into clean output.

F
Frame Selection
Choose which frames to process
D
Text Detection
Find text regions in each frame
R
Text Recognition
Convert detected regions to text
T
Tracking
Associate text across frames
U
Deduplication
Remove redundant detections

The Core Challenge

A 1-hour video at 30fps contains 108,000 frames. Running OCR on every frame is wasteful (most frames are nearly identical) and produces massive redundancy. The art of video OCR lies in intelligent sampling and temporal aggregation: process fewer frames, but get better results by reasoning about text persistence across time.

2

Frame Selection Strategies

Which frames should we process? The answer depends on your content. A news broadcast with static lower-thirds needs different sampling than a sports game with moving scoreboards.

Every N Frames

Process every Nth frame uniformly

Advantages
  • + Simple to implement
  • + Predictable processing time
  • + Works for any video
Disadvantages
  • - May miss short-lived text
  • - Wastes compute on static scenes
Implementation
# Sample every 30 frames (1 per second at 30fps)
frame_indices = range(0, total_frames, 30)
for idx in frame_indices:
    cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
    ret, frame = cap.read()
    results.append(ocr.process(frame))

Sampling Visualization

Frame 0
Frame 29
Green = Processed frames6 frames processed (every 5th)
3

Text Tracking Across Frames

When the same text appears in multiple frames, we want to recognize it as a single entity, not count it multiple times. Text tracking associates detections across time to build coherent text "tracks."

Interactive Demo: Watch Text Tracking

Track: title
BREAKING NEWS
Track: ticker
Markets rally on strong e...
Frame 0 | 0.0s
Active Text Tracks
titleConf: 98%
BREAKING NEWS
tickerConf: 95%
Markets rally on strong earnings...
Bounding Box IoU

Track text regions by overlap between frames

Complexity: Simple
Good for static text
Text Similarity

Match regions by comparing recognized text content

Complexity: Medium
Robust to position changes
Embedding Distance

Use visual embeddings to match text regions

Complexity: High
Best for varying fonts/styles
Optical Flow

Predict text region motion between frames

Complexity: High
Best for moving text
4

Temporal Consistency

OCR is imperfect. The same text might be recognized as "BREAKING NEWS" in one frame and "BREAK1NG NEWS" in another. Temporal consistency uses multiple observations to correct errors and boost confidence.

Without Temporal Consistency

Frame 1:"BREAKING NEWS"
Frame 2:"BREAK1NG NEWS"
Frame 3:"BREAKIN6 NEWS"
Frame 4:"BREAKING NEWS"
Output: 4 different strings, 4 separate entries

With Temporal Consistency

Consensus:"BREAKING NEWS"
Seen in 4 frames, 2 with exact match
Voting result:
"BREAKING NEWS" x2"BREAK1NG NEWS" x1"BREAKIN6 NEWS" x1
Output: 1 corrected string, confidence boosted

Temporal Consistency Techniques

Majority Voting

Count character-level or word-level occurrences across frames. Choose the most frequent spelling.

Confidence Weighting

Weight each observation by its OCR confidence score. Higher confidence detections have more influence.

Language Model Correction

Use spell checkers or language models to prefer valid words over OCR artifacts.

5

Deduplication Strategies

After tracking text across frames, we often have multiple observations of the same content. Deduplication consolidates these into unique entries with temporal metadata.

From Raw Detections to Unique Texts

Raw Detections (6 frames)
0.0sBREAKING NEWS
0.0sMarkets rally on strong earnings...
1.0sBREAKING NEWS
1.0s...earnings reports exceed expectations
2.0sBREAKING NEWS
2.0s...expectations. Tech sector leads gains
3.0sINTERVIEW: CEO SPEAKS
3.0sJohn Smith, CEO TechCorp
4.0sINTERVIEW: CEO SPEAKS
4.0sJohn Smith, CEO TechCorp
5.0sINTERVIEW: CEO SPEAKS
5.0sJohn Smith, CEO TechCorp
5.0s"Innovation drives growth"
Total: 13 detections
->
Deduplicate
Unique Texts
Total: 0 unique texts
Exact Match

Remove identical text strings

Use case: Static overlays, watermarks
Edit Distance

Merge texts within N character edits

Use case: OCR error correction
Time Window

Keep only one instance within N seconds

Use case: Scrolling tickers, captions
Confidence Voting

Keep highest confidence version across detections

Use case: Quality optimization
6

Code Examples

Complete implementations for video OCR with tracking, temporal consistency, and deduplication.

OpenCV + Tesseractpip install opencv-python pytesseract
Classic
import cv2
import pytesseract
from collections import defaultdict

def extract_text_from_video(video_path, sample_rate=30):
    """Extract text from video sampling every N frames."""
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

    results = []
    frame_idx = 0

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        # Process every Nth frame
        if frame_idx % sample_rate == 0:
            # Preprocess for OCR
            gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
            gray = cv2.threshold(gray, 0, 255,
                cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]

            # Run Tesseract with bounding boxes
            data = pytesseract.image_to_data(
                gray, output_type=pytesseract.Output.DICT
            )

            # Parse results
            timestamp = frame_idx / fps
            for i, text in enumerate(data['text']):
                if data['conf'][i] > 60 and text.strip():
                    results.append({
                        'time': timestamp,
                        'text': text,
                        'bbox': (data['left'][i], data['top'][i],
                                data['width'][i], data['height'][i]),
                        'conf': data['conf'][i]
                    })

        frame_idx += 1

    cap.release()
    return results

# Deduplicate results
def deduplicate_text(results, time_window=2.0):
    """Remove duplicate text within time window."""
    seen = {}  # text -> (last_time, best_conf, data)
    unique = []

    for r in results:
        text = r['text'].lower().strip()
        if text in seen:
            last_time, best_conf, _ = seen[text]
            if r['time'] - last_time > time_window:
                # New occurrence after window
                unique.append(seen[text][2])
                seen[text] = (r['time'], r['conf'], r)
            elif r['conf'] > best_conf:
                # Better confidence in same window
                seen[text] = (r['time'], r['conf'], r)
        else:
            seen[text] = (r['time'], r['conf'], r)

    # Add remaining
    for text, (_, _, data) in seen.items():
        unique.append(data)

    return sorted(unique, key=lambda x: x['time'])

Quick Reference

Frame Selection
  • - Every N frames (simple)
  • - Scene change detection
  • - Keyframe extraction
  • - Motion-based selection
Text Tracking
  • - IoU-based matching
  • - Text similarity matching
  • - Embedding distance
  • - Optical flow prediction
Best Practices
  • - Sample 1-2 FPS for static text
  • - Use scene detection for efficiency
  • - Apply temporal voting
  • - Deduplicate with time windows

The Bottom Line

Video OCR is not about running image OCR faster; it is about being smarter with temporal information. By sampling strategically, tracking text across frames, enforcing consistency, and deduplicating results, you can process hours of video efficiently while producing cleaner, more accurate transcripts than frame-by-frame processing ever could.

Use Cases

  • Subtitle generation
  • Broadcast compliance
  • Ad intelligence
  • Sports score overlays

Architectural Patterns

Frame Sampling + OCR

Sample frames, run OCR, merge temporally.

Detection + Tracking + OCR

Track text regions across frames for stability.

Implementations

Open Source

PaddleOCR + FFmpeg

Apache 2.0
Open Source

High-quality frame OCR pipeline.

EasyOCR

Apache 2.0
Open Source

Lightweight OCR for many scripts.

TrOCR + SAM tracking

MIT
Open Source

Recognition paired with segmentation tracking.

Benchmarks

Quick Facts

Input
Video
Output
Text
Implementations
3 open source, 0 API
Patterns
2 approaches

Have benchmark data?

Help us track the state of the art for video ocr.

Submit Results