Video OCR
Extract on-screen text from video frames for subtitles, broadcast monitoring, and compliance.
How Video OCR Works
Video OCR extends image-based text recognition to handle the temporal dimension. The challenge is not just reading text in each frame, but understanding which detections represent the same text across time, and efficiently processing hours of footage without redundant work.
The Video OCR Pipeline
Picture a video as a stream of images flowing through a series of processing stations. Unlike single-image OCR, we must decide which frames deserve attention, track text entities as they persist or change, and consolidate redundant detections into clean output.
The Core Challenge
A 1-hour video at 30fps contains 108,000 frames. Running OCR on every frame is wasteful (most frames are nearly identical) and produces massive redundancy. The art of video OCR lies in intelligent sampling and temporal aggregation: process fewer frames, but get better results by reasoning about text persistence across time.
Frame Selection Strategies
Which frames should we process? The answer depends on your content. A news broadcast with static lower-thirds needs different sampling than a sports game with moving scoreboards.
Every N Frames
Process every Nth frame uniformly
- + Simple to implement
- + Predictable processing time
- + Works for any video
- - May miss short-lived text
- - Wastes compute on static scenes
# Sample every 30 frames (1 per second at 30fps)
frame_indices = range(0, total_frames, 30)
for idx in frame_indices:
cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
ret, frame = cap.read()
results.append(ocr.process(frame))Sampling Visualization
Text Tracking Across Frames
When the same text appears in multiple frames, we want to recognize it as a single entity, not count it multiple times. Text tracking associates detections across time to build coherent text "tracks."
Interactive Demo: Watch Text Tracking
Active Text Tracks
Bounding Box IoU
Track text regions by overlap between frames
Text Similarity
Match regions by comparing recognized text content
Embedding Distance
Use visual embeddings to match text regions
Optical Flow
Predict text region motion between frames
Temporal Consistency
OCR is imperfect. The same text might be recognized as "BREAKING NEWS" in one frame and "BREAK1NG NEWS" in another. Temporal consistency uses multiple observations to correct errors and boost confidence.
Without Temporal Consistency
With Temporal Consistency
Temporal Consistency Techniques
Count character-level or word-level occurrences across frames. Choose the most frequent spelling.
Weight each observation by its OCR confidence score. Higher confidence detections have more influence.
Use spell checkers or language models to prefer valid words over OCR artifacts.
Deduplication Strategies
After tracking text across frames, we often have multiple observations of the same content. Deduplication consolidates these into unique entries with temporal metadata.
From Raw Detections to Unique Texts
Exact Match
Remove identical text strings
Edit Distance
Merge texts within N character edits
Time Window
Keep only one instance within N seconds
Confidence Voting
Keep highest confidence version across detections
Code Examples
Complete implementations for video OCR with tracking, temporal consistency, and deduplication.
import cv2
import pytesseract
from collections import defaultdict
def extract_text_from_video(video_path, sample_rate=30):
"""Extract text from video sampling every N frames."""
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
results = []
frame_idx = 0
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
# Process every Nth frame
if frame_idx % sample_rate == 0:
# Preprocess for OCR
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
gray = cv2.threshold(gray, 0, 255,
cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
# Run Tesseract with bounding boxes
data = pytesseract.image_to_data(
gray, output_type=pytesseract.Output.DICT
)
# Parse results
timestamp = frame_idx / fps
for i, text in enumerate(data['text']):
if data['conf'][i] > 60 and text.strip():
results.append({
'time': timestamp,
'text': text,
'bbox': (data['left'][i], data['top'][i],
data['width'][i], data['height'][i]),
'conf': data['conf'][i]
})
frame_idx += 1
cap.release()
return results
# Deduplicate results
def deduplicate_text(results, time_window=2.0):
"""Remove duplicate text within time window."""
seen = {} # text -> (last_time, best_conf, data)
unique = []
for r in results:
text = r['text'].lower().strip()
if text in seen:
last_time, best_conf, _ = seen[text]
if r['time'] - last_time > time_window:
# New occurrence after window
unique.append(seen[text][2])
seen[text] = (r['time'], r['conf'], r)
elif r['conf'] > best_conf:
# Better confidence in same window
seen[text] = (r['time'], r['conf'], r)
else:
seen[text] = (r['time'], r['conf'], r)
# Add remaining
for text, (_, _, data) in seen.items():
unique.append(data)
return sorted(unique, key=lambda x: x['time'])Quick Reference
- - Every N frames (simple)
- - Scene change detection
- - Keyframe extraction
- - Motion-based selection
- - IoU-based matching
- - Text similarity matching
- - Embedding distance
- - Optical flow prediction
- - Sample 1-2 FPS for static text
- - Use scene detection for efficiency
- - Apply temporal voting
- - Deduplicate with time windows
The Bottom Line
Video OCR is not about running image OCR faster; it is about being smarter with temporal information. By sampling strategically, tracking text across frames, enforcing consistency, and deduplicating results, you can process hours of video efficiently while producing cleaner, more accurate transcripts than frame-by-frame processing ever could.
Use Cases
- ✓Subtitle generation
- ✓Broadcast compliance
- ✓Ad intelligence
- ✓Sports score overlays
Architectural Patterns
Frame Sampling + OCR
Sample frames, run OCR, merge temporally.
Detection + Tracking + OCR
Track text regions across frames for stability.
Implementations
Open Source
Benchmarks
Quick Facts
- Input
- Video
- Output
- Text
- Implementations
- 3 open source, 0 API
- Patterns
- 2 approaches