Solving the "impeccable OCR" myth with a user-in-the-loop mobile interface.
There really isn't an "impeccable" OCR solution, especially on mobile video feeds.
A classic example: The letter H, when viewed at a slightly skewed angle, will often "flicker" into an N and back again in the OCR results. This creates a jittery, unreliable data stream that frustrates users.
Instead of trying to force the OCR model to be perfect, we accept the uncertainty and build a UI that empowers the user to resolve it quickly. This solution uses Google MLKit to scan and correlate text regions, spawning interactive dropdowns directly on the camera feed.
The system tracks text regions across frames. When it detects a region "flickering" between multiple values (e.g., "HELLO" vs "NELLO"), it aggregates these candidates instead of just showing the latest one.
A dropdown or selection UI is spawned directly on top of the text in the photo. The user can tap to confirm the correct value ("H" or "N") from the detected candidates.
The biggest challenge in overlaying UI on a live camera feed is movement. If the user moves their hand slightly, the UI must "stick" to the real-world object.
To achieve resiliency to camera drift, the system uses the middle of the text area as an anchor point.
This "anchoring into itself" technique allows the overlay to follow the text smoothly even as the camera drifts or the bounding box shape fluctuates slightly.
Google MLKit on Android/iOS provides the raw text blocks with bounding boxes. The custom logic sits on top:
// Conceptual Logic
onFrame(image) {
results = MLKit.detectText(image);
for (block in results) {
// Calculate stable center
center = getCentroid(block.frame);
// Find matching active tracker
tracker = findTrackerNear(center);
if (tracker) {
tracker.update(center, block.text);
// If text changed, add to candidate list
if (tracker.text != block.text) {
tracker.addCandidate(block.text);
}
} else {
createTracker(center, block.text);
}
}
// Update UI overlays based on trackers
drawOverlays(trackers);
}By maintaining a history of values for each tracked region, the system turns the "bug" of flickering into a "feature" – a list of possible options for the user.