Computer Visionzero-shot-object-detection

Zero-Shot Object Detection

Zero-shot object detection finds and localizes objects described by free-form text, without any task-specific fine-tuning — the open-vocabulary dream of detection. Grounding DINO (2023) married DINO's detection architecture with grounded pre-training to achieve state-of-the-art open-set detection, while OWL-ViT and YOLO-World showed different paths to the same goal. The technical challenge is grounding language precisely enough to distinguish similar objects ("the red car" vs "the blue car" in the same scene). This is rapidly replacing traditional closed-set detectors in production because it eliminates the most painful step: collecting and annotating domain-specific training data.

Datasets

Results

map

Canonical metric

Canonical Benchmark