Computer Visionzero-shot-object-detection

Zero-Shot Object Detection

Zero-shot object detection finds and localizes objects described by free-form text, without any task-specific fine-tuning — the open-vocabulary dream of detection. Grounding DINO (2023) married DINO's detection architecture with grounded pre-training to achieve state-of-the-art open-set detection, while OWL-ViT and YOLO-World showed different paths to the same goal. The technical challenge is grounding language precisely enough to distinguish similar objects ("the red car" vs "the blue car" in the same scene). This is rapidly replacing traditional closed-set detectors in production because it eliminates the most painful step: collecting and annotating domain-specific training data.

2
Datasets
3
Results
map
Canonical metric
Canonical Benchmark

LVIS Zero-Shot

Open-vocabulary object detection on 1203 LVIS categories

Primary metric: map
View full leaderboard

Top 10

Leading models on LVIS Zero-Shot.

RankModelapYearSource
1
Grounding DINO 1.5 Pro
47.62024paper
2
OWLv2 (ViT-L)
44.62023paper
3
YOLO-World v2-X
35.42024paper

All datasets

2 datasets tracked for this task.

Related tasks

Other tasks in Computer Vision.

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace