Computer Visionobject-detection

Object Detection

Object detection — finding what's in an image and where — is the backbone of autonomous vehicles, surveillance, and robotics. The two-stage R-CNN lineage (2014–2017) gave way to single-shot detectors like YOLO, now in its 11th iteration and still getting faster. DETR (2020) proved transformers could replace hand-designed components like NMS entirely, spawning a family of end-to-end detectors that dominate COCO leaderboards above 60 mAP. The field's current obsession: open-vocabulary detection that works on any object described in natural language, not just fixed categories.

3 datasets35 resultsView full task mapping →

Object detection localizes and classifies multiple objects in an image with bounding boxes. It's the backbone of autonomous driving, surveillance, and robotics. COCO mAP has climbed from 19.7% (R-CNN, 2014) to 65%+ (Co-DETR, 2024), and the field has split between closed-set detectors and open-vocabulary models that find anything described in text.

History

2014

R-CNN (Girshick et al.) combines selective search proposals with CNN features, achieving 31.4% mAP on VOC — the first deep detector

2015

Faster R-CNN introduces the Region Proposal Network (RPN), making detection end-to-end trainable at 5 FPS

2016

SSD and YOLO (v1-v2) prove single-shot detection is viable for real-time (45+ FPS), trading accuracy for speed

2017

Feature Pyramid Networks (FPN) solve multi-scale detection, and RetinaNet's focal loss fixes class imbalance in one-stage detectors — reaching 40.8% COCO AP

2019

EfficientDet optimizes compound scaling for detection; FCOS proves anchor-free detection works, simplifying pipelines

2020

DETR (Carion et al.) eliminates NMS and anchors entirely by casting detection as set prediction with transformers

2022

DINO-DETR achieves 63.3% COCO AP, making transformer detectors decisively better than CNN-based ones for the first time

2023

YOLOv8 (Ultralytics) and RT-DETR bridge the real-time gap — DETR-quality accuracy at YOLO-like speeds (100+ FPS)

2024

Grounding DINO and OWLv2 enable open-vocabulary detection — find any object described in natural language without retraining

2025

Co-DETR and Group-DETR push COCO AP above 65% with collaborative training; Florence-2 unifies detection with other vision tasks in a single model

How Object Detection Works

1Backbone Feature Extr…A pretrained backbone (ResN…2Neck / Feature FusionFPN or BiFPN merges multi-s…3Proposal Generation o…Two-stage detectors (Faster…4Box Regression + Clas…Each proposal/query is refi…5Post-ProcessingNon-maximum suppression (NM…Object Detection Pipeline
1

Backbone Feature Extraction

A pretrained backbone (ResNet-50, Swin Transformer, InternViT) processes the input image into multi-scale feature maps at 1/8, 1/16, and 1/32 resolution.

2

Neck / Feature Fusion

FPN or BiFPN merges multi-scale features top-down and bottom-up, ensuring small and large objects are represented at appropriate resolutions.

3

Proposal Generation or Query Matching

Two-stage detectors (Faster R-CNN) generate ~300 region proposals via RPN. Transformer detectors (DETR) use learned object queries (100-900) that attend to the feature map. Single-shot detectors (YOLO) predict directly on a dense grid.

4

Box Regression + Classification

Each proposal/query is refined into a bounding box (x, y, w, h) and classified. DETR uses bipartite matching (Hungarian algorithm) to assign predictions to ground truth; YOLO/SSD use anchor-based assignment with IoU thresholds.

5

Post-Processing

Non-maximum suppression (NMS) removes duplicate boxes in anchor-based detectors. DETR avoids NMS entirely. Output: list of (box, class, confidence) tuples, evaluated with mAP at IoU thresholds 0.5:0.95.

Current Landscape

Object detection in 2025 is dominated by two parallel tracks: DETR-family transformers for maximum accuracy (Co-DETR, DINO-DETR) and the YOLO lineage for real-time deployment. The gap between them has narrowed dramatically — RT-DETR showed that transformer detectors can match YOLO speeds, and YOLOv8/v9 incorporated transformer ideas into the YOLO framework. Meanwhile, open-vocabulary detection (Grounding DINO, OWLv2) is disrupting the entire paradigm: instead of training a detector per domain, you describe what you want to find in text. Foundation models like Florence-2 are further blurring the boundary between detection, segmentation, and captioning.

Key Challenges

Small object detection — objects under 32×32 pixels account for 41% of COCO annotations but drive only ~15% of AP, and most detectors struggle here

Real-time inference constraints for autonomous driving (10-30ms latency budget) force painful accuracy/speed tradeoffs

Domain adaptation — detectors trained on COCO (everyday objects) fail on specialized domains like aerial imagery, medical scans, or manufacturing defects without significant fine-tuning

Crowded scenes with heavy occlusion (e.g., pedestrians in dense urban environments) cause proposal collision and NMS failures

Annotation cost — drawing bounding boxes takes 25-35 seconds per instance, making large-scale labeled datasets expensive to create

Quick Recommendations

Best accuracy (no latency constraint)

Co-DETR with Swin-L backbone

65%+ COCO mAP, best available closed-set detector; uses collaborative hybrid assignments for superior training

Real-time detection

YOLOv8-L or RT-DETR-L

54-56% COCO mAP at 100+ FPS on an A100; YOLOv8 for simpler deployment, RT-DETR for NMS-free inference

Open-vocabulary / zero-shot

Grounding DINO 1.5 or OWLv2

Detect any object described in text without retraining — critical for robotics, content moderation, and novel domains

Edge / mobile deployment

YOLOv8-N or NanoDet-Plus

~37% COCO mAP at 1.5-3M params, runs at 30+ FPS on mobile NPUs

Low-annotation regime

Grounding DINO + SAM

Use text prompts to generate pseudo-labels, then fine-tune a smaller detector — bootstraps detection without manual annotation

What's Next

The field is converging toward unified vision models that handle detection as one of many tasks (Florence-2, PaLI-X). Open-vocabulary detection will likely make closed-set training obsolete for most applications within 2-3 years. Active research frontiers include 3D object detection from monocular images (crucial for autonomous driving without LiDAR), temporal object detection in video (tracking + detection jointly), and detection foundation models that work zero-shot across wildly different domains like satellite imagery, microscopy, and underwater robotics.

Benchmarks & SOTA

Related Tasks

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Object Detection benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000