Level 1: Single Blocks~25 min

Object Detection: Image to Bounding Boxes

Locate and classify every object in an image. The foundation of autonomous driving, surveillance, robotics, and visual search.

What is Object Detection?

Object detection answers two questions simultaneously: What objects are in this image? and Where are they located?

Unlike image classification (which outputs a single label for the whole image), object detection outputs a list of bounding boxes — rectangles that localize each detected object, along with its class label and a confidence score. A single image might contain dozens of overlapping detections.

This lesson traces the field from its first practical system in 2001 to today's open-vocabulary detectors. Understanding the history reveals why modern architectures look the way they do — every design choice is a reaction to a limitation of the previous generation.

Output Format

Every object detector, from 2001 to 2026, outputs the same three things per detection:

  • -Bounding box: (x1, y1, x2, y2) pixel coordinates of the top-left and bottom-right corners
  • -Class label: What the object is (person, car, dog, "damaged car part", etc.)
  • -Confidence score: Model's certainty, from 0.0 to 1.0
# Universal output format (every detector)
detections = [
    {"bbox": [120, 45, 380, 290], "class": "person", "conf": 0.94},
    {"bbox": [400, 100, 550, 340], "class": "dog",    "conf": 0.87},
    {"bbox": [10, 200, 640, 450],  "class": "car",    "conf": 0.72},
]

25 Years of Teaching Machines to See

Object detection evolved through four distinct eras, each solving the critical limitation of the last. The field moved from hand-crafted features to learned features, from two-pass architectures to single-shot designs, from fixed vocabularies to open-ended language-guided detection.

Era I: Hand-Crafted Features
2001

Viola-Jones: Real-Time Face Detection

Paul Viola and Michael Jones at Mitsubishi Electric Research Labs built the first object detector that worked in real time. Their system detected faces at 15 frames per second on 2001 hardware — a feat that seemed impossible at the time. The architecture introduced three ideas that influenced everything after it:

  1. Haar-like features — simple rectangular filters that captured edge and contrast patterns at multiple scales
  2. Integral images — a preprocessing trick that made computing any rectangular sum O(1), enabling thousands of feature evaluations per window
  3. Attentional cascade — a chain of increasingly complex classifiers. The first stage rejected 50% of windows with just 2 features; only survivors reached later, more expensive stages

"The key insight is that while each stage has a high detection rate, the combined cascade can achieve an extremely low false positive rate while processing most image locations in a few operations."

Viola, P. & Jones, M. (2001). Rapid Object Detection using a Boosted Cascade. CVPR. Later expanded in IJCV, 2004.

Viola-Jones shipped in every digital camera and webcam for a decade. It proved that real-time detection was possible, but it was limited to a single object class (faces), required hand-designed features, and struggled with pose variation, occlusion, and non-rigid objects. Detecting "any object" remained unsolved.

2005

HOG + SVM: The Pedestrian Detector

Navneet Dalal and Bill Triggs at INRIA introduced Histograms of Oriented Gradients (HOG) — a feature descriptor that captured local edge directions in overlapping cells across the image. Paired with a linear SVM classifier and a sliding window, HOG became the dominant approach for pedestrian detection in autonomous driving research.

HOG was more robust than Haar features to lighting and small deformations, but it was still hand-engineered. The features didn't adapt to the data. A human had to decide that edge orientations mattered more than color, that 8x8 pixel cells were the right granularity, that 9 orientation bins were enough. This ceiling — the limited capacity of hand-crafted features — is what deep learning would shatter.

Dalal, N. & Triggs, B. (2005). Histograms of Oriented Gradients for Human Detection. CVPR.

2008–2010

Deformable Parts Model (DPM)

Pedro Felzenszwalb, Ross Girshick, David McAllester, and Deva Ramanan extended HOG into a parts-based model: a "root filter" captured the whole object, while deformable "part filters" modeled components (head, torso, legs) that could shift relative to the root. DPM won the PASCAL VOC challenge for three consecutive years (2007–2009) and represented the peak of hand-crafted detection. Girshick would go on to create R-CNN, explicitly transferring DPM's region-based philosophy to deep learning.

Felzenszwalb, P. et al. (2010). Object Detection with Discriminatively Trained Part-Based Models. IEEE TPAMI, 32(9).

Era II: Two-Stage Detectors
2014

R-CNN: Deep Learning Enters Detection

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik at UC Berkeley combined Selective Search (which proposed ~2,000 candidate regions per image) with a CNN feature extractor (AlexNet) and an SVM classifier. The result blew past every hand-crafted method: R-CNN improved mAP on PASCAL VOC 2012 from 33.4% (DPM) to 53.3% — a 20-point jump.

# R-CNN pipeline (conceptual)
regions = selective_search(image)     # ~2000 region proposals
for region in regions:
    crop = warp(region, 227, 227)     # Resize to CNN input
    features = alexnet(crop)          # 4096-dim feature vector
    class_scores = svm(features)      # Per-class SVM scores
    bbox = regressor(features)        # Refine box coordinates
# Problem: 2000 forward passes per image → ~47 seconds/image on GPU

Girshick, R. et al. (2014). Rich Feature Hierarchies for Accurate Object Detection. CVPR.

R-CNN proved that learned CNN features were dramatically better than hand-crafted ones for detection. But running 2,000 separate CNN forward passes per image was absurdly slow. The next three years were spent making this architecture fast enough to be practical.

2015

Fast R-CNN and Faster R-CNN

Girshick (now at Microsoft Research) solved R-CNN's speed problem in two steps. Fast R-CNN ran the CNN once on the full image and extracted features for each region via ROI pooling — collapsing 2,000 forward passes into one. Then Shaoqing Ren, Kaiming He, Girshick, and Jian Sun replaced the slow Selective Search with a Region Proposal Network (RPN) — a small CNN that predicted region proposals directly from the feature map, sharing computation with the detector.

# The two-stage paradigm (Faster R-CNN)
# Stage 1: Region Proposal Network (RPN)
feature_map = backbone_cnn(image)           # Run CNN once
proposals = rpn(feature_map)                # ~300 region proposals

# Stage 2: Classification + Refinement
for proposal in proposals:
    roi_features = roi_pool(feature_map, proposal)  # Extract features
    class_label = classifier(roi_features)           # What is it?
    refined_box = regressor(roi_features)             # Exact location

# Speed: 47s/image (R-CNN) → 0.2s/image (Fast) → 0.06s/image (Faster)

Girshick, R. (2015). Fast R-CNN. ICCV.
Ren, S. et al. (2015). Faster R-CNN: Towards Real-Time Object Detection. NeurIPS. 50,000+ citations.

Faster R-CNN defined the two-stage paradigm: first propose regions, then classify each one. It remained the accuracy leader for years. But the two-stage design was still inherently sequential — you couldn't classify until you had proposals. A new school of thought asked: what if we skip proposals entirely?

Era III: One-Stage Detectors
2016

YOLO: You Only Look Once

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi at the University of Washington proposed a radical simplification: treat detection as a single regression problem. Divide the image into an S x S grid. Each grid cell directly predicts bounding boxes and class probabilities in one forward pass. No region proposals. No second stage.

"We frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation."

Redmon, J. et al. (2016). You Only Look Once: Unified, Real-Time Object Detection. CVPR.

YOLOv1 ran at 45 FPS — over 100x faster than R-CNN. Accuracy was lower than Faster R-CNN (63.4 vs 73.2 mAP on VOC 2007), but the speed advantage was so enormous that it opened entirely new applications: real-time video processing, robotics, drone navigation. The YOLO lineage would eventually close the accuracy gap while maintaining speed, becoming the most deployed detection architecture in history.

2016–2017

SSD & RetinaNet: Closing the Accuracy Gap

Wei Liu et al. introduced SSD (Single Shot MultiBox Detector), which added multi-scale feature maps — detecting large objects from deep layers and small objects from shallow layers. Then Tsung-Yi Lin, Priya Goyal, Girshick, He, and Dollár at Facebook AI Research diagnosed why one-stage detectors lagged in accuracy: class imbalance. A typical image has thousands of "easy negative" background regions and only a few objects. Their solution, Focal Loss, downweighted easy negatives so the model could focus on hard examples. RetinaNet with Focal Loss matched Faster R-CNN accuracy while running 5x faster.

Liu, W. et al. (2016). SSD: Single Shot MultiBox Detector. ECCV.
Lin, T.-Y. et al. (2017). Focal Loss for Dense Object Detection. ICCV.

2018–2025

The YOLO Dynasty: v2 Through v11

YOLO became a living lineage, with each version absorbing the best ideas from the broader detection community:

v2Batch normalization, anchor boxes, multi-scale training (Redmon & Farhadi, 2017)
v3FPN-like multi-scale predictions, Darknet-53 backbone (Redmon & Farhadi, 2018)
v4CSPDarknet, Mish activation, mosaic augmentation (Bochkovskiy et al., 2020)
v5–v8Ultralytics engineering: anchor-free heads, C2f blocks, decoupled head (2020–2023)
v9Programmable Gradient Information (PGI), GELAN architecture (Wang et al., 2024)
v11C3k2 blocks, attention mechanisms, 54.7 mAP on COCO (Ultralytics, 2024)

After Redmon left the field (citing ethical concerns about military applications), the YOLO name was carried forward by multiple groups. Ultralytics became the de facto steward, releasing v5, v8, and v11 with polished engineering, unified APIs, and extensive deployment tooling. The YOLO lineage now spans nine major versions across eight years — arguably the most successful architecture family in computer vision history.

Era IV: Transformers & Open Vocabulary
2020

DETR: Detection Transformer

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko at Facebook AI Research asked: what if we threw away anchor boxes, NMS, and all the hand-designed components? DETR used a transformer encoder-decoder with learned object queries and bipartite matching loss to predict a set of detections directly — no proposals, no anchors, no NMS.

"We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components."

Carion, N. et al. (2020). End-to-End Object Detection with Transformers. ECCV.

DETR matched Faster R-CNN on COCO (42 AP) with a vastly simpler architecture. The clean design inspired a wave of follow-ups: Deformable DETR (faster convergence), DAB-DETR (learned anchor points), DINO (improved queries), and RT-DETR (real-time speed). The transformer had invaded detection.

2023

Grounding DINO: Language Meets Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren et al. merged the DINO detector with a text encoder (BERT), creating a model that could detect any object described in natural language. Instead of predicting one of 80 COCO classes, you provide a text prompt: "person wearing a red hat. damaged car parts. brand logos." The model's cross-modality fusion lets visual features attend to text features and vice versa.

This was the shift from closed vocabulary (detect only trained classes) to open vocabulary (detect anything describable). No retraining needed for new categories — just change the text prompt. Grounding DINO 1.5 Pro achieves 54.3 AP on COCO while also supporting open-vocabulary detection.

Liu, S. et al. (2023). Grounding DINO: Marrying DINO with Grounded Pre-Training. ECCV 2024.

The throughline: 2001 → 2026

Four eras. One relentless trajectory toward simpler, faster, more general detection:

2001–2010Hand-crafted: Haar, HOG, DPM. Human-designed features, sliding windows (Viola-Jones, Dalal, Felzenszwalb)
2014–2015Two-stage: Propose regions, then classify. Learned features crush hand-crafted (Girshick, Ren, He)
2016–2025One-stage: Direct regression. Real-time speed meets high accuracy (Redmon, Lin, Ultralytics)
2020–nowTransformers + language: End-to-end set prediction, open-vocabulary via text prompts (Carion, Liu)

Every generation removed a hand-designed component: Faster R-CNN removed Selective Search. YOLO removed the second stage. DETR removed anchors and NMS. Grounding DINO removed the fixed class list. The trend is toward fewer assumptions, more learning.

Two-Stage vs One-Stage: The Core Architectural Split

Every modern detector descends from one of two philosophies. Understanding this split is essential for choosing the right architecture for your application.

Two-Stage (Propose + Classify)

First find where objects might be (region proposals), then determine what each one is. The two stages share a backbone but run sequentially.

image → backbone → RPN → proposals
                  → ROI pool → classify + refine

Architectures:

Faster R-CNN, Cascade R-CNN, Mask R-CNN

+ Higher accuracy (especially small objects)+ Better at precise localization- Slower (sequential pipeline)- More complex training

One-Stage (Direct Prediction)

Predict bounding boxes and classes directly from the feature map in a single forward pass. No proposals, no second stage.

image → backbone → neck → head
              → boxes + classes (directly)

Architectures:

YOLO, SSD, RetinaNet, FCOS

+ Much faster (single pass)+ Simpler architecture- Historically less accurate (gap now closed)- Needs tricks for class imbalance (focal loss)

The Third Way: Transformer Set Prediction

DETR and its descendants don't fit neatly into either category. They use a transformer encoder-decoder with learned object queries that act like implicit proposals — but there's no explicit region proposal step, no anchors, and no NMS. The bipartite matching loss treats detection as a set prediction problem: find the optimal one-to-one assignment between predictions and ground truth. This is architecturally the simplest approach, but convergence is slower and performance on small objects requires careful engineering (deformable attention, multi-scale features).

Closed vs Open Vocabulary Detection

Until 2022, every detector was closed vocabulary: it could only detect the exact classes it was trained on. COCO has 80 classes; if you needed to detect "fire hydrant" (class 11) you were fine, but "cracked sidewalk" was impossible without retraining on new data.

Open vocabulary detectors changed this by fusing vision with language. They accept a text description of what to find, making detection a zero-shot capability — no retraining required.

Closed Vocabulary

Fixed set of classes learned during training. Cannot detect novel objects. COCO-trained models know 80 categories (person, car, dog, etc.).

Models:

YOLO v5–v11, RT-DETR, Faster R-CNN

+ Faster inference (no text encoding)+ Higher AP on trained classes- Need labeled data for every new class

Open Vocabulary

Detect any object described in natural language. Uses vision-language fusion to match text to image regions at inference time.

Models:

Grounding DINO, OWL-ViT, GLIP, Florence-2

+ Zero-shot detection of any object+ No retraining for new categories- Slower (text encoder overhead)

Working Code

Three detectors, three paradigms. All produce the same output format (boxes, classes, scores) but differ in architecture, speed, and vocabulary.

YOLO v11 (One-Stage, Closed Vocabulary)

The most deployed detector in the world. Five model sizes from 2.6M to 56.9M parameters, covering edge devices to servers. Inference via the Ultralytics Python API:

from ultralytics import YOLO

# Load model (downloads ~100MB on first run)
model = YOLO('yolo11x.pt')  # x=max accuracy; n/s/m/l for faster variants

# Run inference on a single image
results = model('image.jpg')

# Parse detections
for result in results:
    for box in result.boxes:
        x1, y1, x2, y2 = box.xyxy[0].tolist()
        confidence = box.conf[0].item()
        class_id = int(box.cls[0])
        class_name = model.names[class_id]
        print(f'{class_name}: {confidence:.2f} at ({x1:.0f},{y1:.0f})-({x2:.0f},{y2:.0f})')

# Batch inference on a video
results = model('traffic.mp4', stream=True)
for frame_result in results:
    print(f'{len(frame_result.boxes)} detections in frame')

# Export to ONNX for production deployment
model.export(format='onnx', imgsz=640, half=True)

Install: pip install ultralytics

RT-DETR (Transformer, Closed Vocabulary)

Baidu's real-time DETR variant. End-to-end detection without NMS — the transformer outputs non-duplicate predictions directly. Better than YOLO at overlapping objects and flexible input resolutions. Uses the same Ultralytics API:

from ultralytics import RTDETR

# RT-DETR via Ultralytics (same API as YOLO)
model = RTDETR('rtdetr-x.pt')  # or rtdetr-l.pt for faster variant
results = model('image.jpg')

# Or use HuggingFace Transformers for the original DETR
from transformers import DetrForObjectDetection, DetrImageProcessor
from PIL import Image
import torch

processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-101")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-101")

image = Image.open("image.jpg")
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# Post-process: convert to boxes with confidence threshold
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(
    outputs, target_sizes=target_sizes, threshold=0.7
)

for score, label, box in zip(
    results[0]["scores"], results[0]["labels"], results[0]["boxes"]
):
    box = [round(i, 1) for i in box.tolist()]
    print(f'{model.config.id2label[label.item()]}: {score:.2f} at {box}')

Grounding DINO (Transformer, Open Vocabulary)

Describe what you want to detect in plain English. No training data needed for new categories. Period-separated phrases define the detection targets:

from groundingdino.util.inference import load_model, load_image, predict

# Load model and weights
model = load_model(
    "GroundingDINO_SwinT_OGC.py",
    "groundingdino_swint_ogc.pth"
)

# Load and preprocess image
image_source, image = load_image("warehouse.jpg")

# Detect with natural language prompts (period-separated)
boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption="forklift . safety vest . hard hat . spill on floor",
    box_threshold=0.35,    # Minimum detection confidence
    text_threshold=0.25    # Minimum text-image match score
)

# boxes: (N, 4) tensor of [cx, cy, w, h] normalized coordinates
# logits: (N,) confidence scores
# phrases: list of matched text labels
for box, score, phrase in zip(boxes, logits, phrases):
    cx, cy, w, h = box.tolist()
    print(f'{phrase}: {score:.2f} at center=({cx:.2f},{cy:.2f})')

# Key advantage: change the caption, detect different things
# No retraining. No new dataset. Just words.

Install: pip install groundingdino-py

Weights must be downloaded separately from the GitHub repo.

The COCO Benchmark

COCO (Common Objects in Context) has been the standard benchmark for object detection since 2014. Created by Tsung-Yi Lin et al. at Microsoft Research, it contains 118K training images and 5K validation images with 80 object categories and 886K annotated bounding boxes.

COCO's evaluation protocol is deliberately strict: the primary metric, AP (Average Precision), averages over 10 IoU thresholds from 0.50 to 0.95 in steps of 0.05. This rewards detectors that produce tightly-fitting boxes, not just roughly-correct regions.

Key Metrics Explained

AP (mAP@[0.50:0.95])

The primary metric. Averages precision across 10 IoU thresholds. A detection "counts" only if it overlaps the ground truth by at least the threshold. Range: 0–100.

AP50 / AP75

AP at a single IoU threshold. AP50 (lenient) allows 50% overlap; AP75 (strict) requires 75%. The gap between AP50 and AP75 reveals localization quality.

AP_S / AP_M / AP_L

AP broken down by object size: small (<32² px), medium (32²–96²), large (>96²). Small object detection remains the hardest sub-problem.

IoU (Intersection over Union)

Measures overlap between predicted and ground-truth boxes. IoU = area of intersection / area of union. A detection is "correct" if IoU exceeds the threshold.

Lin, T.-Y. et al. (2014). Microsoft COCO: Common Objects in Context. ECCV. 30,000+ citations.

COCO val2017 Leaderboard (Selected Models)

ModelTypeParamsAPAP50Speed
Co-DETR (Swin-L)Transformer218M66.083.8~180ms
DINO (Swin-L)Transformer218M63.382.0~150ms
RT-DETR-XTransformer67M54.873.19.3ms
YOLO11xOne-Stage56.9M54.772.011.3ms
YOLO11mOne-Stage20.1M51.568.54.7ms
Grounding DINO-TOpen-Vocab172M48.468.0~45ms
YOLO11nOne-Stage2.6M39.556.11.5ms
Faster R-CNN (R-101-FPN)Two-Stage60M42.062.5~60ms
DETR (R-101)Transformer60M44.964.7~70ms

AP on COCO val2017. Speed measured on NVIDIA T4 GPU (batch=1, FP16 where available). Co-DETR and DINO use test-time augmentation and Swin-L backbones pre-trained on Objects365. Real-time models (YOLO, RT-DETR) measured without TTA.

Speed vs Accuracy

Co-DETR (Swin-L)Max accuracy
66AP
180ms
RT-DETR-XReal-time transformer
54.8AP
9.3ms
YOLO11xReal-time CNN
54.7AP
11.3ms
YOLO11mBalanced
51.5AP
4.7ms
Grounding DINO-TOpen vocab
48.4AP
45ms
YOLO11nEdge/mobile
39.5AP
1.5ms

The Pareto frontier: real-time detectors (YOLO, RT-DETR) cluster around 50–55 AP at under 12ms. Pushing beyond 60 AP requires large backbones and 10–20x more compute. Open-vocabulary detectors trade speed for the ability to detect novel categories.

Choosing the Right Detector

Real-Time on Edge (30+ FPS, mobile/drone/embedded)

Use YOLO11n or YOLO11s. Export to ONNX or TensorRT. Accuracy is lower (39–47 AP), but inference is 1.5–2.5ms on a T4 GPU and runs on Jetson Nano, Raspberry Pi 5, or mobile NPUs.

Balanced Production (10–30 FPS)

Use YOLO11m or RT-DETR-L. The sweet spot for general-purpose detection: warehouse automation, retail analytics, traffic monitoring. ~5ms per frame with 51–53 AP.

Maximum Accuracy (Batch Processing OK)

Use Co-DETR or DINO with Swin-L backbone. Medical imaging, satellite analysis, forensic review. When you need the absolute best AP and latency is not a constraint. Pre-train on Objects365 for maximum performance.

Novel Objects / Dynamic Categories

Use Grounding DINO or Florence-2. When you don't have training data for your target classes, or categories change frequently. Quality inspection, anomaly detection, rapid prototyping. ~45ms per frame but zero labeling cost.

Key Takeaways

  • 1

    Object detection evolved through four eras — from hand-crafted features (Viola-Jones, HOG) to two-stage CNNs (R-CNN family) to one-stage speed demons (YOLO) to transformer set predictors (DETR, Grounding DINO). Each removed a hand-designed component.

  • 2

    Two-stage vs one-stage is the fundamental split — two-stage (Faster R-CNN) proposes then classifies for higher accuracy; one-stage (YOLO) predicts directly for real-time speed. Modern one-stage detectors have largely closed the accuracy gap.

  • 3

    COCO AP is the universal benchmark — averaging precision across IoU thresholds 0.50–0.95 over 80 categories. Current SOTA is ~66 AP (Co-DETR); real-time models reach ~55 AP at under 12ms.

  • 4

    Open-vocabulary detection is the frontier — Grounding DINO and its descendants let you detect any object via text prompts, eliminating the need for labeled data when categories change. This is where the field is heading.

References

Help improve this page

Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.