Object Detection: Image to Bounding Boxes
Locate and classify objects in images. The foundation of visual perception systems.
What is Object Detection?
Object detection answers two questions simultaneously: What objects are in this image? and Where are they located?
Unlike image classification (which outputs a single label), object detection outputs a list of bounding boxes - rectangles that localize each detected object along with its class label and confidence score.
Output Format
Each detection contains:
- -Bounding box: (x1, y1, x2, y2) coordinates
- -Class: What type of object (person, car, dog, etc.)
- -Confidence: How certain the model is (0.0 to 1.0)
Closed vs Open Vocabulary Detection
Object detectors fall into two categories based on what they can detect:
Closed Vocabulary
Can only detect classes the model was trained on. COCO dataset has 80 classes (person, car, dog, etc.).
Examples:
YOLO, RT-DETR, Faster R-CNN
+ Higher accuracy on known classes
- Cannot detect novel objects
Open Vocabulary
Can detect any object you describe in natural language. Uses vision-language models to match text descriptions to image regions.
Examples:
Grounding DINO, OWL-ViT, GLIP
+ Zero-shot capability
- Slower inference
YOLO v11 (Ultralytics)
YOLO (You Only Look Once) is the most popular object detection architecture. YOLO v11 is the latest version from Ultralytics, achieving ~54.7 mAP on COCO with real-time performance.
Model Variants
| Model | Params | mAP | Speed (T4) | Use Case |
|---|---|---|---|---|
| yolo11n | 2.6M | 39.5 | 1.5ms | Edge/mobile |
| yolo11s | 9.4M | 47.0 | 2.5ms | Balanced |
| yolo11m | 20.1M | 51.5 | 4.7ms | General purpose |
| yolo11l | 25.3M | 53.4 | 6.2ms | High accuracy |
| yolo11x | 56.9M | 54.7 | 11.3ms | Maximum accuracy |
mAP = mean Average Precision on COCO val2017. Speed measured on NVIDIA T4 GPU.
from ultralytics import YOLO
# Load model (downloads automatically on first run)
model = YOLO('yolo11x.pt') # or yolo11n.pt for speed
# Run inference
results = model('image.jpg')
# Process detections
for result in results:
boxes = result.boxes
for box in boxes:
x1, y1, x2, y2 = box.xyxy[0]
conf = box.conf[0]
cls = int(box.cls[0])
print(f'{model.names[cls]}: {conf:.2f} at ({x1:.0f},{y1:.0f})-({x2:.0f},{y2:.0f})')
Installation: pip install ultralytics
RT-DETR (Real-Time Detection Transformer)
RT-DETR is a transformer-based detector from Baidu that achieves competitive accuracy with end-to-end design. Unlike YOLO's CNN backbone, RT-DETR uses attention mechanisms for better global context understanding.
RT-DETR Advantages
- -End-to-end detection (no NMS post-processing needed)
- -Better at detecting overlapping objects
- -More flexible input resolution scaling
- -~54.8 mAP on COCO with RT-DETR-L
from ultralytics import RTDETR
# Load RT-DETR model
model = RTDETR('rtdetr-l.pt') # or rtdetr-x.pt for max accuracy
# Run inference - same API as YOLO
results = model('image.jpg')
# Process results (identical interface)
for result in results:
boxes = result.boxes
... # Same as YOLO
RT-DETR uses the same Ultralytics API, making it easy to swap between YOLO and RT-DETR depending on your accuracy vs speed requirements.
Grounding DINO (Open Vocabulary)
Grounding DINO combines a DINO detector with grounded pre-training to enable open-vocabulary detection. You describe what you want to find in natural language, and the model locates it.
The Power of Text Prompts
Unlike YOLO which can only detect its 80 trained classes, Grounding DINO can detect:
- -
"person wearing a red hat" - -
"damaged car parts" - -
"brand logos" - -Any object you can describe in words
from groundingdino.util.inference import load_model, predict
# Load model
model = load_model(
'GroundingDINO_SwinT_OGC.py',
'groundingdino_swint_ogc.pth'
)
# Detect with text prompts (period-separated)
boxes, logits, phrases = predict(
model=model,
image=image,
caption='person . dog . car', # Any text prompt!
box_threshold=0.35,
text_threshold=0.25
)
# boxes: tensor of [cx, cy, w, h] normalized coordinates
# logits: confidence scores
# phrases: matched text phrases
Installation:
pip install groundingdino-pyNote: Grounding DINO requires downloading model weights separately from the GitHub repo.
Speed vs Accuracy Tradeoffs
Choosing the right detector depends on your constraints. Here is a comparison:
mAP on COCO val2017. Speed on NVIDIA T4 GPU. Open vocabulary detectors trade speed for flexibility.
When to Use Which Detector
Real-Time Applications (30+ FPS)
Use YOLO11n or YOLO11s. Surveillance, robotics, live video.
1.5-2.5ms per frame. Sacrifice accuracy for speed.
Balanced Production
Use YOLO11m or RT-DETR-L. General purpose, batch processing.
~5ms per frame. Good tradeoff between speed and accuracy.
Maximum Accuracy
Use YOLO11x or RT-DETR-X. Medical imaging, quality inspection.
~10ms per frame. When accuracy is more important than speed.
Custom Object Detection
Use Grounding DINO. Novel objects, dynamic categories, prototyping.
~45ms per frame. No training needed for new categories.
COCO Benchmark
COCO (Common Objects in Context) is the standard benchmark for object detection. It contains 80 object categories with bounding box annotations across 200K+ images.
Key Metrics
mAP (mean Average Precision)
Primary metric. Averages precision across IoU thresholds 0.5-0.95. Higher is better.
AP50 / AP75
AP at IoU threshold 0.5 (lenient) and 0.75 (strict). Useful for understanding localization quality.
Key Takeaways
- 1
Object detection outputs bounding boxes - coordinates, class labels, and confidence scores for each detected object.
- 2
YOLO v11 is the go-to for speed - from 1.5ms (nano) to 11ms (extra-large), covering edge to server deployments.
- 3
RT-DETR offers transformer architecture - end-to-end detection without NMS, better for overlapping objects.
- 4
Grounding DINO enables open-vocabulary detection - detect any object via text prompts, no retraining needed.