Computer vision,
measured in pixels.
From classification on ImageNet to detection on COCO to segmentation on ADE20K — the models that see, and what they see well. Every score dated, every metric defined, every dataset linked.
Descriptions in serif; scores in tabular mono; navigation in sans.
Reading the numbers.
Three families of metric cover nearly every computer-vision leaderboard. Each asks a different question of the model — localization, classification, or pixel-level agreement.
The gold standard for object detection. Measures how well the model places bounding boxes and classifies objects.
- AP50: Easy mode. IoU threshold > 50%.
- AP75: Hard mode. IoU threshold > 75% (tight boxes).
- mAP (COCO): Average across IoU 0.50 to 0.95.
For image classification. Top-1 is exact match, Top-5 is correct class in top five predictions.
- Top-1: Percentage where the top prediction is correct.
- Top-5: Percentage where correct class is in top 5.
- Higher is better: 90% means 90% accuracy.
For semantic segmentation. Measures pixel-level overlap between prediction and ground truth.
- Pixel-level: Evaluates every pixel in the image.
- IoU per class: Calculated for each semantic class.
- Mean: Average IoU across all classes.
Three task families.
Detection, classification, and segmentation — the backbone tasks of modern computer vision. Each card links to its leaderboard below.
Locating and classifying objects with bounding boxes. COCO and Pascal VOC benchmarks.
Categorizing images into predefined classes. ImageNet and CIFAR benchmarks.
Pixel-level classification of images. ADE20K and Cityscapes benchmarks.
Object detection.
Locating and classifying objects with bounding boxes. Higher mAP is better. Shaded row marks current state of the art on COCO.
| # | Model | Vendor | COCO mAP | Pascal VOC mAP | Architecture |
|---|---|---|---|---|---|
| 01 | InternImage-H | Shanghai AI Lab | 65.4 | — | Deformable Convolution |
| 02 | Co-DETR (Swin-L) | Research | 66.0 | — | Transformer Detector |
| 03 | DINO (Swin-L) | Research | 63.3 | — | Transformer Detector |
| 04 | YOLOv10-X | Tsinghua | 57.4 | — | CNN (Real-time) |
| 05 | EfficientDet-D7x | 55.1 | — | EfficientNet+BiFPN |
Image classification.
Categorizing images into predefined classes. Higher accuracy is better.
ImageNet and CIFAR classification benchmarks will be added soon.
Semantic segmentation.
Pixel-level classification of images. Higher mIoU is better.
| # | Model | Vendor | ADE20K mIoU | Cityscapes mIoU | Architecture |
|---|---|---|---|---|---|
| 01 | InternImage-H | Shanghai AI Lab | 62.9 | — | Deformable Convolution |
| 02 | Mask2Former (Swin-L) | Meta | 57.3 | — | Transformer |
The benchmarks.
Every canonical computer-vision dataset, grouped by task. Click through for the paper or the dataset download.
330K images, 1.5 million object instances, 80 object categories. Standard benchmark for object detection and segmentation.
- Task
- object-detection
- Images
- 330,000
1.28M training images, 50K validation images across 1,000 object classes. The standard benchmark for image classification since 2012.
- Task
- image-classification
- Images
- 1,281,167
Linear classification on frozen ImageNet-1K features. Used to evaluate representation quality of self-supervised and contrastive models without fine-tuning the backbone.
- Task
- image-classification
- Images
- 1,281,167
10K new test images following ImageNet collection process. Tests model generalization beyond the original test set.
- Task
- image-classification
- Images
- 10,000
5,000 images with fine annotations and 20,000 with coarse annotations of urban street scenes.
- Task
- semantic-segmentation
- Images
- 25,000
Keep exploring.
Beyond detection, classification, and segmentation — adjacent sections of the vision registry.