Computer Vision
Building systems that understand images and video? Find benchmarks for recognition, detection, segmentation, and document analysis tasks.
Computer vision in 2026 looks nothing like 2023. Foundation models (DINOv2, SAM 3) have replaced task-specific training for most pipelines. NMS-free detection (YOLO26, RF-DETR) is the new production standard. Open-source rivals proprietary across every task. The bottleneck has shifted from models to data, deployment, and evaluation on your actual domain.
State of the Field (2026)
- DINOv2 is the default backbone — used by RF-DETR (detection), Depth Anything 3 (depth), and SAM 3 (segmentation). It's the new ImageNet-pretrained ResNet.
- SAM 3 (Meta, Nov 2025) does open-vocabulary detection + segmentation + video tracking from text prompts. The 'GPT moment' for segmentation.
- DINO-X achieves 56.0 AP on COCO zero-shot — no training on COCO at all. 59.8 AP on LVIS-minival. The best open-set detector, period.
- RF-DETR is the first real-time model to exceed 60 AP on COCO. 54.7% mAP at <5ms latency on a T4 GPU.
- YOLO26 (Sep 2025) removes NMS entirely. 43% faster CPU inference than YOLO11. Purpose-built for edge deployment.
- ImageNet top-1 is 91% (CoCa). COCO AP is 66% (ScyllaNet). Further gains cost orders of magnitude more compute for diminishing returns.
- The line between 'vision model' and 'vision-language model' has dissolved. SAM 3, InternVL3.5, DINO-X all accept text prompts natively.
Architecture Evolution
From hand-crafted features to foundation models in 13 years.
Benchmark Saturation
All major CV benchmarks are flattening. The question is no longer accuracy — it is domain transfer.
BENCHMARK SATURATION
All major CV benchmarks are flattening. Gains now cost orders of magnitude more compute.
Current SOTA
Detection and segmentation scores across leading models.
OBJECT DETECTION — COCO AP
Higher is better. 0-shot = no COCO training.
SEGMENTATION — ADE20K
Semantic segmentation on 150 categories.
Speed vs Accuracy
The real tradeoff. Hover for details.
DETECTION — SPEED VS ACCURACY
COCO AP vs inference latency (T4 GPU). Log scale.
The DINOv2 Ecosystem
One self-supervised backbone powers detection, segmentation, depth, and open-vocabulary models.
THE DINOV2 ECOSYSTEM
One backbone, every task. This is the new default.
Which Model?
Decision tree for detection model selection.
DECISION TREE
Which detection model for your use case?
Timeline
Key breakthroughs from AlexNet to SAM 3.
CNNs beat hand-crafted features. 15.3% top-5 error on ImageNet.
Deeper networks. 6.7% top-5 error.
Skip connections enable 152 layers. 3.6% top-5.
Instance segmentation becomes practical.
Transformers enter vision. Pure attention, no convolutions.
Vision-language pretraining. Zero-shot classification.
Open-source image generation goes mainstream.
Foundation models for segmentation and features.
Video segmentation. Monocular depth solved.
Open-vocab detect+segment+track. Real-time >60AP. NMS-free.
Current SOTA by Task
| Task | Benchmark | Model | Score | Note |
|---|---|---|---|---|
| Image Classification | ImageNet-1K | CoCa | 91.0% top-1 | Benchmark saturated — focus shifting to robustness variants |
| Object Detection | COCO test-dev | ScyllaNet | 66.0 AP | RF-DETR: 60+ AP real-time (<5ms) |
| Object Detection (open-vocab) | LVIS-minival | DINO-X Pro | 59.8 AP | Zero-shot, no LVIS training |
| Semantic Segmentation | ADE20K | InternImage-H | 62.9 mIoU | 1.08B params |
| Panoptic Segmentation | COCO | SAM 3 | SOTA | Also: open-vocab + video tracking |
| Depth Estimation | Multi-view | Depth Anything 3 | +44% vs VGGT | Single DINOv2 transformer, any number of views |
| Image Generation | ImageNet-256 FID | DiT variant | 1.35 FID | FLUX.2 best open-source for text-to-image |
| Video Understanding | Kinetics-400 | InternVideo 2.5 | ~92% | Multimodal, SOTA across 39 video datasets |
Key Models
Open-vocab detect + segment + track
Zero-shot detection (1200+ categories)
First real-time >60 AP on COCO
NMS-free edge detection standard
Self-supervised visual features backbone
Unified monocular + multi-view depth
Best open-source VLM (72.2 MMMU)
Production-grade open image generation
Quick Recommendations
Detection (production, known classes)
YOLO26 (edge) or RF-DETR (server)
YOLO26: NMS-free, 43% faster CPU. RF-DETR: first >60 AP real-time. Fine-tune on your data. Always.
Detection (open-vocabulary)
DINO-X Pro or Grounding DINO 1.6
Best zero-shot accuracy. Use as a labelling assistant, then train YOLO for production.
Segmentation
SAM 3 (interactive) or Mask2Former (production)
SAM 3 for annotation and prompting. Mask2Former/OneFormer fine-tuned for deployment metrics.
Depth estimation
Depth Anything V2 (single image) or V3 (multi-view)
Production-ready, fast, well-supported. Metric3D v2 if you need absolute scale for robotics.
Vision-language understanding
InternVL3.5 (open-source) or GPT-4o (API)
InternVL3.5: 72.2 MMMU, runs locally. GPT-4o: best reasoning but 100x cost. Gemini 2.0 Flash for high-volume.
Image generation
FLUX.2 (local) or SD3.5 (ecosystem)
FLUX.2 rivals proprietary quality. SD3.5 has the LoRA/ControlNet ecosystem. SDXL still best for low VRAM.
Tasks & Benchmarks
Optical Character Recognition
Extracting text from document images
Scene Text Detection
Detecting text regions in natural scene images
Scene Text Recognition
Recognizing text in natural scene images
Document Layout Analysis
Analyzing the layout structure of documents
Document Parsing
Parsing document structure and content
Document Image Classification
Classifying documents by type or category
General OCR Capabilities
Comprehensive benchmarks covering multiple aspects of OCR performance.
Handwriting Recognition
Recognizing handwritten text
Table Recognition
Detecting and parsing tables in documents
Object Detection
Object detection — finding what's in an image and where — is the backbone of autonomous vehicles, surveillance, and robotics. The two-stage R-CNN lineage (2014–2017) gave way to single-shot detectors like YOLO, now in its 11th iteration and still getting faster. DETR (2020) proved transformers could replace hand-designed components like NMS entirely, spawning a family of end-to-end detectors that dominate COCO leaderboards above 60 mAP. The field's current obsession: open-vocabulary detection that works on any object described in natural language, not just fixed categories.
Image Classification
Image classification is the task that launched modern deep learning — AlexNet's 2012 ImageNet win cut error rates in half overnight and triggered the entire neural network renaissance. The progression from VGGNet to ResNet to Vision Transformers traces the intellectual history of the field itself. Today's frontier models like EVA-02 and SigLIP push top-1 accuracy above 91% on ImageNet, but the real action has shifted to efficiency (MobileNet, EfficientNet) and robustness under distribution shift. Still the default benchmark for new architectures, and the foundation that every other vision task builds on.
Document Understanding
Document understanding requires parsing visually rich documents — invoices, forms, scientific papers, tables — where layout and typography carry as much meaning as the text itself. LayoutLMv3 (2022) and Donut pioneered layout-aware pretraining, but the game changed when GPT-4V and Claude 3 demonstrated that general-purpose multimodal LLMs could match or exceed specialist models on DocVQA and InfographicsVQA without fine-tuning. The persistent challenges are multi-page reasoning, handling handwritten text mixed with print, and accurately extracting structured data from complex table layouts. This task sits at the intersection of OCR, layout analysis, and language understanding, making it one of the highest-value enterprise AI applications.
Semantic Segmentation
Semantic segmentation assigns a class label to every pixel — the dense prediction problem that underpins autonomous driving, medical imaging, and satellite analysis. FCN (2015) showed you could repurpose classifiers for pixel labeling, DeepLab introduced atrous convolutions and CRFs, and SegFormer (2021) proved transformers dominate here too. State-of-the-art on Cityscapes exceeds 85 mIoU, but ADE20K with its 150 classes remains brutally challenging. The frontier has moved toward universal segmentation models like Mask2Former that handle semantic, instance, and panoptic segmentation in a single architecture.
Depth Estimation
Depth estimation recovers 3D structure from 2D images — a problem that haunted computer vision for decades before deep learning cracked monocular depth prediction. The field shifted dramatically with MiDaS (2019) showing that mixing diverse training data beats task-specific models, then again with Depth Anything (2024) proving foundation model scale changes everything. Modern systems achieve sub-5% relative error on NYU Depth V2, but real-world robustness — handling reflections, transparency, and extreme lighting — remains the frontier. Critical for autonomous driving, AR/VR, and robotics where accurate 3D perception is non-negotiable.
Zero-Shot Object Detection
Zero-shot object detection finds and localizes objects described by free-form text, without any task-specific fine-tuning — the open-vocabulary dream of detection. Grounding DINO (2023) married DINO's detection architecture with grounded pre-training to achieve state-of-the-art open-set detection, while OWL-ViT and YOLO-World showed different paths to the same goal. The technical challenge is grounding language precisely enough to distinguish similar objects ("the red car" vs "the blue car" in the same scene). This is rapidly replacing traditional closed-set detectors in production because it eliminates the most painful step: collecting and annotating domain-specific training data.
Image Feature Extraction
Image feature extraction produces dense vector representations that encode visual semantics — the hidden layer outputs that power retrieval, clustering, similarity search, and transfer learning. The field progressed from hand-crafted descriptors (SIFT, SURF) to CNN features (ResNet, EfficientNet) to self-supervised vision transformers like DINOv2 (2023), which produces features so rich they rival task-specific models on segmentation, depth, and classification without any fine-tuning. DINOv2's success proved that visual foundation models can match the "extract and use everywhere" paradigm that BERT established in NLP. The quality of your feature extractor determines the ceiling for virtually every downstream vision task.
Image-to-3D
Image-to-3D reconstruction infers full 3D geometry from one or a few images — a fundamentally ill-posed problem that recent models solve with learned geometric priors. Traditional multi-view stereo required dozens of calibrated views, but single-image methods like One-2-3-45 (2023) and TripoSR leverage large-scale 3D training data to hallucinate plausible geometry from a single photo. 3D Gaussian Splatting (2023) revolutionized the representation side, enabling real-time rendering of reconstructed scenes. The practical gap is clear: scanned objects still look better than generated ones, but the convenience of snap-and-reconstruct is reshaping e-commerce product visualization and AR content creation.
Image-to-Image
Image-to-image translation covers a vast family of tasks — super-resolution, style transfer, inpainting, colorization, denoising — unified by the idea of learning a mapping between image domains. Pix2Pix (2017) and CycleGAN showed paired and unpaired translation were both learnable, but diffusion models rewrote the playbook entirely. ControlNet (2023) demonstrated that conditioning Stable Diffusion on edges, depth, or poses gives surgical control over generation, while models like SUPIR push restoration quality beyond what was thought possible. The Swiss army knife of visual AI — nearly every creative and restoration workflow runs through some form of image-to-image.
Image-to-Video
Image-to-video generation animates a single still image into a coherent video sequence — one of the hardest generation tasks because it demands both visual fidelity and temporal consistency. Stable Video Diffusion (2023) proved that fine-tuning image diffusion models on video data produces remarkably stable motion, and Runway's Gen-3 and Kling showed commercial viability. The key challenge remains physics-aware motion: objects should move naturally, lighting should evolve consistently, and the camera should behave like a real one. A cornerstone of the emerging AI filmmaking pipeline.
Keypoint Detection
Keypoint detection localizes specific anatomical or structural landmarks — body joints, facial features, hand articulations — enabling pose estimation, gesture recognition, and motion capture. OpenPose (2017) first demonstrated real-time multi-person pose estimation, and the field has since progressed through HRNet, ViTPose, and RTMPose pushing both accuracy and speed. Modern systems detect 133 whole-body keypoints (body + hands + face) in real-time on mobile devices. The applications span from sports biomechanics (analyzing an athlete's form frame-by-frame) to sign language recognition and AR avatar puppeteering.
Mask Generation
Mask generation produces pixel-precise segmentation masks for objects, and Meta's Segment Anything (SAM, 2023) transformed it from a specialized task into a foundational capability. Trained on 11M images with 1B+ masks, SAM demonstrated that a single promptable model — click a point, draw a box, or provide text — could segment virtually anything. SAM 2 (2024) extended this to video with real-time tracking, while EfficientSAM and FastSAM address the original's computational cost. The "foundation model" moment for segmentation, analogous to what GPT-3 was for NLP.
Text-to-3D
Text-to-3D generates 3D assets — meshes, NeRFs, or Gaussian splats — from text descriptions alone, a capability that barely existed before DreamFusion (2022) showed score distillation sampling could lift 2D diffusion priors into 3D. The field moves at breakneck speed: Magic3D added coarse-to-fine generation, Instant3D achieved single-shot inference, and Meshy and Tripo brought commercial quality. Multi-view consistency remains the core challenge — the "Janus problem" where different viewpoints produce contradictory details. The promise of democratizing 3D content creation for games, VR, and e-commerce is driving massive investment.
Text-to-Video
Text-to-video generation is the most ambitious frontier in generative AI — synthesizing temporally coherent, physically plausible video from text prompts alone. The field exploded in 2024 with Sora demonstrating cinematic-quality generation, followed by open models like CogVideoX and Mochi pushing accessibility. The core technical challenge is maintaining consistency across frames: characters shouldn't morph, physics should hold, and camera motion should feel intentional. Quality is improving at a staggering pace, but generation still takes minutes per clip and artifacts remain visible under scrutiny — the gap between demos and reliable production tools is real.
Unconditional Image Generation
Unconditional image generation — producing realistic images from pure noise — is the purest test of a generative model's learned distribution. GANs dominated for years (ProGAN, StyleGAN, StyleGAN3 pushed FID below 2 on FFHQ), but diffusion models dethroned them in both quality and diversity starting with DDPM (2020). The FID metric itself is now questioned as models produce images indistinguishable from real photos. Historically the proving ground for new generative architectures, though the field's energy has largely migrated to conditional generation (text-to-image) where practical applications live.
Video Classification
Video classification — recognizing actions and events in clips — extends image understanding into the temporal domain, requiring models to reason about motion, context, and temporal ordering. The field evolved from hand-crafted features (HOG, optical flow) through 3D CNNs (C3D, I3D) to video transformers like TimeSformer and VideoMAE that treat frames as spatiotemporal tokens. Kinetics-400 accuracy now exceeds 88%, but the real challenge is long-form video understanding where events unfold over minutes, not seconds. Essential for content moderation, sports analytics, and security applications.
Video-to-Video
Video-to-video translation transforms existing footage — applying style transfer, temporal super-resolution, relighting, or motion retargeting while preserving temporal coherence across frames. The naive approach of processing frames independently produces unwatchable flicker, so the core technical challenge is enforcing cross-frame consistency. Diffusion-based approaches like Rerender-A-Video and TokenFlow (2023) showed that propagating attention features between frames solves this elegantly. The practical frontier is real-time processing for live video — current methods are offline and slow, but the creative potential for film post-production, video editing, and content repurposing is enormous.
Zero-Shot Image Classification
Zero-shot image classification uses vision-language models to categorize images into arbitrary classes never seen during training — you describe categories in text, and the model matches. CLIP (2021) proved this was viable at scale by training on 400M image-text pairs, achieving competitive accuracy on ImageNet without ever seeing a labeled example. SigLIP, EVA-CLIP, and MetaCLIP have since pushed zero-shot ImageNet accuracy above 83%, closing the gap with supervised models. The paradigm shift this represents is profound: instead of collecting labeled datasets for every new domain, you just describe what you're looking for.
Show all datasets and SOTA results
Optical Character Recognition
Scene Text Detection
Scene Text Recognition
Document Layout Analysis
Document Parsing
Document Image Classification
General OCR Capabilities
Handwriting Recognition
Table Recognition
Object Detection
Image Classification
Semantic Segmentation
Depth Estimation
Zero-Shot Object Detection
Image Feature Extraction
Image-to-3D
Image-to-Video
Keypoint Detection
Mask Generation
Text-to-3D
Text-to-Video
Unconditional Image Generation
Video Classification
Video-to-Video
Zero-Shot Image Classification
Honest Takes
Classification and clean-doc OCR are solved. Move on.
91% top-1 on ImageNet. Real-time detection at 55+ AP in <5ms. Monocular depth is production-ready. Stop optimising saturated benchmarks and focus on your actual domain gap.
Zero-shot is a starting point, not an endpoint
DINO-X gets 56 AP zero-shot on COCO. A fine-tuned YOLO26 will beat it on your specific domain every time. Use zero-shot for labelling and prototyping, then train a specialist for production.
The bottleneck is data and deployment, not models
Foundation models are good enough. The real work is getting labelled data for your domain (industrial defects, medical images, satellite), then quantising and distilling for your hardware.
3D vision is still 5 years behind 2D
Depth maps look cool in demos. In production, you need multi-view or LiDAR for anything safety-critical. No single foundation model does 3D as well as DINOv2 does 2D features.
FID scores for image generation are meaningless
FID doesn't capture what humans care about — coherence, prompt following, aesthetics. FLUX.1 'feels' better than models with lower FID. Trust human evals, not automated metrics.
In-Depth Guides
Need help choosing?
We benchmark models on your actual data. Same methodology as CodeSOTA, your domain, your hardware constraints.
Book AssessmentGet notified when these results update
New models drop weekly. We track them so you don't have to.