Computer Vision

Building systems that understand images and video? Find benchmarks for recognition, detection, segmentation, and document analysis tasks.

27 tasks199 datasets1778 results

Computer vision in 2026 looks nothing like 2023. Foundation models (DINOv2, SAM 3) have replaced task-specific training for most pipelines. NMS-free detection (YOLO26, RF-DETR) is the new production standard. Open-source rivals proprietary across every task. The bottleneck has shifted from models to data, deployment, and evaluation on your actual domain.

State of the Field (2026)

  • DINOv2 is the default backbone — used by RF-DETR (detection), Depth Anything 3 (depth), and SAM 3 (segmentation). It's the new ImageNet-pretrained ResNet.
  • SAM 3 (Meta, Nov 2025) does open-vocabulary detection + segmentation + video tracking from text prompts. The 'GPT moment' for segmentation.
  • DINO-X achieves 56.0 AP on COCO zero-shot — no training on COCO at all. 59.8 AP on LVIS-minival. The best open-set detector, period.
  • RF-DETR is the first real-time model to exceed 60 AP on COCO. 54.7% mAP at <5ms latency on a T4 GPU.
  • YOLO26 (Sep 2025) removes NMS entirely. 43% faster CPU inference than YOLO11. Purpose-built for edge deployment.
  • ImageNet top-1 is 91% (CoCa). COCO AP is 66% (ScyllaNet). Further gains cost orders of magnitude more compute for diminishing returns.
  • The line between 'vision model' and 'vision-language model' has dissolved. SAM 3, InternVL3.5, DINO-X all accept text prompts natively.

Architecture Evolution

From hand-crafted features to foundation models in 13 years.

CNN Era2012–2020AlexNetResNetEfficientNet~89%Transformer Era2020–2023ViTSwinDINOv2~91%Foundation Era2023–nowSAM 3DINO-XRF-DETROpen-vocabKey shift: task-specific training → foundation model backbones → open-vocabulary everything

Benchmark Saturation

All major CV benchmarks are flattening. The question is no longer accuracy — it is domain transfer.

BENCHMARK SATURATION

All major CV benchmarks are flattening. Gains now cost orders of magnitude more compute.

20%40%60%80%100%20122015201820212024ImageNet top-1COCO APADE20K mIoU

Current SOTA

Detection and segmentation scores across leading models.

OBJECT DETECTION — COCO AP

Higher is better. 0-shot = no COCO training.

ScyllaNet66RF-DETR-L60.2DINO-X (0-shot)56YOLO26-X55.2YOLOv11-X54.7

SEGMENTATION — ADE20K

Semantic segmentation on 150 categories.

InternImage-H62.9 mIoUBEiT-362.8 mIoUOneFormer58.3 mIoUSegFormer-B551.8 mIoU

Speed vs Accuracy

The real tradeoff. Hover for details.

DETECTION — SPEED VS ACCURACY

COCO AP vs inference latency (T4 GPU). Log scale.

4045505560651ms2ms5ms10ms20ms50msLatency (ms, log scale)COCO APREAL-TIME ZONE (<10ms)YOLO26-NYOLO26-SYOLO26-MYOLO26-XRF-DETR-BRF-DETR-LDINO-XScyllaNetYOLO26RF-DETRDINO-X (0-shot)ScyllaNet

The DINOv2 Ecosystem

One self-supervised backbone powers detection, segmentation, depth, and open-vocabulary models.

THE DINOV2 ECOSYSTEM

One backbone, every task. This is the new default.

DINOv2Self-supervised backboneRF-DETRDetectionSAM 3SegmentationDepth Any. 3DepthDINO-XOpen-vocab

Which Model?

Decision tree for detection model selection.

DECISION TREE

Which detection model for your use case?

Known categories?YESEdge / real-time?NODINO-X / Grounding DINOZero-shot open-vocabularyYESYOLO26NMS-free, 43% faster CPUNORF-DETRFirst real-time >60 APThen fine-tune YOLO onlabels from DINO-XAlways fine-tune on your domain data. Zero-shot is a starting point.

Timeline

Key breakthroughs from AlexNet to SAM 3.

2012AlexNet

CNNs beat hand-crafted features. 15.3% top-5 error on ImageNet.

2014VGGNet / GoogLeNet

Deeper networks. 6.7% top-5 error.

2015ResNet

Skip connections enable 152 layers. 3.6% top-5.

2017Mask R-CNN

Instance segmentation becomes practical.

2020ViT

Transformers enter vision. Pure attention, no convolutions.

2021CLIP / DALL-E

Vision-language pretraining. Zero-shot classification.

2022Stable Diffusion

Open-source image generation goes mainstream.

2023SAM / DINOv2

Foundation models for segmentation and features.

2024SAM 2 / Depth Anything

Video segmentation. Monocular depth solved.

2025SAM 3 / RF-DETR / YOLO26

Open-vocab detect+segment+track. Real-time >60AP. NMS-free.

Current SOTA by Task

TaskBenchmarkModelScoreNote
Image ClassificationImageNet-1KCoCa91.0% top-1Benchmark saturated — focus shifting to robustness variants
Object DetectionCOCO test-devScyllaNet66.0 APRF-DETR: 60+ AP real-time (<5ms)
Object Detection (open-vocab)LVIS-minivalDINO-X Pro59.8 APZero-shot, no LVIS training
Semantic SegmentationADE20KInternImage-H62.9 mIoU1.08B params
Panoptic SegmentationCOCOSAM 3SOTAAlso: open-vocab + video tracking
Depth EstimationMulti-viewDepth Anything 3+44% vs VGGTSingle DINOv2 transformer, any number of views
Image GenerationImageNet-256 FIDDiT variant1.35 FIDFLUX.2 best open-source for text-to-image
Video UnderstandingKinetics-400InternVideo 2.5~92%Multimodal, SOTA across 39 video datasets

Key Models

SAM 3Meta

Open-vocab detect + segment + track

DINO-XIDEA Research

Zero-shot detection (1200+ categories)

RF-DETRRoboflow

First real-time >60 AP on COCO

YOLO26Ultralytics

NMS-free edge detection standard

DINOv2Meta

Self-supervised visual features backbone

Depth Anything 3ByteDance

Unified monocular + multi-view depth

InternVL 3.5OpenGVLab

Best open-source VLM (72.2 MMMU)

FLUX.2Black Forest Labs

Production-grade open image generation

Quick Recommendations

Detection (production, known classes)

YOLO26 (edge) or RF-DETR (server)

YOLO26: NMS-free, 43% faster CPU. RF-DETR: first >60 AP real-time. Fine-tune on your data. Always.

Detection (open-vocabulary)

DINO-X Pro or Grounding DINO 1.6

Best zero-shot accuracy. Use as a labelling assistant, then train YOLO for production.

Segmentation

SAM 3 (interactive) or Mask2Former (production)

SAM 3 for annotation and prompting. Mask2Former/OneFormer fine-tuned for deployment metrics.

Depth estimation

Depth Anything V2 (single image) or V3 (multi-view)

Production-ready, fast, well-supported. Metric3D v2 if you need absolute scale for robotics.

Vision-language understanding

InternVL3.5 (open-source) or GPT-4o (API)

InternVL3.5: 72.2 MMMU, runs locally. GPT-4o: best reasoning but 100x cost. Gemini 2.0 Flash for high-volume.

Image generation

FLUX.2 (local) or SD3.5 (ecosystem)

FLUX.2 rivals proprietary quality. SD3.5 has the LoRA/ControlNet ecosystem. SDXL still best for low VRAM.

Tasks & Benchmarks

Optical Character Recognition

Extracting text from document images

114 datasets696 resultsSOTA tracked

Scene Text Detection

Detecting text regions in natural scene images

11 datasets520 resultsSOTA tracked

Scene Text Recognition

Recognizing text in natural scene images

11 datasets127 resultsSOTA tracked

Document Layout Analysis

Analyzing the layout structure of documents

5 datasets126 resultsSOTA tracked

Document Parsing

Parsing document structure and content

2 datasets56 resultsSOTA tracked

Document Image Classification

Classifying documents by type or category

7 datasets54 resultsSOTA tracked

General OCR Capabilities

Comprehensive benchmarks covering multiple aspects of OCR performance.

4 datasets50 resultsSOTA tracked

Handwriting Recognition

Recognizing handwritten text

7 datasets38 resultsSOTA tracked

Table Recognition

Detecting and parsing tables in documents

5 datasets38 resultsSOTA tracked

Object Detection

Object detection — finding what's in an image and where — is the backbone of autonomous vehicles, surveillance, and robotics. The two-stage R-CNN lineage (2014–2017) gave way to single-shot detectors like YOLO, now in its 11th iteration and still getting faster. DETR (2020) proved transformers could replace hand-designed components like NMS entirely, spawning a family of end-to-end detectors that dominate COCO leaderboards above 60 mAP. The field's current obsession: open-vocabulary detection that works on any object described in natural language, not just fixed categories.

3 datasets35 resultsSOTA tracked

Image Classification

Image classification is the task that launched modern deep learning — AlexNet's 2012 ImageNet win cut error rates in half overnight and triggered the entire neural network renaissance. The progression from VGGNet to ResNet to Vision Transformers traces the intellectual history of the field itself. Today's frontier models like EVA-02 and SigLIP push top-1 accuracy above 91% on ImageNet, but the real action has shifted to efficiency (MobileNet, EfficientNet) and robustness under distribution shift. Still the default benchmark for new architectures, and the foundation that every other vision task builds on.

4 datasets25 resultsSOTA tracked

Document Understanding

Document understanding requires parsing visually rich documents — invoices, forms, scientific papers, tables — where layout and typography carry as much meaning as the text itself. LayoutLMv3 (2022) and Donut pioneered layout-aware pretraining, but the game changed when GPT-4V and Claude 3 demonstrated that general-purpose multimodal LLMs could match or exceed specialist models on DocVQA and InfographicsVQA without fine-tuning. The persistent challenges are multi-page reasoning, handling handwritten text mixed with print, and accurately extracting structured data from complex table layouts. This task sits at the intersection of OCR, layout analysis, and language understanding, making it one of the highest-value enterprise AI applications.

2 datasets7 resultsSOTA tracked

Semantic Segmentation

Semantic segmentation assigns a class label to every pixel — the dense prediction problem that underpins autonomous driving, medical imaging, and satellite analysis. FCN (2015) showed you could repurpose classifiers for pixel labeling, DeepLab introduced atrous convolutions and CRFs, and SegFormer (2021) proved transformers dominate here too. State-of-the-art on Cityscapes exceeds 85 mIoU, but ADE20K with its 150 classes remains brutally challenging. The frontier has moved toward universal segmentation models like Mask2Former that handle semantic, instance, and panoptic segmentation in a single architecture.

2 datasets6 resultsSOTA tracked

Depth Estimation

Depth estimation recovers 3D structure from 2D images — a problem that haunted computer vision for decades before deep learning cracked monocular depth prediction. The field shifted dramatically with MiDaS (2019) showing that mixing diverse training data beats task-specific models, then again with Depth Anything (2024) proving foundation model scale changes everything. Modern systems achieve sub-5% relative error on NYU Depth V2, but real-world robustness — handling reflections, transparency, and extreme lighting — remains the frontier. Critical for autonomous driving, AR/VR, and robotics where accurate 3D perception is non-negotiable.

2 datasets0 results

Zero-Shot Object Detection

Zero-shot object detection finds and localizes objects described by free-form text, without any task-specific fine-tuning — the open-vocabulary dream of detection. Grounding DINO (2023) married DINO's detection architecture with grounded pre-training to achieve state-of-the-art open-set detection, while OWL-ViT and YOLO-World showed different paths to the same goal. The technical challenge is grounding language precisely enough to distinguish similar objects ("the red car" vs "the blue car" in the same scene). This is rapidly replacing traditional closed-set detectors in production because it eliminates the most painful step: collecting and annotating domain-specific training data.

2 datasets0 results

Image Feature Extraction

Image feature extraction produces dense vector representations that encode visual semantics — the hidden layer outputs that power retrieval, clustering, similarity search, and transfer learning. The field progressed from hand-crafted descriptors (SIFT, SURF) to CNN features (ResNet, EfficientNet) to self-supervised vision transformers like DINOv2 (2023), which produces features so rich they rival task-specific models on segmentation, depth, and classification without any fine-tuning. DINOv2's success proved that visual foundation models can match the "extract and use everywhere" paradigm that BERT established in NLP. The quality of your feature extractor determines the ceiling for virtually every downstream vision task.

1 datasets0 results

Image-to-3D

Image-to-3D reconstruction infers full 3D geometry from one or a few images — a fundamentally ill-posed problem that recent models solve with learned geometric priors. Traditional multi-view stereo required dozens of calibrated views, but single-image methods like One-2-3-45 (2023) and TripoSR leverage large-scale 3D training data to hallucinate plausible geometry from a single photo. 3D Gaussian Splatting (2023) revolutionized the representation side, enabling real-time rendering of reconstructed scenes. The practical gap is clear: scanned objects still look better than generated ones, but the convenience of snap-and-reconstruct is reshaping e-commerce product visualization and AR content creation.

1 datasets0 results

Image-to-Image

Image-to-image translation covers a vast family of tasks — super-resolution, style transfer, inpainting, colorization, denoising — unified by the idea of learning a mapping between image domains. Pix2Pix (2017) and CycleGAN showed paired and unpaired translation were both learnable, but diffusion models rewrote the playbook entirely. ControlNet (2023) demonstrated that conditioning Stable Diffusion on edges, depth, or poses gives surgical control over generation, while models like SUPIR push restoration quality beyond what was thought possible. The Swiss army knife of visual AI — nearly every creative and restoration workflow runs through some form of image-to-image.

2 datasets0 results

Image-to-Video

Image-to-video generation animates a single still image into a coherent video sequence — one of the hardest generation tasks because it demands both visual fidelity and temporal consistency. Stable Video Diffusion (2023) proved that fine-tuning image diffusion models on video data produces remarkably stable motion, and Runway's Gen-3 and Kling showed commercial viability. The key challenge remains physics-aware motion: objects should move naturally, lighting should evolve consistently, and the camera should behave like a real one. A cornerstone of the emerging AI filmmaking pipeline.

1 datasets0 results

Keypoint Detection

Keypoint detection localizes specific anatomical or structural landmarks — body joints, facial features, hand articulations — enabling pose estimation, gesture recognition, and motion capture. OpenPose (2017) first demonstrated real-time multi-person pose estimation, and the field has since progressed through HRNet, ViTPose, and RTMPose pushing both accuracy and speed. Modern systems detect 133 whole-body keypoints (body + hands + face) in real-time on mobile devices. The applications span from sports biomechanics (analyzing an athlete's form frame-by-frame) to sign language recognition and AR avatar puppeteering.

2 datasets0 results

Mask Generation

Mask generation produces pixel-precise segmentation masks for objects, and Meta's Segment Anything (SAM, 2023) transformed it from a specialized task into a foundational capability. Trained on 11M images with 1B+ masks, SAM demonstrated that a single promptable model — click a point, draw a box, or provide text — could segment virtually anything. SAM 2 (2024) extended this to video with real-time tracking, while EfficientSAM and FastSAM address the original's computational cost. The "foundation model" moment for segmentation, analogous to what GPT-3 was for NLP.

1 datasets0 results

Text-to-3D

Text-to-3D generates 3D assets — meshes, NeRFs, or Gaussian splats — from text descriptions alone, a capability that barely existed before DreamFusion (2022) showed score distillation sampling could lift 2D diffusion priors into 3D. The field moves at breakneck speed: Magic3D added coarse-to-fine generation, Instant3D achieved single-shot inference, and Meshy and Tripo brought commercial quality. Multi-view consistency remains the core challenge — the "Janus problem" where different viewpoints produce contradictory details. The promise of democratizing 3D content creation for games, VR, and e-commerce is driving massive investment.

1 datasets0 results

Text-to-Video

Text-to-video generation is the most ambitious frontier in generative AI — synthesizing temporally coherent, physically plausible video from text prompts alone. The field exploded in 2024 with Sora demonstrating cinematic-quality generation, followed by open models like CogVideoX and Mochi pushing accessibility. The core technical challenge is maintaining consistency across frames: characters shouldn't morph, physics should hold, and camera motion should feel intentional. Quality is improving at a staggering pace, but generation still takes minutes per clip and artifacts remain visible under scrutiny — the gap between demos and reliable production tools is real.

2 datasets0 results

Unconditional Image Generation

Unconditional image generation — producing realistic images from pure noise — is the purest test of a generative model's learned distribution. GANs dominated for years (ProGAN, StyleGAN, StyleGAN3 pushed FID below 2 on FFHQ), but diffusion models dethroned them in both quality and diversity starting with DDPM (2020). The FID metric itself is now questioned as models produce images indistinguishable from real photos. Historically the proving ground for new generative architectures, though the field's energy has largely migrated to conditional generation (text-to-image) where practical applications live.

2 datasets0 results

Video Classification

Video classification — recognizing actions and events in clips — extends image understanding into the temporal domain, requiring models to reason about motion, context, and temporal ordering. The field evolved from hand-crafted features (HOG, optical flow) through 3D CNNs (C3D, I3D) to video transformers like TimeSformer and VideoMAE that treat frames as spatiotemporal tokens. Kinetics-400 accuracy now exceeds 88%, but the real challenge is long-form video understanding where events unfold over minutes, not seconds. Essential for content moderation, sports analytics, and security applications.

3 datasets0 results

Video-to-Video

Video-to-video translation transforms existing footage — applying style transfer, temporal super-resolution, relighting, or motion retargeting while preserving temporal coherence across frames. The naive approach of processing frames independently produces unwatchable flicker, so the core technical challenge is enforcing cross-frame consistency. Diffusion-based approaches like Rerender-A-Video and TokenFlow (2023) showed that propagating attention features between frames solves this elegantly. The practical frontier is real-time processing for live video — current methods are offline and slow, but the creative potential for film post-production, video editing, and content repurposing is enormous.

1 datasets0 results

Zero-Shot Image Classification

Zero-shot image classification uses vision-language models to categorize images into arbitrary classes never seen during training — you describe categories in text, and the model matches. CLIP (2021) proved this was viable at scale by training on 400M image-text pairs, achieving competitive accuracy on ImageNet without ever seeing a labeled example. SigLIP, EVA-CLIP, and MetaCLIP have since pushed zero-shot ImageNet accuracy above 83%, closing the gap with supervised models. The paradigm shift this represents is profound: instead of collecting labeled datasets for every new domain, you just describe what you're looking for.

1 datasets0 results
Show all datasets and SOTA results

Optical Character Recognition

CodeSOTA Polish2025
25.3(bleu-4)GPT-4o
IMPACT-PSNC2012
KITAB-Bench2024
0.79(cer)PaddleOCR
PolEval 2021 OCR2021
SROIE2019
ThaiOCRBench2024
0.84(ted-score)Claude-Sonnet-4
aapd2020
72.9(f1)KD-LSTMreg
amazon2020
94.31(accuracy)ApproxRepSet
0.81(average-f1)Siamese_MHCA_SA
47.15(rouge-1)DeepPyramidion
19.99(rouge-2)DeepPyramidion
70.9(accuracy)ELSC
ba2020
51.8(accuracy)ELSC
47.12(rouge-1)BigBird-Pegasus
99.59(accuracy)MPAD-path
bc82020
56.06(evaluation-macro-f1)BioRex+Directionality
28.11(wer)PyLaia (human transcriptions + random split)
1.73(cer)StackMix+Blots
5.7(far)Siamese_MultiHeadCrossAttention_SoftAttention (Siamese_MHCA_SA)
33.88(rouge-2)GCN Hybrid
96.85(accuracy)REL-RWMD k-NN
31.1(ndcg-20)XLNet
48.18(rouge-1)Scrambled code + broken (alter)
26.79(smoothed-bleu-4)CodeBERT (MLM)
21.87(smoothed-bleu-4)CodeTrans-MT-Large
25.61(smoothed-bleu-4)Transformer
26.23(smoothed-bleu-4)CodeTrans-MT-Base
20.39(smoothed-bleu-4)CodeTrans-MT-Base
15.26(smoothed-bleu-4)CodeTrans-MT-Base
85.9(top-1-accuracy)Q-SENN
46.73(p-10)Query-doc RobeCzech (Roberta-base)
dart2020
97.6(factspotter)FactT5B
2.5(cer)StackMix+Blots
0.86(percentage-correct)JDeskew
60.1(relation-f1)REXEL
dwie2020
0.73(f1)VaeDiff-DocRE
e2e2020
70.8(rouge-l)HTLM (fine-tuning)
ephoie2020
99.21(average-f1)LayoutLMv3
84.41(accuracy)Bert
27.54(sequence-error)STREET
hkr2020
3.49(cer)StackMix+Blots
hoc2020
88.1(f1)BioLinkBERT (large)
53.5(rouge-1)LexRank (query: method + article + steps titles)
39.6(rouge-1)LexRank (query: step title)
95.38(accuracy)ChuLo
89.09(bleu)I2L-NOPOOL
28.6(test-wer)GFCN
iam-b2020
3.77(cer)StackMix+Blots
iam-d2020
3.01(cer)StackMix+Blots
96.55(weighted-average-f1-score)DiT-L (Cascade)
99.4(accuracy)DTrOCR 105M
93.5(accuracy)DTrOCR 105M
88.86(bleu)I2L-STRIPS
imdb-m2020
54.8(accuracy)Document Classification Using Importance of Sentences
75.8(f-measure-full-lexicon)DeepSolo (ViTAEv2-S, TextOCR)
iris2020
97.7(accuracy)ELSC
jaffe2020
98.6(accuracy)ELSC
18.5(test-wer)GFCN
lun2020
64.4(accuracy)ChuLo
93.32(accuracy)XLMft UDA
96.05(accuracy)XLMft UDA
96.95(accuracy)XLMft UDA
76.02(accuracy)MultiFiT, pseudo
69.57(accuracy)MultiFiT, pseudo
89.7(accuracy)XLMft UDA
96.8(accuracy)XLMft UDA
75.45(accuracy)BiLSTM (Europarl)
mpqa2020
89.81(accuracy)MPAD-path
82.86(nmi)DnC-SC
96(accuracy)ELSC
0.79(f1)VaeDiff-DocRE
16.5(wer)HTR-VT(line-level)
21.1(test-wer)Span
recipe2020
59.06(accuracy)ApproxRepSet
97.17(accuracy)ApproxRepSet
75(accuracy)BilBOWA
86.5(accuracy)BilBOWA
92.7(accuracy)Biinclusion (Euro500kReuters)
84.4(accuracy)Biinclusion (Euro500kReuters)
55.88(content-selection-f1)HierarchicalEncoder + NR + IR
3.65(cer)StackMix+Blots
84.9(accuracy)CCD-ViT-Small
82(f1-micro)SPECTER
88.7(f1-micro)SciNCL
129.1(fps)FAST-T-512
simara2020
14.79(wer)DAN
stdw2020
0.78(ap)RetinaNet
64.4(iou)IM3D
sut2020
86(accuracy)CNN
93.1(test)ARTEMIS-DA
84.8(iou)CCD-ViT-Small
21.84(average-psnr-db)CCD-ViT-Small
84(accuracy)Optimized Text CNN
72.6(accuracy)ApproxRepSet
88.68(recall)ContourNet [69]
53.4(accuracy)ELSC
55.6(bleu)HTLM (fine-tuning)
65.4(bleu)HTLM (fine-tuning)
48.4(bleu)HTLM (fine-tuning)
56.16(parent)MBD
31.37(rouge-l)DOCmT5
wine2020
75.8(accuracy)ELSC
86.07(accuracy)HDLTex
76.58(accuracy)HDLTex
91.28(accuracy)ConvTextTM
69.4(accuracy)KD-LSTMreg

Scene Text Detection

CTW15002019
88.5(precision)DBNet++ (ResNet-50) (1024)
ICDAR 20152015
93.96(precision)TextFuseNet (ResNeXt-101)
ICDAR 2019 ArT2019
82.65(f-measure)pil_maskrcnn
Total-Text2017
152.8(fps)FAST-T-448
Union14M2023
70.8(accuracy)CLIP4STR-B
81.9(1-1-accuracy)CLIP4STR-L
86.4(accuracy)CLIP4STR-L (DataComp-1B)
93.36(f-measure)BDN
98.4(accuracy)TrOCR-base 334M
84.42(precision)PMTD*
137.2(fps)FAST-T-512

Scene Text Recognition

cute802020
99.7(accuracy)CLIP4STR-L (DataComp-1B)
host2020
82.7(1-1-accuracy)CLIP4STR-L
ic132020
97.8(accuracy)ABINet-LV+TPS++
97.1(accuracy)Yet Another Text Recognizer
iiit5k2020
99.6(accuracy)DTrOCR 105M
msda2020
42(accuracy)MetaSelf-Learning
svt2020
99.1(accuracy)CLIP4STR-H (DFN-5B)
svt-p2020
89.6(accuracy)ABINet-LV+TPS++
svtp2020
98.6(accuracy)DTrOCR 105M
92.2(accuracy)CLIP4STR-L (DataComp-1B)
wost2020
90.9(1-1-accuracy)CLIP4STR-H (DFN-5B)

Document Layout Analysis

d4la2020
70.72(map)DoPTA
0.98(table)DETR
83.4(class-average-iou)CV-Group

Document Parsing

OmniDocBench2024
97.5(layout-map)MinerU 2.5
olmOCR-Bench2024
99.9(base)Chandra v0.1.0

Document Image Classification

aip2020
83.4(top-1-accuracy-verb)ResNet-RS (ResNet-200 + RS training tricks)
97.62(accuracy)Pixel-level RC
89.54(accuracy)PCGAN-CHAR
96.68(accuracy)PCGAN-CHAR
98.43(accuracy)PCGAN-CHAR
97.7(accuracy)EAML
95.57(accuracy)DocXClassifier-L

General OCR Capabilities

CC-OCR2024
83.25(multi-scene-f1)Gemini 1.5 Pro
MME-VideoOCR2024
73.7(total-accuracy)Gemini 2.5 Pro
OCRBench v22024
62.2(overall-zh-private)Gemini 2.5 Pro
reVISION2025

Handwriting Recognition

CHURRO-DS2024
82.3(printed-levenshtein)CHURRO (3B)
IAM1999
23.2(wer)Start, Follow, Read
Polish EMNIST Extension2020
RIMES2011
96.8(accuracy)AKHCRNet
kohtd2020
8.36(cer)Bluche

Table Recognition

95.46(f-measure)Proposed System (With post- processing)
97.88(teds-struct)Multi-Task Learning Model
98.35(teds-simple-samples)Re0
91.87(teds-simple-samples)EDD
wtw2020
78.9(f1)StrucTexTv2 (small)

Object Detection

COCO2014
66(mAP)Co-DETR (Swin-L)
LVIS v1.02019
71.4(box-ap)DINO-X
Pascal VOC 20122012
80(mAP-coco-pretrain)SSD512 (VGG-16)

Image Classification

CIFAR-102009
99.1(accuracy)DeiT-B Distilled
CIFAR-1002009
94.55(accuracy)ViT-H/14
ImageNet-1K2012
91(top-1-accuracy)CoCa (finetuned)
ImageNet-V22019
84(top-1-accuracy)Swin Transformer V2 Large

Document Understanding

DocLayNet2022
84.1(mAP)DocFormerv2-Large
FUNSD2019

Semantic Segmentation

ADE20K2016
62.9(mIoU)InternImage-H
Cityscapes2016

Depth Estimation

Zero-Shot Object Detection

Image Feature Extraction

ImageNet kNN2021

Image-to-Image

Set52012

Image-to-Video

Keypoint Detection

Mask Generation

SA-1B2023

Text-to-3D

Text-to-Video

Unconditional Image Generation

Video Classification

Video-to-Video

DAVIS2016

Zero-Shot Image Classification

Honest Takes

Classification and clean-doc OCR are solved. Move on.

91% top-1 on ImageNet. Real-time detection at 55+ AP in <5ms. Monocular depth is production-ready. Stop optimising saturated benchmarks and focus on your actual domain gap.

Zero-shot is a starting point, not an endpoint

DINO-X gets 56 AP zero-shot on COCO. A fine-tuned YOLO26 will beat it on your specific domain every time. Use zero-shot for labelling and prototyping, then train a specialist for production.

The bottleneck is data and deployment, not models

Foundation models are good enough. The real work is getting labelled data for your domain (industrial defects, medical images, satellite), then quantising and distilling for your hardware.

3D vision is still 5 years behind 2D

Depth maps look cool in demos. In production, you need multi-view or LiDAR for anything safety-critical. No single foundation model does 3D as well as DINOv2 does 2D features.

FID scores for image generation are meaningless

FID doesn't capture what humans care about — coherence, prompt following, aesthetics. FLUX.1 'feels' better than models with lower FID. Trust human evals, not automated metrics.

In-Depth Guides

Need help choosing?

We benchmark models on your actual data. Same methodology as CodeSOTA, your domain, your hardware constraints.

Book Assessment

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Computer Vision Benchmarks - CodeSOTA | CodeSOTA