Computer Visionvideo-classification

Video Classification

Video classification — recognizing actions and events in clips — extends image understanding into the temporal domain, requiring models to reason about motion, context, and temporal ordering. The field evolved from hand-crafted features (HOG, optical flow) through 3D CNNs (C3D, I3D) to video transformers like TimeSformer and VideoMAE that treat frames as spatiotemporal tokens. Kinetics-400 accuracy now exceeds 88%, but the real challenge is long-form video understanding where events unfold over minutes, not seconds. Essential for content moderation, sports analytics, and security applications.

3
Datasets
4
Results
top-1-accuracy
Canonical metric
Canonical Benchmark

Kinetics-400

Human action recognition across 400 action classes

Primary metric: top-1-accuracy
View full leaderboard

Top 10

Leading models on Kinetics-400.

RankModeltop-1YearSource
1
InternVideo2
92.12024paper
2
VideoMAE V2 (ViT-g)
90.02023paper
3
ViViT-H
84.92021paper
4
TimeSformer-L
80.72021paper

All datasets

3 datasets tracked for this task.

Related tasks

Other tasks in Computer Vision.

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace