Video understanding asks models to reason over temporal sequences — answering questions, generating summaries, or detecting events across minutes or hours of footage. Early approaches like VideoBERT and TimeSformer processed short clips, but Gemini 1.5 Pro's million-token context (2024) enabled reasoning over hour-long videos natively, and GPT-4o brought real-time video comprehension. The core bottleneck remains temporal reasoning at scale: models can describe individual frames well but struggle to track causal chains, count repetitions, or understand temporal ordering across long sequences. Video-MME and EgoSchema are pushing evaluation beyond simple recognition toward genuine temporal understanding.
Multi-task video understanding with 20 temporal reasoning tasks
Leading models on MVBench.
| # | Model | accuracy | Year | Source |
|---|---|---|---|---|
| ★ | Qwen3.5-Omni-Plus | 79.0 | 2026 | paper ↗ |
| 2 | Qwen3.5-397B-A17B | 77.6 | 2026 | paper ↗ |
| 3 | Qwen3.5-122B-A10B | 76.6 | 2026 | paper ↗ |
| 4 | Qwen3-VL-235B-A22B-Instruct | 76.5 | 2025 | paper ↗ |
| 5 | Qwen3.6-27B | 75.5 | 2026 | paper ↗ |
| 6 | LongCat-Flash-Omni | 75.2 | 2025 | paper ↗ |
| 7 | Qwen3-VL-235B-A22B-Thinking | 75.2 | 2025 | paper ↗ |
| 8 | Qwen3.5-35B-A3B | 74.8 | 2026 | paper ↗ |
| 9 | Qwen3.6-35B-A3B | 74.6 | 2026 | paper ↗ |
| 10 | Qwen3.5-27B | 74.6 | 2026 | paper ↗ |
Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.
2 datasets tracked for this task.
Still looking for something on Video Understanding? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.
Real humans read every message. We track what people are asking for and prioritize accordingly.