Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · Video UnderstandingHome/Tasks/Multimodal/Video Understanding
Multimodal· video-text-to-text

Video Understanding.

Video understanding asks models to reason over temporal sequences — answering questions, generating summaries, or detecting events across minutes or hours of footage. Early approaches like VideoBERT and TimeSformer processed short clips, but Gemini 1.5 Pro's million-token context (2024) enabled reasoning over hour-long videos natively, and GPT-4o brought real-time video comprehension. The core bottleneck remains temporal reasoning at scale: models can describe individual frames well but struggle to track causal chains, count repetitions, or understand temporal ordering across long sequences. Video-MME and EgoSchema are pushing evaluation beyond simple recognition toward genuine temporal understanding.

2
Datasets
44
Results
accuracy
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

MVBench

Multi-task video understanding with 20 temporal reasoning tasks

Primary metric: accuracy
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on MVBench.

#ModelaccuracyYearSource
Qwen3.5-Omni-Plus79.02026paper ↗
2Qwen3.5-397B-A17B77.62026paper ↗
3Qwen3.5-122B-A10B76.62026paper ↗
4Qwen3-VL-235B-A22B-Instruct76.52025paper ↗
5Qwen3.6-27B75.52026paper ↗
6LongCat-Flash-Omni75.22025paper ↗
7Qwen3-VL-235B-A22B-Thinking75.22025paper ↗
8Qwen3.5-35B-A3B74.82026paper ↗
9Qwen3.6-35B-A3B74.62026paper ↗
10Qwen3.5-27B74.62026paper ↗

What were you looking for on Video Understanding?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

2 datasets tracked for this task.

MVBench
CANONICAL
20 results · accuracy
Top: Qwen3.5-Omni-Plus 79.0
Video-MME
24 results · accuracy
Top: Qwen3.6-27B 87.7
§ 05 · Related tasks

Other tasks in Multimodal.

Any-to-AnyAudio-Text-to-TextCross-Modal RetrievalImage CaptioningImage-Text-to-ImageImage-Text-to-TextImage-Text-to-VideoText-to-Image Generation
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Video Understanding? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.