Multimodalimage-text-to-video

Image-Text-to-Video

Image-text-to-video is generative AI's hardest unsolved frontier — animating a still image according to a text prompt while maintaining temporal coherence and physical plausibility. Stable Video Diffusion (2023) and Runway Gen-2 showed early promise, Sora (2024) raised the bar dramatically with minute-long physically consistent clips, and Kling and Veo 2 pushed quality further. The fundamental challenge is that video generation requires implicit world models: objects must persist, lighting must evolve consistently, and motion must obey approximate physics across dozens of frames. Evaluation is still largely human-judged, with FVD and CLIP-temporal scores poorly correlating with perceived quality.

1
Datasets
0
Results
composite
Canonical metric
Canonical Benchmark

VideoBench

Evaluates instruction-guided video generation from image+text

Primary metric: composite
View full leaderboard

Top 10

Leading models on VideoBench.

No results yet. Be the first to contribute.

All datasets

1 dataset tracked for this task.

Related tasks

Other tasks in Multimodal.

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace