Codesota · Tasks · Image-Text-to-VideoHome/Tasks/Multimodal/Image-Text-to-Video

Multimodal· image-text-to-video

Image-Text-to-Video.

Image-text-to-video is generative AI's hardest unsolved frontier — animating a still image according to a text prompt while maintaining temporal coherence and physical plausibility. Stable Video Diffusion (2023) and Runway Gen-2 showed early promise, Sora (2024) raised the bar dramatically with minute-long physically consistent clips, and Kling and Veo 2 pushed quality further. The fundamental challenge is that video generation requires implicit world models: objects must persist, lighting must evolve consistently, and motion must obey approximate physics across dozens of frames. Evaluation is still largely human-judged, with FVD and CLIP-temporal scores poorly correlating with perceived quality.

1

Datasets

0

Results

composite

Canonical metric

§ 02 · Canonical benchmark

The reference dataset.

VideoBench

Evaluates instruction-guided video generation from image+text

Primary metric: composite

View full leaderboard →

§ 03 · Top 10

Leading models.

Leading models on VideoBench.

No results yet. Be the first to contribute.

What were you looking for on Image-Text-to-Video?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

1 dataset tracked for this task.

0 results · composite

§ 05 · Related tasks

Other tasks in Multimodal.

Any-to-Any Audio-Text-to-Text Cross-Modal Retrieval Image Captioning Image-Text-to-Image Image-Text-to-Text Text-to-Image Generation Video Understanding

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Image-Text-to-Video? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.