Home/Building Blocks/Text to Video
TextVideo

Text to Video

Generate videos from text descriptions. The frontier of generative AI for content creation.

How Text to Video Works

A technical deep-dive into video generation. From diffusion models to Sora and beyond.

1

Generation Approaches

Three main paradigms for generating video from text.

Temporal Diffusion

Extend image diffusion to video

+Leverages image priors, Easier to train
-Temporal consistency issues

Diffusion Transformer (DiT)

Transformer-based diffusion

+Scales better, Better motion
-Very expensive

Autoregressive

Generate frame by frame

+Coherent long videos
-Slow, error accumulation

Diffusion Transformer (Sora-style)

Text
CLIP/T5
+
Noise
Latent space
->
DiT
Transformer
Denoising
->
VAE
Decode
->
Video
Frames

Sora treats video as spacetime patches, enabling long, coherent generation.

2

Model Evolution

The rapid evolution of video generation models.

Make-A-Video
2022 - Meta
DiffusionText-to-video from image model
Imagen Video
2022 - Google
DiffusionCascaded generation
Gen-1
2023 - Runway
DiffusionVideo-to-video
Pika 1.0
2023 - Pika
DiffusionConsumer-friendly
Stable Video
2023 - Stability
DiffusionOpen source base
Gen-2
2023 - Runway
DiffusionText-to-video
Sora
2024 - OpenAI
DiT60s videos, physics understanding
Kling
2024 - Kuaishou
DiTLong videos, Chinese market
Veo 2
2024 - Google
Diffusion4K, 2 minute clips
Sora Turbo
2024 - OpenAI
DiTFaster, more accessible
Sora (OpenAI)
Best physics understanding
60s clips, realistic motion
Veo 2 (Google)
Best resolution (4K)
2 minute clips, photorealistic
Runway Gen-3
Most accessible
Fast, good quality, public API
3

Key Challenges

What makes video generation hard.

T
Temporal Consistency
Objects should look the same across frames
M
Motion Quality
Natural, physics-aware movement
P
Prompt Following
Accurately represent the text prompt
R
Resolution/Length
High-res, long duration videos
4

API Comparison

Available video generation APIs.

ModelCompanyDurationResolutionPriceAccess
SoraOpenAI60s1080p$$$Limited
Runway Gen-3Runway10s1080p$$Open
Pika 2.0Pika4s1080p$Open
KlingKuaishou120s1080p$Open
Luma Dream MachineLuma5s720p$Open
5

Code Examples

Get started with video generation.

Runway Gen-3pip install runwayml
Popular API
import runwayml

# Initialize Runway client
client = runwayml.RunwayML()

# Generate video from text
task = client.image_to_video.create(
    model='gen3a_turbo',
    prompt_image='input.jpg',  # Optional: image-to-video
    prompt_text='A serene lake at sunset with gentle ripples',
    duration=10,  # seconds
    ratio='16:9'
)

# Poll for completion
import time
while task.status not in ['SUCCEEDED', 'FAILED']:
    time.sleep(10)
    task = client.tasks.retrieve(task.id)

# Download result
if task.status == 'SUCCEEDED':
    video_url = task.output[0]
    # Download video from URL

Quick Reference

For Best Quality
  • - Sora (limited access)
  • - Veo 2 (Google)
For Public Access
  • - Runway Gen-3
  • - Pika 2.0
  • - Luma Dream Machine
For Open Source
  • - CogVideoX
  • - Stable Video Diffusion
  • - Open-Sora

Use Cases

  • Marketing content
  • Storyboarding
  • Synthetic training data
  • Creative exploration

Architectural Patterns

Latent Video Diffusion

Extend image diffusion to video with temporal layers.

Pros:
  • +High quality
  • +Leverages image diffusion advances
Cons:
  • -Slow generation
  • -VRAM intensive

Autoregressive Video

Generate frames sequentially as tokens.

Pros:
  • +Long videos possible
  • +Controllable
Cons:
  • -Quality still developing
  • -Slow

Image-to-Video

Animate a generated or given image.

Pros:
  • +More controllable
  • +Can use existing images
Cons:
  • -Limited motion
  • -First frame dependency

Implementations

API Services

Sora

OpenAI
API

State-of-the-art quality. Up to 1 minute videos.

Runway Gen-3

Runway
API

Production-ready. Good motion, 10 second clips.

Kling

Kuaishou
API

Strong motion coherence. Up to 2 minutes.

Open Source

CogVideoX

Apache 2.0
Open Source

Best open-source. 5B and 2B variants.

Mochi 1

Apache 2.0
Open Source

High-quality open model. Good motion.

Benchmarks

Quick Facts

Input
Text
Output
Video
Implementations
2 open source, 3 API
Patterns
3 approaches

Have benchmark data?

Help us track the state of the art for text to video.

Submit Results