Level 1: Single Blocks~15 min

Image Search with CLIP

The breakthrough that unified vision and language. Search images with text, classify without training.

What is CLIP?

CLIP (Contrastive Language-Image Pre-training) is a neural network trained by OpenAI that learns to connect text and images in a shared embedding space.

Before CLIP, image classifiers could only recognize categories they were specifically trained on. Want to classify dogs? Train on dog images. Want to add cats? Retrain the whole model.

The Revolution

CLIP changed everything. It learns a shared space where text and images with the same meaning are close together. This means you can:

-Search images using natural language queries
-Classify images into any category without retraining
-Find the best caption for an image

How CLIP Learns: Contrastive Training

CLIP was trained on 400 million image-text pairs scraped from the internet. The training objective is elegant:

Contrastive Learning

Positive Pairs

Image of a cat + "a photo of a cat"

Push embeddings CLOSE together

Negative Pairs

Image of a cat + "a photo of a car"

Push embeddings FAR apart

# Simplified CLIP loss (pseudocode)
for images, texts in batch:
image_embeds = image_encoder(images)
text_embeds = text_encoder(texts)

# Compute similarity matrix
similarities = image_embeds @ text_embeds.T

# Diagonal = matching pairs (should be high)
# Off-diagonal = non-matching (should be low)
loss = cross_entropy(similarities, targets)

After training, the image encoder and text encoder produce embeddings in the same 512-dimensional space. Similar concepts end up near each other, regardless of whether they came from text or images.

Try It: Cross-Modal Search

This visualization shows CLIP's shared embedding space (projected to 2D). Circles are image embeddings, triangles are text embeddings. Type a query to see how text embeddings align with matching images.

CLIP Shared Embedding Space

Search images with text:

Enter a text query above to search images using CLIP-style cross-modal retrieval.

Try queries like "cat", "a photo of a dog", "vehicle", or "nature"

Animals

Vehicles

Nature

Objects

Zero-Shot Classification

Zero-shot means classifying images into categories the model has never seen during training. CLIP makes this possible through a clever trick:

How Zero-Shot Classification Works

1
Define your classes as text prompts
"a photo of a dog", "a photo of a cat", "a photo of a bird"
2
Encode all prompts to get text embeddings
Each class becomes a vector in the shared space
3
Encode the image
The image becomes a vector in the same space
4
Find the closest text embedding
The class with highest cosine similarity wins

# Zero-shot classification with CLIP
import clip

classes = ["dog", "cat", "bird"]
prompts = [f"a photo of a {c}" for c in classes]

text_embeds = clip.encode_text(prompts)
image_embed = clip.encode_image(image)

similarities = image_embed @ text_embeds.T
predicted_class = classes[similarities.argmax()]

Cross-Modal Search

Text to Image Search

Find images that match a text description.

# Find images matching query
query = "sunset over mountains"
query_embed = encode_text(query)
scores = image_embeds @ query_embed
top_images = images[scores.argsort()[::-1]]

Image to Text Search

Find text that best describes an image.

# Find captions for image
image_embed = encode_image(img)
scores = text_embeds @ image_embed
best_caption = captions[scores.argmax()]

CLIP Variants

Since OpenAI released CLIP in 2021, the community has created improved variants with better performance, larger training data, or more efficient architectures.

Model	Key Features	Use Case
OpenAI CLIPOriginal	400M pairs, ViT-B/32 to ViT-L/14	Baseline, widely compatible
OpenCLIPLAION	2B+ pairs, open weights, many sizes	Best open-source option
SigLIPGoogle	Sigmoid loss, better calibration	When you need probability scores
EVA-CLIPBAAI	Better initialization, 18B version	Maximum accuracy
MetaCLIPMeta	Curated data, balanced distribution	Reproducible research

Benchmark: ImageNet Zero-Shot

ImageNet zero-shot accuracy measures how well a model can classify 1000 ImageNet categories without any training on ImageNet. This is the standard CLIP benchmark.

EVA-02-CLIP-E/14+ (18B)

82%

OpenCLIP ViT-G/14 (2B)

80.1%

SigLIP ViT-SO400M/14

78.4%

MetaCLIP ViT-H/14

80.5%

OpenAI CLIP ViT-L/14

75.5%

OpenAI CLIP ViT-B/32

63.2%

ImageNet zero-shot top-1 accuracy. Higher is better. For comparison, a supervised ResNet-50 achieves ~76%.

Key Takeaways

1
CLIP creates a shared embedding space - text and images with the same meaning are close together.
2
Contrastive learning - push matching pairs close, push non-matching pairs apart.
3
Zero-shot classification - classify images into any categories using text prompts.
4
OpenCLIP and SigLIP - best open-source alternatives to the original CLIP.

Next: Speech Recognition Back to Roadmap