Image Search with CLIP
The breakthrough that unified vision and language. Search images with text, classify without training.
What is CLIP?
CLIP (Contrastive Language-Image Pre-training) is a neural network trained by OpenAI that learns to connect text and images in a shared embedding space.
Before CLIP, image classifiers could only recognize categories they were specifically trained on. Want to classify dogs? Train on dog images. Want to add cats? Retrain the whole model.
The Revolution
CLIP changed everything. It learns a shared space where text and images with the same meaning are close together. This means you can:
- -Search images using natural language queries
- -Classify images into any category without retraining
- -Find the best caption for an image
How CLIP Learns: Contrastive Training
CLIP was trained on 400 million image-text pairs scraped from the internet. The training objective is elegant:
Contrastive Learning
Positive Pairs
Image of a cat + "a photo of a cat"
Push embeddings CLOSE together
Negative Pairs
Image of a cat + "a photo of a car"
Push embeddings FAR apart
for images, texts in batch:
image_embeds = image_encoder(images)
text_embeds = text_encoder(texts)
# Compute similarity matrix
similarities = image_embeds @ text_embeds.T
# Diagonal = matching pairs (should be high)
# Off-diagonal = non-matching (should be low)
loss = cross_entropy(similarities, targets)
After training, the image encoder and text encoder produce embeddings in the same 512-dimensional space. Similar concepts end up near each other, regardless of whether they came from text or images.
Try It: Cross-Modal Search
This visualization shows CLIP's shared embedding space (projected to 2D). Circles are image embeddings, triangles are text embeddings. Type a query to see how text embeddings align with matching images.
CLIP Shared Embedding Space
Enter a text query above to search images using CLIP-style cross-modal retrieval.
Try queries like "cat", "a photo of a dog", "vehicle", or "nature"
Zero-Shot Classification
Zero-shot means classifying images into categories the model has never seen during training. CLIP makes this possible through a clever trick:
How Zero-Shot Classification Works
- 1
Define your classes as text prompts
"a photo of a dog", "a photo of a cat", "a photo of a bird"
- 2
Encode all prompts to get text embeddings
Each class becomes a vector in the shared space
- 3
Encode the image
The image becomes a vector in the same space
- 4
Find the closest text embedding
The class with highest cosine similarity wins
import clip
classes = ["dog", "cat", "bird"]
prompts = [f"a photo of a {c}" for c in classes]
text_embeds = clip.encode_text(prompts)
image_embed = clip.encode_image(image)
similarities = image_embed @ text_embeds.T
predicted_class = classes[similarities.argmax()]
Cross-Modal Search
Text to Image Search
Find images that match a text description.
query = "sunset over mountains"
query_embed = encode_text(query)
scores = image_embeds @ query_embed
top_images = images[scores.argsort()[::-1]]
Image to Text Search
Find text that best describes an image.
image_embed = encode_image(img)
scores = text_embeds @ image_embed
best_caption = captions[scores.argmax()]
CLIP Variants
Since OpenAI released CLIP in 2021, the community has created improved variants with better performance, larger training data, or more efficient architectures.
| Model | Key Features | Use Case |
|---|---|---|
| OpenAI CLIPOriginal | 400M pairs, ViT-B/32 to ViT-L/14 | Baseline, widely compatible |
| OpenCLIPLAION | 2B+ pairs, open weights, many sizes | Best open-source option |
| SigLIPGoogle | Sigmoid loss, better calibration | When you need probability scores |
| EVA-CLIPBAAI | Better initialization, 18B version | Maximum accuracy |
| MetaCLIPMeta | Curated data, balanced distribution | Reproducible research |
Benchmark: ImageNet Zero-Shot
ImageNet zero-shot accuracy measures how well a model can classify 1000 ImageNet categories without any training on ImageNet. This is the standard CLIP benchmark.
ImageNet zero-shot top-1 accuracy. Higher is better. For comparison, a supervised ResNet-50 achieves ~76%.
Key Takeaways
- 1
CLIP creates a shared embedding space - text and images with the same meaning are close together.
- 2
Contrastive learning - push matching pairs close, push non-matching pairs apart.
- 3
Zero-shot classification - classify images into any categories using text prompts.
- 4
OpenCLIP and SigLIP - best open-source alternatives to the original CLIP.