Caption + Search: Image Search Pipeline
Chain image captioning with text embeddings to build searchable image databases. Find "dogs playing in the park" without manual tagging.
The Image Search Problem
You have 10,000 product photos, vacation pictures, or surveillance footage. A user asks: "Find images of dogs playing". How do you search?
Traditional approaches require manual tagging - someone labels each image with keywords. This doesn't scale. AI gives us two alternatives: Caption + Search and CLIP direct matching.
// Manual tagging (doesn't scale)
image_001.jpg: ["dog", "park", "playing", "golden retriever"]
image_002.jpg: ["cat", "sleeping", "couch"]
// AI-powered (this lesson)
image_001.jpg: "a golden retriever playing fetch in a sunny park"
image_002.jpg: "an orange cat sleeping on a gray couch"
The Pipeline: Image to Caption to Search
This pipeline chains three building blocks you already know:
Caption (Image to Text)
Use a vision-language model like BLIP or LLaVA to generate natural language descriptions of each image. This runs once at indexing time.
Embed (Text to Vector)
Convert each caption into a vector embedding using a text embedding model like BGE. Store these vectors in an index (FAISS, Pinecone, etc.).
Search (Query to Results)
When a user searches, embed their query with the same text model and find the most similar caption embeddings. Return the corresponding images.
Key Insight
By converting images to text first, you can leverage mature text embedding models (MTEB score 64+) that are better at semantic understanding than image-only approaches. The caption acts as a semantic bridge between visual content and natural language queries.
Step 1: Image Captioning with BLIP
BLIP (Bootstrapping Language-Image Pre-training) is an efficient image captioning model from Salesforce. It generates accurate captions and runs on consumer GPUs.
Install
PythonImage Captioning with BLIP
BLIPfrom transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
# Load model (downloads ~1GB on first run)
processor = BlipProcessor.from_pretrained(
"Salesforce/blip-image-captioning-base"
)
model = BlipForConditionalGeneration.from_pretrained(
"Salesforce/blip-image-captioning-base"
)
# Caption a single image
image = Image.open("photo.jpg")
inputs = processor(image, return_tensors="pt")
caption_ids = model.generate(**inputs, max_length=50)
caption = processor.decode(caption_ids[0], skip_special_tokens=True)
print(caption)
# "a dog playing in the park"Captioning Model Comparison
| Model | VRAM | Speed | Quality |
|---|---|---|---|
| BLIP-base | ~2 GB | Fast | Good |
| BLIP-large | ~4 GB | Medium | Better |
| BLIP-2 | ~8 GB | Slower | Best |
| LLaVA | ~16 GB | Slowest | Best (detailed) |
BLIP-base is usually sufficient for search. LLaVA excels when you need detailed, contextual descriptions.
Step 2-3: Complete Pipeline
Here's the complete pipeline that captions images, embeds them, and enables semantic search:
Additional Dependencies
PythonFull Caption + Embed + Search Pipeline
Complete Examplefrom transformers import BlipProcessor, BlipForConditionalGeneration
from sentence_transformers import SentenceTransformer
from PIL import Image
import faiss
import numpy as np
import os
# Initialize models
blip_processor = BlipProcessor.from_pretrained(
"Salesforce/blip-image-captioning-base"
)
blip_model = BlipForConditionalGeneration.from_pretrained(
"Salesforce/blip-image-captioning-base"
)
embed_model = SentenceTransformer('BAAI/bge-small-en-v1.5')
def caption_image(image_path):
"""Generate caption for a single image."""
image = Image.open(image_path).convert('RGB')
inputs = blip_processor(image, return_tensors="pt")
caption_ids = blip_model.generate(**inputs, max_length=50)
return blip_processor.decode(caption_ids[0], skip_special_tokens=True)
# 1. Caption all images (run once at indexing time)
image_folder = "images/"
image_paths = [
os.path.join(image_folder, f)
for f in os.listdir(image_folder)
if f.endswith(('.jpg', '.png'))
]
captions = []
for path in image_paths:
caption = caption_image(path)
captions.append(caption)
print(f"{path}: {caption}")
# 2. Embed all captions
embeddings = embed_model.encode(captions, normalize_embeddings=True)
embeddings = embeddings.astype('float32')
# 3. Build FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension) # Inner product for cosine similarity
index.add(embeddings)
# 4. Search function
def search_images(query, k=5):
"""Search for images matching a text query."""
query_embedding = embed_model.encode(
[query], normalize_embeddings=True
).astype('float32')
distances, indices = index.search(query_embedding, k)
results = []
for score, idx in zip(distances[0], indices[0]):
results.append({
'image': image_paths[idx],
'caption': captions[idx],
'score': float(score)
})
return results
# Example search
results = search_images("dog playing")
for r in results:
print(f"{r['score']:.3f}: {r['image']}")
print(f" {r['caption']}")0.847: images/dog_park.jpg
a golden retriever playing with a ball in a park
0.723: images/puppy_garden.jpg
a puppy running through a garden
0.612: images/cat_sofa.jpg
a cat sitting on a sofaAlternative: CLIP Direct Matching
CLIP embeds images and text into the same vector space, allowing direct comparison without captioning. This is simpler but has different tradeoffs.
CLIP Direct Image-Text Matching
CLIPfrom transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
import faiss
import numpy as np
# Load CLIP
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
def embed_image(image_path):
"""Embed an image using CLIP."""
image = Image.open(image_path).convert('RGB')
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
embedding = model.get_image_features(**inputs)
return embedding.numpy().flatten()
def embed_text(text):
"""Embed text using CLIP."""
inputs = processor(text=text, return_tensors="pt", padding=True)
with torch.no_grad():
embedding = model.get_text_features(**inputs)
return embedding.numpy().flatten()
# Index images directly (no captioning step)
image_embeddings = np.array([embed_image(p) for p in image_paths])
image_embeddings = image_embeddings / np.linalg.norm(
image_embeddings, axis=1, keepdims=True
)
# Build index
index = faiss.IndexFlatIP(image_embeddings.shape[1])
index.add(image_embeddings.astype('float32'))
# Search with text query
query = "dog playing"
query_embedding = embed_text(query)
query_embedding = query_embedding / np.linalg.norm(query_embedding)
D, I = index.search(query_embedding.reshape(1, -1).astype('float32'), k=5)
for score, idx in zip(D[0], I[0]):
print(f"{score:.3f}: {image_paths[idx]}")Caption+Search vs CLIP: When to Use Which
Caption + Search
- +Better semantic understanding (uses SOTA text embeddings)
- +Captions are human-readable (debugging, auditing)
- +Can use advanced text search (filtering, hybrid search)
- -Slower indexing (caption generation takes time)
- -Caption quality limits search quality
Best for: Large catalogs, e-commerce, content management, when you need explainability.
CLIP Direct
- +Simpler pipeline (one model for everything)
- +Faster indexing (no caption generation)
- +Works for abstract/hard-to-describe images
- -512-dim embeddings may miss nuance
- -No intermediate caption for debugging
Best for: Quick prototypes, real-time search, art/abstract content, zero-shot classification.
Decision Matrix
| Requirement | Caption+Search | CLIP Direct |
|---|---|---|
| Complex queries | Better | Good |
| Indexing speed | Slower | Faster |
| Explainability | Yes (captions) | No |
| Storage | More (captions + embeddings) | Less |
| Abstract images | Caption may miss nuance | Better |
Production: Building a Searchable Database
For production use with thousands of images, add persistence and batch processing:
Production Pipeline with Persistence
SQLite + FAISSimport sqlite3
import json
# Create database for metadata
conn = sqlite3.connect('image_search.db')
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS images (
id INTEGER PRIMARY KEY,
path TEXT UNIQUE,
caption TEXT,
embedding_id INTEGER
)
''')
def index_image(image_path):
"""Index a single image: caption, embed, store."""
# Generate caption
caption = caption_image(image_path)
# Generate embedding
embedding = embed_model.encode(
[caption], normalize_embeddings=True
).astype('float32')
# Add to FAISS index
embedding_id = index.ntotal
index.add(embedding)
# Store metadata
cursor.execute(
'INSERT OR REPLACE INTO images (path, caption, embedding_id) VALUES (?, ?, ?)',
(image_path, caption, embedding_id)
)
conn.commit()
return caption, embedding_id
def search_with_metadata(query, k=5):
"""Search and return full metadata."""
query_embedding = embed_model.encode(
[query], normalize_embeddings=True
).astype('float32')
distances, indices = index.search(query_embedding, k)
results = []
for score, idx in zip(distances[0], indices[0]):
cursor.execute(
'SELECT path, caption FROM images WHERE embedding_id = ?',
(int(idx),)
)
row = cursor.fetchone()
if row:
results.append({
'path': row[0],
'caption': row[1],
'score': float(score)
})
return results
# Save FAISS index to disk
faiss.write_index(index, 'image_search.index')
# Load later
loaded_index = faiss.read_index('image_search.index')Per image (BLIP-base)
Per image (embedding)
Search latency (10K images)
Key Takeaways
- 1
Caption + Search chains two blocks - Image captioning (BLIP) followed by text embedding search (BGE + FAISS).
- 2
Captions enable text-based retrieval - Leverage mature text embeddings (MTEB 64+) for better semantic matching.
- 3
CLIP is a simpler alternative - Direct image-text matching without intermediate captions.
- 4
Choose based on your needs - Caption+Search for explainability and complex queries; CLIP for speed and simplicity.
Practice Exercise
Build your own image search system:
- 1.Collect 20-50 images in a folder (photos from your phone work great).
- 2.Run the complete pipeline code to caption and index all images.
- 3.Test different search queries and see which results are returned.
- 4.Compare results between Caption+Search and CLIP direct. Which works better for your images?