Level 2: Pipelines~20 min

Caption + Search: Image Search Pipeline

Chain image captioning with text embeddings to build searchable image databases. Find "dogs playing in the park" without manual tagging.

The Image Search Problem

You have 10,000 product photos, vacation pictures, or surveillance footage. A user asks: "Find images of dogs playing". How do you search?

Traditional approaches require manual tagging - someone labels each image with keywords. This doesn't scale. AI gives us two alternatives: Caption + Search and CLIP direct matching.

// Manual tagging (doesn't scale)

image_001.jpg: ["dog", "park", "playing", "golden retriever"]

image_002.jpg: ["cat", "sleeping", "couch"]

// AI-powered (this lesson)

image_001.jpg: "a golden retriever playing fetch in a sunny park"

image_002.jpg: "an orange cat sleeping on a gray couch"

The Pipeline: Image to Caption to Search

This pipeline chains three building blocks you already know:

Caption (Image to Text)

Use a vision-language model like BLIP or LLaVA to generate natural language descriptions of each image. This runs once at indexing time.

Embed (Text to Vector)

Convert each caption into a vector embedding using a text embedding model like BGE. Store these vectors in an index (FAISS, Pinecone, etc.).

Search (Query to Results)

When a user searches, embed their query with the same text model and find the most similar caption embeddings. Return the corresponding images.

Key Insight

By converting images to text first, you can leverage mature text embedding models (MTEB score 64+) that are better at semantic understanding than image-only approaches. The caption acts as a semantic bridge between visual content and natural language queries.

Step 1: Image Captioning with BLIP

BLIP (Bootstrapping Language-Image Pre-training) is an efficient image captioning model from Salesforce. It generates accurate captions and runs on consumer GPUs.

Install

Python

pip install transformers torch pillow

Image Captioning with BLIP

BLIP

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image

# Load model (downloads ~1GB on first run)
processor = BlipProcessor.from_pretrained(
    "Salesforce/blip-image-captioning-base"
)
model = BlipForConditionalGeneration.from_pretrained(
    "Salesforce/blip-image-captioning-base"
)

# Caption a single image
image = Image.open("photo.jpg")
inputs = processor(image, return_tensors="pt")
caption_ids = model.generate(**inputs, max_length=50)
caption = processor.decode(caption_ids[0], skip_special_tokens=True)

print(caption)
# "a dog playing in the park"

Captioning Model Comparison

Model	VRAM	Speed	Quality
BLIP-base	~2 GB	Fast	Good
BLIP-large	~4 GB	Medium	Better
BLIP-2	~8 GB	Slower	Best
LLaVA	~16 GB	Slowest	Best (detailed)

BLIP-base is usually sufficient for search. LLaVA excels when you need detailed, contextual descriptions.

Step 2-3: Complete Pipeline

Here's the complete pipeline that captions images, embeds them, and enables semantic search:

Additional Dependencies

Python

pip install sentence-transformers faiss-cpu

Full Caption + Embed + Search Pipeline

Complete Example

from transformers import BlipProcessor, BlipForConditionalGeneration
from sentence_transformers import SentenceTransformer
from PIL import Image
import faiss
import numpy as np
import os

# Initialize models
blip_processor = BlipProcessor.from_pretrained(
    "Salesforce/blip-image-captioning-base"
)
blip_model = BlipForConditionalGeneration.from_pretrained(
    "Salesforce/blip-image-captioning-base"
)
embed_model = SentenceTransformer('BAAI/bge-small-en-v1.5')

def caption_image(image_path):
    """Generate caption for a single image."""
    image = Image.open(image_path).convert('RGB')
    inputs = blip_processor(image, return_tensors="pt")
    caption_ids = blip_model.generate(**inputs, max_length=50)
    return blip_processor.decode(caption_ids[0], skip_special_tokens=True)

# 1. Caption all images (run once at indexing time)
image_folder = "images/"
image_paths = [
    os.path.join(image_folder, f)
    for f in os.listdir(image_folder)
    if f.endswith(('.jpg', '.png'))
]

captions = []
for path in image_paths:
    caption = caption_image(path)
    captions.append(caption)
    print(f"{path}: {caption}")

# 2. Embed all captions
embeddings = embed_model.encode(captions, normalize_embeddings=True)
embeddings = embeddings.astype('float32')

# 3. Build FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)  # Inner product for cosine similarity
index.add(embeddings)

# 4. Search function
def search_images(query, k=5):
    """Search for images matching a text query."""
    query_embedding = embed_model.encode(
        [query], normalize_embeddings=True
    ).astype('float32')

    distances, indices = index.search(query_embedding, k)

    results = []
    for score, idx in zip(distances[0], indices[0]):
        results.append({
            'image': image_paths[idx],
            'caption': captions[idx],
            'score': float(score)
        })
    return results

# Example search
results = search_images("dog playing")
for r in results:
    print(f"{r['score']:.3f}: {r['image']}")
    print(f"         {r['caption']}")

Example output:

0.847: images/dog_park.jpg
         a golden retriever playing with a ball in a park
0.723: images/puppy_garden.jpg
         a puppy running through a garden
0.612: images/cat_sofa.jpg
         a cat sitting on a sofa

Alternative: CLIP Direct Matching

CLIP embeds images and text into the same vector space, allowing direct comparison without captioning. This is simpler but has different tradeoffs.

CLIP Direct Image-Text Matching

CLIP

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
import faiss
import numpy as np

# Load CLIP
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

def embed_image(image_path):
    """Embed an image using CLIP."""
    image = Image.open(image_path).convert('RGB')
    inputs = processor(images=image, return_tensors="pt")
    with torch.no_grad():
        embedding = model.get_image_features(**inputs)
    return embedding.numpy().flatten()

def embed_text(text):
    """Embed text using CLIP."""
    inputs = processor(text=text, return_tensors="pt", padding=True)
    with torch.no_grad():
        embedding = model.get_text_features(**inputs)
    return embedding.numpy().flatten()

# Index images directly (no captioning step)
image_embeddings = np.array([embed_image(p) for p in image_paths])
image_embeddings = image_embeddings / np.linalg.norm(
    image_embeddings, axis=1, keepdims=True
)

# Build index
index = faiss.IndexFlatIP(image_embeddings.shape[1])
index.add(image_embeddings.astype('float32'))

# Search with text query
query = "dog playing"
query_embedding = embed_text(query)
query_embedding = query_embedding / np.linalg.norm(query_embedding)

D, I = index.search(query_embedding.reshape(1, -1).astype('float32'), k=5)
for score, idx in zip(D[0], I[0]):
    print(f"{score:.3f}: {image_paths[idx]}")

Caption+Search vs CLIP: When to Use Which

Caption + Search

+Better semantic understanding (uses SOTA text embeddings)
+Captions are human-readable (debugging, auditing)
+Can use advanced text search (filtering, hybrid search)
-Slower indexing (caption generation takes time)
-Caption quality limits search quality

Best for: Large catalogs, e-commerce, content management, when you need explainability.

CLIP Direct

+Simpler pipeline (one model for everything)
+Faster indexing (no caption generation)
+Works for abstract/hard-to-describe images
-512-dim embeddings may miss nuance
-No intermediate caption for debugging

Best for: Quick prototypes, real-time search, art/abstract content, zero-shot classification.

Decision Matrix

Requirement	Caption+Search	CLIP Direct
Complex queries	Better	Good
Indexing speed	Slower	Faster
Explainability	Yes (captions)	No
Storage	More (captions + embeddings)	Less
Abstract images	Caption may miss nuance	Better

Production: Building a Searchable Database

For production use with thousands of images, add persistence and batch processing:

Production Pipeline with Persistence

SQLite + FAISS

import sqlite3
import json

# Create database for metadata
conn = sqlite3.connect('image_search.db')
cursor = conn.cursor()
cursor.execute('''
    CREATE TABLE IF NOT EXISTS images (
        id INTEGER PRIMARY KEY,
        path TEXT UNIQUE,
        caption TEXT,
        embedding_id INTEGER
    )
''')

def index_image(image_path):
    """Index a single image: caption, embed, store."""
    # Generate caption
    caption = caption_image(image_path)

    # Generate embedding
    embedding = embed_model.encode(
        [caption], normalize_embeddings=True
    ).astype('float32')

    # Add to FAISS index
    embedding_id = index.ntotal
    index.add(embedding)

    # Store metadata
    cursor.execute(
        'INSERT OR REPLACE INTO images (path, caption, embedding_id) VALUES (?, ?, ?)',
        (image_path, caption, embedding_id)
    )
    conn.commit()

    return caption, embedding_id

def search_with_metadata(query, k=5):
    """Search and return full metadata."""
    query_embedding = embed_model.encode(
        [query], normalize_embeddings=True
    ).astype('float32')

    distances, indices = index.search(query_embedding, k)

    results = []
    for score, idx in zip(distances[0], indices[0]):
        cursor.execute(
            'SELECT path, caption FROM images WHERE embedding_id = ?',
            (int(idx),)
        )
        row = cursor.fetchone()
        if row:
            results.append({
                'path': row[0],
                'caption': row[1],
                'score': float(score)
            })
    return results

# Save FAISS index to disk
faiss.write_index(index, 'image_search.index')

# Load later
loaded_index = faiss.read_index('image_search.index')

~0.5s

Per image (BLIP-base)

~1KB

Per image (embedding)

<10ms

Search latency (10K images)

Key Takeaways

1
Caption + Search chains two blocks - Image captioning (BLIP) followed by text embedding search (BGE + FAISS).
2
Captions enable text-based retrieval - Leverage mature text embeddings (MTEB 64+) for better semantic matching.
3
CLIP is a simpler alternative - Direct image-text matching without intermediate captions.
4
Choose based on your needs - Caption+Search for explainability and complex queries; CLIP for speed and simplicity.

Practice Exercise

Build your own image search system:

1.Collect 20-50 images in a folder (photos from your phone work great).
2.Run the complete pipeline code to caption and index all images.
3.Test different search queries and see which results are returned.
4.Compare results between Caption+Search and CLIP direct. Which works better for your images?

Next: Voice Assistant Pipeline Back to Roadmap