Home/Building Blocks/Image Captioning
ImageText

Image Captioning

Generate natural language descriptions of image content. Enables text-based search over visual content.

How Vision Language Models Work

A technical deep-dive into vision-language models. From image captioning to multimodal reasoning with LLaVA, GPT-4V, and beyond.

1

What VLMs Can Do

Vision-Language Models understand images and generate text. One model, many tasks.

Image Captioning

Describe image content

"A dog playing with a ball in the park"

VQA

Answer questions about images

"Q: What color is the car? A: Red"

OCR/Document

Extract text from images

"Receipt parsing, form extraction"

Visual Reasoning

Complex inference about images

"Count objects, spatial relationships"

How VLMs Process Images

Image
Input
->
ViT
Encoder
Patch tokens
->
Projector
Alignment
Visual tokens
->
LLM
Language
+ Text tokens
->
Text
Output
2

VLM Evolution

From CLIP to GPT-4V to open-source alternatives like Qwen2-VL.

CLIP
2021
ContrastiveImage-text alignment, zero-shot
BLIP
2022
Encoder-DecoderBootstrap captions from web
Flamingo
2022
Few-shotInterleaved image-text, in-context
BLIP-2
2023
BridgedQ-Former, efficient bridging
LLaVA
2023
InstructionVisual instruction tuning
GPT-4V
2023
ProprietaryBest general performance
LLaVA 1.5
2023
InstructionHigher resolution, better data
Qwen2-VL
2024
Open72B, video understanding
InternVL2
2024
Open108B, GPT-4V competitor
Gemini 2.0
2024
ProprietaryMultimodal native, real-time
GPT-4o / Gemini 2.0
Best overall performance
Proprietary, API access only
Qwen2-VL-72B
Best open-source
Video + image, Apache 2.0
LLaVA 1.5/1.6
Best for fine-tuning
Simple architecture, easy to train
3

VLM Architectures

How to connect vision and language.

CLIP-style (Contrastive)

Separate encoders, contrastive loss

Pros: Fast retrieval, Zero-shot transfer
Cons: No generation, Fixed outputs

LLaVA-style (Projector)

Vision encoder + linear projection + LLM

Pros: Simple, Leverages pretrained LLM
Cons: May lose visual detail

Qwen-VL (Native)

Vision tokens in transformer

Pros: Deep integration, Better grounding
Cons: Expensive training

LLaVA Architecture (Most Popular)

CLIP ViT
Frozen
336x336 patches
->
MLP
2 layers
576 tokens
->
Vicuna/Llama
7B/13B
Language model

Simple but effective: freeze vision encoder, train projector + LLM.

4

VLM Benchmarks

How to evaluate multimodal understanding.

ModelMMBenchSEED-BenchMMEType
GPT-4o83.477.12070Proprietary
Gemini 1.5 Pro80.675.82015Proprietary
Qwen2-VL-72B8276.52055Open
InternVL2-76B81.275.42000Open
LLaVA 1.5-13B68.2631570Open
5

Code Examples

Get started with VLMs in Python.

LLaVA (Open Source)pip install transformers
Popular
from transformers import LlavaForConditionalGeneration, AutoProcessor
import torch
from PIL import Image

# Load LLaVA model
model = LlavaForConditionalGeneration.from_pretrained(
    'llava-hf/llava-1.5-7b-hf',
    torch_dtype=torch.float16,
    device_map='auto'
)
processor = AutoProcessor.from_pretrained('llava-hf/llava-1.5-7b-hf')

# Load image
image = Image.open('image.jpg')

# Create conversation
conversation = [
    {
        'role': 'user',
        'content': [
            {'type': 'image'},
            {'type': 'text', 'text': 'What is happening in this image?'}
        ]
    }
]

# Process and generate
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors='pt').to(model.device)

output = model.generate(**inputs, max_new_tokens=200)
response = processor.decode(output[0], skip_special_tokens=True)

Quick Reference

For Best Quality
  • - GPT-4o / Claude 3.5
  • - Gemini 2.0 Flash
For Open Source
  • - Qwen2-VL-72B
  • - InternVL2
  • - LLaVA 1.6
For Fast/Lightweight
  • - BLIP-2
  • - LLaVA 1.5-7B
  • - Qwen2-VL-2B

Use Cases

  • Accessibility (alt text generation)
  • Photo library organization
  • Content moderation descriptions
  • RAG pipeline input for image search

Architectural Patterns

VLM Captioning

Use a vision-language model (GPT-4V, Claude, LLaVA) to generate detailed captions.

Pros:
  • +Rich, detailed descriptions
  • +Can follow specific prompts
  • +Handles complex scenes
Cons:
  • -Slower and more expensive
  • -May hallucinate details

Specialized Captioning Models

Use dedicated captioning models like BLIP-2 or CoCa.

Pros:
  • +Fast inference
  • +Optimized for the task
  • +Lower cost
Cons:
  • -Less flexible prompting
  • -May miss nuances

Caption + Text RAG Pipeline

Generate captions, embed them, and use text retrieval. Two-stage approach.

Pros:
  • +Leverages mature text RAG
  • +Captions are human-readable
  • +Easy debugging
Cons:
  • -Information loss in captioning
  • -Slower indexing

Implementations

API Services

GPT-4 Vision

OpenAI
API

State-of-the-art for detailed, accurate captions. Best for complex scenes.

Claude 3.5 Sonnet

Anthropic
API

Excellent vision capabilities with nuanced descriptions.

Open Source

LLaVA

Apache 2.0
Open Source

Strong open-source VLM. LLaVA-1.6 significantly improved.

BLIP-2

BSD 3-Clause
Open Source

Efficient captioning with Q-Former architecture.

CogVLM

Apache 2.0
Open Source

Strong Chinese and English captioning.

Benchmarks

Code Examples

Image Captioning with BLIP-2

Generate captions using Salesforce BLIP-2

Install:pip install transformers torch pillow accelerate
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from PIL import Image
import torch

# Load BLIP-2
processor = Blip2Processor.from_pretrained('Salesforce/blip2-opt-2.7b')
model = Blip2ForConditionalGeneration.from_pretrained(
    'Salesforce/blip2-opt-2.7b',
    torch_dtype=torch.float16,
    device_map='auto'
)

# Caption an image
image = Image.open('photo.jpg').convert('RGB')
inputs = processor(image, return_tensors='pt').to('cuda', torch.float16)

generated_ids = model.generate(**inputs, max_new_tokens=50)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f'Caption: {caption}')

Detailed Captioning with GPT-4 Vision

Get rich descriptions using OpenAI's vision API

Install:pip install openai
from openai import OpenAI
import base64

client = OpenAI()

# Read and encode image
with open('photo.jpg', 'rb') as f:
    image_data = base64.b64encode(f.read()).decode('utf-8')

response = client.chat.completions.create(
    model='gpt-4o',
    messages=[
        {
            'role': 'user',
            'content': [
                {
                    'type': 'text',
                    'text': 'Describe this image in detail for search indexing. '
                            'Include objects, actions, setting, colors, and mood.'
                },
                {
                    'type': 'image_url',
                    'image_url': {
                        'url': f'data:image/jpeg;base64,{image_data}'
                    }
                }
            ]
        }
    ],
    max_tokens=300
)

caption = response.choices[0].message.content
print(f'Caption: {caption}')

Local Captioning with LLaVA

Run vision-language model locally with Ollama

Install:pip install ollama
import ollama
import base64

# Read image
with open('photo.jpg', 'rb') as f:
    image_data = base64.b64encode(f.read()).decode('utf-8')

# Generate caption with LLaVA via Ollama
response = ollama.chat(
    model='llava:13b',
    messages=[
        {
            'role': 'user',
            'content': 'Describe this image in detail.',
            'images': [image_data]
        }
    ]
)

caption = response['message']['content']
print(f'Caption: {caption}')

Quick Facts

Input
Image
Output
Text
Implementations
3 open source, 2 API
Patterns
3 approaches

Have benchmark data?

Help us track the state of the art for image captioning.

Submit Results