Image→Text

Image Captioning

Generate natural language descriptions of image content. Enables text-based search over visual content.

How Vision Language Models Work

A technical deep-dive into vision-language models. From image captioning to multimodal reasoning with LLaVA, GPT-4V, and beyond.

1. VLM Tasks 2. Models 3. Architecture 4. Benchmarks 5. Code

What VLMs Can Do

Vision-Language Models understand images and generate text. One model, many tasks.

Image Captioning

Describe image content

"A dog playing with a ball in the park"

VQA

Answer questions about images

"Q: What color is the car? A: Red"

OCR/Document

Extract text from images

"Receipt parsing, form extraction"

Visual Reasoning

Complex inference about images

"Count objects, spatial relationships"

How VLMs Process Images

Image

Input

ViT

Encoder

Patch tokens

Projector

Alignment

Visual tokens

LLM

Language

+ Text tokens

Text

Output

VLM Evolution

From CLIP to GPT-4V to open-source alternatives like Qwen2-VL.

CLIP

2021

ContrastiveImage-text alignment, zero-shot

BLIP

2022

Encoder-DecoderBootstrap captions from web

Flamingo

2022

Few-shotInterleaved image-text, in-context

BLIP-2

2023

BridgedQ-Former, efficient bridging

LLaVA

2023

InstructionVisual instruction tuning

GPT-4V

2023

ProprietaryBest general performance

LLaVA 1.5

2023

InstructionHigher resolution, better data

Qwen2-VL

2024

Open72B, video understanding

InternVL2

2024

Open108B, GPT-4V competitor

Gemini 2.0

2024

ProprietaryMultimodal native, real-time

GPT-4o / Gemini 2.0

Best overall performance

Proprietary, API access only

Qwen2-VL-72B

Best open-source

Video + image, Apache 2.0

LLaVA 1.5/1.6

Best for fine-tuning

Simple architecture, easy to train

VLM Architectures

How to connect vision and language.

CLIP-style (Contrastive)

Separate encoders, contrastive loss

Pros: Fast retrieval, Zero-shot transfer

Cons: No generation, Fixed outputs

LLaVA-style (Projector)

Vision encoder + linear projection + LLM

Pros: Simple, Leverages pretrained LLM

Cons: May lose visual detail

Qwen-VL (Native)

Vision tokens in transformer

Pros: Deep integration, Better grounding

Cons: Expensive training

LLaVA Architecture (Most Popular)

CLIP ViT

Frozen

336x336 patches

MLP

2 layers

576 tokens

Vicuna/Llama

7B/13B

Language model

Simple but effective: freeze vision encoder, train projector + LLM.

VLM Benchmarks

How to evaluate multimodal understanding.

Model	MMBench	SEED-Bench	MME	Type
GPT-4o	83.4	77.1	2070	Proprietary
Gemini 1.5 Pro	80.6	75.8	2015	Proprietary
Qwen2-VL-72B	82	76.5	2055	Open
InternVL2-76B	81.2	75.4	2000	Open
LLaVA 1.5-13B	68.2	63	1570	Open

Code Examples

Get started with VLMs in Python.

LLaVA (Open Source)pip install transformers

Popular

from transformers import LlavaForConditionalGeneration, AutoProcessor
import torch
from PIL import Image

# Load LLaVA model
model = LlavaForConditionalGeneration.from_pretrained(
    'llava-hf/llava-1.5-7b-hf',
    torch_dtype=torch.float16,
    device_map='auto'
)
processor = AutoProcessor.from_pretrained('llava-hf/llava-1.5-7b-hf')

# Load image
image = Image.open('image.jpg')

# Create conversation
conversation = [
    {
        'role': 'user',
        'content': [
            {'type': 'image'},
            {'type': 'text', 'text': 'What is happening in this image?'}
        ]
    }
]

# Process and generate
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors='pt').to(model.device)

output = model.generate(**inputs, max_new_tokens=200)
response = processor.decode(output[0], skip_special_tokens=True)

Quick Reference

For Best Quality

- GPT-4o / Claude 3.5
- Gemini 2.0 Flash

For Open Source

- Qwen2-VL-72B
- InternVL2
- LLaVA 1.6

For Fast/Lightweight

- BLIP-2
- LLaVA 1.5-7B
- Qwen2-VL-2B

Use Cases

✓Accessibility (alt text generation)
✓Photo library organization
✓Content moderation descriptions
✓RAG pipeline input for image search

Architectural Patterns

VLM Captioning

Use a vision-language model (GPT-4V, Claude, LLaVA) to generate detailed captions.

Pros:

+Rich, detailed descriptions
+Can follow specific prompts
+Handles complex scenes

Cons:

-Slower and more expensive
-May hallucinate details

Specialized Captioning Models

Use dedicated captioning models like BLIP-2 or CoCa.

Pros:

+Fast inference
+Optimized for the task
+Lower cost

Cons:

-Less flexible prompting
-May miss nuances

Caption + Text RAG Pipeline

Generate captions, embed them, and use text retrieval. Two-stage approach.

Pros:

+Leverages mature text RAG
+Captions are human-readable
+Easy debugging

Cons:

-Information loss in captioning
-Slower indexing

Implementations

API Services

GPT-4 Vision

OpenAI

API

State-of-the-art for detailed, accurate captions. Best for complex scenes.

Claude 3.5 Sonnet

Anthropic

API

Excellent vision capabilities with nuanced descriptions.

Open Source

LLaVA

Apache 2.0

Open Source

Strong open-source VLM. LLaVA-1.6 significantly improved.

GitHub HuggingFace

BLIP-2

BSD 3-Clause

Open Source

Efficient captioning with Q-Former architecture.

GitHub HuggingFace

CogVLM

Apache 2.0

Open Source

Strong Chinese and English captioning.

GitHub

Benchmarks

COCO Captions →VQAv2 →

Code Examples

Image Captioning with BLIP-2

Generate captions using Salesforce BLIP-2

Install:pip install transformers torch pillow accelerate

from transformers import Blip2Processor, Blip2ForConditionalGeneration
from PIL import Image
import torch

# Load BLIP-2
processor = Blip2Processor.from_pretrained('Salesforce/blip2-opt-2.7b')
model = Blip2ForConditionalGeneration.from_pretrained(
    'Salesforce/blip2-opt-2.7b',
    torch_dtype=torch.float16,
    device_map='auto'
)

# Caption an image
image = Image.open('photo.jpg').convert('RGB')
inputs = processor(image, return_tensors='pt').to('cuda', torch.float16)

generated_ids = model.generate(**inputs, max_new_tokens=50)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f'Caption: {caption}')

Detailed Captioning with GPT-4 Vision

Get rich descriptions using OpenAI's vision API

Install:pip install openai

from openai import OpenAI
import base64

client = OpenAI()

# Read and encode image
with open('photo.jpg', 'rb') as f:
    image_data = base64.b64encode(f.read()).decode('utf-8')

response = client.chat.completions.create(
    model='gpt-4o',
    messages=[
        {
            'role': 'user',
            'content': [
                {
                    'type': 'text',
                    'text': 'Describe this image in detail for search indexing. '
                            'Include objects, actions, setting, colors, and mood.'
                },
                {
                    'type': 'image_url',
                    'image_url': {
                        'url': f'data:image/jpeg;base64,{image_data}'
                    }
                }
            ]
        }
    ],
    max_tokens=300
)

caption = response.choices[0].message.content
print(f'Caption: {caption}')

Local Captioning with LLaVA

Run vision-language model locally with Ollama

Install:pip install ollama

import ollama
import base64

# Read image
with open('photo.jpg', 'rb') as f:
    image_data = base64.b64encode(f.read()).decode('utf-8')

# Generate caption with LLaVA via Ollama
response = ollama.chat(
    model='llava:13b',
    messages=[
        {
            'role': 'user',
            'content': 'Describe this image in detail.',
            'images': [image_data]
        }
    ]
)

caption = response['message']['content']
print(f'Caption: {caption}')

Quick Facts

Input: Image
Output: Text
Implementations: 3 open source, 2 API
Patterns: 3 approaches

Related Blocks

Have benchmark data?

Help us track the state of the art for image captioning.

Submit Results