Home/Building Blocks/Visual Question Answering

Image→Text

Visual Question Answering

Answer natural language questions about images. Combines vision and language understanding.

How Visual Question Answering Works

A technical deep-dive into Visual Question Answering. From attention mechanisms to modern vision-language models that can reason about images.

1. The Problem 2. Architecture 3. Models 4. Benchmarks 5. Examples 6. Code

The Problem

Why is answering questions about images hard for machines?

Picture a photograph of a family picnic. A human glances at it and can instantly answer "How many people are eating?" or "Is the weather nice?" without conscious effort. For a machine, this requires solving multiple hard problems simultaneously:

Visual Understanding

The system must detect objects, understand their relationships, recognize actions, and infer scene context. A "picnic" is not just objects, but their arrangement and context.

Required: Object detection, scene classification, spatial reasoning

Language Understanding

Questions come in infinite variety. "How many?" needs counting. "Is it raining?" needs visual inference. "What might happen next?" needs reasoning.

Required: Question parsing, intent classification, answer generation

Cross-Modal Alignment

The hardest part: connecting words to visual concepts. What does "eating" look like? Where in the image is the answer to "What color is the blanket?"

Required: Grounding, attention, joint embedding spaces

World Knowledge

Many questions require knowledge beyond the image. "What city is this?" needs to recognize landmarks. "Is this food healthy?" needs nutrition knowledge.

Required: External knowledge, common sense reasoning

Types of VQA Questions

Simple VQA

Questions with direct visual answers

Examples:

"What color is the car?"
"How many people are there?"

Reasoning VQA

Requires logical inference from visual cues

Examples:

"Is it going to rain?"
"What might happen next?"

Knowledge VQA

Needs external world knowledge beyond the image

Examples:

"What city is this?"
"Who painted this?"

Text VQA

Reading and understanding text in images

Examples:

"What does the sign say?"
"What is the price?"

How Vision-Language Models Work

The architecture evolved from simple feature concatenation to sophisticated multimodal transformers.

Modern VLM Architecture (Simplified)

Image

Input

Question

Input

Vision

Encoder

ViT / SigLIP

Projection

Layer

MLP / Q-Former

LLM

Decoder

Llama / Mistral

Answer

Output

The Core Insight

Modern VLMs treat images as a special kind of "text." The vision encoder converts image patches into tokens that look like word embeddings to the LLM. The LLM then processes image tokens and text tokens together, allowing it to reason about both modalities using the same attention mechanisms.

CNN + LSTM

2015

Early FusionConcatenate image and question features

Attention Networks

2016

AttentionFocus on relevant image regions

Bottom-Up Attention

2018

Object-BasedUse detected object features

ViLBERT

2019

Dual-StreamSeparate vision and language encoders

LXMERT

2019

Cross-ModalCross-attention between modalities

CLIP + GPT

2021

Zero-ShotPretrained vision-language alignment

BLIP-2

2023

Q-FormerLearnable query tokens bridge vision to LLM

LLaVA

2023

End-to-EndVisual instruction tuning

Qwen2-VL

2024

Native VLMDynamic resolution, video understanding

GPT-4V

2024

Multimodal LLMFrontier reasoning capabilities

Early Fusion

Concatenate image and text features early, then process together.

Simple but loses modality-specific patterns.

Late Fusion

Process each modality separately, combine at the end.

Misses cross-modal interactions during encoding.

Cross-Attention Fusion

Text attends to image regions, image attends to words. Iterative.

Best results. Used by modern VLMs like LLaVA.

Key Models

The models you should know for VQA in 2024-2025.

BLIP-2

Salesforce | 3B-13B params

Open SourceBSD-3

Strengths:

+Efficient Q-Former bridge
+Works with any LLM
+Good zero-shot

Weaknesses:

-Less strong on complex reasoning
-Frozen vision encoder

LLaVA

Microsoft/UW | 7B-34B params

Open SourceApache 2.0

Strengths:

+Visual instruction following
+Open weights
+Active community

Weaknesses:

-Requires fine-tuning for best results
-Single image only

Qwen2-VL

Alibaba | 2B-72B params

Open SourceApache 2.0

Strengths:

+Dynamic resolution
+Video support
+Strong OCR

Weaknesses:

-Requires significant VRAM
-Complex inference setup

GPT-4V

OpenAI | Unknown params

APIProprietary

Strengths:

+Best reasoning
+Handles complex questions
+Multi-image

Weaknesses:

-Expensive
-Rate limited
-No local deployment

Gemini Pro Vision

Google | Unknown params

APIProprietary

Strengths:

+Long context
+Video understanding
+Fast inference

Weaknesses:

-API only
-Variable availability

Claude 3.5 Sonnet

Anthropic | Unknown params

APIProprietary

Strengths:

+Strong reasoning
+Good at charts/diagrams
+Reliable

Weaknesses:

-API only
-Image limits per request

Best Open Source

Qwen2-VL-72B

State-of-the-art open weights

Best API

GPT-4o

Best reasoning, most reliable

Best for Local

LLaVA 1.6 7B

Runs on consumer GPU

Benchmarks

Standard datasets for evaluating VQA models.

Dataset	Focus	Size	Metric	SOTA
VQAv2	General VQA	1.1M QA pairs	Accuracy	86.1% (Gemini)
OK-VQA	Knowledge VQA	14K questions	Accuracy	66.1% (PaLI-X)
TextVQA	Scene Text	45K questions	Accuracy	77.6% (GPT-4V)
GQA	Compositional	22M questions	Accuracy	72.1% (PaLI)
VizWiz	Accessibility	31K questions	Accuracy	73.2% (GPT-4V)
DocVQA	Document	50K questions	ANLS	93.4% (GPT-4V)

VQAv2

The standard benchmark. 1.1 million questions about COCO images. Balanced so guessing from question alone performs poorly. Accuracy metric with consensus from 10 human annotators.

OK-VQA

"Outside Knowledge" VQA. Questions require knowledge not in the image. Example: "What vitamin is this fruit high in?" Tests knowledge retrieval.

TextVQA

Reading text in images. Street signs, product labels, documents. Requires OCR capability integrated with reasoning.

GQA

22M compositional questions from scene graphs. Tests reasoning: "Is the cat on the table to the left of the lamp?"

VQA in Action

See how VQA works on real images with different question types.

Q: What type of food is this?

This appears to be a plate of pasta with meat sauce and fresh basil leaves.

Q: Is this a healthy meal?

It contains carbohydrates from pasta and protein from meat. Moderately healthy depending on portion size.

Q: What cuisine is this?

Italian cuisine, specifically looks like spaghetti bolognese or a similar pasta dish.

Code Examples

Get started with VQA using different frameworks and models.

BLIP-2pip install transformers accelerate

Open Source

from transformers import Blip2Processor, Blip2ForConditionalGeneration
from PIL import Image
import torch

# Load BLIP-2 model
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load image
image = Image.open("photo.jpg")

# Ask a question
question = "What is happening in this image?"

inputs = processor(image, question, return_tensors="pt").to("cuda", torch.float16)
output = model.generate(**inputs, max_new_tokens=100)
answer = processor.decode(output[0], skip_special_tokens=True)

print(f"Q: {question}")
print(f"A: {answer}")

Quick Reference

For Production

- GPT-4o for best quality
- Qwen2-VL for self-hosted
- LLaVA for local deployment

Key Benchmarks

- VQAv2 (general)
- TextVQA (OCR + QA)
- OK-VQA (knowledge)

Common Pitfalls

- Image resolution matters
- Question phrasing affects accuracy
- Hallucinations on OCR tasks

Use Cases

✓Accessibility for blind users
✓Image-based search
✓Visual reasoning
✓Educational tools
✓Customer support with images

Architectural Patterns

Vision-Language Models

End-to-end models trained on image-question-answer triplets.

Pros:

+State-of-the-art accuracy
+Handles complex reasoning

Cons:

-Large models
-Expensive inference

Vision Encoder + LLM

Encode image, feed features to LLM decoder.

Pros:

+Leverages LLM capabilities
+Flexible

Cons:

-Two-stage pipeline
-May lose visual details

Object Detection + QA

Detect objects first, then reason over detections.

Pros:

+Interpretable
+Good for counting

Cons:

-Limited by detector
-Complex pipeline

Implementations

API Services

GPT-4V

OpenAI

API

Best overall VQA. Handles complex reasoning well.

Claude 3.5 Sonnet

Anthropic

API

Excellent for detailed image analysis.

Open Source

LLaVA

Apache 2.0

Open Source

Best open-source VLM. LLaVA-1.6 is strong.

HuggingFace

Qwen-VL

Apache 2.0

Open Source

Alibaba's VLM. Excellent multilingual.

HuggingFace

InternVL2

MIT

Open Source

Top open-source on many VQA benchmarks.

HuggingFace

Benchmarks

VQAv2 →GQA →OK-VQA →

Quick Facts

Input: Image
Output: Text
Implementations: 3 open source, 2 API
Patterns: 3 approaches

Related Blocks

Image Captioning

Image → Text

Document Question Answering

Document → Text

Have benchmark data?

Help us track the state of the art for visual question answering.

Submit Results