Home/Building Blocks/Visual Question Answering
ImageText

Visual Question Answering

Answer natural language questions about images. Combines vision and language understanding.

How Visual Question Answering Works

A technical deep-dive into Visual Question Answering. From attention mechanisms to modern vision-language models that can reason about images.

1

The Problem

Why is answering questions about images hard for machines?

Picture a photograph of a family picnic. A human glances at it and can instantly answer "How many people are eating?" or "Is the weather nice?" without conscious effort. For a machine, this requires solving multiple hard problems simultaneously:

Visual Understanding

The system must detect objects, understand their relationships, recognize actions, and infer scene context. A "picnic" is not just objects, but their arrangement and context.

Required: Object detection, scene classification, spatial reasoning
Language Understanding

Questions come in infinite variety. "How many?" needs counting. "Is it raining?" needs visual inference. "What might happen next?" needs reasoning.

Required: Question parsing, intent classification, answer generation
Cross-Modal Alignment

The hardest part: connecting words to visual concepts. What does "eating" look like? Where in the image is the answer to "What color is the blanket?"

Required: Grounding, attention, joint embedding spaces
World Knowledge

Many questions require knowledge beyond the image. "What city is this?" needs to recognize landmarks. "Is this food healthy?" needs nutrition knowledge.

Required: External knowledge, common sense reasoning

Types of VQA Questions

Simple VQA

Questions with direct visual answers

Examples:
  • "What color is the car?"
  • "How many people are there?"
Reasoning VQA

Requires logical inference from visual cues

Examples:
  • "Is it going to rain?"
  • "What might happen next?"
Knowledge VQA

Needs external world knowledge beyond the image

Examples:
  • "What city is this?"
  • "Who painted this?"
Text VQA

Reading and understanding text in images

Examples:
  • "What does the sign say?"
  • "What is the price?"
2

How Vision-Language Models Work

The architecture evolved from simple feature concatenation to sophisticated multimodal transformers.

Modern VLM Architecture (Simplified)

Image
Input
+
Question
Input
->
Vision
Encoder
ViT / SigLIP
->
Projection
Layer
MLP / Q-Former
->
LLM
Decoder
Llama / Mistral
->
Answer
Output
The Core Insight

Modern VLMs treat images as a special kind of "text." The vision encoder converts image patches into tokens that look like word embeddings to the LLM. The LLM then processes image tokens and text tokens together, allowing it to reason about both modalities using the same attention mechanisms.

CNN + LSTM
2015
Early FusionConcatenate image and question features
Attention Networks
2016
AttentionFocus on relevant image regions
Bottom-Up Attention
2018
Object-BasedUse detected object features
ViLBERT
2019
Dual-StreamSeparate vision and language encoders
LXMERT
2019
Cross-ModalCross-attention between modalities
CLIP + GPT
2021
Zero-ShotPretrained vision-language alignment
BLIP-2
2023
Q-FormerLearnable query tokens bridge vision to LLM
LLaVA
2023
End-to-EndVisual instruction tuning
Qwen2-VL
2024
Native VLMDynamic resolution, video understanding
GPT-4V
2024
Multimodal LLMFrontier reasoning capabilities
Early Fusion

Concatenate image and text features early, then process together.

Simple but loses modality-specific patterns.
Late Fusion

Process each modality separately, combine at the end.

Misses cross-modal interactions during encoding.
Cross-Attention Fusion

Text attends to image regions, image attends to words. Iterative.

Best results. Used by modern VLMs like LLaVA.
3

Key Models

The models you should know for VQA in 2024-2025.

BLIP-2
Salesforce | 3B-13B params
Open SourceBSD-3
Strengths:
  • +Efficient Q-Former bridge
  • +Works with any LLM
  • +Good zero-shot
Weaknesses:
  • -Less strong on complex reasoning
  • -Frozen vision encoder
LLaVA
Microsoft/UW | 7B-34B params
Open SourceApache 2.0
Strengths:
  • +Visual instruction following
  • +Open weights
  • +Active community
Weaknesses:
  • -Requires fine-tuning for best results
  • -Single image only
Qwen2-VL
Alibaba | 2B-72B params
Open SourceApache 2.0
Strengths:
  • +Dynamic resolution
  • +Video support
  • +Strong OCR
Weaknesses:
  • -Requires significant VRAM
  • -Complex inference setup
GPT-4V
OpenAI | Unknown params
APIProprietary
Strengths:
  • +Best reasoning
  • +Handles complex questions
  • +Multi-image
Weaknesses:
  • -Expensive
  • -Rate limited
  • -No local deployment
Gemini Pro Vision
Google | Unknown params
APIProprietary
Strengths:
  • +Long context
  • +Video understanding
  • +Fast inference
Weaknesses:
  • -API only
  • -Variable availability
Claude 3.5 Sonnet
Anthropic | Unknown params
APIProprietary
Strengths:
  • +Strong reasoning
  • +Good at charts/diagrams
  • +Reliable
Weaknesses:
  • -API only
  • -Image limits per request
Best Open Source
Qwen2-VL-72B
State-of-the-art open weights
Best API
GPT-4o
Best reasoning, most reliable
Best for Local
LLaVA 1.6 7B
Runs on consumer GPU
4

Benchmarks

Standard datasets for evaluating VQA models.

DatasetFocusSizeMetricSOTA
VQAv2General VQA1.1M QA pairsAccuracy86.1% (Gemini)
OK-VQAKnowledge VQA14K questionsAccuracy66.1% (PaLI-X)
TextVQAScene Text45K questionsAccuracy77.6% (GPT-4V)
GQACompositional22M questionsAccuracy72.1% (PaLI)
VizWizAccessibility31K questionsAccuracy73.2% (GPT-4V)
DocVQADocument50K questionsANLS93.4% (GPT-4V)
VQAv2

The standard benchmark. 1.1 million questions about COCO images. Balanced so guessing from question alone performs poorly. Accuracy metric with consensus from 10 human annotators.

OK-VQA

"Outside Knowledge" VQA. Questions require knowledge not in the image. Example: "What vitamin is this fruit high in?" Tests knowledge retrieval.

TextVQA

Reading text in images. Street signs, product labels, documents. Requires OCR capability integrated with reasoning.

GQA

22M compositional questions from scene graphs. Tests reasoning: "Is the cat on the table to the left of the lamp?"

5

VQA in Action

See how VQA works on real images with different question types.

VQA example 1
Q: What type of food is this?
This appears to be a plate of pasta with meat sauce and fresh basil leaves.
Q: Is this a healthy meal?
It contains carbohydrates from pasta and protein from meat. Moderately healthy depending on portion size.
Q: What cuisine is this?
Italian cuisine, specifically looks like spaghetti bolognese or a similar pasta dish.
6

Code Examples

Get started with VQA using different frameworks and models.

BLIP-2pip install transformers accelerate
Open Source
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from PIL import Image
import torch

# Load BLIP-2 model
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load image
image = Image.open("photo.jpg")

# Ask a question
question = "What is happening in this image?"

inputs = processor(image, question, return_tensors="pt").to("cuda", torch.float16)
output = model.generate(**inputs, max_new_tokens=100)
answer = processor.decode(output[0], skip_special_tokens=True)

print(f"Q: {question}")
print(f"A: {answer}")

Quick Reference

For Production
  • - GPT-4o for best quality
  • - Qwen2-VL for self-hosted
  • - LLaVA for local deployment
Key Benchmarks
  • - VQAv2 (general)
  • - TextVQA (OCR + QA)
  • - OK-VQA (knowledge)
Common Pitfalls
  • - Image resolution matters
  • - Question phrasing affects accuracy
  • - Hallucinations on OCR tasks

Use Cases

  • Accessibility for blind users
  • Image-based search
  • Visual reasoning
  • Educational tools
  • Customer support with images

Architectural Patterns

Vision-Language Models

End-to-end models trained on image-question-answer triplets.

Pros:
  • +State-of-the-art accuracy
  • +Handles complex reasoning
Cons:
  • -Large models
  • -Expensive inference

Vision Encoder + LLM

Encode image, feed features to LLM decoder.

Pros:
  • +Leverages LLM capabilities
  • +Flexible
Cons:
  • -Two-stage pipeline
  • -May lose visual details

Object Detection + QA

Detect objects first, then reason over detections.

Pros:
  • +Interpretable
  • +Good for counting
Cons:
  • -Limited by detector
  • -Complex pipeline

Implementations

API Services

GPT-4V

OpenAI
API

Best overall VQA. Handles complex reasoning well.

Claude 3.5 Sonnet

Anthropic
API

Excellent for detailed image analysis.

Open Source

LLaVA

Apache 2.0
Open Source

Best open-source VLM. LLaVA-1.6 is strong.

Qwen-VL

Apache 2.0
Open Source

Alibaba's VLM. Excellent multilingual.

InternVL2

MIT
Open Source

Top open-source on many VQA benchmarks.

Benchmarks

Quick Facts

Input
Image
Output
Text
Implementations
3 open source, 2 API
Patterns
3 approaches

Have benchmark data?

Help us track the state of the art for visual question answering.

Submit Results