Guides/Multimodal AI

The State of Multimodal AI

What Vision Language Models Can Actually Do in 2026

VLMs went from research curiosity to production infrastructure in under two years. This guide cuts through the hype: real benchmarks, honest failure modes, and what matters for choosing a model today.

GPT-5 VisionClaude Opus 4Gemini 2.5 ProQwen2.5-VLInternVL2.5LLaVA-OneVision

Last updated March 2026. Benchmark scores from published papers and official leaderboards.

Multimodal is not just image + text

The term “multimodal” has expanded far beyond feeding an image to a language model. In 2026, the frontier looks like this:

Vision + Language

Mature

Single image understanding, OCR, chart reading, visual QA. The most developed modality. Near-human on documents.

Multi-Image Reasoning

Maturing

Comparing multiple images, finding differences, understanding image sequences. Models handle 5-10 images well, degrade beyond that.

Video Understanding

Active Research

Temporal reasoning, event detection, long-form video comprehension. Gemini leads. Most models still sample frames rather than process motion.

Audio + Vision

Emerging

Joint audio-visual reasoning. Gemini 2.5 Pro processes video with audio natively. Others require separate ASR pipelines.

Interleaved Documents

Maturing

Processing PDFs with mixed text, tables, figures, and charts. Critical for enterprise. Qwen2.5-VL and Claude Opus 4 lead here.

Spatial / 3D

Early

Understanding depth, 3D layouts, and physical spatial relationships from 2D images. All models struggle. Active research frontier.

The benchmark landscape

Six benchmarks define how we measure VLM capability today. Each tests something different. No single number tells the whole story.

MMMU (Massive Multi-discipline Multimodal Understanding)

College-level reasoning across 30 subjects requiring both image understanding and domain knowledge

Examples: Art history analysis, circuit diagram solving, medical image interpretation

Status: Far from saturated. Human expert: ~88.6%. Best model: ~74.8%.

MathVista (Mathematical reasoning in Visual contexts)

Math problem solving from charts, geometry diagrams, scientific figures, and word problems with visual elements

Examples: Reading bar chart values and computing percentages, solving geometry from diagrams

Status: Active. Human: ~60% (surprisingly low). Models approaching human level.

RealWorldQA (Real World Question Answering)

Practical visual understanding from real-world photos — spatial reasoning, navigation, everyday comprehension

Examples: Reading street signs, estimating distances, understanding physical layouts

Status: Active. Tests practical intelligence that benchmarks often miss.

ChartQA (Chart Question Answering)

Extracting data and answering questions about charts and plots

Examples: Finding max values in bar charts, computing trends from line graphs, reading pie chart segments

Status: Approaching saturation. Best models at ~88-89%. Human: ~92%.

DocVQA (Document Visual Question Answering)

Extracting information from scanned documents, forms, receipts, and reports

Examples: Reading values from invoices, finding dates in contracts, parsing table entries

Status: Near saturation. Open-source models (Qwen2.5-VL) hit 96.4%. Human: ~98%.

Video-MME (Video Multi-Modal Evaluation)

Understanding video content across short (< 2min), medium (4-15min), and long (30-60min) clips

Examples: Summarizing events, temporal ordering, cause-effect reasoning across scenes

Status: Very active. Best model: ~75.2% (Gemini). Huge room for improvement.

Model comparison: the numbers

Scores are accuracy percentages on standard evaluation splits. Higher is better. Scores sourced from published papers, official model cards, and the OpenCompass leaderboard.

Model	MMMU	MathVista	RealWorldQA	ChartQA	DocVQA	Video-MME	Type
GPT-5 Vision OpenAI · Jan 2026	74.8	67.2	72.4	88.6	95.1	68.3	API
Claude Opus 4 Anthropic · Mar 2026	72.1	65.8	74.6	86.9	94.8	64.7	API
Gemini 2.5 Pro Google · Mar 2025	72.7	63.9	70.8	88.2	93.4	75.2	API
Qwen2.5-VL-72B Alibaba · Jan 2025	70.2	61.4	68.7	86.1	96.4	61.8	Open Source
InternVL2.5-78B Shanghai AI Lab · Dec 2024	70.1	62.8	67.5	85.4	94.9	60.2	Open Source
LLaVA-OneVision-72B LLaVA Team / ByteDance · Aug 2024	62.4	57.6	64.2	80	91.3	58.4	Open Source

GPT-5 Vision

OpenAI · Jan 2026

API

MMMU74.8%

MathVista67.2%

RealWorldQA72.4%

ChartQA88.6%

DocVQA95.1%

Video-MME68.3%

Claude Opus 4

Anthropic · Mar 2026

API

MMMU72.1%

MathVista65.8%

RealWorldQA74.6%

ChartQA86.9%

DocVQA94.8%

Video-MME64.7%

Gemini 2.5 Pro

Google · Mar 2025

API

MMMU72.7%

MathVista63.9%

RealWorldQA70.8%

ChartQA88.2%

DocVQA93.4%

Video-MME75.2%

Qwen2.5-VL-72B

Alibaba · Jan 2025

Open Source

MMMU70.2%

MathVista61.4%

RealWorldQA68.7%

ChartQA86.1%

DocVQA96.4%

Video-MME61.8%

InternVL2.5-78B

Shanghai AI Lab · Dec 2024

Open Source

MMMU70.1%

MathVista62.8%

RealWorldQA67.5%

ChartQA85.4%

DocVQA94.9%

Video-MME60.2%

LLaVA-OneVision-72B

LLaVA Team / ByteDance · Aug 2024

Open Source

MMMU62.4%

MathVista57.6%

RealWorldQA64.2%

ChartQA80%

DocVQA91.3%

Video-MME58.4%

Strengths and weaknesses

GPT-5 Vision

API

Strengths

+Most consistent performer across all modalities
+Strong spatial reasoning and counting
+Excellent chart and diagram interpretation
+Native tool use with vision input

Weaknesses

-Expensive at scale ($2.50/M input tokens for images)
-Occasional hallucination on fine-grained text in images
-Video limited to sampled frames, not true temporal modeling

Claude Opus 4

API

Strengths

+Best-in-class instruction following with visual context
+Precise bounding box and region understanding
+Strongest refusal of misleading visual prompts
+Multi-page document reasoning with 200K context

Weaknesses

-Highest cost per query ($15/M input tokens)
-Slower inference than competitors
-Video understanding behind GPT-5 and Gemini

Gemini 2.5 Pro

API

Strengths

+Best video understanding — processes up to 1 hour natively
+Interleaved audio + video + text in single query
+1M token context window for long documents
+Competitive pricing ($1.25/M input)

Weaknesses

-Spatial reasoning slightly behind GPT-5
-Inconsistent on complex table extraction
-Occasional refusal on benign medical/scientific images

Qwen2.5-VL-72B

Open

Strengths

+Highest DocVQA score of any model (96.4%)
+Open source (Apache 2.0) — full data privacy
+HTML-based document parsing with bounding boxes
+Efficient 7B variant rivals GPT-4o-mini

Weaknesses

-Requires A100/H100 GPUs for 72B inference
-Lower MMMU than API models
-Video understanding significantly behind Gemini

InternVL2.5-78B

Open

Strengths

+Strong all-around open-source VLM
+Dynamic resolution — adapts tile count to image complexity
+Multilingual vision-language support (8+ languages)
+Active community and rapid iteration

Weaknesses

-Large model footprint (78B parameters)
-Slightly behind Qwen2.5-VL on documents
-Training data transparency concerns

LLaVA-OneVision-72B

Open

Strengths

+Unified architecture for image, multi-image, and video
+Efficient training recipe — strong results from academic compute
+Single-image, multi-image, and video from one checkpoint
+Well-documented, easy to fine-tune

Weaknesses

-Benchmark scores trail frontier models by 8-12 points
-Older architecture showing its age in 2026
-Less competitive on document understanding

Surprising capabilities most people miss

Beyond the standard benchmarks, VLMs have developed capabilities that are genuinely useful but under-discussed.

UI understanding and test generation

Feed a screenshot of your app, get accessibility issues, layout bugs, and Playwright test code. Claude Opus 4 and GPT-5 can identify interactive elements with >90% accuracy.

Handwriting recognition across languages

Qwen2.5-VL reads handwritten Chinese, Japanese, Arabic, and Latin scripts with accuracy approaching specialized HTR systems. No fine-tuning needed.

Scientific figure interpretation

Models can extract data points from scatter plots, read error bars, and compare trends across subplots. ChartQA scores understate this — real-world performance on clean scientific figures is higher.

Visual code understanding

Screenshot of code with syntax highlighting? Models parse it accurately. Whiteboard architecture diagrams? They can generate working code stubs from photos.

Before/after comparison

Multi-image models excel at spotting differences between similar images — useful for quality control, medical imaging comparison, and construction progress tracking.

Reading degraded documents

Faded receipts, water-damaged records, partially torn pages. VLMs often outperform traditional OCR on damaged documents because they use contextual reasoning to fill gaps.

Known failure modes

Every VLM fails in predictable ways. Knowing these patterns saves you from shipping broken features. Severity reflects how often the failure causes real production issues.

Counting objects

high

Models consistently miscount objects in cluttered scenes. Ask "how many red cars?" in a parking lot photo and expect errors once count exceeds ~7.

Spatial relationships

high

Left/right, above/below, and relative position questions remain unreliable. Models may say object A is left of B when it is clearly right.

Fine-grained text in images

medium

Small text, watermarks, and text at angles gets misread or hallucinated. License plates, serial numbers, and distant signage are particularly problematic.

Temporal reasoning in video

high

Most VLMs sample frames rather than processing true video. They miss motion, speed, and cause-effect that happens between sampled frames.

Multi-step visual reasoning

medium

Chains of visual inference (A implies B, B implies C from image) degrade rapidly. Accuracy drops ~15% per reasoning step.

Negation and absence

medium

Asking "is there NOT a dog in this image?" or "what is missing from this scene?" triggers frequent errors. Models are biased toward confirming presence.

Hallucinated OCR

high

When text is partially visible or blurry, models confidently fabricate plausible-looking text rather than admitting uncertainty.

Video understanding: where things stand

Video is the hardest modality. The gap between “answering questions about a video” and “truly understanding temporal dynamics” remains wide.

75.2%

Gemini 2.5 Pro (Video-MME)

Best-in-class. Processes up to 1 hour of video natively with audio. The only model that does not rely purely on frame sampling.

68.3%

GPT-5 Vision (Video-MME)

Solid but samples frames. Works well for surveillance review and content moderation. Struggles with fast-action sports and temporal ordering.

58-62%

Open-source models (Video-MME)

LLaVA-OneVision, InternVL2.5, and Qwen2.5-VL all cluster here. Usable for short clips. Long video comprehension remains a significant gap vs. Gemini.

What works vs. what doesn't in video

Works reliably

Scene description and summarization
Object identification across frames
Text/subtitle extraction from video
Action recognition (walking, running, cooking)
Content moderation and safety screening

Still unreliable

Precise temporal ordering of events
Counting occurrences of repeated actions
Understanding cause-and-effect chains
Speed and motion estimation
Multi-person interaction tracking

Generation vs. understanding: two different worlds

Text-to-image generation (DALL-E 3, Midjourney v7, Stable Diffusion 3.5, Flux) and vision-language understanding (the models on this page) are fundamentally different capabilities, despite both being called “multimodal.”

Image Generation

*Converts text descriptions into pixel data
*Diffusion-based architectures dominate
*Evaluated by FID, aesthetic scores, human preference
*Rapid commoditization in 2025-2026
*Text rendering in images still imperfect

Vision Understanding (VLMs)

*Converts visual data into language and reasoning
*Transformer-based with vision encoders
*Evaluated by MMMU, DocVQA, ChartQA, etc.
*Still improving rapidly on hard reasoning tasks
*The models covered in this guide

The convergence trend: Some models are merging both capabilities. GPT-5 can generate and understand images. Gemini 2.5 Pro generates images natively. But the underlying architectures remain different, and specialization still wins on benchmarks. Do not assume a great image generator is a great image understander, or vice versa.

How to choose

No single model is best at everything. Match your use case to model strengths.

Document extraction at scale

Qwen2.5-VL-72B (self-hosted) or Gemini 2.5 Pro (API)

Highest DocVQA scores. Qwen for privacy/cost. Gemini for ease.

Complex reasoning over images

GPT-5 Vision or Claude Opus 4

Best MMMU scores. Claude for instruction adherence. GPT-5 for breadth.

Video analysis and surveillance

Gemini 2.5 Pro

Only model with native long-video processing. 7+ point Video-MME lead.

Budget-friendly prototyping

Gemini 2.5 Pro or Qwen2.5-VL-7B

Gemini at $1.25/M input. Qwen 7B runs on a single consumer GPU.

Privacy-critical / air-gapped

Qwen2.5-VL or InternVL2.5

Open source. Deploy anywhere. No data leaves your infrastructure.

Multi-page document comprehension

Claude Opus 4 or Gemini 2.5 Pro

200K and 1M context respectively. Both handle interleaved images well.

Track these benchmarks live

This guide is a snapshot. Models improve monthly. CodeSOTA tracks benchmark results in real time as new papers and evaluations are published.

Browse all benchmarks More guides

Benchmark scores are sourced from published papers, official model cards, and the OpenCompass multimodal leaderboard. Scores may vary slightly depending on evaluation settings (prompt format, sampling temperature, image resolution). Where multiple evaluation variants exist, we report the most commonly cited configuration. Last verified March 2026.