The State of Multimodal AI
What Vision Language Models Can Actually Do in 2026
VLMs went from research curiosity to production infrastructure in under two years. This guide cuts through the hype: real benchmarks, honest failure modes, and what matters for choosing a model today.
Last updated March 2026. Benchmark scores from published papers and official leaderboards.
Multimodal is not just image + text
The term “multimodal” has expanded far beyond feeding an image to a language model. In 2026, the frontier looks like this:
Vision + Language
MatureSingle image understanding, OCR, chart reading, visual QA. The most developed modality. Near-human on documents.
Multi-Image Reasoning
MaturingComparing multiple images, finding differences, understanding image sequences. Models handle 5-10 images well, degrade beyond that.
Video Understanding
Active ResearchTemporal reasoning, event detection, long-form video comprehension. Gemini leads. Most models still sample frames rather than process motion.
Audio + Vision
EmergingJoint audio-visual reasoning. Gemini 2.5 Pro processes video with audio natively. Others require separate ASR pipelines.
Interleaved Documents
MaturingProcessing PDFs with mixed text, tables, figures, and charts. Critical for enterprise. Qwen2.5-VL and Claude Opus 4 lead here.
Spatial / 3D
EarlyUnderstanding depth, 3D layouts, and physical spatial relationships from 2D images. All models struggle. Active research frontier.
The benchmark landscape
Six benchmarks define how we measure VLM capability today. Each tests something different. No single number tells the whole story.
MMMU (Massive Multi-discipline Multimodal Understanding)
College-level reasoning across 30 subjects requiring both image understanding and domain knowledge
Examples: Art history analysis, circuit diagram solving, medical image interpretation
Status: Far from saturated. Human expert: ~88.6%. Best model: ~74.8%.
MathVista (Mathematical reasoning in Visual contexts)
Math problem solving from charts, geometry diagrams, scientific figures, and word problems with visual elements
Examples: Reading bar chart values and computing percentages, solving geometry from diagrams
Status: Active. Human: ~60% (surprisingly low). Models approaching human level.
RealWorldQA (Real World Question Answering)
Practical visual understanding from real-world photos — spatial reasoning, navigation, everyday comprehension
Examples: Reading street signs, estimating distances, understanding physical layouts
Status: Active. Tests practical intelligence that benchmarks often miss.
ChartQA (Chart Question Answering)
Extracting data and answering questions about charts and plots
Examples: Finding max values in bar charts, computing trends from line graphs, reading pie chart segments
Status: Approaching saturation. Best models at ~88-89%. Human: ~92%.
DocVQA (Document Visual Question Answering)
Extracting information from scanned documents, forms, receipts, and reports
Examples: Reading values from invoices, finding dates in contracts, parsing table entries
Status: Near saturation. Open-source models (Qwen2.5-VL) hit 96.4%. Human: ~98%.
Video-MME (Video Multi-Modal Evaluation)
Understanding video content across short (< 2min), medium (4-15min), and long (30-60min) clips
Examples: Summarizing events, temporal ordering, cause-effect reasoning across scenes
Status: Very active. Best model: ~75.2% (Gemini). Huge room for improvement.
Model comparison: the numbers
Scores are accuracy percentages on standard evaluation splits. Higher is better. Scores sourced from published papers, official model cards, and the OpenCompass leaderboard.
| Model | MMMU | MathVista | RealWorldQA | ChartQA | DocVQA | Video-MME | Type |
|---|---|---|---|---|---|---|---|
GPT-5 Vision OpenAI · Jan 2026 | 74.8 | 67.2 | 72.4 | 88.6 | 95.1 | 68.3 | API |
Claude Opus 4 Anthropic · Mar 2026 | 72.1 | 65.8 | 74.6 | 86.9 | 94.8 | 64.7 | API |
Gemini 2.5 Pro Google · Mar 2025 | 72.7 | 63.9 | 70.8 | 88.2 | 93.4 | 75.2 | API |
Qwen2.5-VL-72B Alibaba · Jan 2025 | 70.2 | 61.4 | 68.7 | 86.1 | 96.4 | 61.8 | Open Source |
InternVL2.5-78B Shanghai AI Lab · Dec 2024 | 70.1 | 62.8 | 67.5 | 85.4 | 94.9 | 60.2 | Open Source |
LLaVA-OneVision-72B LLaVA Team / ByteDance · Aug 2024 | 62.4 | 57.6 | 64.2 | 80 | 91.3 | 58.4 | Open Source |
GPT-5 Vision
OpenAI · Jan 2026
Claude Opus 4
Anthropic · Mar 2026
Gemini 2.5 Pro
Google · Mar 2025
Qwen2.5-VL-72B
Alibaba · Jan 2025
InternVL2.5-78B
Shanghai AI Lab · Dec 2024
LLaVA-OneVision-72B
LLaVA Team / ByteDance · Aug 2024
Strengths and weaknesses
GPT-5 Vision
APIStrengths
- +Most consistent performer across all modalities
- +Strong spatial reasoning and counting
- +Excellent chart and diagram interpretation
- +Native tool use with vision input
Weaknesses
- -Expensive at scale ($2.50/M input tokens for images)
- -Occasional hallucination on fine-grained text in images
- -Video limited to sampled frames, not true temporal modeling
Claude Opus 4
APIStrengths
- +Best-in-class instruction following with visual context
- +Precise bounding box and region understanding
- +Strongest refusal of misleading visual prompts
- +Multi-page document reasoning with 200K context
Weaknesses
- -Highest cost per query ($15/M input tokens)
- -Slower inference than competitors
- -Video understanding behind GPT-5 and Gemini
Gemini 2.5 Pro
APIStrengths
- +Best video understanding — processes up to 1 hour natively
- +Interleaved audio + video + text in single query
- +1M token context window for long documents
- +Competitive pricing ($1.25/M input)
Weaknesses
- -Spatial reasoning slightly behind GPT-5
- -Inconsistent on complex table extraction
- -Occasional refusal on benign medical/scientific images
Qwen2.5-VL-72B
OpenStrengths
- +Highest DocVQA score of any model (96.4%)
- +Open source (Apache 2.0) — full data privacy
- +HTML-based document parsing with bounding boxes
- +Efficient 7B variant rivals GPT-4o-mini
Weaknesses
- -Requires A100/H100 GPUs for 72B inference
- -Lower MMMU than API models
- -Video understanding significantly behind Gemini
InternVL2.5-78B
OpenStrengths
- +Strong all-around open-source VLM
- +Dynamic resolution — adapts tile count to image complexity
- +Multilingual vision-language support (8+ languages)
- +Active community and rapid iteration
Weaknesses
- -Large model footprint (78B parameters)
- -Slightly behind Qwen2.5-VL on documents
- -Training data transparency concerns
LLaVA-OneVision-72B
OpenStrengths
- +Unified architecture for image, multi-image, and video
- +Efficient training recipe — strong results from academic compute
- +Single-image, multi-image, and video from one checkpoint
- +Well-documented, easy to fine-tune
Weaknesses
- -Benchmark scores trail frontier models by 8-12 points
- -Older architecture showing its age in 2026
- -Less competitive on document understanding
Surprising capabilities most people miss
Beyond the standard benchmarks, VLMs have developed capabilities that are genuinely useful but under-discussed.
UI understanding and test generation
Feed a screenshot of your app, get accessibility issues, layout bugs, and Playwright test code. Claude Opus 4 and GPT-5 can identify interactive elements with >90% accuracy.
Handwriting recognition across languages
Qwen2.5-VL reads handwritten Chinese, Japanese, Arabic, and Latin scripts with accuracy approaching specialized HTR systems. No fine-tuning needed.
Scientific figure interpretation
Models can extract data points from scatter plots, read error bars, and compare trends across subplots. ChartQA scores understate this — real-world performance on clean scientific figures is higher.
Visual code understanding
Screenshot of code with syntax highlighting? Models parse it accurately. Whiteboard architecture diagrams? They can generate working code stubs from photos.
Before/after comparison
Multi-image models excel at spotting differences between similar images — useful for quality control, medical imaging comparison, and construction progress tracking.
Reading degraded documents
Faded receipts, water-damaged records, partially torn pages. VLMs often outperform traditional OCR on damaged documents because they use contextual reasoning to fill gaps.
Known failure modes
Every VLM fails in predictable ways. Knowing these patterns saves you from shipping broken features. Severity reflects how often the failure causes real production issues.
Counting objects
highModels consistently miscount objects in cluttered scenes. Ask "how many red cars?" in a parking lot photo and expect errors once count exceeds ~7.
Spatial relationships
highLeft/right, above/below, and relative position questions remain unreliable. Models may say object A is left of B when it is clearly right.
Fine-grained text in images
mediumSmall text, watermarks, and text at angles gets misread or hallucinated. License plates, serial numbers, and distant signage are particularly problematic.
Temporal reasoning in video
highMost VLMs sample frames rather than processing true video. They miss motion, speed, and cause-effect that happens between sampled frames.
Multi-step visual reasoning
mediumChains of visual inference (A implies B, B implies C from image) degrade rapidly. Accuracy drops ~15% per reasoning step.
Negation and absence
mediumAsking "is there NOT a dog in this image?" or "what is missing from this scene?" triggers frequent errors. Models are biased toward confirming presence.
Hallucinated OCR
highWhen text is partially visible or blurry, models confidently fabricate plausible-looking text rather than admitting uncertainty.
Video understanding: where things stand
Video is the hardest modality. The gap between “answering questions about a video” and “truly understanding temporal dynamics” remains wide.
Gemini 2.5 Pro (Video-MME)
Best-in-class. Processes up to 1 hour of video natively with audio. The only model that does not rely purely on frame sampling.
GPT-5 Vision (Video-MME)
Solid but samples frames. Works well for surveillance review and content moderation. Struggles with fast-action sports and temporal ordering.
Open-source models (Video-MME)
LLaVA-OneVision, InternVL2.5, and Qwen2.5-VL all cluster here. Usable for short clips. Long video comprehension remains a significant gap vs. Gemini.
What works vs. what doesn't in video
Works reliably
- Scene description and summarization
- Object identification across frames
- Text/subtitle extraction from video
- Action recognition (walking, running, cooking)
- Content moderation and safety screening
Still unreliable
- Precise temporal ordering of events
- Counting occurrences of repeated actions
- Understanding cause-and-effect chains
- Speed and motion estimation
- Multi-person interaction tracking
Generation vs. understanding: two different worlds
Text-to-image generation (DALL-E 3, Midjourney v7, Stable Diffusion 3.5, Flux) and vision-language understanding (the models on this page) are fundamentally different capabilities, despite both being called “multimodal.”
Image Generation
- *Converts text descriptions into pixel data
- *Diffusion-based architectures dominate
- *Evaluated by FID, aesthetic scores, human preference
- *Rapid commoditization in 2025-2026
- *Text rendering in images still imperfect
Vision Understanding (VLMs)
- *Converts visual data into language and reasoning
- *Transformer-based with vision encoders
- *Evaluated by MMMU, DocVQA, ChartQA, etc.
- *Still improving rapidly on hard reasoning tasks
- *The models covered in this guide
The convergence trend: Some models are merging both capabilities. GPT-5 can generate and understand images. Gemini 2.5 Pro generates images natively. But the underlying architectures remain different, and specialization still wins on benchmarks. Do not assume a great image generator is a great image understander, or vice versa.
How to choose
No single model is best at everything. Match your use case to model strengths.
Document extraction at scale
Qwen2.5-VL-72B (self-hosted) or Gemini 2.5 Pro (API)
Highest DocVQA scores. Qwen for privacy/cost. Gemini for ease.
Complex reasoning over images
GPT-5 Vision or Claude Opus 4
Best MMMU scores. Claude for instruction adherence. GPT-5 for breadth.
Video analysis and surveillance
Gemini 2.5 Pro
Only model with native long-video processing. 7+ point Video-MME lead.
Budget-friendly prototyping
Gemini 2.5 Pro or Qwen2.5-VL-7B
Gemini at $1.25/M input. Qwen 7B runs on a single consumer GPU.
Privacy-critical / air-gapped
Qwen2.5-VL or InternVL2.5
Open source. Deploy anywhere. No data leaves your infrastructure.
Multi-page document comprehension
Claude Opus 4 or Gemini 2.5 Pro
200K and 1M context respectively. Both handle interleaved images well.
Track these benchmarks live
This guide is a snapshot. Models improve monthly. CodeSOTA tracks benchmark results in real time as new papers and evaluations are published.
Benchmark scores are sourced from published papers, official model cards, and the OpenCompass multimodal leaderboard. Scores may vary slightly depending on evaluation settings (prompt format, sampling temperature, image resolution). Where multiple evaluation variants exist, we report the most commonly cited configuration. Last verified March 2026.