Multimodal

Combining vision and language? Evaluate image captioning, visual QA, text-to-image generation, and cross-modal retrieval models.

10 tasks23 datasets37 results

Multimodal AI in 2025 has moved from research demos to production-ready systems. The gap between proprietary and open-source models has narrowed dramatically, with practical choices now spanning from edge devices to frontier reasoning.

State of the Field (2025)

  • Gemini 3 Pro leads proprietary models with breakthrough reasoning scores on Humanity's Last Exam, while Gemini 3 Flash matches previous-gen Pro performance at lower cost and latency
  • Open-source models have achieved near-parity: InternVL3-78B hits 72.2% on MMMU, Molmo 2 leads in video understanding and grounding tasks, Qwen 2.5 VL handles 29 languages and 1-hour videos
  • Hallucination remains the critical deployment blocker. Models confidently describe non-existent objects, and grounding objectives surprisingly don't fix this in open-ended generation
  • Spatial reasoning and 3D understanding lag behind: even frontier models struggle with orientation tasks (56% vs 95.7% human), limiting robotics and embodied AI applications

Quick Recommendations

General-purpose multimodal reasoning (production API)

Gemini 3 Flash

Matches Gemini 2.5 Pro performance at lower cost and latency. Best efficiency-capability tradeoff for API usage.

Open-source general multimodal

InternVL3-78B

72.2% MMMU, state-of-the-art among open models. Reasonable compute requirements for on-prem deployment.

Video understanding and tracking

Molmo 2

Leading open-weight model for video QA, dense captioning, multi-object tracking. 9M video training examples show.

Document understanding and OCR

Llama 3.2 Vision 90B

73.6% VQAv2, 70.7% DocVQA. Meta's focus on document tasks delivers practical results for enterprise.

Edge deployment (resource-constrained)

Qwen 2.5 VL-7B

Strong performance in 7B parameters. Handles variable resolution, 29 languages, deployable on modest hardware.

Scientific and technical diagrams

DeepSeek-VL

MoE architecture optimized for technical reasoning. Better than generalist models on specialized scientific content.

Multi-image reasoning

Pixtral (Mistral AI)

Native multi-image processing, strong instruction-following. Architectural modularity aids practical deployment.

Long-context document reasoning

MACT framework on top of base model

Multi-agent collaboration outperforms monolithic scaling. Decompose into planning, execution, judgment agents.

Hallucination-critical applications

Base model + MARINE framework

Training-free hallucination reduction via open-source vision model guidance. Works across diverse LVLMs.

Frontier reasoning (cost no object)

Gemini 3 Pro

Tops LMArena for vision tasks, breakthrough scores on reasoning benchmarks. Vendor support and reliability.

Tasks & Benchmarks

Visual Question Answering

Visual question answering (VQA) is the original multimodal reasoning task — given an image and a natural language question, produce the correct answer. VQAv2 (2017) defined the field, but modern benchmarks like GQA, OK-VQA, and TextVQA have pushed toward compositional reasoning, external knowledge, and OCR-dependent understanding. The task was largely "solved" in its classic form once multimodal LLMs arrived, with GPT-4V and Gemini saturating standard benchmarks, but adversarial and compositional variants still expose systematic failures in spatial reasoning and counting. VQA's legacy is establishing that vision-language models need more than pattern matching — they need genuine visual understanding.

6 datasets35 resultsSOTA tracked

Image Captioning

Image captioning — generating natural language descriptions of images — was the task that launched the modern vision-language era when Show and Tell (2015) paired CNNs with RNNs. The field progressed through BLIP, BLIP-2, and CoCa, each improving grounding and descriptive richness, until multimodal LLMs effectively subsumed it as a special case of image-text-to-text. COCO Captions and NoCaps remain standard benchmarks, but CIDEr and SPICE scores have largely saturated — the real frontier is dense captioning, generating paragraph-level descriptions that capture spatial relationships, attributes, and background context that brief captions miss. Captioning's importance now lies more in its role as training signal for other vision-language tasks than as a standalone evaluation.

2 datasets2 resultsSOTA tracked

Cross-Modal Retrieval

Cross-modal retrieval finds the best match between items in different modalities — given text, find the right image; given an image, find the right caption. CLIP (2021) revolutionized the field by learning a shared embedding space from 400M image-text pairs, spawning an entire ecosystem of models like SigLIP, EVA-CLIP, and OpenCLIP that power everything from search engines to generative model guidance. The challenge has shifted from coarse retrieval to fine-grained discrimination: telling apart nearly identical images based on subtle textual differences, or retrieving across underrepresented domains and languages. Recall@K on Flickr30K and COCO may look saturated, but real-world deployment exposes failures on long-tail queries and compositional descriptions.

1 datasets0 results

Image-Text-to-Image

Image-text-to-image covers instruction-guided image editing — taking a source image plus a text command and producing a modified result. InstructPix2Pix (2023) demonstrated this could work zero-shot, and subsequent models like DALL-E 3's inpainting, Ideogram, and Stable Diffusion's img2img pipelines made it practical. The core difficulty is surgical precision: users want "change the dress to red" without altering the face, background, or lighting, which requires disentangling content from style at a level current architectures still fumble. Benchmarks are fragmented across editing fidelity, instruction-following, and identity preservation, making unified comparison difficult.

2 datasets0 results

Any-to-Any

Any-to-any models are the endgame of multimodal AI — a single architecture that can accept and generate any combination of text, images, audio, and video. GPT-4o (2024) was the first production model to natively process and generate across modalities in real time, and Gemini 2.0 pushed this further with interleaved multimodal outputs. The technical challenge is enormous: unifying tokenization across modalities, preventing mode collapse where the model favors text over other outputs, and maintaining quality competitive with specialist models in each domain. Meta's Chameleon and open efforts like NExT-GPT explored this space, but true any-to-any generation at frontier quality remains the province of the largest labs.

1 datasets0 results

Image-Text-to-Video

Image-text-to-video is generative AI's hardest unsolved frontier — animating a still image according to a text prompt while maintaining temporal coherence and physical plausibility. Stable Video Diffusion (2023) and Runway Gen-2 showed early promise, Sora (2024) raised the bar dramatically with minute-long physically consistent clips, and Kling and Veo 2 pushed quality further. The fundamental challenge is that video generation requires implicit world models: objects must persist, lighting must evolve consistently, and motion must obey approximate physics across dozens of frames. Evaluation is still largely human-judged, with FVD and CLIP-temporal scores poorly correlating with perceived quality.

1 datasets0 results

Text-to-Image Generation

Text-to-image generation went from "interesting research" to cultural phenomenon in 18 months. DALL-E 2 (2022) proved diffusion models could produce photorealistic images from text, Stable Diffusion democratized it as open source, and Midjourney v5/v6 set the aesthetic bar that even non-technical users now expect. DALL-E 3 (2023) solved the prompt-following problem by training on highly descriptive captions, Flux pushed open-source quality to near-commercial levels, and Ideogram cracked reliable text rendering in images. The remaining frontiers are compositional generation (multiple objects with specified spatial relationships), consistent character identity across images, and the still-unsolved challenge of reliable hand and finger anatomy.

3 datasets0 results

Video Understanding

Video understanding asks models to reason over temporal sequences — answering questions, generating summaries, or detecting events across minutes or hours of footage. Early approaches like VideoBERT and TimeSformer processed short clips, but Gemini 1.5 Pro's million-token context (2024) enabled reasoning over hour-long videos natively, and GPT-4o brought real-time video comprehension. The core bottleneck remains temporal reasoning at scale: models can describe individual frames well but struggle to track causal chains, count repetitions, or understand temporal ordering across long sequences. Video-MME and EgoSchema are pushing evaluation beyond simple recognition toward genuine temporal understanding.

2 datasets0 results

Image-Text-to-Text

Image-text-to-text exploded from a research curiosity to the dominant AI interface in under two years. GPT-4V (2023) proved multimodal LLMs could reason over images, Gemini 1.5 scaled to million-token contexts mixing text and vision, and Claude 3 showed that careful RLHF produces models that refuse to hallucinate about image content. MMMU and MMBench have become the standard evaluation gauntlet, but the real challenge is grounding — models still confabulate spatial relationships and struggle with fine-grained visual reasoning. This is the task that turned chatbots into visual assistants.

3 datasets0 results

Audio-Text-to-Text

Audio-text-to-text is the backbone of voice assistants that actually understand context — models that jointly process speech and text to generate grounded responses. Whisper (2022) cracked robust transcription, but the real leap came when Gemini 1.5 and GPT-4o (2024) began reasoning natively over audio tokens alongside text, eliminating the lossy ASR-then-LLM pipeline. The key challenges are handling overlapping speakers, noisy environments, and preserving prosodic cues like sarcasm or hesitation that pure transcription destroys. Benchmarks like SUPERB and Dynamic-SUPERB are expanding, but real-world spoken dialogue understanding remains far ahead of what leaderboards capture.

2 datasets0 results
Show all datasets and SOTA results

Visual Question Answering

GQA2019
MMBench2023
90.5(accuracy)Qwen2.5-VL 72B
MMMU2024
73.3(accuracy)InternVL3-78B
OK-VQA2019
TextVQA2019
85.5(accuracy)Qwen2.5-VL 72B
VQA v2.02017
87.6(accuracy)Qwen2-VL 72B

Image Captioning

COCO Captions2015
145.8(CIDEr)BLIP-2
NoCaps2019

Cross-Modal Retrieval

ViDoRe2024

Image-Text-to-Image

Any-to-Any

Image-Text-to-Video

Text-to-Image Generation

Video Understanding

Image-Text-to-Text

MMMU2023
MMStar2024

Audio-Text-to-Text

VoiceBench2024

Honest Takes

Open-source has caught up for most use cases

Unless you need absolute frontier reasoning, InternVL3-78B or Molmo 2 will serve you better than paying per-token for proprietary APIs. The performance gap has collapsed while deployment flexibility remains massive.

Video understanding is still the wild west

Despite claims, most models fail hard on videos over 15 minutes. If your use case involves long-form video, budget for custom fine-tuning. The benchmarks don't reflect real-world complexity.

Grounding doesn't fix hallucination

Research shows spatial grounding training has little to no effect on object hallucination in captions. You'll need explicit verification pipelines, not architectural fixes.

Scientific domains are still underserved

Gemini 2.5 Pro and o3 struggle on chemistry Olympiad problems. If you work with specialized diagrams (molecular structures, technical schematics), expect to build domain-specific solutions.

MoE is the new scaling paradigm

Mixture-of-experts architectures like DeepSeek-VL deliver comparable performance at a fraction of the compute. Dense models are increasingly a poor cost-performance choice.

Get notified when these results update

New models drop weekly. We track them so you don't have to.