Codesota · Registry · MultimodalThe area-level registerIssue: April 22, 2026

Area hub · Multimodal

Many modalities,
one register.

Models that read, see, hear — and sometimes all three at once. The most crowded frontier in the index, and the least standardised: every leaderboard has a slightly different test split.

Multimodal AI in 2025 has moved from research demos to production-ready systems. The gap between proprietary and open-source models has narrowed dramatically, with practical choices now spanning from edge devices to frontier reasoning.

§ 01 · Top tasks

Sub-tasks in multimodal.

Each task opens onto a leaderboard of its canonical benchmark, with the full submission history and dated scores. Tasks without an indexed result are listed elsewhere in the register; the table below is sorted by result count.

01Visual Question Answering8 bench147 res
02Image-Text-to-Text3 bench57 res
03Video Understanding2 bench44 res
04Text-to-Image Generation3 bench8 res
05Image Captioning2 bench7 res
06Audio-Text-to-Text3 bench4 res
07Image-Text-to-Video1 bench0 res
08Image-Text-to-Image2 bench0 res
09Any-to-Any1 bench0 res
10Cross-Modal Retrieval1 bench0 res

Fig 01 · Showing top 10 of 10 tasks under Multimodal.

§ 02 · Top benchmarks

Current state of the art.

Leading scores for the headline benchmarks in this area, drawn from the registry. Shaded rows mark the top result per task; follow any row into the full leaderboard.

#	Task	Benchmark	Leading model	Score
01	Image Captioning	COCO Captions	BLIP-2	145.8% CIDEr
02	Visual Question Answering	MMBench: Is Your Multi-modal Model an All-around Player?	SenseNova-U1-A3B-MoT	91.6% accuracy
03	Video Understanding	Video-MME	Qwen3.6-27B	87.7% accuracy
04	Image-Text-to-Text	MMMU	Qwen3.5-397B-A17B	85.0% accuracy
05	Audio-Text-to-Text	Massive Multi-task Audio Understanding	Qwen3.5-Omni-Plus	82.2% accuracy
06	Text-to-Image Generation	GenEval	BLIP3o-NEXT-GRPO-GenEval (3B)	0.910 geneval-score

Fig 02 · Headline benchmarks for Multimodal. Full leaderboards, dated history and reproduction status live on the task pages.

Side note

State of the Field (2025)

01Gemini 3 Pro leads proprietary models with breakthrough reasoning scores on Humanity's Last Exam, while Gemini 3 Flash matches previous-gen Pro performance at lower cost and latency
02Open-source models have achieved near-parity: InternVL3-78B hits 72.2% on MMMU, Molmo 2 leads in video understanding and grounding tasks, Qwen 2.5 VL handles 29 languages and 1-hour videos
03Hallucination remains the critical deployment blocker. Models confidently describe non-existent objects, and grounding objectives surprisingly don't fix this in open-ended generation
04Spatial reasoning and 3D understanding lag behind: even frontier models struggle with orientation tasks (56% vs 95.7% human), limiting robotics and embodied AI applications

Picks by use-case

What to reach for.

Editorial picks · not vendor rankings

General-purpose multimodal reasoning (production API)

Gemini 3 Flash

Matches Gemini 2.5 Pro performance at lower cost and latency. Best efficiency-capability tradeoff for API usage.

Open-source general multimodal

InternVL3-78B

72.2% MMMU, state-of-the-art among open models. Reasonable compute requirements for on-prem deployment.

Video understanding and tracking

Molmo 2

Leading open-weight model for video QA, dense captioning, multi-object tracking. 9M video training examples show.

Document understanding and OCR

Llama 3.2 Vision 90B

73.6% VQAv2, 70.7% DocVQA. Meta's focus on document tasks delivers practical results for enterprise.

Edge deployment (resource-constrained)

Qwen 2.5 VL-7B

Strong performance in 7B parameters. Handles variable resolution, 29 languages, deployable on modest hardware.

Scientific and technical diagrams

DeepSeek-VL

MoE architecture optimized for technical reasoning. Better than generalist models on specialized scientific content.

Multi-image reasoning

Pixtral (Mistral AI)

Native multi-image processing, strong instruction-following. Architectural modularity aids practical deployment.

Long-context document reasoning

MACT framework on top of base model

Multi-agent collaboration outperforms monolithic scaling. Decompose into planning, execution, judgment agents.

Hallucination-critical applications

Base model + MARINE framework

Training-free hallucination reduction via open-source vision model guidance. Works across diverse LVLMs.

Frontier reasoning (cost no object)

Gemini 3 Pro

Tops LMArena for vision tasks, breakthrough scores on reasoning benchmarks. Vendor support and reliability.

Editor's note

Honest takes.

Open-source has caught up for most use cases

Unless you need absolute frontier reasoning, InternVL3-78B or Molmo 2 will serve you better than paying per-token for proprietary APIs. The performance gap has collapsed while deployment flexibility remains massive.

Video understanding is still the wild west

Despite claims, most models fail hard on videos over 15 minutes. If your use case involves long-form video, budget for custom fine-tuning. The benchmarks don't reflect real-world complexity.

Grounding doesn't fix hallucination

Research shows spatial grounding training has little to no effect on object hallucination in captions. You'll need explicit verification pipelines, not architectural fixes.

Scientific domains are still underserved

Gemini 2.5 Pro and o3 struggle on chemistry Olympiad problems. If you work with specialized diagrams (molecular structures, technical schematics), expect to build domain-specific solutions.

MoE is the new scaling paradigm

Mixture-of-experts architectures like DeepSeek-VL deliver comparable performance at a fraction of the compute. Dense models are increasingly a poor cost-performance choice.

§ 03 · Method

How this area is tracked

Every row in this register is dated and sourced.

The benchmarks above come from the same Postgres registry that powers the wider Codesota index. Each task has exactly one canonical dataset. Each score carries a metric direction, a date and — where possible — a reproduction status.

When a score regresses, the prior record stays visible. When a benchmark is contested, we mark it rather than delete it. The goal is a register that argues in public.

Full methodology →The unified task index

§ Final · Related

Neighbouring registers.

Sibling area hubs, the unified task index and the methodology that binds them.