Codesota · Tasks · Image-Text-to-TextHome/Tasks/Multimodal/Image-Text-to-Text

Multimodal· image-text-to-text

Image-Text-to-Text.

Image-text-to-text exploded from a research curiosity to the dominant AI interface in under two years. GPT-4V (2023) proved multimodal LLMs could reason over images, Gemini 1.5 scaled to million-token contexts mixing text and vision, and Claude 3 showed that careful RLHF produces models that refuse to hallucinate about image content. MMMU and MMBench have become the standard evaluation gauntlet, but the real challenge is grounding — models still confabulate spatial relationships and struggle with fine-grained visual reasoning. This is the task that turned chatbots into visual assistants.

3

Datasets

57

Results

accuracy

Canonical metric

§ 02 · Canonical benchmark

The reference dataset.

MMBench

Comprehensive multimodal LLM evaluation across 20 ability dimensions

Primary metric: accuracy

View full leaderboard →

§ 03 · Top 10

Leading models.

Leading models on MMBench.

No results yet. Be the first to contribute.

What were you looking for on Image-Text-to-Text?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

3 datasets tracked for this task.

0 results · accuracy

36 results · accuracy

Top: Qwen3.5-397B-A17B — 85.0

21 results · accuracy

Top: Qwen3.5-397B-A17B — 83.8

§ 05 · Related tasks

Other tasks in Multimodal.

Any-to-Any Audio-Text-to-Text Cross-Modal Retrieval Image Captioning Image-Text-to-Image Image-Text-to-Video Text-to-Image Generation Video Understanding

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Image-Text-to-Text? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.