Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · Image-Text-to-TextHome/Tasks/Multimodal/Image-Text-to-Text
Multimodal· image-text-to-text

Image-Text-to-Text.

Image-text-to-text exploded from a research curiosity to the dominant AI interface in under two years. GPT-4V (2023) proved multimodal LLMs could reason over images, Gemini 1.5 scaled to million-token contexts mixing text and vision, and Claude 3 showed that careful RLHF produces models that refuse to hallucinate about image content. MMMU and MMBench have become the standard evaluation gauntlet, but the real challenge is grounding — models still confabulate spatial relationships and struggle with fine-grained visual reasoning. This is the task that turned chatbots into visual assistants.

3
Datasets
57
Results
accuracy
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

MMBench

Comprehensive multimodal LLM evaluation across 20 ability dimensions

Primary metric: accuracy
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on MMBench.

No results yet. Be the first to contribute.

What were you looking for on Image-Text-to-Text?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

3 datasets tracked for this task.

MMBench
CANONICAL
0 results · accuracy
MMMU
36 results · accuracy
Top: Qwen3.5-397B-A17B 85.0
MMStar
21 results · accuracy
Top: Qwen3.5-397B-A17B 83.8
§ 05 · Related tasks

Other tasks in Multimodal.

Any-to-AnyAudio-Text-to-TextCross-Modal RetrievalImage CaptioningImage-Text-to-ImageImage-Text-to-VideoText-to-Image GenerationVideo Understanding
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Image-Text-to-Text? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.