Image-text-to-text exploded from a research curiosity to the dominant AI interface in under two years. GPT-4V (2023) proved multimodal LLMs could reason over images, Gemini 1.5 scaled to million-token contexts mixing text and vision, and Claude 3 showed that careful RLHF produces models that refuse to hallucinate about image content. MMMU and MMBench have become the standard evaluation gauntlet, but the real challenge is grounding — models still confabulate spatial relationships and struggle with fine-grained visual reasoning. This is the task that turned chatbots into visual assistants.
Comprehensive multimodal LLM evaluation across 20 ability dimensions
Leading models on MMBench.
No results yet. Be the first to contribute.
Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.
3 datasets tracked for this task.
Still looking for something on Image-Text-to-Text? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.
Real humans read every message. We track what people are asking for and prioritize accordingly.