Multimodal

Combining vision and language? Evaluate image captioning, visual QA, text-to-image generation, and cross-modal retrieval models.

5 tasks 2 datasets 0 results

Image Captioning

Generating text descriptions of images (COCO Captions).

1 datasets 0 results
COCO Captions COCO Captions 2015

330K images with 5 captions each. Standard benchmark for image captioning.

Visual Question Answering

Answering questions about images (VQA, GQA).

1 datasets 0 results
VQA v2.0 Visual Question Answering v2.0 2017

265K images with 1.1M questions. Balanced dataset to reduce language biases found in v1.

Text-to-Image Generation

Generating images from text descriptions (Stable Diffusion, DALL-E).

0 datasets 0 results
No datasets indexed yet. Contribute on GitHub

Video Understanding

Understanding and reasoning about video content.

0 datasets 0 results
No datasets indexed yet. Contribute on GitHub

Cross-Modal Retrieval

Retrieving items across different modalities (image-text).

0 datasets 0 results
No datasets indexed yet. Contribute on GitHub