Multimodal
Video Understanding
Understanding and reasoning about video content.
0 datasets0 results
Video Understanding is a key task in multimodal. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.
Benchmarks & SOTA
No datasets indexed for this task yet.
Contribute on GitHubRelated Tasks
Image Captioning
Generating text descriptions of images (COCO Captions).
Visual Question Answering
Answering questions about images (VQA, GQA).
Text-to-Image Generation
Generating images from text descriptions (Stable Diffusion, DALL-E).
Cross-Modal Retrieval
Retrieving items across different modalities (image-text).