Multimodal Media
Cross-modal image, text, audio, video, and 3D tasks where input and output span multiple media types.
Tasks in Multimodal Media
Visual Question Answering
Answering questions about images.
Image Captioning
Generating textual descriptions of images.
Image-Text Retrieval
Retrieving matching images or text across modalities.
Video Question Answering
Answering questions about video content.
Video Captioning
Generating textual descriptions of videos.
Audio + Text to Text
Using audio and text prompts to produce text responses.
Document VQA
Answering questions over document images.
Text-to-Image
Generating images from text prompts.
Image Editing
Editing images from text or visual instructions.
Text-to-Video
Generating video from text prompts.
Image-to-Video
Animating still images into video.
Text-to-3D
Generating 3D assets from text.
Image-to-3D
Generating 3D assets from images.
Any-to-Any Omni Models
Models that accept and generate multiple modalities.
Explore Other Areas
Language & Knowledge
Language understanding, retrieval, QA, RAG, factuality, information extraction, multilingual evaluation, and knowledge-heavy reasoning.
Vision & Documents
Images, video frames, OCR, layout, tables, document parsing, detection, segmentation, and visual anomaly detection.
Audio & Speech
ASR, TTS, speaker intelligence, music, sound events, audio-language understanding, and audio safety.
Code & Software Engineering
Code generation, completion, repair, repository understanding, tests, vulnerability work, UI code, and mobile app code generation.