Audio & Speech
ASR, TTS, speaker intelligence, music, sound events, audio-language understanding, and audio safety.
Tasks in Audio & Speech
Automatic Speech Recognition
Converting spoken audio to text.
Multilingual ASR
Recognizing speech across languages and accents.
Streaming ASR
Low-latency speech recognition on live audio.
Speech Translation
Translating spoken audio directly to another language.
Text-to-Speech
Generating natural-sounding speech from text.
Expressive TTS
Generating speech with controllable prosody and emotion.
Voice Cloning
Replicating speaker characteristics from examples.
Speaker Verification
Verifying speaker identity from voice samples.
Speaker Diarization
Separating who spoke when in multi-speaker audio.
Speech Emotion Recognition
Classifying emotion or affect from speech.
Audio Classification
Classifying audio clips by event or category.
Sound Event Detection
Detecting sound events over time.
Audio Captioning
Generating text descriptions of audio clips.
Audio Question Answering
Answering questions about audio content.
Music Understanding
Analyzing musical structure, genre, or content.
Music Generation
Generating music from text, prompts, or examples.
Audio Deepfake Detection
Detecting synthetic or manipulated speech.
Explore Other Areas
Language & Knowledge
Language understanding, retrieval, QA, RAG, factuality, information extraction, multilingual evaluation, and knowledge-heavy reasoning.
Vision & Documents
Images, video frames, OCR, layout, tables, document parsing, detection, segmentation, and visual anomaly detection.
Multimodal Media
Cross-modal image, text, audio, video, and 3D tasks where input and output span multiple media types.
Code & Software Engineering
Code generation, completion, repair, repository understanding, tests, vulnerability work, UI code, and mobile app code generation.