Text-to-Speech
Text-to-speech has undergone a stunning transformation from robotic concatenation to near-human expressiveness in under five years. ElevenLabs, OpenAI's TTS, and XTTS-v2 produce speech that most listeners cannot distinguish from recordings, while open models like Bark, VALL-E (Microsoft), and F5-TTS demonstrated that voice cloning from 3-second samples is now a commodity capability. The frontier has moved beyond intelligibility (solved) to prosody, emotion control, and real-time streaming at under 200ms latency for conversational AI. Evaluation remains messy — MOS (Mean Opinion Score) is subjective and expensive, and automated metrics like UTMOS only loosely correlate with human preference, making benchmark comparisons unreliable.
VCTK
Speech data from 110 English speakers with various accents. Used for multi-speaker TTS.
Top 10
Leading models on VCTK.
All datasets
2 datasets tracked for this task.
Related tasks
Other tasks in Speech.
Looking to run a model? HuggingFace hosts inference for this task type.