Speechtext-to-speech

Text-to-Speech

Text-to-speech has undergone a stunning transformation from robotic concatenation to near-human expressiveness in under five years. ElevenLabs, OpenAI's TTS, and XTTS-v2 produce speech that most listeners cannot distinguish from recordings, while open models like Bark, VALL-E (Microsoft), and F5-TTS demonstrated that voice cloning from 3-second samples is now a commodity capability. The frontier has moved beyond intelligibility (solved) to prosody, emotion control, and real-time streaming at under 200ms latency for conversational AI. Evaluation remains messy — MOS (Mean Opinion Score) is subjective and expensive, and automated metrics like UTMOS only loosely correlate with human preference, making benchmark comparisons unreliable.

2
Datasets
24
Results
mos
Canonical metric
Canonical Benchmark

VCTK

Speech data from 110 English speakers with various accents. Used for multi-speaker TTS.

Primary metric: mos
View full leaderboard

Top 10

Leading models on VCTK.

RankModelmosYearSource
1
NaturalSpeech 3
4.362026paper
2
Ground Truth (VCTK)
4.262022paper
3
VITS
4.212026paper
4
StyleTTS2
4.192023paper
5
Ground Truth (VCTK)
4.192022paper
6
VALL-E 2
4.182026paper
7
YourTTS
4.162022paper
8
XTTS v2
4.142026paper
9
YourTTS
4.072022paper
10
VITS2
3.992023paper

All datasets

2 datasets tracked for this task.

Related tasks

Other tasks in Speech.

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace