Home/Building Blocks/Text to Speech
TextAudio

Text to Speech

Convert text to natural-sounding speech. Powers voice assistants, audiobooks, and accessibility features.

How Text-to-Speech Works

A deep-dive into modern speech synthesis. From text preprocessing to neural vocoders, understand how systems like ElevenLabs achieve human-quality speech.

1

Text Normalization

Raw text contains ambiguities that must be resolved before synthesis. Numbers, abbreviations, and symbols need explicit pronunciation rules.

INPUT (Raw Text)
Dr. Smith lives at 123 Main St.
OUTPUT (Normalized)
Doctor Smith lives at one twenty three Main Street.
TRY THESE EXAMPLES:

Common Normalization Rules

Numbers123 -> "one twenty three"
Currency$5.99 -> "five dollars..."
Dates01/15 -> "January fifteenth"
AbbreviationsDr. -> "Doctor"
AcronymsNASA -> "N A S A" or "nasa"
Units5kg -> "five kilograms"
2

Grapheme-to-Phoneme (G2P)

Convert written text (graphemes) to pronunciation symbols (phonemes). English is notoriously irregular - "ough" has 9+ different pronunciations.

hello/helou/
h
consonant
voiceless glottal fricative
e
vowel
schwa (unstressed)
l
consonant
alveolar lateral
ou
vowel
diphthong

Dictionary Lookup

CMU Pronouncing Dictionary: 134K words with phonetic transcriptions

Fast, accurate for known words. Fails on OOV.

Rule-Based

Letter-to-sound rules: "ph" -> /f/, "tion" -> /Sen/

Handles unknown words. Many exceptions.

Neural G2P

Seq2seq transformer: learns patterns from data

Best accuracy. Used in modern TTS.

The IPA (International Phonetic Alphabet)

Vowels
i: eI A: O: u: aI aU
Plosives
p b t d k g
Fricatives
f v T D s z S Z h
Other
m n N l r w j
3

Prosody: The Music of Speech

Prosody controls how we say something, not just what we say. The difference between "You're going?" (question) and "You're going." (statement) is prosody.

"I never said she stole my money"
Click a word to emphasize it and see how meaning changes
Pitch (F0)
Fundamental frequency - higher for questions, emphasis
Duration
How long each phoneme is held
Energy
Volume/intensity variations
Pauses
Silence between phrases for naturalness

Pitch Contour (F0) Visualization

Statements typically fall in pitch at the end. Questions rise. This is why text-only TTS often sounds unnatural - it struggles with contextual prosody.

4

Acoustic Model: The Heart of TTS

The acoustic model converts linguistic features (phonemes, prosody) into audio representations. This is where the "magic" happens - neural networks learn the complex mapping from text to sound.

The Evolution of TTS

1960s-1990sConcatenative

Splice recorded speech units

+ Natural sound from real recordings- Limited vocabulary, robotic joins
Examples: DECtalk, AT&T Natural Voices
1990s-2016Statistical Parametric

HMM/DNN predict acoustic features

+ Flexible, compact models- Muffled, buzzy quality
Examples: Festival, HTS
2016-2020Neural Autoregressive

RNN/attention generate mel frames

+ Human-like quality- Slow generation, attention failures
Examples: Tacotron 2, Transformer-TTS
2020-PresentNon-Autoregressive

Parallel generation with duration predictor

+ Real-time, stable- Slightly less expressive
Examples: FastSpeech 2, VITS
2023-PresentNeural Codec LM

LLM predicts audio tokens

+ Zero-shot cloning, emotion- Compute heavy, artifacts
Examples: VALL-E, Bark, ElevenLabs

Tacotron 2 (Autoregressive)

Text Encoder
->Character/phoneme embeddings
Attention
->Align text to mel frames
Decoder
->Generate mel spectrogram
Generates one mel frame at a time, attending back to encoder. Slow but high quality.

FastSpeech 2 (Non-Autoregressive)

Text Encoder
->Phoneme representations
Duration
->Predict phoneme durations
Mel Decoder
->Parallel generation
Generates all mel frames in parallel. 100x+ faster than autoregressive.
5

Mel Spectrograms: The Audio Image

A mel spectrogram is a 2D representation of audio: time on x-axis, frequency (mel-scaled) on y-axis, intensity as color. It's how neural networks "see" sound.

Interactive Mel Spectrogram

Frequency (Mel)
Time
Low frequency (bass)
Intensity
High frequency (treble)

Why Mel Scale?

Human hearing is logarithmic - we perceive the difference between 100Hz and 200Hz as similar to 1000Hz and 2000Hz. The mel scale matches this perception.

mel(f) = 2595 * log10(1 + f/700)

Typical Parameters

Sample Rate22,050 Hz
FFT Size1,024
Hop Length256
Mel Bins80
Freq Range0-8000 Hz
Frame Rate~86 fps

Reading a Mel Spectrogram

Vowels
Horizontal bands (formants) at consistent frequencies
Consonants
Noise bursts, gaps, or rapid transitions
Pitch
Harmonic structure - evenly spaced horizontal lines
Silence
Dark vertical bands between words
6

Vocoder: Mel to Waveform

The vocoder converts mel spectrograms back into audio waveforms. This "neural audio synthesis" step is crucial for quality - bad vocoders make everything sound robotic.

Waveform Reconstruction

Vocoder Comparison

VocoderTypeSpeed (RTF)QualityNotes
Griffin-LimClassical0.001xPoorIterative phase reconstruction
WaveNetAutoregressive0.01xExcellent16K steps/sec generation
WaveGlowFlow-based0.5xVery GoodParallel, large model
HiFi-GANGAN50x+ExcellentFast, high quality - industry standard
VocosGAN100x+ExcellentFourier-based, very efficient

RTF = Real-Time Factor (>1x = faster than real-time)

HiFi-GAN: The Industry Standard

Mel Spectrogram
->
Generator
Transposed Conv + MRF
->
Audio Waveform
Generator: Upsamples mel (256x) through transposed convolutions with Multi-Receptive Field Fusion
Discriminator: Multi-Period + Multi-Scale discriminators ensure both fine and coarse audio quality
7

Voice Cloning: Zero-Shot TTS

Modern systems like ElevenLabs can clone a voice from just seconds of audio. This is the cutting edge of TTS - combining speaker characteristics with arbitrary text.

Speaker Embedding

Data needed: 5-30 secondsGood similarity

Extract fixed-size vector from reference audio

A speaker encoder (e.g., d-vector, x-vector) compresses voice characteristics into a ~256-dim vector that conditions the TTS model.

Used by: Coqui XTTS, Meta Voicebox

In-Context Learning

Data needed: 3-10 secondsExcellent similarity

Feed reference audio as prompt to LLM

Audio is tokenized into discrete codes. The model learns to continue generating in the same voice, like GPT continuing text.

Used by: VALL-E, ElevenLabs

Fine-Tuning

Data needed: 30 min - 2 hoursBest quality

Adapt model weights on target voice

Low-rank adaptation (LoRA) or full fine-tuning on target speaker data. Most accurate but requires more data.

Used by: Tortoise-TTS, Custom models

Speaker Embedding Space

Each dot represents a speaker's voice characteristics in embedding space. Similar voices cluster together. The model learns to map reference audio to this space, then conditions generation on the resulting vector.

Neural Codec Language Models (VALL-E / ElevenLabs)

How It Works
  1. 1.Audio tokenized into discrete codes via neural codec (EnCodec)
  2. 2.Reference audio becomes a "prompt" of audio tokens
  3. 3.LLM predicts new audio tokens conditioned on text + prompt
  4. 4.Neural codec decodes tokens back to audio
Key Innovations
  • + Zero-shot cloning from 3 seconds of audio
  • + Preserves emotion, accent, and style
  • + Can generate non-speech (laughter, sighs)
  • ! Requires massive training data (60K+ hours)

The Complete TTS Pipeline

Raw Text
->
Normalized
->
Phonemes
->
+ Prosody
->
Acoustic Model
->
Mel Spectrogram
->
Vocoder
->
Audio
MOS
Mean Opinion Score (1-5)
Human naturalness rating
RTF
Real-Time Factor
Speed vs playback time
SV
Speaker Verification
Voice similarity score
WER
Word Error Rate
Intelligibility measure

Use Cases

  • Voice assistants
  • Audiobook generation
  • Accessibility
  • Video narration

Architectural Patterns

Neural TTS

End-to-end neural models for speech synthesis.

Pros:
  • +Natural sounding
  • +Emotion control
  • +Many voices
Cons:
  • -Compute intensive
  • -Voice cloning concerns

Zero-Shot Voice Cloning

Clone any voice from a short sample.

Pros:
  • +Any voice
  • +Minimal samples needed
Cons:
  • -Ethical concerns
  • -Quality varies

Implementations

API Services

ElevenLabs

ElevenLabs
API

Industry-leading quality. Great voice cloning.

OpenAI TTS

OpenAI
API

Simple API, good quality. Limited voice options.

Open Source

Coqui XTTS

MPL-2.0
Open Source

Zero-shot voice cloning. Run locally.

Bark

MIT
Open Source

Supports music, laughter, nonverbal sounds.

Benchmarks

Quick Facts

Input
Text
Output
Audio
Implementations
2 open source, 2 API
Patterns
2 approaches

Have benchmark data?

Help us track the state of the art for text to speech.

Submit Results