Text to Speech
Convert text to natural-sounding speech. Powers voice assistants, audiobooks, and accessibility features.
How Text-to-Speech Works
A deep-dive into modern speech synthesis. From text preprocessing to neural vocoders, understand how systems like ElevenLabs achieve human-quality speech.
Text Normalization
Raw text contains ambiguities that must be resolved before synthesis. Numbers, abbreviations, and symbols need explicit pronunciation rules.
Common Normalization Rules
Grapheme-to-Phoneme (G2P)
Convert written text (graphemes) to pronunciation symbols (phonemes). English is notoriously irregular - "ough" has 9+ different pronunciations.
Dictionary Lookup
CMU Pronouncing Dictionary: 134K words with phonetic transcriptions
Rule-Based
Letter-to-sound rules: "ph" -> /f/, "tion" -> /Sen/
Neural G2P
Seq2seq transformer: learns patterns from data
The IPA (International Phonetic Alphabet)
Prosody: The Music of Speech
Prosody controls how we say something, not just what we say. The difference between "You're going?" (question) and "You're going." (statement) is prosody.
Pitch Contour (F0) Visualization
Statements typically fall in pitch at the end. Questions rise. This is why text-only TTS often sounds unnatural - it struggles with contextual prosody.
Acoustic Model: The Heart of TTS
The acoustic model converts linguistic features (phonemes, prosody) into audio representations. This is where the "magic" happens - neural networks learn the complex mapping from text to sound.
The Evolution of TTS
Splice recorded speech units
HMM/DNN predict acoustic features
RNN/attention generate mel frames
Parallel generation with duration predictor
LLM predicts audio tokens
Tacotron 2 (Autoregressive)
FastSpeech 2 (Non-Autoregressive)
Mel Spectrograms: The Audio Image
A mel spectrogram is a 2D representation of audio: time on x-axis, frequency (mel-scaled) on y-axis, intensity as color. It's how neural networks "see" sound.
Interactive Mel Spectrogram
Why Mel Scale?
Human hearing is logarithmic - we perceive the difference between 100Hz and 200Hz as similar to 1000Hz and 2000Hz. The mel scale matches this perception.
Typical Parameters
Reading a Mel Spectrogram
Vocoder: Mel to Waveform
The vocoder converts mel spectrograms back into audio waveforms. This "neural audio synthesis" step is crucial for quality - bad vocoders make everything sound robotic.
Waveform Reconstruction
Vocoder Comparison
| Vocoder | Type | Speed (RTF) | Quality | Notes |
|---|---|---|---|---|
| Griffin-Lim | Classical | 0.001x | Poor | Iterative phase reconstruction |
| WaveNet | Autoregressive | 0.01x | Excellent | 16K steps/sec generation |
| WaveGlow | Flow-based | 0.5x | Very Good | Parallel, large model |
| HiFi-GAN | GAN | 50x+ | Excellent | Fast, high quality - industry standard |
| Vocos | GAN | 100x+ | Excellent | Fourier-based, very efficient |
RTF = Real-Time Factor (>1x = faster than real-time)
HiFi-GAN: The Industry Standard
Voice Cloning: Zero-Shot TTS
Modern systems like ElevenLabs can clone a voice from just seconds of audio. This is the cutting edge of TTS - combining speaker characteristics with arbitrary text.
Speaker Embedding
Data needed: 5-30 secondsGood similarityExtract fixed-size vector from reference audio
A speaker encoder (e.g., d-vector, x-vector) compresses voice characteristics into a ~256-dim vector that conditions the TTS model.
In-Context Learning
Data needed: 3-10 secondsExcellent similarityFeed reference audio as prompt to LLM
Audio is tokenized into discrete codes. The model learns to continue generating in the same voice, like GPT continuing text.
Fine-Tuning
Data needed: 30 min - 2 hoursBest qualityAdapt model weights on target voice
Low-rank adaptation (LoRA) or full fine-tuning on target speaker data. Most accurate but requires more data.
Speaker Embedding Space
Each dot represents a speaker's voice characteristics in embedding space. Similar voices cluster together. The model learns to map reference audio to this space, then conditions generation on the resulting vector.
Neural Codec Language Models (VALL-E / ElevenLabs)
How It Works
- 1.Audio tokenized into discrete codes via neural codec (EnCodec)
- 2.Reference audio becomes a "prompt" of audio tokens
- 3.LLM predicts new audio tokens conditioned on text + prompt
- 4.Neural codec decodes tokens back to audio
Key Innovations
- + Zero-shot cloning from 3 seconds of audio
- + Preserves emotion, accent, and style
- + Can generate non-speech (laughter, sighs)
- ! Requires massive training data (60K+ hours)
The Complete TTS Pipeline
Use Cases
- ✓Voice assistants
- ✓Audiobook generation
- ✓Accessibility
- ✓Video narration
Architectural Patterns
Neural TTS
End-to-end neural models for speech synthesis.
- +Natural sounding
- +Emotion control
- +Many voices
- -Compute intensive
- -Voice cloning concerns
Zero-Shot Voice Cloning
Clone any voice from a short sample.
- +Any voice
- +Minimal samples needed
- -Ethical concerns
- -Quality varies
Implementations
API Services
ElevenLabs
ElevenLabsIndustry-leading quality. Great voice cloning.
OpenAI TTS
OpenAISimple API, good quality. Limited voice options.
Benchmarks
Quick Facts
- Input
- Text
- Output
- Audio
- Implementations
- 2 open source, 2 API
- Patterns
- 2 approaches