Text→Audio

Text to Speech

Convert text to natural-sounding speech. Powers voice assistants, audiobooks, and accessibility features.

How Text-to-Speech Works

A deep-dive into modern speech synthesis. From text preprocessing to neural vocoders, understand how systems like ElevenLabs achieve human-quality speech.

1. Text Processing 2. Phonemes 3. Prosody 4. Acoustic Model 5. Mel Spectrograms 6. Vocoder 7. Voice Cloning

Text Normalization

Raw text contains ambiguities that must be resolved before synthesis. Numbers, abbreviations, and symbols need explicit pronunciation rules.

Enter text to normalize:

INPUT (Raw Text)

Dr. Smith lives at 123 Main St.

OUTPUT (Normalized)

Doctor Smith lives at one twenty three Main Street.

TRY THESE EXAMPLES:

Common Normalization Rules

Numbers123 -> "one twenty three"

Currency$5.99 -> "five dollars..."

Dates01/15 -> "January fifteenth"

AbbreviationsDr. -> "Doctor"

AcronymsNASA -> "N A S A" or "nasa"

Units5kg -> "five kilograms"

Grapheme-to-Phoneme (G2P)

Convert written text (graphemes) to pronunciation symbols (phonemes). English is notoriously irregular - "ough" has 9+ different pronunciations.

hello/helou/

consonant

voiceless glottal fricative

vowel

schwa (unstressed)

consonant

alveolar lateral

vowel

diphthong

Dictionary Lookup

CMU Pronouncing Dictionary: 134K words with phonetic transcriptions

Fast, accurate for known words. Fails on OOV.

Rule-Based

Letter-to-sound rules: "ph" -> /f/, "tion" -> /Sen/

Handles unknown words. Many exceptions.

Neural G2P

Seq2seq transformer: learns patterns from data

Best accuracy. Used in modern TTS.

The IPA (International Phonetic Alphabet)

Vowels

i: eI A: O: u: aI aU

Plosives

p b t d k g

Fricatives

f v T D s z S Z h

Other

m n N l r w j

Prosody: The Music of Speech

Prosody controls how we say something, not just what we say. The difference between "You're going?" (question) and "You're going." (statement) is prosody.

"I never said she stole my money"

Click a word to emphasize it and see how meaning changes

Pitch (F0)

Fundamental frequency - higher for questions, emphasis

Duration

How long each phoneme is held

Energy

Volume/intensity variations

Pauses

Silence between phrases for naturalness

Pitch Contour (F0) Visualization

Statements typically fall in pitch at the end. Questions rise. This is why text-only TTS often sounds unnatural - it struggles with contextual prosody.

Acoustic Model: The Heart of TTS

The acoustic model converts linguistic features (phonemes, prosody) into audio representations. This is where the "magic" happens - neural networks learn the complex mapping from text to sound.

The Evolution of TTS

1960s-1990sConcatenative

Splice recorded speech units

+ Natural sound from real recordings- Limited vocabulary, robotic joins

Examples: DECtalk, AT&T Natural Voices

1990s-2016Statistical Parametric

HMM/DNN predict acoustic features

+ Flexible, compact models- Muffled, buzzy quality

Examples: Festival, HTS

2016-2020Neural Autoregressive

RNN/attention generate mel frames

+ Human-like quality- Slow generation, attention failures

Examples: Tacotron 2, Transformer-TTS

2020-PresentNon-Autoregressive

Parallel generation with duration predictor

+ Real-time, stable- Slightly less expressive

Examples: FastSpeech 2, VITS

2023-PresentNeural Codec LM

LLM predicts audio tokens

+ Zero-shot cloning, emotion- Compute heavy, artifacts

Examples: VALL-E, Bark, ElevenLabs

Tacotron 2 (Autoregressive)

Text Encoder

->Character/phoneme embeddings

Attention

->Align text to mel frames

Decoder

->Generate mel spectrogram

Generates one mel frame at a time, attending back to encoder. Slow but high quality.

FastSpeech 2 (Non-Autoregressive)

Text Encoder

->Phoneme representations

Duration

->Predict phoneme durations

Mel Decoder

->Parallel generation

Generates all mel frames in parallel. 100x+ faster than autoregressive.

Mel Spectrograms: The Audio Image

A mel spectrogram is a 2D representation of audio: time on x-axis, frequency (mel-scaled) on y-axis, intensity as color. It's how neural networks "see" sound.

Interactive Mel Spectrogram

Frequency (Mel)

Time

Low frequency (bass)

Intensity

High frequency (treble)

Why Mel Scale?

Human hearing is logarithmic - we perceive the difference between 100Hz and 200Hz as similar to 1000Hz and 2000Hz. The mel scale matches this perception.

mel(f) = 2595 * log10(1 + f/700)

Typical Parameters

Sample Rate22,050 Hz

FFT Size1,024

Hop Length256

Mel Bins80

Freq Range0-8000 Hz

Frame Rate~86 fps

Reading a Mel Spectrogram

Vowels

Horizontal bands (formants) at consistent frequencies

Consonants

Noise bursts, gaps, or rapid transitions

Pitch

Harmonic structure - evenly spaced horizontal lines

Silence

Dark vertical bands between words

Vocoder: Mel to Waveform

The vocoder converts mel spectrograms back into audio waveforms. This "neural audio synthesis" step is crucial for quality - bad vocoders make everything sound robotic.

Waveform Reconstruction

Vocoder Comparison

Vocoder	Type	Speed (RTF)	Quality	Notes
Griffin-Lim	Classical	0.001x	Poor	Iterative phase reconstruction
WaveNet	Autoregressive	0.01x	Excellent	16K steps/sec generation
WaveGlow	Flow-based	0.5x	Very Good	Parallel, large model
HiFi-GAN	GAN	50x+	Excellent	Fast, high quality - industry standard
Vocos	GAN	100x+	Excellent	Fourier-based, very efficient

RTF = Real-Time Factor (>1x = faster than real-time)

HiFi-GAN: The Industry Standard

Mel Spectrogram

Generator

Transposed Conv + MRF

Audio Waveform

Generator: Upsamples mel (256x) through transposed convolutions with Multi-Receptive Field Fusion

Discriminator: Multi-Period + Multi-Scale discriminators ensure both fine and coarse audio quality

Voice Cloning: Zero-Shot TTS

Modern systems like ElevenLabs can clone a voice from just seconds of audio. This is the cutting edge of TTS - combining speaker characteristics with arbitrary text.

Speaker Embedding

Data needed: 5-30 secondsGood similarity

Extract fixed-size vector from reference audio

A speaker encoder (e.g., d-vector, x-vector) compresses voice characteristics into a ~256-dim vector that conditions the TTS model.

Used by: Coqui XTTS, Meta Voicebox

In-Context Learning

Data needed: 3-10 secondsExcellent similarity

Feed reference audio as prompt to LLM

Audio is tokenized into discrete codes. The model learns to continue generating in the same voice, like GPT continuing text.

Used by: VALL-E, ElevenLabs

Fine-Tuning

Data needed: 30 min - 2 hoursBest quality

Adapt model weights on target voice

Low-rank adaptation (LoRA) or full fine-tuning on target speaker data. Most accurate but requires more data.

Used by: Tortoise-TTS, Custom models

Speaker Embedding Space

Each dot represents a speaker's voice characteristics in embedding space. Similar voices cluster together. The model learns to map reference audio to this space, then conditions generation on the resulting vector.

Neural Codec Language Models (VALL-E / ElevenLabs)

How It Works

1.Audio tokenized into discrete codes via neural codec (EnCodec)
2.Reference audio becomes a "prompt" of audio tokens
3.LLM predicts new audio tokens conditioned on text + prompt
4.Neural codec decodes tokens back to audio

Key Innovations

+ Zero-shot cloning from 3 seconds of audio
+ Preserves emotion, accent, and style
+ Can generate non-speech (laughter, sighs)
! Requires massive training data (60K+ hours)

The Complete TTS Pipeline

Raw Text

Normalized

Phonemes

+ Prosody

Acoustic Model

Mel Spectrogram

Vocoder

Audio

MOS

Mean Opinion Score (1-5)

Human naturalness rating

RTF

Real-Time Factor

Speed vs playback time

Speaker Verification

Voice similarity score

WER

Word Error Rate

Intelligibility measure

Use Cases

✓Voice assistants
✓Audiobook generation
✓Accessibility
✓Video narration

Architectural Patterns

Neural TTS

End-to-end neural models for speech synthesis.

Pros:

+Natural sounding
+Emotion control
+Many voices

Cons:

-Compute intensive
-Voice cloning concerns

Zero-Shot Voice Cloning

Clone any voice from a short sample.

Pros:

+Any voice
+Minimal samples needed

Cons:

-Ethical concerns
-Quality varies

Implementations

API Services

ElevenLabs

API

Industry-leading quality. Great voice cloning.

OpenAI TTS

OpenAI

API

Simple API, good quality. Limited voice options.

Open Source

Coqui XTTS

MPL-2.0

Open Source

Zero-shot voice cloning. Run locally.

GitHub

Bark

MIT

Open Source

Supports music, laughter, nonverbal sounds.

GitHub

Benchmarks

MOS (Mean Opinion Score) →

Quick Facts

Input: Text
Output: Audio
Implementations: 2 open source, 2 API
Patterns: 2 approaches

Related Blocks

Have benchmark data?

Help us track the state of the art for text to speech.

Submit Results