Best TTS for podcasts
Podcasts punish TTS in a way short-form doesn't. Twenty-minute stamina, natural co-host banter, unusual proper nouns, long pause pacing — these separate ElevenLabs v3 and Google NotebookLM from the rest of the pack.
TL;DR
- > Most natural long-form solo: ElevenLabs v3 with audio tags, 4.8 MOS on 20-minute passages.
- > Best two-voice show: Google NotebookLM's Audio Overview — generates full back-and-forth from a doc. Free.
- > Best production pipeline: PlayHT 3.0 with voice cloning for a branded host voice.
- > Best self-host: Sesame CSM for dialogue, F5-TTS for cloned-host narration.
Prosody: why podcasts sound robotic
The reason bad TTS feels robotic on long passages is flat pitch. Natural speakers drop F0 on declaratives, rise on questions, and pause 150-400ms at semantic boundaries. The top-tier models get this right; commodity models flatten everything into a monotone.
Prosody curve
F0 Hz + energy envelope
ElevenLabs v3 “State-space models are Transformers with selective memory — let me explain why that matters.”
Prosody curve
F0 Hz + energy envelope
Commodity TTS (tts-1) “State-space models are Transformers with selective memory — let me explain why that matters.”
F0 pitch range is ~50Hz for expressive TTS, <20Hz for flat TTS. Prosodic breaks (||) mark where a listener expects a pause. Commodity models rarely insert them.
Long-form capability radar
Capability radar
Podcast-grade TTS
Each axis scored 0-10. Higher is better. Overlay shows trade-offs.
Voice fingerprints: solo narrator
“Expressive long-form — wide dynamic range, formant richness”
“Cloned branded voice — consistent timbre across episodes”
Two-voice dialogue
# Two-voice podcast with ElevenLabs v3 (audio tags + voice switching).
from elevenlabs.client import ElevenLabs
client = ElevenLabs(api_key="sk_...")
script = [
{"voice": "rachel", "text": "[warm] Welcome back. Today we're talking about state-space models."},
{"voice": "adam", "text": "[curious] Linear attention, but make it recurrent, right?"},
{"voice": "rachel", "text": "[laughs] Roughly. Let's actually define what 'selective' means here."},
]
with open("episode.mp3", "wb") as f:
for turn in script:
for chunk in client.text_to_speech.convert(
voice_id=VOICES[turn["voice"]],
model_id="eleven_v3",
text=turn["text"],
output_format="mp3_44100_128",
):
f.write(chunk)For a no-code approach: drop any article into Google NotebookLM and it generates a ~10-minute two-host podcast. Uses Gemini 2.5 Flash TTS multi-speaker mode. Remarkably natural; limited editing knobs.
Listen: 30-second long-form clips
“Intro monologue to a tech podcast”
“Auto-generated two-host banter”
“Long-form solo narration”
“Long-form solo narration”
Where the quality ceiling is
Long-form TTS quality has plateaued near 4.7-4.8 MOS since late 2024. The remaining gap to human narration is in disfluencies, micro-pauses, and context-aware intonation — not timbre.
Evolution
TTS quality over time
MOS per release. Quality has plateaued near 4.7-4.8; the action is now in latency and steerability.
Practical long-form tactics
Scripting
- Chunk scripts into 2-4 sentence blocks — most models drift past ~500 chars.
- Spell unusual names phonetically: "Karpathy" → "kar-PATH-ee".
- Insert explicit commas for natural pauses; em-dashes for dramatic ones.
- Use audio tags (ElevenLabs v3) sparingly — overuse sounds forced.
Post-production
- Normalize to -16 LUFS integrated — the podcast standard.
- High-pass at 80Hz to remove vocoder rumble.
- Render per-turn and concatenate with 350ms gaps, not a single long call.
- Re-render any turn that trips on a name; keep a pronunciation dictionary.