Home/Speech/Best for podcasts

Long-formUpdated April 2026

Best TTS for podcasts

Podcasts punish TTS in a way short-form doesn't. Twenty-minute stamina, natural co-host banter, unusual proper nouns, long pause pacing — these separate ElevenLabs v3 and Google NotebookLM from the rest of the pack.

TL;DR

> Most natural long-form solo: ElevenLabs v3 with audio tags, 4.8 MOS on 20-minute passages.
> Best two-voice show: Google NotebookLM's Audio Overview — generates full back-and-forth from a doc. Free.
> Best production pipeline: PlayHT 3.0 with voice cloning for a branded host voice.
> Best self-host: Sesame CSM for dialogue, F5-TTS for cloned-host narration.

Prosody: why podcasts sound robotic

The reason bad TTS feels robotic on long passages is flat pitch. Natural speakers drop F0 on declaratives, rise on questions, and pause 150-400ms at semantic boundaries. The top-tier models get this right; commodity models flatten everything into a monotone.

Prosody curve

F0 Hz + energy envelope

ElevenLabs v3 “State-space models are Transformers with selective memory — let me explain why that matters.”

Prosody curve

F0 Hz + energy envelope

Commodity TTS (tts-1) “State-space models are Transformers with selective memory — let me explain why that matters.”

F0 pitch range is ~50Hz for expressive TTS, <20Hz for flat TTS. Prosodic breaks (||) mark where a listener expects a pause. Commodity models rarely insert them.

Long-form capability radar

Capability radar

Podcast-grade TTS

Each axis scored 0-10. Higher is better. Overlay shows trade-offs.

Voice fingerprints: solo narrator

ElevenLabs v3 · Hope · narrator

mel spectrogram

“Expressive long-form — wide dynamic range, formant richness”

PlayHT 3.0 · Cloned host

mel spectrogram

“Cloned branded voice — consistent timbre across episodes”

Two-voice dialogue

# Two-voice podcast with ElevenLabs v3 (audio tags + voice switching).
from elevenlabs.client import ElevenLabs
client = ElevenLabs(api_key="sk_...")

script = [
    {"voice": "rachel", "text": "[warm] Welcome back. Today we're talking about state-space models."},
    {"voice": "adam",   "text": "[curious] Linear attention, but make it recurrent, right?"},
    {"voice": "rachel", "text": "[laughs] Roughly. Let's actually define what 'selective' means here."},
]

with open("episode.mp3", "wb") as f:
    for turn in script:
        for chunk in client.text_to_speech.convert(
            voice_id=VOICES[turn["voice"]],
            model_id="eleven_v3",
            text=turn["text"],
            output_format="mp3_44100_128",
        ):
            f.write(chunk)

For a no-code approach: drop any article into Google NotebookLM and it generates a ~10-minute two-host podcast. Uses Gemini 2.5 Flash TTS multi-speaker mode. Remarkably natural; limited editing knobs.

Listen: 30-second long-form clips

ElevenLabs v3Hope

eleven_v3

sample TBD

“Intro monologue to a tech podcast”

drop elevenlabs v3-hope.mp3 at /audio/samples/podcast-11labs-v3.mp3

Google NotebookLMHosts A & B

Gemini 2.5 Flash TTS

sample TBD

“Auto-generated two-host banter”

drop google notebooklm-hosts a & b.mp3 at /audio/samples/podcast-notebooklm.mp3

PlayHT 3.0Cloned host

Play 3.0 Mini

sample TBD

“Long-form solo narration”

drop playht 3.0-cloned host.mp3 at /audio/samples/podcast-playht.mp3

CartesiaBritish Narrator

sonic-2

sample TBD

“Long-form solo narration”

drop cartesia-british narrator.mp3 at /audio/samples/podcast-cartesia.mp3

Where the quality ceiling is

Long-form TTS quality has plateaued near 4.7-4.8 MOS since late 2024. The remaining gap to human narration is in disfluencies, micro-pauses, and context-aware intonation — not timbre.

Evolution

TTS quality over time

MOS per release. Quality has plateaued near 4.7-4.8; the action is now in latency and steerability.

Practical long-form tactics

Scripting

Chunk scripts into 2-4 sentence blocks — most models drift past ~500 chars.
Spell unusual names phonetically: "Karpathy" → "kar-PATH-ee".
Insert explicit commas for natural pauses; em-dashes for dramatic ones.
Use audio tags (ElevenLabs v3) sparingly — overuse sounds forced.

Post-production

Normalize to -16 LUFS integrated — the podcast standard.
High-pass at 80Hz to remove vocoder rumble.
Render per-turn and concatenate with 350ms gaps, not a single long call.
Re-render any turn that trips on a name; keep a pronunciation dictionary.

Best TTS for audiobooks

Character voices, SSML, 10h+ sessions

Best for voice cloning

Clone your host voice

Back to Speech Benchmark

TL;DR

Prosody: why podcasts sound robotic

Long-form capability radar

Voice fingerprints: solo narrator

Two-voice dialogue

Listen: 30-second long-form clips

Where the quality ceiling is

Practical long-form tactics

Scripting

Post-production

Related

Best TTS for audiobooks

Best for voice cloning