Best TTS for audiobooks
An audiobook is ten hours of the same voice. Tiny artifacts that a podcast listener forgives become intolerable at this length. The tools that work are the ones with SSML, custom lexicons, and character-voice switching.
TL;DR
- > Most natural: ElevenLabs v3 for narration, with audio tags for emotion beats.
- > Best SSML + lexicon control: Azure Neural HD or Google Chirp 3 HD — full SSML 1.1 + PLS.
- > Best branded narrator: PlayHT 3.0 with a cloned voice — consistent across book series.
- > ACX / Audible: fully AI-narrated submissions are now accepted in the ACX AI Narration pilot — labeling required.
The audiobook pipeline
A manuscript does not go directly into a TTS endpoint. Every production-grade audiobook workflow has a splitter, an SSML authoring step, and a mastering pass.
Architecture
Production-grade audiobook pipeline
Skip any stage and you get a long, consistent monotone.
Audiobook capability radar
Capability radar
SSML, lexicon, character control
Each axis scored 0-10. Higher is better. Overlay shows trade-offs.
Character voices, same scene
Compare how two character voices differ on the same dialogue. The heroine's pitch sits 140-200Hz with wide range; the villain is 70-115Hz, compressed. Good audiobook TTS lets you set voice presets at this granularity.
Prosody curve
F0 Hz + energy envelope
Narrator (female protagonist) ““I have traveled farther than you know, and I am not afraid.””
Prosody curve
F0 Hz + energy envelope
Narrator (gravelly antagonist) ““Then you have not traveled far enough.””
SSML: the power tool you can't skip
Full SSML 1.1 (breaks, prosody rate/pitch, emphasis, say-as) is table stakes for audiobooks. Azure and Google support it fully; ElevenLabs supports a subset; OpenAI supports none.
<!-- Azure Neural HD SSML: two-character scene with prosody control -->
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-AriaNeural">
<mstts:express-as style="narration-relaxed" xmlns:mstts="http://www.w3.org/2001/mstts">
The house was dark when she arrived.
<break time="400ms"/>
<prosody rate="-10%" pitch="-2st">
Something was wrong.
</prosody>
</mstts:express-as>
</voice>
<voice name="en-US-GuyNeural">
<mstts:express-as style="shouting" xmlns:mstts="http://www.w3.org/2001/mstts">
"Who's there?"
</mstts:express-as>
</voice>
</speak>Pronunciation lexicons
A single mispronounced character name ruins a chapter. Both Azure and Google let you attach a PLS (Pronunciation Lexicon Specification) file. Build one per book and version it with the manuscript.
# Google Cloud TTS with custom pronunciation lexicon (PLS).
# lexicon.xml pins the pronunciation of rare character/location names.
<lexicon xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa">
<lexeme>
<grapheme>Daenerys</grapheme>
<phoneme>dəˈnɛɹɪs</phoneme>
</lexeme>
<lexeme>
<grapheme>Yr Wyddfa</grapheme>
<phoneme>ər ˈwɪθva</phoneme>
</lexeme>
</lexicon>Listen: 60-second scene
“Scene with two characters — narrator, heroine, antagonist”
“Same scene, Azure Neural HD”
“Same scene, Google Chirp 3 HD”
“Same scene, cloned branded narrator”
ACX / Audible readiness checklist
Technical
- -23 to -18 dB RMS, -3dB peak ceiling.
- 192kbps CBR MP3, 44.1kHz, mono or stereo.
- 0.5 to 1s of silence at head/tail of each file.
- Max 120 minutes per file. Chapter breaks on scene transitions.
- Room tone floor < -60 dB — not pure digital silence.
Editorial
- Per-character voice presets locked before production.
- Pronunciation lexicon reviewed by author.
- Disclose AI narration (ACX AI Pilot requirement).
- Human listen pass required — catches ~5-10 re-renders per chapter.
- Add emotion / pace tags at dramatic beats.