Codesota · Tasks · Text-to-SpeechTasks/Audio/Text-to-Speech

Audio · the rare task where the frontier is API-only

Text-to-Speech.

Turn text into natural-sounding speech. The rare ML task where the frontier is entirely API-only and the open academic benchmarks (LJSpeech, VCTK) lag production by two years.

Below: a side-by-side comparison of 12 providers on the axes buyers actually care about — cost, latency, languages, voice cloning, license.

Audio leaderboard →Claim a listing All tasks →

§ 01 · The matrix

12 providers, side by side.

Frontier API · hyperscaler cloud · open weights. Pricing shown per million characters of synthesized output.

Provider / Model	Tier	License	Cost / 1M chars	First-byte	Langs	Cloning	Stream
ElevenLabs Multilingual v2 / Turbo v2.5 / v3	Frontier	Proprietary API	$150–330/M	~250–400 ms	32+	Professional	✓	Claim →
OpenAI tts-1 / tts-1-hd / gpt-4o voice	Frontier	Proprietary API	$15 / $30 / M	~500–700 ms	50+	—	✓	Claim →
C Cartesia Sonic 2 / Sonic Turbo	Frontier	Proprietary API	~$19/M	~90–150 ms	15+	Instant	✓	Claim →
Deepgram Aura 2	Frontier	Proprietary API	~$30/M	~150–250 ms	30+	Limited	✓	Claim →
H Hume EVI 2 / Octave	Frontier	Proprietary API	Per-minute billing	Realtime	40+	Instant	✓	Claim →
S Sesame CSM-1B · Maya / Miles	Frontier	Hybrid	Demo / research	~400 ms	English (primary)	Limited	✓	Claim →
Google Cloud Studio / Neural2 / Wavenet voices	Cloud	Proprietary API	$4–$160/M	~300–500 ms	50+ (380+ voices)	Limited	✓	Claim →
Az Microsoft Azure Neural TTS · HD voices	Cloud	Proprietary API	$16–$30/M	~300–500 ms	140+ locales	Professional	✓	Claim →
AWS Amazon Web Services Polly Neural · Long-form · Generative	Cloud	Proprietary API	$4–$30/M	~300–600 ms	30+	—	✓	Claim →
F5 SWivid (open) F5-TTS	Open	Open weights	Self-host	GPU-dependent	English, Chinese (+ finetune)	Instant	—	Claim →
FS Fish Audio (open) Fish Speech 1.5	Open	Open weights	Self-host	GPU-dependent	8 (en, zh, ja, de, fr, es, ko, ar)	Instant	✓	Claim →
X2 Coqui (open) XTTS v2	Open	Open weights	Self-host	GPU-dependent	17	Instant	✓	Claim →

Pricing is list-price per million characters as of 2026-04 and rounds to the nearest meaningful tier — most vendors negotiate at scale. Click any price to open the vendor’s pricing page. Spot an error? Tell us →

§ 02 · Decision shortcuts

Which should I use?

Picking a TTS provider is a budget-shaped decision on four axes: quality, latency, voice control, and license. Shortcuts by use-case:

Best API quality

ElevenLabs v3 · OpenAI tts-1-hd

ElevenLabs leads on expressive range and voice library; OpenAI leads on consistency and safety.

Lowest latency (real-time agents)

Cartesia Sonic · Deepgram Aura 2

90–250 ms first-byte beats everything else. Built for conversational voice pipelines.

Voice cloning

ElevenLabs · Cartesia · F5-TTS

ElevenLabs professional cloning (minutes of audio, consent-verified). Cartesia instant cloning. F5-TTS for zero-shot open weights.

On-prem / compliance

Fish Speech 1.5 · F5-TTS · XTTS v2

Run on your GPUs. Watch licenses — F5-TTS and Fish Speech are non-commercial by default; Coqui XTTS has the friendliest commercial terms.

Enterprise with an MSA

Azure Neural TTS · Google Studio · AWS Polly

Already in the hyperscaler MSA. Azure leads on locale breadth; Google on voice quality at top-tier; AWS on cost at scale.

Empathetic voice / agentic

Hume EVI 2 · Sesame CSM · gpt-4o voice

Modelled for interruption timing and emotional tone, not just naturalness.

Cheapest at scale

AWS Polly Standard · GCP Standard

$4/M for standard-tier voices. Sounds worse, but fine for IVR, accessibility, and bulk notifications.

§ 03 · Methodology

What to listen for.

MOS scores collapse a 30-second listen into a single number. If you’re evaluating providers, A/B test your own text — not marketing demos — and listen for these five things that separate polished TTS from uncanny:

Prosody

Does the stress land on the right word? A good TTS emphasizes new information; a weak one monotones everything.

Breath & pauses

Real speakers pause mid-sentence to breathe. Synthetic speech that rushes through commas sounds robotic.

Sibilance

Listen to s, sh, z sounds. Cheap TTS hisses; good TTS renders sibilants without distortion.

Disfluencies

Um, uh, and self-corrections matter for conversational AI. Most TTS scrubs them — the frontier ones model them.

Emotional range

Play the same sentence as a question, a statement, and in excitement. Most providers produce identical audio.

Long-form consistency

Run a 5-minute script. Does the voice drift in pitch or pace? Attention-based TTS famously loses the thread past 30 seconds.

§ 04 · Metrics

Why MOS scores are misleading in 2026.

MOS (Mean Opinion Score) was designed in 1996 for telephony codecs. It asks human raters to score a speech clip from 1 (bad) to 5 (excellent). For decades it was the only metric in town.

In 2026 it’s breaking because: (a) top systems saturate at 4.3–4.6 and human raters lose discrimination; (b) ratings are crowd-sourced on short clips that miss long-form failures; (c) published MOS typically uses the author’s own test set, which no two papers share.

The metrics buyers should trust are WER (intelligibility — pass the synthesized audio through ASR, compare to ground truth), SECS (speaker similarity for cloning), first-byte latency measured from your own network, and blind AB preference against a real human baseline.

Vendor-published MOS is not reliable enough to build a ranking on. That’s why the comparison matrix above uses operational axes — cost, latency, features — not MOS.

§ 05 · Academic datasets

The training corpora that built the field.

Useful for training open-weights models and reproducible research. Frontier API providers don’t train on these — they use proprietary voice-actor corpora orders of magnitude larger.

LJSpeech

24 hours · 13K utterances · 1 speaker2017

Single female English speaker reading public-domain books. The canonical TTS training set for a generation — small, clean, copyright-safe.

Dataset page →

VCTK

44 hours · 110 speakers · English2017

Multi-speaker corpus designed for voice-cloning research. Regional English accents. Canonical benchmark for zero-shot speaker conditioning.

Dataset page →

LibriTTS

585 hours · 2,456 speakers · English2019

Cleaned subset of LibriSpeech with original punctuation and casing preserved. The scale-up training set for modern open-weights TTS.

Dataset page →

Common Voice

30,000+ hours · 100+ languages · crowdsourced2019

Mozilla's ongoing multilingual speech corpus. The go-to for multilingual open-weights TTS — though audio quality varies widely.

Dataset page →

§ 06 · Practical tips

Five rules for shipping TTS in 2026.

Don’t train from scratch. Frontier quality requires tens of thousands of hours of studio-grade voice acting. Unless you have a niche (a specific language, a specific voice type), start from XTTS v2 or Fish Speech 1.5 and finetune.

Latency is priced in. Cartesia and Deepgram lead on first-byte; ElevenLabs Turbo is close. For conversational agents the 150 ms line is the difference between natural and awkward — worth paying for.

Voice cloning is a consent problem, not a tech problem. The tech works. The risk is legal: impersonation, deepfake audio, brand hijacking. ElevenLabs Professional Cloning and Azure Custom Neural Voice both require consent-verified onboarding. If your vendor doesn’t, that’s a red flag.

Stream and cache. Streaming cuts perceived latency by half. Cache by hash(text + voice_id + params) — TTS is deterministic enough that 20–40% of requests in a production app are repeats.

Evaluate on your own text. Vendor demos are hand-picked. Write 10 scripts from your actual domain — technical terms, names, long sentences, emotional beats — and AB them blind.

For vendors

Run a TTS product? Claim your listing.

CodeSOTA’s TTS comparison is read by engineers evaluating providers. If you represent one of the vendors above — or a provider we missed — claim the listing to submit verified pricing, latency benchmarks, voice samples, and a demo link. Free; credibility-gated, not pay-to-play.

Claim a listing →Get a rank badge for your site →

Reply within 48 hours · No newsletter

What were you looking for on text-to-speech?

Missing a provider, a column we skipped, or a use case you need help picking for? Tell us — we reply within 48 hours and update the page based on what readers actually ask.

Real humans read every message. We track what people are asking for and prioritize accordingly.