Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · Text-to-SpeechTasks/Audio/Text-to-Speech
Audio · the rare task where the frontier is API-only

Text-to-Speech.

Turn text into natural-sounding speech. The rare ML task where the frontier is entirely API-only and the open academic benchmarks (LJSpeech, VCTK) lag production by two years.

Below: a side-by-side comparison of 12 providers on the axes buyers actually care about — cost, latency, languages, voice cloning, license.

Audio leaderboard Claim a listingAll tasks
§ 01 · The matrix

12 providers, side by side.

Frontier API · hyperscaler cloud · open weights. Pricing shown per million characters of synthesized output.

Provider / ModelTierLicenseCost / 1M charsFirst-byteLangsCloningStream
ElevenLabs logo
ElevenLabs
Multilingual v2 / Turbo v2.5 / v3
FrontierProprietary API$150–330/M~250–400 ms32+ProfessionalClaim →
OpenAI logo
OpenAI
tts-1 / tts-1-hd / gpt-4o voice
FrontierProprietary API$15 / $30 / M~500–700 ms50+Claim →
C
Cartesia
Sonic 2 / Sonic Turbo
FrontierProprietary API~$19/M~90–150 ms15+InstantClaim →
Deepgram logo
Deepgram
Aura 2
FrontierProprietary API~$30/M~150–250 ms30+LimitedClaim →
H
Hume
EVI 2 / Octave
FrontierProprietary APIPer-minute billingRealtime40+InstantClaim →
S
Sesame
CSM-1B · Maya / Miles
FrontierHybridDemo / research~400 msEnglish (primary)LimitedClaim →
Google Cloud logo
Google Cloud
Studio / Neural2 / Wavenet voices
CloudProprietary API$4–$160/M~300–500 ms50+ (380+ voices)LimitedClaim →
Az
Microsoft Azure
Neural TTS · HD voices
CloudProprietary API$16–$30/M~300–500 ms140+ localesProfessionalClaim →
AWS
Amazon Web Services
Polly Neural · Long-form · Generative
CloudProprietary API$4–$30/M~300–600 ms30+Claim →
OpenOpen weightsSelf-hostGPU-dependentEnglish, Chinese (+ finetune)InstantClaim →
FS
Fish Audio (open)
Fish Speech 1.5
OpenOpen weightsSelf-hostGPU-dependent8 (en, zh, ja, de, fr, es, ko, ar)InstantClaim →
X2
OpenOpen weightsSelf-hostGPU-dependent17InstantClaim →

Pricing is list-price per million characters as of 2026-04 and rounds to the nearest meaningful tier — most vendors negotiate at scale. Click any price to open the vendor’s pricing page. Spot an error? Tell us →

§ 02 · Decision shortcuts

Which should I use?

Picking a TTS provider is a budget-shaped decision on four axes: quality, latency, voice control, and license. Shortcuts by use-case:

Best API quality

ElevenLabs v3 · OpenAI tts-1-hd

ElevenLabs leads on expressive range and voice library; OpenAI leads on consistency and safety.

Lowest latency (real-time agents)

Cartesia Sonic · Deepgram Aura 2

90–250 ms first-byte beats everything else. Built for conversational voice pipelines.

Voice cloning

ElevenLabs · Cartesia · F5-TTS

ElevenLabs professional cloning (minutes of audio, consent-verified). Cartesia instant cloning. F5-TTS for zero-shot open weights.

On-prem / compliance

Fish Speech 1.5 · F5-TTS · XTTS v2

Run on your GPUs. Watch licenses — F5-TTS and Fish Speech are non-commercial by default; Coqui XTTS has the friendliest commercial terms.

Enterprise with an MSA

Azure Neural TTS · Google Studio · AWS Polly

Already in the hyperscaler MSA. Azure leads on locale breadth; Google on voice quality at top-tier; AWS on cost at scale.

Empathetic voice / agentic

Hume EVI 2 · Sesame CSM · gpt-4o voice

Modelled for interruption timing and emotional tone, not just naturalness.

Cheapest at scale

AWS Polly Standard · GCP Standard

$4/M for standard-tier voices. Sounds worse, but fine for IVR, accessibility, and bulk notifications.

§ 03 · Methodology

What to listen for.

MOS scores collapse a 30-second listen into a single number. If you’re evaluating providers, A/B test your own text — not marketing demos — and listen for these five things that separate polished TTS from uncanny:

Prosody

Does the stress land on the right word? A good TTS emphasizes new information; a weak one monotones everything.

Breath & pauses

Real speakers pause mid-sentence to breathe. Synthetic speech that rushes through commas sounds robotic.

Sibilance

Listen to s, sh, z sounds. Cheap TTS hisses; good TTS renders sibilants without distortion.

Disfluencies

Um, uh, and self-corrections matter for conversational AI. Most TTS scrubs them — the frontier ones model them.

Emotional range

Play the same sentence as a question, a statement, and in excitement. Most providers produce identical audio.

Long-form consistency

Run a 5-minute script. Does the voice drift in pitch or pace? Attention-based TTS famously loses the thread past 30 seconds.

§ 04 · Metrics

Why MOS scores are misleading in 2026.

MOS (Mean Opinion Score) was designed in 1996 for telephony codecs. It asks human raters to score a speech clip from 1 (bad) to 5 (excellent). For decades it was the only metric in town.

In 2026 it’s breaking because: (a) top systems saturate at 4.3–4.6 and human raters lose discrimination; (b) ratings are crowd-sourced on short clips that miss long-form failures; (c) published MOS typically uses the author’s own test set, which no two papers share.

The metrics buyers should trust are WER (intelligibility — pass the synthesized audio through ASR, compare to ground truth), SECS (speaker similarity for cloning), first-byte latency measured from your own network, and blind AB preference against a real human baseline.

Vendor-published MOS is not reliable enough to build a ranking on. That’s why the comparison matrix above uses operational axes — cost, latency, features — not MOS.

§ 05 · Academic datasets

The training corpora that built the field.

Useful for training open-weights models and reproducible research. Frontier API providers don’t train on these — they use proprietary voice-actor corpora orders of magnitude larger.

LJSpeech

24 hours · 13K utterances · 1 speaker2017

Single female English speaker reading public-domain books. The canonical TTS training set for a generation — small, clean, copyright-safe.

Dataset page →

VCTK

44 hours · 110 speakers · English2017

Multi-speaker corpus designed for voice-cloning research. Regional English accents. Canonical benchmark for zero-shot speaker conditioning.

Dataset page →

LibriTTS

585 hours · 2,456 speakers · English2019

Cleaned subset of LibriSpeech with original punctuation and casing preserved. The scale-up training set for modern open-weights TTS.

Dataset page →

Common Voice

30,000+ hours · 100+ languages · crowdsourced2019

Mozilla's ongoing multilingual speech corpus. The go-to for multilingual open-weights TTS — though audio quality varies widely.

Dataset page →
§ 06 · Practical tips

Five rules for shipping TTS in 2026.

Don’t train from scratch. Frontier quality requires tens of thousands of hours of studio-grade voice acting. Unless you have a niche (a specific language, a specific voice type), start from XTTS v2 or Fish Speech 1.5 and finetune.

Latency is priced in. Cartesia and Deepgram lead on first-byte; ElevenLabs Turbo is close. For conversational agents the 150 ms line is the difference between natural and awkward — worth paying for.

Voice cloning is a consent problem, not a tech problem. The tech works. The risk is legal: impersonation, deepfake audio, brand hijacking. ElevenLabs Professional Cloning and Azure Custom Neural Voice both require consent-verified onboarding. If your vendor doesn’t, that’s a red flag.

Stream and cache. Streaming cuts perceived latency by half. Cache by hash(text + voice_id + params) — TTS is deterministic enough that 20–40% of requests in a production app are repeats.

Evaluate on your own text. Vendor demos are hand-picked. Write 10 scripts from your actual domain — technical terms, names, long sentences, emotional beats — and AB them blind.

For vendors

Run a TTS product? Claim your listing.

CodeSOTA’s TTS comparison is read by engineers evaluating providers. If you represent one of the vendors above — or a provider we missed — claim the listing to submit verified pricing, latency benchmarks, voice samples, and a demo link. Free; credibility-gated, not pay-to-play.

Claim a listing Get a rank badge for your site
Reply within 48 hours · No newsletter

What were you looking for on text-to-speech?

Missing a provider, a column we skipped, or a use case you need help picking for? Tell us — we reply within 48 hours and update the page based on what readers actually ask.

Real humans read every message. We track what people are asking for and prioritize accordingly.