Measured by CodeSOTA · v1 · 2026-04-14

TTS Vendor Evaluation: Kokoro v1.0 vs Gradium TTS (default)

First-party evaluation of production TTS systems measured in this repo — same code, same test set, same metrics across every vendor. Quality via UTMOS22-strong, intelligibility via Whisper-v3 WER round-trip, on 50 Harvard sentences. Reproducible with scripts/tts_synth_*.py and scripts/tts_score.py.

Methodology

Test Set: First 50 Harvard Sentences (IEEE Rec. Pub. No. 297); Public domain, speech-intelligibility standard. Phonetically balanced.
Sample Count: 50 sentences per vendor
Quality Metric (MOS): UTMOS22-strong (Sarulab, VoiceMOS Challenge 2022 winner); Automatic MOS predictor trained specifically on TTS naturalness ratings. Winner of VoiceMOS Challenge 2022. Loaded via torch.hub from tarepan/SpeechMOS. 1-5 scale, higher is better.
Intelligibility Metric (WER): Whisper large-v3-turbo round-trip; Each synthesized clip transcribed by Whisper, WER computed against reference text after lowercasing and punctuation stripping ( jiwer). Catches dropped words, hallucinations, phoneme errors.
RTF (Real-Time Factor): audio_s / synth_s, per-item mean; Gradium: HTTPS POST from EU region (network-bound). Kokoro: local torch CPU inference on Apple Silicon.
Audio Handling: Downsampled to 16 kHz mono before scoring; UTMOS and Whisper both operate natively at 16 kHz. Resampling is the intended input pipeline, not a lossy workaround.

Quality & Intelligibility

Measured on 50 sentences. Lower WER is better, higher UTMOS is better.

#	Model / Voice	UTMOS ↑	Range	WER ↓	Type
1	Kokoro v1.0 Hexgrad (open source) · voice: af_heart	4.48	4.43 – 4.52	1.63%	Open Source
2	Gradium TTS (default) Gradium · voice: Audrey (flagship en)	4.41	4.18 – 4.53	2.23%	Cloud API

Tight UTMOS gap (0.07) — within typical prediction noise. Kokoro's narrower range (4.43–4.52) indicates more consistent sentence-to-sentence quality; Gradium's wider range (4.19–4.53) has higher peaks and deeper troughs.

Deployment & Features

Where the two systems differ most — the second axis of the comparison.

Model	RTF ↑	TTFB	Sample Rate	Languages	Voices	Cloning	Streaming
Kokoro v1.0 Local / self-hosted	9.75×	—	24 kHz	8	54	—	—
Gradium TTS (default) Cloud API · EU + US regions	2.71×	200 ms	48 kHz	6	14	✓	✓

Caveats

Easy test set. Harvard sentences are short, phonetically balanced, clean English. They do not stress-test numbers, abbreviations, named entities, code-switching, or multi-sentence paragraphs — where production TTS systems actually diverge.
One voice per vendor. Different voices from the same vendor can score materially differently. We picked the flagship English voice in each catalog.
Whisper-v3 as oracle. WER is computed against Whisper-v3-turbo transcriptions — systematic Whisper errors (e.g. punctuation, casing, homophones) are normalized away but can still favor acoustically-typical TTS output.
No voice-cloning test yet. Voice cloning and Gradium's on-device model are advertised strengths this v1 does not exercise. Planned for v2.
RTF sample size differs. Gradium synth-time measurements (n=7) are lower because earlier cached runs skipped the sentences; Kokoro was fully measured (n=40). Gradium RTF cross-checked against the latency benchmark in data/gradium-tts-bench.json.
Production hard cases moved to a separate benchmark. Numbers, dates, emails, addresses, acronyms, and long-form entity preservation are tracked by the English TTS intelligibility harness.

Our Take

On this clean English intelligibility test, Kokoro v1.0 and Gradium are essentially tied on naturalness (4.48 vs 4.41 UTMOS — 0.07 is within typical UTMOS noise). Kokoro edges intelligibility (1.63% vs 2.23% WER) and produces more consistent sentence-to-sentence quality.

This is a notable result for open-source TTS: an 82M-parameter Apache-2.0 model matching a commercial cloud API on naturalness for clean English text. Consistent with the broader 2024–2026 trend of small focused open-source models closing the gap with paid APIs.

Gradium's case for production deployment lives in the second table. A commercial API buys ~200 ms TTFB streaming, 48 kHz output, 14 flagship voices across 6 languages, voice cloning, and an on-device model — features Kokoro does not offer. For voice-agent pipelines where time-to-first-audio, multilingual coverage, or cloning matter more than 0.07 MOS on clean English, Gradium's tradeoffs are reasonable.

Roadmap for v2: SEED-TTS hardcase prompts (numbers, abbreviations, long paragraphs), voice-cloning similarity (SECS via WavLM), multilingual eval, and ElevenLabs + Cartesia + OpenAI TTS rows. If you're a TTS vendor and want to be in v2, email k.wikiel@gmail.com.

← Back to Speech Benchmarks View code on GitHub English intelligibility harness →