First-party evaluation of production TTS systems measured in this repo — same code, same test set, same metrics across every vendor. Quality via UTMOS22-strong, intelligibility via Whisper-v3 WER round-trip, on 50 Harvard sentences. Reproducible with scripts/tts_synth_*.py and scripts/tts_score.py.
torch.hub from tarepan/SpeechMOS. 1-5 scale, higher is better.jiwer). Catches dropped words, hallucinations, phoneme errors.Measured on 50 sentences. Lower WER is better, higher UTMOS is better.
| # | Model / Voice | UTMOS ↑ | Range | WER ↓ | Type |
|---|---|---|---|---|---|
| 1 | Kokoro v1.0 Hexgrad (open source) · voice: af_heart | 4.48 | 4.43 – 4.52 | 1.63% | Open Source |
| 2 | Gradium TTS (default) Gradium · voice: Audrey (flagship en) | 4.41 | 4.18 – 4.53 | 2.23% | Cloud API |
Tight UTMOS gap (0.07) — within typical prediction noise. Kokoro's narrower range (4.43–4.52) indicates more consistent sentence-to-sentence quality; Gradium's wider range (4.19–4.53) has higher peaks and deeper troughs.
Where the two systems differ most — the second axis of the comparison.
| Model | RTF ↑ | TTFB | Sample Rate | Languages | Voices | Cloning | Streaming |
|---|---|---|---|---|---|---|---|
Kokoro v1.0 Local / self-hosted | 9.75× | — | 24 kHz | 8 | 54 | — | — |
Gradium TTS (default) Cloud API · EU + US regions | 2.71× | 200 ms | 48 kHz | 6 | 14 | ✓ | ✓ |
data/gradium-tts-bench.json.On this clean English intelligibility test, Kokoro v1.0 and Gradium are essentially tied on naturalness (4.48 vs 4.41 UTMOS — 0.07 is within typical UTMOS noise). Kokoro edges intelligibility (1.63% vs 2.23% WER) and produces more consistent sentence-to-sentence quality.
This is a notable result for open-source TTS: an 82M-parameter Apache-2.0 model matching a commercial cloud API on naturalness for clean English text. Consistent with the broader 2024–2026 trend of small focused open-source models closing the gap with paid APIs.
Gradium's case for production deployment lives in the second table. A commercial API buys ~200 ms TTFB streaming, 48 kHz output, 14 flagship voices across 6 languages, voice cloning, and an on-device model — features Kokoro does not offer. For voice-agent pipelines where time-to-first-audio, multilingual coverage, or cloning matter more than 0.07 MOS on clean English, Gradium's tradeoffs are reasonable.
Roadmap for v2: SEED-TTS hardcase prompts (numbers, abbreviations, long paragraphs), voice-cloning similarity (SECS via WavLM), multilingual eval, and ElevenLabs + Cartesia + OpenAI TTS rows. If you're a TTS vendor and want to be in v2, email k.wikiel@gmail.com.