TTS Vendor Evaluation: Kokoro v1.0 vs Gradium TTS (default)
First-party evaluation of production TTS systems measured in this repo — same code, same test set, same metrics across every vendor. Quality via UTMOS22-strong, intelligibility via Whisper-v3 WER round-trip, on 50 Harvard sentences. Reproducible with scripts/tts_synth_*.py and scripts/tts_score.py.
Methodology
- Test Set
- First 50 Harvard Sentences (IEEE Rec. Pub. No. 297)
- Public domain, speech-intelligibility standard. Phonetically balanced.
- Sample Count
- 50 sentences per vendor
- Quality Metric (MOS)
- UTMOS22-strong (Sarulab, VoiceMOS Challenge 2022 winner)
- Automatic MOS predictor trained specifically on TTS naturalness ratings. Winner of VoiceMOS Challenge 2022. Loaded via
torch.hubfrom tarepan/SpeechMOS. 1-5 scale, higher is better. - Intelligibility Metric (WER)
- Whisper large-v3-turbo round-trip
- Each synthesized clip transcribed by Whisper, WER computed against reference text after lowercasing and punctuation stripping (
jiwer). Catches dropped words, hallucinations, phoneme errors. - RTF (Real-Time Factor)
- audio_s / synth_s, per-item mean
- Gradium: HTTPS POST from EU region (network-bound). Kokoro: local torch CPU inference on Apple Silicon.
- Audio Handling
- Downsampled to 16 kHz mono before scoring
- UTMOS and Whisper both operate natively at 16 kHz. Resampling is the intended input pipeline, not a lossy workaround.
Quality & Intelligibility
Measured on 50 sentences. Lower WER is better, higher UTMOS is better.
| # | Model / Voice | UTMOS ↑ | Range | WER ↓ | Type |
|---|---|---|---|---|---|
| 1 | Kokoro v1.0 Hexgrad (open source) · voice: af_heart | 4.48 | 4.43 – 4.52 | 1.63% | Open Source |
| 2 | Gradium TTS (default) Gradium · voice: Audrey (flagship en) | 4.41 | 4.18 – 4.53 | 2.23% | Cloud API |
Tight UTMOS gap (0.07) — within typical prediction noise. Kokoro's narrower range (4.43–4.52) indicates more consistent sentence-to-sentence quality; Gradium's wider range (4.19–4.53) has higher peaks and deeper troughs.
Deployment & Features
Where the two systems differ most — the second axis of the comparison.
| Model | RTF ↑ | TTFB | Sample Rate | Languages | Voices | Cloning | Streaming |
|---|---|---|---|---|---|---|---|
Kokoro v1.0 Local / self-hosted | 9.75× | — | 24 kHz | 8 | 54 | — | — |
Gradium TTS (default) Cloud API · EU + US regions | 2.71× | 200 ms | 48 kHz | 6 | 14 | ✓ | ✓ |
Caveats
- Easy test set. Harvard sentences are short, phonetically balanced, clean English. They do not stress-test numbers, abbreviations, named entities, code-switching, or multi-sentence paragraphs — where production TTS systems actually diverge.
- One voice per vendor. Different voices from the same vendor can score materially differently. We picked the flagship English voice in each catalog.
- Whisper-v3 as oracle. WER is computed against Whisper-v3-turbo transcriptions — systematic Whisper errors (e.g. punctuation, casing, homophones) are normalized away but can still favor acoustically-typical TTS output.
- No voice-cloning test yet. Voice cloning and Gradium's on-device model are advertised strengths this v1 does not exercise. Planned for v2.
- RTF sample size differs. Gradium synth-time measurements (n=7) are lower because earlier cached runs skipped the sentences; Kokoro was fully measured (n=40). Gradium RTF cross-checked against the latency benchmark in
data/gradium-tts-bench.json.
Our Take
On this clean English intelligibility test, Kokoro v1.0 and Gradium are essentially tied on naturalness (4.48 vs 4.41 UTMOS — 0.07 is within typical UTMOS noise). Kokoro edges intelligibility (1.63% vs 2.23% WER) and produces more consistent sentence-to-sentence quality.
This is a notable result for open-source TTS: an 82M-parameter Apache-2.0 model matching a commercial cloud API on naturalness for clean English text. Consistent with the broader 2024–2026 trend of small focused open-source models closing the gap with paid APIs.
Gradium's case for production deployment lives in the second table. A commercial API buys ~200 ms TTFB streaming, 48 kHz output, 14 flagship voices across 6 languages, voice cloning, and an on-device model — features Kokoro does not offer. For voice-agent pipelines where time-to-first-audio, multilingual coverage, or cloning matter more than 0.07 MOS on clean English, Gradium's tradeoffs are reasonable.
Roadmap for v2: SEED-TTS hardcase prompts (numbers, abbreviations, long paragraphs), voice-cloning similarity (SECS via WavLM), multilingual eval, and ElevenLabs + Cartesia + OpenAI TTS rows. If you're a TTS vendor and want to be in v2, email k.wikiel@gmail.com.