Choose open-source when
You need local inference, private audio handling, predictable marginal cost, model-level control, or custom fine-tuning.
Start with Sesame CSM when naturalness matters, Kokoro v1.0 when you need a small local model, XTTS v2 for voice cloning, and Piper when CPU latency matters more than polish.
Ranked by the shared CodeSOTA TTS catalog. MOS is a starting signal; deployment choice depends on license, footprint, languages, inference speed and whether voice cloning is central.
| Rank | Model | Vendor | MOS | Best fit | Architecture | Params | Source |
|---|---|---|---|---|---|---|---|
| 1 | Sesame CSM | Sesame | 4.7 | dialogue and agents | Conversational Speech Model | 1B+ | source |
| 2 | Fish Audio S2 Pro | Fish Audio | 4.6 | multilingual apps | Dual-autoregressive transformer + RVQ audio codec | 5B | source |
| 3 | Orpheus TTS | Canopy Labs | 4.6 | style control | LLM-based (Llama backbone) | 3B | source |
| 4 | Kokoro v1.0 | Hexgrad | 4.5 | edge and CPU | Lightweight autoregressive | 82M | source |
| 5 | XTTS v2 | Coqui | 4.5 | voice cloning | GPT-like + VITS decoder | 467M | source |
| 6 | Fish Speech 1.5 | Fish Audio | 4.4 | multilingual apps | VQGAN + Transformer | 500M | source |
| 7 | F5-TTS | Shanghai AI Lab | 4.4 | voice cloning | Flow-matching (non-autoregressive) | 335M | source |
| 8 | Dia 1.6B | Nari Labs | 4.3 | dialogue and agents | Transformer + non-verbal tokens | 1.6B | source |
| 9 | Spark-TTS | SparkAudio | 4.3 | multilingual apps | Controllable Transformer | 500M | source |
| 10 | Supertonic 3 | Supertone | 4.2 | local TTS | ONNX Runtime local inference | 99M | source |
| 11 | Parler-TTS | Hugging Face | 4.1 | local TTS | Prompt-controlled Transformer | 880M | source |
| 12 | Piper | Rhasspy | 3.6 | edge and CPU | VITS (lightweight) | ~20M | source |
You need local inference, private audio handling, predictable marginal cost, model-level control, or custom fine-tuning.
You need managed voices, streaming infrastructure, uptime, commercial voice-cloning flows, or the fastest path to a production voice agent.
Run names, numbers, dates, URLs, acronyms, and domain terms through your exact prompts. Naturalness does not guarantee information fidelity.