Best Open-Source TTS Models Compared(2026 Edition)
Eight models, one goal: human-quality speech from open-source code. We compare naturalness, speed, voice cloning, hardware needs, and licensing so you can pick the right TTS for your project.
TL;DR - Pick Your Model
Best overall quality: Kokoro (MOS 4.2, 82M params, Apache 2.0)
Best voice cloning: XTTS v2 (6s reference, 17 languages)
Best for edge/embedded: Piper (runs on Raspberry Pi, 30+ langs)
Best for dialogue: Dia (multi-speaker turns, 1.6B params)
Best multilingual cloning: Fish Speech (8 langs, Apache 2.0)
Best non-speech audio: Bark (laughter, music, MIT license)
Best flow-matching TTS: F5-TTS (zero-shot cloning, 336M params)
Most controllable: Parler-TTS (describe voice in text)
Naturalness (MOS Scores)
Mean Opinion Score on a 1-5 scale. Human speech typically scores 4.5-4.8. Scores below are from published evaluations and community benchmarks.
Full Comparison Table
| Model | MOS | RTF | VRAM | Params | Voice Clone | Languages | License |
|---|---|---|---|---|---|---|---|
Kokoro Hexgrad | 4.2 | 0.03 | < 1 GB | 82M | No (style presets) | English, Japanese, Korean, Chinese, French, Spanish, Italian, Portuguese, Hindi | Apache 2.0 |
XTTS v2 Coqui | 4.0 | 0.18 | ~4 GB | 467M | Yes (6s reference) | 17 languages (EN, ES, FR, DE, IT, PT, PL, TR, RU, NL, CZ, AR, ZH, JA, HU, KO, HI) | CPML (non-commercial) |
Bark Suno | 3.7 | 0.85 | ~6 GB | 900M | Limited (speaker prompts) | 13 languages | MIT |
Piper Rhasspy | 3.5 | 0.008 | < 100 MB (CPU) | 6-60M | No (pre-trained voices) | 30+ languages | MIT |
Fish Speech Fish Audio | 4.1 | 0.12 | ~4 GB | 500M | Yes (10-30s reference) | English, Chinese, Japanese, Korean, Spanish, French, German, Arabic | Apache 2.0 |
Dia Nari Labs | 4.0 | 0.15 | ~5 GB | 1.6B | Yes (audio prompt) | English | Apache 2.0 |
F5-TTS SWivid | 4.1 | 0.14 | ~4 GB | 336M | Yes (5-15s reference) | English, Chinese | CC-BY-NC 4.0 |
Parler-TTS Hugging Face | 3.8 | 0.22 | ~4 GB | 880M | No (text-described voices) | English | Apache 2.0 |
RTF = Real-Time Factor (lower is faster; <1.0 means faster than real-time). Measured on NVIDIA A100 unless noted. MOS scores from published papers and community evaluations. VRAM at fp16, single utterance.
Model Deep Dives
Kokoro
Highest MOS82M paramsKokoro is the efficiency champion. Built on StyleTTS 2 architecture, it achieves the highest MOS score (4.2) among all open-source models while using just 82M parameters -- orders of magnitude smaller than competitors. It runs comfortably on CPU and can generate speech at RTF 0.03 on GPU, meaning a 10-second clip is synthesized in 0.3 seconds. The model ships with curated style presets for different voices but does not support arbitrary voice cloning. As of early 2026, it supports 9 languages including English, Japanese, Korean, and major European languages.
# pip install kokoro>=0.8 soundfile
from kokoro import KPipeline
import soundfile as sf
pipe = KPipeline(lang_code="a") # 'a' = American English
# Available voices: af_heart, af_bella, am_adam, am_michael, etc.
samples = pipe("Hello from Kokoro, the most efficient open-source TTS.", voice="af_heart", speed=1.0)
for i, (gs, ps, audio) in enumerate(samples):
sf.write(f"output_{i}.wav", audio, 24000)XTTS v2
Best Voice CloningCPML LicenseXTTS v2 remains the gold standard for zero-shot voice cloning. With just 6 seconds of reference audio, it produces remarkably faithful voice reproductions across 17 languages. The architecture combines a GPT-style autoregressive model with DVAE and HiFi-GAN vocoder. The main caveat is its CPML license, which restricts commercial use without a separate agreement. For commercial projects, consider Fish Speech or F5-TTS as alternatives.
# pip install TTS
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
# Zero-shot voice cloning from a 6s reference clip
tts.tts_to_file(
text="This is a cloned voice speaking naturally.",
speaker_wav="reference.wav",
language="en",
file_path="output.wav",
)Bark
Non-Speech AudioMITBark by Suno is unique in its ability to generate non-speech audio alongside speech. It can produce laughter, music snippets, sighs, and other paralinguistic sounds using inline tags. The GPT-style autoregressive architecture means it is slower (RTF 0.85) and requires more VRAM (~6 GB), but for creative applications where expressive, varied audio is needed, Bark remains unmatched. The MIT license makes it suitable for any commercial project.
# pip install bark
from bark import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
preload_models()
# Bark supports non-speech: laughter, music, hesitations
text = """Hello! [laughs] This is Bark speaking.
It can generate ♪ musical notes ♪ and even [sighs] emotions."""
audio = generate_audio(text, history_prompt="v2/en_speaker_6")
write_wav("output.wav", SAMPLE_RATE, audio)Piper
Edge / EmbeddedMITPiper is the go-to TTS for edge devices. Built on VITS/VITS2 architecture and exported to ONNX, it achieves RTF 0.008 -- meaning a 10-second clip generates in 80 milliseconds. It runs entirely on CPU with less than 100 MB of RAM. With 30+ pre-trained language models, it is the most broadly multilingual option. The trade-off is lower naturalness (MOS 3.5) and no voice cloning; you pick from pre-trained voices. Ideal for home assistants, kiosks, and offline applications.
# Install: pip install piper-tts
# Download a voice: piper --download-dir ./voices --model en_US-lessac-high
import subprocess
text = "Piper runs on a Raspberry Pi in real-time."
subprocess.run(
["piper", "--model", "./voices/en_US-lessac-high.onnx", "--output_file", "output.wav"],
input=text.encode(),
)
# Or use the Python API directly:
from piper import PiperVoice
voice = PiperVoice.load("en_US-lessac-high.onnx")
with open("output.wav", "wb") as f:
voice.synthesize(text, f)Fish Speech
Multilingual CloningApache 2.0Fish Speech combines a VQGAN tokenizer with a Llama-based decoder to achieve strong voice cloning across 8 languages. It requires 10-30 seconds of reference audio for cloning, slightly more than XTTS v2, but comes with an Apache 2.0 license -- making it the best commercially-friendly voice cloning option. MOS of 4.1 puts it near the top for naturalness. The architecture allows fine-tuning on custom voices with relatively small datasets.
# pip install fish-speech
from fish_speech.api import FishSpeechTTS
tts = FishSpeechTTS(device="cuda")
# Zero-shot cloning with 10-30s reference
tts.synthesize(
text="Fish Speech excels at multilingual voice cloning.",
reference_audio="speaker_ref.wav",
output_path="output.wav",
)Dia (Nari Labs)
Multi-Speaker DialogueApache 2.0Dia is purpose-built for dialogue. You pass in a script with speaker tags ([S1], [S2]) and it generates a natural multi-speaker conversation with appropriate prosody, pacing, and turn-taking. At 1.6B parameters it is the largest model in this comparison, requiring ~5 GB VRAM. It also supports non-verbal cues like laughter and hesitations. Currently English-only, but the dialogue capability is unmatched.
# pip install diarizationlm # Dia by Nari Labs
from dia import Dia
model = Dia("nari-labs/Dia-1.6B", device="cuda")
# Multi-speaker dialogue generation
dialogue = """[S1] Hey, have you tried the new open-source TTS models?
[S2] Yeah, Dia is amazing for dialogue. It handles turn-taking naturally.
[S1] The prosody between speakers is surprisingly good."""
audio = model.generate(dialogue)
model.save_audio("dialogue.wav", audio)F5-TTS
Flow MatchingCC-BY-NC 4.0F5-TTS uses a novel flow matching approach with a Diffusion Transformer (DiT) backbone. It achieves MOS 4.1 with only 336M parameters and provides strong zero-shot voice cloning from 5-15 seconds of reference audio. The flow matching architecture produces more consistent output than autoregressive approaches, avoiding the occasional artifacts common in GPT-style TTS. The CC-BY-NC license limits commercial use.
# pip install f5-tts
from f5_tts.api import F5TTS
tts = F5TTS(device="cuda")
# Zero-shot voice cloning via flow matching
tts.infer(
ref_file="reference.wav",
ref_text="This is the reference transcript.",
gen_text="F5-TTS uses flow matching for natural-sounding speech synthesis.",
output="output.wav",
)Parler-TTS
Text-Described VoicesApache 2.0Parler-TTS from Hugging Face takes a unique approach: instead of providing reference audio for cloning, you describe the voice you want in natural language. "A warm female voice with a slight British accent, speaking clearly and calmly" -- and the model generates speech matching that description. This makes it highly controllable without needing any reference recordings. MOS of 3.8 is decent but not top-tier; the value is in the controllability and Apache 2.0 license.
# pip install parler-tts
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-large-v1")
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-large-v1")
# Describe the voice you want in natural language
description = "A warm female voice with a slight British accent, speaking clearly and calmly."
prompt = "Parler TTS lets you describe the exact voice characteristics you want."
input_ids = tokenizer(description, return_tensors="pt").input_ids
prompt_ids = tokenizer(prompt, return_tensors="pt").input_ids
gen = model.generate(input_ids=input_ids, prompt_input_ids=prompt_ids)
sf.write("output.wav", gen.cpu().numpy().squeeze(), model.config.sampling_rate)Hardware Requirements
Edge / Embedded
Real-time on ARM. Perfect for home assistants and offline kiosks.
Consumer Laptop
Kokoro runs on CPU at near real-time. Piper is instant.
Mid-range GPU
Sweet spot for most use cases. All mainstream models run comfortably.
High-end GPU
Run Dia and Bark with large batch sizes. Batch TTS for audiobook production.
Decision Matrix
Start from your primary requirement and follow it to the right model.
| Your Priority | Best Pick | Runner-Up | Why |
|---|---|---|---|
| Maximum naturalness | Kokoro | Fish Speech | MOS 4.2 with only 82M params. Apache 2.0. |
| Voice cloning (any license) | XTTS v2 | F5-TTS | Best speaker similarity from 6s reference. |
| Voice cloning (commercial) | Fish Speech | Kokoro presets | Apache 2.0 with strong multilingual cloning. |
| Fastest inference | Piper | Kokoro | RTF 0.008 on CPU. Sub-100ms latency. |
| Minimal VRAM / edge | Piper | Kokoro | <100 MB on CPU. Runs on Raspberry Pi. |
| Most languages | Piper | XTTS v2 | 30+ vs 17 languages. Pre-trained voices. |
| Multi-speaker dialogue | Dia | Bark | Native [S1]/[S2] tags with natural turn-taking. |
| Expressive / non-speech | Bark | Dia | Laughter, music, emotions inline. |
| Voice control via text | Parler-TTS | Kokoro presets | Describe voice in natural language. |
| Research / novel architecture | F5-TTS | Parler-TTS | Flow matching + DiT. Cutting-edge approach. |
Licensing Quick Reference
Fully Commercial (Apache 2.0 / MIT)
- + Kokoro (Apache 2.0)
- + Fish Speech (Apache 2.0)
- + Dia (Apache 2.0)
- + Parler-TTS (Apache 2.0)
- + Bark (MIT)
- + Piper (MIT)
Non-Commercial / Restricted
- ! XTTS v2 (CPML -- contact Coqui)
- ! F5-TTS (CC-BY-NC 4.0)
Key Considerations
- Training data licenses may add constraints
- Voice cloning raises consent/legal issues
- Check model card for dataset-specific terms
- Some jurisdictions restrict synthetic speech