Home/Speech/Best for voice cloning

Voice cloningUpdated April 2026

Best TTS for voice cloning

Voice cloning has moved from novelty to infrastructure. Ten seconds of reference audio is enough for most models to produce convincing output. This page compares vendors on fidelity, sample requirements, and consent tooling — because the ethical defaults matter as much as the speaker-similarity score.

TL;DR

> Best fidelity: ElevenLabs Professional Voice Clone — 30+ min of studio audio, near-indistinguishable.
> Best instant clone: PlayHT 3.0 or ElevenLabs IVC — 60 seconds of input.
> Best self-host: F5-TTS (MIT) or XTTS-v2 (CPML). Zero-shot, 10s reference.
> Best streaming clone: Cartesia — the only sub-100ms TTFB vendor with cloning.

How voice cloning actually works

Modern cloning is not fine-tuning. A pretrained speaker encoder reads the reference audio and emits a fixed-length embedding that conditions the acoustic model at inference. The acoustic model was already trained on thousands of speakers — it knows how voices differ. The embedding just tells it which voice to render.

Voice cloning

How voice cloning works

Reference audio is encoded into a speaker embedding that conditions the acoustic model at inference.

Vendor capability radar

Capability radar

Cloning vendors across six axes

Each axis scored 0-10. Higher is better. Overlay shows trade-offs.

Side-by-side

Vendor	Mode	Min sample	Fidelity	Streaming	License / hosting
ElevenLabs Professional	Fine-tuned	30+ min	9.5/10	Yes (Flash)	Hosted, $99+/mo
ElevenLabs Instant (IVC)	Zero-shot	60s	8.5/10	Yes	Hosted, $22+/mo
PlayHT 3.0	Zero-shot or fine-tune	30-60s	9/10	Yes	Hosted, $39+/mo
Cartesia	Zero-shot	15s	8/10	Yes (<100ms)	Hosted, usage-based
Google Chirp 3 HD	Zero-shot (Custom Voice)	10s	7.5/10	Yes	GCP, usage-based
F5-TTS	Zero-shot	10s	8/10	Limited	MIT, self-host
XTTS-v2 (Coqui)	Zero-shot	6s	7.5/10	No	CPML (research only), self-host
Fish Speech (OpenAudio-S1)	Zero-shot	10s	8/10	Yes	CC-BY-NC, self-host

Reference vs cloned fingerprint

A well-cloned voice matches the reference's formant positions and harmonic density. Cheap clones get the timbre wrong but fake the pitch — audible as an uncanny-valley effect.

Reference · Original speaker (10s sample)

mel spectrogram

“This voice is a clone trained on a short reference sample. Can you tell it apart?”

F5-TTS clone · Generated from 10s reference

mel spectrogram

“This voice is a clone trained on a short reference sample. Can you tell it apart?”

Listen: one reference, four clones

ElevenLabs IVCConsented clone

eleven_multilingual_v2

sample TBD

“This voice is a clone trained on a short reference sample. Can you tell it apart?”

drop elevenlabs ivc-consented clone.mp3 at /audio/samples/clone-11labs.mp3

PlayHT 3.0Consented clone

Play 3.0 Mini

sample TBD

“This voice is a clone trained on a short reference sample. Can you tell it apart?”

drop playht 3.0-consented clone.mp3 at /audio/samples/clone-playht.mp3

CartesiaConsented clone

sonic-2

sample TBD

“This voice is a clone trained on a short reference sample. Can you tell it apart?”

drop cartesia-consented clone.mp3 at /audio/samples/clone-cartesia.mp3

F5-TTSSelf-hosted clone

F5-TTS

sample TBD

“This voice is a clone trained on a short reference sample. Can you tell it apart?”

drop f5-tts-self-hosted clone.mp3 at /audio/samples/clone-f5tts.mp3

Minimal cloning code

Hosted: ElevenLabs IVC

# ElevenLabs Instant Voice Clone — 60s of consented audio.
from elevenlabs.client import ElevenLabs
client = ElevenLabs(api_key="sk_...")

voice = client.voices.ivc.create(
    name="Alice (consented)",
    files=[open("alice_reading_script.wav", "rb")],
    description="Alice gave written consent on 2026-02-14. See consent-ledger.md.",
)

audio = client.text_to_speech.convert(
    voice_id=voice.voice_id,
    model_id="eleven_multilingual_v2",
    text="This voice is a clone trained on Alice's consented sample.",
)

Self-host: F5-TTS

# F5-TTS — MIT-licensed, flow-matching. Zero-shot cloning from 10s reference.
# pip install f5-tts
from f5_tts.api import F5TTS

model = F5TTS(model_type="F5-TTS", ckpt_file="F5-TTS/ckpts/model.pt")

audio, sr = model.infer(
    ref_file="alice_reference.wav",
    ref_text="This is a reference sample from Alice.",
    gen_text="And this is new text in Alice's voice.",
)

Consent, ethics, and law

Voice cloning without consent is at minimum a civil wrong in most jurisdictions and a crime in some. The 2024 FCC ruling declaring AI-cloned voice robocalls illegal under the TCPA was a warning shot. The EU AI Act classifies voice cloning as limited-risk with mandatory disclosure.

Production minimums

Written, timestamped consent tied to the specific reference audio.
Per-clone audit log of prompts generated.
Watermark output audio (Resemble AI, ElevenLabs AI Speech Classifier, or SileroWM).
Rate limiting and prompt-moderation to block impersonation of public figures.
“AI-generated voice” disclosure on output, per EU AI Act.

Vendor consent features

ElevenLabs: mandatory voice verification via spoken phrase for Pro clones.
Google Chirp 3 HD Custom Voice: requires consent statement embedded in reference audio.
PlayHT: identity verification + consent form on clone creation.
Open-source: no gates. Build your own — or you personally own the liability.

Related

Best open-source TTS

Self-host and own the model weights

Best TTS for podcasts

Clone a branded host voice

Back to Speech Benchmark