F5-TTS Is Zero-Shot Voice Cloning You Can Run

Behold: F5-TTS, a flow-matching TTS model that clones a voice from a 5-second clip and runs on consumer hardware. No fine-tuning, no enrollment, no waiting. Plot twist: it's MIT licensed.

The Setup

I tested it on a recording of my own voice for a Pipedrive call-back IVR experiment. The setup is genuinely just one pip install. On an M4 Mac it generates roughly 8x realtime.

pip install f5-tts

# inference — one command, one reference clip
f5-tts_infer-cli \
    --model F5TTS_v1_Base \
    --ref_audio aiden_ref.wav \
    --ref_text "Hi, this is Aiden from Rebuild Relief." \
    --gen_text "Your hail damage assessment is scheduled for Tuesday."

The Money Pattern

The Python API is the move if you want to batch this — say, generating personalised voicemail drops from a Pipedrive deal list. Wrap it in a FastAPI endpoint and call it from a Netlify function.

from f5_tts.api import F5TTS
from pathlib import Path

tts = F5TTS(model="F5TTS_v1_Base")

def clone_voice(ref_wav: Path, ref_text: str, line: str, out: Path):
    wav, sr, _ = tts.infer(
        ref_file=str(ref_wav),
        ref_text=ref_text,
        gen_text=line,
        nfe_step=32,        # quality vs speed dial
        cfg_strength=2.0,
        speed=1.0,
    )
    tts.export_wav(wav, str(out), sample_rate=sr)

clone_voice(
    Path("aiden_ref.wav"),
    "Hi, this is Aiden from Rebuild Relief.",
    "Just confirming your roof inspection for Tuesday at 10am.",
    Path("out.wav"),
)

The Catch

Full quality is English and Chinese only. The community fine-tunes for other languages exist but you can hear the seams. If you're shipping Spanish voiceover, this isn't ready.

Also: zero-shot cloning from 5 seconds is a deepfake-shaped footgun. Consent in writing, watermark your outputs, and don't be the protagonist of a news article.

The Verdict

For English voice work where you control the reference audio, F5-TTS is the best free option in the game right now. The fact it runs on my laptop without a CUDA dance is the part that makes it actually useful. Clone tonight, ship tomorrow.

Open Source

Five seconds of reference audio. That's it.

The Setup

The Money Pattern

The Catch

The Verdict

Let us make some quick suggestions?