Five seconds of reference audio. That's it.
Behold: F5-TTS, a flow-matching TTS model that clones a voice from a 5-second clip and runs on consumer hardware. No fine-tuning, no enrollment, no waiting. Plot twist: it's MIT licensed.
The Setup
I tested it on a recording of my own voice for a Pipedrive call-back IVR experiment. The setup is genuinely just one pip install. On an M4 Mac it generates roughly 8x realtime.
pip install f5-tts
# inference — one command, one reference clip
f5-tts_infer-cli \
--model F5TTS_v1_Base \
--ref_audio aiden_ref.wav \
--ref_text "Hi, this is Aiden from Rebuild Relief." \
--gen_text "Your hail damage assessment is scheduled for Tuesday."The Money Pattern
The Python API is the move if you want to batch this — say, generating personalised voicemail drops from a Pipedrive deal list. Wrap it in a FastAPI endpoint and call it from a Netlify function.
from f5_tts.api import F5TTS
from pathlib import Path
tts = F5TTS(model="F5TTS_v1_Base")
def clone_voice(ref_wav: Path, ref_text: str, line: str, out: Path):
wav, sr, _ = tts.infer(
ref_file=str(ref_wav),
ref_text=ref_text,
gen_text=line,
nfe_step=32, # quality vs speed dial
cfg_strength=2.0,
speed=1.0,
)
tts.export_wav(wav, str(out), sample_rate=sr)
clone_voice(
Path("aiden_ref.wav"),
"Hi, this is Aiden from Rebuild Relief.",
"Just confirming your roof inspection for Tuesday at 10am.",
Path("out.wav"),
)The Catch
Full quality is English and Chinese only. The community fine-tunes for other languages exist but you can hear the seams. If you're shipping Spanish voiceover, this isn't ready.
Also: zero-shot cloning from 5 seconds is a deepfake-shaped footgun. Consent in writing, watermark your outputs, and don't be the protagonist of a news article.
The Verdict
For English voice work where you control the reference audio, F5-TTS is the best free option in the game right now. The fact it runs on my laptop without a CUDA dance is the part that makes it actually useful. Clone tonight, ship tomorrow.