The gap is closing — but it hasn't closed
If you've been living under a rock, the open-source voice scene has been on a tear. XTTS-V2, F5-TTS, OpenVoice — all free, all runnable on an M4 Mac. So is ElevenLabs still worth the bill? Spoiler: yes, for one specific thing.
The Setup
I A/B'd these on the same 12-second voice sample for a Rebuild Relief explainer narration. Same script, same reference audio, blind test with the team.
{`# XTTS-V2 — the open contender
pip install TTS
tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \\
--text "Hi, this is Aiden from Rebuild Relief." \\
--speaker_wav ref.wav \\
--language_idx en \\
--out_path out.wav`}
The Money Pattern
For app-side TTS where the voice doesn't need to be a specific person, XTTS-V2 is free and good enough. For voice cloning of a real person where listeners know the voice, ElevenLabs still wins — the prosody and breath pacing aren't close yet.
{`import { ElevenLabsClient } from "elevenlabs";
const eleven = new ElevenLabsClient({ apiKey: process.env.ELEVEN_KEY });
const audio = await eleven.generate({
voice: "aiden-clone-v2",
model_id: "eleven_turbo_v2_5",
text: "Your claim has been approved.",
voice_settings: {
stability: 0.45,
similarity_boost: 0.85,
style: 0.2,
},
});
// pipe straight into FFmpeg for the final mix
await Bun.write("voiceover.mp3", audio);`}
The Catch
ElevenLabs is still expensive at scale and the rate limits will bite you the moment you hit production traffic. XTTS-V2 is free but the inference latency on CPU is brutal — you need a GPU or an Apple Silicon Mac to get realtime.
F5-TTS is the dark horse — flow-matching architecture, faster than XTTS, but quality is a coin flip on short references under 5 seconds.
The Verdict
Hybrid stack wins. ElevenLabs for the hero asset where the voice has to be perfect. XTTS-V2 or F5-TTS for everything else — push notifications, in-app TTS, batch narration over CSVs of insurance copy. Your spend drops 90% and nobody hears the difference where it matters.