6x faster, 49% smaller, same WER on English
Plot twist: HuggingFace's Distil-Whisper is one of those rare distillations that actually keeps the quality. 6x faster, 49% smaller than Whisper Large V2, and the WER is within 1% on long-form English. Do not @ me, the paper is public.
The Setup
I dropped this into the same FFmpeg pipeline I was using AssemblyAI for. On the M4 Mac it runs at roughly 30x realtime — a 60-minute interview transcribes while I'm boiling the kettle.
pip install --upgrade transformers accelerate datasets[audio]
# pull the weights — distil-large-v3 is the move
huggingface-cli download distil-whisper/distil-large-v3The Money Pattern
The transformers pipeline wrapper makes this stupidly easy. Stand up a Netlify function or a queue worker, point it at the chunked audio, done.
from transformers import pipeline
import torch
asr = pipeline(
"automatic-speech-recognition",
model="distil-whisper/distil-large-v3",
torch_dtype=torch.float16,
device="mps", # M4 Mac
chunk_length_s=30,
batch_size=8,
)
out = asr(
"call-recording.wav",
return_timestamps=True,
)
print(out["text"])
for chunk in out["chunks"]:
print(chunk["timestamp"], chunk["text"])The Catch
It's English-only at the distil-large-v3 quality tier. The multilingual distils exist but they're noticeably behind the full Whisper V3 on anything that isn't English. If you're transcribing Spanish or French calls, this isn't your model.
Also worth knowing: it inherits Whisper's silence-hallucination quirk. Run VAD first or you'll get "thank you for watching" sprinkled through your audio.
The Verdict
For English ASR on owned hardware, distil-large-v3 is the cheapest watts-per-minute you'll find. I've already migrated two internal Rebuild Relief tools off paid APIs and onto this. If your audio is English, your bill just got smaller.