Distil-Whisper Is Smaller And Somehow Faster

Plot twist: HuggingFace's Distil-Whisper is one of those rare distillations that actually keeps the quality. 6x faster, 49% smaller than Whisper Large V2, and the WER is within 1% on long-form English. Do not @ me, the paper is public.

The Setup

I dropped this into the same FFmpeg pipeline I was using AssemblyAI for. On the M4 Mac it runs at roughly 30x realtime — a 60-minute interview transcribes while I'm boiling the kettle.

pip install --upgrade transformers accelerate datasets[audio]

# pull the weights — distil-large-v3 is the move
huggingface-cli download distil-whisper/distil-large-v3

The Money Pattern

The transformers pipeline wrapper makes this stupidly easy. Stand up a Netlify function or a queue worker, point it at the chunked audio, done.

from transformers import pipeline
import torch

asr = pipeline(
    "automatic-speech-recognition",
    model="distil-whisper/distil-large-v3",
    torch_dtype=torch.float16,
    device="mps",  # M4 Mac
    chunk_length_s=30,
    batch_size=8,
)

out = asr(
    "call-recording.wav",
    return_timestamps=True,
)

print(out["text"])
for chunk in out["chunks"]:
    print(chunk["timestamp"], chunk["text"])

The Catch

It's English-only at the distil-large-v3 quality tier. The multilingual distils exist but they're noticeably behind the full Whisper V3 on anything that isn't English. If you're transcribing Spanish or French calls, this isn't your model.

Also worth knowing: it inherits Whisper's silence-hallucination quirk. Run VAD first or you'll get "thank you for watching" sprinkled through your audio.

The Verdict

For English ASR on owned hardware, distil-large-v3 is the cheapest watts-per-minute you'll find. I've already migrated two internal Rebuild Relief tools off paid APIs and onto this. If your audio is English, your bill just got smaller.

AI/LLM

6x faster, 49% smaller, same WER on English

The Setup

The Money Pattern

The Catch

The Verdict

Let us make some quick suggestions?