How a 1B student model ends up acting like a 70B teacher
If you've been living under a rock, every smart model shipping in 2026 is secretly a distilled version of something twice its size. Plot twist: the technique has been sitting in plain sight since 2015.
The Setup
Distillation is dead simple. You take a big teacher model, generate a fat pile of outputs, then train a tiny student to mimic the teacher's logits. Not the labels — the full probability distribution. That soft signal is where the magic lives.
# step 1: generate teacher outputs for your dataset
python collect_teacher_logits.py \
--teacher meta-llama/Llama-3-70B-Instruct \
--dataset my_prompts.jsonl \
--out teacher_logits.parquetThe Money Pattern
The training loop is a KL-divergence between student and teacher softmax. Mix in a little hard-label cross-entropy and you're done. On an M4 Mac for prototyping, then a rented H100 for the real run.
import torch, torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
student = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")
T = 2.0 # temperature, soft targets
for batch in loader:
s_logits = student(**batch["inputs"]).logits
t_logits = batch["teacher_logits"]
kd = F.kl_div(
F.log_softmax(s_logits / T, dim=-1),
F.softmax(t_logits / T, dim=-1),
reduction="batchmean",
) * (T * T)
ce = F.cross_entropy(s_logits.view(-1, s_logits.size(-1)),
batch["labels"].view(-1))
loss = 0.7 * kd + 0.3 * ce
loss.backward(); opt.step(); opt.zero_grad()The Catch
The upfront cost is brutal. Running 70B inference across millions of prompts to collect logits is not free — you're basically prepaying your future inference savings. Plus you need disk. Lots of disk. Logits at fp16 across a 128k vocab add up fast.
The Verdict
Every team shipping a "small fast model" in 2026 is doing this, even if the press release calls it "novel architecture". If you've got a teacher model you trust and a domain you care about, distill. The student will be 40x faster and 90% as good. That's the whole pitch.