AI/LLM

Distillation Is The Most Underrated Technique

All articles
🧪🧠

How a 1B student model ends up acting like a 70B teacher

If you've been living under a rock, every smart model shipping in 2026 is secretly a distilled version of something twice its size. Plot twist: the technique has been sitting in plain sight since 2015.

The Setup

Distillation is dead simple. You take a big teacher model, generate a fat pile of outputs, then train a tiny student to mimic the teacher's logits. Not the labels — the full probability distribution. That soft signal is where the magic lives.

# step 1: generate teacher outputs for your dataset
python collect_teacher_logits.py \
  --teacher meta-llama/Llama-3-70B-Instruct \
  --dataset my_prompts.jsonl \
  --out teacher_logits.parquet

The Money Pattern

The training loop is a KL-divergence between student and teacher softmax. Mix in a little hard-label cross-entropy and you're done. On an M4 Mac for prototyping, then a rented H100 for the real run.

import torch, torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer

student = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")

T = 2.0  # temperature, soft targets
for batch in loader:
    s_logits = student(**batch["inputs"]).logits
    t_logits = batch["teacher_logits"]

    kd = F.kl_div(
        F.log_softmax(s_logits / T, dim=-1),
        F.softmax(t_logits / T, dim=-1),
        reduction="batchmean",
    ) * (T * T)

    ce = F.cross_entropy(s_logits.view(-1, s_logits.size(-1)),
                         batch["labels"].view(-1))
    loss = 0.7 * kd + 0.3 * ce
    loss.backward(); opt.step(); opt.zero_grad()

The Catch

The upfront cost is brutal. Running 70B inference across millions of prompts to collect logits is not free — you're basically prepaying your future inference savings. Plus you need disk. Lots of disk. Logits at fp16 across a 128k vocab add up fast.

The Verdict

Every team shipping a "small fast model" in 2026 is doing this, even if the press release calls it "novel architecture". If you've got a teacher model you trust and a domain you care about, distill. The student will be 40x faster and 90% as good. That's the whole pitch.

Let us make some quick suggestions?
Please provide your full name.
Please provide your phone number.
Please provide a valid phone number.
Please provide your email address.
Please provide a valid email address.
Please provide your brand name or website.
Please provide your brand name or website.