Hand-written Triton kernels and 70% less VRAM, for free
Spoiler: the Unsloth team rewrote the autograd path in Triton again and your fine-tunes just got cheaper. Same dataset, same LoRA, half the wall time, a third of the VRAM. Free.
The Setup
Unsloth is a drop-in replacement for HuggingFace's training loop. You import their model loader instead of AutoModelForCausalLM, attach LoRA, hand it to TRL's SFTTrainer, and the rest of your code is identical. The kernels do the heavy lifting.
{`pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes`}The Money Pattern
Here's the whole script. A 7B QLoRA fine-tune on an A100 used to take six hours; this finishes in two. On a 4090 at home it's the difference between "leave it overnight" and "have a coffee".
{`from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
model, tok = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3-8b-bnb-4bit",
max_seq_length=4096,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model, r=16, lora_alpha=32,
target_modules=["q_proj","k_proj","v_proj","o_proj"],
)
trainer = SFTTrainer(
model=model, tokenizer=tok,
train_dataset=ds, dataset_text_field="text",
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=10, max_steps=200,
learning_rate=2e-4, bf16=True,
output_dir="out",
),
)
trainer.train()`}The Catch
NVIDIA-only. The Triton kernels assume CUDA, so AMD ROCm and Apple silicon are out. If you're on an M-series Mac for training, you're still stuck with MLX (which has its own LoRA story — different post). For cloud GPUs, Unsloth is a no-brainer.
The Verdict
If you're fine-tuning anything on NVIDIA and not using Unsloth, you're lighting cloud credits on fire. Swap your loader, keep your dataset, watch the wall clock collapse. Then ship the adapter and pretend it was hard.