Unsloth Is 2x Faster Than Last Year

Spoiler: the Unsloth team rewrote the autograd path in Triton again and your fine-tunes just got cheaper. Same dataset, same LoRA, half the wall time, a third of the VRAM. Free.

The Setup

Unsloth is a drop-in replacement for HuggingFace's training loop. You import their model loader instead of AutoModelForCausalLM, attach LoRA, hand it to TRL's SFTTrainer, and the rest of your code is identical. The kernels do the heavy lifting.

{`pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes`}

The Money Pattern

Here's the whole script. A 7B QLoRA fine-tune on an A100 used to take six hours; this finishes in two. On a 4090 at home it's the difference between "leave it overnight" and "have a coffee".

{`from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments

model, tok = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3-8b-bnb-4bit",
    max_seq_length=4096,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model, r=16, lora_alpha=32,
    target_modules=["q_proj","k_proj","v_proj","o_proj"],
)

trainer = SFTTrainer(
    model=model, tokenizer=tok,
    train_dataset=ds, dataset_text_field="text",
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10, max_steps=200,
        learning_rate=2e-4, bf16=True,
        output_dir="out",
    ),
)
trainer.train()`}

The Catch

NVIDIA-only. The Triton kernels assume CUDA, so AMD ROCm and Apple silicon are out. If you're on an M-series Mac for training, you're still stuck with MLX (which has its own LoRA story — different post). For cloud GPUs, Unsloth is a no-brainer.

The Verdict

If you're fine-tuning anything on NVIDIA and not using Unsloth, you're lighting cloud credits on fire. Swap your loader, keep your dataset, watch the wall clock collapse. Then ship the adapter and pretend it was hard.

AI/LLM

Hand-written Triton kernels and 70% less VRAM, for free

The Setup

The Money Pattern

The Catch

The Verdict

Let us make some quick suggestions?