One stage. No reference model. Less yak-shaving.
Plot twist: the entire SFT-then-DPO ritual we've been performing for two years was a long way of saying "do them at once". ORPO does them at once. And it's in TRL.
The Setup
ORPO — Odds Ratio Preference Optimization — folds the preference signal into the SFT loss via an odds-ratio penalty. No reference model means no doubled memory at train time. No separate stages means no "did the SFT model regress after DPO" panic. One trainer, one dataset of (prompt, chosen, rejected), one go.
pip install --upgrade trl peft transformers datasets
# dataset format: each row has prompt / chosen / rejected
# same shape as DPO datasets — drop them straight inThe Money Pattern
The ORPO trainer in TRL is a near-copy of the DPO one, minus the reference model. I ran it on a 7B base with a 12k preference set for an internal Pipedrive-aware support agent and it landed in one pass where the SFT-then-DPO version took two days of fiddling.
from trl import ORPOTrainer, ORPOConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B", torch_dtype="bfloat16",
)
ds = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train")
args = ORPOConfig(
output_dir="out-orpo",
beta=0.1, # odds-ratio weight
learning_rate=8e-6,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
num_train_epochs=1,
bf16=True,
)
trainer = ORPOTrainer(
model=model, args=args,
train_dataset=ds, tokenizer=tok,
)
trainer.train()The Catch
ORPO is still newer than DPO. The eval landscape is thinner, the hyperparam folklore is shallower, and on some preference sets it underperforms a well-tuned DPO. Beta is the knob that matters — 0.05 to 0.2 is the useful range, anything higher and the SFT signal gets crushed.
The Verdict
If you're starting a new alignment run today, default to ORPO. One stage, half the VRAM, fewer footguns. If you've already got a battle-tested DPO pipeline, don't rip it out — but the next fine-tune should be ORPO.