ORPO Just Replaced DPO And PPO

Plot twist: the entire SFT-then-DPO ritual we've been performing for two years was a long way of saying "do them at once". ORPO does them at once. And it's in TRL.

The Setup

ORPO — Odds Ratio Preference Optimization — folds the preference signal into the SFT loss via an odds-ratio penalty. No reference model means no doubled memory at train time. No separate stages means no "did the SFT model regress after DPO" panic. One trainer, one dataset of (prompt, chosen, rejected), one go.

pip install --upgrade trl peft transformers datasets

# dataset format: each row has prompt / chosen / rejected
# same shape as DPO datasets — drop them straight in

The Money Pattern

The ORPO trainer in TRL is a near-copy of the DPO one, minus the reference model. I ran it on a 7B base with a 12k preference set for an internal Pipedrive-aware support agent and it landed in one pass where the SFT-then-DPO version took two days of fiddling.

from trl import ORPOTrainer, ORPOConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset

tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B", torch_dtype="bfloat16",
)
ds = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train")

args = ORPOConfig(
    output_dir="out-orpo",
    beta=0.1,                 # odds-ratio weight
    learning_rate=8e-6,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    bf16=True,
)

trainer = ORPOTrainer(
    model=model, args=args,
    train_dataset=ds, tokenizer=tok,
)
trainer.train()

The Catch

ORPO is still newer than DPO. The eval landscape is thinner, the hyperparam folklore is shallower, and on some preference sets it underperforms a well-tuned DPO. Beta is the knob that matters — 0.05 to 0.2 is the useful range, anything higher and the SFT signal gets crushed.

The Verdict

If you're starting a new alignment run today, default to ORPO. One stage, half the VRAM, fewer footguns. If you've already got a battle-tested DPO pipeline, don't rip it out — but the next fine-tune should be ORPO.

AI/LLM

One stage. No reference model. Less yak-shaving.

The Setup

The Money Pattern

The Catch

The Verdict

Let us make some quick suggestions?