Abacus AI's DPO trick shot Smaug 2 72B to #1 — for now
Plot twist: a 72B model nobody was watching jumped to the top of the HuggingFace Open LLM leaderboard overnight. Abacus AI dropped Smaug 2 with a custom DPO recipe and the benchmark gods smiled.
The Setup
Smaug 2 72B is built on Qwen 72B base, then aggressively fine-tuned with DPOP — a tweaked Direct Preference Optimization that adds a positive log-likelihood term. The result: massive MMLU gains in days, not weeks.
{`huggingface-cli download abacusai/Smaug-2-72B --local-dir ./smaug-2
# or yank the GGUF for llama.cpp
huggingface-cli download \
TheBloke/Smaug-2-72B-GGUF \
smaug-2-72b.Q4_K_M.gguf --local-dir ./models`}The Money Pattern
The DPOP trick is the actual contribution. Vanilla DPO sometimes pushes preferred-response log-probs down. DPOP adds a clamp. Steal this for your own fine-tunes — it works on any base.
{`# DPOP loss sketch — pair with trl's DPOTrainer
import torch
def dpop_loss(policy_chosen_logps, policy_rejected_logps,
ref_chosen_logps, ref_rejected_logps, beta=0.1, lam=50.0):
pi_ratio = policy_chosen_logps - policy_rejected_logps
ref_ratio = ref_chosen_logps - ref_rejected_logps
logits = beta * (pi_ratio - ref_ratio)
# the DPOP penalty — keep chosen logp above reference
penalty = lam * torch.clamp(ref_chosen_logps - policy_chosen_logps, min=0)
return -torch.nn.functional.logsigmoid(logits).mean() + penalty.mean()`}The Catch
Real talk: leaderboard gains and chat vibes don't always match. Anecdotal reports say Smaug 2 is slightly stiffer than Qwen base on creative writing. There's a smell of benchmark optimisation. Run your own eval before you bet a product on it.
The Verdict
Smaug 2 won the week, but the actual gift is the DPOP recipe. If you're fine-tuning anything at all, swap your DPO loss for this and see what happens. The model itself is a strong Qwen variant — just don't read too much into the #1 spot.