MMLU Pro Is Already Saturated. Now What?

If you've been living under a rock, MMLU Pro was the "this will last us" replacement for plain old MMLU. Spoiler: behold the leaderboard, every frontier model is now stacked between 81 and 86 percent.

The Setup

MMLU Pro was meant to be the hard version — ten answer choices instead of four, harder reasoning chains, less memorisation leakage. It worked for about eighteen months.

# run MMLU Pro locally via lm-evaluation-harness
lm_eval \
  --model hf \
  --model_args pretrained=meta-llama/Llama-3.1-70B-Instruct \
  --tasks mmlu_pro \
  --batch_size 4 \
  --output_path ./results

The Money Pattern

Saturation actually matters for budget shops. When the top ten models are within a few points of each other on the headline benchmark, vendor selection comes down to cost, latency, and tool-use reliability. I picked the model behind a Rebuild Relief lead-routing flow based purely on tokens-per-dollar last sprint — MMLU Pro was tied.

# pull your own task-specific eval delta
from lm_eval import simple_evaluate

results = simple_evaluate(
    model="hf",
    model_args="pretrained=Qwen/Qwen2.5-72B-Instruct",
    tasks=["gpqa_diamond", "math_hard", "ifeval"],
    num_fewshot=0,
)
print(results["results"])

The Catch

The next round of "harder" benchmarks is going to be 80% synthetic, generated by frontier models to stump frontier models. That arms race optimises for adversarial edge cases that nobody's product actually hits. Build your own private eval against your real traffic instead.

The Verdict

Public benchmarks are a coarse filter now, not a verdict. Use MMLU Pro to drop the bottom half, then run your own 200-prompt eval against the survivors. Saturation isn't a crisis — it's just a reminder that you need domain-specific tests.

AI/LLM

The frontier ate the benchmark again

The Setup

The Money Pattern

The Catch

The Verdict

Let us make some quick suggestions?