Open weights are quietly storming the top 10
If you've been living under a rock, the Chatbot Arena leaderboard is the only LLM ranking that doesn't feel completely cooked. Plot twist: this week's reshuffle pushed three open-weight models into the top 10.
The Setup
Arena does blind pairwise voting — humans pick A or B without knowing the model, and Elo ranks shake out from there. No multiple choice, no synthetic eval, just vibes at scale.
# reproduce the arena-hard eval locally
git clone https://github.com/lmarena/arena-hard-auto
cd arena-hard-auto
python gen_answer.py --model my-finetune
python gen_judgment.py --judge gpt-4o
python show_result.pyThe Money Pattern
The interesting movement isn't at the top — it's the middle. A 70B open model now sits within Elo error bars of a frontier closed model. For anyone shipping on a budget, that's the only chart that matters. I'm rebuilding a Pipedrive enrichment script on a local 70B specifically because of this delta.
# quick Elo delta check from the public CSV
import pandas as pd
df = pd.read_csv("arena_leaderboard.csv")
top = df.head(15).sort_values("arena_score", ascending=False)
print(top[["model", "arena_score", "license"]])The Catch
Blind voting has a known bias toward verbose, bullet-pointed answers. Models tuned to dump walls of markdown punch above their weight in Arena and underperform in production. Read style-controlled Elo, not raw.
The Verdict
Arena is still the most honest leaderboard we have, but it's not gospel. Use it to spot trends, then run your own eval set on the top three contenders for your actual use case. The shake-up is real — closed-source moats are getting narrower every month.