A 7B that punches at 70B — and your laptop can run it
If you've been living under a rock for the past 48 hours, the internet collectively lost its mind because a 7B model just out-benched Llama 3 70B on MMLU Pro. Do not @ me — the eval is public.
The Setup
Ghost Pepper 7B is an Apache-2.0 open-weight model from a small team nobody had heard of last week. It's on HuggingFace, it's on Ollama, and it fits comfortably on an M4 Mac with room to breathe.
ollama pull ghost-pepper:7b
ollama run ghost-pepper:7b "write me a sql migration for a users table"
# or grab the raw weights
huggingface-cli download ghost-pepper/gp-7b-instructThe Money Pattern
The real flex is the instruction-following. It actually listens to system prompts. Plug it into transformers and you've got a local agent that doesn't go off the rails after three turns.
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("ghost-pepper/gp-7b-instruct")
model = AutoModelForCausalLM.from_pretrained(
"ghost-pepper/gp-7b-instruct",
device_map="auto",
torch_dtype="auto",
)
msgs = [{"role": "user", "content": "summarise this PR in 3 bullets"}]
inputs = tok.apply_chat_template(msgs, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(inputs, max_new_tokens=256)[0]))The Catch
32k context is the ceiling — fine for code, not fine for whole codebases. It hallucinates function signatures in obscure libraries with a confidence that's almost charming. And the Q2 quants are basically a random word generator, don't bother.
The Verdict
If you're shipping a side project on a laptop, Ghost Pepper 7B is the new default. Q4_K_M, 6GB of VRAM, faster than your API budget. The 70B-killer narrative is overblown, but for local-first workflows this thing is genuinely spicy. Pull it tonight.