Ghost Pepper 7B Just Dropped and It's Spicy

If you've been living under a rock for the past 48 hours, the internet collectively lost its mind because a 7B model just out-benched Llama 3 70B on MMLU Pro. Do not @ me — the eval is public.

The Setup

Ghost Pepper 7B is an Apache-2.0 open-weight model from a small team nobody had heard of last week. It's on HuggingFace, it's on Ollama, and it fits comfortably on an M4 Mac with room to breathe.

ollama pull ghost-pepper:7b
ollama run ghost-pepper:7b "write me a sql migration for a users table"

# or grab the raw weights
huggingface-cli download ghost-pepper/gp-7b-instruct

The Money Pattern

The real flex is the instruction-following. It actually listens to system prompts. Plug it into transformers and you've got a local agent that doesn't go off the rails after three turns.

from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("ghost-pepper/gp-7b-instruct")
model = AutoModelForCausalLM.from_pretrained(
    "ghost-pepper/gp-7b-instruct",
    device_map="auto",
    torch_dtype="auto",
)

msgs = [{"role": "user", "content": "summarise this PR in 3 bullets"}]
inputs = tok.apply_chat_template(msgs, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(inputs, max_new_tokens=256)[0]))

The Catch

32k context is the ceiling — fine for code, not fine for whole codebases. It hallucinates function signatures in obscure libraries with a confidence that's almost charming. And the Q2 quants are basically a random word generator, don't bother.

The Verdict

If you're shipping a side project on a laptop, Ghost Pepper 7B is the new default. Q4_K_M, 6GB of VRAM, faster than your API budget. The 70B-killer narrative is overblown, but for local-first workflows this thing is genuinely spicy. Pull it tonight.

Open Source

A 7B that punches at 70B — and your laptop can run it

The Setup

The Money Pattern

The Catch

The Verdict

Let us make some quick suggestions?