14B params, MIT license, trained on data that doesn't exist
Microsoft has been quietly cooking the Phi series for two years and Phi-4 is where it finally clicks. 14B parameters, MIT licensed, and trained almost entirely on synthetic data. The result is a model that's freakishly smart at reasoning and freakishly weird everywhere else.
The Setup
Phi-4 lives in the "absurdly capable for its size" tier. MMLU around 84, MATH around 80, GSM8K basically saturated. It runs comfortably on a 16GB M4 Mac. This is the model you put behind an internal agent and forget about.
{`# the boring way
huggingface-cli download microsoft/phi-4 --local-dir ./phi-4
# the lazy way
ollama pull phi4:14b
ollama run phi4:14b`}The Money Pattern
Wire it into an agent loop with structured outputs. It's small enough to run multiple parallel instances, smart enough to actually solve the task.
{`from transformers import AutoTokenizer, AutoModelForCausalLM
import json
tok = AutoTokenizer.from_pretrained("microsoft/phi-4")
model = AutoModelForCausalLM.from_pretrained(
"microsoft/phi-4", device_map="auto", torch_dtype="auto",
)
def agent_step(state):
msgs = [
{"role": "system", "content": "Return JSON: {action, args, reasoning}"},
{"role": "user", "content": json.dumps(state)},
]
ids = tok.apply_chat_template(msgs, return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=256, temperature=0.2)
return json.loads(tok.decode(out[0], skip_special_tokens=True))`}The Catch
The synthetic-data training shows up in odd places. Ask it about a recent news event and you get confident, beautifully-structured fiction. Ask about niche libraries and it invents API surfaces. The reasoning is sharp; the world knowledge is patchy and overconfident.
The Verdict
For pure reasoning, code review, math, and structured agent workflows, Phi-4 is the small-model king. For anything requiring real-world facts, pair it with RAG or just use a bigger model. Run it on a Mac mini at the edge — it's the most cost-efficient brain per gigabyte we've ever had.