The M4 Max Is Quietly Eating Nvidia's Lunch
Behold: my M4 Max MacBook just served Llama 3 8B faster than a friend's 4090 desktop running vLLM. Single-batch, same prompt, same quantization. The room got quiet.
The Setup
MLX is Apple's open-source array framework, built from scratch for Apple silicon. It treats unified memory as a first-class citizen — no host-to-device copies, no PCIe bottleneck, no driver politics. For batch-1 inference, this architectural choice turns out to be a massive cheat code.
pip install mlx-lm
# Convert and run any Hugging Face model
mlx_lm.generate --model mlx-community/Llama-3-8B-Instruct-4bit --prompt "Why does the M4 Max slap?" --max-tokens 256The Money Pattern
The numbers from the Gold Coast office: M4 Max with 64GB unified memory, MLX 4-bit Llama 3 8B, single-batch streaming at roughly 95 tokens/sec. The 4090 with vLLM on the same model around 70 tokens/sec for batch-1. Yes really. The Mac wins because it doesn't have to ship the KV cache across PCIe every step.
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Llama-3-8B-Instruct-4bit")
response = generate(
model,
tokenizer,
prompt="Explain why unified memory wins for batch-1 inference",
max_tokens=512,
verbose=True,
)The Catch
Scale to batch-8 or batch-32 and the 4090 destroys the Mac. vLLM's continuous batching is built for throughput, not latency. MLX still doesn't have a serious multi-request scheduler. For a busy API server, Nvidia still wins. For your laptop running one prompt at a time, it doesn't.
The Verdict
If you bought an M-series Mac for "real work" and felt guilty about leaving the AI to the Nvidia crowd, stop. MLX has caught up. For solo developers, agent workflows, and any single-batch use case, an M4 Max is now a credible inference machine. The unified memory bet aged extremely well.