Local AI

MLX On Mac Is Faster Than vLLM (Yes Really)

All articles
🍎💻

The M4 Max Is Quietly Eating Nvidia's Lunch

Behold: my M4 Max MacBook just served Llama 3 8B faster than a friend's 4090 desktop running vLLM. Single-batch, same prompt, same quantization. The room got quiet.

The Setup

MLX is Apple's open-source array framework, built from scratch for Apple silicon. It treats unified memory as a first-class citizen — no host-to-device copies, no PCIe bottleneck, no driver politics. For batch-1 inference, this architectural choice turns out to be a massive cheat code.

pip install mlx-lm

# Convert and run any Hugging Face model
mlx_lm.generate --model mlx-community/Llama-3-8B-Instruct-4bit   --prompt "Why does the M4 Max slap?"   --max-tokens 256

The Money Pattern

The numbers from the Gold Coast office: M4 Max with 64GB unified memory, MLX 4-bit Llama 3 8B, single-batch streaming at roughly 95 tokens/sec. The 4090 with vLLM on the same model around 70 tokens/sec for batch-1. Yes really. The Mac wins because it doesn't have to ship the KV cache across PCIe every step.

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Llama-3-8B-Instruct-4bit")

response = generate(
    model,
    tokenizer,
    prompt="Explain why unified memory wins for batch-1 inference",
    max_tokens=512,
    verbose=True,
)

The Catch

Scale to batch-8 or batch-32 and the 4090 destroys the Mac. vLLM's continuous batching is built for throughput, not latency. MLX still doesn't have a serious multi-request scheduler. For a busy API server, Nvidia still wins. For your laptop running one prompt at a time, it doesn't.

The Verdict

If you bought an M-series Mac for "real work" and felt guilty about leaving the AI to the Nvidia crowd, stop. MLX has caught up. For solo developers, agent workflows, and any single-batch use case, an M4 Max is now a credible inference machine. The unified memory bet aged extremely well.

Let us make some quick suggestions?
Please provide your full name.
Please provide your phone number.
Please provide a valid phone number.
Please provide your email address.
Please provide a valid email address.
Please provide your brand name or website.
Please provide your brand name or website.