MLX On Mac Is Faster Than vLLM (Yes Really)

Behold: my M4 Max MacBook just served Llama 3 8B faster than a friend's 4090 desktop running vLLM. Single-batch, same prompt, same quantization. The room got quiet.

The Setup

MLX is Apple's open-source array framework, built from scratch for Apple silicon. It treats unified memory as a first-class citizen — no host-to-device copies, no PCIe bottleneck, no driver politics. For batch-1 inference, this architectural choice turns out to be a massive cheat code.

pip install mlx-lm

# Convert and run any Hugging Face model
mlx_lm.generate --model mlx-community/Llama-3-8B-Instruct-4bit   --prompt "Why does the M4 Max slap?"   --max-tokens 256

The Money Pattern

The numbers from the Gold Coast office: M4 Max with 64GB unified memory, MLX 4-bit Llama 3 8B, single-batch streaming at roughly 95 tokens/sec. The 4090 with vLLM on the same model around 70 tokens/sec for batch-1. Yes really. The Mac wins because it doesn't have to ship the KV cache across PCIe every step.

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Llama-3-8B-Instruct-4bit")

response = generate(
    model,
    tokenizer,
    prompt="Explain why unified memory wins for batch-1 inference",
    max_tokens=512,
    verbose=True,
)

The Catch

Scale to batch-8 or batch-32 and the 4090 destroys the Mac. vLLM's continuous batching is built for throughput, not latency. MLX still doesn't have a serious multi-request scheduler. For a busy API server, Nvidia still wins. For your laptop running one prompt at a time, it doesn't.

The Verdict

If you bought an M-series Mac for "real work" and felt guilty about leaving the AI to the Nvidia crowd, stop. MLX has caught up. For solo developers, agent workflows, and any single-batch use case, an M4 Max is now a credible inference machine. The unified memory bet aged extremely well.

Local AI

The M4 Max Is Quietly Eating Nvidia's Lunch

The Setup

The Money Pattern

The Catch

The Verdict

Let us make some quick suggestions?