Three quant formats, three runtimes, one fragmented mess — here's the cheat sheet
If you've been living under a rock, picking a quantization format is the single most confusing part of running local LLMs. Spoiler: there's no winner, just three winners for three different jobs.
The Setup
GGUF is the format for llama.cpp and anything CPU or Apple Silicon. AWQ is the format for vLLM, Aphrodite, and TGI on Nvidia hardware. GPTQ is the format for ExLlama and a long tail of older GPU setups. Behold: one model, three downloads, three different runtimes.
{`# GGUF — CPU, Mac, llama.cpp, Ollama
huggingface-cli download \\
bartowski/Llama-3-70B-Instruct-GGUF \\
Llama-3-70B-Instruct-Q4_K_M.gguf
# AWQ — vLLM / Aphrodite / TGI on Nvidia
huggingface-cli download \\
casperhansen/llama-3-70b-instruct-awq
# GPTQ — ExLlama, older serving stacks
huggingface-cli download \\
TheBloke/Llama-3-70B-Instruct-GPTQ`}The Money Pattern
If you can run GGUF, you can run a 70B on an M4 Mac. If you have a 4090 and want production-grade serving, AWQ is the pick. If you're squeezing every token out of a single consumer GPU, GPTQ via ExLlamaV2 still has the crown. You can convert between formats in a single command.
{`# Convert HF weights to GGUF for llama.cpp
python llama.cpp/convert_hf_to_gguf.py \\
./Llama-3-70B-Instruct \\
--outfile llama3-70b-f16.gguf \\
--outtype f16
# Quantize down to Q4_K_M
./llama.cpp/llama-quantize \\
llama3-70b-f16.gguf \\
llama3-70b-Q4_K_M.gguf Q4_K_M`}The Catch
Ecosystem fragmentation is the actual villain. Every runtime supports a different subset, calibration datasets vary, and the same model at "Q4" can mean wildly different quality depending on who quantized it. Test before you ship. Don't trust the file name — trust the perplexity.
The Verdict
Rule of thumb: GGUF on a Mac, AWQ on a GPU server, EXL2 (the ExLlamaV2 evolution of GPTQ) on a single 4090. If you're running Ollama or LM Studio, GGUF is already chosen for you. If you're running vLLM, AWQ is already chosen for you. Pick the runtime first, the format follows.