Local AI

AWQ vs GPTQ vs GGUF Explained In 4 Minutes

All articles
🧮📦🤹

Three quant formats, three runtimes, one fragmented mess — here's the cheat sheet

If you've been living under a rock, picking a quantization format is the single most confusing part of running local LLMs. Spoiler: there's no winner, just three winners for three different jobs.

The Setup

GGUF is the format for llama.cpp and anything CPU or Apple Silicon. AWQ is the format for vLLM, Aphrodite, and TGI on Nvidia hardware. GPTQ is the format for ExLlama and a long tail of older GPU setups. Behold: one model, three downloads, three different runtimes.

{`# GGUF — CPU, Mac, llama.cpp, Ollama
huggingface-cli download \\
  bartowski/Llama-3-70B-Instruct-GGUF \\
  Llama-3-70B-Instruct-Q4_K_M.gguf

# AWQ — vLLM / Aphrodite / TGI on Nvidia
huggingface-cli download \\
  casperhansen/llama-3-70b-instruct-awq

# GPTQ — ExLlama, older serving stacks
huggingface-cli download \\
  TheBloke/Llama-3-70B-Instruct-GPTQ`}

The Money Pattern

If you can run GGUF, you can run a 70B on an M4 Mac. If you have a 4090 and want production-grade serving, AWQ is the pick. If you're squeezing every token out of a single consumer GPU, GPTQ via ExLlamaV2 still has the crown. You can convert between formats in a single command.

{`# Convert HF weights to GGUF for llama.cpp
python llama.cpp/convert_hf_to_gguf.py \\
  ./Llama-3-70B-Instruct \\
  --outfile llama3-70b-f16.gguf \\
  --outtype f16

# Quantize down to Q4_K_M
./llama.cpp/llama-quantize \\
  llama3-70b-f16.gguf \\
  llama3-70b-Q4_K_M.gguf Q4_K_M`}

The Catch

Ecosystem fragmentation is the actual villain. Every runtime supports a different subset, calibration datasets vary, and the same model at "Q4" can mean wildly different quality depending on who quantized it. Test before you ship. Don't trust the file name — trust the perplexity.

The Verdict

Rule of thumb: GGUF on a Mac, AWQ on a GPU server, EXL2 (the ExLlamaV2 evolution of GPTQ) on a single 4090. If you're running Ollama or LM Studio, GGUF is already chosen for you. If you're running vLLM, AWQ is already chosen for you. Pick the runtime first, the format follows.

Let us make some quick suggestions?
Please provide your full name.
Please provide your phone number.
Please provide a valid phone number.
Please provide your email address.
Please provide a valid email address.
Please provide your brand name or website.
Please provide your brand name or website.