Local AI

ExLlamaV2 Is Quantization Wizardry

All articles
🧙🔬🎛️

EXL2 quantizes layer-by-layer at variable precision — and it runs Llama 3 70B on a single 4090

If you've been living under a rock, ExLlamaV2 is the inference engine that quietly does the impossible: running Llama 3 70B at usable speed on a single consumer GPU. Plot twist: the secret sauce is its custom EXL2 quant format.

The Setup

EXL2 doesn't pick one bit-width and stop. It quantizes every layer at a different precision based on calibration loss, then packs the whole thing into a single file. The result is a 70B at ~2.4 bits per weight that fits in 24GB of VRAM and still gives coherent output.

git clone https://github.com/turboderp/exllamav2
cd exllamav2 && pip install -e .

# Quantize a model to ~2.4 bpw
python convert.py \
  -i ~/models/Llama-3-70B-Instruct \
  -o ~/quants/llama3-70b-exl2 \
  -cf ~/quants/llama3-70b-exl2 \
  -b 2.4 \
  -hb 6

The Money Pattern

Once it's quantized, the generator API is two imports and you're streaming. I've had this running on a Gold Coast workstation for an Aidxn Design experiment and it hits 35 tokens/sec on a single 4090.

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2DynamicGenerator

config = ExLlamaV2Config("~/quants/llama3-70b-exl2")
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
model.load_autosplit(cache)
tok = ExLlamaV2Tokenizer(config)

gen = ExLlamaV2DynamicGenerator(model=model, cache=cache, tokenizer=tok)
print(gen.generate(prompt="Write a Tailwind grid system", max_new_tokens=400))

The Catch

GPU only. There is no CPU fallback, no Mac support, no Apple Silicon path. The quantization step itself takes hours and a beefy GPU just to produce the file. Documentation assumes you already speak fluent transformer. And dynamic batching, while real, takes some squinting at examples to get right.

The Verdict

If you have a single 4090 and you want the absolute best local inference on the planet, ExLlamaV2 is the answer. The EXL2 format is genuinely magic and the throughput numbers embarrass everything else in the consumer tier. Quantize a model tonight, you'll never go back to plain GPTQ.

Let us make some quick suggestions?
Please provide your full name.
Please provide your phone number.
Please provide a valid phone number.
Please provide your email address.
Please provide a valid email address.
Please provide your brand name or website.
Please provide your brand name or website.