EXL2 quantizes layer-by-layer at variable precision — and it runs Llama 3 70B on a single 4090
If you've been living under a rock, ExLlamaV2 is the inference engine that quietly does the impossible: running Llama 3 70B at usable speed on a single consumer GPU. Plot twist: the secret sauce is its custom EXL2 quant format.
The Setup
EXL2 doesn't pick one bit-width and stop. It quantizes every layer at a different precision based on calibration loss, then packs the whole thing into a single file. The result is a 70B at ~2.4 bits per weight that fits in 24GB of VRAM and still gives coherent output.
git clone https://github.com/turboderp/exllamav2
cd exllamav2 && pip install -e .
# Quantize a model to ~2.4 bpw
python convert.py \
-i ~/models/Llama-3-70B-Instruct \
-o ~/quants/llama3-70b-exl2 \
-cf ~/quants/llama3-70b-exl2 \
-b 2.4 \
-hb 6The Money Pattern
Once it's quantized, the generator API is two imports and you're streaming. I've had this running on a Gold Coast workstation for an Aidxn Design experiment and it hits 35 tokens/sec on a single 4090.
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2DynamicGenerator
config = ExLlamaV2Config("~/quants/llama3-70b-exl2")
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
model.load_autosplit(cache)
tok = ExLlamaV2Tokenizer(config)
gen = ExLlamaV2DynamicGenerator(model=model, cache=cache, tokenizer=tok)
print(gen.generate(prompt="Write a Tailwind grid system", max_new_tokens=400))The Catch
GPU only. There is no CPU fallback, no Mac support, no Apple Silicon path. The quantization step itself takes hours and a beefy GPU just to produce the file. Documentation assumes you already speak fluent transformer. And dynamic batching, while real, takes some squinting at examples to get right.
The Verdict
If you have a single 4090 and you want the absolute best local inference on the planet, ExLlamaV2 is the answer. The EXL2 format is genuinely magic and the throughput numbers embarrass everything else in the consumer tier. Quantize a model tonight, you'll never go back to plain GPTQ.