TGI v3 powers HF Inference Endpoints — and most of the production LLMs you've talked to this year
If you've been living under a rock, Text Generation Inference is the server that's quietly running half the production LLM endpoints on the internet. Spoiler: it's not vLLM, it's not TensorRT-LLM, it's TGI v3 from Hugging Face.
The Setup
TGI ships tensor parallel, speculative decoding, paged attention, FlashAttention 3, and continuous batching out of the box. It's the engine behind Hugging Face Inference Endpoints, which means it's battle-tested at scale you do not want to think about.
{`docker run --gpus all --shm-size 1g -p 8080:80 \\
-v $PWD/models:/data \\
ghcr.io/huggingface/text-generation-inference:3.0 \\
--model-id meta-llama/Llama-3-70B-Instruct \\
--num-shard 4 \\
--quantize awq`}The Money Pattern
The `/generate` endpoint is dead simple, and there's an OpenAI-compatible router on the side if you want it. I've been running it behind a Netlify Function for an Aidxn Design experiment and it hasn't blinked.
{`curl http://localhost:8080/generate \\
-X POST \\
-H 'Content-Type: application/json' \\
-d '{
"inputs": "Write a Zod schema for a hail claim",
"parameters": {
"max_new_tokens": 512,
"temperature": 0.3,
"top_p": 0.9
}
}'`}The Catch
It's Hugging Face-flavored, which means it's opinionated about how you load weights, what tokenizer config it expects, and where models live on disk. Custom architectures take work to land. Older quant formats are second-class citizens compared to vLLM's grab-bag approach. And the Rust-based router is fast but cryptic when things break.
The Verdict
For production deployments where stability matters more than the latest community quant format, TGI is the safe bet. vLLM has more buzz, Aphrodite has more knobs, but TGI is what I'd reach for if I were serving a customer-facing endpoint at Rebuild Relief tomorrow. Run the container. It just works.