Inference

TGI From Hugging Face Quietly Won The Serving War

All articles
🤗📡🏆

TGI v3 powers HF Inference Endpoints — and most of the production LLMs you've talked to this year

If you've been living under a rock, Text Generation Inference is the server that's quietly running half the production LLM endpoints on the internet. Spoiler: it's not vLLM, it's not TensorRT-LLM, it's TGI v3 from Hugging Face.

The Setup

TGI ships tensor parallel, speculative decoding, paged attention, FlashAttention 3, and continuous batching out of the box. It's the engine behind Hugging Face Inference Endpoints, which means it's battle-tested at scale you do not want to think about.

{`docker run --gpus all --shm-size 1g -p 8080:80 \\
  -v $PWD/models:/data \\
  ghcr.io/huggingface/text-generation-inference:3.0 \\
  --model-id meta-llama/Llama-3-70B-Instruct \\
  --num-shard 4 \\
  --quantize awq`}

The Money Pattern

The `/generate` endpoint is dead simple, and there's an OpenAI-compatible router on the side if you want it. I've been running it behind a Netlify Function for an Aidxn Design experiment and it hasn't blinked.

{`curl http://localhost:8080/generate \\
  -X POST \\
  -H 'Content-Type: application/json' \\
  -d '{
    "inputs": "Write a Zod schema for a hail claim",
    "parameters": {
      "max_new_tokens": 512,
      "temperature": 0.3,
      "top_p": 0.9
    }
  }'`}

The Catch

It's Hugging Face-flavored, which means it's opinionated about how you load weights, what tokenizer config it expects, and where models live on disk. Custom architectures take work to land. Older quant formats are second-class citizens compared to vLLM's grab-bag approach. And the Rust-based router is fast but cryptic when things break.

The Verdict

For production deployments where stability matters more than the latest community quant format, TGI is the safe bet. vLLM has more buzz, Aphrodite has more knobs, but TGI is what I'd reach for if I were serving a customer-facing endpoint at Rebuild Relief tomorrow. Run the container. It just works.

Let us make some quick suggestions?
Please provide your full name.
Please provide your phone number.
Please provide a valid phone number.
Please provide your email address.
Please provide a valid email address.
Please provide your brand name or website.
Please provide your brand name or website.