Compile your model to GPU kernels and watch vLLM eat dust
Spoiler: TensorRT-LLM is still the fastest way to serve a transformer on Nvidia hardware. vLLM is friendlier, SGLang is cleverer, but at H100 scale TensorRT-LLM wins on raw throughput per dollar.
The Setup
TensorRT-LLM doesn't just load weights — it compiles your model into a fused CUDA kernel graph specific to your GPU and batch shape. The output is an \`.engine\` file that runs at silicon limits.
# convert HF checkpoint -> TRT-LLM format
python convert_checkpoint.py --model_dir ./llama-3-8b-instruct --output_dir ./trt_ckpt/llama-3-8b/1-gpu --dtype bfloat16
# compile to engine
trtllm-build --checkpoint_dir ./trt_ckpt/llama-3-8b/1-gpu --output_dir ./trt_engines/llama-3-8b --gemm_plugin auto --max_batch_size 32 --max_input_len 4096 --max_output_len 1024The Money Pattern
The runtime API is dead simple once the engine is built. You point at the engine folder and stream. Hook this behind Triton and you've got the production-grade serving rig the big labs run.
from tensorrt_llm.runtime import ModelRunner
runner = ModelRunner.from_dir(
engine_dir="./trt_engines/llama-3-8b",
rank=0,
)
input_ids = tokenizer("write a haiku about latency", return_tensors="pt").input_ids
outputs = runner.generate(
input_ids,
max_new_tokens=64,
temperature=0.7,
streaming=True,
)
for token in outputs:
print(tokenizer.decode(token), end="", flush=True)The Catch
Compile times are brutal — 10 to 40 minutes per engine. Every change to batch size or sequence length means recompiling. The build is vendor-locked to Nvidia and version-locked to a specific CUDA/cuDNN combo. Container hygiene is mandatory.
The Verdict
For a side project on the M4, this is comical overkill — stick with llama.cpp. For a serious deployment serving thousands of req/s on H100s, TensorRT-LLM is the answer and there isn't a close second. Hardcore only. Worth it.