Inference

TensorRT-LLM Is For The Hardcore

All articles
🚀🛠️🔥

Compile your model to GPU kernels and watch vLLM eat dust

Spoiler: TensorRT-LLM is still the fastest way to serve a transformer on Nvidia hardware. vLLM is friendlier, SGLang is cleverer, but at H100 scale TensorRT-LLM wins on raw throughput per dollar.

The Setup

TensorRT-LLM doesn't just load weights — it compiles your model into a fused CUDA kernel graph specific to your GPU and batch shape. The output is an \`.engine\` file that runs at silicon limits.

# convert HF checkpoint -> TRT-LLM format
python convert_checkpoint.py   --model_dir ./llama-3-8b-instruct   --output_dir ./trt_ckpt/llama-3-8b/1-gpu   --dtype bfloat16

# compile to engine
trtllm-build   --checkpoint_dir ./trt_ckpt/llama-3-8b/1-gpu   --output_dir ./trt_engines/llama-3-8b   --gemm_plugin auto   --max_batch_size 32   --max_input_len 4096   --max_output_len 1024

The Money Pattern

The runtime API is dead simple once the engine is built. You point at the engine folder and stream. Hook this behind Triton and you've got the production-grade serving rig the big labs run.

from tensorrt_llm.runtime import ModelRunner

runner = ModelRunner.from_dir(
    engine_dir="./trt_engines/llama-3-8b",
    rank=0,
)

input_ids = tokenizer("write a haiku about latency", return_tensors="pt").input_ids
outputs = runner.generate(
    input_ids,
    max_new_tokens=64,
    temperature=0.7,
    streaming=True,
)
for token in outputs:
    print(tokenizer.decode(token), end="", flush=True)

The Catch

Compile times are brutal — 10 to 40 minutes per engine. Every change to batch size or sequence length means recompiling. The build is vendor-locked to Nvidia and version-locked to a specific CUDA/cuDNN combo. Container hygiene is mandatory.

The Verdict

For a side project on the M4, this is comical overkill — stick with llama.cpp. For a serious deployment serving thousands of req/s on H100s, TensorRT-LLM is the answer and there isn't a close second. Hardcore only. Worth it.

Let us make some quick suggestions?
Please provide your full name.
Please provide your phone number.
Please provide a valid phone number.
Please provide your email address.
Please provide a valid email address.
Please provide your brand name or website.
Please provide your brand name or website.