One endpoint, any framework, batching for free
If you've been living under a rock, Triton Inference Server is how big teams serve models in 2026. PyTorch, TensorRT, ONNX, plain Python — Triton hosts them all behind one HTTP/gRPC endpoint with dynamic batching built in.
The Setup
Triton is config-driven. You drop a model into a folder, write a tiny \`config.pbtxt\`, and Triton picks it up on boot. Dynamic batching, version management, and ensemble pipelines are all metadata.
# model_repository/embedder/config.pbtxt
name: "embedder"
platform: "onnxruntime_onnx"
max_batch_size: 32
input [
{ name: "input_ids" data_type: TYPE_INT64 dims: [ -1 ] }
]
output [
{ name: "embedding" data_type: TYPE_FP32 dims: [ 768 ] }
]
dynamic_batching {
preferred_batch_size: [ 8, 16 ]
max_queue_delay_microseconds: 1000
}The Money Pattern
Once it's up, every model in your stack speaks the same wire format. KServe-compatible v2 protocol, same client SDK, same metrics. Your Astro 5 backend doesn't care if the model is on CPU or H100.
docker run --gpus=1 --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v $PWD/model_repository:/models nvcr.io/nvidia/tritonserver:24.10-py3 tritonserver --model-repository=/models
curl -s -X POST localhost:8000/v2/models/embedder/infer -d '{"inputs":[{"name":"input_ids","shape":[1,8],"datatype":"INT64","data":[[1,2,3,4,5,6,7,8]]}]}'The Catch
The learning curve is steep. \`config.pbtxt\` is protobuf text format with footguns. Error messages are cryptic. The Python backend has its own ABI quirks. Plan a full week of "why does it return shape mismatch" before things click.
The Verdict
For any team running more than two models, Triton pays for itself in a quarter. It's the boring serving infrastructure layer that lets you swap PyTorch for TensorRT without touching the client. If you're past prototype, get on Triton — your future self will write a thank-you note.