Skip to content

Inference

SGLang Is Quietly Outperforming vLLM

All articles
🚀📊

The Inference Engine Nobody's Talking About (Yet)

Plot twist: the king of open-source inference might not be vLLM anymore. SGLang, out of LMSYS — same crew behind Chatbot Arena — is 20-40% faster on real-world agent workloads. The internet is quietly switching.

The Setup

SGLang's secret sauce is RadixAttention. Instead of just paging the KV cache like vLLM, it builds a radix tree across requests and shares prefix computations automatically. Every time two prompts share a system prompt, few-shot examples, or tool definitions, SGLang reuses the cached attention. Agents do this constantly.

{`pip install "sglang[all]"

# Launch the server
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3-8B-Instruct \
  --port 30000 \
  --host 0.0.0.0`}

The Money Pattern

The frontend DSL is the killer feature. You describe multi-step reasoning as a Python function and SGLang automatically caches everything that can be cached. Tool-calling loops, branching agents, structured generation — all noticeably faster than the equivalent vLLM setup.

{`import sglang as sgl

@sgl.function
def multi_turn_agent(s, question):
    s += sgl.system("You are a helpful agent on the Gold Coast.")
    s += sgl.user(question)
    s += sgl.assistant(sgl.gen("plan", max_tokens=128))
    s += sgl.user("Now execute the plan step by step.")
    s += sgl.assistant(sgl.gen("answer", max_tokens=512))

state = multi_turn_agent.run(question="How do I deploy Astro to Netlify?")
print(state["answer"])`}

The Catch

The ecosystem is smaller. Fewer Hugging Face integrations, fewer tutorials, fewer Stack Overflow answers. If you hit a weird bug at 2am, vLLM has more shoulders to stand on. The DSL is also opinionated — if you want pure OpenAI-API-only usage, vLLM is simpler.

The Verdict

If your workload is agent-heavy — lots of prefix sharing, tool calls, multi-turn flows — switch to SGLang and pocket the 30% speedup. If you're doing simple stateless completions, vLLM is still fine. The inference engine wars are now a real competition, which means everyone wins. Do not @ me.

Let us make some quick suggestions?
Please provide your full name.
Please provide your phone number.
Please provide a valid phone number.
Please provide your email address.
Please provide a valid email address.
Please provide your brand name or website.
Please provide your brand name or website.