vLLM Beats Every Hosted API On Speed

If you've been living under a rock: vLLM is the open-source inference engine that's quietly eating the hosted LLM market. PagedAttention plus continuous batching equals "your A100 now outperforms OpenAI's API for high-throughput workloads." Yes really.

The Setup

vLLM came out of Berkeley in 2023. The trick is treating GPU memory like a virtual memory system — pages of KV cache get swapped in and out so you can fit way more concurrent requests on the same hardware. Then continuous batching keeps the GPU pegged at 100% instead of stalling between requests.

{`pip install vllm

# Spin up an OpenAI-compatible server on one command
vllm serve meta-llama/Llama-3-8B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --host 0.0.0.0 --port 8000`}

The Money Pattern

The server speaks the OpenAI API. Existing client code just works. Numbers from real production: one A100 80GB running Llama 3 8B at around 5,000 tokens/sec aggregate throughput with 64 concurrent users. Hosted APIs cap you at single-digit-thousands and charge by the token. The math gets uncomfortable fast.

{`from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(
    model="meta-llama/Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Summarize this support ticket: ..."}],
    max_tokens=300,
)
print(resp.choices[0].message.content)`}

The Catch

You need a serious GPU. A 4090 will do 8B models, but anything 70B-and-up wants an A100 or H100 minimum. Spinning a GPU instance up for one-off prompts is a worse experience than just hitting an API — vLLM only wins when you're doing thousands of requests per minute.

The dependency tree is also chunky. CUDA versions matter. Driver versions matter. This is fine.

The Verdict

For batch inference, RAG pipelines, agent fleets, and anything with serious throughput — self-hosting on vLLM is now the obvious move. The hosted API tax stops making sense once you're spending more than a few hundred bucks a month. The economics flipped. Most teams just haven't run the numbers yet.

Inference

Self-Host Once, Beat Every API Forever

The Setup

The Money Pattern

The Catch

The Verdict

Let us make some quick suggestions?