Inference Just Crossed The Speed Of Thought
The first time you watch Groq stream a 500-token response in under a second, your brain refuses to believe it. 800 tokens per second on Llama 3 70B. The internet collectively lost its mind, then quietly started rewriting their agents to use it.
The Setup
Groq isn't a GPU company. The LPU — Language Processing Unit — is a custom chip designed for one thing: sequential token generation with deterministic latency. No memory bandwidth bottleneck, no batching gymnastics, just brutally fast inference. The benchmarks are not a typo.
pip install groq
# Set your API key
export GROQ_API_KEY="gsk_..."The Money Pattern
The API is OpenAI-shaped, so existing clients port over in five minutes. Pricing on small models actually undercuts OpenAI's equivalents. For voice agents, autocomplete, and any UX where latency is the product, Groq is now the default choice.
from groq import Groq
client = Groq()
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{"role": "system", "content": "You're a concise assistant."},
{"role": "user", "content": "Plan a Gold Coast surf trip for next weekend."}
],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="", flush=True)The Catch
The model selection is limited. Llama 3 variants, a handful of Mistrals, some Whisper for audio — that's basically it. No fine-tuning hosting. No image models. No long-context monsters. If you need GPT-5-Opus-Pro-Max-Turbo, Groq isn't it.
Rate limits also get aggressive at the free tier. The good news is paid tiers are reasonable. The bad news is you'll find out the hard way.
The Verdict
For anything where streaming latency is the actual product — voice apps, agent step-throughs, live editing UIs — Groq is the obvious move. For everything else, it's a great backup provider when you need speed over flexibility. The LPU bet aged better than anyone expected.