2000+ tokens/sec on Llama 3 70B — and yes, the chip really is the size of a dinner plate
If you've been living under a rock, Cerebras is the company building chips the size of an entire silicon wafer. Plot twist: they're serving Llama 3 70B at 2000+ tokens/sec, which is roughly 10x faster than anything Nvidia ships.
The Setup
The WSE-3 has 900,000 cores on a single chip. No NVLink, no PCIe bottlenecks — the entire model lives on-die. The result is inference that feels like instant text. Open the playground, type a prompt, and the response renders before your eyes refocus.
pip install cerebras-cloud-sdk
export CEREBRAS_API_KEY=csk-...The Money Pattern
Their SDK is OpenAI-shaped, which means you swap clients and ship. I ran a streaming agent loop through it for a Rebuild Relief demo last week — the user-perceived latency was basically zero.
from cerebras.cloud.sdk import Cerebras
client = Cerebras(api_key=os.environ["CEREBRAS_API_KEY"])
resp = client.chat.completions.create(
model="llama3.1-70b",
messages=[
{"role": "system", "content": "You are a terse assistant."},
{"role": "user", "content": "summarise this insurance claim in 2 lines"},
],
max_tokens=200,
)
print(resp.choices[0].message.content)The Catch
Paid only, no free tier worth mentioning. No fine-tuning — you take whatever models they decide to host. Model selection is narrow: Llama 3.1, a handful of others, no DeepSeek, no Qwen. And the API rate limits on the lower tiers will throttle a serious agent loop within minutes.
The Verdict
For latency-sensitive workloads — voice agents, real-time copilots, anything where TTFT matters more than cost — Cerebras is the move. For batch jobs, it's overkill. I'd happily route an Aidxn Design voice prototype through it tomorrow. Try the free credits before they get rate-limited into oblivion.