Inference

Together AI Is Cheaper Than OpenAI And Faster

All articles
💸🤝

$0.20 per million tokens for Llama 3 70B and it's faster than your self-hosted box

If you've been paying OpenAI prices for an LLM that mostly does string formatting, behold: Together AI hosts the same open-weight models for cents on the dollar and ships them at 200+ tokens/sec. Do not @ me about lock-in — this is OpenAI-compatible.

The Setup

Together is a managed inference cloud running Llama 3, Mixtral, Qwen, DeepSeek, and every other open model worth caring about. $0.20 per million input tokens for a 70B, no GPU to rent, no Kubernetes to babysit. Plot twist: the curl call is exactly what you think it is.

curl -X POST https://api.together.xyz/v1/chat/completions \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3-70b-chat-hf",
    "messages": [{"role": "user", "content": "explain RLS in one paragraph"}],
    "max_tokens": 300
  }'

The Money Pattern

It's a drop-in for the OpenAI SDK. Change the base URL and you're billed by Together instead of Sam Altman. I swapped a Rebuild Relief internal classifier over in 90 seconds and the bill dropped 80%.

from openai import OpenAI

client = OpenAI(
    base_url="https://api.together.xyz/v1",
    api_key=os.environ["TOGETHER_API_KEY"],
)

resp = client.chat.completions.create(
    model="meta-llama/Llama-3-70b-chat-hf",
    messages=[{"role": "user", "content": "classify this hail damage report"}],
    temperature=0.2,
)
print(resp.choices[0].message.content)

The Catch

Rate limits on the free tier are real and you will hit them in dev within an afternoon. Model selection lags new HuggingFace releases by a few days — don't expect day-zero support for whatever model just trended on X. And the streaming endpoint occasionally hiccups under load, so wrap your client in retries.

The Verdict

For bursty workloads, side projects, and anything that doesn't justify a dedicated GPU, Together is genuinely the best deal on the open web. I default to it for prototyping anything that doesn't ship with Claude in the loop. Move your non-critical inference today.

Let us make some quick suggestions?
Please provide your full name.
Please provide your phone number.
Please provide a valid phone number.
Please provide your email address.
Please provide a valid email address.
Please provide your brand name or website.
Please provide your brand name or website.