A million tokens, Apache 2.0, and a needle-in-haystack score that's actually believable
Alibaba just shipped Qwen 3.5 with a 1,000,000 token context window. For reference, that's the entire Lord of the Rings trilogy plus the appendices, plus your code review backlog, with room to spare.
The Setup
Qwen 3.5 ships in 7B, 32B, and 110B flavours, all Apache 2.0. The headline is the long context — 1M tokens with a needle-in-haystack score above 95% across the full window. Gemini 1.5 Pro scores about 88% at 1M.
vllm serve Qwen/Qwen3.5-32B-Instruct \
--max-model-len 1048576 \
--enable-chunked-prefill \
--port 8000The Money Pattern
OpenAI-compatible endpoint means existing clients just work. Stuff your entire repo into the system prompt and ask it to find the bug. Behold:
from openai import OpenAI
from pathlib import Path
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
# yeet the whole repo into context
repo = "\n\n".join(
f"# {p}\n{p.read_text()}"
for p in Path("src").rglob("*.ts")
)
resp = client.chat.completions.create(
model="Qwen/Qwen3.5-32B-Instruct",
messages=[
{"role": "system", "content": f"Codebase:\n{repo}"},
{"role": "user", "content": "Where is auth state hydrated and why is it racing?"},
],
)The Catch
1M tokens at full bore needs serious VRAM. The 32B at 1M context is roughly 8x H100s. You can rent it, you can't laptop it. 128k context is fine on a single H100, which is what most of us actually need anyway.
The Verdict
The long-context game just got a free, open-weight competitor that's better than the paid leader. Most workloads don't need 1M tokens — but when you do, this is the model. Spin it up on a runpod, point your tooling at it, never pay per-token for a giant context window again.