Qwen 3.5 Ships 1M Context And It's Free

Alibaba just shipped Qwen 3.5 with a 1,000,000 token context window. For reference, that's the entire Lord of the Rings trilogy plus the appendices, plus your code review backlog, with room to spare.

The Setup

Qwen 3.5 ships in 7B, 32B, and 110B flavours, all Apache 2.0. The headline is the long context — 1M tokens with a needle-in-haystack score above 95% across the full window. Gemini 1.5 Pro scores about 88% at 1M.

vllm serve Qwen/Qwen3.5-32B-Instruct \
  --max-model-len 1048576 \
  --enable-chunked-prefill \
  --port 8000

The Money Pattern

OpenAI-compatible endpoint means existing clients just work. Stuff your entire repo into the system prompt and ask it to find the bug. Behold:

from openai import OpenAI
from pathlib import Path

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

# yeet the whole repo into context
repo = "\n\n".join(
    f"# {p}\n{p.read_text()}"
    for p in Path("src").rglob("*.ts")
)

resp = client.chat.completions.create(
    model="Qwen/Qwen3.5-32B-Instruct",
    messages=[
        {"role": "system", "content": f"Codebase:\n{repo}"},
        {"role": "user", "content": "Where is auth state hydrated and why is it racing?"},
    ],
)

The Catch

1M tokens at full bore needs serious VRAM. The 32B at 1M context is roughly 8x H100s. You can rent it, you can't laptop it. 128k context is fine on a single H100, which is what most of us actually need anyway.

The Verdict

The long-context game just got a free, open-weight competitor that's better than the paid leader. Most workloads don't need 1M tokens — but when you do, this is the model. Spin it up on a runpod, point your tooling at it, never pay per-token for a giant context window again.

AI/LLM

A million tokens, Apache 2.0, and a needle-in-haystack score that's actually believable

The Setup

The Money Pattern

The Catch

The Verdict

Let us make some quick suggestions?