AI/LLM

Llama 4 Finally Has Vision (For Real)

All articles
👁️🦙📸

Native multimodal, three sizes, and it actually reads screenshots

If you've been living under a rock, Meta just shipped Llama 4 with native vision baked in from pre-training. Not bolted on with a CLIP adapter. Real, native, joint-trained multimodal.

The Setup

Three sizes: 8B (laptop), 70B (workstation), 405B (data centre flex). All Llama Community License, all multimodal in and text out. The 70B beats GPT-4o on screenshot understanding benchmarks. The internet collectively lost its mind.

# the laptop-friendly one
ollama pull llama4:8b-vision
ollama run llama4:8b-vision "what's broken in this screenshot?" \
  --image ./bug-report.png

The Money Pattern

The 70B in transformers is the sweet spot for production. Feed it screenshots from a support pipeline and it'll classify, transcribe, and route in one call.

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image

proc = AutoProcessor.from_pretrained("meta-llama/Llama-4-70B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-4-70B-Instruct",
    device_map="auto",
    torch_dtype="auto",
)

img = Image.open("./claim-photo.jpg")
msgs = [{"role": "user", "content": [
    {"type": "image"},
    {"type": "text", "text": "Describe the storm damage visible in this photo."},
]}]
inputs = proc(images=img, text=proc.apply_chat_template(msgs), return_tensors="pt").to("cuda")
print(proc.decode(model.generate(**inputs, max_new_tokens=512)[0]))

The Catch

The 405B is not for mortals. 8x H100s minimum, and at that point you're better off renting Anthropic's API. The 8B vision is fine but hallucinates UI elements that aren't there. The 70B is the only one you actually want.

The Verdict

For any pipeline involving screenshots, photos, or document OCR, Llama 4 70B Vision is now the default. I'm running it on a single A100 for hail-damage photo triage and it's slotting straight into where GPT-4o used to live, at 10% of the cost. This is the open-source vision moment.

Let us make some quick suggestions?
Please provide your full name.
Please provide your phone number.
Please provide a valid phone number.
Please provide your email address.
Please provide a valid email address.
Please provide your brand name or website.
Please provide your brand name or website.