Native multimodal, three sizes, and it actually reads screenshots
If you've been living under a rock, Meta just shipped Llama 4 with native vision baked in from pre-training. Not bolted on with a CLIP adapter. Real, native, joint-trained multimodal.
The Setup
Three sizes: 8B (laptop), 70B (workstation), 405B (data centre flex). All Llama Community License, all multimodal in and text out. The 70B beats GPT-4o on screenshot understanding benchmarks. The internet collectively lost its mind.
# the laptop-friendly one
ollama pull llama4:8b-vision
ollama run llama4:8b-vision "what's broken in this screenshot?" \
--image ./bug-report.pngThe Money Pattern
The 70B in transformers is the sweet spot for production. Feed it screenshots from a support pipeline and it'll classify, transcribe, and route in one call.
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
proc = AutoProcessor.from_pretrained("meta-llama/Llama-4-70B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-4-70B-Instruct",
device_map="auto",
torch_dtype="auto",
)
img = Image.open("./claim-photo.jpg")
msgs = [{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "Describe the storm damage visible in this photo."},
]}]
inputs = proc(images=img, text=proc.apply_chat_template(msgs), return_tensors="pt").to("cuda")
print(proc.decode(model.generate(**inputs, max_new_tokens=512)[0]))The Catch
The 405B is not for mortals. 8x H100s minimum, and at that point you're better off renting Anthropic's API. The 8B vision is fine but hallucinates UI elements that aren't there. The 70B is the only one you actually want.
The Verdict
For any pipeline involving screenshots, photos, or document OCR, Llama 4 70B Vision is now the default. I'm running it on a single A100 for hail-damage photo triage and it's slotting straight into where GPT-4o used to live, at 10% of the cost. This is the open-source vision moment.