Three sizes, native vision, and an Apple Silicon glow-up
Google has been losing the open-weights vibe war for a year. Gemma 3 is the comeback. Three sizes, native multimodal, and the 27B variant runs at usable speeds on an M4 Mac. Yes really.
The Setup
Gemma 3 ships in 2B (phone), 9B (laptop), and 27B (workstation). All three speak image-in, text-out. The 9B is the sweet spot — fast on a 32GB Mac, great at screenshot triage and document QA.
ollama pull gemma3:9b
ollama run gemma3:9b "what is in this image" --image ./photo.jpg
# or grab the full weights for fine-tuning
huggingface-cli download google/gemma-3-9b-itThe Money Pattern
Image understanding through the transformers pipeline. I'm using this locally for a tagging step on Aidxn portfolio assets — runs on the Mac, no API key, no rate limit.
from transformers import pipeline
from PIL import Image
pipe = pipeline(
"image-text-to-text",
model="google/gemma-3-9b-it",
device_map="auto",
torch_dtype="auto",
)
img = Image.open("./design-mockup.png")
result = pipe(
images=img,
text="List every UI component visible in this screenshot as JSON.",
max_new_tokens=512,
)
print(result[0]["generated_text"])The Catch
The Gemma license is not Apache. It's mostly permissive, but there's a Prohibited Use Policy you have to read, and Google reserves the right to update it. For most commercial use it's fine — but read it before you ship a product on top. This is not the license you skim.
The Verdict
For local multimodal on Apple Silicon, Gemma 3 9B is the new pick. Beats Llava, beats MiniCPM, runs faster than both. If you need fully-permissive licensing, look at Llama 4 8B Vision instead. If you just want the best thing your laptop can run tonight, this is it.