Cold-start an A100, swap weights, kill it
Spoiler: Fly.io shipped GPU-attached volumes for Fly Machines and the inference crowd lost it. You can now mount a persistent NVMe volume to an A100 machine, hot-swap weights, and pay per second. Do not @ me — this is genuinely good.
The Setup
Fly Machines are firecracker VMs that boot in milliseconds. Pair them with a GPU and a persistent volume and you've got the cheapest serverless inference rig I've seen in 2026. I ran it from my M4 Mac with the fly CLI and a tiny Astro 5 dashboard on top.
fly volumes create model_weights --region ord --size 100 --gpu
fly machine run \
--vm-gpu-kind a100-40gb \
--vm-memory 32gb \
--volume model_weights:/weights \
--env MODEL_PATH=/weights/llama-3-70b-q4 \
ghcr.io/aidxn/llm-server:latestThe Money Pattern
Here's the flex — you can preload weights once, then boot a machine cold in seconds without re-downloading 40GB. Killed machines retain the volume. That makes per-tenant inference for multi-tenant SaaS economically reasonable for the first time.
// API route — Astro 5, served from Netlify
import { createMachine } from "@fly/machines";
export const POST = async ({ request }) => {
const { tenantId, prompt } = await request.json();
const m = await createMachine({
image: "ghcr.io/aidxn/llm-server",
gpu: "a100-40gb",
volume: `weights_${tenantId}`,
autoStop: true,
idleTimeoutSec: 30,
});
return Response.json(await m.invoke({ prompt }));
};The Catch
It's still expensive. A100 minutes add up fast and the scaling limits are real — you can't horizontally fan out an L40S like you can a CPU machine. Volume snapshots between regions are flaky. And if your traffic is steady, you'll save money on a dedicated runpod box instead.
The Verdict
For spiky, per-tenant inference workloads, Fly.io GPU volumes are the new default. Pair with Supabase auth, an Astro 5 dashboard, and per-second billing and you've got a real serverless GPU story. Just don't run it 24/7 unless you like surprise invoices.