Serverless GPUs Just Got A Glow-Up
If you've been living under a rock: Modal cut its serverless GPU pricing again, cold starts now sit under two seconds, and Python developers can deploy an A100 by adding a decorator. Replicate is going to need a really good Q1.
The Setup
Modal is what you wished AWS Lambda was. You write a Python function, slap a decorator on it, run a deploy command, and a GPU function with autoscaling is in production. No Dockerfiles, no Kubernetes, no YAML at all. The container image layer is built and cached automatically.
pip install modal
modal token new
# Deploy with one command — that's it
modal deploy app.pyThe Money Pattern
The decorator-driven deploy story is wildly productive. You declare GPU type, memory, timeout, secrets, image — all in Python. The same code runs locally for debugging and in the cloud for production. Idle scales to zero. You pay per second the GPU is actually running.
import modal
app = modal.App("llama-server")
image = modal.Image.debian_slim().pip_install("vllm", "fastapi")
@app.function(gpu="A100", image=image, scaledown_window=60)
@modal.web_endpoint(method="POST")
def generate(prompt: str):
from vllm import LLM
llm = LLM("meta-llama/Llama-3-8B-Instruct")
return llm.generate(prompt)[0].outputs[0].textThe Catch
Vendor lock-in is real. Modal-specific decorators, Modal-specific image builder, Modal-specific volumes. Move off and you're rewriting a lot of glue code. Debugging cold starts is also annoying — the logs are good but you can't ssh in, and reproducing prod-only bugs locally is a vibes-based exercise.
The Verdict
For solo developers and small teams who want to ship GPU workloads without becoming part-time DevOps engineers, Modal is the move. Replicate still wins for the "model marketplace" use case where you want someone else to host the popular models. But for custom code? Modal is now the obvious answer. The ergonomics gap closed and pricing did the rest.