Treat prompts like code — write tests, fail builds
Plot twist: you've been treating prompts as text files in a repo with zero tests, but they're real code with real bugs. Promptfoo is the tool that finally fixes that. YAML in, eval results out, fail the CI on regression.
The Setup
I wired this into the GitHub Actions for a Pipedrive call-summary tool. New prompt? Run the suite. PR shows the diff between old and new outputs side-by-side. It's Playwright energy but for LLM behaviour.
{`npm install -g promptfoo
# scaffold a config
promptfoo init
# run the suite
promptfoo eval
# open the side-by-side web UI
promptfoo view`}The Money Pattern
The YAML config is the whole pattern. Define your prompts, your test cases, your assertions — including LLM-as-judge graders for fuzzy outputs. Promptfoo runs every prompt against every test and shows you the matrix.
{`# promptfooconfig.yaml
prompts:
- prompts/summarise_v1.txt
- prompts/summarise_v2.txt
providers:
- openai:gpt-4o-mini
- anthropic:claude-haiku-4
tests:
- vars:
transcript: "Roof tiles cracked, leaking into kitchen, urgent."
assert:
- type: contains
value: "roof"
- type: llm-rubric
value: "Output is a single sentence under 20 words."
- type: cost
threshold: 0.001
- vars:
transcript: "Just calling to check on my claim status."
assert:
- type: not-contains
value: "URGENT"`}The Catch
The YAML gets gnarly fast when you have more than a handful of prompts or providers. There's a learning curve around assertion types — llm-rubric vs similar vs javascript — and you'll write a few duds before it clicks.
Also: running eval suites costs real API money. Cache aggressively (Promptfoo does this by default) or you'll burn through credits on identical re-runs.
The Verdict
If your team ships LLM features more than once a week, Promptfoo pays for itself in the first prompt regression it catches. Pair it with Ragas for RAG-specific metrics and you've got real CI for AI. Add it to your repo tonight, write your first 10 test cases over coffee tomorrow.