Skip to content

Dev Tools

Promptfoo Is CI For Prompts

All articles
📈🧪🔍

Treat prompts like code — write tests, fail builds

Plot twist: you've been treating prompts as text files in a repo with zero tests, but they're real code with real bugs. Promptfoo is the tool that finally fixes that. YAML in, eval results out, fail the CI on regression.

The Setup

I wired this into the GitHub Actions for a Pipedrive call-summary tool. New prompt? Run the suite. PR shows the diff between old and new outputs side-by-side. It's Playwright energy but for LLM behaviour.

{`npm install -g promptfoo

# scaffold a config
promptfoo init

# run the suite
promptfoo eval

# open the side-by-side web UI
promptfoo view`}

The Money Pattern

The YAML config is the whole pattern. Define your prompts, your test cases, your assertions — including LLM-as-judge graders for fuzzy outputs. Promptfoo runs every prompt against every test and shows you the matrix.

{`# promptfooconfig.yaml
prompts:
  - prompts/summarise_v1.txt
  - prompts/summarise_v2.txt

providers:
  - openai:gpt-4o-mini
  - anthropic:claude-haiku-4

tests:
  - vars:
      transcript: "Roof tiles cracked, leaking into kitchen, urgent."
    assert:
      - type: contains
        value: "roof"
      - type: llm-rubric
        value: "Output is a single sentence under 20 words."
      - type: cost
        threshold: 0.001

  - vars:
      transcript: "Just calling to check on my claim status."
    assert:
      - type: not-contains
        value: "URGENT"`}

The Catch

The YAML gets gnarly fast when you have more than a handful of prompts or providers. There's a learning curve around assertion types — llm-rubric vs similar vs javascript — and you'll write a few duds before it clicks.

Also: running eval suites costs real API money. Cache aggressively (Promptfoo does this by default) or you'll burn through credits on identical re-runs.

The Verdict

If your team ships LLM features more than once a week, Promptfoo pays for itself in the first prompt regression it catches. Pair it with Ragas for RAG-specific metrics and you've got real CI for AI. Add it to your repo tonight, write your first 10 test cases over coffee tomorrow.

Let us make some quick suggestions?
Please provide your full name.
Please provide your phone number.
Please provide a valid phone number.
Please provide your email address.
Please provide a valid email address.
Please provide your brand name or website.
Please provide your brand name or website.