Finally, prompt regressions you can actually catch in CI
If you've been living under a rock, every LLM team has been duct-taping their own eval harness for the last two years. Spoiler: DeepEval just made that whole cottage industry obsolete.
The Setup
DeepEval is a Python library that wraps your LLM evals in plain old pytest. Drop a file in your test suite, run pytest, get a red or green bar. That's it.
# test_chatbot.py
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric, HallucinationMetric
from deepeval.test_case import LLMTestCase
def test_support_bot():
case = LLMTestCase(
input="how do I reset my password?",
actual_output=my_chatbot("how do I reset my password?"),
retrieval_context=["Users reset via /account/security"],
)
assert_test(case, [
AnswerRelevancyMetric(threshold=0.8),
HallucinationMetric(threshold=0.3),
])The Money Pattern
The killer feature is the same one pytest has — parametrize. Loop a hundred prompts through one test, get a per-case breakdown when something regresses. I wired this into a GitHub Action on a Pipedrive support bot at Rebuild Relief last week and it caught a prompt regression nobody noticed in QA.
# run the whole eval suite like any other test
deepeval test run test_chatbot.py
# or just plain pytest if you want CI logs
pytest test_chatbot.py -v --maxfail=3The Catch
Most of the built-in metrics use GPT-4-class models as the judge. Run a thousand test cases on every PR and your OpenAI bill starts looking like a mortgage payment. Swap in a local judge model if you can, or sample your eval set ruthlessly.
The Verdict
If you're shipping anything with an LLM in the loop and you don't have evals in CI, you are flying blind. DeepEval is the lowest-friction way to fix that today. Pip install, write five tests, sleep better tonight.