DeepEval Is Pytest For LLMs

If you've been living under a rock, every LLM team has been duct-taping their own eval harness for the last two years. Spoiler: DeepEval just made that whole cottage industry obsolete.

The Setup

DeepEval is a Python library that wraps your LLM evals in plain old pytest. Drop a file in your test suite, run pytest, get a red or green bar. That's it.

# test_chatbot.py
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric, HallucinationMetric
from deepeval.test_case import LLMTestCase

def test_support_bot():
    case = LLMTestCase(
        input="how do I reset my password?",
        actual_output=my_chatbot("how do I reset my password?"),
        retrieval_context=["Users reset via /account/security"],
    )
    assert_test(case, [
        AnswerRelevancyMetric(threshold=0.8),
        HallucinationMetric(threshold=0.3),
    ])

The Money Pattern

The killer feature is the same one pytest has — parametrize. Loop a hundred prompts through one test, get a per-case breakdown when something regresses. I wired this into a GitHub Action on a Pipedrive support bot at Rebuild Relief last week and it caught a prompt regression nobody noticed in QA.

# run the whole eval suite like any other test
deepeval test run test_chatbot.py

# or just plain pytest if you want CI logs
pytest test_chatbot.py -v --maxfail=3

The Catch

Most of the built-in metrics use GPT-4-class models as the judge. Run a thousand test cases on every PR and your OpenAI bill starts looking like a mortgage payment. Swap in a local judge model if you can, or sample your eval set ruthlessly.

The Verdict

If you're shipping anything with an LLM in the loop and you don't have evals in CI, you are flying blind. DeepEval is the lowest-friction way to fix that today. Pip install, write five tests, sleep better tonight.

Dev Tools

Finally, prompt regressions you can actually catch in CI

The Setup

The Money Pattern

The Catch

The Verdict

Let us make some quick suggestions?