Dev Tools

Ragas Made RAG Evaluation Actually Possible

All articles
📊🧪🔍

Stop shipping RAG on vibes

If you've been living under a rock, your RAG pipeline is probably being evaluated by you scrolling through 10 outputs and going "yeah looks fine". Spoiler: that doesn't scale. Ragas is the framework that gives you real numbers.

The Setup

I bolted this onto a Supabase pgvector retriever I'm using for a compliance-doc Q&A tool. Pip install, build a small eval dataset, run it before every prompt change.

pip install ragas datasets

export OPENAI_API_KEY="sk-..."   # ragas uses an LLM as judge

The Money Pattern

The four core metrics — faithfulness, answer relevance, context precision, context recall — cover the obvious failure modes. Faithfulness catches hallucinations, context precision tells you if your retriever is dumping junk into the prompt.

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)

samples = Dataset.from_dict({
    "question": ["What's the cooling-off period for a roofing contract in QLD?"],
    "answer": ["10 business days under the BCIPA."],
    "contexts": [["BCIPA s17 establishes a 10-business-day cooling-off period..."]],
    "ground_truth": ["10 business days."],
})

scores = evaluate(
    samples,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

print(scores.to_pandas())

The Catch

Ragas uses an LLM as judge — usually GPT-4o-mini or better — and your eval bill is real. A 200-sample run with all four metrics is roughly 800 model calls. Budget for it, or self-host with a local model and accept lower judge quality.

The metrics aren't gospel either. Faithfulness scores can drift between Ragas releases as the underlying prompts change. Pin the version in CI or you'll be debugging a score regression that was actually a Ragas update.

The Verdict

If you're shipping RAG to production, Ragas is the closest thing to grown-up testing in this space. Build a 50-question golden set tomorrow, run it on every prompt or retriever change, and you'll catch regressions before customers do. Vibes-based shipping is over.

Let us make some quick suggestions?
Please provide your full name.
Please provide your phone number.
Please provide a valid phone number.
Please provide your email address.
Please provide a valid email address.
Please provide your brand name or website.
Please provide your brand name or website.