Stop shipping RAG on vibes
If you've been living under a rock, your RAG pipeline is probably being evaluated by you scrolling through 10 outputs and going "yeah looks fine". Spoiler: that doesn't scale. Ragas is the framework that gives you real numbers.
The Setup
I bolted this onto a Supabase pgvector retriever I'm using for a compliance-doc Q&A tool. Pip install, build a small eval dataset, run it before every prompt change.
pip install ragas datasets
export OPENAI_API_KEY="sk-..." # ragas uses an LLM as judgeThe Money Pattern
The four core metrics — faithfulness, answer relevance, context precision, context recall — cover the obvious failure modes. Faithfulness catches hallucinations, context precision tells you if your retriever is dumping junk into the prompt.
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
samples = Dataset.from_dict({
"question": ["What's the cooling-off period for a roofing contract in QLD?"],
"answer": ["10 business days under the BCIPA."],
"contexts": [["BCIPA s17 establishes a 10-business-day cooling-off period..."]],
"ground_truth": ["10 business days."],
})
scores = evaluate(
samples,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(scores.to_pandas())The Catch
Ragas uses an LLM as judge — usually GPT-4o-mini or better — and your eval bill is real. A 200-sample run with all four metrics is roughly 800 model calls. Budget for it, or self-host with a local model and accept lower judge quality.
The metrics aren't gospel either. Faithfulness scores can drift between Ragas releases as the underlying prompts change. Pin the version in CI or you'll be debugging a score regression that was actually a Ragas update.
The Verdict
If you're shipping RAG to production, Ragas is the closest thing to grown-up testing in this space. Build a 50-question golden set tomorrow, run it on every prompt or retriever change, and you'll catch regressions before customers do. Vibes-based shipping is over.