AI/LLM

RAG Is Dead. Long Live RAG.

All articles
⚰️🪦🚀

Hybrid retrieval, rerank, structured filters — the new RAG pipeline

Every six months someone tweets "RAG is dead" because context windows got bigger. Plot twist: stuffing 2M tokens into the prompt is slower, more expensive, and dumber than just retrieving the right 4k.

The Setup

Naive RAG — chunk, embed, top-k, stuff — is genuinely dead. What killed it wasn't long context, it was the realisation that single-vector retrieval misses obvious matches. The new shape is hybrid: BM25 for keyword precision, dense for semantics, cross-encoder to rerank the union.

from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np

docs = load_my_docs()  # list[str]
bm25 = BM25Okapi([d.split() for d in docs])
embedder = SentenceTransformer("BAAI/bge-large-en-v1.5")
doc_vecs = embedder.encode(docs, normalize_embeddings=True)

The Money Pattern

Pull the top 50 from BM25, top 50 from dense, union the lot, then let a cross-encoder pick the real winners. I run this in front of a Supabase pgvector table for the Rebuild Relief knowledge base and the gap between "retrieved" and "actually relevant" basically vanished.

reranker = CrossEncoder("BAAI/bge-reranker-large")

def hybrid_search(q, k=8):
    bm_hits = np.argsort(bm25.get_scores(q.split()))[-50:]
    qv = embedder.encode(q, normalize_embeddings=True)
    dn_hits = np.argsort(doc_vecs @ qv)[-50:]
    cands = list(set(bm_hits) | set(dn_hits))
    pairs = [[q, docs[i]] for i in cands]
    scores = reranker.predict(pairs)
    top = np.argsort(scores)[-k:][::-1]
    return [docs[cands[i]] for i in top]

The Catch

You still need to chunk smartly. Sentence-window, semantic splitter, parent-child — pick one and own it. And the reranker adds latency, so cache aggressively or use a smaller distilled cross-encoder for prod. Hybrid isn't free.

The Verdict

RAG isn't dead. The 2022 version of RAG is dead. Hybrid retrieval plus reranking plus structured filters is the actual job now, and it beats long-context prompt stuffing on cost, latency, and accuracy. Build the pipeline, stop arguing on Twitter.

Let us make some quick suggestions?
Please provide your full name.
Please provide your phone number.
Please provide a valid phone number.
Please provide your email address.
Please provide a valid email address.
Please provide your brand name or website.
Please provide your brand name or website.