Late interaction quietly destroys your dense retriever on weird queries
If you've been living under a rock, everyone's been bolting OpenAI embeddings into pgvector and calling it RAG. Spoiler: that breaks the second a user asks something your training corpus never saw.
The Setup
ColBERT 2.0 doesn't squash a document into a single vector. It keeps a vector per token and scores queries with late interaction — MaxSim across every query token vs every document token. Out-of-domain queries get hugely better recall.
pip install ragatouille
# index your docs once
python -c "from ragatouille import RAGPretrainedModel; m = RAGPretrainedModel.from_pretrained('colbert-ir/colbertv2.0'); m.index(collection=['doc 1 text', 'doc 2 text'], index_name='aidxn')"The Money Pattern
RAGatouille wraps the awful Stanford research code into something a normal human can ship. Index once, search forever. I dropped it into a Supabase-backed knowledge base and recall went up without touching the chunker.
from ragatouille import RAGPretrainedModel
rag = RAGPretrainedModel.from_index(".ragatouille/colbert/indexes/aidxn")
hits = rag.search(
query="how do I reroute a field rep around a hail cell",
k=5,
)
for h in hits:
print(h["score"], h["content"][:120])The Catch
The index is fat. Like, 5-10x the size of a flat dense index because you're storing per-token vectors. Search is slower than a hot pgvector call too — milliseconds become tens of milliseconds. For a 50k-doc corpus that's fine. For 50M, you need PLAID and a real budget.
The Verdict
If your RAG quietly fails on long-tail queries, swap your dense retriever for ColBERT v2 before you spend another week tuning chunk sizes. Use RAGatouille, ship it tonight, watch your "no relevant context found" rate fall off a cliff. Then layer a cross-encoder rerank on top and call it done.