Advanced RAG and evals

10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.

In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 20, note type = Basic.

Front	Back
What does BM25 do that dense embedding retrieval cannot?	BM25 matches exact tokens by term frequency — catching proper nouns, ticker symbols, and entity names that dense embeddings may miss through semantic generalization.
What is Reciprocal Rank Fusion (RRF) used for in hybrid retrieval?	RRF merges ranked lists from BM25 and dense retrieval by summing reciprocal ranks for each document, producing a single merged ranking that benefits from both sparse and dense signals.
What does a cross-encoder reranker do differently than a bi-encoder?	A cross-encoder sees the query and document concatenated together, producing a calibrated relevance score. It is more accurate than cosine similarity but requires O(N) forward passes — one per candidate document.
What is contextual retrieval and what improvement does it provide?	It prepends LLM-generated context describing each chunk's role in the full document before embedding. Anthropic reported a 49% reduction in retrieval failures (Sept 2024).
What is HyDE and when should you avoid it?	HyDE generates a hypothetical answer to the query and embeds that instead of the raw query. Avoid it on domain-specific entity questions where the LLM may confabulate wrong facts, drifting the embedding from true relevant documents.
Name the four metrics RAGAS measures.	Context precision, context recall, faithfulness (answer claims grounded in context), and answer relevancy (answer addresses the question).
Why is it wrong to measure only end-to-end accuracy for a RAG system?	The four RAGAS metrics fail independently. A system can be faithful but miss relevant chunks (high faithfulness, low recall) or retrieve correctly but hallucinate from the correct context. Collapsing to one metric erases diagnostic signal.
What query type does GraphRAG specifically outperform dense retrieval on?	Global sensemaking questions requiring cross-document entity resolution and aggregation — e.g., 'Which portfolio founders have prior exits?' — where dense retrieval cannot aggregate across all documents.
What is the practical cost of building a GraphRAG index over 500 VC memos?	Roughly 1,000 LLM calls for entity extraction (~$5 at GPT-4o mini pricing) plus offline graph construction (Leiden community detection) and storage in a graph database or parquet format.
What is the parent-document retrieval pattern and why is it useful?	Index small child chunks for retrieval precision, but return the full parent section (e.g., the full financial section of a memo) when a child chunk matches. This preserves retrieval precision while giving the LLM sufficient context to answer.