Chapter 20 · Advanced RAG and evals — From Tokens to Embodied Minds

Your RAG system is broken in ways your demo does not expose. The demo uses synthetic questions generated from the same documents you retrieved — a closed loop that guarantees high scores. Production queries cross chunk boundaries, require entity resolution across documents, and ask about entities not directly named in any single chunk. The failure modes are predictable: (1) retrieval recall is low — the right chunk exists but the query phrasing doesn't match its embedding; (2) retrieved context is correct but the answer fabricates additional detail — faithfulness failure; (3) retrieved context is irrelevant — precision failure. Each failure requires a different fix, none are visible without a labeled eval set, and no RAG demo survives production without an offline regression suite. The 2024–2025 advances moved well past the naive hybrid-retrieval-plus-reranker stack most teams deployed in 2023. Anthropic's Contextual Retrieval (September 19, 2024) demonstrated a 49% reduction in retrieval failures by prepending chunk-specific context at indexing time. Microsoft's GraphRAG (Edge et al., arXiv:2404.16130, April 24, 2024) showed that cross-document synthesis — 'Which of our portfolio companies have a founder who has exited before?' — is fundamentally unsolvable with dense retrieval alone and requires a knowledge graph over the corpus. For DealLens, this chapter is not infrastructure background — it is the core product engineering.

Hybrid retrieval and cross-encoder reranking

BM25 (Robertson and Zaragoza, 2009) is term-frequency-based sparse retrieval — it matches exact tokens and rewards documents with above-average query-term frequency. Dense retrieval (bi-encoder: embed query and document independently, retrieve by cosine similarity) captures semantic meaning but misses exact string matches. Hybrid retrieval combines both: retrieve top-K candidates from BM25 and top-K from dense, merge with Reciprocal Rank Fusion (RRF), then pass the merged list to a cross-encoder reranker. In practice, hybrid outperforms either alone by 5–15% recall on diverse query types — especially queries with named entities (company names, founder names, ticker symbols, model names) where BM25 exact matching is decisive.

The cross-encoder reranker (e.g., cross-encoder/ms-marco-MiniLM-L-6-v2) sees the full query and each document concatenated — not a bi-encoder — and produces a calibrated relevance score that is substantially more accurate than cosine similarity. The cost is O(N) forward passes at reranking time: a top-50 reranker running 12-layer cross-encoder at 30ms per pair is 1.5 seconds — acceptable for DealLens batch-screening mode, too slow for sub-second interactive use. For interactive DealLens queries, cap the reranker candidate pool at 20 and use a smaller reranker (6-layer, ~10ms per pair). Weaviate and Elasticsearch both support hybrid BM25 + vector search natively; Qdrant supports it via their Sparse+Dense API.

Chunking strategy interacts with retrieval in ways most implementations ignore. Fixed-size chunks (512 tokens, 50-token overlap) are the default and the worst option for VC memos. Sentence-level chunking preserves semantic units but makes context windows too narrow. Semantic chunking — split on topic shifts using an embedding-similarity threshold — produces variable-length chunks that match the document's logical structure. For DealLens, the right strategy is document-section chunking (executive summary, team, market, financials, risks as separate chunks) plus a parent-document retriever that returns the full memo section when any sub-chunk matches.

Contextual retrieval and HyDE

Contextual Retrieval (Anthropic, September 19, 2024): before embedding each chunk, prompt an LLM to generate 2–3 sentences describing what the chunk is about in the context of the full document. Prepend that context to the chunk text before embedding. The embedding then reflects the chunk's role in the document, not just its local tokens. A VC memo chunk reading 'Revenue grew 40% YoY' without context embeds near any revenue-growth document; with context ('This excerpt from Acme Corp's Series B memo discusses 2023 financial performance and sets up the valuation rationale') it is specific to the document and query intent. Anthropic reported 49% fewer retrieval failures. The cost is one LLM call per chunk at indexing time — a fixed one-time cost, not per-query.

HyDE (Hypothetical Document Embeddings, Gao et al., arXiv:2212.10496, 2022): instead of embedding the raw query, prompt the LLM to generate a plausible answer, then embed that hypothetical answer. The intuition is that a hallucinated-but-plausible passage about Acme Corp's revenue will embed closer to the actual VC memo than the bare question. HyDE consistently improves dense retrieval by 5–15% on complex factual questions at zero indexing cost (query-time only). The failure mode: when the LLM confabulates a wrong entity or wrong number in the hypothetical, the embedding drifts toward the wrong documents. Safe usage: apply HyDE on general-knowledge questions; avoid on domain-specific entity questions where the LLM cannot reliably hypothesize correct facts.

GraphRAG for cross-document synthesis

Dense retrieval retrieves relevant passages; it cannot synthesize across them. The question 'Which portfolio companies have a founder who has exited before?' requires visiting all memos, extracting founder entities, cross-referencing exit history, and aggregating. GraphRAG (Edge et al., arXiv:2404.16130, April 24, 2024) addresses this by building a knowledge graph from the corpus: entities (companies, founders, investors) and relationships (invested in, co-founded, acquired) are extracted via LLM, organized into a hierarchical community structure, and each community is summarized at multiple resolutions. At query time, map-reduce over community summaries enables global reasoning the retrieval layer cannot perform. On Microsoft's benchmark, GraphRAG answered 72–83% of 'global sensemaking' questions correctly versus 45–58% for naive dense RAG.

The infrastructure cost of GraphRAG is real: entity extraction across 500 VC memos at ~2 pages each requires ~1,000 LLM calls (GPT-4o mini at $0.15/1M tokens → roughly $5 for the full corpus). The resulting graph must be stored (Neo4j, or a parquet-backed LightRAG implementation), and community detection (Leiden algorithm) adds offline compute. The payoff: DealLens gains the ability to answer portfolio-level synthesis questions that are currently manual analyst work. Microsoft's open-source GraphRAG library and the lighter-weight LightRAG (Guo et al., arXiv:2410.05779, October 2024) are the two implementations worth evaluating.

The eval discipline that separates demos from production

RAGAS (Shahul Es et al., arXiv:2309.15217, September 2023) decomposes RAG quality into four independent metrics: context precision (retrieved chunks are relevant), context recall (relevant chunks were retrieved), faithfulness (answer claims are grounded in retrieved context), and answer relevancy (answer addresses the question). You need all four because they fail independently. A system with high faithfulness but low context recall is honest but incomplete. A system with high context precision but low faithfulness hallucinated from correct inputs. Measuring only end-to-end accuracy collapses the diagnostic signal.

The production eval discipline: (1) label 100–200 question-answer pairs from real production queries, not synthetic generation; (2) run RAGAS on every retrieval pipeline change, not just model changes — chunking and reranker updates break recalls silently; (3) build a regression gate in CI that fails the deployment if any RAGAS metric drops more than 2% from baseline; (4) run LLM-as-judge on a random 5% sample of production traffic weekly to detect distribution shift. TruLens (Truera, 2024) and Arize Phoenix both integrate RAGAS and support production trace logging. For DealLens, the labeled eval set should be built from real analyst questions about real memos, annotated by the same analysts who will use the system.

The eval you skip is the bug you ship

Every RAG demo that never went to production died the same way: the developer was the only evaluator, and the developer knew where the answers were. Build a labeled eval set from real user queries before you build another feature.

Figure 20.1Advanced RAG pipeline: optional HyDE query expansion, parallel BM25 and dense retrieval merged via RRF, cross-encoder reranking to top-10, and offline RAGAS evaluation across context precision, recall, and faithfulness.

Primary source · Build · Capstone ladder

Primary source. Introducing Contextual Retrieval, Anthropic (September 19, 2024)

Build. Take 50 VC memos. Build three retrievers — naive dense (all-MiniLM-L6-v2), hybrid BM25 + dense with cross-encoder reranker, and contextual retrieval. Run RAGAS evaluation across 20 labeled Q&A pairs annotated by a domain expert. Report context precision, context recall, and faithfulness for each retriever. Bonus: add GraphRAG indexing and compare on 5 cross-document synthesis questions.

Capstone ladder. This is DealLens. The retrieval architecture, eval suite, and contextual indexing pipeline you build here are the DealLens core product. Every subsequent feature (scoring, memo generation, portfolio synthesis) depends on this foundation being correct and measured.

Retrieve before you continue

Three questions on what you just read

Q1 Factual What does Anthropic's Contextual Retrieval technique do and what improvement did it demonstrate?

Q2 Conceptual Why does GraphRAG outperform dense retrieval on global sensemaking questions?

Q3 Synthetic DealLens needs to answer 'Which founders in our portfolio have prior exits?' Why is this question unsolvable with naive dense retrieval, and what architecture would you use?