From Tokens to Embodied Minds · Drill cards · Chapter 23
Drills
Observability — traces, evals, regression
10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.
In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 23, note type = Basic.
| Front | Back |
|---|---|
| What is a trace in the context of LLM observability? | A span tree rooted at a top-level request. Each span represents one unit of work (LLM call, tool call, retrieval) and captures inputs, outputs, latency, and cost. |
| What is the regression gate in a CI/CD pipeline for LLM systems? | An automated test that runs the full labeled eval set after every change and blocks deployment if any metric drops more than a threshold (typically 2%) from the baseline. |
| Why must eval sets be built from production traces, not synthetic data? | Synthetic labels from the same LLM you are evaluating are self-referential — the model grades its own homework. Production traces contain real user queries and real failure modes that synthetic generation does not cover. |
| What does LLM-as-judge calibration mean and how do you measure it? | Measuring the judge's agreement rate with human labels on a held-out set (50 examples minimum). A judge with less than 70% agreement with human labels should not be used for production evals. |
| What is distribution shift in the context of production RAG systems and how do online evals detect it? | Distribution shift: real user query distribution drifts from the labeled eval set, so offline evals stay green while production quality degrades. Online eval samples 5% of production traffic, runs LLM-as-judge, and alerts when scores drift more than 10% from the rolling baseline. |
| Name four LLM observability tools and their primary differentiator. | Langfuse (open-source, self-hostable, best LangGraph integration), LangSmith (LangChain hosted, tightest LangChain ecosystem integration), Arize Phoenix (strong for both LLM and traditional ML monitoring), W&B Weave (best if you use W&B Experiments for fine-tuning runs). |
| What does faithfulness drift in a DealLens eval suite signal? | The RAG pipeline is retrieving less relevant context — likely because the deal memo distribution has shifted (new sectors, structures) and the retrieval index or embeddings no longer represent current traffic. Triggers a re-labeling sprint and retrieval pipeline review. |
| What is the robotics equivalent of LangSmith trace viewing? | Rerun.io — a visual trajectory replay interface for ROS 2 bags. ROS 2 bags are the equivalent of LLM traces (capturing every sensor reading, actuator command, and policy output). |
| How many production traces should a labeled eval set contain? | 200–500 for a production system. Fewer than 100 gives insufficient statistical power to detect 2% regressions reliably. More than 1,000 is rarely needed for a focused system like DealLens. |
| What is the primary use of Arize Phoenix for an engineer building both DealLens and a humanoid policy? | Phoenix supports LLM tracing and traditional ML model monitoring in the same UI — relevant if DealLens trains custom scoring heads or if the humanoid policy includes a vision model with performance metrics to track alongside LLM traces. |