From Tokens to Embodied Minds  ·  Drill cards · Chapter 23
Drills

Observability — traces, evals, regression

10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.

10 cards due for review

In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 23, note type = Basic.

FrontBack
What is a trace in the context of LLM observability?A span tree rooted at a top-level request. Each span represents one unit of work (LLM call, tool call, retrieval) and captures inputs, outputs, latency, and cost.
What is the regression gate in a CI/CD pipeline for LLM systems?An automated test that runs the full labeled eval set after every change and blocks deployment if any metric drops more than a threshold (typically 2%) from the baseline.
Why must eval sets be built from production traces, not synthetic data?Synthetic labels from the same LLM you are evaluating are self-referential — the model grades its own homework. Production traces contain real user queries and real failure modes that synthetic generation does not cover.
What does LLM-as-judge calibration mean and how do you measure it?Measuring the judge's agreement rate with human labels on a held-out set (50 examples minimum). A judge with less than 70% agreement with human labels should not be used for production evals.
What is distribution shift in the context of production RAG systems and how do online evals detect it?Distribution shift: real user query distribution drifts from the labeled eval set, so offline evals stay green while production quality degrades. Online eval samples 5% of production traffic, runs LLM-as-judge, and alerts when scores drift more than 10% from the rolling baseline.
Name four LLM observability tools and their primary differentiator.Langfuse (open-source, self-hostable, best LangGraph integration), LangSmith (LangChain hosted, tightest LangChain ecosystem integration), Arize Phoenix (strong for both LLM and traditional ML monitoring), W&B Weave (best if you use W&B Experiments for fine-tuning runs).
What does faithfulness drift in a DealLens eval suite signal?The RAG pipeline is retrieving less relevant context — likely because the deal memo distribution has shifted (new sectors, structures) and the retrieval index or embeddings no longer represent current traffic. Triggers a re-labeling sprint and retrieval pipeline review.
What is the robotics equivalent of LangSmith trace viewing?Rerun.io — a visual trajectory replay interface for ROS 2 bags. ROS 2 bags are the equivalent of LLM traces (capturing every sensor reading, actuator command, and policy output).
How many production traces should a labeled eval set contain?200–500 for a production system. Fewer than 100 gives insufficient statistical power to detect 2% regressions reliably. More than 1,000 is rarely needed for a focused system like DealLens.
What is the primary use of Arize Phoenix for an engineer building both DealLens and a humanoid policy?Phoenix supports LLM tracing and traditional ML model monitoring in the same UI — relevant if DealLens trains custom scoring heads or if the humanoid policy includes a vision model with performance metrics to track alongside LLM traces.