Observability — traces, evals, regression

10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.

In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 23, note type = Basic.

Front	Back
What is a trace in the context of LLM observability?	A span tree rooted at a top-level request. Each span represents one unit of work (LLM call, tool call, retrieval) and captures inputs, outputs, latency, and cost.
What is the regression gate in a CI/CD pipeline for LLM systems?	An automated test that runs the full labeled eval set after every change and blocks deployment if any metric drops more than a threshold (typically 2%) from the baseline.
Why must eval sets be built from production traces, not synthetic data?	Synthetic labels from the same LLM you are evaluating are self-referential — the model grades its own homework. Production traces contain real user queries and real failure modes that synthetic generation does not cover.
What does LLM-as-judge calibration mean and how do you measure it?	Measuring the judge's agreement rate with human labels on a held-out set (50 examples minimum). A judge with less than 70% agreement with human labels should not be used for production evals.
What is distribution shift in the context of production RAG systems and how do online evals detect it?	Distribution shift: real user query distribution drifts from the labeled eval set, so offline evals stay green while production quality degrades. Online eval samples 5% of production traffic, runs LLM-as-judge, and alerts when scores drift more than 10% from the rolling baseline.
Name four LLM observability tools and their primary differentiator.	Langfuse (open-source, self-hostable, best LangGraph integration), LangSmith (LangChain hosted, tightest LangChain ecosystem integration), Arize Phoenix (strong for both LLM and traditional ML monitoring), W&B Weave (best if you use W&B Experiments for fine-tuning runs).
What does faithfulness drift in a DealLens eval suite signal?	The RAG pipeline is retrieving less relevant context — likely because the deal memo distribution has shifted (new sectors, structures) and the retrieval index or embeddings no longer represent current traffic. Triggers a re-labeling sprint and retrieval pipeline review.
What is the robotics equivalent of LangSmith trace viewing?	Rerun.io — a visual trajectory replay interface for ROS 2 bags. ROS 2 bags are the equivalent of LLM traces (capturing every sensor reading, actuator command, and policy output).
How many production traces should a labeled eval set contain?	200–500 for a production system. Fewer than 100 gives insufficient statistical power to detect 2% regressions reliably. More than 1,000 is rarely needed for a focused system like DealLens.
What is the primary use of Arize Phoenix for an engineer building both DealLens and a humanoid policy?	Phoenix supports LLM tracing and traditional ML model monitoring in the same UI — relevant if DealLens trains custom scoring heads or if the humanoid policy includes a vision model with performance metrics to track alongside LLM traces.