Production LLM systems degrade silently. A prompt change that improves output style also shifts the distribution of retrieved context. A model version upgrade that reduces hallucination also changes the token distribution in ways that break downstream parsers. A retrieval pipeline update that improves recall also changes the exact chunks that reach the LLM, causing previously correct answers to become incorrect. None of these failures produce exceptions. They produce bad outputs that users silently stop trusting — until the system is abandoned. Observability is the discipline that catches these failures before users do. It has three layers: (1) trace-level logging — every LLM call, tool call, latency, and token cost captured as a span tree; (2) offline eval suite — a labeled dataset of inputs and expected outputs, run against every prompt or model change in CI; (3) online eval — LLM-as-judge running on a random sample of production traffic, detecting distribution shift between deploys. The tools — LangSmith, Langfuse, Arize Phoenix, Weights and Biases Weave — differ in their UI and pricing, not in their function. The discipline is the same across all of them.
Trace-level logging: what to capture
A trace is a span tree rooted at the top-level request. Each span represents one unit of work: one LLM call, one tool invocation, one retrieval, one reranking. Each span captures: inputs (the exact prompt or query), outputs (the exact response or result), latency (wall-clock milliseconds), and cost (tokens in + tokens out at the current model's per-token pricing). The trace tree is the only artifact that tells you exactly what the LLM received and exactly what it returned — it is the ground truth for every downstream eval and every debugging session.
Langfuse (open-source, self-hostable) and LangSmith (LangChain's hosted service) both support automatic LangGraph integration via a one-line tracer setup. Every LangGraph node automatically becomes a span. Every LLM call within a node is a child span. The latency breakdown shows you whether your bottleneck is retrieval, the LLM, or the reranker. The cost breakdown shows you which node is responsible for 80% of your per-run API spend. For DealLens, this tells you whether the expensive node is the draft-memo LLM call (likely) or the retrieval cross-encoder (less likely but worth measuring).
Arize Phoenix and W&B Weave are the alternatives. Phoenix has strong eval integration and supports both LLM tracing and traditional ML model monitoring in the same UI — relevant if DealLens eventually trains custom scoring heads. Weave is tightly integrated with W&B Experiments, which is valuable if you are also running fine-tuning runs and want all evaluation metrics in one dashboard. For most teams starting out, Langfuse is the practical choice: open-source, self-hostable, and has the most complete LangGraph integration.
Offline eval suite and regression gate
The offline eval suite is a fixed dataset of (input, expected_output) pairs, run against the system after every change, with a pass/fail gate that blocks deployment on regression. Building the dataset: export 200–500 production traces, label a subset by hand (correct / incorrect / partial, with a note explaining why), and store them in a versioned file (JSON or CSV). The labeling must be done by domain experts — for DealLens, by the analysts who actually use the scoring output. Synthetic labels from the same LLM you are evaluating are a self-referential trap.
The regression gate: run RAGAS (or your custom LLM-as-judge) on the full labeled eval set in CI after every commit that touches a prompt, retrieval configuration, or model version. Set a threshold — 2% drop in any metric blocks the deploy. This prevents the common failure mode of 'we changed the prompt to improve conciseness and somehow broke deal scoring on biotech memos.' Without the regression gate, that failure would reach production. With it, it is a CI failure caught in 5 minutes.
LLM-as-judge calibration: do not trust an LLM judge that has not been calibrated against human labels. Measure the judge's agreement rate with human labels on 50 examples before using it for evals. A judge with 70% agreement with humans is useful; a judge with 50% agreement is a coin flip. The judge's system prompt, model choice, and output format all affect agreement rate — treat calibration as an engineering problem, not a prompting problem.
Online eval: catching distribution shift in production
The offline eval suite catches regressions on a fixed dataset. It cannot catch distribution shift — when the distribution of real user queries drifts away from your labeled eval set, the offline suite stays green while production quality degrades. Online eval closes this gap: sample 5% of production traffic, run LLM-as-judge on the sampled traces, track the judge scores over time. Alert when any metric drifts more than 10% from the 4-week rolling baseline.
For DealLens, the signal to watch is faithfulness drift: if the judge's faithfulness score drops over two consecutive weeks, your RAG pipeline is retrieving less relevant context — likely because the deal memo distribution has shifted (new sectors, new deal structures) and your eval labels no longer represent the current traffic. The alert triggers a re-labeling sprint and a retrieval pipeline review. This feedback loop — online eval detects drift, offline labels are refreshed, regression gate is updated — is the production LLM ops discipline. It is tedious and it is the actual work.
The robotics analog: trajectory logging and replay
The JHU humanoid capstone requires the same observability discipline, with different tools. ROS 2 bags capture every sensor reading, every actuator command, and every policy output — the robotics equivalent of LLM traces. Rerun.io provides a visual trajectory replay interface — the robotics equivalent of LangSmith's trace viewer. The offline eval is a fixed set of task scenarios run in Isaac Lab simulation after every policy update — the robotics regression gate. The online eval is periodic replay of recorded real-robot trajectories against a success classifier.
The structural lesson is the same across both domains: you need (1) comprehensive logging of every action and its context, (2) a fixed evaluation benchmark that does not change between experiments, and (3) automated regression gates that block deployment of worse policies. The only difference is that a bad LLM prompt costs a failed deal evaluation; a bad robot policy costs a broken object or a safety incident.
Synthetic eval sets generated by the same LLM you are evaluating produce self-referential metrics — the model grades its own homework. Label production traces by hand with domain experts. There is no shortcut.