Chapter 19 · Evals and Observability — The Agentic Enterprise

You cannot govern what you cannot see. Agentic systems that run without comprehensive observability are, from a risk management perspective, essentially black boxes: they produce outputs, they take actions, they consume resources, and the organization has no systematic way to know whether they are doing so correctly, safely, or within the bounds of their governance charter. The observability stack for an agentic program — spanning traces, metrics, replays, and structured evaluations — is not a nice-to-have for mature deployments; it is the prerequisite for responsible deployment at any scale. Without it, the governance charter is a fiction and the policy stack is unenforceable.

Traces: The Atomic Unit of Observability

A trace is a structured record of a single agent run from start to finish. It captures every significant event in the run's execution: the initial prompt, each model call and its response, each tool invocation and its result, the intermediate reasoning produced at each planning step, the final output, the total token consumption, the wall-clock latency, and any errors or exceptions that occurred. Traces are the atomic unit of agentic observability because they contain all the information needed to reconstruct what happened during a run and to evaluate whether it went well.

The challenge with traces is their volume and complexity. A production agent handling hundreds of tasks per day generates traces that collectively contain millions of data points — far too much for humans to review manually. The observability infrastructure must therefore provide efficient storage, indexing, and querying of trace data, along with automated analysis that surfaces the traces most likely to be informative: the ones that failed, the ones that were unusually expensive, the ones that triggered escalation, the ones that deviate significantly from the historical baseline.

Langfuse is currently one of the most widely deployed open-source tracing solutions for LLM-based agents. It provides an OpenTelemetry-compatible trace format, a rich web interface for trace exploration, and a scoring system that allows human reviewers to annotate traces with quality judgments that feed back into automated evaluation pipelines. Arize Phoenix offers similar capabilities with a stronger emphasis on statistical analysis of trace populations — identifying systematic patterns in how the agent's behavior varies across different input types, user groups, or time periods.

Structured Evals: Measuring What Matters

Observability tells you what happened; evaluations tell you whether it was good. A structured evaluation is a test of the agent's behavior on a defined set of scenarios, scored against defined criteria. The criteria for agentic evaluations are different from those for conventional model evaluations: in addition to output quality (is the agent's answer correct?), they must include behavioral dimensions that are specific to autonomous systems: tool use appropriateness (did the agent call the right tools in the right order?), instruction-following fidelity (did the agent respect the constraints in its system prompt?), escalation correctness (did the agent escalate when it should have and not escalate when it shouldn't have?), and safety compliance (did the agent avoid prohibited actions even when the task seemed to require them?).

Building effective evaluation suites for agents requires a combination of automated and human evaluation. Automated evaluation using LLM judges — where a second model is asked to score the primary agent's responses on a defined rubric — is scalable and cheap, but it inherits the biases of the judge model and is unreliable for adversarial scenarios. Human evaluation is accurate but expensive and slow. The standard practice is a hybrid: automated evaluation for routine quality metrics, human evaluation for a statistically sampled subset of runs and for all edge cases identified through anomaly detection on the trace data.

Replays and Regression Testing

A replay is the re-execution of a historical agent run against a new version of the agent — a new model version, a new system prompt, a new tool configuration — using the same inputs that the historical run received. Replays are the standard mechanism for detecting behavioral regressions before they reach production: if a new model version changes the agent's behavior on a set of historical runs in ways that degrade the evaluation metrics, the regression is detected before the new version is deployed.

Weights & Biases Weave provides a particularly capable replay infrastructure. Its versioning system tracks not just model versions but the full configuration of the agent — system prompt, tool definitions, retrieval settings — and its replay engine re-executes historical runs against any combination of component versions, producing a structured diff of the evaluation metrics between versions. This makes it possible to attribute behavioral changes to specific component updates, which is essential for root-cause analysis when a regression is detected.

The value of replay testing extends beyond regression detection. Organizations that maintain a comprehensive library of historical traces — including traces that document past failures and the correct behavior expected in those situations — have a qualitatively different ability to evaluate new agent versions than those that rely only on synthetic test sets. Historical traces capture the full diversity of the agent's actual operating environment, including edge cases that would never have been anticipated at test design time.

"The organizations that will have the most robust agentic deployments in three years are the ones that are collecting and annotating traces most diligently today. The trace library is the institutional memory of the agent program."

Scorecards and Dashboards

Individual traces and evaluation runs generate data; scorecards aggregate that data into actionable signals for the governance structure. A well-designed scorecard for an agentic program tracks three categories of metrics. Quality metrics — success rate, task completion rate, output quality scores from automated and human evaluations — measure whether the agent is doing its job well. Safety metrics — escalation rate, policy violation rate, anomaly detection trigger rate — measure whether the agent is staying within its governance boundaries. Efficiency metrics — token consumption per task, latency per task, cost per task — measure whether the agent is operating within the economic parameters of its business case.

The governance charter should specify minimum acceptable thresholds for each category. An agent whose quality metrics fall below threshold should trigger a performance review; an agent whose safety metrics exceed threshold (too many escalations, too many policy violations) should trigger a governance review. The scorecard is the mechanism by which the governance charter's risk appetite is operationalized into a monitoring regime that can run continuously without requiring constant human attention.