Chapter 10 · Measure: Evals That Actually Catch Things

If Map produces the artefact, Measure produces the trust. The hardest thing about agentic evals is that they look like classical software tests until they don't. The places where the resemblance ends are where production failure lives.

The two traps

The first trap is the vibes eval: a small number of hand-picked prompts run by the engineer who built the agent, judged by that same engineer. It catches the obvious failures and misses the failures that matter. It is also flattering, which is part of why it survives — the agent always seems to be working when its author tests it.

The second trap is the academic benchmark: borrowing a public eval that was designed for a different system and treating its score as a proxy for real-world quality. A model that scores well on MMLU may still confabulate your customer's policy number under load. A score is not a guarantee.

Real evals are neither of these. They are a small, growing collection of cases drawn from production traffic, scored by a process the team trusts, run on every change to the agent or its tools, and watched as a time series.

The golden set

The single most useful artefact a Measure function produces is the golden set: a versioned collection of inputs, expected behaviours, and acceptance criteria that any candidate version of the agent must pass before going to production. A serious golden set has at least one hundred cases for a non-trivial agent, drawn from real production traffic, including a deliberate proportion of edge cases, ambiguous inputs, and adversarial prompts.

Building the golden set is mostly a process of curation. Production traffic produces the candidates; humans pick the ones worth keeping; humans write the expected behaviours, which are not always single correct answers but ranges of acceptable responses. Specialised platforms like Braintrust, Galileo, Arize, and LangSmith make the running and tracking of these suites much easier — but the curation is the work, and the work is human.

Pass criteria for the golden set should be set at deployment, not after. NIST's Govern 1.3 calls these "minimum performance thresholds as deployment go/no-go criteria" — and the discipline is to write them down before you know what your model can do, not after.

Online metrics that matter

Offline evals tell you whether the agent passed the golden set. Online metrics tell you whether the production agent is behaving like the version that passed. The two are not the same. Real production traffic looks different from any golden set, and the agent meets edge cases the curators never imagined.

Two kinds of online metric matter. Outcome metrics measure whether the work is being done: containment rate (agent resolved without escalation), first-contact resolution, customer satisfaction signal, downstream rework rate, complaint rate. Process metrics measure how the agent is reaching its outcomes: tool-call distribution, average tokens per task, retries per task, escalation reasons, and agent latency at each step. The first set tells you if the agent is helping. The second set tells you why, when it stops.

Questions to ask

For your most production-critical agent: when did its last full eval run finish? Who saw the results? What was the threshold that, if missed, would have triggered a rollback? If the answer to any of those is "not sure", you do not yet have a Measure function — you have a dashboard.

Drift, and the silent failure

Models drift. Tools change. The data the agent reads changes. Customers change. Any of those four can quietly break an agent that was working last week, and only one of them — the model — is in the team's direct control.

Drift detection sits on top of the online metrics described above. The mechanism is simple: establish a baseline at deployment for every metric you care about; alert on statistically significant deviation; investigate every alert. The discipline that makes this work is investigating false alarms with the same seriousness as real ones, because your noise floor is itself a thing that drifts.

The silent failure to watch for is the agent that is still passing the golden set but is producing worse outcomes in production. This means the golden set has aged out of relevance, or production has moved past it. Either way, the response is the same: pull recent production traffic into a curation queue, find the cases that surprised the agent, and grow the golden set. A Measure function that never adds to its golden set is a Measure function that has stopped working.

Figure 10.1An eval suite for production. Golden set, offline runs, online metrics, drift detection, and the curation feedback loop that grows the golden set from real traffic.