Chapter 03 · The Five Honest Questions — Building an Agentic Enterprise

Most agent ideas die in production. A few of them deserve to die earlier — at the whiteboard, before anyone has burned a quarter of engineering time on them. This chapter is a thirty-minute test for separating the survivors from the dead-on-arrival.

The questions are deliberately blunt. They are designed to make people uncomfortable, because the projects that pass discomfort tend to ship, and the ones that talk their way around the discomfort tend not to. Run them in order. The first no is where you stop.

Q1 — Is the win measurable?

Write down, on one line, the metric that will move if this agent works, and by how much.

"Resolves 30% of tier-1 support tickets without escalation, with a 90% customer-reported satisfaction score" is a good answer. "Improves customer experience" is not an answer; it is a hope dressed as one. "Saves engineering hours" without a specific baseline is a wish.

The discipline of a single, numeric, before-and-after metric is what makes the rest of the work tractable. It tells the eval team what to measure. It tells the product team when to ship. It tells the finance team how to value the project. Most importantly, it lets you decide, eight weeks in, whether the agent is working — without that anchor, every project drifts toward "looks promising," which is the longest-running deception in software.

If your team cannot agree on a metric — and the meeting feels like wading — the project is not yet ready. The discomfort isn't the project's; it's the absence of a real customer.

Q2 — Does it need judgement?

Many "agent" projects are, on inspection, deterministic workflows that someone has draped a model over because models are fashionable. This is wasteful and risky. Models are slow, expensive, sometimes wrong, and hard to debug. If a process can be done with code, it should be done with code.

The honest test: write down each step of the process. Ask, for each, whether a junior employee given a clear rulebook could do it correctly 99% of the time. If yes, that step is deterministic. It belongs in code, possibly called as a tool by the agent, but not done by the agent.

What's left — the steps where the rulebook isn't enough, where the right answer depends on context, on tone, on weighing competing things — is where an agent earns its keep. If everything in the process can be done by junior-with-a-rulebook, you don't have an agent project. You have a workflow project. Ship the workflow. Save the agent budget for the next problem.

Caution

The temptation to "agentify" everything is partly cultural — agents are the visible badge of an AI strategy — and partly economic, since vendors price agentic platforms higher than RPA tools. Both are bad reasons. Use the right tool for the job and let the architecture diagram be honest, not fashionable.

Q3 — Can a bad action be undone?

Reversibility is the hinge that the whole risk discussion swings on.

If an agent's worst possible action can be reversed — a draft email that is never sent until reviewed, a refund within a small limit, a calendar event that can be deleted — then you can ship faster, iterate in production, and let the agent learn from real cases. If the worst action is irreversible — a contract sent, a public statement made, a payment wired, a clinical recommendation followed — the bar rises hard, and the appropriate architecture is one where the agent prepares but does not commit, leaving a human or a deterministic gate at the irreversible step.

The mistake to avoid is binary thinking. Most processes have a mix of reversible and irreversible steps. The job is to keep the agent in the reversible territory and put guardrails (or humans) at every crossing into the irreversible. The architecture pattern is sometimes called "propose-then-commit": the agent does all the work up to the irreversible action and then hands off.

Q4 — Do we have data we can evaluate on?

You cannot ship an agent you cannot evaluate. You cannot evaluate an agent you do not have data for. Therefore: before any code, ask whether you have a credible golden set of inputs and expected outputs.

For a refund-classification agent, the golden set is a few hundred real refund requests, hand-labelled by an experienced human, covering the full distribution of cases including the long tail. For a code-review agent, it is a corpus of past PRs with their actual outcomes. For an underwriting agent, it is historical decisions with known correctness.

Two warnings. First, the golden set is not a one-time artefact — it has to grow as the world changes, which means somebody owns it as a job. Second, building the golden set is often the most expensive piece of the project, and is often the piece that exposes whether the company actually knows what "correct" looks like for the process. A team that cannot agree on labels for a hundred examples cannot ship an agent that decides the same thing at scale.

If you have no data and no plan to gather it, the project is not ready. If you have data but it sits in a place no one can access without a six-week procurement, the project is not ready. The data question is rarely the showstopper people expect; it is the showstopper people pretend not to see.

Q5 — Who owns this when it's wrong?

This is the question most projects fail.

An agent will be wrong. Sometimes spectacularly. When that happens, somebody has to pick up the phone, talk to the customer, decide what to do, escalate to legal if needed, write the postmortem, and decide whether to roll back. That somebody must be named, must have authority, and must have signed up for it before the agent shipped.

"The platform team" is not an answer. "The AI council" is not an answer. The answer is a name, with a job title and a phone number, and a backup name in case the first is on holiday. There is also a second answer underneath that one: which executive is on the hook in the quarterly review when something goes wrong? If those two names are not written down, the project is not ready.

The reason this question kills so many projects is that it forces an organisation to choose between two uncomfortable options: appoint someone, with the implication that they have a real career risk if the agent misbehaves; or admit that no one wants the responsibility, in which case the project should not move forward, no matter how exciting the demo. Most organisations would rather build the agent than have the conversation.

What 'pass' looks like

Five yeses, in plain language, written down, agreed by everyone in the room. That's it. There is no scoring rubric and no committee approval; there is just the small humility of admitting whether each answer is real or rhetorical.

If the test passes, you've done something rarer than the demo suggests: you have a project that can be shipped, measured, owned, and recovered when it goes wrong. Now the work in the next chapters — anatomy, stack, governance — is worth doing. If the test fails, you've saved between three and twelve months of budget. Either way, the test paid for itself.

One last note. The five questions don't go away after kickoff. They are revisited at each major milestone — at the end of design, before pilot, before scale, on every quarterly review — and the answers can change. The metric you committed to at kickoff might not be the right one a quarter in. The owner might leave. The data might rot. Treat the five questions as a recurring rite, not a one-off ceremony.

Figure 3.1The five honest questions. Run them in order; the first 'no' is where the project stops, or — better — pivots into something that can ship.