Building an Agentic Enterprise  ·  Chapter 01 of 21
Chapter 01

The Quiet Test

What 'agentic' actually means, and the failure rate nobody puts on the slide

95%
of GenAI pilots cited as not reaching production (MIT NANDA, 2025)
1in 4
AI projects delivering promised ROI (IBM CEO survey, 2025)
853
FTE-equivalent work Klarna's agent did by Nov 2025
Save PDF

Somewhere in the building you work in, there is a slide deck with the word agentic on the title page. The deck has been opened in a meeting where everyone smiled and nodded, and then it was closed, and three weeks later nothing in the building did anything new. This report is about why that keeps happening, and what to do about it.

It is not a sceptic's pamphlet. Agentic AI is real, and a small number of teams are doing it well. But the gap between what is on stage at vendor conferences and what is running on Tuesday morning at a real company is wider than most leaders realise, and it widens every quarter that the marketing keeps moving and the production work doesn't.

The room you've been in

You know the room. A consultant or a vendor demos an agent that books a flight, fills a form, files a ticket, escalates a case. The demo always works. The questions afterwards are about pricing and rollout. Nobody asks the only question that matters: what happens the first time it is wrong?

The honest answer is usually one of three. Either no one is sure, or there's a vague gesture toward "human in the loop", or the fallback is the very process the agent was supposed to replace. None of those are answers. They are placeholders for an answer.

This is why pilots stall, and why the most-cited "failures" in the press are usually more nuanced than the headline. Klarna's customer-service agent, deployed in early 2024, was first held up as a poster child for replacing 700 humans, then in May 2025 reported as a walkback when CEO Sebastian Siemiatkowski admitted that cost had been weighed too heavily against quality. The fuller picture, six months later, is more interesting than either headline: by November 2025 the AI was doing the work of 853 agents, saving roughly $60M annually, and Klarna had simply added a small human tier for emotionally complex interactions — a calibration, not a reversal. Air Canada's chatbot, by contrast, promised a bereavement fare it had no authority to grant, and a tribunal ruled the airline owed the customer the difference: the agent's mistake became the company's legal liability. The pattern is consistent across the wins and the losses. The places where agentic AI fails are not the places where it does the wrong thing — they are the places where nobody had decided what the right thing was.

The word, used carefully

Before going further, the vocabulary needs sweeping. The industry uses "AI agent", "agentic AI", "copilot", "automation", and "workflow" interchangeably, and the resulting confusion is not innocent — it sells software. In this report, the word agent means something specific:

An agent is a model placed in a closed loop with tools, memory, and a goal — capable of taking multi-step action toward that goal without per-step human prompting.

That definition does work. It excludes a one-shot LLM call that writes an email (no loop). It excludes a copilot that suggests text and waits for you to press accept (no autonomy). It excludes a workflow with a hard-coded if-this-then-that (no goal-directed reasoning). It includes the underwriter that drafts a credit memo, queries five systems, asks for a missing document, and files the case. It includes the support agent that classifies a ticket, looks up the customer's contract, drafts a response, and either sends it or escalates depending on its own confidence.

The line between "automation with a model in it" and "agent" is exactly this loop. Cross it and you have new powers and new problems. Most of the rest of this report is about both.

The quiet test

Here is a small test you can run on your own current agent project. Take the most senior person who can be in a room and ask them three questions, in order, and listen to how long the silence is between each one.

  1. What does this agent do that the previous workflow could not? The right answer is concrete and small. "It reads unstructured email and decides which of seven categories it falls into" is a good answer. "It transforms our operations" is not.
  2. What is the worst thing it can do, and who pays? Worst-thing analysis is a habit borrowed from safety engineering. If the answer is "it can issue a refund up to fifty dollars" you have a small problem. If the answer is "it can email the wrong customer a competitor's contract" you have a different problem. If the answer is silence, you have the worst problem.
  3. How will we know it has gotten worse? Models drift. Tools change. Data changes. The agent that worked on Monday can quietly stop working on Friday in a way that nobody notices until a customer complains. If there is no eval suite, no production monitoring, no golden set, then the answer is "we won't, until it embarrasses us."

If those three questions get crisp answers, you are in a small minority and you should keep going. If they don't, you are not yet ready to deploy, no matter what the timeline says. The work between today and ready is the work this report is about.

The shape of this report

The report has three parts, written for somebody who has to ship something on Tuesday.

Part I — Foundations. What an agent is in detail; how to map a business process to one; the five honest questions you should pass before you build; the reference stack of orchestration, registry, memory, tools, and evals; and the protocols that bind them.

Part II — Governance and architecture. The NIST AI Risk Management Framework — Govern, Map, Measure, Manage — turned into something a real team can actually run; the agent registry as a piece of infrastructure, not a slide; the risks that bite (prompt injection, autonomy creep, identity sprawl, OWASP and MITRE's catalogues); and how human-in-the-loop is more than a comforting phrase.

Part III — Deployment, ROI, the flywheel. Three real case studies, including the cautionary ones; an honest tour of no-code and low-code platforms; an ROI template that will sometimes tell you not to build; and the flywheel — data, evals, telemetry — that decides whether your agent program compounds or just costs.

Throughout, you'll see boxes labelled questions to ask, caution, and flywheel note. The questions boxes are the ones to read out loud in your next meeting. The caution boxes are where money or trust is most often lost. The flywheel boxes are where the compounding lives. They are written from the floor of a server room, not the back of a stage.

A practitioner's note

Two industries currently set the bar for agentic AI in production: financial services, where the cost of being wrong is concretely priced, and customer support, where the volume justifies investment. Both are useful templates. Both have also produced the most expensive failures. Read their case studies (Part III) before reading anyone's marketing.

One more thing before we begin. This is a report for people building inside organisations, not selling to them. There are no vendor logos on the cover. The platforms named in these pages are named because they exist and matter, not because they paid to be here. Where one is genuinely better at something than another, that's stated. Where the market is too young to know, that's stated too.

We start with the smallest part of the work, which is also the most often skipped: deciding what an agent should actually do.

The agent loop A model in a closed loop with tools, memory, and a goal — the smallest definition that matters. MODEL reasoner PERCEIVE user goal · context PLAN decompose · pick tool ACT tool call · side-effect REFLECT eval · retry · finish memory · vector + episodic
Figure 1.1The agent loop. A model in a closed loop with tools and memory, working toward a goal — the smallest definition that does any real work.