The point of a plan is to put the artefacts of Parts I and II into a sequence a team can actually run. This chapter is that sequence — written for a team that has roughly one quarter to land its first production agent and the governance to put a second one in next quarter.
The spirit of the plan
Before the calendar, three principles. Spine before agents. The governance, registry, and telemetry stack come before the first deployment, not after the third. This is the most counter-intuitive recommendation in the report and the most reliable predictor of which programs compound.
Internal before external. The first production agent in any organisation should serve employees, not customers. The blast radius is smaller, the feedback is faster, and the political cost of an early failure is bounded. Klarna shipped customer-facing first and recovered well; most organisations are not Klarna and will not.
Narrow before broad. One use case, well-scoped, fully instrumented, with a measurable before/after, beats five half-built ones. The temptation to land three pilots in the first quarter is the most reliable failure pattern we have observed. Resist it.
Days 1–30: build the spine
The first thirty days produce no working agent. They produce four artefacts and one decision. Skip any one and the rest of the quarter struggles.
Artefact one: an Agent Council. A small, named cross-functional group with formal authority to approve and block agent deployments. CISO, CTO/CDO, Legal/DPO, CRO, and a CFO representative for material investments. Meets weekly in the first quarter; biweekly thereafter. Has a written charter. Has a defined autonomy threshold above which its approval is required (Tier 2+ in our taxonomy).
Artefact two: an agent registry skeleton. Even before the first agent enters it. The schema from Chapter 12: identity, owner, purpose, autonomy tier, tools, data scope, model and prompt versions, delegation, authority review date, kill-switch procedure, incident history. A spreadsheet is acceptable for week one; a database with access controls is required by week four.
Artefact three: telemetry stack and observability platform decision. The six-layer stack from Chapter 20, with a vendor decision (LangSmith, Arize, Langfuse, or in-house) made and budgeted. Telemetry deploys before the first agent does, into a small reference workflow, so that day-one of the first agent has a working baseline.
Artefact four: an incident and kill-switch playbook. Drafted, reviewed, and rehearsed in tabletop form against a hypothetical agent that does not yet exist. The rehearsal is the point.
The decision: which use case ships first. The selection criteria from Chapter 4 (the five honest questions) are the input. The Agent Council approves the selection. The use case should be internal, narrow, reversible, and have a measurable before/after metric.
Days 31–60: build the first agent
The middle thirty days produce a working agent in a controlled environment, not in production. Two artefacts and one rehearsal.
The Map artefact for the chosen use case. The one-page Map from Chapter 9: context paragraph, system summary, action-consequence table, autonomy tier, kill-switch procedure, authority review date. Reviewed and approved by the Agent Council.
The first golden set. A hundred curated cases drawn from the existing process (tickets, documents, queries — whatever the agent will replace or augment), with expected behaviours written and acceptance criteria set. The team running the offline eval pipeline against the golden set should be different from the team writing the agent's prompt — independence at this layer pays back.
The rehearsal: a tabletop incident response with the new agent. Trigger a simulated confabulation, a simulated indirect prompt injection, and a simulated tool-cascade error. Walk the playbook end-to-end. Refine. The first time you run the kill-switch should be in this window, not during a real incident.
By Day 60, the agent has passed the offline eval suite, the Map is approved, the kill-switch has been rehearsed, the registry has its first real entry, and the Agent Council has formally approved canary deployment.
Days 61–90: production, controlled
The final thirty days take the agent into production carefully and start the wheel turning.
Canary deployment at 5% of relevant traffic, with the online metrics from Chapter 10 collected from minute one. Drift detection alerts wired to the team that owns the agent, not just the platform team.
Tier-2 post-action sampling from day one. A defined percentage of the agent's actions reviewed by humans after the fact, with the findings logged into the curation queue that grows the golden set.
The first improvement cycle. Two weeks into production, the team meets to review the curation queue, categorise failure modes, and ship the first prompt or tool change through the deployment pipeline. The cadence — every two weeks, religiously — is the wheel.
Public scaling. Once the agent has thirty days of production data above its threshold, scale to 100% of the chosen use case. Not faster. Not to a new use case until the second cycle of the wheel has produced at least one improvement.
By Day 90, what a successful program has is one production agent, one well-instrumented telemetry stack, one alive registry, one Agent Council with a real meeting cadence, one tested kill-switch, one growing golden set, and one fortnightly improvement cycle. That is enough. The second agent is much cheaper than the first. The third is cheaper than the second. The compounding starts at the second deployment, not the first.
What you avoid by Day 90 is more important than what you ship: an agent in customer-facing high-stakes production without governance; an agent whose registry entry was created retroactively; an agent whose first incident is also its first kill-switch test; an agent without a measurable before/after; an agent that shipped because the launch date was on a slide. Each of these is the recurring failure pattern of the field. Each is avoidable.
A closing note
This report has been about a small set of practices that make agentic AI in enterprises survivable, accountable, and — sometimes — actually transformative. None of them are exotic. The five honest questions, the Map artefact, the eval suite, the registry, the kill-switch, the HITL design, the cost stack, and the flywheel are not technically difficult. They are organisationally difficult. They require teams to do unsexy work in a field that rewards sexy slides.
The teams that do the unsexy work are the teams whose agents are still working a year later, still improving, and still inside the company's risk tolerance. The teams that skip it are the teams whose pilot is in a slide deck somewhere, opened in a meeting where everybody smiled and nodded, and then closed, and three weeks later nothing in the building did anything new.
This is the gap between agentic AI on stage and agentic AI in production. It is bridgeable. The bridge is in this report.