Chapter 33 · The Twelve-Month Roadmap

A roadmap is not a plan; a plan describes activities. A roadmap describes the state the organisation will be in at each milestone — what it will be capable of that it cannot do today, what risks will be managed that are currently open, what evidence will exist that does not exist now. The twelve-month roadmap for an agentic AI programme is, in this sense, a maturity commitment: a promise to the board, by quarter, of where each of the four pillars will land on the Scorecard. The pilot programme that cannot make this commitment credibly is not a programme; it is an experiment.

Q1: Charter, scorecard, three pilots picked, NIST RMF baseline

Q1 is the ninety-day plan from Chapter 32, executed well. By the end of the first quarter, the programme has a signed charter, a governance owner, a Scorecard baseline with a four-pillar radar, and three use cases that have passed the scoping filter — named owner, success metric, eval set, data classification, kill criterion. The NIST AI RMF gap assessment has been completed and the resulting action list is in the governance owner's backlog, prioritised by risk.

The NIST RMF baseline is particularly important for Q1 because it sets the regulatory posture before any production deployment. Organisations operating under the EU AI Act need to know, by Q1, whether any of their planned agent deployments fall under the high-risk categories defined in Annex III. Discovering this in Q3, after a deployment is live, is significantly more expensive than discovering it in Q1, when the design can still be changed.

"Q1 is for learning. Q2 is for proving. Q3 is for scaling. Q4 is for owning." — an internal framing used by the AI governance team at a large European financial institution, paraphrased.

Q2: Pilot 1 to production, orchestration choice, evals platform

Q2 is the first real test of the programme's governance model. Pilot 1 reaches production — not a demo environment, not a limited beta, but a live system handling real work with real consequences. It is accompanied by guardrails: a circuit breaker that stops the agent if it exceeds its permission scope, a structured eval set with a minimum pass rate that must be maintained for the pilot to stay live, and a runbook for the incident response procedure if the agent behaves unexpectedly.

The orchestration choice should be made in Q2, not Q1. The Q1 pilots were small enough to run on whatever framework the engineering team preferred; by Q2, the programme needs a single orchestration standard that all future agents will use. This decision — LangGraph, AutoGen, a proprietary platform, or a bespoke internal framework — is one of the most consequential technical decisions in the programme, and it should be made with the benefit of one quarter of operational experience, not in the abstract. Key criteria: does the framework produce structured audit logs? Does it support human-in-the-loop interruption? Does it integrate with the organisation's existing identity and secrets management infrastructure?

The evals platform is the instrumentation that will govern every future agent. At minimum: a test runner that executes the eval set against the production agent on each deployment, a dashboard that makes pass rates visible to the governance owner, and a gating mechanism that blocks deployment if the pass rate drops below the agreed threshold. The evals platform is not an optional quality-assurance measure; it is the operational foundation of the governance model.

Q3: ISO 42001 readiness, Pilots 2–3, agent identity model

Q3 is the scaling phase. With one production agent running and a governance model proven, the programme can now expand to the second and third pilots while simultaneously investing in the institutional infrastructure that will make scaling sustainable. The two activities are not independent: each new pilot exercises the governance model in a new domain, producing evidence for the ISO/IEC 42001 readiness review that should be in progress or complete by the end of the quarter.

ISO 42001 is the natural target standard for an enterprise that wants a third-party–validated AI management system. The readiness review — conducted against the standard's requirements for AI policy, risk assessment, performance evaluation, and continual improvement — will identify the remaining gaps between the programme's current state and a certifiable AI management system. For most organisations that have followed the Q1 and Q2 roadmap faithfully, the gaps at this point are administrative rather than structural: documentation of decisions already made, formalisation of practices already running.

The agent identity model is the Q3 infrastructure investment that most programmes underestimate. Every production agent needs a machine identity: a credential that is tied to the agent's specific role, scoped to the data and tools the agent is authorised to use, rotated on a schedule, and auditable. Without a formal identity model, agents accumulate permissions informally — developers use personal credentials, or a single service account is shared across multiple agents — and the resulting access graph becomes impossible to audit. Building the identity model in Q3, before the agent estate grows large, costs a fraction of what it costs to retrofit it in Q4 or year two.

Q4: Operating model decision, procurement playbook, first annual review

By Q4, the programme has enough operational experience to make the operating model decision that will govern how the organisation scales agent deployments beyond the initial programme. The three models — Center of Excellence, hub-and-spoke, and federated — are described in detail in Chapter 34. The Q4 decision is not which model to adopt eventually; it is which model the organisation will commit to for the next two years, with the governance infrastructure to support it.

The procurement playbook codifies everything the programme has learned about buying agent-adjacent technology: model provider contracts, orchestration platforms, evals vendors, security tooling. It includes the contract clauses the programme has learned it needs — data residency, incident notification windows, model version pinning, exit rights — and the vendor risk assessment process for agent-specific supply chain risk. Chapter 36 covers this in depth; the Q4 roadmap milestone is to have the playbook written and reviewed by legal before the programme's next major procurement decision.

The first annual review is the most important governance event of the year. It is not a retrospective on activities; it is a re-run of the Scorecard, producing a new four-pillar radar that can be compared directly to the Q1 baseline. The comparison is the evidence the board needs to assess the programme's progress: which pillars moved, by how much, and what evidence supports the scores. Any pillar that has not moved since Q1 is a question the board should ask, and the governance owner must have an answer.