Every enterprise that has attempted to introduce a genuinely novel capability — from the first relational database to the first public cloud footprint — has discovered the same uncomfortable truth: the technology arrives years before the organisation does. Agentic AI is no exception. The gap between a working proof-of-concept and a governed, auditable, production-grade agent program is not a gap of model capability; it is a gap of institutional maturity. Naming that gap, measuring it honestly, and charting a path across it is what a maturity model is for.
Why maturity models work — and when they don't
The original insight behind the Capability Maturity Model Integration framework — that processes improve in predictable stages, not in random leaps — has survived three decades of software engineering because it reflects something true about organisations: they do not skip levels. A team that has never written a deployment runbook cannot jump directly to automated rollbacks. A governance committee that has never classified a dataset cannot leap to a real-time data-lineage graph. Levels exist because the knowledge required for level three is encoded in the failures of level two.
The maturity model presented in this chapter adapts that logic for agentic AI across four dimensions — Governance, Orchestration, Use Cases, and Integration — matching the four pillars introduced in Part I. Each level, from L1 Initial to L5 Optimized, is defined by observable evidence: documents that exist, controls that run, metrics that are tracked. Aspiration does not count. The NIST AI Risk Management Framework takes a similar stance in its tiering model, distinguishing organisations by how systematically they have embedded AI risk practice, not by how sincerely they intend to.
Where maturity models fail is when they become checklists for box-ticking rather than instruments for honest diagnosis. An organisation that reports L4 because it has a policy document, not because it operates the policy, is deceiving its board and imperilling its customers. The antidote is evidence-based scoring: every level claim must be accompanied by the artefact that proves it.
L1 — Initial: the heroic phase
At L1, agentic AI exists because someone cared enough to make it work, not because the organisation made it easy. A team has stood up an agent — perhaps using an API from a frontier model lab and a thin orchestration wrapper — and it is producing results. There is no formal policy governing it. There is no evals harness. The agent's permissions are whatever the developer's credentials allow. If the person who built it leaves, the agent either breaks or becomes ungoverned dark matter in the IT estate.
L1 is not a failure state; it is a beginning. The most important thing an organisation at L1 can do is write down what it has. An inventory of pilots, a rough taxonomy of the risks each presents, and a named owner for each — these are the first deliverables that open the door to L2. Organisations that skip this step discover it later, usually during an incident.
"The difference between an experiment and a liability is a changelog and an owner." — observation common to every enterprise AI governance review, regardless of industry.
L2 — Repeatable: patterns emerge at team level
L2 organisations have done one thing L1 organisations have not: they have learned from their first agent and encoded that learning in a repeatable pattern. A team has adopted a shared orchestration framework. A data-handling practice is in place for the agent's memory store. Someone has run a red-team exercise, even informally. The hallmark of L2 is that a second agent can be deployed faster than the first, because the team has retained and shared what the first deployment taught them.
Governance at L2 is still team-level. There may be a policy document, but it was written by the team for the team, and other teams in the organisation do not know it exists. The Gartner Trust, Risk and Security Management framework identifies this as the characteristic failure mode at this level: local competence coexists with enterprise-level blindness. A risk that looks managed from inside the team looks unmanaged from the audit committee's seat.
The transition from L2 to L3 is the hardest in the model. It requires the team to give something up — autonomy, speed, the satisfaction of owning the whole stack — in exchange for an enterprise standard they did not write and may not prefer. This is why most agent programs stall here.
L3 — Defined: enterprise standards arrive
At L3, the organisation has published — and enforces — an enterprise AI policy that applies to all agentic systems, not just the ones built by the team that wrote it. There is a model risk management process adapted for generative and agentic AI. There is an evals standard: minimum test coverage, required red-team scenarios, documented acceptance criteria. There is an identity model for agents: each agent has a machine identity, scoped permissions, and an audit trail tied to a human accountable owner. The ISO/IEC 42001 AI management system standard provides a useful template for what L3 documentation looks like in practice.
L3 organisations typically have a Center of Excellence — even a small one — responsible for maintaining the standard and reviewing new agent deployments before they reach production. The CoE is not a gate designed to slow progress; it is a consistency function designed to prevent twenty teams from making twenty different, incompatible architectural choices that will take three years to unpick.
The measurable outcome of L3 is that an auditor — internal or external — can walk into the organisation, ask for evidence of AI governance, and receive it within a business day. Not a presentation about intentions. Evidence.
L4 — Managed: measurement and control
L4 organisations do not just have standards; they track compliance with those standards in real time. An evals dashboard shows pass rates across production agents. A model risk register is live, with open items assigned to owners and tracked to closure. The organisation can answer, at any moment, how many agents are running, what permissions each holds, what their last eval score was, and who is accountable for each. This is not bureaucracy; it is the minimum information a board needs to discharge its fiduciary duty under frameworks like the EU AI Act.
At L4, the organisation has also developed feedback loops between production performance and the policy stack. When an agent behaves unexpectedly in production, the incident feeds back into the red-team library, the evals harness, and the policy. The organisation gets smarter from failures rather than merely recovering from them.
L5 — Optimized: continuous improvement as culture
L5 is not a destination; it is a disposition. Organisations at this level treat the maturity model itself as a managed asset: they run annual self-assessments, benchmark against peers, and revise their standards when the technology or the regulatory landscape shifts. Their evals library is continuously expanded. Their identity and access model evolves with new agent architectures. Their procurement playbook is updated every time a new class of vendor risk emerges.
Critically, L5 organisations export knowledge. They contribute to industry working groups, publish incident post-mortems (anonymised where necessary), and help raise the floor for the sector. This is not altruism; it is risk management. A sector in which the weakest players are at L1 creates systemic risk that eventually lands on the strongest players too.
Reading the model across all four pillars
The full model produces a 4×5 matrix — four pillars, five levels each. Most organisations are not at the same level across all four. A financial services firm may be at L4 on Governance (long accustomed to model risk management) and L1 on Orchestration (agents are new, frameworks are immature). A SaaS company may be at L3 on Orchestration and L1 on Governance (engineers lead, lawyers follow). The profile is as diagnostic as any single level score.
The next chapter introduces the Scorecard Method: a structured half-day exercise that produces this profile from evidence, not self-report. The chapter after that explains how to read the resulting radar and decide where to invest first. Together, these three chapters form the diagnostic engine at the heart of Part III.