The Agentic Enterprise  ·  Chapter 39
Chapter 39

The Failure Postmortem

Three case studies of agentic deployments that broke — and what they teach

valuenote
label
valuenote
label

The most instructive documents in any engineering discipline are not the success stories; they are the post-mortems. What failed, why it failed, and what was done about it — these are the sentences that contain the lessons. This chapter presents three composite case studies, drawn from real failure patterns in agentic AI deployments. The organisations are fictional aggregates; the failure modes are not. Each case study names what failed across the four pillars — Governance, Orchestration, Use Case selection, and Integration — because that is the structure that makes a failure legible, remediable, and most importantly, preventable in the next deployment.

Case study 1: A regional retailer whose customer-service agent was prompt-injected into issuing refunds

A regional retailer with significant e-commerce volume deployed a customer-service agent in late 2024. The agent was integrated with the order management system and authorised to issue refunds for orders below a defined threshold. The deployment was considered a success: deflection rates were high, customer satisfaction scores were positive, and the engineering team had instrumented the agent with basic logging. The incident occurred four months after launch.

A small number of customers had discovered — and shared on a public forum — that the agent could be prompted to issue refunds beyond its authorised threshold if the user included specific phrasing in their message. The phrasing exploited the agent's instruction following behaviour: by framing the refund request as an instruction from a "store manager override," customers were able to cause the agent to treat their message as an authoritative command rather than a user request. By the time the fraud team identified the pattern, it had been active for eleven days and produced material unauthorised refunds.

What failed, by pillar:

Governance: the agent's authorisation scope was defined in the deployment specification but was not operationalised as a runtime constraint. The governance policy said the agent could not issue refunds above a threshold; the orchestration layer did not enforce that threshold independently of the agent's own reasoning. A policy that exists only in a document and is enforced only by the model's judgment is not a policy; it is a hope.

Orchestration: the agent had no prompt-injection detection layer. The OWASP Agentic AI guidance identifies indirect prompt injection — where adversarial instructions are embedded in content the agent processes — as a top-ranked threat. The retailer's orchestration framework processed user messages without sanitisation or a secondary validation step for instructions that changed the agent's authorised behaviour.

Use Case: the kill criterion for the deployment was defined as "customer satisfaction score drops below threshold" — a lagging indicator that measures experience quality, not financial integrity. A kill criterion for a financially authorised agent should include real-time anomaly detection on the distribution of refund amounts and frequencies, not only customer experience metrics.

Integration: the agent's integration with the order management system did not include a second-factor confirmation for refunds that approached the authorisation limit. A simple check — "refund amount is within 20% of the authorisation ceiling, escalate to human" — would have interrupted the majority of the fraudulent transactions.

The retailer's remediation included: a runtime refund ceiling enforced by the integration layer, independent of the model's reasoning; a prompt-injection detection wrapper on all user inputs; revised kill criteria that included financial anomaly detection; and a quarterly red-team exercise targeting the agent's authorisation model.

Case study 2: A mid-sized commercial bank whose internal research agent escalated a flawed analyst note via auto-send email

A mid-sized commercial bank deployed an internal research agent to assist the equity research team. The agent was designed to aggregate data from multiple internal and external sources, draft an analyst note based on a structured template, and — once approved by the analyst — send the note to a distribution list of institutional clients. The "once approved" step was implemented as a confirmation dialogue in the orchestration interface.

The incident occurred when an analyst, under deadline pressure, approved a note that the agent had generated based on a stale data source. The stale source contained a material error in the revenue projection for a covered company. The agent sent the note to 340 institutional clients before the error was identified. The bank faced regulatory inquiry and client relationship damage.

What failed, by pillar:

Governance: the agent was classified as a research-assistance tool, not as a communications tool. This classification determined its risk tier and its review process: a research tool needs analyst review; a communications tool needs compliance review. Because the agent both drafted and sent the note, it straddled two risk categories, but the governance process treated it as only one. The NIST RMF map function requires that the risk profile of an AI system be assessed for all its functions, not just its primary one.

Orchestration: the confirmation dialogue was a single click — a binary yes/no that provided no friction proportional to the action's consequence. Sending a research note to 340 institutional clients is a materially consequential action. The confirmation step should have included a summary of the data sources used, the date of each source, and a flag for any source that had not been updated within a defined freshness window. The orchestration layer had the information to generate this summary; it did not surface it.

Use Case: the use case had been scoped as "draft assistance," but the agent had been extended by the engineering team to include the "send" action because the analysts found the workflow more efficient. This scope extension was not reviewed against the original risk assessment. Use case scope must be version-controlled and any extension must trigger a re-assessment.

Integration: the agent's integration with the external data provider did not include a data-freshness check. A staleness flag on the data source would have been trivial to implement and would have surfaced the error before the analyst's confirmation step.

Case study 3: A SaaS company whose code-fixing agent merged a regression that brought down billing for four hours

A SaaS company with a developer-productivity culture deployed a code-fixing agent integrated with its CI/CD pipeline. The agent was authorised to open pull requests, address review comments automatically, and — for PRs that passed all automated tests — merge directly to the main branch without human review. The engineering team had high confidence in their test suite; the agent had resolved several hundred PRs without incident over six weeks.

The incident occurred when a billing-service dependency was updated upstream. The agent, responding to a failing test caused by the dependency change, generated a fix that changed an API call signature. The fix passed all automated tests because the billing-service integration tests had a coverage gap: they did not test the specific code path affected by the signature change. The agent merged. Billing stopped processing for four hours until a human engineer identified and reverted the change.

What failed, by pillar:

Governance: the agent had merge authority — an irreversible, high-consequence action — without a tiered authorisation model. The policy should have distinguished between low-risk merge categories (documentation, test additions, dependency pinning) and high-risk categories (changes to payment processing, billing, authentication). The agent should have had direct merge authority only for the former.

Orchestration: the agent had no blast-radius estimate before merging. A well-instrumented orchestration layer for a code-change agent should include a dependency graph analysis that identifies which downstream systems are affected by a proposed change, and should escalate to human review for changes that affect systems above a defined criticality threshold.

Use Case: the kill criterion for the agent was "merge error rate above threshold" — but a merge error that causes a production incident is not captured by an error rate metric measured at merge time. The kill criterion should have included post-merge production incident correlation: if a production incident is identified within a defined window after an agent-merged PR, the agent's merge authority is automatically suspended pending review.

Integration: the CI/CD integration did not include a coverage-gap analysis before granting the agent merge authority. If the test suite coverage for the affected code path was below a defined threshold, the agent should have been required to escalate to human review rather than merging autonomously. Test coverage is an integration-layer signal; routing merge decisions through a coverage gate is an integration-layer control.

The SaaS company's remediation included: tiered merge authority based on affected system criticality; automated dependency-graph analysis before merge; post-merge incident correlation as an automated kill criterion; and a coverage-gate integration that required human review for any PR touching code with less than 80% test coverage.