The Agentic Enterprise  ·  Chapter 08
Chapter 08

Failure Modes

Hallucinated calls, prompt injection, runaway loops, and the cost of being wrong at scale

Every technology has its characteristic failure patterns. The failure patterns of agentic AI are not random — they follow from the architecture, the trust model, and the autonomy level of the system. Understanding them in advance is not pessimism; it is engineering discipline. The organizations that have deployed agents most successfully are those that anticipated the specific failure modes most likely to affect their systems and built explicit mitigations before the failures occurred. The ones that have had the most costly incidents are those that treated failure as an unlikely edge case rather than a predictable engineering challenge.

Hallucinated Tool Calls

The most operationally significant failure mode in production agentic systems is the hallucinated or malformed tool call: the agent generates a tool call with incorrect parameters, a non-existent function name, or a parameter value that is internally inconsistent with the task context. In a well-designed system, such calls are caught by the tool-call validation layer and returned to the agent as errors. In a poorly designed system — or when the validation is incomplete — they may be silently dropped, produce unexpected side effects, or propagate errors into downstream processing.

Hallucinated tool calls are distinct from hallucinated content (the fabrication of facts in text output) but they arise from the same underlying mechanism: the model generating plausible-sounding output that is not grounded in verified information. The mitigation strategies are also analogous: strict schema validation for all tool calls, explicit error handling for malformed calls that feeds back into the model's context rather than silently failing, and evaluation harnesses that specifically test for tool-call reliability across the range of inputs the agent is expected to encounter.

The frequency of hallucinated tool calls varies substantially between models and between task types. Tasks that involve calling well-known, frequently-used APIs — web search, calendar access — have lower hallucination rates than tasks that involve novel or complex API schemas. Organizations deploying agents with large tool catalogs should expect higher hallucination rates than those with small, well-defined catalogs, and should design their evaluation frameworks accordingly.

Prompt Injection

Prompt injection is the agentic equivalent of SQL injection: adversarial content embedded in the agent's input stream that hijacks its behavior. In the simplest form — direct prompt injection — a user provides input that contains instructions designed to override the agent's system prompt. In the more dangerous form — indirect prompt injection — the adversarial content arrives through a tool call: in a document the agent retrieves, a web page it visits, a database record it queries, or an email it reads.

Indirect prompt injection is more dangerous than direct injection for two reasons. First, the content arrives from a source that the agent is designed to trust — a retrieved document, a tool result — and may not be treated with the same skepticism that an agent (or its operators) would apply to user-provided input. Second, because the agent is autonomous, the adversarial instruction can direct it to take actions — send a message, exfiltrate data, modify a record — without any further human interaction. The attack requires no social engineering of the human; it only needs to reach the agent's context window.

Documented indirect prompt injection attacks have been demonstrated against multiple production agentic systems, including browser-use agents and email-processing agents. The MITRE ATLAS knowledge base catalogs the techniques involved; the OWASP LLM Top 10 ranks prompt injection as the top risk for LLM applications (LLM01). Mitigations include input/output sandboxing, instruction hierarchy enforcement (instructions from the system prompt take precedence over instructions from retrieved content), and explicit validation of agent actions against the original task goal before execution.

Runaway Loops

A runaway loop is an agent that continues executing indefinitely — consuming tokens, tool calls, and time — without making progress toward its goal or reaching a completion condition. Loops arise from several distinct mechanisms: a goal that cannot be met with the available tools (the agent retries indefinitely), a planning error that places the agent in a circular sequence of steps, or a tool failure that the agent attempts to recover from but cannot. The result in all cases is consumption of resources and, often, an accumulating set of side effects from the failed recovery attempts.

Runaway loops are the failure mode with the clearest mitigation: explicit resource budgets. Every production agent should have a hard limit on the number of tool calls it can make in a single session, the total number of tokens it can consume, and the elapsed wall-clock time before it is halted and the situation flagged for human review. These limits should be tuned to the expected execution profile of the agent's tasks — tight enough to catch runaway behavior early, loose enough not to interrupt normal operation — and they should be monitored to detect drift in the agent's execution patterns over time.

Beyond resource limits, graceful degradation matters. An agent that hits its resource limit should not simply stop; it should attempt to preserve whatever partial work it has completed, communicate the state of the task to the human operator, and explain, to the extent it can, why it was unable to complete. The quality of an agent's failure communication is a meaningful differentiator between well-engineered and poorly-engineered systems, and it is underspecified in almost all current vendor documentation.

Scope Creep and Overreach

Scope creep is the gradual expansion of an agent's operational footprint beyond what was intended. It happens for several reasons: agents instructed to "do whatever it takes" to complete a task will, if given sufficient tool access, eventually access resources and take actions that were not explicitly sanctioned; agents operating in multi-step tasks may acquire context that makes additional actions seem locally justified even when they are globally outside the intended scope; and agents that accumulate permissions through on-behalf-of delegation chains may end up with effective access rights far beyond those of the immediate delegating user.

The OWASP LLM Top 10 designates this failure mode "Excessive Agency" (LLM06) and identifies over-permissioning — granting agents more tool access, permissions, or capabilities than the minimum required — as the primary structural cause. The mitigation is principled: start with minimal permissions, grant additional access only when there is a demonstrated requirement, and review the permission envelope periodically as the agent's operational scope is better understood. This principle is easy to state and persistently difficult to implement in organizations where the path of least resistance is to give the agent broad permissions and rely on its judgment to use them appropriately.

The Failure Feedback Loop

Perhaps the most important meta-level failure mode is the absence of a failure feedback loop: the organizational condition in which agent failures occur, produce costs or user harm, and are never routed back to the teams responsible for the agent in a form that enables learning and remediation. This failure mode does not have a technical name in the OWASP taxonomy, but it is arguably more consequential than any specific technical failure mode, because it allows all the others to persist.

Healthy failure feedback loops require incident reporting that reaches the agent development team, retrospective analysis that distinguishes failure types and root causes, evaluation updates that add new test cases for each failure mode discovered, and governance processes that track remediation of known failure modes. Organizations with mature DevOps practices will find these requirements familiar; the challenge is applying them to agents, where the failure modes are novel, the evidence (a reasoning trace) is less familiar than a stack trace, and the operational responsibility may not yet be clearly assigned.

"The three most expensive agent failures I have encountered in enterprise settings had one thing in common: they were not surprising to anyone who had done the pre-deployment risk analysis. They were failures that had been identified as risks and classified as unlikely edge cases. The edge cases turned out to be more frequent than estimated."