Chapter 20 · Guardrails and Policy Engines

Governance charters and policy stacks tell the organization what agents are permitted to do; guardrails and policy engines enforce those permissions at runtime. The distinction is not academic. A governance charter that says "this agent shall not send email to external parties without human approval" is worthless unless there is a technical control that actually prevents the agent from doing so — because the agent does not read the charter and the model that powers it has no inherent understanding of organizational policy. Guardrails are the runtime instantiation of governance; without them, governance is aspirational. This chapter maps the full defensive surface from input filters at the conversation boundary to policy engines that enforce constraints mid-workflow.

The Defense-in-Depth Model

Security practitioners have long applied the principle of defense in depth to conventional software systems: no single control is assumed to be perfect, so multiple overlapping controls are layered such that an attacker must defeat several independently to succeed. The same principle applies to agentic guardrails, and it matters more for agents than for conventional systems because the attack surface is wider and more dynamic. An agentic system faces attacks at multiple points: the initial user input may contain a prompt injection; the tool responses retrieved from external services may contain adversarial content; the model's reasoning may be manipulated by carefully crafted context; and the agent's outputs may be intercepted and modified before they reach their destination. A single guardrail that operates at only one of these points provides inadequate coverage.

The layered guardrail model has four principal layers. The input layer screens incoming requests before they reach the model, filtering for known attack patterns, PII that should not be included in model prompts, and requests that fall outside the agent's approved use cases. The reasoning layer monitors the model's intermediate outputs — its plans, its reasoning traces, its tool call parameters — for indications that it is about to take an action that violates policy. The output layer screens the agent's final outputs before they are delivered, checking for policy violations, data leakage, and content that fails defined quality thresholds. The action layer enforces policy at the point of tool invocation — the last line of defense before an action has real-world consequences.

Input Filters and Prompt Injection Defense

Prompt injection is the agentic equivalent of SQL injection: an attacker supplies input that the model interprets as instructions rather than data, causing it to take actions that override its intended behavior. The OWASP Agentic AI Security Initiative has identified prompt injection as one of the top vulnerabilities in agentic systems — and the defense is correspondingly important. Input filters at the conversation boundary should screen for the linguistic patterns associated with known injection attacks: instructions to ignore the system prompt, requests to reveal the system prompt's contents, persona-switching commands, and instructions to perform actions outside the agent's authorized scope.

Input filters are necessary but not sufficient. Indirect prompt injection — where the malicious instruction is not in the user's input but in a document or tool response that the agent retrieves and incorporates into its context — is harder to defend against because it occurs after the input filter has already processed the initial request. Defense against indirect injection requires monitoring at the reasoning layer: the agent's reasoning traces should be analyzed for the pattern of a sudden, unexplained change in goal or behavior that is characteristic of a successful injection attack. This is an active research area with no perfect solution, but several commercial guardrail products have begun offering injection detection at the reasoning layer as a production feature.

Runtime Policy Engines

A runtime policy engine is a component that sits between the orchestration framework and the tools, intercepting every tool invocation and evaluating it against a defined policy before allowing it to execute. The policy can be expressed in various forms: a set of allow/deny rules based on the tool's parameters (block any email send to a domain not on the approved list), a statistical anomaly detector (flag any tool call sequence that differs significantly from the historical baseline), or a secondary model call (ask a smaller, cheaper model whether this action is consistent with the agent's approved use case). The runtime policy engine is the action layer of the defense-in-depth model — the control that matters most when everything else has failed.

Policy engine design involves tradeoffs between coverage, latency, and cost. A policy that calls a secondary model for every tool invocation is comprehensive but adds latency and cost to every agent run. A rule-based policy is fast and cheap but brittle — it can only cover scenarios that have been explicitly anticipated. The practical approach for most production deployments is a tiered policy engine: fast, rule-based checks for the most common and most clearly prohibited actions, with secondary model invocation reserved for ambiguous cases that the rules cannot resolve. The threshold for ambiguity should be calibrated to the risk profile of the action: a tool invocation that could have irreversible consequences should get secondary review even at the cost of additional latency.

"A guardrail that adds 200 milliseconds of latency to an agent action that could authorize a $50,000 payment is not a performance problem. It is a bargain."

Output Screening and Data Leakage Prevention

The output layer is where data leakage prevention controls are most naturally implemented. An agent that has retrieved confidential documents, synthesized their contents, and is about to deliver a summary to a user may be about to deliver information that the user is not authorized to receive — either because the agent made a mistake in its access control logic, or because the user manipulated the agent into retrieving documents that exceeded their clearance level. Output screening checks the agent's response against the classification labels that the data governance framework has assigned to the retrieved content and blocks or redacts the response if it contains material at a classification level above the user's authorization.

Beyond data classification, output screening should also enforce content policies: the agent's responses should not contain hate speech, discriminatory content, or other material that violates the organization's acceptable use policy, even if the model generates it in a context where it might appear to follow naturally from the task. Screening for content policy violations is a well-solved problem for simple question-answering systems, but it is harder for agents because the relevant context is not just the agent's immediate output but the full sequence of actions that led to it — and a content policy violation may be embedded in a tool call parameter rather than in the agent's natural language output.

The Policy-as-Code Paradigm

The most maintainable approach to agentic guardrails treats policy as code: the rules and constraints that govern the agent's behavior are expressed in a machine-readable format, version-controlled alongside the agent's other code, and deployed through the same CI/CD pipeline that deploys changes to the agent's prompts and tools. This policy-as-code paradigm has several advantages. It makes policy changes auditable — every modification is tracked in the version control system with a timestamp, an author, and an explanation. It makes policy testing automated — regression tests can verify that a policy change has the intended effect and no unintended side effects. And it makes policy portable — the same policy definitions can be applied consistently across different deployment environments and different agent implementations.

The practical implication is that policy definitions should be stored in a structured format — YAML, JSON, or a domain-specific language — rather than embedded in natural language in the agent's system prompt. Natural language policies in system prompts are expensive (they consume context window tokens), brittle (the model may not follow them consistently), and not auditable (there is no version history of when the policy changed and why). Structured, programmatic policy definitions outside the model context are cheaper, more reliable, and more governable.