Building an Agentic Enterprise  ·  Chapter 04 of 21
Chapter 04

Anatomy of an Agent

Six parts you cannot leave out without the system pretending to work

6
parts, all required
1
the part teams skip first: observability
0
agents in production without all six (that survive a year)
Save PDF

An agent is a small system. Like all small systems, it has parts. Skip a part and the system runs anyway — for a while — and then fails in a way that's hard to debug because the missing piece is exactly the piece that would have told you what went wrong. This chapter walks the six parts and what each of them is actually for.

Model and prompt

The model is the reasoner. In 2026 the choice is mostly between a frontier closed model (Anthropic Claude, OpenAI GPT-class, Google Gemini), a strong open model run on your own infrastructure (Llama, Qwen, Mistral, DeepSeek), or a domain-specific small model. The choice depends on three things: the latency budget (how fast must each step be?), the cost budget per task (model price × tokens × steps × volume), and the data residency constraints (can the data leave your tenant?). Frontier-closed wins on raw capability. Open wins on cost and control. Small wins on latency and predictability for narrow tasks.

The prompt is where most teams under-invest. The system prompt is not a polite request; it is the agent's job description, escalation procedure, scope of authority, and code of conduct, all in one document. A good system prompt explicitly says what the agent does and does not do, what tools it has, what it should do when uncertain (ask, don't guess), and what it must never do regardless of how the user phrases the request. Versions of the system prompt are managed like code: in source control, with diffs, with an owner, and with regression tests. A "prompt update" is a deploy. Treat it accordingly.

Tools, scoped

Tools are the levers the agent can pull on the world. In the worst design, the agent has a tool called do_anything wired to a service account with admin rights. That's not an exaggeration; it's an early-2024 anti-pattern that produced several embarrassing incidents.

The right shape: each tool has a single, narrow purpose; takes a small, typed input; returns a small, typed output; runs as the user, not as the agent's service account, where possible (OAuth on-behalf-of); and is rate-limited and logged on every call. The agent's tool catalogue is the union of these — not a free-for-all, but a curated set of capabilities, each with an owner.

This is the layer where the Model Context Protocol (MCP) earns its keep. MCP standardises how tools describe themselves to models, which means a tool written once is callable by any compliant model, and a tool can be revoked or updated without re-prompting. We'll spend more time on protocols in Chapter 7. For now, the rule: every tool the agent can call should be one you'd be comfortable handing to a contractor with the same scope.

Memory, in three flavours

Memory is the most miscast part of agents. People reach for "vector database" the way they used to reach for "Hadoop" — as a generic answer that doesn't always fit the question.

The taxonomy in plain language: short-term memory is the working scratchpad — what's in the context window for this run. Long-term memory is the agent's own remembered facts about a user, project, or domain — usually a mix of vectors (for fuzzy lookup) and structured records (for clean lookup). Episodic memory is the trace of past runs — useful for "have we done this before?" and for few-shot prompting from history. Semantic memory is the organisation's knowledge — policies, products, SOPs — owned by humans, versioned, refreshed on a schedule.

Most agent failures that look like model failures are actually memory failures: the agent reached into the wrong store, retrieved the wrong thing, and stitched it confidently into an answer. Chapter 6 walks through each of the four kinds in more detail. For now: pick the store on purpose, not by default.

Guardrails, plural

"We have guardrails" is one of the most overloaded phrases in agentic AI. In practice, guardrails are at least four separate things, and most teams confuse them.

Input guardrails screen what enters the agent: prompt-injection detection, PII detection, jailbreak attempts, scope checks ("is this question one we even answer?"). Output guardrails screen what leaves: PII redaction, profanity, regulated language ("guarantees", "free", in financial-services contexts), policy compliance. Tool-call guardrails intercept proposed tool calls and block ones that are out of scope, suspicious, or above the agent's authority limit. Behaviour guardrails watch the agent's own actions: did it loop more than N times? Has its cost exceeded a ceiling? Is it about to take an irreversible action? — and either intervene or hand off.

Tools like NeMo Guardrails, Guardrails AI, and Lakera implement parts of this. None of them is a substitute for designing the four layers explicitly. The cheap, high-yield baseline: a regex-and-policy filter on inputs, a PII scrubber on outputs, a deny-list on tool calls, and a per-run budget cap. That alone catches most production incidents.

Evals and observability

This is the part teams skip. They skip it because it produces no demo and consumes engineering time. They are punished for it later, and the punishment compounds.

Evals are the structured tests of your agent against a curated set of inputs with known good outputs. Run them on every prompt change, every model swap, every tool change, every memory update — the way you'd run unit tests on every code change. The eval suite must include: golden positive cases, golden negative cases, edge cases that have failed in production, and adversarial cases (prompt injection, jailbreaks, tool misuse). Vendors here include LangSmith, Braintrust, Arize, Patronus, Galileo. The platform matters less than the discipline.

Observability is what runs in production: traces of every run (prompt, tool calls, outputs, latency, cost), aggregate metrics (success rate, escalation rate, cost per task, time to resolution), and alerts on regressions. The trace is the single most useful artefact when something goes wrong — without it, debugging an agent is somewhere between guesswork and seance.

The bring-up checklist

Before letting an agent see a real user, walk this list:

  • System prompt versioned in source control, with an owner.
  • Tools registered, each with a typed schema, an owner, and a rate limit.
  • Memory stores chosen on purpose, with retention and refresh policies.
  • Input/output/tool/behaviour guardrails configured and tested.
  • Eval suite of at least a hundred labelled cases, covering happy path, edge, and adversarial.
  • Tracing on, going to a place a human can read and search.
  • A per-run cost ceiling and a global cost ceiling, both alerting.
  • An escalation path: when the agent can't or shouldn't act, what happens?
  • A rollback plan: if a regression appears, how do we revert in under an hour?
  • Named owner, named exec sponsor, both on the hook.

Ten items. None novel. Most missed. Each one missed is a place where the agent will pretend to work and then quietly fail — which is the failure mode that costs the most, because it doesn't show up until the customer is already on the phone.

Anatomy of an agent Six parts. Pull any one out and the agent stops working — or worse, pretends to. — the agent — MODEL + system prompt + role TOOLS scoped capabilities MEMORY short / long / episodic GUARDRAILS policy · PII · injection · jailbreak · scope EVALS + OBSERVABILITY traces · cost · golden sets · drift
Figure 4.1The six parts of an agent. The arrangement varies, the parts don't. Pull any one out and the system runs for a while, and then doesn't.