The difference between a language model and an agent is, in one narrow but important sense, a list of functions. Give a model the ability to call a search API and it can retrieve current information. Give it a code interpreter and it can run computations. Give it write access to a CRM and it can update records. Each tool added to an agent's catalog extends its reach into the world — and each extension creates both new capability and new exposure. Understanding how tool use works mechanically, what grounding actually means in practice, and where the limits of the context window create irreducible failure modes, is essential for anyone responsible for designing or governing an agentic system.
How Tool Calls Work
Modern language models support tool use through a mechanism commonly called function calling. The model is provided, at inference time, with a catalog of available tools described in a structured schema — typically a JSON specification of the function name, its parameters, and their types and descriptions. When the model determines that a tool call is appropriate, it generates a structured output specifying the tool name and parameter values. The host application executes the call, captures the result, and feeds it back into the model's context for the next inference step.
This mechanism is deceptively simple. The tool call itself is just a structured string; execution happens in application code. But the simplicity is also the source of a significant security boundary: the model cannot verify that the tool it is calling will do what its schema claims, cannot verify that the result returned to it is authentic, and cannot detect whether the environment in which it is operating has been modified to manipulate its behavior. Trust flows one way — from the application to the model — and the model has no independent verification capability.
The Model Context Protocol standardizes this exchange, providing a consistent interface between model hosts and tool servers. An MCP server exposes a set of tools through a defined protocol; an MCP client (the agent's host) discovers and invokes those tools. The protocol handles the transport and serialization details, but the trust model — which MCP servers an agent is allowed to contact, what authentication is required, what audit logging is performed — remains the application developer's responsibility. The emergence of MCP has standardized the interoperability layer; it has not standardized the governance layer.
The Grounding Problem
Grounding is the process of anchoring a model's outputs in verifiable, current, specific information rather than in parametric knowledge baked in during training. It matters because training data has a cutoff date, because it may not include proprietary or domain-specific information, and because even information that was accurate at training time may be stale at inference time. An agent that answers from parametric memory alone — without retrieving current information — is confident but potentially wrong, in ways that are difficult to detect from the quality of its prose.
The dominant grounding architecture in enterprise settings is retrieval-augmented generation (RAG): at inference time, the agent retrieves relevant documents from a knowledge base, inserts them into the context window, and generates its response against that retrieved context. RAG systems work well when retrieval quality is high — when the right documents are found, returned with high fidelity, and accurately represent current organizational knowledge. They fail in characteristic ways when any of these conditions are not met.
Retrieval failure is more common than it should be, for reasons that are primarily organizational rather than technical. Knowledge bases are often maintained with less discipline than databases: documents are added without version control, updated without invalidating the old version, or archived in formats that retrieval systems handle poorly. The agent that answers a question about the current benefits enrollment process from a document that was superseded fourteen months ago is not malfunctioning; it is functioning correctly with stale inputs. The failure is in the knowledge management system, not the model — but the failure is attributed to AI, and the trust damage accrues accordingly.
Retrieval Architectures
The engineering of RAG systems has matured substantially since the pattern was formalized in 2023. Early implementations used single-stage vector similarity search: embed the query, retrieve the most similar document chunks, insert into context. This works for many use cases but has well-documented failure modes: semantic similarity does not always align with relevance, documents that are important but not lexically or semantically close to the query may be missed, and retrieved chunks that lack surrounding context may be misinterpreted.
More sophisticated architectures use hybrid retrieval (combining vector search with lexical BM25 search), multi-stage retrieval (coarse retrieval followed by a reranking step), and structured knowledge sources (knowledge graphs, databases) alongside unstructured document stores. The choice of architecture should be driven by the information structure and the failure modes most important to avoid in the specific use case — there is no universally superior retrieval architecture, and the overhead of more complex systems is only justified when the simpler approach demonstrably fails.
Beyond retrieval, grounding can be achieved through direct tool access: a code interpreter can compute exact figures, a database tool can return exact current records, a web search tool can retrieve current public information. Direct tool access has higher fidelity than RAG for specific factual queries, but also higher cost (in latency and token spend) and higher security surface. The combination of RAG for background context and direct tool access for specific factual queries represents the current best-practice pattern for enterprise agents that need both breadth and precision.
Context Window Limits
The context window is the agent's working memory — everything it can attend to in a single inference step. Large language models in 2025 support context windows of 128,000 to over one million tokens, depending on the model. This sounds generous, and for many use cases it is. But context windows impose costs and constraints that are easy to underestimate in design but consequential in production.
The first constraint is attention quality. Research has consistently shown that model performance degrades for information positioned in the middle of a very long context (the "lost in the middle" phenomenon), and that the most reliable positions for critical information are the beginning and end of the context. For agents that are operating with long context windows filled with retrieved documents, this means that the information the agent most needs may not be the information it attends to most reliably — a non-intuitive failure mode that requires careful prompt engineering and retrieval ordering to mitigate.
The second constraint is cost. Large context windows are expensive to process. At current pricing, a single inference call with a context window of 100,000 tokens costs orders of magnitude more than one with 1,000 tokens. For agents running multi-step loops over many iterations, the token cost of maintaining a large context through each step can overwhelm the economics of the use case. Effective agentic system design includes explicit strategies for context management: summarization of prior steps, selective retention of critical information, and periodic "context compaction" that preserves the essential state without preserving every intermediate step in full detail.
The Limits of Tool Trust
An agent is only as trustworthy as its tools. A model that reasons impeccably but is connected to a tool that returns incorrect data — whether through malfunction, misconfiguration, or deliberate manipulation — will reason correctly to an incorrect conclusion. This is the fundamental tension in agentic system design: the agent's value depends on its ability to act on tool outputs, but its safety depends on skepticism about those same outputs.
Current models do not independently verify tool outputs. They process them as authoritative by default. This is operationally convenient — verified tool calls would require a separate verification infrastructure that does not currently exist — but it creates an irreducible trust gap. Mitigations include output validation layers (post-processing tool results through rule-based checks before feeding them to the model), tool provenance logging (recording not just what a tool returned but from which server and at what time), and explicit uncertainty signals in the tool result format (so the model can distinguish between high-confidence and low-confidence responses from the same tool). None of these mitigations eliminates the trust gap; they manage it.
"Grounding is not a feature you add to an agent after the fact. It is an architecture you design from the start. The agents that have surprised their operators most unpleasantly are almost always the ones where someone assumed the model would 'figure out' where to get current information."