Chapter 27 · Data Readiness — The Agentic Enterprise

An agent is only as reliable as the data it operates on. This is true of all data-dependent systems, but it is more acutely true of agents because agents act on their data — they do not merely analyze it or report on it. An agent that consults stale inventory data may place an order for parts that are no longer available; an agent that reads a customer record with incorrect contact information will communicate with the wrong person; an agent that retrieves a document without knowing its classification may include restricted content in a response that reaches an unauthorized recipient. Data readiness — the state of data being accurate, current, accessible, and properly classified for the agent's use — is not a precondition for data science work; it is a precondition for autonomous action, and the consequences of inadequate data readiness are correspondingly more serious.

The Four Dimensions of Data Readiness

Data readiness for agentic deployment has four dimensions that must all be addressed. Accuracy: the data must correctly represent the entities and states it purports to describe. For agentic use cases, accuracy requirements are often more stringent than for analytical use cases, because an analyst can identify suspicious data and exclude it from their analysis, while an agent acting on that data will not. Freshness: the data must be current enough for the decisions the agent is making. An agent performing risk assessments needs data that reflects the current risk state; one performing customer outreach needs data that reflects the customer's current relationship status; one performing compliance monitoring needs data that reflects the current regulatory environment. The freshness requirement varies dramatically by use case and must be assessed explicitly for each deployment.

Accessibility: the data must be reachable by the agent through the integration spine, in a format and at a speed appropriate for the agent's operational requirements. An agent that needs to query a database in real time cannot work with data that is only available in a batch export updated nightly; an agent that needs to retrieve documents from multiple systems cannot work with data that requires manual access requests. Accessibility includes both technical connectivity — the integration exists and works — and latency properties: the data can be retrieved quickly enough for the agent's operational requirements.

Classification: every piece of data the agent can access must carry accurate metadata about its sensitivity level, its permissible uses, and its distribution restrictions. Without this classification metadata, the agent's output-handling logic cannot determine whether a given response is appropriate to deliver to a given recipient, and the data governance framework's controls become unenforceable. Classification is the most frequently underdeveloped dimension of data readiness, because it requires sustained human effort to assign and maintain metadata across large, heterogeneous data estates.

Lineage: Knowing Where Data Came From

Data lineage — the provenance record that tracks where a piece of data came from, how it has been transformed, and what it has been used for — is valuable for conventional analytics but essential for agentic AI. An agent that acts on data cannot simply be held responsible for the quality of its action; the question of responsibility runs upstream to the data itself. If the agent made a wrong decision because it was consulting incorrect data, the question of who is responsible for the data's incorrectness requires a lineage record that can be followed from the agent's decision back to the data source that supplied the incorrect value.

Lineage also matters for regulatory compliance. The EU AI Act's requirements for data governance (Article 10) include obligations to document the data's characteristics, its provenance, and any known data quality issues. Meeting these obligations requires a lineage system that can produce this documentation on demand — not a manual survey that is assembled after the fact, but a continuously maintained record that reflects the actual provenance of the data the agent is using in production. Several data catalog tools — Alation, Collibra, DataHub — now offer lineage tracking that is compatible with agentic data access patterns, though all require integration effort to capture the agent's retrieval activity in the lineage record.

Freshness and the Staleness Risk

Staleness is the data quality failure mode that is most specific to agentic systems, because agents act on data in real time and the consequences of acting on stale data are immediate. An analytical system that uses stale data produces an incorrect analysis that is eventually corrected; an agent that uses stale data takes an incorrect action that may have already caused harm by the time the staleness is detected. The staleness risk varies enormously by data type: financial prices can become stale in milliseconds; customer contact information can become stale in weeks; regulatory guidance can become stale in months. The agent's data architecture must match the data refresh cadence to the staleness tolerance of the use case.

Implementing freshness controls requires instrumenting the data layer: every data retrieval should return not just the data value but the timestamp of its last verified update, and the agent's policy engine should be configured to flag or block actions that depend on data that is older than the staleness threshold for that use case. This sounds straightforward, but many enterprise data systems do not expose freshness metadata through their APIs, which means the integration work for freshness control often involves modifying or replacing data access patterns that were designed for human analytical use rather than for real-time agentic action.

"The data quality that was acceptable for quarterly business reviews is not the data quality required for continuous autonomous action. The upgrade from 'good enough for dashboards' to 'good enough for agents' is frequently the most expensive part of an agentic readiness program."

RAG and the Retrieval Quality Problem

Retrieval-augmented generation (RAG) — the pattern of augmenting an agent's context with documents retrieved from an external knowledge base — is the standard approach for giving agents access to proprietary organizational knowledge without incorporating that knowledge into model weights. The quality of a RAG system is determined primarily by the quality of its retrieval: a model that would answer correctly given the right context will answer incorrectly if it receives irrelevant or misleading retrieval results. The retrieval quality problem is, in practice, a data readiness problem: it is caused by documents that are outdated, inconsistently structured, poorly indexed, or not classified in a way that makes their content discoverable by semantic search.

Improving retrieval quality for agentic RAG systems requires investment in the document corpus — ensuring that documents are current, that outdated versions are deprecated and removed from the retrieval index, and that the indexing pipeline captures the semantic structure of the documents accurately. It also requires investment in the retrieval evaluation — measuring retrieval quality with metrics like precision at k and mean reciprocal rank, and using those metrics to tune the retrieval configuration rather than relying on end-to-end agent evaluation as a proxy for retrieval performance.