Chapter 20 · The Flywheel — Building an Agentic Enterprise

The single thing that separates agent programs that compound from programs that plateau is whether the deployment has a working flywheel. Most do not. The chapter is what one looks like, why most teams fail to build one, and what the production evidence says about the gain when the wheel actually spins.

Five components

A working data flywheel for agentic systems has five components. They are not optional. A program with four of the five does not compound; it stalls at whatever quality its build team produced on day one.

Structured logging. Every agent interaction logged with input, output, tool calls, metadata, and outcome — as structured, queryable data, not as text in a file. Without structure, no pipeline downstream can reason about what happened. Most teams instrument for debugging, which is a different artefact.

Feedback signal capture. Both explicit feedback (thumbs up/down, ratings, escalation events) and implicit feedback (was the agent's recommendation followed? did the customer call back? was the ticket reopened?) captured and linked back to the originating interaction. The link matters: feedback that cannot be joined to the interaction it refers to is unusable.

Eval pipeline. Automated tests run against the golden set on every meaningful change, and on a schedule even when nothing changes (because external data sources change). Tools like Arize Phoenix, Braintrust, and LangSmith are the current state of the art for this work.

Improvement cycle. Production data reviewed on a regular cadence; failure modes categorised; prompts improved; fine-tuning considered where warranted; knowledge base updated. The cadence is what makes this work — quarterly is too slow for most agents; weekly is realistic for a serious deployment.

Deployment pipeline. Changes to agent configuration, prompts, or tools deployed through a tested CI/CD pipeline with canary rollouts. Without this, the improvement cycle stops at "we identified the problem" — which is a research function, not a flywheel.

Why most flywheels don't spin

The MIT NANDA report's diagnosis of the "learning gap" is precisely the absence of these five components. The recurring failure modes are predictable:

Logging is an afterthought. Teams instrument for debugging, not learning. The logs capture errors. They miss the positive signal needed to know what "good" looks like. Without that, drift is invisible.

No eval baseline at launch. Without a pre-deployment baseline, the team cannot tell whether the agent is improving or degrading. They can only tell whether it broke loudly.

Feedback loops are not designed. Users have no mechanism to tell the system when it is wrong. Escalation data exists somewhere — in CSAT surveys, in re-opened tickets, in support handoffs — but is not joined back to the agent interactions that caused them.

Static knowledge bases. The RAG knowledge base is loaded once at deployment and never updated. The agent's world knowledge becomes stale at the rate the business changes. After eighteen months, the agent is confidently wrong about things that have changed underneath it.

Provider model updates break things. Provider model version changes can silently degrade agent behaviour. Without continuous evals in CI/CD, model drift is detected by user complaints, weeks after the change.

Organisational friction. The team that built the agent is different from the team that runs the business process. Production feedback never reaches the development team — and so the wheel is not spun, even if all the parts are in place.

The telemetry stack

A minimum viable telemetry stack for a production agent has six layers. LLM call tracing: every model call with prompt, response, tokens, latency, model version (LangSmith, Arize, Langfuse). Tool invocation: tool called, arguments, return values, latency, errors (custom plus the observability platform). Agent-level metrics: task completion rate, escalation rate, time-to-completion, confidence scores (the BI stack). User satisfaction: CSAT, NPS, repeat-contact rate, explicit ratings (survey + CRM integration). Business outcome: was the action correct, audited on a sample (human review pipeline). Cost tracking: tokens per task, total cost per session, cost per outcome.

The telemetry stack is infrastructure. It is also the artefact most often shorted in the build budget because it does not visibly produce features. The teams that ship telemetry as a phase-zero deliverable, before the first agent, are the teams whose agents compound.

Evidence from production

Recent research on Agent-in-the-Loop (AITL) demonstrates the mechanism quantitatively. Integrating four types of in-production annotations — pairwise response preferences, agent adoption signals, knowledge relevance checks, and missing-knowledge identification — directly into live operations reduced retraining cycles from months to weeks. Production results: +11.7% recall, +14.8% precision, +8.4% helpfulness, +4.5% adoption rate. Not from a new model. Not from a new framework. Just from closing the feedback loop faster.

Flywheel note

The flywheel is the answer to the question every Part-I, II, and III idea quietly returns to: how does this get better next quarter? The Map artefact gets better because the eval suite produces a curation queue. The eval suite gets better because the production telemetry produces edge cases. The HITL design gets better because human refusals become golden-set entries. None of those compounding effects exist without structured logging, feedback capture, eval pipeline, improvement cycle, and deployment CI/CD. Skip any one of them and the program plateaus exactly where the build team's instincts run out.

The final chapter is the operational version of all of this — a 30-60-90 plan that puts the work into a sequence a real team can run.

Figure 20.1The data flywheel for an agent program: log → capture → evaluate → improve → deploy → log. Skip any node and the wheel does not turn.