Chapter 19 · ROI: The Cost Stack and the Honest Arithmetic

The most common form of bad ROI for agentic AI is not over-optimism about value. It is under-counting of cost. Most analyses we see omit between forty and seventy percent of the actual stack. The chapter is the missing pieces.

The honest cost stack

A complete TCO for a production agent has ten layers, in rough order of share. Model inference is the visible cost: input plus output tokens at API pricing, typically 30–50% of total cost of ownership. Tool-call overhead is the cost few people model: each tool call carries both a token cost (packaging the request and parsing the result) and an external API cost. Vector database for retrieval: $50–300/month for a 10K-document knowledge base, more at scale. Orchestration platform: LangSmith, Arize, or managed orchestration, typically $500–5,000/month. Eval and observability: $500–3,000/month for non-trivial deployments. Guardrails (NeMo, Lakera, custom): $500–5,000/month and roughly 0.5s of added latency. Engineering build: the one-time cost, typically 3–12 FTE-months for a serious agent. Engineering run: 0.5–2 FTE on an ongoing basis to maintain, evaluate, and iterate. Change management: training, process redesign, workflow integration — almost always under-estimated. Governance and compliance: $0.5–5M total for a high-risk deployment in EU AI Act scope.

AgentMelt's 2026 TCO analysis puts it bluntly: "by the time an agent is reliably handling real work, inference is typically 30–50% of total cost of ownership and the rest is everything around it." Token pricing is what gets quoted; the layers above it are where the surprise lives.

A concrete benchmark, useful as a sanity-check: a large-scale agentic system processing 5 million monthly interactions costs roughly $1.2M/year all-in — substantially less than a $6M offshore team performing equivalent work, with full ROI achievable within twelve months at 20% workload reduction. The shape of those numbers is portable; the absolute values depend on geography, regulatory regime, and the company's existing infrastructure.

The value stack that holds up

Three value metrics survive scrutiny. Time saved × hourly cost is the most defensible. If the agent handles 1,000 tickets/day at 15 minutes each at a $40/hour fully-loaded rate, that is $10,000/day, or $3.65M/year — multiplied by a realistic automation rate (40–80%, depending on the use case) and presented with a confidence interval, not as a point estimate.

Containment / deflection rate is the customer-service-specific version: the percentage of interactions fully resolved by the agent without escalation. Klarna's two-thirds containment is an aspirational benchmark; most well-scoped enterprise deployments achieve 40–60%. Above that requires either narrow use cases or a mature flywheel. Below that suggests the use case is wrong, not the agent.

Revenue lift is harder to attribute but real in sales and CX contexts. Salesforce reports Agentforce customers seeing 290% first-year ROI driven primarily by speed-to-lead — response time dropping from four hours to forty-five seconds accounted for more than half of the conversion-rate improvement. Treat vendor-published numbers as upper bounds and discount aggressively.

Error reduction is the under-appreciated category. In high-error workflows — manual data entry, document processing, intake — agents can halve error rates, and the downstream cost of errors (rework, customer impact, compliance risk) is often more than the cost of the original work. Quantifying this requires a pre-agent baseline you may not have, which is itself a reason to instrument the human process before the agent goes live.

Why most claims are wrong

BCG's 2025 research shows a widening gap between AI leaders (future-built companies achieving 5x revenue increases and 3x cost reductions) and laggards. Even leaders, however, frequently overclaim. Five recurring failure modes:

Pilot ROI ≠ production ROI. Pilots run on curated datasets and favourable conditions. Production has messy edge cases, user resistance, and integration friction. Discount pilot results by 40–60% for initial production projections; recover the difference only after a quarter of measured production performance.

Gross savings ≠ net savings. Subtract engineering costs, governance costs, observability costs, and ongoing eval and maintenance burden from gross cost savings. The net is what matters.

Correlation vs. causation. Many AI deployments coincide with process redesign that would have generated savings independently. Attribution is hard. Wherever possible, isolate the agent's contribution by running parallel control groups.

Ignoring transition costs. Retraining staff, redesigning workflows, managing change has real costs that almost never appear in AI ROI models.

Ignoring liability tail risk. One serious agent error — Air Canada-scale liability, a data breach, a regulatory fine — can dwarf years of operational savings. Expected-value calculations must include tail risk weighted by probability.

IBM's 2025 survey of 2,000 CEOs found that only 1 in 4 AI projects delivers promised ROI and only 16% scale across the enterprise — consistent with MIT NANDA's finding that only 5% reach production at scale with P&L impact.

The variables that move the answer

For most agent ROI models, six variables are high-sensitivity: a 25% change in any one of them shifts projected ROI by 20% or more. Automation rate (40–80% in most use cases). Agent error rate versus human error rate (agents typically 2–15%, humans typically 3–8%, with the comparison highly use-case-specific). Fully-loaded agent cost per task ($0.01–$2.00 depending on token intensity and tool-call count). Human cost per equivalent task ($15–$150, depending on role level and geography). Adoption rate (40–90%, set by change-management quality). Time to production value (3–18 months).

A practitioner's note

If your ROI model is a single point estimate, it is wrong. The honest version is a range, with a sensitivity table on the six variables above, and a tail-risk line that explicitly prices the worst-plausible incident. The teams that present this version are easier to take seriously than the teams that present a single number with two decimal places.

The most reliable predictor of positive ROI, across the deployments we have observed: narrow use case + tight workflow integration + measurable before/after metric + a system that learns from feedback + managed transition of affected staff. If any of those five is missing, the ROI model probably does not survive contact with reality.

The next chapter is the system that produces the "learns from feedback" piece — the flywheel.

Figure 19.1ROI sensitivity. Tornado chart of the six variables that most often change the sign or magnitude of agent ROI. Address them in the model before you write the executive summary.