The Enterprise AI ROI Reckoning: Why Your Spend Isn't Returning, and What an Engineering-Led Build Does Differently

Most failed AI ROI is a data and plumbing problem, not a model problem. Here's the instrumentation — evals, observability, per-workflow cost — that makes return measurable.

The Enterprise AI ROI Reckoning: Why Your Spend Isn't Returning, and What an Engineering-Led Build Does Differently

For two years the operating instruction inside a lot of enterprises was, roughly, use more. Push tokens through everything. Wire a model into every form, every ticket, every meeting summary. The internal name for it, by mid-2026, became “tokenmaxxing” — and then the bill arrived. TechCrunch’s reporting on the token bill coming due catalogs the hangover: companies blowing through annual AI budgets in a quarter, internal usage leaderboards getting killed, seats getting cut. On TechCrunch’s Equity podcast, NEA partner Tiffany Luck framed the ROI reckoning plainly: the shift is away from maximizing usage and toward measuring return on AI spend, with a wave of startups now selling the instrumentation to track it. Her blunt summary elsewhere — enterprises are still figuring out their AI ROI.

That’s the demand side. The public-trust side rhymes: a June 2026 Pew survey, reported by TechCrunch, found just 16% of Americans expect AI to have a positive impact on society. When the people footing the bill and the people using the output are both skeptical, “we shipped an AI feature” stops being a credible answer. The question becomes the one engineers should have been asked first: where, exactly, did the return show up?

We’ve sat on both sides of this. Here’s the uncomfortable diagnosis.

Most failed AI ROI is not a model problem#

The instinct when an AI initiative underdelivers is to reach for a better model. Swap the provider. Bump to the larger context window. Add a reasoning tier. It rarely moves the number, because the failure almost never lived in the model.

It lived in the plumbing. The model produced a plausible answer from incomplete, stale, or unjoinable data — and there was no loop to catch that it was wrong, no record of what it cost, and no place for its output to actually do anything. The demo worked because the demo ran on a hand-picked happy path. Production runs on your real data, which is messier than the demo and changes underneath you.

AI implementation, done honestly, is mostly data engineering with a model on top. The model is the last 10% and the most visible 10%. The 90% nobody screenshots is ingestion, joins, freshness, lineage, retrieval, and the operational hooks that turn a generated answer into a committed action. Teams that skip the 90% to ship the 10% are the ones now staring at a usage bill and an empty ROI column.

The foil: heavy vendors selling the demo#

The legacy and mega-platform playbook is recognizable. A polished pilot, a per-seat licensing model that rewards usage rather than outcomes, and an architecture you can’t see into. The pitch optimizes for adoption metrics — seats activated, prompts run, “engagement” — precisely the tokenmaxxing vanity numbers that the reckoning is now retiring.

The tell is that these systems are unfalsifiable by design. You cannot ask them what a given workflow costs, you cannot diff this week’s answer quality against last week’s, and you cannot trace a bad output back to the row of data that caused it. When you can’t measure, you can’t disprove value — and you also can’t prove it. That’s a comfortable place for a vendor and an expensive place for a buyer.

The engineering-led alternative is the opposite posture: assume nothing works until it’s instrumented, and treat every claim of ROI as something you should be able to query.

What instrumentation actually means#

“Instrument it” is easy to say and specific to build. Four layers, none optional.

Evals before scale#

Before a workflow touches production volume, it needs an eval set — a fixed bank of real inputs with known-good outputs, scored on every change. Not vibes, not a spot check. When you swap a model or edit a prompt, the eval tells you whether quality moved and in which direction. Without it, every change is a coin flip you can’t see. Evals are the regression suite for non-deterministic software, and the discipline is the same as any test suite: cheap to run, run constantly, block the merge when they drop.

Observability on every call#

You need traces. Every model call logged with its inputs, its retrieved context, its output, its latency, and its token cost — queryable, not buried in a vendor dashboard. When an answer is wrong in production, you want to open the trace and see exactly which retrieved chunk poisoned it. Tools like OpenTelemetry-based LLM tracing make this standard practice now; the point isn’t the brand, it’s that an opaque pipeline is an unfixable one.

Cost tracking per workflow, not per company#

The single most common gap. Most enterprises know their total AI bill and almost nothing below it. They cannot tell you that the contract-summarization flow costs 4 cents a run and saves twenty minutes, while the meeting-notes flow costs 30 cents a run and saves nothing anyone misses. ROI is a per-workflow question. You attribute cost and value at the workflow level or you don’t have ROI — you have a number and a hope. This is the layer the new crop of startups Luck described is racing to sell; it’s also the layer you can build yourself if your data is in one place. The cost half is the easy half — token counts come straight off the API. The value half is the work: you have to define the metric each workflow moves and capture it on the same timeline as the spend, so the two sit in one table you can actually divide.

A real operational loop#

The output has to land somewhere that changes an outcome — a ticket gets routed, a claim gets flagged, a record gets updated, a human gets a ranked queue instead of a blank page. A generated summary that a person reads and forgets is a cost with no return. The loop is what converts inference into operational automation. If you can’t name the downstream action and the metric it moves, you haven’t built a workflow; you’ve built a feature.

The data layer is where ROI is won or lost#

All four layers assume one thing: your data is queryable, fresh, and joined. That’s the precondition almost nobody budgets for.

Our default operational engine is deliberately minimal — ClickHouse for storage and sub-second analytical queries, Airflow for orchestration, dbt for transformation and tests. Minimalism in architecture, maximum impact in operations. The reason this stack keeps earning its place isn’t fashion; it’s that a warehouse answering queries in under 200 ms can sit inside a live workflow, not just behind a nightly dashboard. The warehouse stops being a reporting destination and becomes the thing that drives automation. When retrieval, cost attribution, and evals all read from the same fast, governed layer, ROI measurement is a query instead of a quarterly archaeology project.

This is the unglamorous truth the reckoning is forcing back into view: you cannot instrument AI on top of data you can’t trust or reach. Fix the data platform and the model becomes the easy part. Skip it and no model will save you.

Where measurable ROI actually shows up: vertical operations#

The clearest returns we see aren’t in horizontal “assistant for everything” deployments. They’re in specific operational domains where the data is structured, the workflow is well-defined, and the saved minutes are countable.

A Hospital Management System is a strong example. The data is already relational and high-volume — admissions, scheduling, billing, clinical notes. Put a governed data platform underneath and the AI work becomes tractable: triage-queue prioritization, prior-authorization drafting, discharge-summary assembly, coding assistance. Each is a bounded workflow with a measurable before-and-after. You can prove the prior-auth flow cut turnaround from days to hours because you logged both, per run, with cost attached.

A School ERP is the same shape in a different vertical. Attendance, grading, fee reconciliation, and parent communication are structured, repetitive, and rule-bound — the conditions under which AI-driven operational automation pays for itself. The wins are modest per event and large in aggregate, and crucially they’re attributable: you can point at the reconciliation workflow and show the runs, the cost, and the hours returned.

In both cases the pattern holds. ROI is achievable not because the model is special but because the data layer is right and every workflow is instrumented end to end.

The reckoning is a good thing#

The end of tokenmaxxing isn’t a retreat from AI. It’s AI implementation finally being held to the standard every other piece of enterprise software already meets: prove the return, or don’t ship it. That standard rewards exactly the teams that built the boring 90% — the data platforms, the evals, the traces, the per-workflow ledger — and it exposes the ones who shipped a demo and called it a strategy.

Build the data layer. Instrument every workflow. Put each one into a real operational loop and measure it. Do that, and the ROI column fills itself in. Skip it, and you’ll be back here next quarter explaining another bill.


Stuck between an impressive demo and a returning system? That gap is data engineering, not model selection — and it’s the work we do. Let’s talk.