Cheaper Inference Is Here: Token-Cost Engineering for LLM Teams
Flash-Lite tiers and a price war made tokens cheap, but bills are still blowing up. The fix is routing, caching, and measuring cost-per-task, not cost-per-token.
Token prices fell off a cliff this year, and somehow the bills got worse. The average cost per million tokens across major providers dropped from roughly $10 to about $2.50 in twelve months, yet Uber blew through its entire 2026 AI coding budget by April and Microsoft revoked developers’ Claude Code licenses months after handing them out. Cheaper tokens did not lower spend; they raised consumption. For any team shipping LLM features, the discipline that matters now is not finding the cheapest model — it is engineering the system so cost tracks value.
The cheap tier is real, and it is good#
The price pressure is concrete. Google’s new Gemini 3.1 Flash-Lite lands at $0.25 per million input tokens and $1.50 per million output, with roughly 2.5× faster time to first answer token and a 45% bump in output speed over 2.5 Flash. It is not a toy: Google reports an 86.9% GPQA Diamond reasoning score and 76.8% on MMMU Pro multimodal, and positions it for high-volume translation, content moderation, UI generation, and instruction-following. That is squarely “production workhorse” territory, not “demo only.”
This is the warning shot in what the press is calling an AI subscription price war. Microsoft has rolled out lower-cost models to cut its reliance on OpenAI. Google has pitched aggressive savings from routing internal workloads onto Flash-class models. The strategic message to buyers is the same from every vendor: most of your traffic does not need a frontier model.
They are right. But the lesson teams keep mislearning is “switch everything to the cheap model.” That is how quality quietly tanks. The right lesson is that you now have a price ladder to allocate against demand — and allocation is an engineering problem.
Why cheaper tokens made bills bigger#
Before tactics, understand the dynamic, because it explains why “just use Flash-Lite” is not a strategy. Per the same TechCrunch reporting, engineers who used the most tokens were about 2× more productive but consumed roughly 10× more tokens, and per-developer token use rose 18.6× in nine months, driven by agentic features that loop, retry, and fan out. The FinOps Foundation’s J.R. Storment described the shift bluntly: “the whole conversation shifted from tokenmaxxing and ‘go fast’ to ‘we need guardrails, how do we control this?’”
This is Jevons paradox in plain sight: make a resource cheaper and efficient, and total consumption rises rather than falls. An agent that costs a tenth as much per step will happily take a hundred steps. So the cost problem does not go away when prices drop — it moves from per-token price to per-task token volume, which is the number you actually control. And the subsidized “all-you-can-eat” subscription pricing that masked this is ending: providers are shifting toward usage-based billing, which means the consumption you ignored yesterday becomes a line item you own tomorrow.
Tactic 1: model tiering and routing#
The highest-leverage move is to stop sending all traffic to one model. In most real workloads the distribution is lopsided: a large majority of requests are easy — classification, extraction, short rewrites, routine Q&A — and a minority are genuinely hard, needing multi-step reasoning or long context.
A workable tiering policy looks like this:
- Cheap tier (the 80%). A Flash-Lite-class model handles the bulk: structured extraction, routing, summarization of short inputs, moderation, simple tool calls. At $0.25 per million input tokens, this is where your volume should live.
- Frontier tier (the hard 20%). Reserve premium models for tasks that demonstrably need them: long-horizon reasoning, ambiguous instructions, high-stakes generation where an error is expensive.
- An explicit router between them. Route on signals you can compute cheaply — input length, task type, a fast classifier, or a confidence check — not on vibes.
Google’s own framing is that shifting a large share of workloads onto a cheap-plus-frontier mix yields large savings; the migration guidance from practitioners puts the bill reduction in the range of 80% on workloads that move down a tier (UsageBox). The catch is that routing has to be measured, not assumed. A naive router that misclassifies hard requests as easy degrades quality invisibly. Build the router with a fallback: if the cheap model’s output fails a validation check — unparseable structure, low confidence, a refused tool call — escalate that single request to the frontier tier rather than shipping a bad answer.
A cheap pattern that works: cascade#
A practical version of tiering is the cascade. Send the request to the cheap model first. Score the result with a fast, deterministic check — does the JSON parse, did the required tool get called, does a lightweight verifier pass. If it clears, you are done at a tenth of the cost. If it fails, escalate. On a workload where the cheap model handles most cases correctly, the blended cost lands near the cheap tier while quality stays near the frontier tier. The whole game is making the escalation check cheap and honest.
Tactic 2: caching, prompt and response#
Routing decides which model. Caching decides whether you call a model at all. Two layers matter:
- Prompt / prefix caching. Most providers now bill cached input tokens at a steep discount. If your requests share a long, stable prefix — a system prompt, a tool schema, a retrieved document set — structure the prompt so the stable part comes first and is cacheable, and the variable part comes last. For agentic loops that resend the same context every step, this is one of the largest single line-item reductions available.
- Response caching. For deterministic or near-deterministic tasks — canonical FAQ answers, repeated classifications, embeddings of unchanged documents — cache the output keyed on a normalized input. A cache hit is a request that costs nothing and returns instantly. Teams routinely underuse this because LLM output feels dynamic, but a large fraction of production traffic is the same handful of questions asked again.
Neither is exotic. Both are ordinary systems engineering applied to a new cost center, and both compound with tiering: a cache hit is free, and a cache miss can still be routed to the cheap tier.
Tactic 3: measure cost per task, not cost per token#
This is the mindset shift that makes the rest coherent. Cost per token is a vanity metric. A model at half the per-token price that needs three round-trips, a longer prompt, and a retry to get a usable answer is more expensive than the “pricier” model that nails it in one shot. Optimizing the headline token price while ignoring tokens-per-task is how you end up with a cheaper model and a bigger bill.
Instrument the unit that maps to business value — a resolved support ticket, an enriched record, a merged code change — and track total token cost to complete it across tiers. The industry is building exactly this muscle: the Linux Foundation launched a Tokenomics Foundation to standardize AI cost tracking, and observability vendors have added token-level monitoring. You do not need to wait for a standard. You need three numbers per feature: tokens in, tokens out, and tasks completed. With those you can see whether a model swap or a routing change actually helped, and you can set per-team or per-feature budgets with caps before a runaway agent discovers a $500-million bill the way one company in the TechCrunch report did.
When Flash-Lite is the wrong call#
Cheap tiers are not free quality. A Flash-Lite-class model is the right default for high-volume, well-specified, low-ambiguity tasks. It is the wrong call when:
- The task needs long, multi-step reasoning where a small error compounds across steps.
- Output feeds an irreversible or high-stakes action with no human in the loop.
- The prompt is genuinely ambiguous and the model has to infer intent rather than execute a clear instruction.
- You have no eval distinguishing good output from plausible-but-wrong output — in which case a cheaper model quietly tanks quality and you will not notice until users do.
The dividing line is verifiability. Where you can cheaply check the answer, push down the tier aggressively and let the cascade catch failures. Where you cannot check it, the cost of a wrong answer dominates the cost of tokens, and the frontier model earns its price.
The takeaway#
The price war is real and it favors teams that treat inference as a system to be engineered rather than a model to be chosen. Cheaper tokens did not solve the cost problem; they relocated it to consumption, where tiering, caching, and honest per-task accounting are the levers. Build a router with a cascade and a real escalation check, cache the stable parts of your prompts, and measure cost against completed tasks instead of tokens. Do that, and a model like Flash-Lite becomes a genuine efficiency win across the easy 80% of your traffic. Skip it, and you will keep paying frontier prices for work a cheap model could have done — or worse, ship a cheap model into work it was never going to get right.