AI Token Pricing in 2026: Why Bills Keep Rising Even as Per-Token Costs Fall
AI cost per token 2026 is down 67% YoY, yet 73% of enterprises blow their AI budget. The economics, GitHub Copilot pricing change, and the AI inference cost playbook.
Blended AI token prices fell roughly 67% year over year — from $18.40 to $6.07 per million tokens between Q1 2025 and Q1 2026. By every classical reading of a market, AI should be getting cheaper. It isn’t. 73% of enterprises exceeded their original AI cost projections last fiscal year. Uber’s CTO burned the company’s full 2026 AI coding budget in four months. The contradiction is not a paradox — it’s the predictable consequence of how the workloads are changing underneath the prices. This is the AI cost per token 2026 reality, and what an enterprise AI pricing playbook actually looks like.
The price collapse is real#
The per-token numbers are not in dispute. A short tour:
- Blended cost across major frontier models fell 67% YoY, from $18.40 to $6.07 per million tokens.
- GPT-4o input dropped from $5.00 to $2.50 per million tokens.
- o4 Mini sits at $0.55 per million tokens — a price that would have been unthinkable in 2023.
- Gemini Flash and GPT-4o Mini both sit under $0.50 per million tokens.
For a buyer reading a price sheet in isolation, this looks like a margin gift. Run the same workload as last year and your bill should be a third of what it was. The reason it isn’t is that nobody is running the same workload as last year.

Why bills go up anyway: the agentic multiplier#
The shape of AI workloads in 2026 is fundamentally different from 2024. The pattern then was “user types prompt, model writes one response.” The pattern now is agentic — a single user instruction spawns a multi-step plan, the model reads files, calls tools, evaluates intermediate results, sometimes spawns sub-agents, and retries when things break.
The cost consequence is direct: agentic workflows consume between 5 and 30 times more tokens per user task than the equivalent 2024 prompt-and-response interaction. A coding agent fixing a bug might burn 200,000 tokens reading the codebase, exploring failure modes, and iterating on a patch. A research agent answering one question might consume a million tokens across web reads and synthesis steps.
Goldman Sachs has projected 24x growth in token consumption by 2030. That is not a wild outlier forecast — it is roughly what you get if you assume agents become the default interaction shape and only a fraction of the per-token price decline continues. Even with prices halving every year, demand grows faster.
Why 73% of enterprises blow their budget#
The 73% number is the one that should pin enterprise CFOs to their chairs. Three quarters of organizations that budgeted for AI in fiscal 2026 are running over. We have seen the pattern repeatedly:
- The pilot underestimates the production shape. A pilot built on prompt-and-response gets approved on a low cost model. Production rebuilds it as an agent, and the per-task token count is 10x higher.
- The success penalty. Internal adoption grows faster than forecast. The model is good, people use it more, the bill scales with adoption.
- Tool calls compound. Every retrieval, every web search, every code execution adds tokens. The visible “prompt” is a fraction of the bill.
- No central observability. Teams pick models independently, route through different gateways, and no one sees the aggregate until finance does.
The mismatch is not that AI is expensive. It is that AI usage at the org level is uncorrelated with the assumptions in the original budget.
The GitHub Copilot pricing change — a leading indicator#
On June 1, 2026, GitHub Copilot moved to usage-based billing. The old “premium request” system is gone; the new model is GitHub AI Credits tied directly to token consumption. The shift is significant for two reasons.
First, it is an admission from the largest AI coding product on the market that flat per-seat pricing is no longer sustainable as agentic features become dominant. Copilot agents read files, plan, retry, and call tools — the cost per “active seat” is no longer fixed.
Second, it pushes the variability onto the buyer. A team that used $40-per-seat-per-month Copilot last year now has variable bills that depend on how their engineers use agent mode. Finance teams who liked Copilot precisely because it was predictable are now negotiating credit packs and watching dashboards.
This is not a Copilot-specific story. Cursor moved earlier. Most enterprise AI coding tools will be on usage-based billing by end of year. The GitHub Copilot pricing change is the canary — the last predictable-bill product caved.
The Uber data point#
Uber’s CTO reported burning the company’s entire 2026 AI coding budget in four months. Same shape as the 73% figure, but at a scale most enterprises can recognize. The reasons line up: engineers liked the tools, agent modes consumed more tokens per task than the budget assumed, and procurement had no easy lever to throttle usage without breaking workflows.
We expect to see the same pattern in three places in the next 12 months: AI coding tools (already happening), customer-support AI (next quarter), and internal research / knowledge-base assistants (year-end).
The cost-control playbook#
The right response is neither “ban the tools” nor “swallow the bill.” It is a deliberate enterprise AI pricing and AI budget management practice that treats inference cost the way mature ops teams treat cloud cost. The components:
Prompt caching#
The single highest-leverage tactic. Anthropic, OpenAI, and Google all offer cached input pricing — typically 10x cheaper than fresh input. Long system prompts, tool definitions, and stable retrieval chunks should be cached aggressively. We covered the mechanics in LLM prompt caching in 2026; for many agent workloads cache hit rates above 70% are achievable and cut bills by half or more.
Model routing#
Not every task needs Opus 4.7. A well-tuned router sends easy tasks to Haiku, Flash, or o4 Mini and reserves frontier models for the hard ones. The LLM router pattern is now standard for any production deployment of meaningful scale. Done right, routing can cut blended cost per task by 60-80% without measurable quality loss.
Batching and async#
Workloads that don’t need synchronous responses should run on batch tiers. Anthropic and OpenAI both offer 50% discounts for batch processing. Move every offline job — embedding generation, summarization, evaluation runs, document classification — onto the batch tier.
Eval-driven cheaper-model migration#
The discipline that matters most over time. Build evaluation suites for your workloads. On a quarterly cadence, re-test on cheaper models. Frontier capabilities trickle down fast in 2026 — last quarter’s Opus job is often this quarter’s Sonnet job, and next quarter’s Haiku job. Without evals you can’t tell; with evals you ratchet your bill down on every cycle.
Centralized gateway and observability#
Route everything through an AI gateway so finance, security, and platform engineering see one cost picture. Per-team budgets, per-app quotas, alerting on anomalies. This is the single biggest gap in most enterprise AI deployments today — usage scattered across vendors, no aggregated view until the credit-card statements land.

Context pruning and retrieval discipline#
Large contexts feel free because they don’t show up on the invoice line by line — but every token in the context window is billed. Teams that audit their prompts often find they are paying to re-send the same boilerplate, tool definitions, or document chunks across thousands of calls per day. Trim aggressively. Use targeted retrieval rather than dumping entire documents into context. The combination of prompt caching for what stays stable and tight retrieval for what changes typically halves the per-call input cost on top of the routing and batching savings.
What to put in next year’s budget#
For organizations setting fiscal 2027 AI budgets in the next two quarters, three planning assumptions:
- Assume the agentic multiplier holds. Budget 5-10x the per-task token consumption of equivalent 2024 workflows. If your pilot ran on prompt-and-response, double the figure again.
- Assume usage-based billing is universal. Per-seat pricing for the products that matter is going away. Build the FinOps muscle to manage variable bills now.
- Assume per-token prices keep falling — but not fast enough. Plan for 30-50% YoY price declines on frontier models. Then plan for your consumption to grow faster than that.
The teams that will be fine in 2027 are the ones who built routing, caching, batching, and evaluation discipline this year. The teams that will be in board meetings explaining variance are the ones who treated AI cost as somebody else’s problem.
Where pdpspectra fits#
We build the inference cost layer — gateway, router, cache strategy, batch pipelines, and the eval suite that lets you migrate to cheaper models without losing quality. Most engagements start with a usage audit and end with a 40-60% reduction in per-task cost. See AI / LLM integration for what that engagement looks like.
Related reading#
- LLM cost optimization in 2026
- Cost control for agentic workflows
- LLM routing to the cheapest model that works
If your finance team is asking why the AI line item keeps growing while the price sheets keep falling, we can run the audit and ship the controls. Get in touch — AI budget management is a solvable problem, just not by accident.