Cost Control for Agentic Workflows: Caps, Caches, and Routing
Agent costs blow up exponentially. The three controls — per-task caps, response caches, and model routing — that hold them flat without sacrificing.
Agent costs follow a power law. The median task is cheap; the worst-case task is 200x the median. Without controls, the worst-case dominates the bill. We’ve seen agentic features that cost $0.01 per task in eval and $4 per task in production. Same code, same model, different inputs.
Three controls keep this in check.
Per-task caps#
Every task gets a budget: max tokens, max iterations, max dollars. Cross the cap → abort cleanly with a recoverable error. Cap is enforced inside the agent loop, not by a billing alert that arrives next month.
A typical config:
budget:
max_input_tokens: 100000
max_output_tokens: 8000
max_iterations: 20
max_dollars: 0.50
Set the cap to ~3x your eval median. Tasks that need more usually have a different shape — escalate to a human, not to a larger budget.
Response caches#
Many agent calls are deterministic given inputs: classify this email, extract entities from this PDF, summarize this transcript. Cache by content hash; TTL by use case. A well-placed cache pays for itself in days.
Two patterns:
Exact-match cache. Hash the prompt+inputs, store the response. Works for deterministic tasks (classification, extraction). 30–60% hit rate is normal.
Semantic cache. Embed the request, look up similar prior requests with high similarity, return the cached response. Risky for tasks where small input changes matter, useful for FAQ-style customer support. Always include a “novelty floor” so genuinely new queries bypass the cache.
Don’t cache anything with user-specific output unless you partition by user.
Model routing#
The biggest model isn’t always the right model. A typical agent task has three subtasks:
- Classification / routing — small model (Haiku, GPT-4o-mini, Gemma) handles this fine
- Extraction / structured output — mid model
- Open-ended reasoning — large model (Opus, GPT-5, Gemini 2.5)
A router agent (cheap model) inspects the request and picks the worker model. For most enterprise workloads, ~70% of tasks route to the cheap model. Cost drops by 5–10x with no measurable quality regression — measured against an eval set, not vibes.
What we monitor#
For agent deployments via our AI & LLM integration service, the dashboard shows:
- Cost per task (p50, p95, p99)
- Cache hit rate by task type
- Model mix (% routed to each tier)
- Cap-abort rate (tasks that hit the budget cap)
- Cost per business outcome (per resolved ticket, per processed invoice)
The last one is what the CFO cares about. The others are how you control it.
The compounding problem#
A single misconfigured prompt that doubles average tokens, multiplied by a 10x traffic ramp, multiplied by a routing config that defaults to the largest model — that’s how a $2k/month AI feature becomes a $200k/month problem. The fix isn’t one heroic optimization; it’s the three controls above, applied early.
What goes in week one#
Before any agent feature ships:
- Per-task cap configured, alert on cap-abort rate
- Exact-match cache for deterministic stages
- Routing implemented (router model + worker tiers)
- Cost dashboard with per-task breakdown
- Eval suite that scores quality across model tiers
Cost control isn’t a Phase 2 cleanup. It’s part of how you build the agent in the first place.
Most agent cost overruns aren’t bugs — they’re the absence of three controls applied late. Our team installs cost-controlled agent stacks across enterprise rollouts. Tell us about the workflow.