Cost Control for AI Agents

Agent costs follow a power law. The median task is cheap; the worst-case task is 200x the median. Without controls, the worst-case dominates the bill. We’ve seen agentic features that cost $0.01 per task in eval and $4 per task in production. Same code, same model, different inputs.

Three controls keep this in check.

Per-task caps#

Every task gets a budget: max tokens, max iterations, max dollars. Cross the cap → abort cleanly with a recoverable error. Cap is enforced inside the agent loop, not by a billing alert that arrives next month.

A typical config:

budget:
  max_input_tokens: 100000
  max_output_tokens: 8000
  max_iterations: 20
  max_dollars: 0.50

Set the cap to ~3x your eval median. Tasks that need more usually have a different shape — escalate to a human, not to a larger budget.

Response caches#

Many agent calls are deterministic given inputs: classify this email, extract entities from this PDF, summarize this transcript. Cache by content hash; TTL by use case. A well-placed cache pays for itself in days.

Two patterns:

Exact-match cache. Hash the prompt+inputs, store the response. Works for deterministic tasks (classification, extraction). 30–60% hit rate is normal.

Semantic cache. Embed the request, look up similar prior requests with high similarity, return the cached response. Risky for tasks where small input changes matter, useful for FAQ-style customer support. Always include a “novelty floor” so genuinely new queries bypass the cache.

Don’t cache anything with user-specific output unless you partition by user.

Model routing#

The biggest model isn’t always the right model. A typical agent task has three subtasks:

Classification / routing — small model (Haiku, GPT-4o-mini, Gemma) handles this fine
Extraction / structured output — mid model
Open-ended reasoning — large model (Opus, GPT-5, Gemini 2.5)

A router agent (cheap model) inspects the request and picks the worker model. For most enterprise workloads, ~70% of tasks route to the cheap model. Cost drops by 5–10x with no measurable quality regression — measured against an eval set, not vibes.

What we monitor#

For agent deployments via our AI & LLM integration service, the dashboard shows:

Cost per task (p50, p95, p99)
Cache hit rate by task type
Model mix (% routed to each tier)
Cap-abort rate (tasks that hit the budget cap)
Cost per business outcome (per resolved ticket, per processed invoice)

The last one is what the CFO cares about. The others are how you control it.

The compounding problem#

A single misconfigured prompt that doubles average tokens, multiplied by a 10x traffic ramp, multiplied by a routing config that defaults to the largest model — that’s how a $2k/month AI feature becomes a $200k/month problem. The fix isn’t one heroic optimization; it’s the three controls above, applied early.

What goes in week one#

Before any agent feature ships:

Per-task cap configured, alert on cap-abort rate
Exact-match cache for deterministic stages
Routing implemented (router model + worker tiers)
Cost dashboard with per-task breakdown
Eval suite that scores quality across model tiers

Cost control isn’t a Phase 2 cleanup. It’s part of how you build the agent in the first place.

Most agent cost overruns aren’t bugs — they’re the absence of three controls applied late. Our team installs cost-controlled agent stacks across enterprise rollouts. Tell us about the workflow.

Per-task caps#

Response caches#

Model routing#

What we monitor#

The compounding problem#

What goes in week one#

Related posts.

Multi-Agent Systems for Enterprise Workflows: What Actually Works

AI Agent Orchestration Patterns: Planner-Executor, Swarm, and What Ships

Enterprise AI Rollout: A 12-Month Phased Roadmap for Global Firms