LLM Routing: How to Pick the Cheapest Model That Works
Routing requests to the smallest model that handles them well cuts costs 5-10x. The architecture and decision logic.
LLM costs at substantial scale matter. A naive deployment uses the substantial flagship model for every request — substantially expensive when most requests would be handled well by substantially cheaper models. The substantial productivity gain from LLM routing: send each request to the smallest model that handles it well, with substantial fallback to bigger models for hard cases. Typical cost reduction: 5-10x at similar quality. This post walks through what’s actually deployed.
Why routing matters substantially#
The substantial price differences in 2026:
Frontier models (Claude Opus 4.7, GPT-4.5 class, Gemini Ultra 2): substantial input/output token cost.
Mid-tier models (Claude Sonnet 4.6, GPT-4o, Gemini Pro 2): substantially cheaper, ~10-20% of frontier cost.
Small models (Claude Haiku 4.5, GPT-4o-mini, Gemini Flash, Llama 3.3 8B): substantially cheaper still — frequently 50-100x cheaper than frontier per token.
The substantial implication: for workloads where most requests are simple, sending all requests to the frontier model overpays substantially.
The substantial routing patterns#
Several substantial routing patterns exist:
Rule-based routing. Substantial classification based on input characteristics — length, language, detected complexity, plus the various. Simple to implement; substantial limitations.
Classifier routing. Small ML classifier (sometimes itself a small LLM) classifies requests into difficulty buckets. Substantial improvement over rules.
Cascade routing. Try small model first; if response confidence is low or response fails quality check, escalate to larger model. Substantial cost benefits when small model handles substantial percentage.
Ensemble routing. Send to multiple models in parallel; pick best response. Substantial cost; substantial quality.
Skill-based routing. Specific models for specific tasks — embedding model for embedding, code model for code, vision model for vision. Substantial when you have specialized models.
Confidence-based routing. Models that report uncertainty; route low-confidence outputs to larger models.
The substantial decision logic#
For most production deployments, cascade routing produces substantial value:
Tier 1 (cheap): Small/fast model for first attempt. Handles substantial percentage of requests.
Quality check. Substantial evaluation of Tier 1 output — confidence score, output structure validation, simple LLM-as-judge check, plus the various.
Tier 2 (mid): Mid-tier model for cases where Tier 1 was inadequate.
Tier 3 (frontier): Frontier model for cases where Tier 2 was inadequate.
The substantial design choice: quality check accuracy. False positives (sending good responses to higher tiers) waste cost; false negatives (accepting bad responses) hurt quality.
The substantial tooling#
Several substantial tools support LLM routing:
Open-source:
- RouteLLM — substantial open-source routing framework
- Mixture-of-Agents patterns
- DSPy for substantial compound systems
- LlamaIndex routing components
Commercial:
- Martian — substantial routing platform
- Portkey — substantial AI gateway with routing
- OpenRouter — substantial multi-provider with routing
- NotDiamond — substantial routing focused
- Vellum AI, Braintrust, Helicone — substantial features including routing
LLM gateways (LiteLLM, plus the various) substantially support routing logic.
The substantial quality dimensions#
Routing decisions have substantial quality implications:
Accuracy. Did the routed model produce correct output?
Completeness. Did the routed model produce all required output?
Tone and style consistency. Different models have different styles; routing produces inconsistent style.
Latency. Smaller models are typically faster — substantial UX benefit.
Reliability. Different models have different failure modes.
Specific capabilities. Some models substantially better at specific tasks (coding, math, reasoning, multilingual, plus the various).
The substantial routing system optimizes across these dimensions, not just cost.
The substantial production patterns#
Several substantial patterns from production deployments:
Substantial offline routing decision. Routing decisions made by analyzing request and selecting model rather than trying multiple models in real time.
Substantial routing of inputs that look similar. Caching routing decisions for similar inputs.
Substantial fallback handling. When routed model fails or times out, fallback path matters.
Substantial cost monitoring. Real-time cost tracking by model and route. Substantial visibility into cost-vs-quality trade-offs.
Substantial A/B testing. Continuous testing of routing decisions — does sending more requests to smaller model affect quality meaningfully?
Substantial drift monitoring. Models evolve; routing decisions that worked yesterday may not work tomorrow.
The substantial cost calculations#
Typical cost reduction from routing:
Naive deployment: All requests to frontier model. Cost baseline.
Basic routing (rule-based with 2 tiers): ~50-70% cost reduction at similar quality.
Substantial routing (cascade with 3 tiers, confidence-based): 80-90% cost reduction at similar quality.
Substantial routing with caching and prompt optimization: 95%+ cost reduction at similar quality for some workloads.
The substantial savings at substantial scale justify substantial routing investment.
What we typically see at clients#
Common patterns:
No routing. Most enterprises send everything to frontier models. Substantial overspending.
Rule-based routing. Some sophistication; substantially more value available.
Cascade routing at cost-conscious deployments — increasingly common.
Substantial routing systems at large-scale deployments where cost matters substantially.
The substantial trade-offs#
Routing has substantial costs:
Substantial system complexity. More components, more failure modes.
Substantial latency. Cascade routing adds latency when escalating.
Substantial monitoring requirements. Substantial dashboard and alerting needs.
Substantial quality risk. Bad routing decisions hurt user experience.
The trade-offs are substantial; the cost savings at substantial scale typically justify them.
Where pdpspectra fits#
Our AI integration practice builds production LLM systems with substantial routing and cost optimization.
Related reading: the sub-100ms inference post, the LLM cost optimization post, and the AI red teaming post.
LLM routing is substantial cost lever. Talk to our team about your AI cost optimization.