Model routing is the practice of dynamically selecting which language model handles each request based on task characteristics, cost constraints, and quality requirements. Instead of using one model for everything, model routing matches each request to the cheapest model that can handle it effectively — often reducing costs by 40-70%.
Why Model Routing Matters
Most AI applications default to a single model — usually a frontier model like GPT-4o or Claude Sonnet — for all requests. This is like using a Ferrari for both race day and grocery shopping. Many requests are simple enough for a much cheaper model to handle perfectly.
Consider a typical AI application's request distribution:
If you route the simple 60% to a model that costs 15x less, you cut your total spend by 50%+ with minimal quality impact.
How Model Routing Works
Rule-Based Routing
The simplest approach: define rules that match request characteristics to models.
Examples:
Pros: Simple, predictable, easy to debug
Cons: Requires manual rule creation, doesn't adapt to new patterns
Classifier-Based Routing
Use a fast, cheap classifier model to analyze each request and determine complexity before routing:
Pros: Adapts to diverse request types automatically
Cons: Adds latency (one extra LLM call), classifier can misroute
Cascading/Fallback Routing
Start with the cheapest model and escalate only if quality is insufficient:
Pros: Only pays premium prices when necessary
Cons: Higher latency for escalated requests, evaluation adds cost
Embedding-Based Routing
Use embeddings to match incoming requests against a library of known request types:
Pros: Fast, no LLM call needed for routing
Cons: Requires building and maintaining the example library
Model Routing in Practice
Setting Up a Model Router
A basic model router consists of:
Example Routing Configuration
Default: gpt-4o-mini ($0.15/$0.60 per 1M tokens)
Escalate to gpt-4o ($2.50/$10.00) when:
- Task is code generation
- Task requires multi-step reasoning
- User is on premium tier
- Quality score of mini response < 0.8
Escalate to claude-sonnet ($3/$15) when:
- Input exceeds 128K tokens
- Task requires creative writing
- Tool use with >5 functions
Use batch API when:
- Response not needed in real-time
- Task is content generation or evaluationMeasuring Routing Effectiveness
Track these metrics to validate your routing: