Last updated: May 2026
An LLM router is the routing-decision layer that picks which model handles each request — cheap for simple, expensive for hard. Published benchmarks (RouteLLM from LMSYS) hit 95% of GPT-4 quality at 26% of GPT-4 cost, translating to 74% production cost savings. Real-world deployments report 30-85% savings depending on traffic mix. This guide covers static rule-based routing (the right starting point), learned routers like RouteLLM / NotDiamond / Martian, open-source self-host options (LiteLLM, vLLM Semantic Router, OctoRouter), and how to actually deploy routing in May 2026.
For the broader proxy layer (auth, observability, caching), see LLM Gateway in 2026: OpenRouter vs LiteLLM vs Portkey. For per-model prices, see Groq Pricing and Best Open Source LLM 2026.
Router vs gateway — clearing the terms
The terms get used interchangeably. They are not the same thing:
- LLM router: decides which model handles a query (the routing logic).
- LLM gateway: the broader proxy layer that includes routing, observability, failover, caching, auth, virtual keys, and unified API translation.
In 2026, every production gateway (LiteLLM, Portkey, OpenRouter, Helicone) ships with built-in routing. Standalone routers (RouteLLM, NotDiamond, Martian) only do the routing decision and assume you have a gateway downstream. Most teams pick a gateway and use its routing; standalone routers matter when you need a sophisticated learned model that the gateway doesn't ship.
Why routing saves money
A 2026 production audit on a typical chat application surface shows the distribution of query difficulty:
- 60-75% of queries are simple: short questions, factual lookups, classification, summarization of small inputs, FAQ-style answers.
- 15-25% of queries are medium: longer reasoning chains, light tool use, mid-length code generation.
- 5-15% of queries are hard: complex reasoning, long-context analysis, multi-step agentic workflows, novel code generation.
Sending every query to GPT-5 ($2.50/M input) when 70% of them could land on Llama 3.1 8B ($0.05/M on Groq) is a 50x overpay on the cheap-query bucket. Static routing alone captures most of that savings.
Static (rule-based) routing — start here
Static routing uses if-then rules: query characteristics → model choice. Examples:
- Token count:
if input_tokens < 200: model = "llama-3.1-8b-instant" - Keywords:
if "code" in query OR "explain" in query: model = "claude-sonnet-4" - User tier:
if user.plan == "pro": model = "gpt-5"elsemodel = "llama-3.3-70b" - Task type: routes embedding requests to a small embedding model, chat to a fast LLM, code generation to a code-specialist model.
- Fallback chain: try Groq Llama 3.3 70B first; on rate limit, fall back to Together Llama 3.3 70B; on timeout, fall back to OpenAI GPT-4o.
Static rules typically capture 60-70% of available savings with zero training cost and full debuggability. You can read your routing config and know exactly why a request went to a particular model.
LiteLLM, Portkey, and OpenRouter all support rule-based routing natively via config files. For most teams under $50K/month LLM spend, static is the right answer.
Learned routing — when static caps out
Learned routers train a classifier on labeled query data to predict which model will produce a good answer. The classifier sees each incoming query and outputs a routing decision.
The trade-off: learned routers add 15-25% savings on top of static rules, at the cost of training data, evaluation infrastructure, and retraining cadence.
Approaches:
- Preference-based: Train a classifier on human-rated comparisons between strong and weak models on real queries (RouteLLM's matrix factorization approach).
- Embedding similarity: Embed the query, look up the best model from a learned routing table indexed by query embedding (vLLM Semantic Router, Semantic Router from Aurelio AI).
- Causal LLM classifier: Use a small LLM to classify the query into a routing category.
- Multi-armed bandit: Online learning during production — explore-exploit across models, prefer models that minimize cost given quality threshold.
Published benchmarks (RouteLLM from LMSYS):
- MT-Bench: 85% cost reduction at 95% GPT-4 quality.
- MMLU: 45% cost reduction at 95% quality.
- GSM8K (math): 35% cost reduction at 95% quality.
Cost reduction varies by workload — simple chat (MT-Bench) routes far more queries to the cheap model than math-heavy work (GSM8K). The 35-85% range is real and consistent across reports.
The notable learned routers (May 2026)
RouteLLM (LMSYS, MIT licensed, open source)
The reference implementation. Four router types (similarity-weighted, matrix factorization, causal LLM classifier, BERT classifier). Trained on Chatbot Arena preference data. Drops in as an OpenAI-compatible endpoint. The matrix-factorization router's 74% savings at 95% quality is the published benchmark. Generalizes to model pairs other than its training pair (GPT-4 / Mixtral) without retraining.
NotDiamond (commercial SaaS)
Closed-source smart router with the strongest commercial offering. Trains custom classifiers per customer based on production traffic. Strong UX, good evaluation tooling. Pricing scales with request volume.
Martian (commercial)
Similar positioning to NotDiamond. Focuses on real-time routing with low overhead. Closed-source.
LLMRouter (ulab-uiuc, open source, December 2025)
Cost-aware framework with 16+ router implementations, unified CLI, Gradio UI. Newer than RouteLLM with broader benchmark coverage.
Inworld AI Router
Commercial router targeted at agentic / character AI workloads. Includes quality-aware routing for tool-use reliability.
OpenRouter's auto routing
OpenRouter (the gateway) has an auto model that routes among 300+ catalog models using OpenRouter's internal classifier. Convenient but opaque — you don't see the routing logic, and savings depend on OpenRouter's choice of model pair per query.
Open-source self-host options
LiteLLM router (MIT, easiest)
Rule-based routing via config. Fallback chains, load balancing across keys, budget routing (route to cheapest non-rate-limited provider). The default choice for teams that want routing without operational overhead.
vLLM Semantic Router (open source)
Routes by embedding similarity, no external API calls (ONNX embeddings run locally). Fits well in vLLM-served self-hosted inference stacks. Apache-style license.
OctoRouter (open source)
LLM gateway focused on cost optimization with semantic routing. Per-provider budget controls, Redis-backed state sharing, granular policies.
RouteLLM (MIT, learned router)
The framework if you have preference data and want the published 74% savings result.
Semantic Router (Aurelio AI, MIT)
Lightweight Python library for routing decisions via embedding similarity. Simpler than vLLM Semantic Router; runs anywhere.
How to deploy routing — the practical sequence
- Audit your current traffic. Sample 500-1000 production queries. Classify by hand into easy / medium / hard. Compute current spend per bucket.
- Start with static rules. Add a router config to LiteLLM / Portkey / OpenRouter: token-count threshold, keyword routing, user-tier routing, fallback chain. Deploy.
- Measure. Track cost per bucket and quality metrics (user satisfaction, downstream task success, eval set scores). Compare to baseline.
- Iterate static rules. Most of the savings show up here. Common refinements: route by request
systemprompt (specific feature → specific model), by user history (paid vs free), by language detected. - Add learned routing only if static caps out. When static rules show diminishing returns and you have labeled data, deploy RouteLLM or NotDiamond alongside the gateway. Treat learned routing as an enhancement, not a replacement, for static rules.
- Add semantic caching. Routing + caching stacks — semantic cache hits skip the model entirely, costing nothing. This often delivers more savings than learned routing.
Quality monitoring is the missing piece
Routers without quality monitoring eventually break. The cheap model drifts in capability (model deprecations, training updates), and the router happily keeps sending queries it can no longer answer well.
Mitigations:
- Eval set in CI. Hold out 100-500 labeled production queries. Re-run after every routing change. Track win-rate against the all-frontier-model baseline.
- User feedback loop. Thumbs-up / thumbs-down on routed responses. Sample low-rated responses and inspect routing decisions.
- Periodic re-routing audit. Monthly check: would the current router still pick the same model given today's model lineup and pricing? Re-train or update static rules quarterly.
Common mistakes when deploying a router
- Routing without measuring. "We added a router, costs are down" without quality comparison is how you ship a worse product. Always evaluate against a holdout set.
- Trying learned routing first. Static rules + caching captures 70-80% of available savings with zero ML overhead.
- Picking models too far apart. Routing between GPT-5 ($2.50/M) and a 1B model ($0.005/M) almost always hurts quality. Pick neighboring quality tiers (GPT-5 / Llama 4 Maverick, or Claude Sonnet 4 / Llama 3.3 70B).
- Ignoring latency. Cheap models often run faster (Groq Llama 8B at 840 TPS) — but some "cheap" model paths add latency (cold start, slow provider). Measure end-to-end TTFT.
- No fallback chain. A router that picks Groq with no fallback breaks every time Groq has an incident. Always configure a fallback to a different provider (Together, Fireworks, OpenAI).
When routing is not worth it
- All your queries genuinely need GPT-5 / Claude / Opus capability. Some workloads (legal analysis, frontier-research synthesis) really do need the top model. Router adds no value.
- You're under $200/month total LLM spend. Operational overhead exceeds savings.
- Single-model dependency. You picked Claude for specific tool-use reliability. Routing to anything else degrades the feature.
- Strict consistency requirement. A/B testing models within a session can break UX consistency. Lock the model per session if needed.