Last updated: May 2026
An LLM gateway is a control plane that sits between your application and LLM providers, exposing a unified API while handling routing, failover, observability, cost tracking, and caching. It is the layer most production AI apps eventually grow into — and the layer most prototypes can skip. This guide explains what an LLM gateway actually does, the three signals that tell you it's time to add one, and how the top six options — OpenRouter, LiteLLM, Portkey, Helicone, Cloudflare AI Gateway, and Vercel AI Gateway — compare in May 2026.
If you're still picking which model to call in the first place, see Best Open Source LLM 2026 — Model Comparison. For free tiers across providers, see Free LLM API Credits.
LLM gateways at a glance (May 2026)
| Gateway | License | Self-host | Models | Pricing at $1K spend |
|---|---|---|---|---|
| OpenRouter | Hosted SaaS | No | 300+ | $1,055 (~5.5% markup) |
| LiteLLM | MIT | Yes | 100+ providers | $0 + server (~$20-50/mo) |
| Portkey | Apache 2.0 (open since Mar 2026) | Yes | 250+ | $0 self-host / $49 hosted |
| Helicone AI Gateway | MIT | Yes | 100+ | Free tier 100K req/mo |
| Cloudflare AI Gateway | Free tier | No | All major providers | Free + $5/$1M requests |
| Vercel AI Gateway | Hosted | No | All major providers | Bundled with Vercel |
"Models" counts mean different things — OpenRouter and Portkey count individual model variants (Llama-3.3-70B, Llama-3.3-405B as separate entries), LiteLLM counts upstream provider integrations.
What an LLM gateway actually does
Strip away the marketing and a gateway does six things:
- Unified API. One OpenAI-compatible endpoint, regardless of upstream provider. Your app calls
gateway.com/v1/chat/completionswith modelanthropic/claude-3.7-sonnetoropenai/gpt-5orqwen/qwen-3.5-397b— the gateway translates.
- Routing. Rule-based or quality-aware routing. Send simple queries to cheap models, hard queries to expensive ones. Some gateways (Inworld, Notdiamond) ship learned routers that pick the cheapest model that will produce a good enough answer.
- Failover and load balancing. If OpenAI is down or rate-limiting you, retry on Azure OpenAI, then Anthropic, then a self-hosted Qwen. Spread load across multiple keys to stay under per-key quotas.
- Cost tracking and budget enforcement. Per-team, per-user, per-feature spend attribution. Virtual API keys with monthly budget caps. Block requests when a team hits its limit.
- Observability. Every request logged with model, tokens, latency, cost, prompt, completion. Trace IDs, prompt versions, A/B test cohorts. The data you need to debug "why is the AI feature suddenly slow / wrong / expensive."
- Caching. Exact-match (same prompt, same params) and semantic (similar prompt). Semantic caching alone can cut spend 20-40% on repetitive workloads (FAQ chat, content moderation, classification).
Production deployments report 30-85% cost reduction after deploying a gateway with routing + caching.
When you need one (and when you don't)
Three signals that say it's time:
- You're calling more than one provider. OpenAI + Anthropic + an open model. The SDK boilerplate is duplicating, the auth is duplicating, the retry logic is duplicating. A gateway collapses all of that.
- Cost or latency is unpredictable in production. You can't answer "why did our LLM bill jump 40% last week?" or "which feature is slow?". You need per-request observability.
- You need failover. A single provider outage breaks your product. You want to route around incidents automatically.
Three signals to skip the gateway for now:
- Single-provider prototype. You're calling only OpenAI or only Anthropic. The OpenAI SDK plus a logging line is enough.
- Pure batch workloads. Embedding generation, classification, batch summarization. Provider-native SDKs work fine; a gateway adds latency without much upside.
- You're under $200/month total spend. The gateway operational cost (or hosted markup) is comparable to your provider spend. Wait until volume justifies it.
OpenRouter — the easiest start
OpenRouter is the hosted SaaS market leader. 300+ models from 60+ providers, single API key, OpenAI-compatible. You add a credit balance, point your SDK at https://openrouter.ai/api/v1, and you can call Claude / GPT / Gemini / Llama / DeepSeek / Qwen / Mistral with a model-string change.
Pricing: pay-per-request at provider cost + small markup (typically 5% on hosted models, free tier covers some open models). At $1,000 of provider-cost spend, you pay roughly $1,055.
Strengths: the largest catalog, the easiest onboarding, generous free tier on open-weight models (Llama 4, Qwen 3.5, DeepSeek V4 Flash, GLM-5). Excellent for prototyping and testing models.
Weaknesses: hosted-only, no self-host option. The markup adds up at scale (~$5K/year markup at $100K/year provider spend). No production guardrails (PII, jailbreak detection). Limited routing rules — quality-based routing is opt-in via a separate model variant.
Pick OpenRouter if: you want to try many models, you're a small team, you don't want to run infra, your monthly spend is under $5K.
LiteLLM — the open-source default
LiteLLM is the most widely adopted open-source LLM gateway. MIT licensed, 100+ provider integrations, OpenAI-compatible Python proxy. Run it as a Docker container, point your SDK at http://litellm:4000, done.
Pricing: free. You pay the LLM providers directly (zero markup) plus your server hosting (~$20-50/month for a small VM). At $1,000/month provider spend, total cost is ~$1,025.
Strengths: zero markup, virtual key system (each team gets a virtual API key with a configurable monthly budget), audit logging, Prometheus metrics, no vendor lock-in, runs anywhere Docker runs. Latency overhead ~8ms P95 at 1,000 RPS.
Weaknesses: requires engineering ownership — you run it, you monitor it, you upgrade it. Observability is bare-bones out of the box (integrates well with Langfuse, Helicone, Datadog for richer dashboards). UI is functional, not polished.
Pick LiteLLM if: you have engineering capacity to run it, you want zero markup, you need virtual keys for team budgets, you want to keep all prompt data inside your infrastructure.
Portkey — open-source production gateway
Portkey open-sourced its entire gateway under Apache 2.0 in March 2026, making it the most production-feature-complete open-source option in May 2026. 250+ models, guardrails (PII redaction, jailbreak detection, prompt injection filters), semantic caching (fuzzy matching), prompt management with versioning, fine-grained routing rules.
Pricing: free self-host (open source); managed version starts at ~$49/month at $1K provider spend, scales with usage.
Strengths: the best out-of-the-box production safety. PII redaction is hard to build correctly; Portkey ships it. Audit trails for regulated industries (healthcare HIPAA, finance SOC 2). Strong semantic caching. Good observability dashboards.
Weaknesses: more complex setup than LiteLLM. Smaller community than LiteLLM (faster-moving but less battle-tested in less common provider combos).
Pick Portkey if: you're in a regulated industry, you need guardrails, you want a single tool that does routing + safety + observability without integrating three things.
Helicone — observability-first
Helicone is purpose-built for LLM observability. MIT licensed, free tier covers 100K requests/month. It can act as a proxy gateway (Helicone AI Gateway) or sit beside an existing router as a logging layer.
Strengths: the best observability UI in the open-source space. Per-request logging, cost attribution by user/feature, latency breakdowns, prompt versioning and A/B testing, structured search across millions of requests. Rust-based gateway runtime is the lowest-overhead in the field.
Weaknesses: routing rules are less rich than Portkey or LiteLLM. Helicone leans into being the observability layer; for complex routing logic, pair it with LiteLLM.
Pick Helicone if: you already have a router (or use a single provider) and just need deep visibility. Or if observability is the highest-priority feature.
Cloudflare AI Gateway — the platform-bundled option
If your app already runs on Cloudflare Workers / Pages / R2, Cloudflare AI Gateway is the no-friction option. Sits in front of any LLM provider (OpenAI, Anthropic, Google, HuggingFace, Replicate, Workers AI). Global edge caching (geographic, per-region cache hits). Free tier; $5 per million requests after that.
Strengths: zero-config integration with Cloudflare Workers. Global edge caching for free. Generous free tier. Built-in analytics dashboard.
Weaknesses: only routes to upstream providers — no self-hosted model support. Routing rules are simpler than LiteLLM / Portkey. Lock-in to Cloudflare.
Pick Cloudflare AI Gateway if: your app is already on Cloudflare and you want one less thing to manage.
Vercel AI Gateway — Next.js / Vercel native
Bundled with Vercel. If you ship a Next.js app via Vercel, the AI Gateway integrates with the AI SDK and adds routing + observability with minimal setup. Same strengths and weaknesses pattern as Cloudflare: zero friction inside the platform, lock-in outside it.
Pick Vercel AI Gateway if: your frontend is on Vercel and you use the Vercel AI SDK.
How to choose — by team profile
Solo dev / hobby project: OpenRouter. The free tier on open models covers most prototyping; pay-per-request scales without surprise.
Small startup (1-10 engineers, $1K-$5K monthly LLM spend): OpenRouter for prototyping, then LiteLLM in production once spend justifies the engineering cost.
Growing company (10-50 engineers, $5K-$50K monthly LLM spend): LiteLLM with Helicone for observability. Or Portkey if you want a single stack.
Regulated industry (any size): Portkey self-hosted. Non-negotiable PII guardrails, audit trails, on-prem deployment.
Enterprise with platform commitments: Use the gateway that lives where your app does. Cloudflare for Cloudflare-first apps, Vercel for Vercel-first, otherwise LiteLLM or Portkey.
Common mistakes when picking an LLM gateway
- Adding a gateway too early. A single-provider prototype doesn't need one. Wait until the cost or duplicate-SDK pain is real.
- Picking based on model count alone. "300+ models" is impressive but you'll call 3-5 in production. Pick on production features (observability, guardrails, routing rules).
- Trying to do learned routing first. Static routing rules ("simple queries -> cheap model, hard ones -> expensive model") cover 80% of cost savings. Learned routers add complexity and only pay off at very high volume.
- Skipping virtual keys. If multiple teams share LLM infra, virtual keys + budget caps prevent one team from burning the shared budget. LiteLLM and Portkey both ship this; OpenRouter does not.
- No fallback strategy. Every gateway supports failover; few teams configure it. When OpenAI has an outage, your gateway is only useful if you've set up Anthropic / open-weight fallback.