Last updated: May 2026
HuggingFace Inference API in 2026 is three products under one umbrella: the Serverless Inference API (free tier with rate limits, best for prototyping), Inference Endpoints (dedicated GPU you spin up per model, $0.50/hr+, scale-to-zero), and Inference Providers (a unified OpenAI-compatible gateway routing to Groq, Together AI, Fireworks, Replicate, Cerebras, and 10+ others). This guide unpacks each product, the real free-tier limits, where HuggingFace PRO ($9/month) is worth it, and how the stack compares to OpenRouter, Together AI direct, and Groq direct.
If you're still picking a model, see Best Open Source LLM 2026. For free tiers across all providers (not just HuggingFace), see Free LLM API Credits.
The three HuggingFace Inference products
| Product | What it is | Pricing | Free tier |
|---|---|---|---|
| Serverless Inference API | Shared inference on HF infrastructure | Free / PRO $9/mo | A few hundred req/hour, <10B params |
| Inference Endpoints | Dedicated GPU per model, scale-to-zero | $0.03/CPU/hr, $0.50+/GPU/hr | None (paid only) |
| Inference Providers | Unified API to 15+ third-party providers | Pass-through provider rates | 2M credits/mo on PRO |
The naming is unfortunate — all three are "HuggingFace Inference" but solve different problems. Pick by workload: prototyping (Serverless), production with single model and steady load (Endpoints), production with multiple models or provider choice (Providers).
Serverless Inference API — the free tier
The Serverless Inference API is shared infrastructure that runs inference on a wide catalog of HuggingFace Hub models. Free-tier users get a few hundred requests per hour, limited to models under approximately 10 billion parameters. Cold starts on less popular models can take 10-30 seconds.
What works on the free tier:
- Text classification, NER, summarization, embeddings — fast, generous limits.
- Small-to-medium LLMs (Llama 3.2 8B, Qwen 2.5 7B, Mistral 7B) — fine for prototyping.
- Image classification, object detection — straightforward.
What does not work well on the free tier:
- 70B+ LLMs — typically gated or rate-limited heavily, often unavailable on Serverless.
- High-volume production traffic — rate limits hit fast.
- Latency-critical workloads — cold starts will bite.
HuggingFace PRO ($9/month) raises Serverless rate limits substantially and adds three perks: 25 minutes of daily H200 ZeroGPU compute (vs ~3-5 minutes on free), 1TB private storage + 10TB public storage, and 2M monthly Inference Provider credits. For a developer running multiple demos, prototyping with larger models, or building ZeroGPU Spaces, PRO is the cheapest H200 access available.
Inference Endpoints — dedicated GPU
Inference Endpoints lets you spin up a dedicated server (CPU or GPU) for a single model from the Hub. You pick the model and the hardware tier; HuggingFace handles provisioning, autoscaling, and the OpenAI-compatible API in front of it.
Pricing (May 2026, billed by the minute, not the hour):
| Tier | Hardware | Hourly | Typical use |
|---|---|---|---|
| CPU small | 2 vCPU | $0.03 | Embeddings, small classification |
| CPU large | 8 vCPU | $0.12 | Heavier CPU workloads |
| GPU T4 | 16GB | $0.50 | 7B-14B models, image classification |
| GPU L4 | 24GB | $0.80 | 13B models, Stable Diffusion |
| GPU A10G | 24GB | $1.30 | 30B at 4-bit, faster SD |
| GPU A100 | 80GB | $4.50 | 70B at 4-bit, large vision models |
| GPU H100 | 80GB | $6.00 | Flagship inference, low latency |
Scale-to-zero is the killer feature: paused endpoints incur no charges. For a model serving 100-1000 requests/day in bursty traffic, scale-to-zero with autoscaling typically lands at $20-60/month — competitive with hosted APIs at this volume.
When Endpoints make sense: you have a custom model (fine-tuned variant, in-house model) or a specific pinned version that's not exposed via Inference Providers, and your traffic is steady enough to justify dedicated hardware.
When Endpoints don't make sense: bursty workloads with very long idle periods (use Serverless or Providers), or extremely steady workloads at high QPS (renting a GPU directly on RunPod or Lambda is often cheaper for sustained load).
Inference Providers — the unified gateway
Launched in late 2024 and matured through 2025-2026, Inference Providers is HuggingFace's answer to the "which inference partner should I use" question. It exposes a single OpenAI-compatible API that routes to 15+ third-party providers:
- Groq — LPU hardware, sub-100ms time-to-first-token, fastest on Llama / Qwen variants.
- Together AI — 200+ open-source LLMs, automated tuning, broad catalog.
- Fireworks — high-performance, low-latency open-weight inference focus.
- Replicate — per-prediction billing, easy custom-model deployment from Hub.
- Cerebras — CS-3 wafer-scale chips, ultra-fast Llama inference.
- Cohere — Cohere's Command models, strong RAG and tool use.
- Nebius, SambaNova, Novita AI, Hyperbolic, Featherless — and others.
Pricing is pass-through — you pay the underlying provider's published rate. The 2M monthly Inference Provider credits included with PRO ($9) cover prototyping. Above the included credits, charges go to your HF account at provider rates.
Switching providers is a string change: model meta-llama/Llama-3.3-70B-Instruct routes to whichever provider you select (or the cheapest available). One auth token, one billing relationship, no per-provider account setup.
When to use which HuggingFace Inference product
Prototyping a new model on the Hub: Serverless Inference API (free or PRO). Cold start is fine; rate limits are generous for development.
Production, single fine-tuned model, steady traffic: Inference Endpoints with scale-to-zero on the right GPU tier.
Production, multiple models or provider choice: Inference Providers. Lower operational overhead than running Endpoints; pay per-token instead of per-hour.
High-volume single-model production (>100K req/day): Skip HF for hot path. Direct provider integration (Groq, Together, Fireworks) or self-hosted GPU on RunPod / Lambda usually wins on per-token cost. Use HF Inference Providers for failover.
ZeroGPU Spaces / interactive demos: HF PRO. 25 minutes of daily H200 for $9/month is the cheapest H200 access for short interactive workloads.
HuggingFace vs OpenRouter
Both Inference Providers and OpenRouter expose multiple LLM backends through one OpenAI-compatible API. The differences:
Catalog: OpenRouter has the broadest catalog (300+ models from 60+ providers). HF Inference Providers covers 15+ partners with focus on open-weight models. For Anthropic Claude, Google Gemini, OpenAI GPT-5, OpenRouter aggregates them; HF focuses on open models served by inference partners.
Hub integration: HF wins by a lot if you use the Hub for models, datasets, Spaces. Models on the Hub are one click away from inference. OpenRouter is inference-only.
Pricing: OpenRouter adds ~5% markup at most levels. HF Inference Providers is pass-through (no markup beyond the PRO-included credits). For pure cost minimization on open models, HF Providers wins; for proprietary frontier models (GPT-5, Claude), OpenRouter is competitive.
UX: OpenRouter has the more polished playground and analytics. HF is improving but still feels like an extension of the Hub UX.
Pick HuggingFace Inference Providers if Hub integration matters (most ML teams). Pick OpenRouter for breadth and frontier-model coverage.
HuggingFace vs direct provider APIs
If you only call one or two providers (e.g., always Groq or always Together), going direct to the provider is usually cheapest and lowest-latency. The HF Inference Providers layer adds a few milliseconds of routing overhead and a per-provider relationship through HF instead of with the provider.
The trade-off: with HF you switch providers with a string change. Direct, you maintain N integrations.
For 90% of teams: start with Inference Providers, migrate to direct when one provider becomes 70%+ of your spend.
Common mistakes with HuggingFace Inference
- Hitting Serverless rate limits in production. Serverless is for prototyping — graduate to Endpoints or Providers before you ship.
- Running Endpoints 24/7 for bursty traffic. Enable scale-to-zero. A model used 2 hours/day costs 1/12 of a 24/7 endpoint.
- Paying for ZeroGPU on the free tier. ZeroGPU has very tight time limits on free; PRO is the practical minimum for ZeroGPU work.
- Using Inference Providers for one stable provider. If 95% of your traffic goes to Groq, integrate Groq directly. The HF layer is for choice, not for single-provider deployments.
- Ignoring cold starts on Serverless. For latency-sensitive prototyping, pre-warm the model with periodic requests or move to Endpoints.
How to start
- Get a free HuggingFace account. Generate an Inference API token in account settings.
- Test Serverless first. Point the OpenAI SDK at
https://api-inference.huggingface.co/v1(or use thehuggingface_hubPython library directly) and call a small model. - Upgrade to PRO ($9) if rate-limited or if you use Spaces / ZeroGPU.
- Move to Inference Providers when you need provider choice or higher reliability.
- Move to Inference Endpoints when you have a custom model or specific GPU tier requirements.
- Direct provider integration when one provider dominates your spend.