Last updated: May 2026
Groq pricing in May 2026 ranges from $0.05 per million input tokens (Llama 3.1 8B) to $1.00 per million (Kimi K2), making it the cheapest fast inference option for open-source LLMs. The free tier requires no credit card and gives every developer access to all supported models at 30 requests per minute. Groq's LPU hardware delivers 5-14x more tokens per second than GPUs, and the Batch API + prompt caching can stack to ~25% of on-demand pricing. This guide breaks down every supported model, the real free-tier and Developer-tier limits, batch discount math, and where Groq fits in the broader inference landscape against OpenAI, Anthropic, OpenRouter, and HuggingFace Inference Providers.
For free LLM API tiers across all providers, see Free LLM API Credits. For the wider model landscape, see Best Open Source LLM 2026.
Groq pricing by model (May 2026)
| Model | Input / 1M tokens | Output / 1M tokens | Speed |
|---|---|---|---|
| Llama 3.1 8B Instant | $0.05 | $0.08 | 840 TPS |
| Llama 4 Scout 17B | $0.50 | $0.80 | ~600 TPS |
| Llama 3.3 70B Versatile | $0.59 | $0.79 | 394 TPS |
| GPT-OSS 120B | $0.15 | $0.60 | 500 TPS |
| Qwen 3 32B | $0.29 | $0.59 | ~500 TPS |
| DeepSeek R1 Distill 70B | $0.75 | $0.99 | ~280 TPS |
| Kimi K2 (Moonshot) | $1.00 | $3.00 | ~250 TPS |
| Mistral Saba 24B | $0.79 | $0.79 | ~350 TPS |
| Gemma 2 9B | $0.20 | $0.20 | 500 TPS |
| Whisper v3 Turbo | $0.04 / hour audio | n/a | real-time x40 |
Prices reflect Groq's published on-demand rates as of May 2026. TPS (tokens per second) numbers are Groq's benchmark figures; real production traffic typically sees 60-90% of peak.
Free tier — what you actually get
- 30 requests per minute (RPM)
- 6,000 tokens per minute (TPM)
- 14,400 requests per day (RPD), per organization
- All supported models (no model carve-outs on free)
- No credit card required to sign up
Free tier is sufficient for: prototyping, side projects, hobbyist tools, internal demos, low-volume customer-facing features. The 14,400/day request cap typically translates to ~10 concurrent users on a chat app or ~50,000 monthly tokens-of-input on light workloads.
Free tier breaks down at: bursty traffic spikes (30 RPM caps at one request per 2 seconds steady state), customer-facing production at any meaningful scale, batch processing of large datasets.
Developer tier — when to upgrade
Adding a credit card upgrades you to Developer tier:
- 10x rate limits across the board (Llama 3.1 8B goes from 14,400 to 500,000 RPD)
- 25% discount on on-demand pricing
- Pay-as-you-go beyond free quota
Developer tier is the right move when:
- You're hitting Free tier rate limits in prototyping (>30 RPM bursts).
- You're shipping to production users.
- You want the 25% on-demand discount.
The on-ramp is gradual — you only pay for usage above the included quota at the discounted rate. For a small project that occasionally bursts past free limits, Developer tier might cost $5-20/month total.
Batch API and prompt caching — the 75% discount path
Two stackable discounts cut Groq's effective pricing significantly:
- Batch API: -50%. Submit requests as batches with up to 24-hour completion windows. Groq runs them when there's spare LPU capacity. For non-realtime workloads (overnight processing, content moderation queues, embedding generation), Batch is the default.
- Prompt caching: -50% on cached input tokens. Identical prompt prefixes are cached. Useful for system prompts that repeat across requests, RAG with stable contexts, agentic workflows with shared scaffolding.
Stacked, you pay ~25% of on-demand pricing. Example: Llama 3.1 8B at $0.05/M input drops to $0.0125/M with both discounts. For a workload doing 100M tokens/day of input with high cache hit rate, that's $1.25/day on Groq Batch vs $5/day on Groq on-demand vs $25/day on OpenAI GPT-5 Standard.
Groq vs OpenAI — concrete cost comparison
For comparable model capability (Groq runs open-source equivalents, not GPT-5 / Claude / Gemini themselves), Groq is 3-19x cheaper depending on the model:
- Small/fast tier: Groq Llama 3.1 8B ($0.05/M) vs OpenAI GPT-4o-mini (~$0.15/M) — 3x cheaper.
- Mid tier: Groq Llama 3.3 70B ($0.59/M) vs OpenAI GPT-4o (~$2.50/M) — 4x cheaper.
- Large tier: Groq Kimi K2 ($1.00/M) vs OpenAI GPT-5 Standard (~$2.50/M) — 2.5x cheaper, but Kimi K2 leads HumanEval globally so capability is comparable.
- With Batch + caching stacked: Groq is 8-19x cheaper than OpenAI on equivalent workloads.
When Groq does not win: workloads that genuinely need GPT-5's specific safety tuning, Claude's tool-use reliability, or Gemini's specific multimodal modes. For everything else (chat, summarization, classification, code generation, RAG, agentic workflows), Groq's open-source models are competitive.
Groq vs Anthropic Claude
Anthropic doesn't license Claude to third parties for inference — Groq doesn't run Claude. The comparison is open-source vs Claude on capability:
- Code generation: Claude Sonnet 4 ~$3 input / $15 output per M. Groq Kimi K2 at $1/$3 hits 99% on HumanEval (current open-weight leader). For code, Groq Kimi K2 is competitive at 1/5 the cost.
- Reasoning: Anthropic's Claude Opus 4.7 leads on agentic reasoning benchmarks. Groq's DeepSeek R1 Distill 70B is close on math/logic at 5-10x lower cost.
- Long context: Claude offers 200K context. Groq Llama 4 Scout offers 10M context (much larger) but raw quality is below Claude Sonnet 4.
If your workload tolerates open-source quality (most do), Groq wins on cost. If you need Claude specifically, you go direct to Anthropic.
Groq speed — why LPUs matter
Groq's LPU (Language Processing Unit) is custom silicon designed for sequential token generation. The architectural difference vs GPUs:
- Memory bandwidth on-chip, not via HBM. Tokens stream out without waiting on memory.
- Deterministic latency — no batching variability. Each request gets predictable time-to-first-token.
- Single-stream optimization — Groq is fastest for one user at a time, not for massive batch concurrency.
Benchmark results (May 2026):
- Llama 3 70B: Groq ~2,100 TPS vs H100 280-450 TPS — 5-7x faster.
- Llama 3.1 8B: Groq 840 TPS.
- Mixtral 8x7B: Groq 727 TPS vs GPU 75-120 TPS — 6-9x faster.
- Gemma 7B: Groq 2,800 TPS.
- Time to first token: Groq <100ms vs H100 200-500ms.
For chat UIs, the 80ms vs 280ms TTFT difference is perceptible — users feel the speed. For agentic workflows with multi-turn tool use, Groq's TTFT compounds across turns.
Where Groq doesn't win
- Massive batch throughput at lowest cost per million tokens: dedicated GPU rental on RunPod / Lambda Labs for sustained high QPS can beat Groq's per-token rate. Cross-over is around 30-50M tokens/day of steady-state load with cache-miss workloads.
- Proprietary models: GPT-5, Claude, Gemini are not on Groq.
- Custom fine-tunes: Groq runs published checkpoints, not your fine-tuned models. For fine-tuned inference, use HuggingFace Inference Endpoints, Together AI, or self-host.
- Specific safety / RLHF tuning: Closed-model providers ship specific safety tuning. Open models on Groq are baseline-tuned by the model author (Meta, Alibaba, OpenAI for GPT-OSS).
How to start with Groq
- Sign up at console.groq.com. No credit card required.
- Generate an API key. Stored in your console.
- Point your code at Groq. OpenAI-compatible endpoint:
https://api.groq.com/openai/v1. Set model to one of the supported list. - Start with Llama 3.1 8B Instant — cheapest and fastest. Move up to 70B / 120B / Kimi K2 when 8B is the bottleneck.
- Upgrade to Developer tier when free limits constrain you.
- Enable Batch API + prompt caching when you can tolerate latency for the discount.
Common mistakes with Groq
- Picking the largest model by default. Llama 3.1 8B at $0.05/M solves most chat workloads. Don't pay for Kimi K2 unless capability requires it.
- Running production on the Free tier. 30 RPM is not production-safe for any meaningful customer base. Upgrade to Developer the moment you ship.
- Ignoring Batch + caching for non-realtime workloads. 75% discount is meaningful at scale.
- Treating Groq as a drop-in for GPT-5 / Claude. Test capability first — open-source models match closed models on most tasks, lag on others.
- Single-provider dependency. Groq has had outages. Set up failover via an LLM gateway — OpenRouter or LiteLLM with Groq as primary, Together AI or Fireworks as backup.