Last updated: June 2026
Groq pricing in June 2026 ranges from $0.05 per million input tokens (Llama 3.1 8B) to $1.00 per million (Kimi K2), making it the cheapest fast inference option for open-source LLMs. The free tier requires no credit card and gives every developer access to all supported models at 30 requests per minute. Groq's LPU hardware delivers 5-10x more tokens per second than single GPUs, and the Batch API + prompt caching can stack to ~25% of on-demand pricing. This guide breaks down every supported model, the real free-tier and Developer-tier limits, batch discount math, and where Groq fits in the broader inference landscape against OpenAI, Anthropic, OpenRouter, and HuggingFace Inference Providers.
For free LLM API tiers across all providers, see Free LLM API Credits. For the wider model landscape, see Best Open Source LLM 2026.
Groq pricing by model (June 2026)
| Model | Input / 1M tokens | Output / 1M tokens | Speed |
|---|---|---|---|
| Llama 3.1 8B Instant | $0.05 | $0.08 | 840 TPS |
| Llama 4 Scout 17B | $0.11 | $0.34 | ~600 TPS |
| Llama 3.3 70B Versatile | $0.59 | $0.79 | 394 TPS |
| GPT-OSS 120B | $0.15 | $0.60 | 500 TPS |
| Qwen 3 32B | $0.29 | $0.59 | ~500 TPS |
| Kimi K2 (Moonshot) | $1.00 | $3.00 | ~250 TPS |
| Mistral Saba 24B | $0.79 | $0.79 | ~350 TPS |
| Gemma 2 9B | $0.20 | $0.20 | 500 TPS |
| Whisper v3 Turbo | $0.04 / hour audio | n/a | real-time x40 |
Prices reflect Groq's published on-demand rates as of June 2026. TPS (tokens per second) numbers are Groq's benchmark figures; real production traffic typically sees 60-90% of peak.
Free tier - what you actually get
- 30 requests per minute (RPM)
- 6,000 tokens per minute (TPM)
- 14,400 requests per day (RPD), per organization
- All supported models (no model carve-outs on free)
- No credit card required to sign up
Free tier is sufficient for: prototyping, side projects, hobbyist tools, internal demos, low-volume customer-facing features. The 14,400/day request cap typically translates to ~10 concurrent users on a chat app or ~50,000 monthly tokens-of-input on light workloads.
Free tier breaks down at: bursty traffic spikes (30 RPM caps at one request per 2 seconds steady state), customer-facing production at any meaningful scale, batch processing of large datasets.
Developer tier - when to upgrade
Adding a credit card upgrades you to Developer tier:
- 10x rate limits across the board (Llama 3.1 8B goes from 14,400 to 500,000 RPD)
- 25% discount on on-demand pricing
- Pay-as-you-go beyond free quota
Developer tier is the right move when:
- You're hitting Free tier rate limits in prototyping (>30 RPM bursts).
- You're shipping to production users.
- You want the 25% on-demand discount.
The on-ramp is gradual - you only pay for usage above the included quota at the discounted rate. For a small project that occasionally bursts past free limits, Developer tier might cost $5-20/month total.
Batch API and prompt caching - the 75% discount path
Two stackable discounts cut Groq's effective pricing significantly:
- Batch API: -50%. Submit requests as batches with up to 24-hour completion windows. Groq runs them when there's spare LPU capacity. For non-realtime workloads (overnight processing, content moderation queues, embedding generation), Batch is the default.
- Prompt caching: -50% on cached input tokens. Identical prompt prefixes are cached. Useful for system prompts that repeat across requests, RAG with stable contexts, agentic workflows with shared scaffolding.
Stacked, you pay ~25% of on-demand pricing. Example: Llama 3.1 8B at $0.05/M input drops to $0.0125/M with both discounts. For a workload doing 100M tokens/day of input with high cache hit rate, that's $1.25/day on Groq Batch vs $5/day on Groq on-demand vs $25/day on OpenAI GPT-5 Standard.
Groq vs OpenAI - concrete cost comparison
For comparable model capability (Groq runs open-source equivalents, not GPT-5 / Claude / Gemini themselves), Groq is 3-19x cheaper depending on the model:
- Small/fast tier: Groq Llama 3.1 8B ($0.05/M) vs OpenAI's cheapest GPT-5-tier model (~$0.15/M) - 3x cheaper.
- Mid tier: Groq Llama 3.3 70B ($0.59/M) vs OpenAI's mid GPT-5 tier (~$2.50/M) - 4x cheaper.
- Large tier: Groq Kimi K2 ($1.00/M) vs OpenAI GPT-5 Standard (~$2.50/M) - 2.5x cheaper, and Kimi K2 is a strong open-weight model so capability is competitive.
- With Batch + caching stacked: Groq is 8-19x cheaper than OpenAI on equivalent workloads.
When Groq does not win: workloads that genuinely need GPT-5's specific safety tuning, Claude's tool-use reliability, or Gemini's specific multimodal modes. For everything else (chat, summarization, classification, code generation, RAG, agentic workflows), Groq's open-source models are competitive.
Groq vs Anthropic Claude
Anthropic doesn't license Claude to third parties for inference - Groq doesn't run Claude. The comparison is open-source vs Claude on capability:
- Code generation: Claude Sonnet 4.6 ~$3 input / $15 output per M. Groq Kimi K2 at $1/$3 is among the strongest open-weight coding models. For code, Groq Kimi K2 is competitive at a fraction of the cost.
- Reasoning: Anthropic's Claude Opus 4.8 leads on agentic reasoning benchmarks. Groq's Qwen 3 32B is close on math/logic at much lower cost.
- Long context: Claude offers up to 1M context. Groq Llama 4 Scout offers 10M context (much larger) but raw quality is below Claude Sonnet 4.6.
If your workload tolerates open-source quality (most do), Groq wins on cost. If you need Claude specifically, you go direct to Anthropic.
Groq speed - why LPUs matter
Groq's LPU (Language Processing Unit) is custom silicon designed for sequential token generation. The architectural difference vs GPUs:
- Memory bandwidth on-chip, not via HBM. Tokens stream out without waiting on memory.
- Deterministic latency - no batching variability. Each request gets predictable time-to-first-token.
- Single-stream optimization - Groq is fastest for one user at a time, not for massive batch concurrency.
Benchmark results (June 2026):
- Llama 3 70B: Groq ~2,100 TPS vs H100 280-450 TPS - 5-7x faster.
- Llama 3.1 8B: Groq 840 TPS.
- Mixtral 8x7B: Groq 727 TPS vs GPU 75-120 TPS - 6-9x faster.
- Gemma 7B: Groq 2,800 TPS.
- Time to first token: Groq <100ms vs H100 200-500ms.
For chat UIs, the 80ms vs 280ms TTFT difference is perceptible - users feel the speed. For agentic workflows with multi-turn tool use, Groq's TTFT compounds across turns.
Where Groq doesn't win
- Massive batch throughput at lowest cost per million tokens: dedicated GPU rental on RunPod / Lambda Labs for sustained high QPS can beat Groq's per-token rate. Cross-over is around 30-50M tokens/day of steady-state load with cache-miss workloads.
- Proprietary models: GPT-5, Claude, Gemini are not on Groq.
- Custom fine-tunes: Groq runs published checkpoints, not your fine-tuned models. For fine-tuned inference, use HuggingFace Inference Endpoints, Together AI, or self-host.
- Specific safety / RLHF tuning: Closed-model providers ship specific safety tuning. Open models on Groq are baseline-tuned by the model author (Meta, Alibaba, OpenAI for GPT-OSS).
How to start with Groq
- Sign up at console.groq.com. No credit card required.
- Generate an API key. Stored in your console.
- Point your code at Groq. OpenAI-compatible endpoint:
https://api.groq.com/openai/v1. Set model to one of the supported list. - Start with Llama 3.1 8B Instant - cheapest and fastest. Move up to 70B / 120B / Kimi K2 when 8B is the bottleneck.
- Upgrade to Developer tier when free limits constrain you.
- Enable Batch API + prompt caching when you can tolerate latency for the discount.
Common mistakes with Groq
- Picking the largest model by default. Llama 3.1 8B at $0.05/M solves most chat workloads. Don't pay for Kimi K2 unless capability requires it.
- Running production on the Free tier. 30 RPM is not production-safe for any meaningful customer base. Upgrade to Developer the moment you ship.
- Ignoring Batch + caching for non-realtime workloads. 75% discount is meaningful at scale.
- Treating Groq as a drop-in for GPT-5 / Claude. Test capability first - open-source models match closed models on most tasks, lag on others.
- Single-provider dependency. Groq has had outages. Set up failover via an LLM gateway - OpenRouter or LiteLLM with Groq as primary, Together AI or Fireworks as backup.
Frequently asked questions
How much does Groq cost per million tokens? Groq pricing scales with model size and is among the cheapest in the industry for open-source models. Examples (May 2026): Llama 3.1 8B at $0.05 input / $0.08 output per million tokens, GPT-OSS 120B at $0.15/$0.60, Llama 3.3 70B at $0.59/$0.79, Kimi K2 at $1.00/$3.00, Llama 4 Scout at $0.11/$0.34. Whisper v3 Turbo costs $0.04/hour of audio. Batch API and prompt caching each cut rates 50%; stacked, you pay ~25% of on-demand.
What are the Groq API free tier rate limits in 2026? Yes. Groq has a free tier that requires no credit card. Free-tier limits: 30 requests per minute, 6,000 tokens per minute, and 14,400 requests per day per organization. All supported models (Llama 3.1 8B, Llama 3.3 70B, Llama 4 Scout, Mistral Saba, Qwen 3, Kimi K2, GPT-OSS 120B) share these limits. Sign up at console.groq.com and you get an API key immediately.
What's the difference between Groq Free and Developer tier? The Developer tier raises rate limits ~10x and discounts on-demand pricing by 25%. Llama 3.1 8B daily request limit jumps from 14,400 to 500,000 on Developer. Developer tier requires a payment method on file; you only pay for usage beyond the included quota. For prototyping or hobby projects, Free is enough. For production traffic >50 req/min, Developer is the upgrade trigger.
How fast is Groq vs running on a GPU? Groq's LPU hardware delivers 5-10x faster token generation than single-GPU setups. Llama 3 70B hits ~2,100 tokens/sec on Groq vs 280-450 tok/s on a comparable GPU. Llama 3.1 8B runs at 840 tok/s. Time-to-first-token is typically <100ms (vs 200-500ms on GPU-based inference). For chat UIs and agentic workflows where TTFT matters, the difference is immediately perceptible.
Groq vs OpenAI - when does Groq make sense? Groq makes sense when (1) your workload can run on an open-source model (Llama, Qwen, DeepSeek, GPT-OSS, Kimi K2), (2) you want low latency, and (3) you want to pay 3-19x less than OpenAI for comparable model capability. Groq does not offer GPT-5, Claude, or Gemini - it runs open-weight models only. If your product depends on a specific proprietary model, you need OpenAI / Anthropic / Google. For open-source workloads, Groq is the price/speed leader.
Does Groq have batch pricing or prompt caching discounts? Yes, both. Batch API discounts on-demand rates by 50% - useful for non-realtime workloads (overnight processing, embeddings batches, content classification). Prompt caching discounts another 50% on cached input tokens. Stacked, you can hit ~25% of on-demand pricing for high-cache-hit batch workloads. For high-volume Llama 3.1 8B usage with heavy caching, effective rate lands around $0.01-0.02 per million input tokens.
Related guides on this site
- $500K in Free Cloud Credits 2026: 15 Programs Compared (AWS, Google, Microsoft, Cloudflare)
- Free Startup Credits 2026: Complete Guide ($1M+ Across 53 Programs)
- Best Open Source LLM 2026 - Model Comparison
- LLM Gateway in 2026: OpenRouter vs LiteLLM vs Portkey
- HuggingFace Inference API 2026
- Free LLM API Credits - Every Route from $0 to $10K
- Free AI API Credits - Provider Comparison
- OpenAI Free Credits 2026
- Claude Free Credits 2026
- OpenRouter Free Tier 2026 - 28+ free models, limits, BYOK