Groq
Groq · Ranked #6 of 7 in LLM APIs
LPU-based inference host delivering 300-1000+ tokens/sec on open models, with a no-credit-card free dev tier.
Ultra-low-latency open-model inference

Overview
Groq is a US chip-and-cloud company whose core differentiator is the LPU (Language Processing Unit), a deterministic, SRAM-based inference accelerator purpose-built for low-latency token generation rather than training. GroqCloud exposes this hardware through an OpenAI-compatible REST API, so the value proposition is narrow but sharp: dramatically faster output throughput and lower time-to-first-token than GPU-based providers, at competitive per-token prices, for a curated catalog of open-weight models (Llama 3.x/4, GPT-OSS 20B/120B, Qwen3, Whisper for speech-to-text). The target user is a developer or product team that already wants an open model and is bottlenecked on latency, real-time agents, voice pipelines, streaming chat, and high-volume batch jobs. Independent third-party benchmarking (Artificial Analysis) repeatedly places Groq at or near the top of provider rankings for speed on shared models like Llama 3.3 70B, where it has measured ~322 tokens/s output and sub-1s time-to-first-token.
Where Groq wins is unambiguous: raw speed and a frictionless migration path. Because the endpoint mirrors the OpenAI SDK, adoption is often a one-line base-URL change, and there are first-party Python and TypeScript SDKs plus integrations across LangChain, the Vercel AI SDK, LiteLLM, and similar frameworks. Pricing is linear and predictable, no idle infrastructure fees, with a 50% Batch API discount and 50% prompt-caching discount further lowering effective cost. Where it loses is in breadth and capacity. Groq hosts only open-weight models, so teams needing Claude, GPT-4-class, or Gemini quality must look elsewhere. Historically its most persistent criticism has been rate limits and over-capacity (429) errors: the free tier is tight (low RPM), limits are pooled per-organization rather than per-key, and the "flex" service tier is explicitly best-effort and can return over-capacity errors under load. The SRAM-only LPU design (each chip carries only a few hundred MB) also draws skepticism on Hacker News about how economically it scales to very large models.
Net: Groq is the strongest choice when latency on an open model is the dominant constraint and you can architect around its rate limits, and a poor fit when you need frontier closed models, very high guaranteed concurrency without an enterprise contract, or a single vendor covering every model family. Reliability has been solid in practice (status page reports 99.9% SLA target with recent periods at 100% actual), but production teams should plan for retry/back-off handling around 429s.
How this score is derived
The APIbenchmarks Index is a weighted sum of four dimensions, each scored on an absolute 0–100 reference scale. See the methodology for every mapping.
| Dimension | Score | Weight | Contribution |
|---|---|---|---|
| Documentation & DXGroqDocs (console.groq.com/docs) is thorough and code-first, with explicit OpenAI-compatibility guides, rate-limit/service-tier pages, API reference, and quickstarts in Python and JS. | 78 | 30% | 23.4 |
| ReliabilityPublic status page (groqstatus.com) reports a 99.9% SLA target with recent 30-day uptime at ~100%, though developers report intermittent 429/over-capacity errors under load on lower tiers. | 72 | 25% | 18.0 |
| Ecosystem & SDKsStrong third-party integration footprint (OpenAI SDK drop-in, LangChain, Vercel AI SDK, LiteLLM, OpenRouter) plus first-party Python and TypeScript SDKs. | 72 | 25% | 18.0 |
| AccessibilityFree GroqCloud tier with no upfront cost and a one-line OpenAI-SDK swap makes onboarding trivial, but low free-tier RPM and per-org pooled limits constrain real production use without upgrading. | 92 | 20% | 18.4 |
| APIbenchmarks Index (ABI) | 77.8 | ||
Table 1. Derivation of the ABI for Groq. Contribution = score × weight; the index is their sum.
At a glance
- Vendor
- Groq
- Pricing model
- Per token
- Free tier
- Free dev tier, no card: 30 RPM / 6K TPM / 14,400 req/day
- Official SDKs
- 8 languages
Pricing
| Free (GroqCloud) | $0 | Get started free with low rate limits (e.g. ~30 RPM, capped requests-per-day); shared per-organization limits. |
| Developer / Pay-as-you-go (On-Demand) | Per-token, usage-based | Linear per-token pricing with substantially higher rate limits; no idle infrastructure fees. |
| Batch API | 50% off standard per-token pricing | Asynchronous processing at half the on-demand token cost. |
| Enterprise | Custom (contact sales) | Private/co-cloud instances, SSO/SCIM/MFA, enterprise-only models (e.g. Minimax M2.5, Qwen3-VL 32B), higher capacity. |
Key features
- •LPU (Language Processing Unit) deterministic inference hardware for low-latency token generation
- •OpenAI-compatible chat completions endpoint
- •Batch API with 50% discount for async workloads
- •Prompt caching with 50% discount on cached input tokens
- •Speech-to-text via Whisper-large-v3 and whisper-large-v3-turbo
- •Service tiers (on-demand, flex/best-effort) with configurable rate limits
- •Streaming responses for real-time applications
- •Private and co-cloud deployment options for enterprise
- •Enterprise auth: SSO, SCIM provisioning, MFA
- •Tool/function calling support on compatible models
Official SDKs
Strengths & trade-offs
- +Top-ranked output speed on shared models (~322 t/s on Llama 3.3 70B per Artificial Analysis), far above typical GPU providers
- +Very low time-to-first-token (sub-1s), ideal for real-time agents and voice pipelines
- +OpenAI-compatible API, migrate with a one-line base-URL change, reuse existing OpenAI SDK code
- +Cheap, linear, predictable per-token pricing with no idle infrastructure fees; Llama 3.1 8B at $0.05/$0.08 per 1M tokens
- +Built-in cost levers: 50% Batch API discount and 50% prompt-caching discount
- +Fast-moving model catalog (Llama 4, GPT-OSS 20B/120B, Qwen3, Whisper STT) plus first-party Python and TypeScript SDKs
- –Open-weight models only, no Claude, GPT-4-class, or Gemini
- –Persistent rate-limit and over-capacity (429) complaints; free tier is tight (~30 RPM)
- –Rate limits are pooled per-organization, not per-key, so adding keys does not raise capacity
- –Flex/best-effort service tier can return over-capacity errors under load
- –SRAM-only LPU architecture (few hundred MB per chip) raises questions about cost-efficiency at very large model sizes
- –Thin public review footprint (e.g. only 1 G2 review), making independent aggregate validation hard
What developers say
G2 5/5 (1 review)
Developers are enthusiastic about Groq's inference speed and OpenAI-compatible drop-in experience, but consistently frustrated by rate limits, over-capacity errors, and the open-models-only catalog.
“Groq is 4-7x faster on output throughput and 3-4x faster on time-to-first-token compared to the fastest GPU-based inference providers, and the endpoint is OpenAI-compatible, so you can point the OpenAI SDK at Groq's base URL with a one-line change.”
Key figures
| Output speed (Llama 3.3 70B) | 322.0 tokens/s (ranked #1 fastest provider) | Artificial Analysis ↗ |
| Time to first token (Llama 3.3 70B) | 0.93 s (ranked #2 lowest latency) | Artificial Analysis ↗ |
| Throughput (Llama 3 8B) | Surpasses 1,200 tokens/s | Groq blog / Hacker News ↗ |
| Price (Llama 3.1 8B Instant) | $0.05 input / $0.08 output per 1M tokens | Groq pricing page ↗ |
| Price (Llama 3.3 70B Versatile) | $0.59 input / $0.79 output per 1M tokens | Groq pricing page ↗ |
| SLA target / recent uptime | 99.9% target; recent 30-day at ~100% actual | Groq status page ↗ |
| Batch API discount | 50% off standard per-token pricing | Groq pricing page ↗ |
Compare Groq head to head
Sources
- https://groq.com/pricing
- https://artificialanalysis.ai/models/llama-3-3-instruct-70b/providers
- https://console.groq.com/docs/openai
- https://console.groq.com/docs/rate-limits
- https://groqstatus.com/
- https://www.g2.com/products/groqcloud/reviews
- https://news.ycombinator.com/item?id=40999229
- https://awesomeagents.ai/reviews/review-groq/
- https://console.groq.com/docs/models
Figures last verified 2026-06-27. Spotted an error? corrections@apibenchmarks.com
