Groq

Groq · Ranked #6 of 7 in LLM APIs

77.8/ 100

BStrong

LPU-based inference host delivering 300-1000+ tokens/sec on open models, with a no-credit-card free dev tier.

Best for

Ultra-low-latency open-model inference

Visit website Documentation

Overview

Groq is a US chip-and-cloud company whose core differentiator is the LPU (Language Processing Unit), a deterministic, SRAM-based inference accelerator purpose-built for low-latency token generation rather than training. GroqCloud exposes this hardware through an OpenAI-compatible REST API, so the value proposition is narrow but sharp: dramatically faster output throughput and lower time-to-first-token than GPU-based providers, at competitive per-token prices, for a curated catalog of open-weight models (Llama 3.x/4, GPT-OSS 20B/120B, Qwen3, Whisper for speech-to-text). The target user is a developer or product team that already wants an open model and is bottlenecked on latency, real-time agents, voice pipelines, streaming chat, and high-volume batch jobs. Independent third-party benchmarking (Artificial Analysis) repeatedly places Groq at or near the top of provider rankings for speed on shared models like Llama 3.3 70B, where it has measured ~322 tokens/s output and sub-1s time-to-first-token.

Where Groq wins is unambiguous: raw speed and a frictionless migration path. Because the endpoint mirrors the OpenAI SDK, adoption is often a one-line base-URL change, and there are first-party Python and TypeScript SDKs plus integrations across LangChain, the Vercel AI SDK, LiteLLM, and similar frameworks. Pricing is linear and predictable, no idle infrastructure fees, with a 50% Batch API discount and 50% prompt-caching discount further lowering effective cost. Where it loses is in breadth and capacity. Groq hosts only open-weight models, so teams needing Claude, GPT-4-class, or Gemini quality must look elsewhere. Historically its most persistent criticism has been rate limits and over-capacity (429) errors: the free tier is tight (low RPM), limits are pooled per-organization rather than per-key, and the "flex" service tier is explicitly best-effort and can return over-capacity errors under load. The SRAM-only LPU design (each chip carries only a few hundred MB) also draws skepticism on Hacker News about how economically it scales to very large models.

Net: Groq is the strongest choice when latency on an open model is the dominant constraint and you can architect around its rate limits, and a poor fit when you need frontier closed models, very high guaranteed concurrency without an enterprise contract, or a single vendor covering every model family. Reliability has been solid in practice (status page reports 99.9% SLA target with recent periods at 100% actual), but production teams should plan for retry/back-off handling around 429s.

How this score is derived

The APIbenchmarks Index is a weighted sum of four dimensions, each scored on an absolute 0–100 reference scale. See the methodology for every mapping.

Dimension	Score	Weight	Contribution
Documentation & DXGroqDocs (console.groq.com/docs) is thorough and code-first, with explicit OpenAI-compatibility guides, rate-limit/service-tier pages, API reference, and quickstarts in Python and JS.	78	30%	23.4
ReliabilityPublic status page (groqstatus.com) reports a 99.9% SLA target with recent 30-day uptime at ~100%, though developers report intermittent 429/over-capacity errors under load on lower tiers.	72	25%	18.0
Ecosystem & SDKsStrong third-party integration footprint (OpenAI SDK drop-in, LangChain, Vercel AI SDK, LiteLLM, OpenRouter) plus first-party Python and TypeScript SDKs.	72	25%	18.0
AccessibilityFree GroqCloud tier with no upfront cost and a one-line OpenAI-SDK swap makes onboarding trivial, but low free-tier RPM and per-org pooled limits constrain real production use without upgrading.	92	20%	18.4
APIbenchmarks Index (ABI)			77.8

Table 1. Derivation of the ABI for Groq. Contribution = score × weight; the index is their sum.

At a glance

Vendor: Groq
Pricing model: Per token
Free tier: Free dev tier, no card: 30 RPM / 6K TPM / 14,400 req/day
Official SDKs: 8 languages

Pricing

Free (GroqCloud)	$0	Get started free with low rate limits (e.g. ~30 RPM, capped requests-per-day); shared per-organization limits.
Developer / Pay-as-you-go (On-Demand)	Per-token, usage-based	Linear per-token pricing with substantially higher rate limits; no idle infrastructure fees.
Batch API	50% off standard per-token pricing	Asynchronous processing at half the on-demand token cost.
Enterprise	Custom (contact sales)	Private/co-cloud instances, SSO/SCIM/MFA, enterprise-only models (e.g. Minimax M2.5, Qwen3-VL 32B), higher capacity.

Key features

•LPU (Language Processing Unit) deterministic inference hardware for low-latency token generation
•OpenAI-compatible chat completions endpoint
•Batch API with 50% discount for async workloads
•Prompt caching with 50% discount on cached input tokens
•Speech-to-text via Whisper-large-v3 and whisper-large-v3-turbo
•Service tiers (on-demand, flex/best-effort) with configurable rate limits
•Streaming responses for real-time applications
•Private and co-cloud deployment options for enterprise
•Enterprise auth: SSO, SCIM provisioning, MFA
•Tool/function calling support on compatible models

Official SDKs

Python (groq)TypeScript / Node.js (groq-sdk)OpenAI SDK (drop-in via base URL)LangChainVercel AI SDKLiteLLMOpenRouter (third-party routing)REST / HTTP API

Strengths & trade-offs

Strengths

+Top-ranked output speed on shared models (~322 t/s on Llama 3.3 70B per Artificial Analysis), far above typical GPU providers
+Very low time-to-first-token (sub-1s), ideal for real-time agents and voice pipelines
+OpenAI-compatible API, migrate with a one-line base-URL change, reuse existing OpenAI SDK code
+Cheap, linear, predictable per-token pricing with no idle infrastructure fees; Llama 3.1 8B at $0.05/$0.08 per 1M tokens
+Built-in cost levers: 50% Batch API discount and 50% prompt-caching discount
+Fast-moving model catalog (Llama 4, GPT-OSS 20B/120B, Qwen3, Whisper STT) plus first-party Python and TypeScript SDKs

Trade-offs

–Open-weight models only, no Claude, GPT-4-class, or Gemini
–Persistent rate-limit and over-capacity (429) complaints; free tier is tight (~30 RPM)
–Rate limits are pooled per-organization, not per-key, so adding keys does not raise capacity
–Flex/best-effort service tier can return over-capacity errors under load
–SRAM-only LPU architecture (few hundred MB per chip) raises questions about cost-efficiency at very large model sizes
–Thin public review footprint (e.g. only 1 G2 review), making independent aggregate validation hard

What developers say

G2 5/5 (1 review)

Developers are enthusiastic about Groq's inference speed and OpenAI-compatible drop-in experience, but consistently frustrated by rate limits, over-capacity errors, and the open-models-only catalog.

“Groq is 4-7x faster on output throughput and 3-4x faster on time-to-first-token compared to the fastest GPU-based inference providers, and the endpoint is OpenAI-compatible, so you can point the OpenAI SDK at Groq's base URL with a one-line change.”

Key figures

Output speed (Llama 3.3 70B)	322.0 tokens/s (ranked #1 fastest provider)	Artificial Analysis ↗
Time to first token (Llama 3.3 70B)	0.93 s (ranked #2 lowest latency)	Artificial Analysis ↗
Throughput (Llama 3 8B)	Surpasses 1,200 tokens/s	Groq blog / Hacker News ↗
Price (Llama 3.1 8B Instant)	$0.05 input / $0.08 output per 1M tokens	Groq pricing page ↗
Price (Llama 3.3 70B Versatile)	$0.59 input / $0.79 output per 1M tokens	Groq pricing page ↗
SLA target / recent uptime	99.9% target; recent 30-day at ~100% actual	Groq status page ↗
Batch API discount	50% off standard per-token pricing	Groq pricing page ↗

Compare Groq head to head

Groq vs OpenAI API Groq vs Anthropic Claude API Groq vs Google Gemini API Groq vs Mistral La Plateforme Groq vs xAI Grok API Groq vs DeepSeek API

Sources

Figures last verified 2026-06-27. Spotted an error? corrections@apibenchmarks.com