AssemblyAI
AssemblyAI · Ranked #2 of 8 in Speech-to-Text APIs
Research-driven STT with the Universal and Slam-1 models and a deep audio-intelligence add-on stack (sentiment, topics, LeMUR LLM).
Accurate STT plus audio intelligence

Overview
AssemblyAI is a developer-first speech-to-text (ASR) API company whose core product is a hosted REST/WebSocket API for transcribing audio, layered with "Audio Intelligence" features (speaker diarization, sentiment, topic detection, entity detection, PII redaction, auto chapters) and LeMUR/LLM Gateway for running language models over transcripts. Its current models are the Universal family for async transcription (Universal-2 at $0.15/hr, Universal-3 Pro at $0.21/hr) and Universal-Streaming for real-time. The company targets software teams building voice agents, call analytics, meeting/notetaking tools, and media-transcription products, positioning itself against Deepgram, OpenAI Whisper, Google Cloud Speech-to-Text and Amazon Transcribe. It is unambiguously API-only, there is no end-user transcription app, and competes on accuracy, clean formatting (punctuation, casing, proper nouns, alphanumerics), and a polished developer experience.
On accuracy AssemblyAI benchmarks well: its own and third-party comparisons put Universal-2 around 6.68% WER on English versus ~7.88% for Whisper large-v3, and Universal-3 Pro (Feb 2026) claims ~5.6% mean WER with natural-language keyterm prompting. For streaming, AssemblyAI reports ~307 ms P50 word-emission latency (vs ~516 ms for Deepgram Nova-3) and ~91% accuracy on real-world audio. Pricing is aggressive on the base rate ($0.15/hr is well below Amazon's ~$1.44/hr), but the practical caveat repeated by reviewers is that Audio Intelligence add-ons stack on top (diarization, entity detection, redaction, translation, etc.), so real-world cost per hour can climb substantially above the headline number, independent write-ups cite an effective ~$0.35/hr once common features are enabled.
Sentiment is strongly positive overall (G2 ~4.6/5 across ~114 reviews, with very high ease-of-use and support scores), with praise concentrated on transcription accuracy on hard audio, clean output requiring little cleanup, and excellent docs/SDKs. The recurring criticisms are: cost at high volume once add-ons are stacked, inconsistent real-time latency under load for some users, weaker support for certain non-English languages historically, limited deep customization/fine-tuning for domain vocabulary, and some billing-UX friction (e.g. removing a card / disabling autopay). The company backs an enterprise offering with a 99.9% uptime SLA and a public status page, and cites scale figures of 600M+ monthly inference calls.
How this score is derived
The APIbenchmarks Index is a weighted sum of four dimensions, each scored on an absolute 0–100 reference scale. See the methodology for every mapping.
| Dimension | Score | Weight | Contribution |
|---|---|---|---|
| Documentation & DXWidely praised, developer-first docs with quickstarts, an OpenAPI reference, official Python/JS SDKs and copy-paste tutorials that reviewers say make integration fast. | 92 | 30% | 27.6 |
| ReliabilityBacks a published 99.9% annual uptime SLA with a public status/uptime page and cites 600M+ monthly inference calls, though some users report inconsistent real-time latency under load. | 82 | 25% | 20.5 |
| Ecosystem & SDKsSolid official SDK coverage (Python, JavaScript first-class; Java/Go in maintenance), plus integrations like a Microsoft Power Platform connector and LangChain/LLM Gateway tooling. | 83 | 25% | 20.8 |
| AccessibilitySelf-serve signup with $50 in free credits and no credit card required lowers the barrier, but advanced Audio Intelligence add-ons and enterprise SLOs sit behind stacked/usage pricing. | 90 | 20% | 18.0 |
| APIbenchmarks Index (ABI) | 86.9 | ||
Table 1. Derivation of the ABI for AssemblyAI. Contribution = score × weight; the index is their sum.
At a glance
- Vendor
- AssemblyAI
- Pricing model
- Per minute (usage-based)
- Free tier
- $50 one-time credit
- Official SDKs
- 7 languages
Pricing
| Free credits | $50 free | No credit card required; 5 new streams/min concurrency limit. Self-serve signup. |
| Universal-2 (async STT) | $0.15/hr | Pre-recorded transcription, billed per second; lowest base rate tier. |
| Universal-3 Pro (async STT) | $0.21/hr | Higher-accuracy async model (~5.6% mean WER, keyterm prompting). |
| Universal-Streaming | $0.15/hr | Real-time English/multilingual; billed per WebSocket session duration. |
| Voice Agent API | $4.50/hr ($0.075/min) | All-inclusive STT + LLM + TTS + turn detection + tool calling. |
| Enterprise | Custom | 99.9% uptime SLA, custom SLOs, volume pricing, committed-use; 600M+ monthly inference calls capacity. |
Key features
- •Universal async speech-to-text (Universal-2, Universal-3 Pro)
- •Universal-Streaming real-time transcription over WebSocket
- •Speaker diarization across 95 languages (~2.9% speaker-count error rate)
- •99-language transcription with automatic language detection
- •Audio Intelligence: sentiment analysis, topic detection, auto chapters, entity detection
- •PII text redaction and medical mode
- •LeMUR, apply LLMs over transcripts (up to ~10 hrs / 150k tokens per call)
- •LLM Gateway with token-based access to GPT, Claude, Gemini models
- •Voice Agent API bundling STT + LLM + TTS + turn detection
- •Keyterm/natural-language prompting (up to ~1,500 words) for domain accuracy
Official SDKs
Strengths & trade-offs
- +Strong English accuracy, Universal-2 ~6.68% WER beats Whisper large-v3 (~7.88%) in published comparisons
- +Very low base price ($0.15/hr async), far cheaper than Amazon Transcribe (~$1.44/hr) for comparable English STT
- +Clean, low-cleanup output: reliable punctuation, casing, proper nouns and alphanumerics (order numbers, etc.)
- +Developer-first experience, well-regarded docs and first-class Python/JavaScript SDKs make integration fast
- +Low streaming latency: ~307 ms P50 word emission, roughly 41% faster than Deepgram Nova-3
- +Rich Audio Intelligence layer (diarization in 95 languages, sentiment, PII redaction, topic detection) plus LeMUR/LLM Gateway over transcripts
- –Add-on features (diarization, entity detection, redaction, translation) stack on the base rate, pushing effective cost well above the $0.15/hr headline (~$0.35/hr in independent estimates)
- –Cost becomes expensive at high audio volume for usage-heavy apps
- –Some users report inconsistent real-time latency under load, hurting predictability for live use cases
- –Limited deep customization/fine-tuning for domain-specific vocabulary or acoustic quirks
- –Historically weaker accuracy/support for some non-English languages (e.g. requests for better Portuguese/Spanish)
- –Billing-UX friction reported (e.g. difficulty removing a card or disabling autopay)
What developers say
G2 4.6/5 · 114 reviews
Sentiment is strongly positive, centered on transcription accuracy, clean output and developer experience, with cost-at-scale and add-on pricing as the main complaints.
“Reviewers mostly praise AssemblyAI for accurate transcripts, especially on names, numbers, punctuation and formatting, saying outputs need less cleanup and the API is easy to test and integrate thanks to strong documentation.”
Key figures
| Word error rate (Universal-2, English) | 6.68% WER | AssemblyAI vs Whisper comparison (third-party cited) ↗ |
| Word error rate (Universal-3 Pro, mean) | 5.6% WER | Independent ASR benchmark roundup ↗ |
| Streaming latency (P50 word emission) | 307 ms (vs 516 ms Deepgram Nova-3) | AssemblyAI Universal-Streaming announcement ↗ |
| Streaming latency (P99) | 1,012 ms | AssemblyAI Universal-Streaming announcement ↗ |
| Uptime SLA | 99.9% annual (~8h46m max downtime/yr) | AssemblyAI docs (uptime SLA) ↗ |
| Async price (Universal-2) | $0.15/hr (billed per second) | AssemblyAI pricing page ↗ |
| Speaker-count error rate (diarization) | 2.9% | AssemblyAI Speaker Diarization feature page ↗ |
Compare AssemblyAI head to head
Sources
- https://www.assemblyai.com/pricing
- https://www.assemblyai.com/blog/introducing-universal-streaming
- https://www.assemblyai.com/blog/comparing-universal-2-and-openai-whisper
- https://www.assemblyai.com/docs/faq/what-is-your-api-uptime-sla
- https://status.assemblyai.com/
- https://www.g2.com/products/assemblyai-speech-to-text-api/reviews
- https://www.coval.ai/blog/best-speech-to-text-providers-in-2026-independent-benchmarks-and-how-to-choose/
- https://brasstranscripts.com/blog/assemblyai-pricing-per-minute-2025-real-costs
- https://www.assemblyai.com/features/speaker-diarization
Figures last verified 2026-06-27. Spotted an error? corrections@apibenchmarks.com
