Google Cloud Speech-to-Text
Google · Ranked #4 of 8 in Speech-to-Text APIs
Hyperscaler STT (Chirp models) with 125+ languages, contractual enterprise SLAs and GCP-wide infrastructure, but heavier console onboarding.
Enterprise multilingual STT on GCP

Overview
Google Cloud Speech-to-Text is Google's managed automatic speech recognition (ASR) API, part of Google Cloud's AI/ML portfolio. It exists in two generations: the legacy v1 API and the modern v2 API, the latter built around Google's Chirp family of foundation models (Chirp, Chirp 2, and the current Chirp 3), which are large self-supervised multilingual models trained on millions of hours of audio across 100+ languages. The product targets developers and enterprises already on Google Cloud who need transcription embedded in IAM, VPC-SC, CMEK and audit-logging governance, call-center analytics, media captioning, voice agents, and meeting transcription are the dominant use cases. It supports synchronous, streaming (real-time), and asynchronous batch recognition, plus speaker diarization, automatic punctuation, word-level timestamps, model adaptation/custom vocabulary, and a built-in denoiser.
Where it wins is breadth and platform integration: 85–125+ languages and locales depending on model, deep GCP-native security/compliance, a generous lower v2 price point ($0.016/min standard, dropping to ~$0.004/min at volume and a Dynamic Batch tier ~75% cheaper for non-urgent jobs), and a mature multi-language SDK surface. Where it loses is accuracy on hard real-world audio and developer experience. Independent benchmarking (Voicewriter.io, cited by Deepgram) put Google at a 13.1% word error rate on noisy/accented/technical audio, materially worse than specialist competitors like Deepgram Nova (~7.6%) and roughly comparable-to-worse than OpenAI Whisper (~10.6%), so its leadership in clean conversational English does not always carry into production conditions.
The other recurring friction is operational: the v1/v2 split with differing features, pricing and regional availability confuses integrators; speaker diarization and some add-ons carry extra cost that is not transparently surfaced on the public pricing page; and billing can "double" versus the headline rate once GCP overhead and add-ons are counted. For teams already standardized on Google Cloud it is a safe, well-governed default with a strong SLA and free monthly tier; teams optimizing purely for transcription accuracy, simpler pricing, or DX increasingly evaluate Deepgram, AssemblyAI, or Whisper-based stacks alongside it.
How this score is derived
The APIbenchmarks Index is a weighted sum of four dimensions, each scored on an absolute 0–100 reference scale. See the methodology for every mapping.
| Dimension | Score | Weight | Contribution |
|---|---|---|---|
| Documentation & DXExtensive official docs covering v1 and v2 APIs, per-model guides (Chirp/Chirp 2/Chirp 3), Colab notebooks and per-language client-library references, though the v1-vs-v2 fragmentation makes it easy to land on the wrong version. | 78 | 30% | 23.4 |
| ReliabilityBacked by a published Google Cloud SLA committing to 99.9% monthly uptime with tiered service credits, running on Google's global infrastructure. | 92 | 25% | 23.0 |
| Ecosystem & SDKsDeeply integrated into the broader Google Cloud platform (IAM, VPC-SC, CMEK, audit logging, Vertex AI) with official SDKs across nine languages, but less of a standalone speech-specialist community than Deepgram or Whisper. | 85 | 25% | 21.3 |
| AccessibilitySelf-serve via Google Cloud Console with a $300 new-customer credit and 60 free minutes/month, but requires a GCP account, billing setup and ADC auth, raising the barrier versus single-key API competitors. | 68 | 20% | 13.6 |
| APIbenchmarks Index (ABI) | 81.3 | ||
Table 1. Derivation of the ABI for Google Cloud Speech-to-Text. Contribution = score × weight; the index is their sum.
At a glance
- Vendor
- Pricing model
- Per 15 seconds
- Free tier
- 60 min/mo + $300 credit
- Official SDKs
- 10 languages
Pricing
| Standard (v2 / Chirp) | $0.016 / minute | Real-time and batch transcription on the v2 API; Chirp models included at no surcharge. Down from $0.024/min on v1. |
| Dynamic Batch | ~$0.004 / minute (≈75% off Standard) | Discounted async tier for non-urgent jobs with results delivered within up to 24 hours. |
| Volume / committed-use tiers | as low as $0.004 / minute | Per-minute rate drops at high monthly transcription volumes. |
| Legacy v1 API | $0.024 / minute | Older API generation; standard and enhanced/phone-call models. |
| Free tier | $0 for first 60 min/month | 60 minutes of audio free each month; new Google Cloud customers also get $300 in credits. |
Key features
- •Chirp 3 generative multilingual ASR model (state-of-the-art accuracy, v2-exclusive)
- •Real-time streaming recognition (StreamingRecognize)
- •Synchronous and asynchronous batch recognition (Recognize, BatchRecognize)
- •Speaker diarization (multi-speaker identification on single-channel audio)
- •Automatic punctuation and word-level timestamps
- •Automatic language detection for multilingual audio
- •Model adaptation / custom vocabulary (speech adaptation)
- •Built-in denoiser for noisy audio
- •85–125+ languages and locales supported
- •CMEK, audit logging, and VPC-SC enterprise security controls
Official SDKs
Strengths & trade-offs
- +Included Chirp/Chirp 2/Chirp 3 foundation models cover 85–125+ languages and locales, among the broadest language coverage of any STT API
- +Lower v2 pricing ($0.016/min) with Dynamic Batch and volume tiers dropping cost toward $0.004/min
- +Deep Google Cloud integration: IAM, VPC Service Controls, CMEK, audit logging and Vertex AI for enterprise governance
- +Full recognition surface, streaming/real-time, sync, and batch, plus diarization, auto-punctuation, word timestamps, model adaptation and a built-in denoiser
- +Strong published SLA (99.9% monthly uptime with service credits) on Google's global infrastructure
- +Official client libraries across nine languages (C++, C#, Go, Java, Node.js, PHP, Python, Ruby, Rust)
- –Higher real-world word error rate (~13.1% on noisy/accented/technical audio) than specialist rivals like Deepgram (~7.6%)
- –Confusing v1-vs-v2 split with different features, prices and regional availability
- –Add-ons such as speaker diarization cost extra and are not transparently documented on the public pricing page
- –Effective bill often exceeds the headline $0.016/min once GCP overhead and add-ons are counted
- –Diarization can misattribute speakers, requiring manual correction
- –Requires a full Google Cloud account, billing and ADC auth setup rather than a single API key
What developers say
G2 4.5/5 (~240 reviews)
Users praise ease of integration, broad language coverage and clean-audio accuracy, but criticize accuracy in noisy/accented conditions, escalating cost at volume, and extra charges for features like diarization.
“Very easy to use and handle, send audio and get it back as text, very easy implementation; GCP's model handles background noise way better than I expected.”
Key figures
| Word error rate (noisy/accented/technical real-world audio) | 13.1% WER | Voicewriter.io independent benchmark (cited by Deepgram) ↗ |
| Standard transcription price (v2) | $0.016 / minute | Google Cloud / Cloud Ace pricing announcement ↗ |
| Legacy v1 transcription price | $0.024 / minute | Google Cloud Speech-to-Text v2 pricing comparison ↗ |
| Dynamic Batch discount vs Standard | ~75% lower per minute (≈$0.004/min) | Google Cloud pricing ↗ |
| Monthly uptime SLA | 99.9% | Google Cloud Speech-to-Text SLA ↗ |
| Free tier | 60 minutes/month free | Google Cloud pricing ↗ |
| Language coverage (Chirp 3) | 85+ languages and locales | Google Cloud Chirp 3 documentation ↗ |
Compare Google Cloud Speech-to-Text head to head
Sources
- https://cloud.google.com/speech-to-text/pricing
- https://cloud.google.com/speech-to-text/sla
- https://docs.cloud.google.com/speech-to-text/docs/models/chirp-3
- https://cloud.google.com/blog/products/ai-machine-learning/google-cloud-speech-to-text-v2-api
- https://cloud.google.com/speech-to-text/v2/docs/libraries
- https://deepgram.com/learn/deepgram-vs-google-speech-to-text-comparison
- https://www.g2.com/products/google-cloud-speech-to-text/reviews
- https://id.cloud-ace.com/resources/cloud-speech-to-text-v2-api-and-chirp-are-now-generally-available-with-new-lower-pricing-tier
- https://brasstranscripts.com/blog/google-cloud-speech-to-text-pricing-2025-gcp-integration-costs
Figures last verified 2026-06-27. Spotted an error? corrections@apibenchmarks.com
