OpenAI Whisper / GPT-4o Transcribe
OpenAI · Ranked #3 of 8 in Speech-to-Text APIs
Transcription endpoints (whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe) bundled into the broader OpenAI API; simple flat per-minute pricing, no STT-specific free tier.
Transcription inside the OpenAI model API

Overview
OpenAI's speech-to-text lineup spans three generations under one API: the original open-weight Whisper (released Sept 2022, MIT-licensed, 99 languages, the hosted whisper-1 endpoint), and the 2025 GPT-4o-based transcription models, gpt-4o-transcribe and the cheaper gpt-4o-mini-transcribe, plus the late-2025 gpt-4o-transcribe-diarize variant that adds built-in speaker attribution. All are reachable through the single /v1/audio/transcriptions endpoint (and several over the Realtime API), which makes OpenAI the default low-friction choice for any team already on the OpenAI platform: one API key, one SDK, near-zero integration cost. Whisper additionally remains downloadable and self-hostable, which is rare among the major commercial STT vendors and a major reason it dominates the open-source ASR ecosystem.
On accuracy, the GPT-4o transcribe models are a genuine step up from Whisper for multilingual and noisy audio, OpenAI reports lower word error rate than Whisper v2/v3 across the 100+-language FLEURS benchmark, and independent reviewers cluster gpt-4o-transcribe with Deepgram Nova-3, AssemblyAI and ElevenLabs Scribe inside a 2-5% WER band on standard benchmarks. Notably, on clean read-speech (LibriSpeech test-clean) the older Whisper Large-v3 (~2.7% WER) still edges out OpenAI's own reported ~4.1% for gpt-4o-transcribe, so the GPT-4o models win on robustness and languages rather than on every academic metric. The headline weakness is architectural: because the GPT-4o models are LLM-based, they can "follow" instructions embedded in the audio rather than transcribing them verbatim. Simon Willison and OpenAI's own engineers have flagged this; for sensitive verbatim use cases, plain Whisper is arguably safer.
The practical gaps matter more than the WER deltas for many buyers. The hosted API caps uploads at 25 MB (roughly 25-30 minutes), forcing client-side chunking for long files; word-level timestamps are weaker than purpose-built tools (WhisperX, Deepgram); and true speaker diarization only arrived with gpt-4o-transcribe-diarize, which as of mid-2026 is Transcription-API-only with documented teething issues on Azure and in maintaining speaker identity across chunks. Pricing is simple and competitive at the low end ($0.006/min for whisper-1 and gpt-4o-transcribe, $0.003/min for mini) but token-based GPT-4o billing can surprise on long audio, and there is no published STT-specific SLA. Net: an excellent default for general-purpose, multilingual transcription embedded in an OpenAI-centric stack; less ideal where you need guaranteed verbatim output, long-file handling without chunking, or contractual uptime guarantees.
How this score is derived
The APIbenchmarks Index is a weighted sum of four dimensions, each scored on an absolute 0–100 reference scale. See the methodology for every mapping.
| Dimension | Score | Weight | Contribution |
|---|---|---|---|
| Documentation & DXOpenAI's Speech-to-text guide and per-model API reference pages are clear, example-rich, and cover streaming, prompting, timestamps and the diarized_json format, though edge cases like chunking and diarization-across-chunks are under-documented. | 85 | 30% | 25.5 |
| ReliabilityBacked by OpenAI's production platform and status page, but there is no published STT-specific SLA, and the LLM-based GPT-4o models carry a known risk of following spoken instructions instead of transcribing them verbatim. | 80 | 25% | 20.0 |
| Ecosystem & SDKsWhisper's MIT-licensed open weights spawned a huge ecosystem (whisper.cpp, WhisperX, faster-whisper, Azure hosting) and the hosted models plug into OpenAI's first-party SDKs, OpenRouter, and Azure OpenAI. | 88 | 25% | 22.0 |
| AccessibilityA single API key and audio file get you transcription in a few lines via official SDKs; the 25 MB upload cap and required client-side chunking for long audio are the main accessibility friction points. | 82 | 20% | 16.4 |
| APIbenchmarks Index (ABI) | 83.9 | ||
Table 1. Derivation of the ABI for OpenAI Whisper / GPT-4o Transcribe. Contribution = score × weight; the index is their sum.
At a glance
- Vendor
- OpenAI
- Pricing model
- Per minute (per-second billed)
- Free tier
- No
- Official SDKs
- 10 languages
Pricing
| whisper-1 | $0.006 / min | Original hosted Whisper model; billed per second of audio duration, ~$0.36/hour. |
| gpt-4o-transcribe | $0.006 / min (~$6.00 / 1M audio input tokens) | Flagship GPT-4o speech-to-text; lower WER and better multilingual accuracy than Whisper. |
| gpt-4o-mini-transcribe | $0.003 / min (~$3.00 / 1M audio input tokens) | Cost-efficient smaller model, half the price of the flagship. |
| gpt-4o-transcribe-diarize | Usage-based (audio token pricing) | Adds built-in speaker diarization; Transcription API only, returns diarized_json with A:/B: speaker labels. |
Key features
- •Automatic language detection across 99 (Whisper) / 100+ (FLEURS-tested GPT-4o) languages
- •Speech-to-English translation endpoint
- •Streaming transcription via stream=True and the Realtime API (gpt-4o-mini-transcribe, gpt-4o-transcribe, diarize)
- •Built-in speaker diarization with optional reference clips (gpt-4o-transcribe-diarize, diarized_json output)
- •Prompt parameter to bias spelling/vocabulary and context
- •Utterance-level timestamps; verbose_json with segment timing
- •Self-hostable open-weight Whisper models (tiny → large-v3) under MIT license
- •Supports mp3, mp4, mpeg, mpga, m4a, wav, webm inputs (25 MB cap)
- •Available via Azure OpenAI in addition to OpenAI's own API
Official SDKs
Strengths & trade-offs
- +Single endpoint and API key for transcription, translation, streaming and diarization, trivial to integrate for teams already on OpenAI
- +GPT-4o transcribe models post lower WER than Whisper v2/v3 across 100+ languages on FLEURS, strong on noisy/accented audio
- +Very competitive entry pricing: $0.006/min flagship and $0.003/min mini, billed per second
- +Whisper weights are MIT-licensed and fully self-hostable, rare among commercial STT vendors, enabling on-prem/GDPR-compliant deployments
- +Automatic language identification across 99 languages plus speech-to-English translation with no config
- +Huge open-source ecosystem (whisper.cpp, WhisperX, faster-whisper) for optimization and word-level alignment
- –LLM-based GPT-4o models can follow instructions spoken in the audio instead of transcribing verbatim, a real risk for sensitive/legal use
- –Hosted API caps uploads at 25 MB, forcing client-side chunking for long recordings
- –Native diarization only via the newer gpt-4o-transcribe-diarize, which is Transcription-API-only and has documented bugs (e.g. Azure, cross-chunk speaker identity)
- –Whisper's timestamps are utterance-level and can be off by seconds; word-level accuracy needs external tools like WhisperX
- –No published STT-specific uptime SLA
- –Token-based GPT-4o billing can produce less predictable costs on long audio than flat per-minute pricing
What developers say
Developers praise the easy integration, low price, and strong multilingual accuracy, but raise real concerns about LLM-style hallucination/instruction-following, the 25 MB limit, and rough edges in the new diarization model.
“Any time an LLM-based model is used for audio transcription I worry about accidental instruction following, is there a risk that content that looks like an instruction in the spoken text might not be included in the transcript? For some sensitive applications it may make sense to stick with whisper. I remain skeptical.”
Key figures
| FLEURS multilingual WER (vs Whisper v2/v3) | Lower WER than Whisper v2 and v3 across 100+ languages | OpenAI (vendor announcement) ↗ |
| LibriSpeech test-clean WER (gpt-4o-transcribe) | ~4.1% (OpenAI-reported); Whisper Large-v3 ~2.7% | Promptt.dev / TokenMix comparison ↗ |
| Top-tier STT WER band | 2-5% WER, within ~1-2 pts of Deepgram Nova-3, AssemblyAI, ElevenLabs Scribe | Coval independent benchmark ↗ |
| Price (whisper-1 / gpt-4o-transcribe) | $0.006 / minute | OpenAI API pricing ↗ |
| Price (gpt-4o-mini-transcribe) | $0.003 / minute (~$3.00 / 1M audio input tokens) | OpenAI API pricing ↗ |
| Training data (Whisper) | 680,000 hours of multilingual/multitask supervised audio | OpenAI Whisper ↗ |
| Max upload size (hosted API) | 25 MB per file | OpenAI speech-to-text guide ↗ |
Compare OpenAI Whisper / GPT-4o Transcribe head to head
Sources
- https://openai.com/index/introducing-our-next-generation-audio-models/
- https://developers.openai.com/api/docs/guides/speech-to-text
- https://developers.openai.com/api/docs/pricing
- https://platform.openai.com/docs/models/gpt-4o-transcribe-diarize
- https://openai.com/index/whisper/
- https://www.coval.ai/blog/best-speech-to-text-providers-in-2026-independent-benchmarks-and-how-to-choose/
- https://www.promptt.dev/blog/whisper-1-vs-gpt-4o-transcribe-full-comparison-2025
- https://simonw.substack.com/p/new-audio-models-from-openai-but
- https://community.openai.com/t/introducing-gpt-4o-transcribe-diarize-now-available-in-the-audio-api/1362933
Figures last verified 2026-06-27. Spotted an error? corrections@apibenchmarks.com
