APIbenchmarks
OpenAI Whisper / GPT-4o Transcribe logo

OpenAI Whisper / GPT-4o Transcribe

OpenAI · Ranked #3 of 8 in Speech-to-Text APIs

83.9/ 100
BStrong

Transcription endpoints (whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe) bundled into the broader OpenAI API; simple flat per-minute pricing, no STT-specific free tier.

Best for

Transcription inside the OpenAI model API

Screenshot of OpenAI Whisper / GPT-4o Transcribe

Overview

OpenAI's speech-to-text lineup spans three generations under one API: the original open-weight Whisper (released Sept 2022, MIT-licensed, 99 languages, the hosted whisper-1 endpoint), and the 2025 GPT-4o-based transcription models, gpt-4o-transcribe and the cheaper gpt-4o-mini-transcribe, plus the late-2025 gpt-4o-transcribe-diarize variant that adds built-in speaker attribution. All are reachable through the single /v1/audio/transcriptions endpoint (and several over the Realtime API), which makes OpenAI the default low-friction choice for any team already on the OpenAI platform: one API key, one SDK, near-zero integration cost. Whisper additionally remains downloadable and self-hostable, which is rare among the major commercial STT vendors and a major reason it dominates the open-source ASR ecosystem.

On accuracy, the GPT-4o transcribe models are a genuine step up from Whisper for multilingual and noisy audio, OpenAI reports lower word error rate than Whisper v2/v3 across the 100+-language FLEURS benchmark, and independent reviewers cluster gpt-4o-transcribe with Deepgram Nova-3, AssemblyAI and ElevenLabs Scribe inside a 2-5% WER band on standard benchmarks. Notably, on clean read-speech (LibriSpeech test-clean) the older Whisper Large-v3 (~2.7% WER) still edges out OpenAI's own reported ~4.1% for gpt-4o-transcribe, so the GPT-4o models win on robustness and languages rather than on every academic metric. The headline weakness is architectural: because the GPT-4o models are LLM-based, they can "follow" instructions embedded in the audio rather than transcribing them verbatim. Simon Willison and OpenAI's own engineers have flagged this; for sensitive verbatim use cases, plain Whisper is arguably safer.

The practical gaps matter more than the WER deltas for many buyers. The hosted API caps uploads at 25 MB (roughly 25-30 minutes), forcing client-side chunking for long files; word-level timestamps are weaker than purpose-built tools (WhisperX, Deepgram); and true speaker diarization only arrived with gpt-4o-transcribe-diarize, which as of mid-2026 is Transcription-API-only with documented teething issues on Azure and in maintaining speaker identity across chunks. Pricing is simple and competitive at the low end ($0.006/min for whisper-1 and gpt-4o-transcribe, $0.003/min for mini) but token-based GPT-4o billing can surprise on long audio, and there is no published STT-specific SLA. Net: an excellent default for general-purpose, multilingual transcription embedded in an OpenAI-centric stack; less ideal where you need guaranteed verbatim output, long-file handling without chunking, or contractual uptime guarantees.

How this score is derived

The APIbenchmarks Index is a weighted sum of four dimensions, each scored on an absolute 0–100 reference scale. See the methodology for every mapping.

DimensionScoreWeightContribution
Documentation & DXOpenAI's Speech-to-text guide and per-model API reference pages are clear, example-rich, and cover streaming, prompting, timestamps and the diarized_json format, though edge cases like chunking and diarization-across-chunks are under-documented.
85
30%25.5
ReliabilityBacked by OpenAI's production platform and status page, but there is no published STT-specific SLA, and the LLM-based GPT-4o models carry a known risk of following spoken instructions instead of transcribing them verbatim.
80
25%20.0
Ecosystem & SDKsWhisper's MIT-licensed open weights spawned a huge ecosystem (whisper.cpp, WhisperX, faster-whisper, Azure hosting) and the hosted models plug into OpenAI's first-party SDKs, OpenRouter, and Azure OpenAI.
88
25%22.0
AccessibilityA single API key and audio file get you transcription in a few lines via official SDKs; the 25 MB upload cap and required client-side chunking for long audio are the main accessibility friction points.
82
20%16.4
APIbenchmarks Index (ABI)83.9

Table 1. Derivation of the ABI for OpenAI Whisper / GPT-4o Transcribe. Contribution = score × weight; the index is their sum.

At a glance

Vendor
OpenAI
Pricing model
Per minute (per-second billed)
Free tier
No
Official SDKs
10 languages

Pricing

whisper-1$0.006 / minOriginal hosted Whisper model; billed per second of audio duration, ~$0.36/hour.
gpt-4o-transcribe$0.006 / min (~$6.00 / 1M audio input tokens)Flagship GPT-4o speech-to-text; lower WER and better multilingual accuracy than Whisper.
gpt-4o-mini-transcribe$0.003 / min (~$3.00 / 1M audio input tokens)Cost-efficient smaller model, half the price of the flagship.
gpt-4o-transcribe-diarizeUsage-based (audio token pricing)Adds built-in speaker diarization; Transcription API only, returns diarized_json with A:/B: speaker labels.

Key features

  • Automatic language detection across 99 (Whisper) / 100+ (FLEURS-tested GPT-4o) languages
  • Speech-to-English translation endpoint
  • Streaming transcription via stream=True and the Realtime API (gpt-4o-mini-transcribe, gpt-4o-transcribe, diarize)
  • Built-in speaker diarization with optional reference clips (gpt-4o-transcribe-diarize, diarized_json output)
  • Prompt parameter to bias spelling/vocabulary and context
  • Utterance-level timestamps; verbose_json with segment timing
  • Self-hostable open-weight Whisper models (tiny → large-v3) under MIT license
  • Supports mp3, mp4, mpeg, mpga, m4a, wav, webm inputs (25 MB cap)
  • Available via Azure OpenAI in addition to OpenAI's own API

Official SDKs

Python (openai)Node.js / TypeScript (openai)Java.NET / C#GoRuby (community + RubyLLM)REST / HTTPAzure OpenAI SDKswhisper.cpp (C/C++ self-host port)faster-whisper (Python self-host)

Strengths & trade-offs

Strengths
  • +Single endpoint and API key for transcription, translation, streaming and diarization, trivial to integrate for teams already on OpenAI
  • +GPT-4o transcribe models post lower WER than Whisper v2/v3 across 100+ languages on FLEURS, strong on noisy/accented audio
  • +Very competitive entry pricing: $0.006/min flagship and $0.003/min mini, billed per second
  • +Whisper weights are MIT-licensed and fully self-hostable, rare among commercial STT vendors, enabling on-prem/GDPR-compliant deployments
  • +Automatic language identification across 99 languages plus speech-to-English translation with no config
  • +Huge open-source ecosystem (whisper.cpp, WhisperX, faster-whisper) for optimization and word-level alignment
Trade-offs
  • LLM-based GPT-4o models can follow instructions spoken in the audio instead of transcribing verbatim, a real risk for sensitive/legal use
  • Hosted API caps uploads at 25 MB, forcing client-side chunking for long recordings
  • Native diarization only via the newer gpt-4o-transcribe-diarize, which is Transcription-API-only and has documented bugs (e.g. Azure, cross-chunk speaker identity)
  • Whisper's timestamps are utterance-level and can be off by seconds; word-level accuracy needs external tools like WhisperX
  • No published STT-specific uptime SLA
  • Token-based GPT-4o billing can produce less predictable costs on long audio than flat per-minute pricing

What developers say

Developers praise the easy integration, low price, and strong multilingual accuracy, but raise real concerns about LLM-style hallucination/instruction-following, the 25 MB limit, and rough edges in the new diarization model.

Any time an LLM-based model is used for audio transcription I worry about accidental instruction following, is there a risk that content that looks like an instruction in the spoken text might not be included in the transcript? For some sensitive applications it may make sense to stick with whisper. I remain skeptical.

Key figures

FLEURS multilingual WER (vs Whisper v2/v3)Lower WER than Whisper v2 and v3 across 100+ languagesOpenAI (vendor announcement)
LibriSpeech test-clean WER (gpt-4o-transcribe)~4.1% (OpenAI-reported); Whisper Large-v3 ~2.7%Promptt.dev / TokenMix comparison
Top-tier STT WER band2-5% WER, within ~1-2 pts of Deepgram Nova-3, AssemblyAI, ElevenLabs ScribeCoval independent benchmark
Price (whisper-1 / gpt-4o-transcribe)$0.006 / minuteOpenAI API pricing
Price (gpt-4o-mini-transcribe)$0.003 / minute (~$3.00 / 1M audio input tokens)OpenAI API pricing
Training data (Whisper)680,000 hours of multilingual/multitask supervised audioOpenAI Whisper
Max upload size (hosted API)25 MB per fileOpenAI speech-to-text guide

Compare OpenAI Whisper / GPT-4o Transcribe head to head

Sources

  1. https://openai.com/index/introducing-our-next-generation-audio-models/
  2. https://developers.openai.com/api/docs/guides/speech-to-text
  3. https://developers.openai.com/api/docs/pricing
  4. https://platform.openai.com/docs/models/gpt-4o-transcribe-diarize
  5. https://openai.com/index/whisper/
  6. https://www.coval.ai/blog/best-speech-to-text-providers-in-2026-independent-benchmarks-and-how-to-choose/
  7. https://www.promptt.dev/blog/whisper-1-vs-gpt-4o-transcribe-full-comparison-2025
  8. https://simonw.substack.com/p/new-audio-models-from-openai-but
  9. https://community.openai.com/t/introducing-gpt-4o-transcribe-diarize-now-available-in-the-audio-api/1362933

Figures last verified 2026-06-27. Spotted an error? corrections@apibenchmarks.com