Cartesia
Cartesia · Ranked #6 of 7 in Text-to-Speech APIs
Low-latency (sub-100ms) real-time TTS challenger built for voice agents, with clean Fern-generated SDKs and a public status page, but a young track record.
Real-time TTS for voice agents

Overview
Cartesia is a San Francisco-based voice AI company building real-time text-to-speech (and increasingly speech-to-text and full voice-agent) infrastructure around its Sonic model family. Its core technical differentiator is architecture: Sonic is built on state-space models (SSMs) rather than the transformer/diffusion stacks most competitors use, which is what lets Cartesia advertise time-to-first-audio in the 40-90ms range. The company is squarely a developer-API play, not a no-code product. You get a low-latency TTS endpoint with WebSocket streaming, instant and professional voice cloning, 40+ languages, custom pronunciation dictionaries, and emotion/laughter controls. You do not get a packaged business tool that wires itself into your helpdesk. As one review put it, Cartesia is "an excellent engine, but you still need to build the car."
Where Cartesia wins is raw speed and the real-time conversational use case: voice agents, IVR/telephony, and live dubbing where every 100ms of latency degrades the experience. Independent third-party numbers back the speed claim directionally, the Coval benchmark (May 2026) clocked Sonic-3 at 188ms P50 TTFA, faster than ElevenLabs Turbo/Flash v2.5 and Deepgram Aura-2, though it's worth noting that real measured latency is meaningfully higher than the ~40ms marketing figure (which is time-to-first-byte under ideal conditions), and Cartesia's latency variance (a 100ms inter-quartile range in that test) is wider than some rivals, which matters for production consistency. On pure quality, Cartesia is competitive but no longer dominant: on Artificial Analysis's Elo leaderboard Sonic 3.5 briefly held #1 before being overtaken, and it now trails newer entrants like Inworld on naturalness Elo.
Commercially, Cartesia is priced aggressively for developers with a credit model (1 credit/character for TTS) and a tiered ladder from a free 20k-credit plan up through Pro ($5), Startup ($49), Scale ($299) and custom Enterprise, plus per-minute voice-agent and telephony pricing. It carries enterprise compliance (SOC 2, HIPAA, GDPR, PCI) and supports on-prem/on-device deployment, which broadens its appeal beyond startups. The main caveats: there's little independent review-site presence (no real G2/Trustpilot/Capterra footprint), self-reported logging/observability is a known weak spot, and because it's an infrastructure primitive rather than an agent platform, teams must build and maintain the surrounding orchestration themselves.
How this score is derived
The APIbenchmarks Index is a weighted sum of four dimensions, each scored on an absolute 0–100 reference scale. See the methodology for every mapping.
| Dimension | Score | Weight | Contribution |
|---|---|---|---|
| Documentation & DXStrong, dedicated developer docs at docs.cartesia.ai with quickstarts, WebSocket streaming guides, a changelog, and first-party Python/JS SDK references, widely praised as clear in third-party reviews. | 80 | 30% | 24.0 |
| ReliabilitySonic delivers class-leading low latency but third-party benchmarks flag wide latency variance (a 100ms IQR in the Coval test) and there is no clearly published public status page or SLA outside Enterprise terms. | 68 | 25% | 17.0 |
| Ecosystem & SDKsSolid integration footprint via official Python and JS/TS SDKs plus partnerships and tutorials with voice-agent platforms (Vapi, GetStream, LiveKit-style stacks), though smaller than ElevenLabs' broader tooling ecosystem. | 70 | 25% | 17.5 |
| AccessibilityGenerous free tier (20k credits) and a free playground make it easy to start, but it is a code-first developer API with no no-code/business-user product, so non-engineers cannot self-serve. | 86 | 20% | 17.2 |
| APIbenchmarks Index (ABI) | 75.7 | ||
Table 1. Derivation of the ABI for Cartesia. Contribution = score × weight; the index is their sum.
At a glance
- Vendor
- Cartesia
- Pricing model
- Per character (credits)
- Free tier
- 20k credits (~15-20 min audio)
- Official SDKs
- 4 languages
Pricing
| Free | $0/mo | ~20,000 model credits (~27 min TTS), 1 agent slot, basic TTS + STT, no time limit. |
| Pro | $5/mo | ~133 min TTS, 3 agent slots, adds commercial-use license and instant voice cloning. |
| Startup | $49/mo | ~1,667 min TTS, 5 agent slots, adds professional voice cloning and org support. |
| Scale | $299/mo | ~10,667 min TTS, 10 agent slots, priority support and high concurrency limits. |
| Enterprise | Custom | Volume pricing, custom terms, DPAs/BAAs, SSO, compliance and security review; on-prem/on-device options. |
Key features
- •Sonic real-time streaming TTS with sub-90ms / ~40ms time-to-first-byte (state-space model architecture)
- •WebSocket streaming with multiple concurrent TTS streams over a single connection
- •Instant voice cloning from ~3-10 seconds of audio
- •Professional / Pro voice cloning (one-time training fee, higher fidelity)
- •40+ languages with native-speaker quality and accent localization (localize into 42 languages)
- •Emotion, speed, pitch, and volume controls plus AI laughter tags
- •Custom pronunciation dictionaries for proper nouns and domain terms
- •Voice changer / speech-to-speech transformation
- •Speech-to-text (STT) and end-to-end voice-agent product (per-minute pricing)
- •On-prem and on-device deployment for data privacy; SOC 2 / HIPAA / GDPR / PCI compliance
Official SDKs
Strengths & trade-offs
- +Class-leading latency for real-time voice, advertised ~40ms TTFB and measured fastest-tier P50 TTFA vs ElevenLabs and Deepgram in independent benchmarks
- +State-space-model architecture purpose-built for streaming, low-latency conversational audio rather than batch generation
- +Instant voice cloning from as little as 3-10 seconds of audio, plus higher-fidelity professional cloning
- +Broad multilingual coverage (40+ languages) with localization that preserves emotion, tone, and speaker identity
- +Developer-friendly: clear docs, official Python and JS/TS SDKs, single-WebSocket multiplexed streaming, and a usable free tier
- +Enterprise-grade compliance (SOC 2, HIPAA, GDPR, PCI) plus on-prem and on-device deployment options
- –A developer API, not a business tool, no no-code workflow, helpdesk wiring, or agent-testing layer; you must build the surrounding system
- –Self-reported and reviewer-noted weakness in logging, observability, and troubleshooting tooling
- –Latency consistency is a concern: third-party tests show a wide inter-quartile range, so tail latency can approach the 300ms conversational threshold
- –Quality Elo has slipped from briefly #1 to trailing newer entrants like Inworld on the Artificial Analysis leaderboard
- –Marketing latency figures (~40ms) are best-case TTFB; real measured end-to-end TTFA is meaningfully higher (~188ms P50)
- –Minimal independent review-site presence (no substantive G2/Trustpilot/Capterra footprint), so social proof is thin
What developers say
Developers consistently praise Cartesia for best-in-class latency and natural voice quality, while the recurring critiques are weak observability/logging and that it is raw infrastructure rather than a turnkey product.
“Cartesia is amazing! They have enabled us to reduce system latency by hundreds of milliseconds.”
Key figures
| P50 time-to-first-audio (Sonic-3) | 188 ms (100 ms IQR) | Coval benchmark via Gradium TTS Latency Benchmark 2026 ↗ |
| Time-to-first-byte (Sonic 3 / 3.5 Turbo, vendor claim) | ~40 ms | Cartesia Sonic product page ↗ |
| Latency vs ElevenLabs Turbo v2.5 | 76 ms faster (188 ms vs 264 ms P50) | Gradium TTS Latency Benchmark 2026 ↗ |
| Latency vs Deepgram Aura-2 | 125 ms faster (188 ms vs 313 ms P50) | Gradium TTS Latency Benchmark 2026 ↗ |
| Quality Elo (Sonic 3.5) | ~1,054 (briefly #1, since overtaken) | Artificial Analysis Text-to-Speech leaderboard (via search summary) ↗ |
| TTS price | 1 credit per character (~$35 per 1M characters effective) | Cartesia pricing / eesel AI ↗ |
| Voice-agent call price | $0.06 / minute ($0.014/min telephony) | Cartesia pricing page ↗ |
Compare Cartesia head to head
Sources
- https://www.cartesia.ai/pricing
- https://www.cartesia.ai/sonic/
- https://docs.cartesia.ai/changelog/2026
- https://github.com/cartesia-ai/cartesia-python
- https://pypi.org/project/cartesia/
- https://gradium.ai/content/tts-latency-benchmark-2026
- https://artificialanalysis.ai/text-to-speech/model-families/cartesia
- https://www.eesel.ai/blog/cartesia-sonic-3-review
- https://www.eesel.ai/blog/cartesia-sonic-3-pricing
Figures last verified 2026-06-27. Spotted an error? corrections@apibenchmarks.com
