By James Aspinwall
The goal is clear: automated recorded talking demos. A script goes in, a natural-sounding narrated video comes out. No microphone, no recording studio, no re-takes. The missing piece is a text-to-speech API that sounds human enough to represent your product professionally.
ElevenLabs and Google Cloud TTS are the two most discussed providers, but they are not alone. Here is a thorough comparison of both, plus the competitors worth considering.
The Contenders at a Glance
| Feature | ElevenLabs | Google Cloud TTS | OpenAI TTS | Amazon Polly | Azure Speech | Cartesia |
|---|---|---|---|---|---|---|
| Best voice tier | Multilingual v2 | Chirp 3: HD | tts-1-hd | Generative | Neural HD V2 | Sonic-3 |
| Voice naturalness | Gold standard | Very good | Excellent | Good | Very good | Good |
| Voice count | 10,000+ (community) | ~180 built-in | 13 | ~60 | 500+ | 40+ |
| Languages | 70+ | 40+ | ~20 | 30+ | 140+ | 40+ |
| Voice cloning | Yes (instant + pro) | Yes (Chirp 3) | No | No | Yes | Yes |
| SSML support | No | Yes (full) | No | Yes | Yes (extensive) | No |
| Streaming | Yes (75ms TTFB) | No (full response) | Yes (~500ms) | Yes | Yes | Yes (40ms TTFB) |
| Steerability | No | No | Yes (prompting) | No | No | No |
| REST API simplicity | Simple | Simple | Simplest | Moderate | Complex | Simple |
| Free tier | 10K chars/mo | 1M WaveNet chars/mo | None | 5M standard chars/mo | 500K chars/mo | Pay-as-you-go |
Pricing Comparison
For our use case (automated demos, 10-50 per month at 2-5 minutes each), the monthly volume is roughly 50,000-200,000 characters. Here is what that costs:
| Provider | Tier | Cost at 100K chars/mo | Cost at 200K chars/mo |
|---|---|---|---|
| ElevenLabs | Creator plan | $22 (100K included) | $22 + $6 overage = $28 |
| Google Cloud | WaveNet | $0 (1M free) | $0 (1M free) |
| Google Cloud | Chirp 3: HD | $3 | $6 |
| OpenAI | tts-1-hd | $3 | $6 |
| Amazon Polly | Neural | $0 (1M free, 12 mo) | $0 (1M free, 12 mo) |
| Azure Speech | Neural TTS | $0 (500K free) | $0 (500K free) |
| Cartesia | Sonic-3 | $1.10 | $2.20 |
ElevenLabs is the most expensive option at this volume. Google, Azure, and Polly offer free tiers that cover demo-scale usage entirely. OpenAI and Cartesia are cheap at pay-per-use rates.
ElevenLabs: The Voice Quality Leader
ElevenLabs built its reputation on one thing: voices that sound human. Their Multilingual v2 model produces speech with natural prosody, breathing pauses, and emotional variation that no other provider consistently matches.
Strengths:
- Unmatched naturalness and expressiveness
- Instant voice cloning from under 1 minute of audio
- Professional voice cloning (near-indistinguishable) on Pro+ plans
- Voice Design: create new voices from text descriptions
- 10,000+ community voices to choose from
- Sound effects, audio isolation, speech-to-speech conversion
Weaknesses:
- Most expensive option for moderate volume ($22/mo minimum for reasonable limits)
- No SSML support for fine-grained control over pauses, emphasis, pronunciation
- Free tier is tiny (10K chars, roughly 1-2 demos)
- Rate limits are tight on lower plans (2-5 concurrent requests)
Best for: When voice quality is the absolute top priority and budget is secondary.
Google Cloud TTS: The Value Play
Google’s TTS has evolved significantly. Their latest Chirp 3: HD voices are a major step up from WaveNet, with emotional resonance and natural intonation. The free tier is absurdly generous for demo-scale usage.
Strengths:
- WaveNet free tier (1M chars/mo) covers demo volume 5-10x over
- Chirp 3: HD quality is competitive with ElevenLabs for narration
- Full SSML support for controlling pauses, pitch, rate, emphasis, pronunciation
- Audio profiles (optimize for headphones, speakers, phone)
- Multi-speaker markup for multi-voice content
- Deep Google Cloud ecosystem integration
Weaknesses:
- No streaming — returns complete audio in one response (fine for pre-recorded demos)
- Smaller voice selection than ElevenLabs
- Chirp 3: HD supports only limited SSML tags
- Google Cloud authentication adds setup complexity vs simple API keys
- Voice cloning (Chirp 3 Instant Custom Voice) is newer and less proven
Best for: Cost-sensitive projects that still need good quality. The free tier alone handles most demo workloads.
The Dark Horse: OpenAI TTS
OpenAI’s TTS offering deserves serious attention, especially for the demo narration use case.
The standout feature is steerability. With the gpt-4o-mini-tts model, you can prompt the voice with instructions like:
- “Speak clearly and professionally, emphasizing feature names”
- “Sound enthusiastic when describing the new capability”
- “Slow down for technical terms, speak conversationally otherwise”
No other provider offers this. For demos where tone matters — and it always matters — this is a meaningful advantage.
Strengths:
- Steerability via text prompts (unique, powerful for demos)
- Excellent naturalness on tts-1-hd
- Cheapest high-quality option at $3/mo for 100K chars
- Simplest API (single POST, returns audio bytes)
- Streaming support
- If you already use OpenAI for other things, one API key covers everything
Weaknesses:
- Only 13 voices (no custom cloning)
- Limited language coverage vs ElevenLabs or Azure
- No SSML support
- 4,096 character limit per request (need to chunk longer scripts)
- Relatively new TTS offering, less battle-tested at scale
Best for: Demo narration where you want to control speaking style without manual audio editing.
Other Competitors
Amazon Polly — Reliable AWS infrastructure, generous free tier (5M standard chars/mo forever). Neural voices are decent but sound noticeably synthetic next to ElevenLabs or OpenAI. Best if you are already deep in the AWS ecosystem.
Microsoft Azure Speech — Largest voice catalog (500+ voices, 140+ languages). Neural HD V2 has context-aware emotion detection. The Azure SDK adds integration complexity. Best for multilingual or enterprise deployments.
Cartesia Sonic-3 — Fastest time-to-first-audio (40ms), cheapest per-character ($11/1M). Optimized for conversational AI agents rather than studio narration. Worth watching but not the top pick for polished demo recordings.
Recommendation
For automated recorded talking demos, start with OpenAI tts-1-hd.
The reasoning:
-
Steerability is uniquely valuable for demos. You can describe the speaking style in plain English. “Professional but friendly, pause before each feature name, sound excited about the dashboard.” No other API can do this. For narrating software walkthroughs where you want the voice to match the moment, this is a significant advantage.
-
Cost is near zero. At $15 per million characters, 50 demos per month costs about $3-6. No subscription plans to manage.
-
Integration is trivial. One HTTP POST, one API key, audio bytes in the response. Three lines of Elixir code:
{:ok, %{body: audio}} = Req.post("https://api.openai.com/v1/audio/speech",
headers: [{"Authorization", "Bearer #{api_key}"}],
json: %{model: "tts-1-hd", input: text, voice: "nova", response_format: "mp3"}
)
File.write!("demo.mp3", audio)
- Quality is excellent. The tts-1-hd model produces natural, professional-sounding narration. It is not quite ElevenLabs-level, but the gap is small and the steerability compensates.
If OpenAI voices are not natural enough, upgrade to ElevenLabs Creator ($22/mo) for the best voice quality available. The voice cloning feature also lets you create a consistent brand voice across all demos.
If budget is the primary constraint, use Google Cloud WaveNet (free for 1M chars/mo) or Chirp 3: HD ($3/mo) with SSML markup for pacing control.
All three integrate into Elixir via simple HTTP POST requests with no special SDKs required.
Decision Matrix
| Priority | Choose | Why |
|---|---|---|
| Best overall for demos | OpenAI tts-1-hd | Steerability + quality + cost |
| Best voice quality | ElevenLabs Multilingual v2 | Most natural-sounding voices |
| Lowest cost | Google Cloud WaveNet | 1M chars/mo free |
| Most voice variety | Azure Speech | 500+ voices, 140+ languages |
| Fastest latency | Cartesia Sonic-3 | 40ms time-to-first-audio |
| Voice cloning | ElevenLabs | Instant + professional cloning |
| Fine speech control | Google Cloud TTS | Full SSML support |
| AWS ecosystem | Amazon Polly | Native integration |