The blog files in asset/blogs/ are Markdown text. To listen to them on a walk, in the car, or while kitesurfing – they need to become audio. This article walks through the realistic options, from free and local to paid and premium, and the integration shape for wiring them into the existing blog pipeline.
The verified facts on each option below are anchored to mid-2026 pricing and capabilities. Treat per-character costs as approximate; vendors change them.
The pipeline shape
Every option fits the same four-stage pipeline:
-
Source: Markdown file in
asset/blogs/. - Clean: strip Markdown to clean speakable text. This is the step most people skip and the one that determines whether the result is listenable.
- Synthesize: run the text through a TTS engine. This is the part everyone focuses on.
- Publish: drop the resulting MP3 next to the source file, embed it in the blog page, optionally add an enclosure to an RSS feed for podcast clients.
The synthesize step is where vendors compete. The cleaning step is where quality is actually won or lost.
Step 2: clean Markdown to speakable text
If you feed raw Markdown to a TTS engine it will say “hash hash hash heading text” or “open square bracket link text close square bracket open paren u r l close paren.” Unlistenable.
A useful cleaning pass for blog text:
-
Strip frontmatter (
---block at the top). - Convert headings to plain text plus a sentence boundary.
- Drop URL parts of Markdown links, keep the visible text.
- Convert code blocks to a short note (“Code example follows. Skipping.”) or skip them entirely.
- Collapse multiple newlines.
- Expand abbreviations the TTS engine pronounces poorly (“MCP” -> “M C P”, “WA” -> “Working Agents”, “TTS” -> “T T S”).
-
Replace
--with a comma (since memory rules ban em dashes;--is the substitute, and unprocessed it reads as “dash dash”).
In Elixir, Earmark (already a dep in this project) parses Markdown to an AST; from there a small walker converts each node to speakable text. Twenty lines.
Without this step, even the best TTS engine produces audio you’ll turn off after 30 seconds.
Option A: macOS say (free, local, Mac-only)
The Mac ships with say. One command:
say -v Samantha -o blog.aiff -f cleaned.txt
ffmpeg -i blog.aiff blog.mp3
Voices available out of the box include Samantha, Alex, Karen, Daniel, and dozens more across languages (say -v ? lists them). The newer “Premium” and “Enhanced” voices (downloadable via System Settings -> Accessibility -> Spoken Content) are noticeably better than the defaults.
Pros: free, fast, offline, scripts in 5 lines. Cons: macOS only. Sounds like macOS. Voice quality is below current cloud offerings but a long way above the robotic TTS of 5 years ago.
Use when: you’re on a Mac, you don’t care if it sounds slightly synthetic, you don’t want to send your text to a vendor.
Option B: Piper (free, local, multi-platform)
Piper is an open-source local TTS system built on ONNX/VITS by the Rhasspy team. Runs on Linux, Mac, Windows, Raspberry Pi. 30+ languages, voices ranging from x_low (16kHz) to high (22.05kHz) quality.
echo "Your blog text here" | piper --model en_US-lessac-medium.onnx --output_file blog.wav
ffmpeg -i blog.wav blog.mp3
Pros: free, offline, no per-character billing, runs anywhere. Quality on the high voices is genuinely good as of 2026 – a 2026 review notes the gap between local and cloud has nearly closed for most use cases.
Cons: voice cloning is not its strength (XTTS v2 or Coqui beat it for that). One-time setup is more work than say.
Use when: you want a self-hosted pipeline that scales without per-character cost, or you’re shipping this to a customer on Linux who needs offline TTS.
Option C: OpenAI TTS (cheapest cloud, good quality)
OpenAI has three TTS models in mid-2026:
| Model | Pricing | Notes |
|---|---|---|
gpt-4o-mini-tts |
$0.60 / 1M input tokens + $12 / 1M audio tokens, ~$0.015/minute of audio | Newest, cheapest, multimodal. 13 voices including Marin and Cedar. |
tts-1 |
$15 / 1M characters | Standard quality, lowest latency. |
tts-1-hd |
$30 / 1M characters | Premium quality, higher latency. |
13 voices: Alloy, Ash, Ballad, Coral, Echo, Fable, Nova, Onyx, Sage, Shimmer, Verse (plus Marin, Cedar on the newest model).
Output formats: MP3 (default), Opus, AAC, FLAC, WAV, PCM. MP3 for downloads, Opus for low-latency streaming.
Math: a typical 2,000-word blog post is roughly 12,000 characters. At tts-1 rates that’s $0.18 per blog post. At gpt-4o-mini-tts rates it’s around $0.04. Trivial.
A small Elixir HTTP client against the OpenAI API:
def synthesize(text, voice \\ "nova") do
Req.post("https://api.openai.com/v1/audio/speech",
auth: {:bearer, System.get_env("OPENAI_API_KEY")},
json: %{model: "gpt-4o-mini-tts", voice: voice, input: text, response_format: "mp3"}
)
end
Pros: cheap, high quality, easy API, multiple voice options, well-documented. Cons: cloud round-trip per generation. Your text goes to OpenAI.
Use when: you want a “set and forget” pipeline that produces good audio at low cost without operating a model.
Option D: ElevenLabs (premium, voice cloning)
The premium tier of TTS. Available via API, with plans built around character-credit budgets:
| Plan | Cost | Credits / month | Voice cloning |
|---|---|---|---|
| Free | $0 | 10k (~10 min) | Instant voice cloning not in API tier |
| Starter | $5/mo | 30k | Instant voice cloning |
| Creator | $22/mo (after intro) | 100k | Professional voice cloning |
| Pro | higher | 500k (~500 min) | 44.1 kHz PCM API |
| Scale / Business | enterprise | 1.8M+ | Multi-seat, multiple pro clones |
Voice cloning is the differentiator. With ~30 minutes of clean recordings of your own voice, ElevenLabs builds a model that synthesizes new text in your voice. For a personal blog, that means listeners hear you reading your own posts, even on posts you never read aloud.
Pros: best-in-class voice quality, voice cloning, expressive output (laughter, hesitation, emotion). Cons: most expensive option per character. Voice cloning carries identity-misuse risk – treat your cloned-voice API key like a password.
Use when: the audio version is itself a brand asset (a podcast, a paid newsletter), or you want listeners to hear your own voice without recording every post.
Option E: AWS Polly and Google Cloud TTS
The two big-cloud incumbents are still in the running for enterprise deployments.
- AWS Polly: standard voices ~$4 / 1M characters, neural voices ~$16 / 1M characters, generative voices higher. Tight IAM integration if you’re already on AWS. Voices like Joanna and Matthew are widely used in audiobook production.
- Google Cloud Text-to-Speech: WaveNet and Neural2 voices. Comparable per-character pricing. Native integration with Google Cloud Storage for hosting the generated files.
Pros: enterprise-grade SLAs, region selection for data residency, IAM-level access control. Cons: more boilerplate than the OpenAI API. Voice quality is competitive but not ahead of the field.
Use when: you’re already on AWS or GCP and want everything under one IAM boundary.
Practical integration for this project
Given that this project already has:
-
BlogStoreandBlogFileWatcherauto-importingasset/blogs/*.md - A Sqler-per-module pattern
- An MCP tool surface with per-tool permission keys
The natural shape:
-
A new
BlogAudiomodule.lib/blog_audio.exwraps the TTS API call with a configurable backend (:openai,:piper,:say). One function:generate(file_path) -> {:ok, mp3_path}. -
A
Permissions.BlogAudiowrapper with two keys:blog_audio.read(list which posts have audio) andblog_audio.generate(create audio for a post). StandardAccessControlledpattern. -
An MCP tool
blog_audio_generateso an agent can be told “generate audio for the last 3 blog posts” and it goes off and does it. -
A hook in
BlogFileWatcher: when a new.mdfile is imported, kick off audio generation in the background (so the UI isn’t blocked) and write the resulting MP3 toasset/blogs/audio/<same-slug>.mp3. -
Render the audio on the blog page: an HTML5
<audio>tag at the top of each post, plus the MP3 link for download. RSS feed gains an<enclosure>element so podcast clients can pick it up.
The cost calculation: an average 2,000-word post is about 12,000 characters. At gpt-4o-mini-tts rates that’s roughly $0.04 per post. At one post a day that’s $1.20 a month. At ElevenLabs Creator rates with voice cloning, it’s about $0.12 a post. Pick the budget you want.
Where it goes wrong
Things that turn a working pipeline into one you turn off after a week:
- Pronouncing technical terms wrong. “Elixir” should be “ee-LIX-er” but some engines say “EH-licks-er.” “MCP” should be “M C P.” Build a per-engine substitution map: replace specific strings before sending to the API. ElevenLabs supports phoneme markup; OpenAI does not.
-
Run-on sentences. Long technical paragraphs without commas read as a single breathless wall of sound. The cleaning pass should add sentence breaks where there are em-dash-style separators (the
--substitution we use). - Code blocks read aloud. Unlistenable. Strip them, or replace with “Code example. Continued.”
- Tables read aloud. Same. Strip or skip with a “table summary follows” note.
- URL pronunciation. “h-t-t-p-s-colon-slash-slash-w-w-w” is brutal. Strip URLs from the audio version; the listener will look up the post for the links.
- Voice fatigue. The same voice for every post gets monotonous over time. Rotate among 2-3 voices, or assign different voices to different post categories.
- Mood mismatch. A bouncy voice reading a sober analysis sounds wrong. Match voice to subject – Onyx and Ash for serious topics, Nova and Coral for upbeat ones. Pick once, document the choice.
Decision shortcut
-
Personal use, Mac only, fast: macOS
saywith a Premium voice. Free, five lines of shell. - Personal use, multi-platform, free: Piper. Local, no vendor lock-in, quality is genuinely close to cloud.
-
Production, low cost, “just works”: OpenAI
gpt-4o-mini-tts. $0.04 per post, multiple voices, one API call. - Brand voice, your-actual-voice listeners: ElevenLabs with voice cloning. $0.12 per post, sounds like you.
- Enterprise, on AWS or GCP: Polly or Google TTS.
The right starting point for the WorkingAgents blog: OpenAI gpt-4o-mini-tts with the nova voice, integrated into BlogFileWatcher so audio generates automatically on each new post. If you decide later that you want listeners to hear your own voice, swap the backend to ElevenLabs with a cloned voice and regenerate. The MP3 path and the <audio> tag stay the same; only the synthesize step changes.
Listening to your own writing is also a great editing tool. Posts that read fine on paper often sound clunky when spoken. Generating audio for a draft, listening to it once, and tweaking the prose is a faster edit cycle than a third read-through.
Verify before deploying
Pricing on TTS APIs changes. The numbers in this article were checked against the vendor pricing pages in May 2026; assume they’re approximate. Before committing to a vendor, run a real post through their API at the production volume you expect for a month and look at the actual bill. The math above is correct in shape but not in long-term cost prediction.