Turning Blog Markdown into Audio You Can Listen To

The blog files in asset/blogs/ are Markdown text. To listen to them on a walk, in the car, or while kitesurfing – they need to become audio. This article walks through the realistic options, from free and local to paid and premium, and the integration shape for wiring them into the existing blog pipeline.

The verified facts on each option below are anchored to mid-2026 pricing and capabilities. Treat per-character costs as approximate; vendors change them.

The pipeline shape

Every option fits the same four-stage pipeline:

  1. Source: Markdown file in asset/blogs/.
  2. Clean: strip Markdown to clean speakable text. This is the step most people skip and the one that determines whether the result is listenable.
  3. Synthesize: run the text through a TTS engine. This is the part everyone focuses on.
  4. Publish: drop the resulting MP3 next to the source file, embed it in the blog page, optionally add an enclosure to an RSS feed for podcast clients.

The synthesize step is where vendors compete. The cleaning step is where quality is actually won or lost.

Step 2: clean Markdown to speakable text

If you feed raw Markdown to a TTS engine it will say “hash hash hash heading text” or “open square bracket link text close square bracket open paren u r l close paren.” Unlistenable.

A useful cleaning pass for blog text:

In Elixir, Earmark (already a dep in this project) parses Markdown to an AST; from there a small walker converts each node to speakable text. Twenty lines.

Without this step, even the best TTS engine produces audio you’ll turn off after 30 seconds.

Option A: macOS say (free, local, Mac-only)

The Mac ships with say. One command:

say -v Samantha -o blog.aiff -f cleaned.txt
ffmpeg -i blog.aiff blog.mp3

Voices available out of the box include Samantha, Alex, Karen, Daniel, and dozens more across languages (say -v ? lists them). The newer “Premium” and “Enhanced” voices (downloadable via System Settings -> Accessibility -> Spoken Content) are noticeably better than the defaults.

Pros: free, fast, offline, scripts in 5 lines. Cons: macOS only. Sounds like macOS. Voice quality is below current cloud offerings but a long way above the robotic TTS of 5 years ago.

Use when: you’re on a Mac, you don’t care if it sounds slightly synthetic, you don’t want to send your text to a vendor.

Option B: Piper (free, local, multi-platform)

Piper is an open-source local TTS system built on ONNX/VITS by the Rhasspy team. Runs on Linux, Mac, Windows, Raspberry Pi. 30+ languages, voices ranging from x_low (16kHz) to high (22.05kHz) quality.

echo "Your blog text here" | piper --model en_US-lessac-medium.onnx --output_file blog.wav
ffmpeg -i blog.wav blog.mp3

Pros: free, offline, no per-character billing, runs anywhere. Quality on the high voices is genuinely good as of 2026 – a 2026 review notes the gap between local and cloud has nearly closed for most use cases. Cons: voice cloning is not its strength (XTTS v2 or Coqui beat it for that). One-time setup is more work than say.

Use when: you want a self-hosted pipeline that scales without per-character cost, or you’re shipping this to a customer on Linux who needs offline TTS.

Option C: OpenAI TTS (cheapest cloud, good quality)

OpenAI has three TTS models in mid-2026:

Model Pricing Notes
gpt-4o-mini-tts $0.60 / 1M input tokens + $12 / 1M audio tokens, ~$0.015/minute of audio Newest, cheapest, multimodal. 13 voices including Marin and Cedar.
tts-1 $15 / 1M characters Standard quality, lowest latency.
tts-1-hd $30 / 1M characters Premium quality, higher latency.

13 voices: Alloy, Ash, Ballad, Coral, Echo, Fable, Nova, Onyx, Sage, Shimmer, Verse (plus Marin, Cedar on the newest model).

Output formats: MP3 (default), Opus, AAC, FLAC, WAV, PCM. MP3 for downloads, Opus for low-latency streaming.

Math: a typical 2,000-word blog post is roughly 12,000 characters. At tts-1 rates that’s $0.18 per blog post. At gpt-4o-mini-tts rates it’s around $0.04. Trivial.

A small Elixir HTTP client against the OpenAI API:

def synthesize(text, voice \\ "nova") do
  Req.post("https://api.openai.com/v1/audio/speech",
    auth: {:bearer, System.get_env("OPENAI_API_KEY")},
    json: %{model: "gpt-4o-mini-tts", voice: voice, input: text, response_format: "mp3"}
  )
end

Pros: cheap, high quality, easy API, multiple voice options, well-documented. Cons: cloud round-trip per generation. Your text goes to OpenAI.

Use when: you want a “set and forget” pipeline that produces good audio at low cost without operating a model.

Option D: ElevenLabs (premium, voice cloning)

The premium tier of TTS. Available via API, with plans built around character-credit budgets:

Plan Cost Credits / month Voice cloning
Free $0 10k (~10 min) Instant voice cloning not in API tier
Starter $5/mo 30k Instant voice cloning
Creator $22/mo (after intro) 100k Professional voice cloning
Pro higher 500k (~500 min) 44.1 kHz PCM API
Scale / Business enterprise 1.8M+ Multi-seat, multiple pro clones

Voice cloning is the differentiator. With ~30 minutes of clean recordings of your own voice, ElevenLabs builds a model that synthesizes new text in your voice. For a personal blog, that means listeners hear you reading your own posts, even on posts you never read aloud.

Pros: best-in-class voice quality, voice cloning, expressive output (laughter, hesitation, emotion). Cons: most expensive option per character. Voice cloning carries identity-misuse risk – treat your cloned-voice API key like a password.

Use when: the audio version is itself a brand asset (a podcast, a paid newsletter), or you want listeners to hear your own voice without recording every post.

Option E: AWS Polly and Google Cloud TTS

The two big-cloud incumbents are still in the running for enterprise deployments.

Pros: enterprise-grade SLAs, region selection for data residency, IAM-level access control. Cons: more boilerplate than the OpenAI API. Voice quality is competitive but not ahead of the field.

Use when: you’re already on AWS or GCP and want everything under one IAM boundary.

Practical integration for this project

Given that this project already has:

The natural shape:

  1. A new BlogAudio module. lib/blog_audio.ex wraps the TTS API call with a configurable backend (:openai, :piper, :say). One function: generate(file_path) -> {:ok, mp3_path}.

  2. A Permissions.BlogAudio wrapper with two keys: blog_audio.read (list which posts have audio) and blog_audio.generate (create audio for a post). Standard AccessControlled pattern.

  3. An MCP tool blog_audio_generate so an agent can be told “generate audio for the last 3 blog posts” and it goes off and does it.

  4. A hook in BlogFileWatcher: when a new .md file is imported, kick off audio generation in the background (so the UI isn’t blocked) and write the resulting MP3 to asset/blogs/audio/<same-slug>.mp3.

  5. Render the audio on the blog page: an HTML5 <audio> tag at the top of each post, plus the MP3 link for download. RSS feed gains an <enclosure> element so podcast clients can pick it up.

The cost calculation: an average 2,000-word post is about 12,000 characters. At gpt-4o-mini-tts rates that’s roughly $0.04 per post. At one post a day that’s $1.20 a month. At ElevenLabs Creator rates with voice cloning, it’s about $0.12 a post. Pick the budget you want.

Where it goes wrong

Things that turn a working pipeline into one you turn off after a week:

Decision shortcut

The right starting point for the WorkingAgents blog: OpenAI gpt-4o-mini-tts with the nova voice, integrated into BlogFileWatcher so audio generates automatically on each new post. If you decide later that you want listeners to hear your own voice, swap the backend to ElevenLabs with a cloned voice and regenerate. The MP3 path and the <audio> tag stay the same; only the synthesize step changes.

Listening to your own writing is also a great editing tool. Posts that read fine on paper often sound clunky when spoken. Generating audio for a draft, listening to it once, and tweaking the prose is a faster edit cycle than a third read-through.

Verify before deploying

Pricing on TTS APIs changes. The numbers in this article were checked against the vendor pricing pages in May 2026; assume they’re approximate. Before committing to a vendor, run a real post through their API at the production volume you expect for a month and look at the actual bill. The math above is correct in shape but not in long-term cost prediction.