By James Aspinwall, co-written by Alfred Pennyworth (my trusted AI) — March 3, 2026, 06:38
The Orchestrator’s Alarm module started life as a simple fire-and-forget scheduler: store an alarm in SQLite, hand it to Timer via Process.send_after, cast to the target GenServer, move on. It worked, but it was blind. Did the target process exist? Did the Pushover notification actually send? How long did delivery take? Nobody knew.
This article walks through the upgrade that gave Alarm eyes, ears, and a second chance.
The Problem
Three blind spots:
-
No completion tracking — An alarm fired, Timer cast the message, Alarm marked
started_at. That was the end of the story. Whether Pushover returned 200 or 500, the alarm record looked the same. -
No process safety — Timer would
GenServer.castto a registered name without checking if anything was listening. If the target had crashed, the message vanished silently. - No retry — One shot. If delivery failed for a transient reason (process restarting, network blip), the alarm was gone forever.
The Schema Evolution
Eight new columns, added idempotently so existing databases upgrade on restart:
for col <- [
"completed_at INTEGER",
"duration_ms INTEGER",
"result TEXT",
"created_by TEXT",
"created_via TEXT",
"max_retries INTEGER DEFAULT 0",
"attempts INTEGER DEFAULT 0",
"backoff_ms INTEGER DEFAULT 5000"
] do
try do
Sqler.sql(state.alarmdb, "ALTER TABLE alarm ADD COLUMN #{col}")
rescue
_ -> :ok
end
end
The try/rescue pattern is deliberate — SQLite raises on duplicate column names, so this is a no-op on subsequent restarts. No migration table needed.
Completion Tracking: Closing the Loop
The key insight: the target process knows whether it succeeded, not Timer, not Alarm. So we need the target to report back.
Timer enriches the message before casting. If the payload is a tuple with a map (the common {:send, %{message: "..."}} pattern), Timer injects the alarm ID:
defp inject_alarm_id({action, %{} = payload}, id) do
{action, Map.put(payload, :alarm_id, id)}
end
defp inject_alarm_id(message, _id), do: message
The key is named :alarm_id (not :id) to avoid colliding with existing payload fields. Non-map payloads pass through unchanged — backward compatible.
Pushover reads it back and reports completion:
def handle_cast({:send, payload}, state) do
result = send_message(payload, state)
if alarm_id = Map.get(payload, :alarm_id) do
result_str = case result do
{:ok, _} -> "ok"
{:error, reason} -> "FAILED: #{inspect(reason)}"
end
Alarm.completed(alarm_id, result_str)
end
{:noreply, state}
end
Alarm records the outcome with a timestamp and computed duration:
def handle_cast({:completed, id, result}, state) do
now = System.system_time(:second)
# ... fetch started_at from DB ...
duration_ms = if started_at, do: (now - started_at) * 1000, else: nil
Sqler.update(state.alarmdb, "alarm", %{
id: id, updated_at: updated_at,
completed_at: now, duration_ms: duration_ms, result: result
})
end
Now every alarm has a measurable lifecycle: created → fired → completed (or failed).
Execution Safety: Check Before You Cast
Before this upgrade, Timer blindly cast to the target. If the process didn’t exist, GenServer.cast silently dropped the message (cast is fire-and-forget by design).
Now Timer checks first:
defp resolve_target(name) when is_atom(name) do
case Process.whereis(name) do
pid when is_pid(pid) -> {:ok, pid}
nil -> {:error, :noproc}
end
end
On failure, instead of Alarm.fired(id), Timer calls Alarm.delivery_failed(id, "noproc"). The alarm knows it wasn’t delivered.
Retry with Exponential Backoff
delivery_failed/2 doesn’t just record the failure — it decides whether to try again:
if attempts <= max_retries do
delay_ms = backoff_ms * Integer.pow(2, attempts - 1)
retry_at = now + div(delay_ms, 1000)
# ... update attempts, schedule retry via Timer with same alarm ID ...
else
# Retries exhausted — mark as permanently failed
end
The retry reuses the same alarm ID. The attempts counter tracks retries, and started_at is preserved from the first fire (so duration_ms reflects total time from first attempt to final completion).
Default: 0 retries (backward compatible). Callers opt in:
Alarm.set_timer(time, Pushover, {:send, payload},
max_retries: 2, backoff_ms: 5000)
The Timeout Sweep
What about alarms that fire but never complete and never fail? The target might hang, or the completion callback might have a bug. A 60-second sweep catches these:
def handle_info(:sweep_timeouts, state) do
# Find alarms fired > 5 minutes ago with no completion
stale = Sqler.sql(state.alarmdb, """
SELECT id, updated_at FROM alarm
WHERE started_at IS NOT NULL
AND started_at < ?
AND completed_at IS NULL AND cancelled_at IS NULL
AND (result IS NULL OR ...)
""", [now - 300])
for [id, updated_at] <- stale do
Sqler.update(state.alarmdb, "alarm", %{
id: id, updated_at: updated_at, result: "timeout_unknown"
})
end
Process.send_after(self(), :sweep_timeouts, 60_000)
end
Simpler than per-alarm timeout timers. One sweep catches everything.
Avoiding the Self-Call Deadlock
The new query functions (history/1, stats/0, failed/1) are GenServer calls. But WebSocket handlers are also GenServer calls (Alarm registers with WsRegistry). If a WS handler called Alarm.history(), it would deadlock — the process would be waiting on itself.
The solution: private do_* functions that operate directly on state:
# Public API (for external callers)
def history(opts), do: GenServer.call(__MODULE__, {:history, opts})
# GenServer handler
def handle_call({:history, opts}, _from, state) do
{:reply, do_history(state, opts), state}
end
# WS handler (same process, no GenServer.call)
def handle_call({:ws, "history", args}, _from, state) do
opts = ws_args_to_opts(args)
{:reply, do_history(state, opts), state}
end
# Shared logic
defp do_history(state, opts) do
# ... build query, execute, map results ...
end
Both paths use do_history/2. No deadlock.
Provenance: Who Scheduled This?
Every alarm can now carry created_by (user ID) and created_via (transport: “mcp”, “rest”, “iex”). The MCP server passes these automatically:
# In my_mcp_server.ex
def handle_tool_call("pushover_schedule", args, state) do
user = get_user(state)
opts = [created_by: to_string(user.id), created_via: "mcp"]
Permissions.Platform.pushover_schedule(perms, args, opts)
end
When debugging why an alarm fired at 3 AM, you can now see who scheduled it and through which interface.
Dashboard Visibility
Two new MCP tools expose alarm data:
-
alarm_history— query with filters: status, process, since, limit -
alarm_stats— aggregate counts: total, pending, fired, completed, failed, cancelled, avg duration
Monitor’s services section now includes stats and the 5 most recent failures, visible in the system dashboard without needing to query directly.
The Retired Scheduler
While upgrading Alarm, we also deleted lib/scheduler.ex — a module that was never started in the supervision tree, had bugs, and had zero callers. Its Sqler instance ({Sqler, name: "scheduler", register: :scheduler}) was removed from the application and Monitor’s database list. Dead code, gone.
What’s Next
-
Per-module completion reporting — currently only Pushover reports back. Any GenServer that receives alarm-scheduled messages could do the same by checking for
:alarm_idin the payload. - Alerting on failure rates — the stats infrastructure is there; a threshold check in the sweep could trigger a Pushover notification when failure rates spike.
-
Retry for specific error types — currently retries on any delivery failure. Could be smarter: retry on
:noproc(process restarting), don’t retry on serialization errors.
The Alarm module went from “I hope that worked” to “I can tell you exactly what happened, when, how long it took, and what went wrong.” That’s the difference between a scheduler and an observable system.