From Fire-and-Forget to Full Observability: Upgrading the Alarm Module

By James Aspinwall, co-written by Alfred Pennyworth (my trusted AI) — March 3, 2026, 06:38

The Orchestrator’s Alarm module started life as a simple fire-and-forget scheduler: store an alarm in SQLite, hand it to Timer via Process.send_after, cast to the target GenServer, move on. It worked, but it was blind. Did the target process exist? Did the Pushover notification actually send? How long did delivery take? Nobody knew.

This article walks through the upgrade that gave Alarm eyes, ears, and a second chance.

The Problem

Three blind spots:

No completion tracking — An alarm fired, Timer cast the message, Alarm marked started_at. That was the end of the story. Whether Pushover returned 200 or 500, the alarm record looked the same.
No process safety — Timer would GenServer.cast to a registered name without checking if anything was listening. If the target had crashed, the message vanished silently.
No retry — One shot. If delivery failed for a transient reason (process restarting, network blip), the alarm was gone forever.

The Schema Evolution

Eight new columns, added idempotently so existing databases upgrade on restart:

for col <- [
  "completed_at INTEGER",
  "duration_ms INTEGER",
  "result TEXT",
  "created_by TEXT",
  "created_via TEXT",
  "max_retries INTEGER DEFAULT 0",
  "attempts INTEGER DEFAULT 0",
  "backoff_ms INTEGER DEFAULT 5000"
] do
  try do
    Sqler.sql(state.alarmdb, "ALTER TABLE alarm ADD COLUMN #{col}")
  rescue
    _ -> :ok
  end
end

The try/rescue pattern is deliberate — SQLite raises on duplicate column names, so this is a no-op on subsequent restarts. No migration table needed.

Completion Tracking: Closing the Loop

The key insight: the target process knows whether it succeeded, not Timer, not Alarm. So we need the target to report back.

Timer enriches the message before casting. If the payload is a tuple with a map (the common {:send, %{message: "..."}} pattern), Timer injects the alarm ID:

defp inject_alarm_id({action, %{} = payload}, id) do
  {action, Map.put(payload, :alarm_id, id)}
end
defp inject_alarm_id(message, _id), do: message

The key is named :alarm_id (not :id) to avoid colliding with existing payload fields. Non-map payloads pass through unchanged — backward compatible.

Pushover reads it back and reports completion:

def handle_cast({:send, payload}, state) do
  result = send_message(payload, state)

  if alarm_id = Map.get(payload, :alarm_id) do
    result_str = case result do
      {:ok, _} -> "ok"
      {:error, reason} -> "FAILED: #{inspect(reason)}"
    end
    Alarm.completed(alarm_id, result_str)
  end

  {:noreply, state}
end

Alarm records the outcome with a timestamp and computed duration:

def handle_cast({:completed, id, result}, state) do
  now = System.system_time(:second)
  # ... fetch started_at from DB ...
  duration_ms = if started_at, do: (now - started_at) * 1000, else: nil

  Sqler.update(state.alarmdb, "alarm", %{
    id: id, updated_at: updated_at,
    completed_at: now, duration_ms: duration_ms, result: result
  })
end

Now every alarm has a measurable lifecycle: created → fired → completed (or failed).

Execution Safety: Check Before You Cast

Before this upgrade, Timer blindly cast to the target. If the process didn’t exist, GenServer.cast silently dropped the message (cast is fire-and-forget by design).

Now Timer checks first:

defp resolve_target(name) when is_atom(name) do
  case Process.whereis(name) do
    pid when is_pid(pid) -> {:ok, pid}
    nil -> {:error, :noproc}
  end
end

On failure, instead of Alarm.fired(id), Timer calls Alarm.delivery_failed(id, "noproc"). The alarm knows it wasn’t delivered.

Retry with Exponential Backoff

delivery_failed/2 doesn’t just record the failure — it decides whether to try again:

if attempts <= max_retries do
  delay_ms = backoff_ms * Integer.pow(2, attempts - 1)
  retry_at = now + div(delay_ms, 1000)
  # ... update attempts, schedule retry via Timer with same alarm ID ...
else
  # Retries exhausted — mark as permanently failed
end

The retry reuses the same alarm ID. The attempts counter tracks retries, and started_at is preserved from the first fire (so duration_ms reflects total time from first attempt to final completion).

Default: 0 retries (backward compatible). Callers opt in:

Alarm.set_timer(time, Pushover, {:send, payload},
  max_retries: 2, backoff_ms: 5000)

The Timeout Sweep

What about alarms that fire but never complete and never fail? The target might hang, or the completion callback might have a bug. A 60-second sweep catches these:

def handle_info(:sweep_timeouts, state) do
  # Find alarms fired > 5 minutes ago with no completion
  stale = Sqler.sql(state.alarmdb, """
    SELECT id, updated_at FROM alarm
    WHERE started_at IS NOT NULL
      AND started_at < ?
      AND completed_at IS NULL AND cancelled_at IS NULL
      AND (result IS NULL OR ...)
  """, [now - 300])

  for [id, updated_at] <- stale do
    Sqler.update(state.alarmdb, "alarm", %{
      id: id, updated_at: updated_at, result: "timeout_unknown"
    })
  end

  Process.send_after(self(), :sweep_timeouts, 60_000)
end

Simpler than per-alarm timeout timers. One sweep catches everything.

Avoiding the Self-Call Deadlock

The new query functions (history/1, stats/0, failed/1) are GenServer calls. But WebSocket handlers are also GenServer calls (Alarm registers with WsRegistry). If a WS handler called Alarm.history(), it would deadlock — the process would be waiting on itself.

The solution: private do_* functions that operate directly on state:

# Public API (for external callers)
def history(opts), do: GenServer.call(__MODULE__, {:history, opts})

# GenServer handler
def handle_call({:history, opts}, _from, state) do
  {:reply, do_history(state, opts), state}
end

# WS handler (same process, no GenServer.call)
def handle_call({:ws, "history", args}, _from, state) do
  opts = ws_args_to_opts(args)
  {:reply, do_history(state, opts), state}
end

# Shared logic
defp do_history(state, opts) do
  # ... build query, execute, map results ...
end

Both paths use do_history/2. No deadlock.

Provenance: Who Scheduled This?

Every alarm can now carry created_by (user ID) and created_via (transport: “mcp”, “rest”, “iex”). The MCP server passes these automatically:

# In my_mcp_server.ex
def handle_tool_call("pushover_schedule", args, state) do
  user = get_user(state)
  opts = [created_by: to_string(user.id), created_via: "mcp"]
  Permissions.Platform.pushover_schedule(perms, args, opts)
end

When debugging why an alarm fired at 3 AM, you can now see who scheduled it and through which interface.

Dashboard Visibility

Two new MCP tools expose alarm data:

alarm_history — query with filters: status, process, since, limit
alarm_stats — aggregate counts: total, pending, fired, completed, failed, cancelled, avg duration

Monitor’s services section now includes stats and the 5 most recent failures, visible in the system dashboard without needing to query directly.

The Retired Scheduler

While upgrading Alarm, we also deleted lib/scheduler.ex — a module that was never started in the supervision tree, had bugs, and had zero callers. Its Sqler instance ({Sqler, name: "scheduler", register: :scheduler}) was removed from the application and Monitor’s database list. Dead code, gone.

What’s Next

Per-module completion reporting — currently only Pushover reports back. Any GenServer that receives alarm-scheduled messages could do the same by checking for :alarm_id in the payload.
Alerting on failure rates — the stats infrastructure is there; a threshold check in the sweep could trigger a Pushover notification when failure rates spike.
Retry for specific error types — currently retries on any delivery failure. Could be smarter: retry on :noproc (process restarting), don’t retry on serialization errors.

The Alarm module went from “I hope that worked” to “I can tell you exactly what happened, when, how long it took, and what went wrong.” That’s the difference between a scheduler and an observable system.