The current AI Agent Gateway codebase has grown to 122 modules in lib/, 23 permission wrappers, and 18 MCP tool handlers. A lot of that is genuinely load-bearing; a lot of it is features that accreted around the load-bearing parts and could live as separate projects. This article walks through what is what, names the minimum viable gateway you would actually want to ship, and lists the design principles and library choices worth carrying forward into a clean rewrite.
The goal is not to deprecate the current code. It is to identify the spine.
The spine
If you stripped the current project to the smallest version that still does the job a gateway exists to do, this is what survives.
Identity and tokens
-
User– registered users, password hashing (Argon2), email, Google OAuth association. -
SubToken– bearer tokens stored as database rows, each independently revocable, with a scope (:full,:partial) and an optional set of permission keys. Token strings are prefixed (st_...) so the auth path can recognize them. -
Auth.Authorization+Auth.AuthorizationPlug+Auth.AuthContext– the request-side plumbing that turns aBearerheader into a user identity and a permission set.
That is the entirety of the authentication surface. The gateway does not invent its own crypto; it uses Plug’s primitives plus Argon2 for password storage.
Capability-based permission system
-
AccessControl– the registry where capability keys live, with TTL support, attenuation, and the “keys never leave the server process” invariant. -
AccessControlled– ause-macro that any protected module mixes in. Sets@permission,@module_name,@module_description, generatesallowed?/1, and registers itself. -
Permissions.Registry– the lookup of per-tool permission keys, hashed deterministically so they survive restart. -
Permissions.Bootstrap– the boot-time step that registers all the keys. -
Permissions.Keys– the canonical list of well-known keys.
Every protected module follows the same shape: business logic in Foo, permission wrapper in Permissions.Foo, MCP handler in MCPServer.Tools.Foo. That structure is the actual product. Without it there is no governance, just an LLM hitting endpoints.
Persistence
-
Sqler– a thin wrapper around exqlite that gives every module its own SQLite database, one file per subsystem. Each module owns its instance. Sqler generates IDs as millisecond timestamps (soid / 1000recovers the creation time), enforces optimistic locking viaupdated_at, and centralizes the slow-query, error, and migration logic.Known constraint: millisecond timestamps collide under burst-insert workloads where two rows are written in the same millisecond. The current project tolerates this because per-module concurrent-insert rates are low; the rewrite ships with
Sqler.insert/3returning:retryon collision and the caller (always the same GenServer per module under WAL mode) waits a single millisecond and retries. If a module’s insert rate grows past that threshold, the upgrade path is ULID – still lexically sortable, still timestamp-prefixed forid / 1000-style queries, but with 80 bits of randomness suffix that makes collision effectively impossible. Migrating an existing table is a column-add and a one-time backfill; not free, but bounded.
There is no Ecto. There is no Postgres. There is one SQLite file per concern, on local disk. That choice is load-bearing: single-customer instances, file-system backups, no external database to operate.
MCP transport (server side)
-
MCPServer– the inbound MCP HTTP transport. Receives JSON-RPC requests, dispatches to tool handlers via a prefix table ({"workflow_", MCPServer.Tools.Workflow},{"access_control_", MCPServer.Tools.AccessControl}, etc.), returns results. -
MCPServerRouter– the Plug pipeline that mounts the MCP endpoints alongside the web UI and REST API on the same port. -
MCPServer.Helpers– shared response helpers (reply_json/2,unauthorized/2,error_reply/2,get_permissions/1). -
MCPServerManager– session state per connected MCP client. -
MCPClientTracker– which clients are currently connected, what tokens they hold, what their session id is.
This is the half-day-to-understand piece. Once you grasp the dispatch table, the entire MCP surface is composed of handler modules following the same pattern.
MCP transport (client side)
-
MCPClient– outbound MCP. The gateway can act as an MCP client itself, consuming external MCP servers (Box, GitHub, Google, customer-supplied tooling) and surfacing them under the same permission model. -
MCPConnection+MCPConnectionRegistry– the registry of configured outbound connections, each with its own auth, its own permission key, its own enabled/disabled state.
The dual-role design (server and client) is the gateway’s value proposition. An agent talks to one endpoint, the gateway talks to ten. The audit trail and permission gating happens in between.
OpenAPI bridge (REST -> MCP synthesis)
-
OpenApiBridge– the functional module that owns the registered API specs and the MCP tools synthesized from them. -
OpenApiBridge.Parser– reads an OpenAPI 3.x document and produces an internal description of each path, its parameters, its authentication, and its response shape. -
OpenApiBridge.Dispatcher– at request time, takes an MCP tool call against a synthesized tool, builds the HTTP request, signs it, executes it, and shapes the response back into MCP’s reply format. -
OpenApiBridge.Assertion– the signed assertion auth pattern used to call external services on behalf of the gateway with the caller’s identity attached. The same pattern the project uses for feature servers. -
OpenApiBridge.Db– persistence for registered API specs and the synthesized tool definitions, in its own SQLite database. -
MCP handlers
api_register,api_unregister,api_list,api_describe, andapi_call– the operator surface for adding a new REST API to the gateway’s tool list.
Why this belongs in the core: it is the lever that turns “ten REST APIs the customer already has” into “ten MCP tool surfaces the agent can use” without writing one Elixir adapter per API. Drop an OpenAPI spec into the bridge, set the auth credential, grant the permission key, and the agent now has typed access to that API through the same permission-gated, audit-logged pipeline as every other tool. The work that used to be a Permissions wrapper plus an MCP handler plus a Plug router for each integration collapses into a one-time registration call.
The bridge does not replace hand-written adapters when behavior matters more than the spec. The Box, Google, and WhatsApp integrations in the current codebase are hand-written because their workflows have semantics (multi-step OAuth flows, file streaming, polling) that an OpenAPI dispatcher cannot synthesize. But for the long tail of “I have a REST API, I want an MCP tool for it,” the bridge eliminates the per-API engineering cost.
REST API and web UI (the three-transport rule for core modules)
Every core protected module in the gateway – meaning every module that ships in the spine, not optional plugins – is reachable on three transports: MCP (for agents), REST (for programmatic clients that don’t speak MCP), and Web (for human operators). Picking one or two for a core module is a mistake. The discipline is to expose all three from day one because each transport serves a constituency the other two cannot.
Optional plugin modules follow a weaker rule: MCP is mandatory, REST and Web are recommended but not required. The split is intentional and is restated in the “Optional plugin modules” paragraph below.
A note on what “web” means here. The *_web.ex views are the operator console: server-rendered HTML pages for the human who runs the gateway. They are not the end-user product UI – a customer who wants a polished end-user experience builds that as a separate front-end project consuming the REST API. The gateway is both a headless API and a self-contained operator console; it is not also a marketing site or a customer-facing SaaS UI.
For each module Foo, the standard fan-out looks like this:
| Transport | File | Calls into |
|---|---|---|
| MCP |
lib/mcp_server/tools/foo.ex |
Permissions.Foo |
| REST |
lib/router/foo_api.ex |
Permissions.Foo |
| Web |
lib/foo_web.ex |
Permissions.Foo (renders HTML) |
All three call into the same Permissions.Foo wrapper. Permission checks live in the logic layer, never in the transport. The three transport modules are intentionally thin: parse arguments, call the wrapper, format the response in the transport’s native shape (MCP envelope, JSON, HTML).
Core REST/Web modules in the spine:
-
Router.AccessControlAPI+AccessControlWeb– grant, revoke, list permissions over HTTP and HTML. -
Router.Helpers– shared JSON response helpers. -
Router.AccessLogger– the Plug that drops every request intohttp_access_log. -
LoginWeb,RegistrationWeb– the human-facing auth flows. -
SubTokenWeb– managing personal sub-tokens (/settings/token). -
WorkflowWeb– creating, listing, monitoring workflows; approving paused steps. -
AlarmWeb– viewing scheduled alarms, cancelling, manually triggering. -
MonitorWeb– live health-check dashboard. -
McpConnectionsWeb– view, enable, disable outbound MCP connections. -
HttpAccessLogWeb– the request audit dashboard. -
OpenApiBridgeWeb– register a new OpenAPI spec, view synthesized tools, set credentials. -
NavMenu– the shared navigation surface across all*Webmodules.
The rule is uniform: if a module appears in lib/permissions/, it must have a *_web.ex and a router/*_api.ex alongside its mcp_server/tools/ handler. Three files per module. The operator can use any transport. The agent can use any transport (but usually picks MCP). The programmatic integrator can use any transport (but usually picks REST).
Synthesized surfaces (OpenAPI bridge): the MCP tools and REST passthroughs are auto-generated from the registered specs; the operator surface is a single generic page (OpenApiBridgeWeb) for listing registered specs, inspecting synthesized tools, setting credentials, and toggling enable/disable. The operator does not get a hand-crafted page per registered API – the bridge does not have hundreds of operator pages, it has one.
Optional plugin modules (Box, Google, WhatsApp, platform-specific drivers, etc.): each ships at minimum the MCP transport. REST and Web are strongly recommended but optional, because some plugins (notably narrow platform-specific drivers) have no useful operator surface beyond what their parent module already exposes. The contract on the plugin’s permission wrapper stays identical – agents and integrators get the same gated path regardless of how many transports the plugin chose to surface.
Why this matters in practice:
- Modify user access and permissions – the operator does it in the web UI; an automation script does it via REST; an agent (with the admin permission) does it via MCP. Same wrapper underneath.
- Set up workflows – the operator authors a template in the web UI; a CI pipeline registers a new one via REST; an agent instantiates a template via MCP.
- Add alarms – a person sets a one-off reminder in the web UI; a scheduled-job runner posts to REST; an agent schedules a follow-up via MCP.
- View MCP and API connections – the operator inspects them in the web UI; an inventory script lists them via REST; an agent enumerates them via MCP before deciding which tools to use.
- Set sub-tokens – the operator generates a token in the web UI; a provisioning script creates per-user tokens via REST; an agent rotates its own token via MCP.
The constituent surfaces (different audiences, different ergonomics) are not negotiable. Skipping the web UI to “ship MCP faster” produces a gateway no operator wants to operate. Skipping REST to “stay pure MCP” produces a gateway no DevOps team can automate. Each transport pays for itself in a different way; together they make the gateway tractable for the three audiences that actually use it.
Scheduling
-
Alarm– one-shot or recurring scheduled events. Each alarm has aset_attimestamp, a callback module/function, optional cancellation, and persists in its own SQLite database. -
Timer– the low-level GenServer that Alarm uses to schedule the actual BEAM messages. Helper processes like Timer are started inside their parent’sinit/1and owned by the parent.
This is the scheduling primitive that everything else uses. Background jobs, retries, scheduled tasks for agents, periodic health checks – they all dispatch through Alarm so the database row is the source of truth and a restart does not lose them.
Workflow orchestration
-
Workflow– the functional module. Owns the workflow row, the step list, the current state, the per-step inputs and outputs. Each workflow run is persisted in its own SQLite database with the same millisecond-ID convention as the rest of the project. -
WorkflowExecutor– the runtime that walks a workflow through its steps, calls into the target tool for each step, captures the result, advances or pauses the run. Long-running steps yield back to the supervisor rather than blocking. -
Permissions.Workflow+ MCP handlers forworkflow_create,workflow_list,workflow_get,workflow_cancel,workflow_step_ready,workflow_step_approve, and the template surface (workflow_template_save,workflow_template_list,workflow_template_delete). - Templates – saved patterns that an operator authored once and an agent can instantiate many times. Template management is a separate permission key from execution, so a tightly-scoped agent can run templates without being able to edit them.
-
Human-in-the-loop steps – a step can require explicit approval before it advances. The workflow row carries the state; the operator approves through the web UI or via
workflow_step_approve; the executor resumes. -
Per-user isolation – every workflow row carries an
owner(the username under which it was created). All queries –workflow_list,workflow_get,workflow_cancel, step approvals – filter byowner = current_user. A user cannot list, read, modify, or cancel another user’s workflows; the runtime returns:not_foundrather than:not_allowedto avoid leaking the fact that another user has a workflow at all. The only exception is theadminrole, which carries an explicitworkflow.cross_userpermission key. Templates are subject to a separate ownership rule: an operator can mark a template asshared, after which it appears in other users’workflow_template_listresults but the resulting instantiated workflow is still owned by the user who ran it.
Workflows are the orchestration primitive that turns single-tool calls into multi-step business processes. An agent doesn’t write a 500-line script to “process this invoice, route to AP, post a notification, and wait for finance approval.” It instantiates a workflow template with three steps and let the executor handle the state machine, including the human-approval pause. Workflows depend on Alarm for scheduled steps, Notifier for step-ready signals, FunctionNode for deterministic transformations between steps, the LlmGateway for the limited steps that need natural-language parsing, and the permission system for who can do what at each step.
Function nodes (sandboxed compute for workflow glue)
-
FunctionNode– the functional module. A registered piece of customer-authored code with a typed input, a typed output, a runtime selection (Python, JavaScript, Elixir-on-BEAM, etc.), and a permission key. Each function node has its own row in the function-node Sqler db, its own version history, and its own audit trail for invocations. -
FunctionNode.Provisioner– creates and tears down the sandboxed execution environment. The rewrite’s default backend is an in-process WASM runtime; a local Docker backend handles function nodes that need richer language runtimes; Fly.io machines and Kubernetes Jobs are optional plugin backends for deployments that genuinely need managed multi-machine isolation. The current codebase ships only the Fly.io backend; the rewrite inverts that priority. -
FunctionNode.Registry– which function nodes exist, which versions, which are deployed, which are draft. -
FunctionNode.Runtime(the in-machine code that lives infunction_node_runtime/) – the agent that runs inside the sandbox, fetches the function body on boot, executes it on request, and reports back. Authoritative for the sandbox-side contract; replaceable per backend. -
Permissions.FunctionNodes+ three transports (MCPfunction_node_*tools, REST router,FunctionNodeWebUI for authoring and deployment).
Why this belongs in the core: real workflows are mostly deterministic plumbing between domain-specific MCP calls. Pull a contract PDF from Box. Extract a date. Compare it to today. If within 30 days, post a Slack message via the company’s MCP server. The MCP calls are the “Box pull” and “Slack post” boxes; everything between them is data shaping that has no business going through an LLM. Function nodes let workflow authors write that glue in a real programming language, deploy it as a versioned, permissioned artifact, and have the workflow executor invoke it just like any other step.
The split of work in a typical workflow:
- Most steps: MCP tool calls. Domain-specific, written once by whoever owns the domain (Box, Salesforce, your customer’s REST API).
- Most transitions: function-node calls. Deterministic data shaping. Twenty lines of code, runs in milliseconds, has unit tests.
- A few steps: LLM Gateway calls. Reserved for the parts that actually need natural-language understanding (parse this email, summarize this PDF, classify this support ticket).
- A few steps: human-in-the-loop approvals.
Without function nodes, the “Most transitions” line collapses into one of two bad options: write an LLM step for every trivial transformation (expensive, slow, non-deterministic) or write Elixir in the gateway codebase for every new shape (rigid, redeploy required, no per-customer isolation). Both are losing strategies for an instance-per-customer product. Function nodes are the third option that makes workflows actually deployable.
Backend choice is a deployment decision, not an architectural one. The current codebase uses Fly.io for production isolation, but the rewrite changes the default to in-process WASM and demotes Fly.io to an optional plugin. The reasoning matters because it materially changes the deployment story:
- Most workflow function nodes are pure transformations on JSON: pull a field, reshape a structure, derive a value. These do not need a separate machine, a container, or a network round-trip. An in-process WASM call runs in microseconds with strict isolation guarantees and zero external dependencies.
- The vast majority of WASM-incompatible function nodes (those needing Python, Node, or filesystem access) are served by a local Docker backend on the gateway host. Universal, no vendor lock, the operator already has Docker available.
- Fly.io as a backend remains useful for deployments that genuinely need managed multi-machine isolation – but that is a minority of deployments, and the dependency on a specific cloud provider should not be in the spine.
The contract across backends is the same: invoke a function node, pass typed input, get typed output, log the invocation. The default is in-process; the escape hatches scale up when the workload demands them.
LLM Gateway (multi-provider routing)
-
LlmGateway– the functional module. Public API:complete/2,chat/2,embed/2. Takes a request, picks a provider, executes the call, normalizes the response shape across providers. -
LlmGateway.Handler– the dispatch layer. Provider selection by request (model: "opus"-> Anthropic;model: "gpt-4o"-> OpenAI;model: "gemini"-> Google), with optional fallback on rate-limit or provider failure. -
LlmGateway.Database– per-call audit and usage logs in its own SQLite instance. Token counts, latency, cost attribution by token id, and the full request/response when retention is enabled. - Provider adapters – one module per upstream (Anthropic, OpenAI, Google Vertex, Gemini, OpenRouter, local providers). Each is a small client that translates the canonical request into the provider’s wire format and back.
Why this belongs in the core: workflows that involve LLM steps are most useful when the choice of model is a configuration variable, not a code dependency. A workflow template can declare “summarize step uses sonnet, extraction step uses gpt-4o-mini, embedding step uses text-embedding-3-small.” Swapping providers becomes a template edit instead of a code change. Customers who already have a contract with OpenAI but want to try Anthropic for one workflow get to do that without a redeploy.
The gateway also gives the audit layer something to hang on. Every model call goes through one path, so cost-per-token, latency, and failure rates are queryable as one table rather than scattered across N integration modules. Permissions are per-provider: llm.anthropic, llm.openai, llm.google – a customer can grant an agent access only to the providers they have a contract with.
Without the gateway, every workflow step that calls an LLM becomes its own integration. With it, the workflow knows nothing about provider APIs; it knows that one of its steps is “call the LLM with these inputs and these constraints.”
Notifications
-
Notifier– a generic dispatcher that takes a notification and routes it to one or more backends. -
Pushover– one backend, owns its own Sqler instance for audit logs, delivers via the Pushover REST API. -
LoggerNotifier– the bridge that lets aLogger.errorcall optionally fan out to the notifier when configured.
The shape is generic. Adding email, Slack, or SMS is a new backend, not a new pipeline.
Logging and audit
-
ServerLog+ServerLogWriter– the gateway’s own runtime log, written to a SQLite file alongside the on-disk text log. Queryable from the web UI. -
HttpAccessLog– one row per HTTP request, including path, status, user, IP, duration. The data behind any “who hit what” investigation. -
CompileLog+CompileAnalyzer+AutoCompile– developer-facing: which files compiled, which warnings, which errors. Less essential for a customer deployment, more essential for a dev environment. -
LogSubscriber– the GenServer that subscribes to Elixir’sLoggerevents and routes them where they need to go.
Audit logging is not a feature; it is a precondition. Strip it and the gateway loses its ability to answer “what did the agent do with this customer’s data.”
Health and monitoring
-
Monitor+MonitorServer+MonitorWeb– periodic health checks, each producing a pass/fail record. Disk usage, process counts, port liveness, dependency health, custom checks per deployment. -
Mcp.Telemetry– the BEAM-level metrics surface (VM stats, scheduler utilization, message queue lengths).
Server-sent events
-
SsePush– generic outbound SSE for MCP clients that require server-to-client notifications. -
PageSse– the same primitive applied to the web UI for live updates.
Documentation (searchable corpus, agent-readable, no built-in Q&A)
-
Doc– functional module. Public surface:Doc.search/2(text + semantic search returning ranked chunks),Doc.get/1(fetch by path, with optional section anchors),Doc.list/1(browse the corpus). -
Doc.Index– a SQLite-backed search index overasset/docs/andasset/manuals/. Full-text via SQLite’s FTS5 and semantic viasqlite_vecfor vector search. Both indexes are queried inDoc.searchand the results merged with a hybrid ranking. -
Doc.Watcher– aFileSystem-backed process that re-indexes anyasset/docs/**/*.mdorasset/manuals/**/*.mdthat changes on disk. Drop a new manual in, it’s searchable within seconds. -
Permissions.Doc+ three transports (doc_search,doc_get,doc_list, plusDocWebandRouter.DocAPI). Nodoc_ask– answering questions is the agent’s job, not the gateway’s.
The deliberate design: the gateway exposes the corpus, the index, and the retrieval; the user’s own agent calls doc_search and doc_get over MCP, reads the chunks, and synthesizes the answer using whatever model it already has loaded. The gateway does not run an LLM to answer questions about its own documentation. Three reasons:
- No double billing. The user’s agent is already running an LLM. If the gateway also runs one to “answer” the same question, the customer pays twice for the same reasoning.
- No model choice imposed on the customer. The user picks the model they want their agent to use. The gateway should not pick a different one for documentation Q&A.
-
Cleaner permission boundary. Reading documentation is a separate capability from calling an LLM. A token can have
doc.readwithout having anyllm.*keys.
Why this belongs in the core anyway: an instance-per-customer deployment is operated by people who did not write the gateway. They need to find “how do I set up a workflow,” “what does workflow_create accept as arguments,” and “why is permission attenuation behaving this way” without trawling the source. Agents have the same problem – before invoking a tool, a well-behaved agent calls doc_search to find out what the tool does and what arguments it expects.
Doc is the corpus in two shapes at once: a browseable manual surface for humans (DocWeb) and a search/retrieval surface for agents (MCP and REST). The content is asset/docs/ (architecture, internal design) plus asset/manuals/ (per-module user guides, split by access method as the project convention requires: *-web.md, *-rest.md, *-mcp.md, *-iex.md). Onboarding a new operator becomes “give them the URL.” Onboarding a new agent becomes “let it call doc_search before it starts.”
Documentation does not depend on the LlmGateway. It can ship independently, earlier in the plan.
Runtime introspection (Tidewave, permission-gated)
-
Permissions.Tidewave– gates Tidewave behind three permission keys:tidewave.eval(arbitrary Elixir evaluation),tidewave.docs(read-only docs/source lookup),tidewave.logs(read-only log access). The first is the most dangerous and is granted only to operators. -
Production-mode mount – the Tidewave Plug is mounted on the same HTTPS port as the rest of the gateway, behind the same bearer-token auth. No more port 4001 daemon. The dep stays optional via
MIX_ENV=prodconfig and theTIDEWAVE_ENABLED=trueenv var. -
Full audit retention – every Tidewave invocation logs to
HttpAccessLogand to a dedicatedtidewave_invocationstable that retains the request payload, the caller, and the response shape. No silent power.
Why this belongs in the core: production debugging for a single-customer Elixir instance has two paths. Build a bespoke “inspect this process, inspect this Sqler database, replay this MCP request” UI (months of engineering, never matches what you actually need at 2am). Or gate the existing runtime-introspection tool behind one of the most powerful permission keys in the system, with full audit, and hand it only to operators.
The latter is what works in practice. The discipline:
-
tidewave.evalis never granted to a sub-token an agent holds. - The token is rotated whenever an operator role changes.
-
A leaked
tidewave.evaltoken is treated as a production incident, same as a leaked DB password.
The article elsewhere about “fail safely at boundaries” still applies inside Tidewave: a query that errors raises in Tidewave’s scope, gets logged with the caller’s username, and does not leave half-state on the system.
External agent invocation
-
ExternalAgent– functional module. Public surface:dispatch/3(synchronous request to an external agent with a prompt and arguments, capture and return the reply). -
Core backend adapters (cross-platform, ship in the spine):
-
ExternalAgent.AnthropicApi– direct Claude API call with a structured prompt. -
ExternalAgent.OpenAIApi– direct GPT API call. -
ExternalAgent.LocalSubprocess– spawnclaude -p "..."orcodex -p "..."(or any other configured CLI) and capture stdout. Works on any Unix.
-
- Optional backend adapters (separate libraries, loaded only on deployments that need them): platform-specific dispatchers such as tmux drivers, screen drivers, terminal-emulator drivers, or native iOS/Android automation. None of these is in the spine; each ships as its own optional library and follows the same adapter contract.
-
Permissions.ExternalAgent– per-backend permission keys:external_agent.anthropic,external_agent.openai,external_agent.subprocess. Optional backends register their own permission keys when loaded. A token granted one key cannot invoke another backend. - Three transports as usual, plus a per-invocation audit table that retains the prompt, the response, the backend, the latency, and the caller.
Why this belongs in the core: the gateway’s value proposition is “one place to govern all the AI activity in an organization.” If the gateway cannot dispatch to external agents at all, the customer either builds their own dispatch outside the audit boundary (defeating the point) or forces every workflow through one model (losing the right-tool-for-the-job benefit). The feature has to exist. The work is the audit/permission boundary around it.
This is the feature in the gateway that carries the most responsibility per call. An authorized user with external_agent.anthropic can effectively ask a remote model to do anything the gateway’s other tools allow. The mitigations:
-
Default-deny. No token has any
external_agent.*key unless an operator explicitly granted it. - Per-backend isolation. Granting one backend does not grant the others. A workflow that needs Claude for one step doesn’t also get the Anthropic API key for free.
-
Rate-limited per token. A runaway loop calling
dispatch/31000 times in a minute trips a circuit breaker tied to the calling token. - Full audit. Every invocation has its full prompt and response in the audit table for 90 days minimum (operator-configurable).
-
No prompt smuggling. The dispatch layer refuses to pass through fields that could exfiltrate other state (system prompts, user-context injection patterns) unless an explicit
external_agent.advancedkey is held.
The reason this is in the spine and not in the optional features is that workflows depend on it for any step that needs an outside agent’s judgment – which, for many real customers, is most of the workflow.
Supervision and startup
-
Mcp.Application– the OTP application’sstart/2. Brings up Sqler instances, the registries, the web/REST/MCP transports, the alarm and notifier processes, the monitor. One supervision tree, no auxiliary scripts. -
Mcp.Startup– the ordered list of boot steps (database schemas, permission key registration, monitor wiring). Each step is logged and any failure aborts startup loudly.
That is the spine. It is roughly 30 modules covering the foundational concerns above. The full day-one + day-two list (the spine plus workflow, function nodes, OpenAPI bridge, LLM gateway, documentation, external agent, and the *_web.ex views the three-transport rule requires for each core module) brings the total to roughly 62 – the larger number you’ll see in the scope estimate later. Both counts describe the same architecture; the difference is whether you’re counting the foundational concerns alone or the full core-module surface with all transports.
The features
Everything in the current lib/ that is not in the list above falls into one of three buckets.
Useful but optional integrations
These are real product capabilities, but they are not the gateway. They are tools the gateway happens to expose, the way it could expose any external service through the same MCP-tool pattern:
- Google Workspace: Drive, Docs, Sheets, Gmail, Vertex, Gemini, Settings.
- WhatsApp Web bridge (via the Node service).
- Box (when added).
- Browser automation.
- iTerm2 driver (Mac-only).
- Draw.io diagram generation.
- Page scraping.
- NotebookLM file watcher.
- OpenClaw (Claude in an OrbStack VM).
- A2A demo client/server.
Each is interesting. None is the gateway. In a clean rewrite they would live as separate Elixir libraries, depended on optionally, surfaced via the same Permissions.Foo + MCPServer.Tools.Foo pattern.
Optional content UI features
- Blog management (blog store, file watcher, RSS feed, web pages, demos).
- Contact form, public marketing pages, customer-facing site.
- Manuals indexed beyond the documentation corpus.
These are content-management features that happen to be built into the current codebase but are not the gateway. Server-rendered *_web.ex views for each protected core module (workflows, alarms, permissions, audit logs, etc.) stay in the spine – the three-transport rule applies. The pieces above are the editorial/CMS surface, which is a separate concern that can live as its own project consuming the gateway’s REST API.
Demo and developer harnesses
- Weather chat CLIs (one per LLM provider).
- MCP probe scripts.
- Page scraper demos.
- A2A demo and example modules.
These belong in a separate examples/ repo, not in the production gateway. Keeping them in lib/ makes the codebase look bigger than it is.
What a clean rewrite should ship
If you started fresh today, the minimum viable gateway would have these features in this priority order:
Day-one (the spine, no shortcuts)
- Sqler – per-module SQLite, millisecond IDs, optimistic locking. The persistence primitive.
- User and SubToken – accounts, Argon2 password storage, sub-tokens with scope and per-key permissions, independent revocation.
- AccessControl + AccessControlled – capability-based permission registry with the standard mixin macro.
- Permissions.Bootstrap and Registry – deterministic permission key registration on boot.
- Auth pipeline – bearer token plug, ws ticket plug, auth context.
-
REST router skeleton + web UI scaffold – the access-control API and
AccessControlWeb, the access-log Plug, the shared helpers, theNavMenu, the login/registration/sub-token web flows. Every protected module ships with all three transports (MCP, REST, Web) from day one. - MCP server – inbound JSON-RPC transport, prefix dispatch table, helpers, session manager.
- MCP client – outbound connections, connection registry, per-connection permission keys.
- OpenAPI bridge – parser, dispatcher, signed-assertion auth, persistence. Turns registered OpenAPI specs into MCP tools without per-API engineering.
- ServerLog + HttpAccessLog – audit and access logs to SQLite.
- Mcp.Application + Mcp.Startup – supervision tree, ordered boot steps, fail-loud startup.
Day-two (operational essentials)
- Alarm + Timer – scheduled events that survive restart.
- Workflow + WorkflowExecutor – multi-step orchestration with templates, step approvals, and persisted state. The primitive that turns single tool calls into business processes.
- FunctionNode + Provisioner + Registry – sandboxed code execution for deterministic workflow transitions. The “glue between MCP calls” substrate that workflows depend on.
- LlmGateway – multi-provider model routing with per-provider permission keys, normalized request/response, and per-call audit. The substrate workflow LLM steps call into for the limited cases that need natural-language parsing.
-
Doc + Doc.Index + Doc.Watcher – the searchable documentation surface. FTS5 + sqlite_vec hybrid search over
asset/docs/andasset/manuals/. No built-in Q&A: the agent retrieves chunks viadoc_search/doc_getand synthesizes answers with its own model. -
ExternalAgent + cross-platform backends – dispatch to Anthropic API, OpenAI API, or a local subprocess (
claude -p,codex -p) as a workflow step type or as a permissioned MCP tool. Per-backend permission keys, rate-limited per token, full audit retention. Platform-specific backends ship as separate optional libraries. -
Tidewave permission gating – mount Tidewave on the main HTTPS port behind bearer auth, with
tidewave.eval,tidewave.docs,tidewave.logskeys. Production introspection without a separate port or a separate auth model. - Notifier + Pushover – generic notification dispatch with at least one backend.
- Monitor + MonitorServer – periodic health checks with persisted history.
- LoggerNotifier – error logs that can fan out to the notifier.
- SsePush – server-to-client notifications for MCP clients that need them.
Day-three (operator UX)
- Settings management for backend secrets (Pushover keys, OAuth credentials, TLS paths).
- Compile-time observability (CompileLog + AutoCompile) if you want dev ergonomics.
The web UI itself is not on this list because it is not a day-three concern – every protected module above already ships with its own web view as part of the three-transport rule. By the time day-two is done, the operator already has dashboards for permissions, audit logs, health, sub-tokens, workflows, alarms, MCP connections, and API bridge registrations.
Everything else is a feature, not the gateway. Box, Google, WhatsApp, platform-specific terminal drivers, page scrapers, NotebookLM watchers, A2A demos, content/CMS surfaces – those live as plugins.
Design principles to carry forward
Strip the bloat but keep these. They are what makes the gateway maintainable.
Functional core, imperative shell
Business logic lives in pure functional modules. Process management (GenServer state, supervision, lifecycle) lives in dedicated Server modules. A module called TaskManager has its functions; TaskManagerServer has its init/1, handle_call/3, etc. The two are not the same module.
Permissions in the logic layer, not the transport
Web, REST, and MCP all pass user permissions down. The business module checks permissions and returns {:not_allowed, reason}. Each transport translates that into its own format (403 JSON, redirect, MCP error response). Never check permissions in a Plug. Never check permissions in a tool handler.
Three transports per core protected module
Every core module in lib/permissions/ (the spine, not plugins) ships with three transport adapters: an MCP tool handler in lib/mcp_server/tools/, a REST router in lib/router/, and a web view in lib/<module>_web.ex. All three are thin and all three call the same Permissions.* wrapper. A core module that has only one or two transports is incomplete; ship the third before declaring the module done. The discipline ensures that whatever operation an agent can do via MCP, an operator can also do in the browser and a script can do via REST. Plugin modules follow a weaker version (MCP mandatory, REST/Web recommended but optional) – the trade-off documented in the three-transport-rule section above.
Modules own their data
Each protected module owns its Sqler instance, its @permission key, its process lifecycle. No shared schema across modules. No cross-module SQL joins. If two modules need to share data, one of them queries the other’s API.
Tagged tuples at API boundaries
Public functions return {:ok, value} or {:error, reason}. Internal code can raise. Rescue at the boundary, convert to tagged tuples. The “always parsable by an LLM” rule – agentic modules in particular should never return surprise shapes.
Multi-head functions and with over if/else
For any non-trivial branching, write multi-head functions or a with chain. if/else is reserved for actual boolean gates inside a single function.
Explicit OTP
@impl true on every GenServer callback. Every process in a supervision tree. No spawned processes that are not children of something.
Singleton GenServers accessed by registered name
Long-lived processes (Alarm, Pushover, Timer, Sqler) are registered by module name. Callers do GenServer.cast(Alarm, ...), not GenServer.cast(pid, ...). Internal helper processes (a per-parent Timer or Sqler) are started inside the parent’s init/1 and owned by the parent.
Soft deletes for audit trails
cancelled_at, started_at, completed_at columns. No DELETE FROM. The history is the audit trail.
Verify and refuse loudly
Boot fails with a clear message if SECRET_KEY_BASE, COOKIE_SALT, ACCESS_CONTROL_KEY are missing. Half-configured production is worse than no production. The same rule applies to enabled-but-unkeyed subsystems: if LlmGateway is enabled in config but no provider has a working API key, boot fails. If Pushover is enabled with no app token, boot fails. If OpenApiBridge has registered specs but no signing key for its signed assertions, boot fails. A gateway that boots but cannot complete the operations its config promises is the silent-failure mode that produces 2am support calls.
Library choices to carry forward
These are the dependencies that earned their place. Use them in the rewrite without re-evaluating:
-
Hermes (via
:hermes_mcp) – the MCP server protocol implementation. The dispatch table, JSON-RPC envelope handling, and SSE transport are not worth rewriting. - Bandit – HTTP/HTTPS server. Light, well-supported, current with HTTP/2.
- Exqlite + Sqlite_vec (where vector search is needed) – SQLite driver. The choice of SQLite over Postgres is not negotiable; it is what makes single-customer instances tractable.
- Argon2_elixir – password hashing.
- Plug.Crypto – session token encryption.
- Req – HTTP client for outbound calls. Replaces HTTPoison/Tesla.
- Jason – JSON encoding/decoding.
- gproc – registry for named processes when the default Registry isn’t sufficient.
- Tzdata – timezone data. The codebase uses millisecond IDs, but human-readable timestamps need real TZ data.
Optional but proven:
- Floki – HTML parsing when you need it (page scraping, content extraction).
- Earmark – Markdown parsing if you keep any blog/doc surface.
- Eqrcode – QR code rendering for the WhatsApp QR re-link flow if you keep WhatsApp.
- File_system – inotify/fsevents wrapper for file watchers if you keep anything that auto-imports from disk.
Avoid in the rewrite:
- Ecto – not because Ecto is bad, but because the project’s choice to live without it has paid off. Sqler is 500 lines of code that does what the gateway needs. Ecto’s schema-and-changeset model encourages mixing concerns that the current design keeps separate.
-
Phoenix – not needed. The operator UI stays in-tree as server-rendered
*_web.exviews built on Plug + Bandit + EEx templates. Phoenix’s LiveView is genuinely useful for richer interactive UIs, but the rewrite’s operator console is intentionally simple: form posts, full page renders, no client-side state machine. If a customer later wants a richer SPA, they build it as a separate front-end project consuming the REST API; the in-tree operator console stays regardless.
Operational and lifecycle concerns
Architecture and design principles get the most attention because they shape the codebase. The concerns below shape the deployment, and a rewrite that ignores them produces a clean architecture that fails its first production incident. Each is part of the spine.
Authentication surface boundaries
Not every endpoint requires a bearer token. The boundaries:
-
Unauthenticated.
GET /healthzandGET /healthz?deep=1(load-balancer probes),GET /(a static landing page),GET /loginandPOST /login(the form and its submit),GET /registerandPOST /register(if registration is enabled at all on the deployment), the favicon and any static assets. Nothing else. -
Session-authenticated. Every operator web view (
*_web.ex). Backed by a server-side session row keyed by an encrypted cookie. The session row links to a user id and a permission set derived from the user’s roles. Logout invalidates the row. -
Bearer-authenticated. Every REST and MCP endpoint. Bearer tokens are sub-tokens (
st_...prefix), validated viaSubToken.verify/1, mapping to a user id and a scoped permission set. The same token model serves REST and MCP – one token, two protocols, identical authorization. The bearer plug runs before any handler. -
Per-tool permission gate. After authentication, the handler calls into the module’s
Permissions.*wrapper, which checks the specific permission key for that operation against the caller’s permission map. A token withsubtoken.listbut notsubtoken.createcan list but not create – and the token never sees a 401 (it’s authenticated), only a:not_allowedfrom the wrapper that translates into the transport’s deny shape (403, MCP error, etc.).
This separation matters: a leaked session cookie has different blast radius than a leaked bearer token; the gateway audits and rate-limits both, but the operator should be able to look at any access-log row and know which auth path the request took.
Deployment topology: in-process, beside-the-process, remote
Three concentric rings of execution:
-
In-process (inside the BEAM). Everything in the supervision tree: the HTTP listener, the MCP server, all
Permissions.*modules, all Sqler instances, the alarm executor, the workflow executor, the in-process WASM function-node backend, the SsePush registry, the LLM gateway client (the dispatch logic; the actual model lives at the provider). This is the gateway’s own footprint. -
Beside the process (same host, different process). Local Docker daemon for function-node containers, optional
tidewaveinvocation targets, local subprocesses spawned byExternalAgent.LocalSubprocess(claude -p ...,codex -p ...). The gateway’s in-tree supervisors own the lifecycle of these processes: they start them, monitor them, kill them when the parent supervisor goes down. Filesystem and IPC, not network. -
Remote (network-accessible). Anthropic API, OpenAI API, Google Vertex (called via the LLM gateway). Fly.io machines for the optional Fly backend. External OpenAPI-described services routed through the bridge. Customer MCP servers consumed via
MCPClient. The gateway is a client to all of these; it owns the connection state but not the lifecycle of the remote system.
The deployment story follows the rings: a minimal instance runs the in-process ring on a single host with a local SQLite directory. Adding a “beside” ring requires Docker. Adding a “remote” ring requires network egress and the relevant credentials in Secrets. Each ring is opt-in per deployment.
Workflow template migration and compatibility
Workflow templates embed concrete references to tool names, tool versions, function-node ids, function-node versions, LLM model aliases, and permission keys. A gateway upgrade that changes any of those is a template-breaking change unless the rewrite handles it:
-
Templates are versioned the same way tools are. Each template row carries the gateway version at which it was authored. A template authored against
1.4.0keeps the tool resolution semantics of1.4.0even after the gateway is on1.5.0. The operator runsworkflow_template_upgrade <id>to re-resolve references against the current version; the upgraded template is a new row, the old one remains for in-flight runs. -
Function-node references are by id + version. A template that pins
node_id: 42, version: 3will keep using version 3 until the operator deliberately bumps the reference. Deleting a function-node version that any template still pins is refused (:in_use) with a list of the pinning templates so the operator can decide. -
Model aliases are resolved at run time, not at template-write time. A template that says
model: sonnetalways resolves to whatever the currentsonnetalias points at. This is the intended behavior so customers absorb provider deprecations automatically – but the audit log captures the resolved model id at run time so post-hoc analysis can tell exactly which model executed each step. -
Permission key renames are forbidden. Once a permission key is registered (e.g.,
workflow.create), its identity is stable across the gateway’s lifetime. If a feature needs to evolve, it adds a new key alongside the old one; the rename happens later, with a deprecation window in which both keys are accepted.
Observability for executor pressure and queue saturation
Rate limits and SQLite backpressure protect against blast radius, but the operator also needs to see “the system is healthy but falling behind.” The rewrite ships these signals as first-class metrics, exposed on MonitorWeb and emitted as telemetry events that can fan out to Prometheus/Grafana through a thin adapter:
-
Workflow executor queue depth. Number of workflow runs in
pendingorstep_runningstate, broken down by step type (MCP, function_node, llm, external_agent, approval). Sudden growth in any one bucket is the canary – usually it means an external dependency is degraded. - Workflow run latency, p50/p95/p99. End-to-end time from instantiation to terminal state. Step-type breakdown of where the time is going.
-
Alarm execution lag. For each alarm row, the difference between
set_atand the actual fire time. Steady-state should be sub-second; growth means the alarm executor is overloaded or blocked. - Function-node invocation queue depth and per-backend latency. WASM, Docker, Fly.io tracked separately. Docker-backed nodes are the usual bottleneck (container cold-start cost).
- MCP session count and per-session activity. A connected MCP client that goes quiet is normal; many simultaneously-active sessions can saturate the SSE push side. Both numbers are graphed.
- Sqler write queue depth, per database. Surfaces SQLite contention before it becomes user-visible timeouts. Threshold alerts on any database whose write queue grows past a configurable depth.
- LLM and external-agent per-token spend. Real-time view of monthly budget burn per token; alerts at 75% and 95%. The operator notices runaway agents before they finish overshooting.
These are dashboard items, not afterthoughts. The web UI’s monitor page shows them on a single screen; the telemetry feed makes them ingestible by whatever the customer’s ops platform is.
Secret management and rotation
The gateway holds and uses a wide range of secrets: bearer tokens (sub-tokens, including the operator’s tidewave.eval token), signed-assertion private keys for the OpenAPI bridge, OAuth refresh tokens for Google/Box/Microsoft integrations, per-provider LLM API keys, TLS certificates and private keys, Pushover application tokens, and customer-specific webhook secrets. Treating them as EnvironmentFile= lines in a systemd unit is fine for the first three; it does not scale to dozens.
The rewrite ships with one explicit secret store:
-
Secretsmodule – a registered module backed by its own Sqler instance, all secrets stored AES-256 encrypted at rest with a master key derived fromACCESS_CONTROL_KEYand a per-secret salt. Public surface:Secrets.get/1,Secrets.put/2,Secrets.rotate/2,Secrets.list/0(returns metadata only, never values). -
No secret value ever appears in a Logger line, an audit log payload, or an MCP tool response. Audit logs capture which secret was accessed by id, not its value. A grep across the gateway’s data directory for the string
sk-orBearerproduces zero matches. -
Rotation is a first-class operation. Every secret has a
created_at,rotated_at, and optionalexpires_at. The operator UI shows secrets approaching expiry. Rotation generates a new value, marks the old one asrotating, and the system supports a brief overlap window where both values are accepted before the old one is fully retired. For OAuth refresh tokens, rotation is automatic on a schedule. -
TLS material is the exception: kept on disk (paths in
TLS_CERTFILE,TLS_KEYFILE) because the BEAM and Bandit need filesystem-level access for cert reloading. Acert-reloadMCP tool re-reads the files and rolls the HTTPS listener with zero downtime.
The Secrets module is day-one. Not day-three. Without it, the gateway’s “good security posture” claim is false.
Master key recovery and emergency rotation. The master key derived from ACCESS_CONTROL_KEY decrypts the entire Secrets store and the row-level encrypted audit tables. Losing it without a recovery path bricks the deployment. The rewrite ships three explicit mitigations:
-
Shamir-split escrow. At deployment,
ACCESS_CONTROL_KEYis split via Shamir secret sharing into N=5 shares with threshold k=3. Shares are distributed to the operator team and a sealed third-party escrow. Any three shares reconstruct the key. Single-share loss is recoverable; single-share compromise is not catastrophic. -
Re-encryption rotation.
Secrets.rotate_master/1accepts a new master key, decrypts each row with the old key, re-encrypts with the new, swaps in a single transaction. Run as a maintenance window with read-only mode enabled. The same routine handles compromise scenarios. -
No silent re-encryption. Boot fails if the configured
ACCESS_CONTROL_KEYdoes not decrypt the first sentinel row inSecrets. No fallback, no “try the old key.” The operator gets a clear error and the runbook entry pointing atSecrets.rotate_master/1or the Shamir reconstruction path.
This is the most important runbook entry in the deployment. It is documented, tested, and rehearsed before any customer goes live.
Failure semantics for workflows and alarms
Workflows orchestrate side effects. Side effects fail. The rewrite specifies the failure model up front so customers can reason about it rather than discover it in production.
-
Per-step retry policy. Each workflow step declares
retries: 0..Nandbackoff: :linear | :exponential | :fixed. The default isretries: 0– explicit opt-in. Retries are persisted in the workflow row, so a gateway restart mid-retry does not lose the attempt count. -
Idempotency. Every workflow run has an
idempotency_keyderived from the template id + caller-supplied dedup hint. Re-invoking the same workflow with the same key returns the existing run, not a new one. Steps that call external systems pass the run-scoped idempotency key downstream when the protocol supports it (Box, Stripe, etc.). -
Side-effect boundaries. A step is marked as
effectful: trueif it mutates state outside the gateway. Effectful steps capture the response (or response hash) in the audit log before advancing. If the gateway crashes after the side effect but before the advance, the recovery code on next boot sees the captured response and treats the step as completed. -
Dead-letter queue. A step that exhausts retries lands in
workflow_dead_letters, gated byworkflow.dead_letterpermission key. The operator can inspect the failure, edit inputs, and either retry or cancel. The web UI surfaces dead-letter counts on the dashboard. - Manual recovery. The operator can mark a step as completed (with a reason), skip it, or restart from any earlier step. Every recovery action writes an audit row capturing who, when, and why.
-
Alarm failure model. Alarms have a simpler model: if the registered callback raises, the alarm row is marked
failedwith the error preserved, andNotifieris invoked (this is one of the few error fan-out cases that ships by default). Failed alarms do not auto-retry; the operator chooses to retry, reschedule, or cancel.
Audit retention and privacy
“Full audit retention” sounds responsible until it becomes the GDPR liability the gateway introduced. The rewrite ships a retention policy that is configurable per-customer and per-table:
- Default retention windows. HTTP access log: 90 days. Server log: 30 days. Workflow run logs (step inputs/outputs): 365 days. LLM gateway invocations (full payload): 90 days. External agent dispatches (full prompt + response): 90 days. Tidewave invocations (eval payload): 30 days. All windows configurable per deployment.
-
Payload redaction. Each protected module declares which fields of its audit rows are PII-class. A nightly job redacts those fields on records older than the redaction window (which can be shorter than the retention window). The structural row stays for auditability; the sensitive content is replaced with
[redacted:<sha256>]. -
Encryption at rest for sensitive tables. LLM payloads, external-agent prompts, and Tidewave evaluations are stored AES-256-encrypted at the row level, with the key managed by the same
Secretsmodule. A leaked SQLite file does not expose payload contents. -
Audit-off toggles. A workflow template can declare
audit_payload: falsefor steps that handle regulated content; the audit row still captures the step happened, the caller, and the timestamp, but the payload is not retained. This is opt-in per step, not per workflow, so the operator must consciously elect to lose the audit detail. -
Data export and deletion. The gateway ships a
data_exportadmin tool that produces a per-user dump (everything the user owns, everything that mentions them by user id) and adata_deleteadmin tool that satisfies a right-to-erasure request without breaking referential integrity (rows are anonymized, not hard-deleted; the audit trail of the anonymization itself is preserved).
Backup and disaster recovery
SQLite-per-module is a load-bearing choice, so backup strategy is part of the architecture, not an operator’s afterthought.
-
Continuous SQLite backups via
litestream. Each Sqler instance ships with a litestream sidecar configuration pointing at the customer’s chosen S3-compatible store. Replication lag is measured in seconds. Point-in-time restore is supported within the retention window of the upstream bucket. - Snapshot manifests. A daily job writes a manifest of all SQLite files and their checksums to the backup store. Restore tooling validates the manifest before importing, so a partial or corrupted backup is caught before it overwrites a healthy instance.
- Restore drill at deployment. Standing up a new instance (developer’s own host, future customer host, disaster-recovery host) requires a successful test restore from backup before any traffic is switched to it. Documented in the deployment playbook; not negotiable.
- What the litestream backup does not cover. Function-node sandbox state (in-flight WASM instances, running Docker containers, or Fly.io machines depending on backend), in-process alarm timers waiting to fire, and the in-memory MCP session manager. Those are reconstructed from the persisted database rows on boot, which is why every alarm, workflow step, and MCP session must have a DB-row source of truth from the start.
- Privacy interaction with redaction. Litestream replicates row changes as they happen. A row that contained PII at write time and was redacted later still exists unredacted in older backup segments. The backup retention policy on the upstream bucket therefore caps the effective redaction window – a 90-day audit retention with a 365-day backup retention means the upstream still has the unredacted row for 365 days. The rewrite ships a quarterly job that compacts backups, drops old segments past the redaction window, and re-baselines. This is part of the GDPR-compliance posture, not an operations afterthought.
Resource control and blast radius
Permission keys answer “who can do what?” Resource limits answer “how much, how fast, how concurrently?” The gateway ships both.
-
Per-token rate limits. Each sub-token has configurable per-second and per-day caps. Defaults are conservative: 10 req/sec, 50k req/day. Exceeding the cap returns HTTP 429 with
Retry-After. Cap settings are stored in the token row; rate-limit state lives in:ets. -
Per-token cost budgets. Sub-tokens that call
LlmGatewayorExternalAgentcarry a monthly token-spend budget. Exceeding the budget refuses further LLM/external calls until the operator raises the cap. The budget is per-token, not per-user, so a runaway agent does not consume the operator’s own budget. - Concurrent workflow caps. Per-user concurrency limit (default: 5 running workflows) prevents one user from saturating the executor. Per-deployment cap (default: 50) prevents one customer from monopolizing the BEAM.
-
Function-node execution limits. Wall-clock timeout per invocation (default: 30 seconds, configurable per node), memory ceiling (default: 256 MiB, configurable), and a per-function-node concurrent-invocation cap. The in-process WASM backend enforces these at the runtime level (fuel-based instruction limits + memory ceiling); the local Docker backend enforces via
--memoryand--ulimit; the Fly.io backend enforces them at the machine level. -
SQLite write contention. Each Sqler instance uses WAL mode and a single writer process per database; concurrent writes serialize through the GenServer. Long-running writes are tagged with a slow-query log entry (threshold: 100ms). If a Sqler write queue grows past a configurable depth, the queue’s GenServer applies backpressure (refuses new writes with
:busy) rather than allowing unbounded RAM growth.
Versioning and compatibility
The gateway is updated. Tools (especially synthesized ones) and plugins must keep working across updates, or they fail customers who never asked for a breaking change.
- Semantic versioning for the gateway itself. Major versions can break MCP tool schemas; minor versions can add tools and fields but never remove or rename; patch versions are pure fixes. The customer pin is on major.minor.
-
Per-tool schema versioning. Each MCP tool declaration includes
version: "1.0". Schema changes bump the version; the old version stays available for a deprecation window (default: two minor releases) before retirement. The MCP listing surface reports bothlatestandsupportedversions per tool. -
OpenAPI bridge re-parse on upgrade. A spec registered under gateway v1.4 keeps the tool schema it had at registration time. The operator runs
api_re_parse <spec_id>to pick up the new gateway’s parser improvements; the re-parse produces a new tool version and the old version stays available for the deprecation window. -
Plugin compatibility contract. Optional plugin libraries declare
gateway_compat: ">= 1.4, < 2.0"in their mix project. The gateway refuses to load plugins outside the compatible range and logs a clear error pointing the operator at the upgrade path. Plugins ship their own version, their own permission keys, and their own audit table; they cannot rewrite gateway internals. - Migration scripts in releases. Every Sqler instance’s schema migrations live in the module that owns it. Booting a release N+1 against a release-N data directory runs the pending migrations transactionally, with a backup-first policy: the data directory is snapshotted before any migration runs, and rollback is supported if the migration fails.
Approximate scope of the rewrite
The day-one and day-two lists above are roughly:
-
62 modules in
lib/(the spine + workflow + function nodes + OpenAPI bridge + LLM gateway + documentation + external agent + the*_web.exviews for every protected module)
The 62-module count is roughly 50% of the current 122-file project. That ratio looks unaggressive at first glance, but the three-transport rule multiplies the per-protected-module file count by 3-4 (Permissions.Foo, MCPServer.Tools.Foo, Router.FooAPI, FooWeb). The actual count of distinct architectural concerns in the spine is closer to 20, not 62. If that count can be cut further by promoting more concerns to optional libraries (the LLM gateway or the OpenAPI bridge are the obvious candidates), the spine shrinks accordingly – but doing so would force every customer who needs LLM access or REST integration to install a separate library, and the three-transport rule would still require three to four files per protected module. The trade-off is real; 20 concerns × ~3 files each is the floor.
-
11 modules in
lib/permissions/(Bootstrap, Registry, Keys, plus core wrappers for admin, workflow, function_nodes, open_api_bridge, llm_gateway, doc, tidewave, external_agent) -
10 modules in
lib/mcp_server/tools/(admin, utility, platform, workflows, function_nodes, open_api_bridge, llm, doc, tidewave, external_agent) -
2 modules in
lib/auth/ -
10 modules in
lib/router/(access control, workflow, function nodes, alarm, monitor, mcp connections, open_api_bridge, doc, tidewave, external_agent)
You can ship a useful gateway in around 6,000 lines of Elixir, with one SQLite database per subsystem, one supervision tree, and one HTTPS port. Everything beyond that is a feature, not the gateway.
Requirements and plan
Two sections to turn the architecture above into a concrete project: the requirements the rewrite must satisfy, and a phased plan to ship it.
Requirements
Functional
F1. Identity and tokens. Username + Argon2 password storage. Sub-tokens issued as database rows with scope (:full, :partial) and an optional per-key permission set. Independent revocation. Token strings prefixed (st_...).
F2. Capability-based permission registry. Deterministic permission keys hashed at boot. Per-user permission maps held in the access control registry. Keys never leave the server process. TTL support with lazy expiry. Attenuation support for scoped delegation.
F3. Three transports per core protected module. Every core entry in lib/permissions/ ships with an MCP handler, a REST router, and a web view. All three thin, all three call the same Permissions.* wrapper. Optional plugin modules ship MCP at minimum; REST and Web are recommended but optional.
F4. MCP transport. Inbound JSON-RPC over HTTP and SSE. Outbound client for consuming external MCP servers, with a connection registry, per-connection auth, and per-connection permission keys.
F5. OpenAPI bridge. Register a spec, get synthesized MCP tools and REST passthroughs without writing an adapter. Signed-assertion auth for the gateway-to-upstream call.
F6. LLM gateway. One canonical request shape. Provider adapters for at least Anthropic, OpenAI, and Google. Per-provider permission keys. Per-call audit (token counts, latency, cost, optional payload retention).
F7. Workflow orchestration. Multi-step state machine with templates, human-in-the-loop step approvals, and per-user isolation. Workflow rows owned by their creator; cross-user access returns :not_found. admin role can hold workflow.cross_user for support cases.
F8. Function nodes. Sandboxed code execution as workflow step type. Pluggable backend with priority order: in-process WASM (default, zero external deps), local Docker (universal, for richer language runtimes), and Fly.io machines as an optional vendor plugin. Versioned, permissioned, audited per-invocation. The deterministic-transformation substrate that workflows rely on between MCP calls.
F9. Alarm scheduling. One-shot and recurring scheduled events that survive restart. Backed by Alarm‘s own Sqler instance; the database row is the source of truth.
F10. Notifications. Generic Notifier dispatcher with at least one backend (Pushover). LoggerNotifier bridge for error fan-out.
F11. Health monitoring. Periodic checks with persisted history. Exposed on all three transports per the rule: monitor_* MCP tools for agents, /api/monitor REST endpoints for scripts, MonitorWeb dashboard for operators.
F12. Audit logging. Server log (BEAM Logger -> SQLite), HTTP access log (one row per request), and per-module action logs in each module’s own Sqler instance. Each is queryable via its own protected module on all three transports (HttpAccessLog, ServerLog, etc.).
F13. Server-sent events. SsePush for outbound notifications to MCP clients that subscribe to changes.
F14. Documentation surface. Searchable (FTS5) and semantic (sqlite_vec) over asset/docs/ and asset/manuals/. Agent-readable: doc_search returns ranked chunks, doc_get fetches by path. The gateway does not synthesize answers; the calling agent does. Doc.Watcher re-indexes on filesystem changes.
F15. Runtime introspection. Tidewave mounted in production behind the same HTTPS port as the rest, gated by three permission keys (tidewave.eval, tidewave.docs, tidewave.logs), full payload audit retention.
F16. External agent invocation. ExternalAgent.dispatch/3 with cross-platform core backends (Anthropic API, OpenAI API, LocalSubprocess) and per-backend permission keys. Default-deny, per-token rate-limited, full audit retention. Platform-specific adapters live in separate optional libraries.
Non-functional
N1. Single instance per customer. No multi-tenancy. Each customer’s deployment owns its data, its keys, its tokens.
N2. SQLite per module. No shared database. No Ecto. Sqler’s millisecond IDs and optimistic locking convention applied uniformly.
N3. Functional core, imperative shell. Business logic in pure modules. Process management in dedicated *Server modules. Tagged tuples at public boundaries.
N4. Permissions in the logic layer. Transports stay thin. A permission check never lives in a Plug or an MCP handler.
N5. Fail-loud boot. Boot aborts with a clear message if SECRET_KEY_BASE, COOKIE_SALT, ACCESS_CONTROL_KEY are missing.
N6. Single supervision tree. The BEAM process itself is single-tree: no auxiliary daemons, no out-of-tree workers, no scripts that start their own GenServers. Everything that runs inside the gateway process starts via Mcp.Application.start/2. The gateway may orchestrate external execution environments – the in-process WASM runtime for most function nodes, Docker containers for richer function-node runtimes, Fly.io machines for the optional multi-machine backend, subprocesses for LocalSubprocess ExternalAgent calls, the Tidewave Plug for runtime introspection – but those are dispatched by in-tree supervisors that own the lifecycle (provisioning, health checks, cleanup), not by side-channel scripts. One supervisor owns each external runtime’s connection from the gateway’s side.
N7. Audit by default. Every action that mutates state writes a row. Soft deletes only (cancelled_at, started_at, completed_at); no DELETE FROM.
N8. Tests run without external services. CI does not need Pushover, Anthropic, OpenAI, Box, or Google. Provider adapters and tool integrations must be mockable at the boundary.
N9. Deployable as a Mix release. Build once, ship a tarball, run on any Linux with matching libc. A Dockerfile is built from the release, not from mix run.
N10. macOS-only modules clearly tagged. Anything that shells out to osascript or similar returns a clean error on Linux and does not break the test suite.
Plan
A phased plan. Each phase produces a runnable gateway that does strictly more than the previous one. Estimated effort assumes one engineer working steadily; double the calendar time for one engineer doing this part-time around other work.
Phase 0 – Foundation (week 1)
-
Fresh
mix newproject. Pin Elixir, Erlang, and asdf versions. - Lock dependencies: Bandit, Plug.Crypto, Jason, Req, Argon2_elixir, Exqlite, hermes_mcp, gproc, Tzdata.
- Single supervision tree, empty for now.
- Sqler module ported in. Smoke test: one module owns one SQLite file.
-
HTTPS listener on Bandit serving two health endpoints:
GET /healthz(shallow – “the process is up”) andGET /healthz?deep=1(deep – exercises a write to a sentinel Sqler instance, reads it back, returns 200 only if the persistence path is intact). Load balancers point at the shallow check; on-call dashboards point at the deep one. -
CI pipeline that runs
mix test,mix format --check-formatted,mix credo --strict,mix dialyzer. The bar for “phase 0 done.”
Phase 1 – Identity (week 2)
-
Usermodule + Sqler-backedusers_db. Create, find by username, find by email, Argon2 verification. -
SubTokenmodule with scope, permission keys, revocation, prefix-based token recognition. -
Auth.AuthorizationPlug,Auth.AuthContext, bearer header parsing. -
Boot-time creation of an initial
adminuser from env vars (ADMIN_USERNAME,ADMIN_PASSWORD). -
Auth surface defined per the “Authentication surface boundaries” section:
/healthz,/healthz?deep=1, the login/register forms and statics stay unauthenticated; operator web views are session-authenticated; REST and MCP endpoints are bearer-authenticated; every authenticated request is then permission-checked in the logic layer. -
Phase 1 done when
curl /api/admin/users(no auth) returns 401,curl -H "Authorization: Bearer <token>" /api/admin/usersreturns 200 with a valid admin token, andcurl /healthzreturns 200 without any auth.
Phase 2 – Permissions (week 3)
-
AccessControlregistry with deterministic key hashing, TTL, attenuation. -
AccessControlledmacro for protected modules. -
Permissions.Registry,Permissions.Keys,Permissions.Bootstrap. -
Permissions.Adminas the first protected module (grant/revoke/list permissions). -
Phase 2 done when the registry boots cleanly, the admin role auto-syncs its per-tool keys, and a non-admin user gets
:not_allowedwhen trying to grant permissions.
Phase 3 – First three-transport module (week 4)
-
lib/mcp_server/tools/access_control.ex– MCP handler for permission ops. -
lib/router/access_control_api.ex– REST router. -
lib/access_control_web.ex– HTML view. -
NavMenu+ login/registration/sub-token web flows. - All three transports exercised by integration tests.
- Phase 3 done when an operator can log into the web UI, generate a sub-token, and use it to call the same permission API over MCP and REST.
Phase 4 – MCP transport (inbound + outbound) (week 5)
- Full MCP server with hermes_mcp: prefix dispatch table, session manager, SSE transport.
-
MCP client +
MCPConnection+MCPConnectionRegistry. One real outbound connection (a public test server) to validate the round-trip. -
McpConnectionsWeb+ REST + MCP for managing outbound connections. -
Phase 4 done when an external Claude Code client successfully runs
mcp__dev__access_control_*tools end to end.
Phase 5 – OpenAPI bridge (week 6)
-
OpenApiBridge.Parserfor OpenAPI 3.x. -
OpenApiBridge.Dbfor persisted spec storage. -
OpenApiBridge.Dispatcherfor run-time MCP-to-HTTP translation. -
OpenApiBridge.Assertionfor signed-assertion auth. -
All three transports (
api_register,api_list, etc.) plus aOpenApiBridgeWebpage. - Phase 5 done when registering a public OpenAPI spec (any well-formed sample) produces working MCP tools you can call.
MVP candidate. At the end of phase 5 the gateway is usable for deployments whose only requirements are governance, MCP/REST/Web access, and OpenAPI-fed integrations. The developer can switch to the new instance here for everyday use; phases 6 onward add operational depth.
Phase 6 – Scheduling and workflows (week 7)
-
Alarm+Timer+AlarmWeb+ REST + MCP. -
Workflow+WorkflowExecutor+ per-user isolation, templates, human-in-the-loop approvals. - All three transports for workflow.
-
One end-to-end test workflow built from MCP calls only (no function nodes yet): instantiate a template, pause on a
step_ready, approve via web UI, see it complete.
Phase 7 – Function nodes (week 8)
-
FunctionNode+ Sqler-backed registry with version history. -
FunctionNode.Provisionerwith two backends in this phase, in priority order:-
In-process WASM (
Wasmexor equivalent) – the default backend, ships first. Strongest isolation for pure transformations, zero external dependencies, runs on any deployment. Good for “take this JSON, extract these fields, return that JSON” – which is most workflow glue. - Local Docker – the second backend, for function nodes that need a richer language runtime (Python, Node) or filesystem access. Universal: works on any host with a Docker daemon. No vendor lock-in.
- Fly.io machines – a third, optional backend that ships as a separate plugin library, not in the spine. Used by deployments that want managed isolation but accepted as a vendor dependency. The core gateway has no hard dependency on Fly.io.
-
In-process WASM (
-
FunctionNode.Runtime– the in-sandbox agent that boots, fetches the function body, executes on demand. One implementation per backend. -
All three transports (
function_node_create,function_node_invoke,function_node_list, etc., plusFunctionNodeWebfor authoring and deployment). -
WorkflowExecutorextended to support afunction_nodestep type. - Phase 7 done when a workflow can chain two MCP calls with a function-node transformation between them, end to end, with the function node deployed via the web UI and running on the in-process WASM backend (no Docker or Fly.io required for the smoke test).
Phase 8 – Operational essentials (week 9)
-
Notifier+Pushover. Hook intoLoggerNotifierfor error fan-out. -
Monitor+MonitorServer+MonitorWeb. At least three baseline checks: disk usage, MCP listener liveness, alarm queue depth. -
ServerLog,HttpAccessLogweb dashboards. -
SsePushwired to the MCP session manager. -
Permissions.Tidewave– mount Tidewave on the main HTTPS port, gate behind the three permission keys, wire the dedicatedtidewave_invocationsaudit table.
Phase 9 – Documentation (week 10)
-
Doc+Doc.Index(FTS5 overasset/docs/andasset/manuals/corpus, plussqlite_vecfor semantic search). Hybrid ranking inDoc.search. -
Doc.Watcherfor live re-indexing on filesystem changes. -
All three transports:
DocWebfor the operator browse/search UI,Router.DocAPIfor programmatic search, MCP toolsdoc_search/doc_get/doc_listfor agents. - No Q&A endpoint, no LLM call inside the gateway. The agent does the synthesis.
-
Phase 9 done when (a) an operator can search the corpus from the web UI and get ranked results with file/line citations, and (b) an external agent connected over MCP can call
doc_search, receive ranked chunks, and incorporate them into its own response to a user question.
Phase 10 – LLM gateway (week 11)
-
LlmGateway+LlmGateway.Handler+LlmGateway.Database. - Provider adapters: Anthropic, OpenAI, Google. Each behind a per-provider permission key.
-
LlmGatewayWebfor cost dashboards. REST endpoint for programmatic use. MCP tool for agent use. -
Wire
WorkflowExecutorto callLlmGatewayforllmstep types – reserved for the limited cases where natural-language parsing is needed, not for deterministic transformations (those go to function nodes).
Phase 11 – External agent invocation (week 12)
Depends on LlmGateway for the API-based backends.
-
ExternalAgent+ the three cross-platform backend adapters (AnthropicApi,OpenAIApi,LocalSubprocess). Platform-specific backends are optional libraries, not part of this phase. -
Per-backend permission keys (
external_agent.anthropic,external_agent.openai,external_agent.subprocess), per-token rate limiting, full audit retention. -
WorkflowExecutorextended with anexternal_agentstep type so workflow templates can declare “have Claude write this” as a step. - Phase 11 done when a workflow can include a step that dispatches to the Anthropic API and captures the response, and the audit table shows the full prompt/response with the caller’s username.
Phase 12 – Polish and hardening (week 13)
- Settings UI for backend secrets.
- CompileLog + AutoCompile if dev ergonomics matter.
- End-to-end smoke test suite: every core MCP tool exercised against the new gateway with a fresh data directory and a fresh admin token, asserting each tool’s documented behavior.
-
Documentation pass: every
lib/permissions/<module>.exhas a matchingasset/manuals/<module>-*.mdset (web/rest/mcp/iex split as the project convention requires) soDoc.Indexingests the corpus the agent will actually search.
There is no migration phase. The current codebase has no production users; the developer is the only operator and is doing the rewrite. Phase 12 produces a ready-to-deploy gateway that boots cleanly with the developer’s own admin credentials; the developer then configures permissions, tokens, OAuth credentials, and workflow templates fresh on the new instance. The old codebase remains on disk as a reference for porting feature plugins later if needed, but is not running in parallel and is not the source of any data transferred over.
What the plan does not include
Deliberately out of scope for the rewrite, deferred or dropped entirely:
- Multi-tenancy or multi-server federation. The instance-per-customer model is preserved.
- Hot upgrades. Every release is a full restart with planned downtime.
-
A separate front-end SPA for the operator UI. The gateway ships server-rendered
*_web.exviews as part of the spine; the three-transport rule applies to every core protected module. If a richer SPA is wanted later, build it as a separate project consuming the gateway’s REST API. The server-rendered views stay regardless. - Any of the domain feature integrations currently in the codebase (Google, Box, WhatsApp, iTerm, page scraper, NotebookLM watcher, OpenClaw, A2A demo, etc.). Each is a separate library that can be added back after the spine is solid. The rewrite ships without them; whichever ones the developer actually needs day-to-day get ported from the old codebase into the plugin shape after Phase 12. (Function nodes, by contrast, are in the spine because workflows depend on them.)
- Performance optimization beyond “don’t be obviously stupid.” No caching tier, no read replicas, no message bus. If a single-customer instance needs more than one BEAM node to handle its load, the customer has grown out of the single-instance model and a separate architecture conversation is warranted.
Success criteria for the rewrite
The rewrite is done when:
- Day-one and day-two priority lists are fully shipped, with three transports per protected module.
-
mix credo --strictandmix dialyzerexit clean. -
The total module count in
lib/is under 75. If it grows past that during the rewrite, something is in the codebase that should be a separate library. -
A new engineer can read every module in the
lib/spine in one working day. -
A single customer instance starts cleanly with three env vars (
SECRET_KEY_BASE,COOKIE_SALT,ACCESS_CONTROL_KEY) and one TLS cert pair. - The MCP tool surface advertised to an authenticated agent matches the permission keys on its token, no more, no less.
The discipline
The reason the current codebase ballooned past 100 modules is not bad engineering; it is that every interesting new capability got added to the same lib/ directory. The result is a single project that ships the gateway plus a dozen tools plus a blog CMS plus a contact form plus a documentation site plus the developer’s iTerm helper. Each piece is fine on its own. Together they are 122 files.
The rewrite’s discipline is: anything that is not in the day-one or day-two list becomes a separate library, depended on optionally by deployments that need it. The gateway’s mix.exs should list six or seven mandatory dependencies and a long optional list. Customer deployments pick the feature plugins they want. The core stays small enough that two engineers can hold all of it in their heads.
That, more than any specific feature choice, is what separates a product from a junk drawer.