Deploying the WorkingAgents AI Agent Gateway: Options and Trade-offs

The WorkingAgents AI Agent Gateway runs as a long-lived BEAM process. The default deployment model is one instance per customer – each company runs its own copy on its own server, with its own data, its own auth, its own tools. No shared tenancy, no cross-customer routing. That single architectural choice drives every deployment option below.

This article walks through the deployment paths that work for that model: bare VM with a Mix release, Docker, Kubernetes, and PaaS targets. It also describes what every option needs (TLS, environment variables, SQLite persistence, time sync) regardless of how you ship it.

What runs

The gateway is an Elixir application built with Mix. The release name in mix.exs is orchestrator:

defp releases do
  [
    orchestrator: [
      applications: [mcp: :permanent],
      include_executables_for: [:unix]
    ]
  ]
end

mix release orchestrator produces a self-contained tarball under _build/prod/rel/orchestrator/ – the BEAM, the compiled Erlang/Elixir code, the application’s static assets, and shell scripts to start, stop, and attach to the running node. The release does not require Elixir or Mix on the target host.

The release listens on a single HTTPS port (default 8443, override with the PORT env var) and serves:

REST/JSON endpoints for the web app, the REST API, and MCP HTTP transport.
WebSocket endpoints for live UI sessions.
Per-user MCP sessions over Server-Sent Events.

Persistent state lives in SQLite files on local disk – one file per subsystem (users, access control registry, audit log, contact forms, blog store, etc.). Sqler – the project’s SQLite wrapper – owns each database.

That description matters because it constrains the deployment shape: stateful, single-node, file-backed. The gateway is not built to be replicated horizontally. The customer-per-instance model means you don’t need horizontal replication; you scale by adding more instances for more customers, not more nodes per customer.

What every deployment needs

Independent of the runtime choice, every WorkingAgents deployment requires:

Environment variables at startup

The release reads these at boot via config/runtime.exs:

SECRET_KEY_BASE (required) – generate with openssl rand -base64 64.
COOKIE_SALT (required) – session cookie salt.
ACCESS_CONTROL_KEY (required) – AES key for the access control registry. Generate with openssl rand -base64 32.
PORT (optional) – listening port, default 8443.
TLS_CERTFILE and TLS_KEYFILE (optional) – paths to certs. If unset, the bundled self-signed cert is used (fine for behind a reverse proxy, not for direct exposure).
Tool-specific tokens as needed: PUSHOVER_TOKEN, PUSHOVER_USER, GOOGLE_APPLICATION_CREDENTIALS, etc.

Boot fails fast and loud if SECRET_KEY_BASE, COOKIE_SALT, or ACCESS_CONTROL_KEY are missing. That is intentional – a half-configured production instance would be a security incident.

TLS

The MCP HTTP transport requires HTTPS in any modern client. Three patterns work:

Reverse proxy terminates TLS – Caddy, nginx, Traefik, or a cloud load balancer holds the cert and forwards plaintext over loopback to the BEAM. Easiest, most flexible.
BEAM terminates TLS directly – point TLS_CERTFILE and TLS_KEYFILE at the cert and key. Works, but every cert renewal needs a graceful restart to pick up new files (or wire in :public_key-level reload).
Self-signed for testing – the bundled cert. Never use this in production exposed to the public internet.

Persistent disk

SQLite files live wherever the BEAM’s working directory’s data/ subdirectory resolves. Plan for:

Volume that survives container restarts and host reboots. A docker run with no mounted volume will lose the access control registry, the users, and the audit log on every restart. That’s a deal-breaker.
Backups. Sqlite is one file per database; rsync or restic at a regular interval is plenty for a single-customer instance. Snapshots if your hypervisor supports it.
Disk speed. SQLite is fast on local SSD; very slow on networked filesystems (NFS, CephFS). Don’t put data/ on a network mount.

Time sync

The gateway uses millisecond timestamps for record IDs (Sqler’s convention). NTP must be running on the host or in the container. Without it, IDs go backward when the clock drifts, audit logs lie about ordering, and TTL-based expiry breaks.

Option A: Bare VM with Mix release + systemd

The current production deployment of this project runs this way. Simplest path, lowest moving parts, highest control.

Build

On a build host with the right Elixir / Erlang versions (asdf, mise, or matching system packages):

MIX_ENV=prod mix deps.get --only prod
MIX_ENV=prod mix release orchestrator

The tarball at _build/prod/rel/orchestrator/ is portable to any Linux host with the same libc generation as the build host. Built on Debian Bookworm, runs on Debian Bookworm or Ubuntu 24.04 cleanly; mixing glibc versions across major distros breaks.

Ship

rsync or scp the tarball to the target host. Extract to /opt/workingagents/. Drop a systemd unit:

[Unit]
Description=WorkingAgents AI Agent Gateway
After=network.target

[Service]
Type=simple
User=workingagents
WorkingDirectory=/opt/workingagents
EnvironmentFile=/etc/workingagents/env
ExecStart=/opt/workingagents/bin/orchestrator start
ExecStop=/opt/workingagents/bin/orchestrator stop
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

/etc/workingagents/env holds the secrets. systemctl enable --now workingagents and you’re running. journalctl -u workingagents -f tails logs.

When it fits

Single dedicated VM or bare-metal box per customer.
Customer wants full visibility into where their data lives.
You don’t want to operate a container runtime per deployment.
Cert renewal via certbot’s standard hook system is sufficient.

When it doesn’t fit

You’re deploying to many customers and need image-based reproducibility.
The customer’s ops team mandates containers.

Option B: Docker container

The project does not ship a top-level Dockerfile today; the existing deploy/function_node/Dockerfile is for the sandboxed Function Node runtime, a separate component. Building one for the gateway is straightforward. Shape:

# Stage 1: build the release
FROM hexpm/elixir:1.18.4-erlang-28.4-debian-bookworm-slim AS build
ENV MIX_ENV=prod
WORKDIR /app
COPY mix.exs mix.lock ./
RUN mix local.hex --force && mix local.rebar --force && mix deps.get --only prod
COPY config config
COPY lib lib
COPY asset asset
RUN mix release orchestrator

# Stage 2: runtime
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y openssl libstdc++6 libsqlite3-0 ca-certificates tzdata && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY --from=build /app/_build/prod/rel/orchestrator ./
EXPOSE 8443
ENTRYPOINT ["/app/bin/orchestrator"]
CMD ["start"]

Notes:

Pin the Elixir/Erlang base image to match your CI to avoid surprises.
Mount /app/data/ as a persistent volume. Without this, you lose state on every container restart. This is the most common operator mistake.
Pass secrets via Docker Compose env_file, Kubernetes Secrets, Docker Swarm secrets, or AWS Secrets Manager – never bake them into the image.
Run with --init or use Tini so the BEAM gets clean SIGTERM forwarding for graceful shutdown.

Compose example

services:
  gateway:
    image: workingagents/gateway:latest
    restart: unless-stopped
    ports: ["8443:8443"]
    volumes:
      - gateway-data:/app/data
    env_file: ./env.prod

volumes:
  gateway-data:

That is the smallest viable production setup. Behind a Caddy or nginx container in front for TLS termination and you have a complete deployment.

When it fits

Customer’s ops standardizes on containers.
You want repeatable, image-versioned deploys (rollback by retagging).
You operate multiple deployments and want one image flowing through CI.

When it doesn’t fit

The customer needs the data to live on a network-mounted volume (don’t do it; SQLite on NFS is slow and corrupts under contention).
You need zero-downtime rolling restarts; the gateway’s stateful single-node design doesn’t support a rolling deploy without coordination.

Option C: Kubernetes

Possible, often overkill. The gateway’s single-node stateful shape means a Kubernetes deployment is structurally a StatefulSet with replicas=1 and a PersistentVolumeClaim. You are using Kubernetes to manage one pod that mounts one volume – a workload its scheduler is not built to make better.

Where Kubernetes does earn its cost:

The customer’s entire ops platform is already Kubernetes and your gateway is one of dozens of services.
You want declarative config in Git, GitOps deploy flows, and the existing Helm chart conventions.
Cert management via cert-manager and ingress via an existing ingress controller is genuinely simpler than configuring Caddy on a VM.

Where it doesn’t:

The customer doesn’t already run Kubernetes.
The deployment is one-off and small.
The data volume backend is anything other than fast local SSD (avoid Ceph, GlusterFS, or anything with network latency in the SQLite write path).

If you do go this route, the StatefulSet pattern looks like:

apiVersion: apps/v1
kind: StatefulSet
metadata: { name: workingagents-gateway }
spec:
  serviceName: workingagents-gateway
  replicas: 1
  selector: { matchLabels: { app: gateway } }
  template:
    spec:
      containers:
      - name: gateway
        image: workingagents/gateway:1.x
        ports: [{ containerPort: 8443 }]
        envFrom: [{ secretRef: { name: gateway-env } }]
        volumeMounts: [{ name: data, mountPath: /app/data }]
  volumeClaimTemplates:
  - metadata: { name: data }
    spec:
      accessModes: ["ReadWriteOnce"]
      resources: { requests: { storage: 50Gi } }
      storageClassName: local-ssd

Plus a Service, an Ingress with TLS, and a Secret holding the env vars. The deployment story is then kubectl apply per release.

Option D: PaaS (Fly.io, Render, Railway, etc.)

For a single-customer deploy where the customer is OK with a managed platform, a PaaS removes a lot of operational work. The shape:

Fly.io: deploy with fly deploy from a Dockerfile, attach a Fly Volume for /app/data, set secrets via fly secrets set. Per-region scaling is easy but irrelevant given the single-node model. The function node runtime in this project already targets Fly.io for its sandbox; using Fly for the gateway too keeps everything on one platform.
Render: similar shape, Docker-based, attached disks for state. Slightly more click-heavy than Fly.
Railway: container-based, simple env management, less mature for stateful workloads.

The Fly Volumes trade-off is worth knowing: they are local SSD on the Fly machine, which is what SQLite wants – but they are bound to a specific Fly region. The gateway cannot fail over to another region without rsync-ing the volume first. For a single-instance deployment that’s acceptable.

When PaaS fits

Small customer, no in-house ops team.
You want one bill (compute + storage + bandwidth) instead of stitching together VPS + DNS + cert + monitoring.
The customer accepts being on Fly’s / Render’s / Railway’s platform.

When it doesn’t

The customer’s compliance posture requires on-prem or self-managed cloud (HIPAA BAA, FedRAMP, data residency in a country not served).
Egress costs are prohibitive at the data volumes you expect.

Things to plan for, every option

Independent of the chosen runtime:

Backups. Document where state lives (data/*.sqlite) and how it’s backed up. Test restore at least once before production.
Log retention. Either ship logs off-host (Loki, Datadog, CloudWatch) or rotate them locally. The gateway logs to stdout and to disk; both grow.
Monitoring. At minimum: process up/down, HTTPS 200 from /healthz (auth-gated, use a token), disk usage on the data volume, and clock drift.
Cert renewal. Whichever path you choose, the renewal workflow must be tested. A cert that expires on a customer’s instance with no auto-renewal is a guaranteed weekend support call.
Upgrade procedure. Document how you go from version N to N+1 (build new release, stop old, swap binary, start new) including a rollback path. The release format supports hot upgrades (appup), but the project does not currently produce them – treat every upgrade as restart-with-downtime.

Decision shortcut

A rough heuristic:

One customer, you control the VM: Bare VM + systemd + mix release. Smallest moving-parts surface.
One customer, they control the platform and want containers: Docker on their VM, or Docker Compose if they’re not on Kubernetes.
Many customers, you operate the fleet: Containerized + a thin orchestration layer (Kubernetes if you already run it, plain docker compose per host otherwise). Image versioning becomes the deployment unit.
Small customer, no ops team: Fly.io with a Fly Volume.
Customer is regulated and on-prem: Bare VM + systemd, or air-gapped Docker if they have a registry. Avoid PaaS.

The deployment shape that matters most isn’t the runtime – it’s the operational surface around it: backups, monitoring, cert renewal, and a tested upgrade path. Pick the runtime your team can operate. The gateway runs fine on any of them.