# WorkingAgents Architecture Review (Current Application)

Source baseline: `asset/blogs/2026-02-24-workingagents-architecture.md`
Review scope: current codebase under `lib/`, `config/`, and `test/`
Date: 2026-02-24

## Executive Summary

The application already has a strong OTP-first architecture with clear domain coverage (MCP, chat, CRM/NIS, tasks, WhatsApp, summaries, A2A, access control). The biggest risks are not missing features, but **architectural drift and concentration points**:

- Authorization is split across layers, despite the architecture goal of centralized control.
- Core transport/tool modules have grown into large monoliths, slowing safe change.
- A few legacy/parallel implementations duplicate logic and increase maintenance cost.
- Production-grade controls (observability, idempotency, CSRF, policy/budget enforcement) are partial.

## What Is Strong Today

- OTP supervision and process isolation are well established (`lib/mcp/application.ex`).
- Per-domain persistence with SQLite instances is consistent and pragmatic.
- MCP tool surface is broad and permission-aware.
- AccessControl is feature-rich (roles, TTL keys, audit trail).
- Chat provider abstraction is in place (`ServerChat.Provider`), enabling runtime provider/model switching.

## High-Priority Findings

## 1) Authorization Source Of Truth Is Not Fully Centralized

### Evidence

- Router resolves permissions from `User.get_user/1` + `permission_keys`, not `AccessControl`:
  - `lib/my_mcp_server_router.ex:3004`
- MCP manager also loads permissions directly from user rows:
  - `lib/my_mcp_server_manager.ex:136`
- AccessControl is documented as single authority and supports temporary keys/TTL, but these paths bypass that model.

### Impact

- Temporary grants and in-memory revocations can be missed by request paths.
- Policy behavior may differ between MCP tools and other transports.
- Security model is harder to reason about and validate.

### Recommendation

- Introduce one `AuthContext.for_user(user_id)` API that always returns:
  - identity
  - `AccessControl.get_permissions(user_id)`
  - timezone/profile metadata
- Replace direct `User.get_user(...).permission_keys` permission assembly in router and manager.

## 2) Singleton MCP Manager Is A Throughput And Coupling Bottleneck

### Evidence

- `MyMCPServer.Manager` is a globally named GenServer and handles all `call_tool/list/read` serially:
  - `lib/my_mcp_server_manager.ex:12`
  - `lib/my_mcp_server_manager.ex:57`
- It carries mutable shared `mcp_state` across users.

### Impact

- Head-of-line blocking under tool-heavy concurrent workloads.
- Increased blast radius for slow tool handlers.
- Harder horizontal scaling story.

### Recommendation

- Move to per-session or per-user MCP execution processes via `DynamicSupervisor`.
- Keep manager as a lightweight dispatcher/registry.
- Add concurrency/load tests around tool call latency percentiles.

## 3) Architectural Drift Between Documentation And Behavior

### Evidence

- Blog says provider switch can carry history across providers.
- Code clears history on provider switch:
  - `lib/server_chat.ex:309`

### Impact

- Operational expectations and real behavior diverge.
- User-facing behavior may appear inconsistent.

### Recommendation

- Pick one policy and align both code and docs.
- If preserving history is desired, add provider-specific history transformers.

## 4) Security Hardening Gaps In Web Transport

### Evidence

- Query param token auth path remains enabled:
  - `lib/my_mcp_server_router.ex:2591`
- No explicit CSRF protection plug in router pipeline.
- Cookies are configured with `SameSite=Lax`, but state-changing browser POST endpoints are numerous.

### Impact

- Token leakage risk through URLs/logs/referrers.
- Higher CSRF exposure for authenticated browser sessions.

### Recommendation

- Remove query-string token auth; keep header/cookie-based flows only.
- Add CSRF tokens for browser form and JS POST flows.
- Add security tests for auth bypass and request forgery scenarios.

## Medium-Priority Findings

## 5) Monolithic Modules Increase Change Risk

### Evidence

- `lib/my_mcp_server_router.ex` ~3021 lines.
- `lib/my_mcp_server.ex` ~2038 lines.

### Impact

- Cross-domain coupling and regression risk.
- Lower testability and review velocity.

### Recommendation

- Split router by bounded contexts (`ChatRouter`, `TaskRouter`, `NisRouter`, `AdminRouter`, etc.).
- Move MCP tool registration into declarative per-domain tool modules (registry pattern).

## 6) Legacy/Duplicate Chat And MCP Demo Paths

### Evidence

- Multiple parallel chat implementations:
  - `lib/server_chat.ex` (provider architecture)
  - `lib/server_chat_openai.ex` (standalone legacy flow)
  - `lib/weather_chat*.ex` variants (duplicated weather tool loops)

### Impact

- Divergent behavior and bug fixes applied unevenly.
- Extra cognitive load for contributors.

### Recommendation

- Mark legacy modules deprecated and migrate to provider-based implementations.
- Keep one canonical path for tool-calling chat.

## 7) Reliability Controls For Long-Running Work Are Partial

### Evidence

- `A2A` executes tool calls inline in request path (`handle_send/2`) and defaults to first tool if unspecified:
  - `lib/a2a_server.ex:97`
- No explicit queue/DLQ/outbox orchestration layer.

### Impact

- Retry semantics and failure replay are limited.
- Hard to support durable multi-step workflows at scale.

### Recommendation

- Add durable job/workflow engine (queued execution + retries + dead-letter).
- Use idempotency keys for externally-triggered write operations.

## Missing Features (Relative To Production-Grade Agent Platform)

- End-to-end tracing and metrics (OpenTelemetry + request/tool correlation IDs).
- Budget/quota controls per user/tenant/tool/model.
- Explicit policy engine for tool constraints (time windows, allowlists, spend caps).
- Centralized event log for tool invocations and decisions (not only access control changes).
- Formal eval harness for agent/tool quality and regression testing.
- Stronger test coverage for critical modules (`router`, `mcp server`, `access control`, `a2a`, `chat providers`).

## Redundancy And Consolidation Opportunities

- Consolidate all chat tool-loop logic into `ServerChat.*` providers.
- Remove or archive standalone demo modules once equivalent provider paths exist.
- Standardize DB initialization/migration pattern (module-owned is good, but add schema versioning/backfill framework).
- Introduce shared response/error helpers for transport layers to reduce repeated patterns.

## Suggested Roadmap (Pragmatic)

## Phase 1 (1-2 weeks)

- Centralize auth context through AccessControl for all request paths.
- Disable query param token auth.
- Add CSRF protection for browser POST endpoints.
- Declare legacy chat/weather modules deprecated.

## Phase 2 (2-4 weeks)

- Break router and MCP server into domain modules with a registry-based tool map.
- Add idempotency keys on high-impact POST endpoints.
- Add baseline telemetry: latency, error rate, queue depth, tool-call count.

## Phase 3 (4-8 weeks)

- Introduce durable workflow execution (retry/backoff/DLQ).
- Add policy/budget enforcement layer for model/tool usage.
- Build regression/eval suite for agent behavior and permission boundaries.

## Quick Wins

- Update architecture article claim about provider-switch history or change code to match.
- Add architecture decision records (ADRs) for auth source-of-truth and MCP execution model.
- Add test matrix that maps each public API/tool to auth + permission + error-path coverage.

## Final Assessment

The platform architecture is directionally strong and already capable. The next step is not feature sprawl; it is **tightening control planes**: centralized authorization, execution isolation, security hardening, and observable/replayable workflows. Fixing these will materially improve safety, scalability, and velocity.