# WorkingAgents Architecture Review (Current Application) Source baseline: `asset/blogs/2026-02-24-workingagents-architecture.md` Review scope: current codebase under `lib/`, `config/`, and `test/` Date: 2026-02-24 ## Executive Summary The application already has a strong OTP-first architecture with clear domain coverage (MCP, chat, CRM/NIS, tasks, WhatsApp, summaries, A2A, access control). The biggest risks are not missing features, but **architectural drift and concentration points**: - Authorization is split across layers, despite the architecture goal of centralized control. - Core transport/tool modules have grown into large monoliths, slowing safe change. - A few legacy/parallel implementations duplicate logic and increase maintenance cost. - Production-grade controls (observability, idempotency, CSRF, policy/budget enforcement) are partial. ## What Is Strong Today - OTP supervision and process isolation are well established (`lib/mcp/application.ex`). - Per-domain persistence with SQLite instances is consistent and pragmatic. - MCP tool surface is broad and permission-aware. - AccessControl is feature-rich (roles, TTL keys, audit trail). - Chat provider abstraction is in place (`ServerChat.Provider`), enabling runtime provider/model switching. ## High-Priority Findings ## 1) Authorization Source Of Truth Is Not Fully Centralized ### Evidence - Router resolves permissions from `User.get_user/1` + `permission_keys`, not `AccessControl`: - `lib/my_mcp_server_router.ex:3004` - MCP manager also loads permissions directly from user rows: - `lib/my_mcp_server_manager.ex:136` - AccessControl is documented as single authority and supports temporary keys/TTL, but these paths bypass that model. ### Impact - Temporary grants and in-memory revocations can be missed by request paths. - Policy behavior may differ between MCP tools and other transports. - Security model is harder to reason about and validate. ### Recommendation - Introduce one `AuthContext.for_user(user_id)` API that always returns: - identity - `AccessControl.get_permissions(user_id)` - timezone/profile metadata - Replace direct `User.get_user(...).permission_keys` permission assembly in router and manager. ## 2) Singleton MCP Manager Is A Throughput And Coupling Bottleneck ### Evidence - `MyMCPServer.Manager` is a globally named GenServer and handles all `call_tool/list/read` serially: - `lib/my_mcp_server_manager.ex:12` - `lib/my_mcp_server_manager.ex:57` - It carries mutable shared `mcp_state` across users. ### Impact - Head-of-line blocking under tool-heavy concurrent workloads. - Increased blast radius for slow tool handlers. - Harder horizontal scaling story. ### Recommendation - Move to per-session or per-user MCP execution processes via `DynamicSupervisor`. - Keep manager as a lightweight dispatcher/registry. - Add concurrency/load tests around tool call latency percentiles. ## 3) Architectural Drift Between Documentation And Behavior ### Evidence - Blog says provider switch can carry history across providers. - Code clears history on provider switch: - `lib/server_chat.ex:309` ### Impact - Operational expectations and real behavior diverge. - User-facing behavior may appear inconsistent. ### Recommendation - Pick one policy and align both code and docs. - If preserving history is desired, add provider-specific history transformers. ## 4) Security Hardening Gaps In Web Transport ### Evidence - Query param token auth path remains enabled: - `lib/my_mcp_server_router.ex:2591` - No explicit CSRF protection plug in router pipeline. - Cookies are configured with `SameSite=Lax`, but state-changing browser POST endpoints are numerous. ### Impact - Token leakage risk through URLs/logs/referrers. - Higher CSRF exposure for authenticated browser sessions. ### Recommendation - Remove query-string token auth; keep header/cookie-based flows only. - Add CSRF tokens for browser form and JS POST flows. - Add security tests for auth bypass and request forgery scenarios. ## Medium-Priority Findings ## 5) Monolithic Modules Increase Change Risk ### Evidence - `lib/my_mcp_server_router.ex` ~3021 lines. - `lib/my_mcp_server.ex` ~2038 lines. ### Impact - Cross-domain coupling and regression risk. - Lower testability and review velocity. ### Recommendation - Split router by bounded contexts (`ChatRouter`, `TaskRouter`, `NisRouter`, `AdminRouter`, etc.). - Move MCP tool registration into declarative per-domain tool modules (registry pattern). ## 6) Legacy/Duplicate Chat And MCP Demo Paths ### Evidence - Multiple parallel chat implementations: - `lib/server_chat.ex` (provider architecture) - `lib/server_chat_openai.ex` (standalone legacy flow) - `lib/weather_chat*.ex` variants (duplicated weather tool loops) ### Impact - Divergent behavior and bug fixes applied unevenly. - Extra cognitive load for contributors. ### Recommendation - Mark legacy modules deprecated and migrate to provider-based implementations. - Keep one canonical path for tool-calling chat. ## 7) Reliability Controls For Long-Running Work Are Partial ### Evidence - `A2A` executes tool calls inline in request path (`handle_send/2`) and defaults to first tool if unspecified: - `lib/a2a_server.ex:97` - No explicit queue/DLQ/outbox orchestration layer. ### Impact - Retry semantics and failure replay are limited. - Hard to support durable multi-step workflows at scale. ### Recommendation - Add durable job/workflow engine (queued execution + retries + dead-letter). - Use idempotency keys for externally-triggered write operations. ## Missing Features (Relative To Production-Grade Agent Platform) - End-to-end tracing and metrics (OpenTelemetry + request/tool correlation IDs). - Budget/quota controls per user/tenant/tool/model. - Explicit policy engine for tool constraints (time windows, allowlists, spend caps). - Centralized event log for tool invocations and decisions (not only access control changes). - Formal eval harness for agent/tool quality and regression testing. - Stronger test coverage for critical modules (`router`, `mcp server`, `access control`, `a2a`, `chat providers`). ## Redundancy And Consolidation Opportunities - Consolidate all chat tool-loop logic into `ServerChat.*` providers. - Remove or archive standalone demo modules once equivalent provider paths exist. - Standardize DB initialization/migration pattern (module-owned is good, but add schema versioning/backfill framework). - Introduce shared response/error helpers for transport layers to reduce repeated patterns. ## Suggested Roadmap (Pragmatic) ## Phase 1 (1-2 weeks) - Centralize auth context through AccessControl for all request paths. - Disable query param token auth. - Add CSRF protection for browser POST endpoints. - Declare legacy chat/weather modules deprecated. ## Phase 2 (2-4 weeks) - Break router and MCP server into domain modules with a registry-based tool map. - Add idempotency keys on high-impact POST endpoints. - Add baseline telemetry: latency, error rate, queue depth, tool-call count. ## Phase 3 (4-8 weeks) - Introduce durable workflow execution (retry/backoff/DLQ). - Add policy/budget enforcement layer for model/tool usage. - Build regression/eval suite for agent behavior and permission boundaries. ## Quick Wins - Update architecture article claim about provider-switch history or change code to match. - Add architecture decision records (ADRs) for auth source-of-truth and MCP execution model. - Add test matrix that maps each public API/tool to auth + permission + error-path coverage. ## Final Assessment The platform architecture is directionally strong and already capable. The next step is not feature sprawl; it is **tightening control planes**: centralized authorization, execution isolation, security hardening, and observable/replayable workflows. Fixing these will materially improve safety, scalability, and velocity.