By James Aspinwall, co-written by Alfred Pennyworth (my trusted AI) — March 7, 2026, 17:25
The Thesis
WorkingAgents builds the agent orchestration layer — the brain that coordinates AI agents, tools, CRM, task management, and human communication. Arize AI builds the observability and evaluation layer — the eyes that watch what agents actually do, measure whether they’re doing it well, and surface where they fail.
These two platforms occupy adjacent, non-overlapping layers of the AI agent stack. Together, they form a complete build-observe-improve loop that neither can deliver alone.
What WorkingAgents Brings
WorkingAgents (“The Orchestrator”) is an MCP-powered agent orchestration platform built on Elixir OTP. It gives AI agents real tools to do real work:
- 50+ MCP tools spanning CRM, task management, content authoring, alarms, and system monitoring
- Multi-provider LLM integration — Anthropic Claude, OpenRouter, Perplexity, switchable at runtime
- Permission-aware tool execution — every tool call gated by a capability-based access control system
- Google A2A protocol support — agent-to-agent task delegation and skill discovery
- WhatsApp bridge — natural language tool invocation via messaging
- Per-user data isolation — separate SQLite databases per domain, per user
WorkingAgents solves the “what agents can do” problem. It’s the runtime where agents act.
What Arize AI Brings
Arize AI is an agent and AI engineering platform for observing, evaluating, and improving AI agents and LLM applications. Their stack includes:
- Arize AX — enterprise-grade observability with tracing at session, trace, and span levels
- Phoenix — open-source AI observability built on OpenTelemetry, with out-of-the-box support for Anthropic, OpenAI, LangChain, CrewAI, and dozens more
- OpenInference — an open standard for AI telemetry that extends OpenTelemetry with LLM-specific semantics
- Evaluator Hub — versioned, reusable evaluators with LLM-as-a-Judge templates for tool calling, hallucination, relevance, and correctness
- Prompt Playground — no-code environment for iterating on prompts and testing agent behaviors side by side
- Online Evaluations — continuous production monitoring that automatically tags spans with quality labels
Arize solves the “how well agents perform” problem. It’s the feedback loop that makes agents better.
Where the Synergy Lives
1. Tool Call Observability — The Immediate Win
WorkingAgents dispatches 50+ tool calls through its MCP server. Every call is a decision point: Did the agent pick the right tool? Did it pass the right parameters? Did the result make sense?
Today, WorkingAgents has basic monitoring (health endpoints, process counts, memory usage) but no distributed tracing of tool invocations. There’s no correlation ID linking a user’s chat message to the chain of tool calls it triggered.
Arize’s OpenInference tracing would give WorkingAgents exactly this. Each MCP tool call becomes a span. Each chat session becomes a trace. Each user becomes a session. Suddenly you can see:
- Which tools get called most (and least)
- Which tool sequences lead to successful outcomes
- Where agents get stuck in retry loops
- How long each tool takes and where latency hides
Integration path: Emit OpenTelemetry spans from MyMCPServer.Manager when dispatching tool calls. Arize Phoenix (self-hosted) or Arize AX (cloud) ingests the traces. Zero changes to business logic.
2. Agent Quality Evaluation — The Strategic Win
WorkingAgents runs multi-turn conversations where agents use tools to accomplish tasks. But there’s no automated way to answer: “Was this interaction good?”
Arize’s Evaluator Hub provides exactly this capability:
- Tool Calling Evaluation — Did the agent select the right tool? Were the parameters correct? This maps directly to WorkingAgents’ 50+ MCP tools where misrouted calls waste time or produce wrong results.
- Path Convergence — Do different agents reach the same answer for the same question? Critical for WorkingAgents’ multi-provider architecture where users can switch between Claude, OpenRouter, and Perplexity mid-conversation.
- Hallucination Detection — When agents summarize articles or report CRM data, are they fabricating details? WorkingAgents’ Summary and NIS modules produce outputs that can be grounded against source data.
- Relevance Scoring — When a user asks for their sales pipeline and gets back task management data instead, that’s a relevance failure Arize can catch.
3. Multi-Provider Model Comparison
WorkingAgents is provider-agnostic — users switch between Claude, OpenRouter models, and Perplexity at runtime. This creates a natural experiment: Which provider handles which tool-use patterns best?
Arize’s side-by-side evaluation in the Prompt Playground would let WorkingAgents benchmark providers against the same real-world prompts, measuring:
- Tool selection accuracy per provider
- Latency per provider per tool type
- Cost per successful task completion
- Hallucination rates across providers
This data directly informs which provider to default to for different task types — a competitive advantage for WorkingAgents’ consulting clients.
4. Permission and Security Auditing
WorkingAgents has a sophisticated access control system where every tool call is gated by capability-based permissions. Arize’s tracing would add a security dimension:
- Trace all permission-denied attempts alongside successful calls
- Detect patterns of permission escalation attempts
- Monitor whether temporary access keys are being used within their intended scope
- Audit tool usage patterns per user role
This transforms Arize from an observability tool into a security monitoring layer for WorkingAgents’ access control system.
5. A2A Protocol Observability
WorkingAgents implements Google’s Agent-to-Agent (A2A) protocol, allowing external agents to discover and invoke its tools as “skills.” As the A2A ecosystem grows, observability becomes critical:
- Trace cross-agent task delegation chains
- Measure external agent request patterns and load
- Evaluate whether incoming A2A tasks are being handled correctly
- Monitor A2A skill discovery and execution quality
Arize’s session-level tracing maps naturally to A2A task lifecycles — each A2A task becomes a trace, each skill invocation becomes a span.
Partnership Models
Technology Integration Partner
The most natural first step. WorkingAgents integrates Arize Phoenix (open-source) or Arize AX (enterprise) as its observability backend:
- WorkingAgents gains: Production-grade tracing, evaluation, and debugging without building it in-house
- Arize gains: A reference implementation showing their platform with MCP-native agent orchestration on Elixir/OTP — a stack currently underrepresented in Arize’s ecosystem
Consulting Channel Partner
James is building an AI consulting firm focused on AI integration for medium-size companies. Arize has an active GSI/consulting partner program (they recently partnered with Infogain for exactly this model):
- WorkingAgents consulting delivers: Agent orchestration, tool design, MCP integration, custom workflows
- Arize delivers: Observability, evaluation, production monitoring, quality assurance
- Joint value proposition: “We build your AI agents AND give you visibility into how they perform” — a complete package that neither offers alone
Co-Marketing / Case Study
WorkingAgents’ architecture — Elixir OTP, MCP protocol, multi-provider LLM, A2A interop — is technically distinctive. A joint case study showing Arize observability on a non-Python, non-TypeScript agent platform would differentiate both companies:
- Demonstrates Arize’s language-agnostic claims (OpenTelemetry works everywhere)
- Showcases WorkingAgents as a serious orchestration platform
- Creates content for both companies’ marketing pipelines
The Gap Analysis — What Each Needs From the Other
| WorkingAgents Gap | Arize Solution |
|---|---|
| No distributed tracing | OpenInference spans + Phoenix/AX collector |
| No tool call quality metrics | Evaluator Hub with tool-calling templates |
| No provider comparison framework | Prompt Playground side-by-side evaluation |
| No regression testing for agent behavior | Online Evaluations with continuous monitoring |
| No cost-per-outcome tracking | Span-level token usage and latency metrics |
| Arize Gap | WorkingAgents Solution |
|---|---|
| Limited MCP-native examples | Full 50+ tool MCP server implementation |
| Few Elixir/BEAM ecosystem references | Production Elixir OTP agent orchestration |
| Need consulting channel partners | AI consulting firm with Florida presence |
| Need A2A protocol observability stories | Working A2A implementation with skill discovery |
Recommended Next Steps
-
Prototype integration — Add OpenTelemetry span emission to WorkingAgents’ MCP dispatcher. Point at a self-hosted Phoenix instance. Prove the tracing works end-to-end in a week.
-
Reach out to Arize partnerships — Arize is actively hiring a GSI/Consulting Partnerships Manager and recently launched partnerships with Infogain and Google Cloud. The timing is right for new consulting partners.
-
Build a demo — Record a session showing: user sends WhatsApp message → agent selects tools → tools execute → results returned — all visible in Arize’s trace view. This becomes the pitch deck for joint consulting engagements.
-
Propose a case study — “MCP Agent Observability on Elixir OTP” is a story nobody else is telling. Arize’s content team would likely be interested.
Conclusion
WorkingAgents and Arize AI sit on opposite sides of the same coin. One builds the agent runtime, the other builds the agent feedback loop. Neither competes with the other. Both are stronger together.
For James’s consulting firm, the combination is particularly powerful: walk into a client meeting offering both “we’ll build your AI agents” and “we’ll show you exactly how they perform.” That’s a hard pitch to say no to.
The integration is technically straightforward (OpenTelemetry is protocol-level, not language-level), the partnership timing is right (Arize is actively expanding their consulting partner network), and the market positioning is complementary (orchestration + observability = complete agent engineering).
Time to make the call.
Sources: