WorkingAgents + Arize AI: A Natural Partnership for the Agent Engineering Stack

By James Aspinwall, co-written by Alfred Pennyworth (my trusted AI) — March 7, 2026, 17:25

The Thesis

WorkingAgents builds the agent orchestration layer — the brain that coordinates AI agents, tools, CRM, task management, and human communication. Arize AI builds the observability and evaluation layer — the eyes that watch what agents actually do, measure whether they’re doing it well, and surface where they fail.

These two platforms occupy adjacent, non-overlapping layers of the AI agent stack. Together, they form a complete build-observe-improve loop that neither can deliver alone.

What WorkingAgents Brings

WorkingAgents (“The Orchestrator”) is an MCP-powered agent orchestration platform built on Elixir OTP. It gives AI agents real tools to do real work:

50+ MCP tools spanning CRM, task management, content authoring, alarms, and system monitoring
Multi-provider LLM integration — Anthropic Claude, OpenRouter, Perplexity, switchable at runtime
Permission-aware tool execution — every tool call gated by a capability-based access control system
Google A2A protocol support — agent-to-agent task delegation and skill discovery
WhatsApp bridge — natural language tool invocation via messaging
Per-user data isolation — separate SQLite databases per domain, per user

WorkingAgents solves the “what agents can do” problem. It’s the runtime where agents act.

What Arize AI Brings

Arize AI is an agent and AI engineering platform for observing, evaluating, and improving AI agents and LLM applications. Their stack includes:

Arize AX — enterprise-grade observability with tracing at session, trace, and span levels
Phoenix — open-source AI observability built on OpenTelemetry, with out-of-the-box support for Anthropic, OpenAI, LangChain, CrewAI, and dozens more
OpenInference — an open standard for AI telemetry that extends OpenTelemetry with LLM-specific semantics
Evaluator Hub — versioned, reusable evaluators with LLM-as-a-Judge templates for tool calling, hallucination, relevance, and correctness
Prompt Playground — no-code environment for iterating on prompts and testing agent behaviors side by side
Online Evaluations — continuous production monitoring that automatically tags spans with quality labels

Arize solves the “how well agents perform” problem. It’s the feedback loop that makes agents better.

Where the Synergy Lives

1. Tool Call Observability — The Immediate Win

WorkingAgents dispatches 50+ tool calls through its MCP server. Every call is a decision point: Did the agent pick the right tool? Did it pass the right parameters? Did the result make sense?

Today, WorkingAgents has basic monitoring (health endpoints, process counts, memory usage) but no distributed tracing of tool invocations. There’s no correlation ID linking a user’s chat message to the chain of tool calls it triggered.

Arize’s OpenInference tracing would give WorkingAgents exactly this. Each MCP tool call becomes a span. Each chat session becomes a trace. Each user becomes a session. Suddenly you can see:

Which tools get called most (and least)
Which tool sequences lead to successful outcomes
Where agents get stuck in retry loops
How long each tool takes and where latency hides

Integration path: Emit OpenTelemetry spans from MyMCPServer.Manager when dispatching tool calls. Arize Phoenix (self-hosted) or Arize AX (cloud) ingests the traces. Zero changes to business logic.

2. Agent Quality Evaluation — The Strategic Win

WorkingAgents runs multi-turn conversations where agents use tools to accomplish tasks. But there’s no automated way to answer: “Was this interaction good?”

Arize’s Evaluator Hub provides exactly this capability:

Tool Calling Evaluation — Did the agent select the right tool? Were the parameters correct? This maps directly to WorkingAgents’ 50+ MCP tools where misrouted calls waste time or produce wrong results.
Path Convergence — Do different agents reach the same answer for the same question? Critical for WorkingAgents’ multi-provider architecture where users can switch between Claude, OpenRouter, and Perplexity mid-conversation.
Hallucination Detection — When agents summarize articles or report CRM data, are they fabricating details? WorkingAgents’ Summary and NIS modules produce outputs that can be grounded against source data.
Relevance Scoring — When a user asks for their sales pipeline and gets back task management data instead, that’s a relevance failure Arize can catch.

3. Multi-Provider Model Comparison

WorkingAgents is provider-agnostic — users switch between Claude, OpenRouter models, and Perplexity at runtime. This creates a natural experiment: Which provider handles which tool-use patterns best?

Arize’s side-by-side evaluation in the Prompt Playground would let WorkingAgents benchmark providers against the same real-world prompts, measuring:

Tool selection accuracy per provider
Latency per provider per tool type
Cost per successful task completion
Hallucination rates across providers

This data directly informs which provider to default to for different task types — a competitive advantage for WorkingAgents’ consulting clients.

4. Permission and Security Auditing

WorkingAgents has a sophisticated access control system where every tool call is gated by capability-based permissions. Arize’s tracing would add a security dimension:

Trace all permission-denied attempts alongside successful calls
Detect patterns of permission escalation attempts
Monitor whether temporary access keys are being used within their intended scope
Audit tool usage patterns per user role

This transforms Arize from an observability tool into a security monitoring layer for WorkingAgents’ access control system.

5. A2A Protocol Observability

WorkingAgents implements Google’s Agent-to-Agent (A2A) protocol, allowing external agents to discover and invoke its tools as “skills.” As the A2A ecosystem grows, observability becomes critical:

Trace cross-agent task delegation chains
Measure external agent request patterns and load
Evaluate whether incoming A2A tasks are being handled correctly
Monitor A2A skill discovery and execution quality

Arize’s session-level tracing maps naturally to A2A task lifecycles — each A2A task becomes a trace, each skill invocation becomes a span.

Partnership Models

Technology Integration Partner

The most natural first step. WorkingAgents integrates Arize Phoenix (open-source) or Arize AX (enterprise) as its observability backend:

WorkingAgents gains: Production-grade tracing, evaluation, and debugging without building it in-house
Arize gains: A reference implementation showing their platform with MCP-native agent orchestration on Elixir/OTP — a stack currently underrepresented in Arize’s ecosystem

Consulting Channel Partner

James is building an AI consulting firm focused on AI integration for medium-size companies. Arize has an active GSI/consulting partner program (they recently partnered with Infogain for exactly this model):

WorkingAgents consulting delivers: Agent orchestration, tool design, MCP integration, custom workflows
Arize delivers: Observability, evaluation, production monitoring, quality assurance
Joint value proposition: “We build your AI agents AND give you visibility into how they perform” — a complete package that neither offers alone

Co-Marketing / Case Study

WorkingAgents’ architecture — Elixir OTP, MCP protocol, multi-provider LLM, A2A interop — is technically distinctive. A joint case study showing Arize observability on a non-Python, non-TypeScript agent platform would differentiate both companies:

Demonstrates Arize’s language-agnostic claims (OpenTelemetry works everywhere)
Showcases WorkingAgents as a serious orchestration platform
Creates content for both companies’ marketing pipelines

The Gap Analysis — What Each Needs From the Other

WorkingAgents Gap	Arize Solution
No distributed tracing	OpenInference spans + Phoenix/AX collector
No tool call quality metrics	Evaluator Hub with tool-calling templates
No provider comparison framework	Prompt Playground side-by-side evaluation
No regression testing for agent behavior	Online Evaluations with continuous monitoring
No cost-per-outcome tracking	Span-level token usage and latency metrics

Arize Gap	WorkingAgents Solution
Limited MCP-native examples	Full 50+ tool MCP server implementation
Few Elixir/BEAM ecosystem references	Production Elixir OTP agent orchestration
Need consulting channel partners	AI consulting firm with Florida presence
Need A2A protocol observability stories	Working A2A implementation with skill discovery

Recommended Next Steps

Prototype integration — Add OpenTelemetry span emission to WorkingAgents’ MCP dispatcher. Point at a self-hosted Phoenix instance. Prove the tracing works end-to-end in a week.
Reach out to Arize partnerships — Arize is actively hiring a GSI/Consulting Partnerships Manager and recently launched partnerships with Infogain and Google Cloud. The timing is right for new consulting partners.
Build a demo — Record a session showing: user sends WhatsApp message → agent selects tools → tools execute → results returned — all visible in Arize’s trace view. This becomes the pitch deck for joint consulting engagements.
Propose a case study — “MCP Agent Observability on Elixir OTP” is a story nobody else is telling. Arize’s content team would likely be interested.

Conclusion

WorkingAgents and Arize AI sit on opposite sides of the same coin. One builds the agent runtime, the other builds the agent feedback loop. Neither competes with the other. Both are stronger together.

For James’s consulting firm, the combination is particularly powerful: walk into a client meeting offering both “we’ll build your AI agents” and “we’ll show you exactly how they perform.” That’s a hard pitch to say no to.

The integration is technically straightforward (OpenTelemetry is protocol-level, not language-level), the partnership timing is right (Arize is actively expanding their consulting partner network), and the market positioning is complementary (orchestration + observability = complete agent engineering).

Time to make the call.

Sources: