WorkingAgents + Deepchecks: Evaluating the Agents That Run Your Business

By James Aspinwall, co-written by Alfred Pennyworth (my trusted AI) — March 7, 2026, 17:40


The Problem Neither Solves Alone

WorkingAgents orchestrates AI agents that manage real business operations — CRM, task management, content, communications. These agents make decisions that affect relationships, deadlines, and revenue. But there’s no systematic way to answer: “Are my agents making good decisions?”

Deepchecks answers that question. Their platform evaluates agent workflows interaction by interaction, scoring planning quality, tool selection accuracy, and response groundedness. They find the failure patterns you can’t see by reading logs.

WorkingAgents builds agents that act. Deepchecks builds systems that judge whether those actions were right. The combination closes the loop.


What Each Company Does

WorkingAgents — The Agent Runtime

WorkingAgents (“The Orchestrator”) is an Elixir OTP platform that gives AI agents real tools to operate a business:

Deepchecks — The Evaluation Engine

Deepchecks is an AI evaluation platform purpose-built for agents and LLM applications:


Where the Synergy Lives

1. Tool Call Quality Scoring — The Core Fit

WorkingAgents dispatches 50+ MCP tools. Every dispatch is a three-part decision: Did the agent choose the right tool? Did it pass correct parameters? Did the tool response actually help?

Deepchecks evaluates exactly these three dimensions with built-in agent span properties:

Each property is scored 0–5. Over time, patterns emerge. Maybe the agent consistently ignores the nis_pipeline tool when users ask about sales status, defaulting to nis_list_contacts instead. Maybe task_capture parses natural language correctly 92% of the time but fails on relative dates like “next Thursday.” These are the systematic issues that individual log inspection will never surface.

Integration path: Deepchecks uses OpenTelemetry and OpenInference for trace capture. WorkingAgents’ MCP dispatcher emits spans per tool call. Deepchecks ingests, scores, and surfaces patterns — no changes to business logic required.

2. Hallucination Detection on Grounded Data

WorkingAgents agents frequently work with grounded data — real contacts, real companies, real task lists. When an agent says “You have 3 overdue tasks with priority above 7,” that’s either true or it isn’t. When it summarizes an article, every claim is verifiable against the source.

This is where Deepchecks’ ORION hallucination detection excels. Their “Grounded in Context” framework was built for exactly this scenario — production-scale validation that each factual statement in an output is entailed by the provided context. It doesn’t just flag “this might be wrong.” It pinpoints the specific claim where the hallucination occurred.

For WorkingAgents, this means:

3. Know Your Agent — The Executive Dashboard

Deepchecks’ KYA feature generates a comprehensive strengths-and-weaknesses report for any agent. For WorkingAgents, this translates directly to business intelligence:

Tool Usage Analysis:

Failure Mode Analysis:

LLM Behavior Insights:

For James’s consulting firm, KYA reports become client deliverables. Deploy agents for a client, run KYA analysis, present a clear report: “Your agent handles appointment scheduling well (4.2/5 planning, 4.5/5 tool coverage) but struggles with multi-step data lookups (2.8/5 planning, 3.1/5 tool completeness). Here’s our improvement plan.”

4. Multi-Provider Model Comparison

WorkingAgents supports Claude, OpenRouter (dozens of models), and Perplexity. Users switch providers at runtime. Deepchecks’ version comparison feature was built for exactly this scenario.

Run identical task sequences against different providers. Deepchecks scores each version on the same rubric:

Metric Claude Sonnet GPT-4o Llama 3.3
Plan Efficiency 4.5 4.2 3.8
Tool Coverage 4.8 4.0 3.5
Tool Completeness 4.3 4.4 3.2
Hallucination Rate 2% 5% 11%
Avg Latency 1.2s 0.9s 0.6s
Cost per Session $0.08 $0.06 $0.01

This data answers the question every AI consulting client asks: “Which model should we use?” Not with opinion — with scored evidence on their actual workflows.

5. Production Monitoring for Consulting Clients

WorkingAgents is the foundation of James’s AI consulting firm. Each client deployment is an agent instance with custom tools and configurations. Deepchecks provides the production monitoring layer:

This means WorkingAgents consulting can offer “managed AI operations” — deploy agents, monitor quality continuously, fix degradation proactively — as a recurring revenue service rather than one-time project work.

6. Session-Level Debugging

When a WorkingAgents user reports “the agent gave me wrong information about my contacts,” today’s debugging path is manual: check logs, read the chat history, trace what happened. Deepchecks transforms this into structured forensics.

Their session-level view shows the complete execution tree — how the agent’s reasoning flowed from prompt to planning to tool call to response. Each span gets individual scoring with step-by-step reasoning: what happened, why the evaluation scored it that way, and where things went off track.

For a platform like WorkingAgents where agents handle real business data across CRM, tasks, and communications, this isn’t a nice-to-have. It’s the difference between “something went wrong” and “the agent called nis_get_contact with ID 42, got the right data, then hallucinated the company name in its response — here’s the exact claim that failed grounding.”


The Gap Analysis

WorkingAgents Gap Deepchecks Solution
No systematic quality scoring Built-in Plan Efficiency, Tool Coverage, Tool Completeness metrics
No hallucination detection on CRM/task data ORION groundedness scoring with claim-level pinpointing
No agent strengths/weaknesses reporting KYA automated analysis with failure mode surfacing
No provider comparison framework Version comparison with identical inputs across models
No production quality monitoring Continuous evaluation with drift detection and alerting
No compliance certification for client deployments SOC2 Type 2, GDPR, HIPAA pre-certified
Deepchecks Gap WorkingAgents Solution
Need real-world agent orchestration references 50+ tool MCP server with production business workflows
Need non-Python/non-LangChain ecosystem examples Elixir OTP agent orchestration — unique in the market
Need consulting channel partners AI consulting firm deploying agents for medium-size companies
Need complex tool-calling evaluation scenarios CRM + task + content + communication tool chains
Need multi-provider evaluation stories Runtime-switchable Claude/OpenRouter/Perplexity architecture

Partnership Models

Technology Integration Partner

The natural first step. WorkingAgents integrates Deepchecks as its evaluation backend:

  1. Emit OpenTelemetry spans from the MCP dispatcher
  2. Deepchecks ingests traces, runs evaluation agents, scores interactions
  3. KYA reports surface in the WorkingAgents admin dashboard
  4. Hallucination detection runs on all grounded-data responses

Deepchecks gains: A reference customer on Elixir/OTP with a unique multi-provider MCP architecture — expanding their ecosystem beyond Python-centric frameworks.

WorkingAgents gains: Enterprise-grade evaluation without building it in-house. Compliance certifications by association. A concrete quality story for consulting clients.

Consulting Reseller / Referral Partner

Deepchecks partners with NVIDIA (Inception program credits) and is available on AWS Marketplace. A similar model with WorkingAgents consulting:

Co-Development: MCP Evaluation Toolkit

Deepchecks currently integrates with CrewAI, LangChain, LlamaIndex, and LangGraph. There’s no MCP-native evaluation integration. WorkingAgents could collaborate on:

This positions both companies at the front of MCP evaluation — a market segment that barely exists yet but will matter as MCP adoption grows.


Why Deepchecks Over Alternatives

Deepchecks’ key differentiator for WorkingAgents is the swarm evaluation architecture. Rather than using a single LLM-as-judge (which introduces its own biases), Deepchecks deploys multiple small language models as specialists — one for hallucination detection, another for planning accuracy, another for rule-based checks. This Mixture of Experts approach produces more reliable scores than any single evaluator.

For a platform like WorkingAgents where agents handle real business operations — not toy demos — evaluation reliability matters. A false positive on hallucination detection in CRM data erodes trust. A missed failure in task management causes real deadlines to slip. Deepchecks’ multi-model evaluation swarm reduces these risks.

The compliance certifications (SOC2, GDPR, HIPAA) are the other differentiator. WorkingAgents’ consulting clients — medium-size companies integrating AI — will ask about compliance. Having a pre-certified evaluation partner removes a sales objection before it’s raised.


Recommended Next Steps

  1. Prototype — Add OpenTelemetry span emission to WorkingAgents’ MCP dispatcher. Point at Deepchecks’ cloud platform. Score a week of real interactions. See what the KYA report reveals.

  2. Contact Deepchecks partnerships — Their partnerships page is a demo-booking form, not a formal program page. This suggests they’re still building their partner ecosystem — early partners get more attention and better terms.

  3. Build evaluation benchmarks — Create a reference set of 100 representative WorkingAgents sessions across CRM, tasks, and content. Run Deepchecks evaluation. Use the results as the baseline for improvement.

  4. Package for consulting — Design a “Managed AI Operations” service tier that bundles WorkingAgents orchestration with Deepchecks evaluation. Recurring monthly monitoring becomes the revenue model, not one-time deployment.


Conclusion

WorkingAgents and Deepchecks address different halves of the same problem. WorkingAgents makes agents useful — giving them real tools to manage real business operations. Deepchecks makes agents trustworthy — evaluating whether they’re actually using those tools well.

For an AI consulting firm, the combination is the complete offering: “We deploy AI agents for your business, and we prove they work.” The evaluation data isn’t just quality assurance — it’s the sales collateral for the next client. “Here’s what our agents scored on your competitor’s workflow. Here’s where we improved. Here’s the compliance report.”

Deepchecks’ swarm evaluation, ORION hallucination detection, and KYA reporting give WorkingAgents the quality infrastructure that would take months to build internally. WorkingAgents gives Deepchecks a production MCP reference implementation on a technically distinctive stack. Both get a consulting partnership model that generates recurring revenue.

The partnership case writes itself. Now it’s a matter of making the call.


Sources: