WorkingAgents + Deepchecks: Evaluating the Agents That Run Your Business

By James Aspinwall, co-written by Alfred Pennyworth (my trusted AI) — March 7, 2026, 17:40

The Problem Neither Solves Alone

WorkingAgents orchestrates AI agents that manage real business operations — CRM, task management, content, communications. These agents make decisions that affect relationships, deadlines, and revenue. But there’s no systematic way to answer: “Are my agents making good decisions?”

Deepchecks answers that question. Their platform evaluates agent workflows interaction by interaction, scoring planning quality, tool selection accuracy, and response groundedness. They find the failure patterns you can’t see by reading logs.

WorkingAgents builds agents that act. Deepchecks builds systems that judge whether those actions were right. The combination closes the loop.

What Each Company Does

WorkingAgents — The Agent Runtime

WorkingAgents (“The Orchestrator”) is an Elixir OTP platform that gives AI agents real tools to operate a business:

50+ MCP tools — CRM contacts, companies, sales pipeline, task management, content authoring, alarms, system monitoring
Multi-provider LLM — Claude, OpenRouter, Perplexity, switchable at runtime per user
Permission-gated execution — capability-based access control on every tool call
A2A protocol — Google Agent-to-Agent interop for cross-agent task delegation
WhatsApp bridge — natural language tool invocation via messaging
Per-user isolation — separate databases per domain, per user

Deepchecks — The Evaluation Engine

Deepchecks is an AI evaluation platform purpose-built for agents and LLM applications:

Agentic Workflow Evaluation — breaks agent sessions into individual interactions, scores each one across planning, tool calling, and response quality
Swarm of Evaluation Agents — small language models using Mixture of Experts techniques simulate an intelligent human annotator, with specialists for hallucination detection, planning accuracy, and rule-based criteria
ORION — a family of lightweight models for hallucination detection that outperform both open-source and proprietary solutions on groundedness scoring
Know Your Agent (KYA) — generates a full strengths-and-weaknesses report in minutes, surfacing systematic failure patterns across tools, agents, and properties
Version Comparison — side-by-side evaluation of different prompts, models, and agent configurations against identical inputs
Compliance — SOC2 Type 2, GDPR, HIPAA certified, with on-prem deployment options

Where the Synergy Lives

1. Tool Call Quality Scoring — The Core Fit

WorkingAgents dispatches 50+ MCP tools. Every dispatch is a three-part decision: Did the agent choose the right tool? Did it pass correct parameters? Did the tool response actually help?

Deepchecks evaluates exactly these three dimensions with built-in agent span properties:

Plan Efficiency — Is the agent building a smart execution plan, or thrashing between tools?
Tool Coverage — Is the agent using the right tools for the task, or ignoring available capabilities?
Tool Completeness — Are tool responses rich enough, or is the agent getting shallow data back?

Each property is scored 0–5. Over time, patterns emerge. Maybe the agent consistently ignores the nis_pipeline tool when users ask about sales status, defaulting to nis_list_contacts instead. Maybe task_capture parses natural language correctly 92% of the time but fails on relative dates like “next Thursday.” These are the systematic issues that individual log inspection will never surface.

Integration path: Deepchecks uses OpenTelemetry and OpenInference for trace capture. WorkingAgents’ MCP dispatcher emits spans per tool call. Deepchecks ingests, scores, and surfaces patterns — no changes to business logic required.

2. Hallucination Detection on Grounded Data

WorkingAgents agents frequently work with grounded data — real contacts, real companies, real task lists. When an agent says “You have 3 overdue tasks with priority above 7,” that’s either true or it isn’t. When it summarizes an article, every claim is verifiable against the source.

This is where Deepchecks’ ORION hallucination detection excels. Their “Grounded in Context” framework was built for exactly this scenario — production-scale validation that each factual statement in an output is entailed by the provided context. It doesn’t just flag “this might be wrong.” It pinpoints the specific claim where the hallucination occurred.

For WorkingAgents, this means:

CRM data accuracy — When an agent reports on a contact’s follow-up status, pipeline stage, or company details, verify every claim against the actual database records
Task reporting — Validate that task counts, due dates, and priority levels match real data
Article summarization — Score how faithfully the Summary module captures source articles without fabrication
Search result grounding — When blog_search or summary_search returns results and the agent synthesizes them, verify the synthesis against the actual chunks

3. Know Your Agent — The Executive Dashboard

Deepchecks’ KYA feature generates a comprehensive strengths-and-weaknesses report for any agent. For WorkingAgents, this translates directly to business intelligence:

Tool Usage Analysis:

Which of the 50+ tools get called most frequently?
Which tools have the highest failure rates?
Are there tools that never get invoked despite being available?

Failure Mode Analysis:

Across the full pipeline: Where do sessions break down most often?
Per tool: Which tools consistently produce low-quality responses?
Per property: Is the agent’s planning strong but tool selection weak? Or vice versa?

LLM Behavior Insights:

Token usage and latency distributions across different task types
Behavioral patterns: Does the agent handle CRM queries better than task management queries?
Topic-level performance variation

For James’s consulting firm, KYA reports become client deliverables. Deploy agents for a client, run KYA analysis, present a clear report: “Your agent handles appointment scheduling well (4.2/5 planning, 4.5/5 tool coverage) but struggles with multi-step data lookups (2.8/5 planning, 3.1/5 tool completeness). Here’s our improvement plan.”

4. Multi-Provider Model Comparison

WorkingAgents supports Claude, OpenRouter (dozens of models), and Perplexity. Users switch providers at runtime. Deepchecks’ version comparison feature was built for exactly this scenario.

Run identical task sequences against different providers. Deepchecks scores each version on the same rubric:

Metric	Claude Sonnet	GPT-4o	Llama 3.3
Plan Efficiency	4.5	4.2	3.8
Tool Coverage	4.8	4.0	3.5
Tool Completeness	4.3	4.4	3.2
Hallucination Rate	2%	5%	11%
Avg Latency	1.2s	0.9s	0.6s
Cost per Session	$0.08	$0.06	$0.01

This data answers the question every AI consulting client asks: “Which model should we use?” Not with opinion — with scored evidence on their actual workflows.

5. Production Monitoring for Consulting Clients

WorkingAgents is the foundation of James’s AI consulting firm. Each client deployment is an agent instance with custom tools and configurations. Deepchecks provides the production monitoring layer:

Continuous evaluation — Score every production interaction, not just test runs
Drift detection — Alert when agent performance degrades (model updates, data changes, prompt drift)
Compliance documentation — SOC2, GDPR, HIPAA compliance matters for enterprise clients. Deepchecks is already certified.
On-prem deployment — For clients with data residency requirements, Deepchecks offers single-tenant SaaS and custom on-prem options

This means WorkingAgents consulting can offer “managed AI operations” — deploy agents, monitor quality continuously, fix degradation proactively — as a recurring revenue service rather than one-time project work.

6. Session-Level Debugging

When a WorkingAgents user reports “the agent gave me wrong information about my contacts,” today’s debugging path is manual: check logs, read the chat history, trace what happened. Deepchecks transforms this into structured forensics.

Their session-level view shows the complete execution tree — how the agent’s reasoning flowed from prompt to planning to tool call to response. Each span gets individual scoring with step-by-step reasoning: what happened, why the evaluation scored it that way, and where things went off track.

For a platform like WorkingAgents where agents handle real business data across CRM, tasks, and communications, this isn’t a nice-to-have. It’s the difference between “something went wrong” and “the agent called nis_get_contact with ID 42, got the right data, then hallucinated the company name in its response — here’s the exact claim that failed grounding.”

The Gap Analysis

WorkingAgents Gap	Deepchecks Solution
No systematic quality scoring	Built-in Plan Efficiency, Tool Coverage, Tool Completeness metrics
No hallucination detection on CRM/task data	ORION groundedness scoring with claim-level pinpointing
No agent strengths/weaknesses reporting	KYA automated analysis with failure mode surfacing
No provider comparison framework	Version comparison with identical inputs across models
No production quality monitoring	Continuous evaluation with drift detection and alerting
No compliance certification for client deployments	SOC2 Type 2, GDPR, HIPAA pre-certified

Deepchecks Gap	WorkingAgents Solution
Need real-world agent orchestration references	50+ tool MCP server with production business workflows
Need non-Python/non-LangChain ecosystem examples	Elixir OTP agent orchestration — unique in the market
Need consulting channel partners	AI consulting firm deploying agents for medium-size companies
Need complex tool-calling evaluation scenarios	CRM + task + content + communication tool chains
Need multi-provider evaluation stories	Runtime-switchable Claude/OpenRouter/Perplexity architecture

Partnership Models

Technology Integration Partner

The natural first step. WorkingAgents integrates Deepchecks as its evaluation backend:

Emit OpenTelemetry spans from the MCP dispatcher
Deepchecks ingests traces, runs evaluation agents, scores interactions
KYA reports surface in the WorkingAgents admin dashboard
Hallucination detection runs on all grounded-data responses

Deepchecks gains: A reference customer on Elixir/OTP with a unique multi-provider MCP architecture — expanding their ecosystem beyond Python-centric frameworks.

WorkingAgents gains: Enterprise-grade evaluation without building it in-house. Compliance certifications by association. A concrete quality story for consulting clients.

Consulting Reseller / Referral Partner

Deepchecks partners with NVIDIA (Inception program credits) and is available on AWS Marketplace. A similar model with WorkingAgents consulting:

WorkingAgents deploys agents for clients using the Orchestrator platform
Deepchecks evaluates and monitors those agents in production
Joint pricing: orchestration + evaluation as a bundled service
Deepchecks gets distribution through WorkingAgents’ consulting engagements
WorkingAgents gets evaluation capability without engineering investment

Co-Development: MCP Evaluation Toolkit

Deepchecks currently integrates with CrewAI, LangChain, LlamaIndex, and LangGraph. There’s no MCP-native evaluation integration. WorkingAgents could collaborate on:

An MCP-specific evaluation property set (beyond generic tool calling)
Evaluation templates for common MCP tool patterns (CRM operations, task management, search)
Reference traces and benchmarks from WorkingAgents’ 50+ tool catalog

This positions both companies at the front of MCP evaluation — a market segment that barely exists yet but will matter as MCP adoption grows.

Why Deepchecks Over Alternatives

Deepchecks’ key differentiator for WorkingAgents is the swarm evaluation architecture. Rather than using a single LLM-as-judge (which introduces its own biases), Deepchecks deploys multiple small language models as specialists — one for hallucination detection, another for planning accuracy, another for rule-based checks. This Mixture of Experts approach produces more reliable scores than any single evaluator.

For a platform like WorkingAgents where agents handle real business operations — not toy demos — evaluation reliability matters. A false positive on hallucination detection in CRM data erodes trust. A missed failure in task management causes real deadlines to slip. Deepchecks’ multi-model evaluation swarm reduces these risks.

The compliance certifications (SOC2, GDPR, HIPAA) are the other differentiator. WorkingAgents’ consulting clients — medium-size companies integrating AI — will ask about compliance. Having a pre-certified evaluation partner removes a sales objection before it’s raised.

Recommended Next Steps

Prototype — Add OpenTelemetry span emission to WorkingAgents’ MCP dispatcher. Point at Deepchecks’ cloud platform. Score a week of real interactions. See what the KYA report reveals.
Contact Deepchecks partnerships — Their partnerships page is a demo-booking form, not a formal program page. This suggests they’re still building their partner ecosystem — early partners get more attention and better terms.
Build evaluation benchmarks — Create a reference set of 100 representative WorkingAgents sessions across CRM, tasks, and content. Run Deepchecks evaluation. Use the results as the baseline for improvement.
Package for consulting — Design a “Managed AI Operations” service tier that bundles WorkingAgents orchestration with Deepchecks evaluation. Recurring monthly monitoring becomes the revenue model, not one-time deployment.

Conclusion

WorkingAgents and Deepchecks address different halves of the same problem. WorkingAgents makes agents useful — giving them real tools to manage real business operations. Deepchecks makes agents trustworthy — evaluating whether they’re actually using those tools well.

For an AI consulting firm, the combination is the complete offering: “We deploy AI agents for your business, and we prove they work.” The evaluation data isn’t just quality assurance — it’s the sales collateral for the next client. “Here’s what our agents scored on your competitor’s workflow. Here’s where we improved. Here’s the compliance report.”

Deepchecks’ swarm evaluation, ORION hallucination detection, and KYA reporting give WorkingAgents the quality infrastructure that would take months to build internally. WorkingAgents gives Deepchecks a production MCP reference implementation on a technically distinctive stack. Both get a consulting partnership model that generates recurring revenue.

The partnership case writes itself. Now it’s a matter of making the call.

Sources: