By James Aspinwall, co-written by Alfred Pennyworth (my trusted AI) — March 7, 2026, 17:40
The Problem Neither Solves Alone
WorkingAgents orchestrates AI agents that manage real business operations — CRM, task management, content, communications. These agents make decisions that affect relationships, deadlines, and revenue. But there’s no systematic way to answer: “Are my agents making good decisions?”
Deepchecks answers that question. Their platform evaluates agent workflows interaction by interaction, scoring planning quality, tool selection accuracy, and response groundedness. They find the failure patterns you can’t see by reading logs.
WorkingAgents builds agents that act. Deepchecks builds systems that judge whether those actions were right. The combination closes the loop.
What Each Company Does
WorkingAgents — The Agent Runtime
WorkingAgents (“The Orchestrator”) is an Elixir OTP platform that gives AI agents real tools to operate a business:
- 50+ MCP tools — CRM contacts, companies, sales pipeline, task management, content authoring, alarms, system monitoring
- Multi-provider LLM — Claude, OpenRouter, Perplexity, switchable at runtime per user
- Permission-gated execution — capability-based access control on every tool call
- A2A protocol — Google Agent-to-Agent interop for cross-agent task delegation
- WhatsApp bridge — natural language tool invocation via messaging
- Per-user isolation — separate databases per domain, per user
Deepchecks — The Evaluation Engine
Deepchecks is an AI evaluation platform purpose-built for agents and LLM applications:
- Agentic Workflow Evaluation — breaks agent sessions into individual interactions, scores each one across planning, tool calling, and response quality
- Swarm of Evaluation Agents — small language models using Mixture of Experts techniques simulate an intelligent human annotator, with specialists for hallucination detection, planning accuracy, and rule-based criteria
- ORION — a family of lightweight models for hallucination detection that outperform both open-source and proprietary solutions on groundedness scoring
- Know Your Agent (KYA) — generates a full strengths-and-weaknesses report in minutes, surfacing systematic failure patterns across tools, agents, and properties
- Version Comparison — side-by-side evaluation of different prompts, models, and agent configurations against identical inputs
- Compliance — SOC2 Type 2, GDPR, HIPAA certified, with on-prem deployment options
Where the Synergy Lives
1. Tool Call Quality Scoring — The Core Fit
WorkingAgents dispatches 50+ MCP tools. Every dispatch is a three-part decision: Did the agent choose the right tool? Did it pass correct parameters? Did the tool response actually help?
Deepchecks evaluates exactly these three dimensions with built-in agent span properties:
- Plan Efficiency — Is the agent building a smart execution plan, or thrashing between tools?
- Tool Coverage — Is the agent using the right tools for the task, or ignoring available capabilities?
- Tool Completeness — Are tool responses rich enough, or is the agent getting shallow data back?
Each property is scored 0–5. Over time, patterns emerge. Maybe the agent consistently ignores the nis_pipeline tool when users ask about sales status, defaulting to nis_list_contacts instead. Maybe task_capture parses natural language correctly 92% of the time but fails on relative dates like “next Thursday.” These are the systematic issues that individual log inspection will never surface.
Integration path: Deepchecks uses OpenTelemetry and OpenInference for trace capture. WorkingAgents’ MCP dispatcher emits spans per tool call. Deepchecks ingests, scores, and surfaces patterns — no changes to business logic required.
2. Hallucination Detection on Grounded Data
WorkingAgents agents frequently work with grounded data — real contacts, real companies, real task lists. When an agent says “You have 3 overdue tasks with priority above 7,” that’s either true or it isn’t. When it summarizes an article, every claim is verifiable against the source.
This is where Deepchecks’ ORION hallucination detection excels. Their “Grounded in Context” framework was built for exactly this scenario — production-scale validation that each factual statement in an output is entailed by the provided context. It doesn’t just flag “this might be wrong.” It pinpoints the specific claim where the hallucination occurred.
For WorkingAgents, this means:
- CRM data accuracy — When an agent reports on a contact’s follow-up status, pipeline stage, or company details, verify every claim against the actual database records
- Task reporting — Validate that task counts, due dates, and priority levels match real data
- Article summarization — Score how faithfully the Summary module captures source articles without fabrication
- Search result grounding — When blog_search or summary_search returns results and the agent synthesizes them, verify the synthesis against the actual chunks
3. Know Your Agent — The Executive Dashboard
Deepchecks’ KYA feature generates a comprehensive strengths-and-weaknesses report for any agent. For WorkingAgents, this translates directly to business intelligence:
Tool Usage Analysis:
- Which of the 50+ tools get called most frequently?
- Which tools have the highest failure rates?
- Are there tools that never get invoked despite being available?
Failure Mode Analysis:
- Across the full pipeline: Where do sessions break down most often?
- Per tool: Which tools consistently produce low-quality responses?
- Per property: Is the agent’s planning strong but tool selection weak? Or vice versa?
LLM Behavior Insights:
- Token usage and latency distributions across different task types
- Behavioral patterns: Does the agent handle CRM queries better than task management queries?
- Topic-level performance variation
For James’s consulting firm, KYA reports become client deliverables. Deploy agents for a client, run KYA analysis, present a clear report: “Your agent handles appointment scheduling well (4.2/5 planning, 4.5/5 tool coverage) but struggles with multi-step data lookups (2.8/5 planning, 3.1/5 tool completeness). Here’s our improvement plan.”
4. Multi-Provider Model Comparison
WorkingAgents supports Claude, OpenRouter (dozens of models), and Perplexity. Users switch providers at runtime. Deepchecks’ version comparison feature was built for exactly this scenario.
Run identical task sequences against different providers. Deepchecks scores each version on the same rubric:
| Metric | Claude Sonnet | GPT-4o | Llama 3.3 |
|---|---|---|---|
| Plan Efficiency | 4.5 | 4.2 | 3.8 |
| Tool Coverage | 4.8 | 4.0 | 3.5 |
| Tool Completeness | 4.3 | 4.4 | 3.2 |
| Hallucination Rate | 2% | 5% | 11% |
| Avg Latency | 1.2s | 0.9s | 0.6s |
| Cost per Session | $0.08 | $0.06 | $0.01 |
This data answers the question every AI consulting client asks: “Which model should we use?” Not with opinion — with scored evidence on their actual workflows.
5. Production Monitoring for Consulting Clients
WorkingAgents is the foundation of James’s AI consulting firm. Each client deployment is an agent instance with custom tools and configurations. Deepchecks provides the production monitoring layer:
- Continuous evaluation — Score every production interaction, not just test runs
- Drift detection — Alert when agent performance degrades (model updates, data changes, prompt drift)
- Compliance documentation — SOC2, GDPR, HIPAA compliance matters for enterprise clients. Deepchecks is already certified.
- On-prem deployment — For clients with data residency requirements, Deepchecks offers single-tenant SaaS and custom on-prem options
This means WorkingAgents consulting can offer “managed AI operations” — deploy agents, monitor quality continuously, fix degradation proactively — as a recurring revenue service rather than one-time project work.
6. Session-Level Debugging
When a WorkingAgents user reports “the agent gave me wrong information about my contacts,” today’s debugging path is manual: check logs, read the chat history, trace what happened. Deepchecks transforms this into structured forensics.
Their session-level view shows the complete execution tree — how the agent’s reasoning flowed from prompt to planning to tool call to response. Each span gets individual scoring with step-by-step reasoning: what happened, why the evaluation scored it that way, and where things went off track.
For a platform like WorkingAgents where agents handle real business data across CRM, tasks, and communications, this isn’t a nice-to-have. It’s the difference between “something went wrong” and “the agent called nis_get_contact with ID 42, got the right data, then hallucinated the company name in its response — here’s the exact claim that failed grounding.”
The Gap Analysis
| WorkingAgents Gap | Deepchecks Solution |
|---|---|
| No systematic quality scoring | Built-in Plan Efficiency, Tool Coverage, Tool Completeness metrics |
| No hallucination detection on CRM/task data | ORION groundedness scoring with claim-level pinpointing |
| No agent strengths/weaknesses reporting | KYA automated analysis with failure mode surfacing |
| No provider comparison framework | Version comparison with identical inputs across models |
| No production quality monitoring | Continuous evaluation with drift detection and alerting |
| No compliance certification for client deployments | SOC2 Type 2, GDPR, HIPAA pre-certified |
| Deepchecks Gap | WorkingAgents Solution |
|---|---|
| Need real-world agent orchestration references | 50+ tool MCP server with production business workflows |
| Need non-Python/non-LangChain ecosystem examples | Elixir OTP agent orchestration — unique in the market |
| Need consulting channel partners | AI consulting firm deploying agents for medium-size companies |
| Need complex tool-calling evaluation scenarios | CRM + task + content + communication tool chains |
| Need multi-provider evaluation stories | Runtime-switchable Claude/OpenRouter/Perplexity architecture |
Partnership Models
Technology Integration Partner
The natural first step. WorkingAgents integrates Deepchecks as its evaluation backend:
- Emit OpenTelemetry spans from the MCP dispatcher
- Deepchecks ingests traces, runs evaluation agents, scores interactions
- KYA reports surface in the WorkingAgents admin dashboard
- Hallucination detection runs on all grounded-data responses
Deepchecks gains: A reference customer on Elixir/OTP with a unique multi-provider MCP architecture — expanding their ecosystem beyond Python-centric frameworks.
WorkingAgents gains: Enterprise-grade evaluation without building it in-house. Compliance certifications by association. A concrete quality story for consulting clients.
Consulting Reseller / Referral Partner
Deepchecks partners with NVIDIA (Inception program credits) and is available on AWS Marketplace. A similar model with WorkingAgents consulting:
- WorkingAgents deploys agents for clients using the Orchestrator platform
- Deepchecks evaluates and monitors those agents in production
- Joint pricing: orchestration + evaluation as a bundled service
- Deepchecks gets distribution through WorkingAgents’ consulting engagements
- WorkingAgents gets evaluation capability without engineering investment
Co-Development: MCP Evaluation Toolkit
Deepchecks currently integrates with CrewAI, LangChain, LlamaIndex, and LangGraph. There’s no MCP-native evaluation integration. WorkingAgents could collaborate on:
- An MCP-specific evaluation property set (beyond generic tool calling)
- Evaluation templates for common MCP tool patterns (CRM operations, task management, search)
- Reference traces and benchmarks from WorkingAgents’ 50+ tool catalog
This positions both companies at the front of MCP evaluation — a market segment that barely exists yet but will matter as MCP adoption grows.
Why Deepchecks Over Alternatives
Deepchecks’ key differentiator for WorkingAgents is the swarm evaluation architecture. Rather than using a single LLM-as-judge (which introduces its own biases), Deepchecks deploys multiple small language models as specialists — one for hallucination detection, another for planning accuracy, another for rule-based checks. This Mixture of Experts approach produces more reliable scores than any single evaluator.
For a platform like WorkingAgents where agents handle real business operations — not toy demos — evaluation reliability matters. A false positive on hallucination detection in CRM data erodes trust. A missed failure in task management causes real deadlines to slip. Deepchecks’ multi-model evaluation swarm reduces these risks.
The compliance certifications (SOC2, GDPR, HIPAA) are the other differentiator. WorkingAgents’ consulting clients — medium-size companies integrating AI — will ask about compliance. Having a pre-certified evaluation partner removes a sales objection before it’s raised.
Recommended Next Steps
-
Prototype — Add OpenTelemetry span emission to WorkingAgents’ MCP dispatcher. Point at Deepchecks’ cloud platform. Score a week of real interactions. See what the KYA report reveals.
-
Contact Deepchecks partnerships — Their partnerships page is a demo-booking form, not a formal program page. This suggests they’re still building their partner ecosystem — early partners get more attention and better terms.
-
Build evaluation benchmarks — Create a reference set of 100 representative WorkingAgents sessions across CRM, tasks, and content. Run Deepchecks evaluation. Use the results as the baseline for improvement.
-
Package for consulting — Design a “Managed AI Operations” service tier that bundles WorkingAgents orchestration with Deepchecks evaluation. Recurring monthly monitoring becomes the revenue model, not one-time deployment.
Conclusion
WorkingAgents and Deepchecks address different halves of the same problem. WorkingAgents makes agents useful — giving them real tools to manage real business operations. Deepchecks makes agents trustworthy — evaluating whether they’re actually using those tools well.
For an AI consulting firm, the combination is the complete offering: “We deploy AI agents for your business, and we prove they work.” The evaluation data isn’t just quality assurance — it’s the sales collateral for the next client. “Here’s what our agents scored on your competitor’s workflow. Here’s where we improved. Here’s the compliance report.”
Deepchecks’ swarm evaluation, ORION hallucination detection, and KYA reporting give WorkingAgents the quality infrastructure that would take months to build internally. WorkingAgents gives Deepchecks a production MCP reference implementation on a technically distinctive stack. Both get a consulting partnership model that generates recurring revenue.
The partnership case writes itself. Now it’s a matter of making the call.
Sources: