By James Aspinwall, co-written by Alfred Pennyworth (my trusted AI) — March 7, 2026, 12:32
Rafay builds the platform that turns raw GPU and Kubernetes infrastructure into self-service, governed compute for enterprises, sovereign AI clouds, and cloud service providers. WorkingAgents builds the governance layer that turns autonomous AI agents into trustworthy participants in business operations. One manages the infrastructure agents run on. The other manages what agents are allowed to do once they’re running.
What Rafay Does
Rafay’s platform orchestrates the full lifecycle of compute infrastructure — GPU-accelerated, CPU-based, or containerized — across public clouds, private data centers, and sovereign environments. Their v4.0 release (November 2025) added enhanced managed Kubernetes, VMware vSphere and Nutanix support, and node-level debugging.
Core capabilities:
- GPU orchestration — unified GPU/CPU pools across environments, fractional GPU allocation, GPU matchmaking and time-slicing
- Multi-tenant self-service — developers access compute and AI tooling while platform teams enforce policies, guardrails, and cost controls
- Chargeback and billing — granular usage data, customizable cost attribution aligned to organizational structures
- Deployment flexibility — SaaS, on-premises, or air-gapped configurations with Terraform and GitOps support
Their customers include sovereign AI clouds, cloud service providers (EdgeNet in LATAM), and enterprises managing GPU infrastructure at scale. Gartner named them a Cool Vendor in Container Management. GigaOm named them Leader and Outperformer in Managed Kubernetes.
What WorkingAgents Does
WorkingAgents is the governance and control layer between AI agents and the systems they interact with. Three gateways, one control plane:
- Unified LLM Routing — control which models agents use and how they access them
- Agentic Workflow Control — define, supervise, and enforce how agents take actions
- Enterprise MCP and A2A Tools Access — connect agents to internal tools with least-privilege permissions
Agents inherit the user’s access control. Every action has an audit trail. Every agent has defined permissions. Every decision is auditable. The platform includes 86+ MCP tools covering task management, CRM, alarm scheduling, push notifications, file management, and system monitoring — all with per-user SQLite databases and encrypted access control.
The Gap Between Them
Rafay solves: “How do I give data scientists and developers self-service access to GPU compute with guardrails?”
WorkingAgents solves: “How do I give AI agents self-service access to enterprise tools with guardrails?”
Same architectural pattern. Different layer of the stack. Rafay governs infrastructure consumption. WorkingAgents governs agent behavior. An enterprise running AI workloads needs both — ungoverned infrastructure is wasteful, ungoverned agents are dangerous.
Synergy Areas
1. Agent-Governed Infrastructure Operations
Rafay provides self-service compute. But “self-service” today means dashboards and CLI tools operated by humans. The next step is agents requesting, scaling, and releasing compute resources autonomously.
WorkingAgents provides the governance layer for that transition:
- An AI agent needs GPU resources for a training job → it calls a WorkingAgents MCP tool → WorkingAgents checks the agent’s permissions → if allowed, it triggers a Rafay API call to provision compute → WorkingAgents logs the action, schedules a cost review alarm, and tracks the resource lifecycle in its task system
- When the job completes → WorkingAgents receives the callback → releases the Rafay resources → notifies the team via Pushover → logs the chargeback data
The agent never touches Rafay directly. WorkingAgents mediates every interaction with per-user permissions, audit trails, and automated follow-up. Rafay’s multi-tenancy maps directly to WorkingAgents’ per-user access control — same tenant boundaries, enforced at both the infrastructure and agent layers.
2. AI Platform Orchestration Beyond Inference
Rafay already orchestrates AI platforms — Kubeflow workbenches, KubeRay clusters, Models-as-a-Service, Accenture’s AI Refinery. These platforms generate operational needs that Rafay doesn’t address:
- A model finishes training → who gets notified? WorkingAgents’ alarm system.
- An inference endpoint exceeds its error budget → who escalates? WorkingAgents’ push notifications with escalation chains.
- A data scientist requests a new GPU cluster → who approves, tracks, and follows up? WorkingAgents’ task manager with NIS contact tracking.
- A sovereign cloud customer needs audit evidence of all agent actions on their infrastructure → WorkingAgents’ per-user databases with encrypted access control logs.
Rafay handles the infrastructure lifecycle. WorkingAgents handles the human and agent workflow lifecycle that wraps around it.
3. Sovereign AI Cloud — Compliance Stack
Rafay explicitly targets sovereign AI clouds. Sovereign clouds have strict requirements: data residency, access control, audit trails, compliance reporting. Rafay delivers this at the infrastructure layer. WorkingAgents delivers it at the application and agent layer.
Together:
| Requirement | Rafay Layer | WorkingAgents Layer |
|---|---|---|
| Data residency | Infrastructure deployed in-country | Per-user databases, on-premise deployment |
| Access control | Multi-tenant compute isolation | Per-user, per-tool agent permissions |
| Audit trails | Infrastructure provisioning logs | Agent action logs, task provenance |
| Compliance | Policy-driven guardrails on compute | Permission keys, encrypted access control |
| Air-gapped support | On-premise / air-gapped deployment | Self-hosted Elixir, no cloud dependencies |
A sovereign AI cloud built on Rafay + WorkingAgents gives customers governed infrastructure AND governed agents — the full compliance stack from GPU allocation to agent behavior.
4. CSP Value-Add: Agent-as-a-Service
Rafay’s cloud service provider customers (like EdgeNet in LATAM) use Rafay to build GPU cloud platforms. These CSPs compete on value-add services beyond raw compute.
WorkingAgents enables a new tier: Agent-as-a-Service. CSPs deploy WorkingAgents alongside Rafay to offer customers not just GPU compute, but managed agent infrastructure — task automation, CRM, scheduling, notifications — all governed by the same tenant boundaries Rafay already enforces.
The CSP’s pitch changes from “We rent you GPUs” to “We rent you governed AI agents that run on governed GPUs.” Higher margins, stickier customers.
5. MCP as the Integration Protocol
Rafay’s platform exposes APIs for infrastructure operations. WorkingAgents is an MCP server. The integration path:
- Wrap Rafay APIs as MCP tools — cluster provisioning, GPU allocation, scaling, teardown become tools in WorkingAgents’ 86+ tool catalog
- Agents interact with infrastructure through natural language — “Provision a 4xA100 cluster for the training job” → WorkingAgents validates permissions → calls Rafay API → tracks the resource
- Chargeback integration — Rafay’s granular usage data flows into WorkingAgents’ per-user databases for cost tracking and reporting
No custom integration code. MCP is the standard. Any AI agent connected to WorkingAgents gains governed access to Rafay’s infrastructure.
The Partnership Opportunity
For Rafay: WorkingAgents solves the “last mile” problem — what happens after infrastructure is provisioned. Their platform stops at compute orchestration. WorkingAgents extends governance to the agent and workflow layer, making Rafay’s platform more valuable to enterprises adopting agentic AI.
For WorkingAgents: Rafay solves the infrastructure problem. WorkingAgents needs compute for AI workloads — model inference, training jobs, batch processing. Rafay provides multi-cloud, governed compute at scale with cost optimization built in. Instead of managing infrastructure directly, WorkingAgents delegates to Rafay and focuses on agent governance.
For the joint customer: One governance model from GPU to agent. Same tenant boundaries. Same compliance posture. Infrastructure that’s self-service for humans today and self-service for agents tomorrow.
Concrete Next Steps
- Technical integration — Wrap Rafay’s cluster and GPU APIs as WorkingAgents MCP tools. Estimate: 2-3 days for a proof-of-concept with 5-6 tools (provision, scale, status, teardown, usage).
- Joint demo — Sovereign cloud scenario: an AI agent provisions GPU resources through WorkingAgents, runs a training job on Rafay infrastructure, and reports results — all with full audit trail and access control.
- CSP pilot — Partner with one of Rafay’s CSP customers to deploy WorkingAgents as an agent governance add-on, validating the Agent-as-a-Service model.
Rafay turns infrastructure into a platform. WorkingAgents turns agents into governed employees. Together, they deliver the full stack enterprises need to run AI autonomously — from silicon to agent behavior — with guardrails at every layer.