Rafay: Infrastructure Orchestration Meets Agent Governance

By James Aspinwall, co-written by Alfred Pennyworth (my trusted AI) — March 7, 2026, 12:32


Rafay builds the platform that turns raw GPU and Kubernetes infrastructure into self-service, governed compute for enterprises, sovereign AI clouds, and cloud service providers. WorkingAgents builds the governance layer that turns autonomous AI agents into trustworthy participants in business operations. One manages the infrastructure agents run on. The other manages what agents are allowed to do once they’re running.

What Rafay Does

Rafay’s platform orchestrates the full lifecycle of compute infrastructure — GPU-accelerated, CPU-based, or containerized — across public clouds, private data centers, and sovereign environments. Their v4.0 release (November 2025) added enhanced managed Kubernetes, VMware vSphere and Nutanix support, and node-level debugging.

Core capabilities:

Their customers include sovereign AI clouds, cloud service providers (EdgeNet in LATAM), and enterprises managing GPU infrastructure at scale. Gartner named them a Cool Vendor in Container Management. GigaOm named them Leader and Outperformer in Managed Kubernetes.

What WorkingAgents Does

WorkingAgents is the governance and control layer between AI agents and the systems they interact with. Three gateways, one control plane:

Agents inherit the user’s access control. Every action has an audit trail. Every agent has defined permissions. Every decision is auditable. The platform includes 86+ MCP tools covering task management, CRM, alarm scheduling, push notifications, file management, and system monitoring — all with per-user SQLite databases and encrypted access control.

The Gap Between Them

Rafay solves: “How do I give data scientists and developers self-service access to GPU compute with guardrails?”

WorkingAgents solves: “How do I give AI agents self-service access to enterprise tools with guardrails?”

Same architectural pattern. Different layer of the stack. Rafay governs infrastructure consumption. WorkingAgents governs agent behavior. An enterprise running AI workloads needs both — ungoverned infrastructure is wasteful, ungoverned agents are dangerous.

Synergy Areas

1. Agent-Governed Infrastructure Operations

Rafay provides self-service compute. But “self-service” today means dashboards and CLI tools operated by humans. The next step is agents requesting, scaling, and releasing compute resources autonomously.

WorkingAgents provides the governance layer for that transition:

The agent never touches Rafay directly. WorkingAgents mediates every interaction with per-user permissions, audit trails, and automated follow-up. Rafay’s multi-tenancy maps directly to WorkingAgents’ per-user access control — same tenant boundaries, enforced at both the infrastructure and agent layers.

2. AI Platform Orchestration Beyond Inference

Rafay already orchestrates AI platforms — Kubeflow workbenches, KubeRay clusters, Models-as-a-Service, Accenture’s AI Refinery. These platforms generate operational needs that Rafay doesn’t address:

Rafay handles the infrastructure lifecycle. WorkingAgents handles the human and agent workflow lifecycle that wraps around it.

3. Sovereign AI Cloud — Compliance Stack

Rafay explicitly targets sovereign AI clouds. Sovereign clouds have strict requirements: data residency, access control, audit trails, compliance reporting. Rafay delivers this at the infrastructure layer. WorkingAgents delivers it at the application and agent layer.

Together:

Requirement Rafay Layer WorkingAgents Layer
Data residency Infrastructure deployed in-country Per-user databases, on-premise deployment
Access control Multi-tenant compute isolation Per-user, per-tool agent permissions
Audit trails Infrastructure provisioning logs Agent action logs, task provenance
Compliance Policy-driven guardrails on compute Permission keys, encrypted access control
Air-gapped support On-premise / air-gapped deployment Self-hosted Elixir, no cloud dependencies

A sovereign AI cloud built on Rafay + WorkingAgents gives customers governed infrastructure AND governed agents — the full compliance stack from GPU allocation to agent behavior.

4. CSP Value-Add: Agent-as-a-Service

Rafay’s cloud service provider customers (like EdgeNet in LATAM) use Rafay to build GPU cloud platforms. These CSPs compete on value-add services beyond raw compute.

WorkingAgents enables a new tier: Agent-as-a-Service. CSPs deploy WorkingAgents alongside Rafay to offer customers not just GPU compute, but managed agent infrastructure — task automation, CRM, scheduling, notifications — all governed by the same tenant boundaries Rafay already enforces.

The CSP’s pitch changes from “We rent you GPUs” to “We rent you governed AI agents that run on governed GPUs.” Higher margins, stickier customers.

5. MCP as the Integration Protocol

Rafay’s platform exposes APIs for infrastructure operations. WorkingAgents is an MCP server. The integration path:

  1. Wrap Rafay APIs as MCP tools — cluster provisioning, GPU allocation, scaling, teardown become tools in WorkingAgents’ 86+ tool catalog
  2. Agents interact with infrastructure through natural language — “Provision a 4xA100 cluster for the training job” → WorkingAgents validates permissions → calls Rafay API → tracks the resource
  3. Chargeback integration — Rafay’s granular usage data flows into WorkingAgents’ per-user databases for cost tracking and reporting

No custom integration code. MCP is the standard. Any AI agent connected to WorkingAgents gains governed access to Rafay’s infrastructure.

The Partnership Opportunity

For Rafay: WorkingAgents solves the “last mile” problem — what happens after infrastructure is provisioned. Their platform stops at compute orchestration. WorkingAgents extends governance to the agent and workflow layer, making Rafay’s platform more valuable to enterprises adopting agentic AI.

For WorkingAgents: Rafay solves the infrastructure problem. WorkingAgents needs compute for AI workloads — model inference, training jobs, batch processing. Rafay provides multi-cloud, governed compute at scale with cost optimization built in. Instead of managing infrastructure directly, WorkingAgents delegates to Rafay and focuses on agent governance.

For the joint customer: One governance model from GPU to agent. Same tenant boundaries. Same compliance posture. Infrastructure that’s self-service for humans today and self-service for agents tomorrow.

Concrete Next Steps

  1. Technical integration — Wrap Rafay’s cluster and GPU APIs as WorkingAgents MCP tools. Estimate: 2-3 days for a proof-of-concept with 5-6 tools (provision, scale, status, teardown, usage).
  2. Joint demo — Sovereign cloud scenario: an AI agent provisions GPU resources through WorkingAgents, runs a training job on Rafay infrastructure, and reports results — all with full audit trail and access control.
  3. CSP pilot — Partner with one of Rafay’s CSP customers to deploy WorkingAgents as an agent governance add-on, validating the Agent-as-a-Service model.

Rafay turns infrastructure into a platform. WorkingAgents turns agents into governed employees. Together, they deliver the full stack enterprises need to run AI autonomously — from silicon to agent behavior — with guardrails at every layer.