Rafay: Infrastructure Orchestration Meets Agent Governance

By James Aspinwall, co-written by Alfred Pennyworth (my trusted AI) — March 7, 2026, 12:32

Rafay builds the platform that turns raw GPU and Kubernetes infrastructure into self-service, governed compute for enterprises, sovereign AI clouds, and cloud service providers. WorkingAgents builds the governance layer that turns autonomous AI agents into trustworthy participants in business operations. One manages the infrastructure agents run on. The other manages what agents are allowed to do once they’re running.

What Rafay Does

Rafay’s platform orchestrates the full lifecycle of compute infrastructure — GPU-accelerated, CPU-based, or containerized — across public clouds, private data centers, and sovereign environments. Their v4.0 release (November 2025) added enhanced managed Kubernetes, VMware vSphere and Nutanix support, and node-level debugging.

Core capabilities:

GPU orchestration — unified GPU/CPU pools across environments, fractional GPU allocation, GPU matchmaking and time-slicing
Multi-tenant self-service — developers access compute and AI tooling while platform teams enforce policies, guardrails, and cost controls
Chargeback and billing — granular usage data, customizable cost attribution aligned to organizational structures
Deployment flexibility — SaaS, on-premises, or air-gapped configurations with Terraform and GitOps support

Their customers include sovereign AI clouds, cloud service providers (EdgeNet in LATAM), and enterprises managing GPU infrastructure at scale. Gartner named them a Cool Vendor in Container Management. GigaOm named them Leader and Outperformer in Managed Kubernetes.

What WorkingAgents Does

WorkingAgents is the governance and control layer between AI agents and the systems they interact with. Three gateways, one control plane:

Unified LLM Routing — control which models agents use and how they access them
Agentic Workflow Control — define, supervise, and enforce how agents take actions
Enterprise MCP and A2A Tools Access — connect agents to internal tools with least-privilege permissions

Agents inherit the user’s access control. Every action has an audit trail. Every agent has defined permissions. Every decision is auditable. The platform includes 86+ MCP tools covering task management, CRM, alarm scheduling, push notifications, file management, and system monitoring — all with per-user SQLite databases and encrypted access control.

The Gap Between Them

Rafay solves: “How do I give data scientists and developers self-service access to GPU compute with guardrails?”

WorkingAgents solves: “How do I give AI agents self-service access to enterprise tools with guardrails?”

Same architectural pattern. Different layer of the stack. Rafay governs infrastructure consumption. WorkingAgents governs agent behavior. An enterprise running AI workloads needs both — ungoverned infrastructure is wasteful, ungoverned agents are dangerous.

Synergy Areas

1. Agent-Governed Infrastructure Operations

Rafay provides self-service compute. But “self-service” today means dashboards and CLI tools operated by humans. The next step is agents requesting, scaling, and releasing compute resources autonomously.

WorkingAgents provides the governance layer for that transition:

An AI agent needs GPU resources for a training job → it calls a WorkingAgents MCP tool → WorkingAgents checks the agent’s permissions → if allowed, it triggers a Rafay API call to provision compute → WorkingAgents logs the action, schedules a cost review alarm, and tracks the resource lifecycle in its task system
When the job completes → WorkingAgents receives the callback → releases the Rafay resources → notifies the team via Pushover → logs the chargeback data

The agent never touches Rafay directly. WorkingAgents mediates every interaction with per-user permissions, audit trails, and automated follow-up. Rafay’s multi-tenancy maps directly to WorkingAgents’ per-user access control — same tenant boundaries, enforced at both the infrastructure and agent layers.

2. AI Platform Orchestration Beyond Inference

Rafay already orchestrates AI platforms — Kubeflow workbenches, KubeRay clusters, Models-as-a-Service, Accenture’s AI Refinery. These platforms generate operational needs that Rafay doesn’t address:

A model finishes training → who gets notified? WorkingAgents’ alarm system.
An inference endpoint exceeds its error budget → who escalates? WorkingAgents’ push notifications with escalation chains.
A data scientist requests a new GPU cluster → who approves, tracks, and follows up? WorkingAgents’ task manager with NIS contact tracking.
A sovereign cloud customer needs audit evidence of all agent actions on their infrastructure → WorkingAgents’ per-user databases with encrypted access control logs.

Rafay handles the infrastructure lifecycle. WorkingAgents handles the human and agent workflow lifecycle that wraps around it.

3. Sovereign AI Cloud — Compliance Stack

Rafay explicitly targets sovereign AI clouds. Sovereign clouds have strict requirements: data residency, access control, audit trails, compliance reporting. Rafay delivers this at the infrastructure layer. WorkingAgents delivers it at the application and agent layer.

Together:

Requirement	Rafay Layer	WorkingAgents Layer
Data residency	Infrastructure deployed in-country	Per-user databases, on-premise deployment
Access control	Multi-tenant compute isolation	Per-user, per-tool agent permissions
Audit trails	Infrastructure provisioning logs	Agent action logs, task provenance
Compliance	Policy-driven guardrails on compute	Permission keys, encrypted access control
Air-gapped support	On-premise / air-gapped deployment	Self-hosted Elixir, no cloud dependencies

A sovereign AI cloud built on Rafay + WorkingAgents gives customers governed infrastructure AND governed agents — the full compliance stack from GPU allocation to agent behavior.

4. CSP Value-Add: Agent-as-a-Service

Rafay’s cloud service provider customers (like EdgeNet in LATAM) use Rafay to build GPU cloud platforms. These CSPs compete on value-add services beyond raw compute.

WorkingAgents enables a new tier: Agent-as-a-Service. CSPs deploy WorkingAgents alongside Rafay to offer customers not just GPU compute, but managed agent infrastructure — task automation, CRM, scheduling, notifications — all governed by the same tenant boundaries Rafay already enforces.

The CSP’s pitch changes from “We rent you GPUs” to “We rent you governed AI agents that run on governed GPUs.” Higher margins, stickier customers.

5. MCP as the Integration Protocol

Rafay’s platform exposes APIs for infrastructure operations. WorkingAgents is an MCP server. The integration path:

Wrap Rafay APIs as MCP tools — cluster provisioning, GPU allocation, scaling, teardown become tools in WorkingAgents’ 86+ tool catalog
Agents interact with infrastructure through natural language — “Provision a 4xA100 cluster for the training job” → WorkingAgents validates permissions → calls Rafay API → tracks the resource
Chargeback integration — Rafay’s granular usage data flows into WorkingAgents’ per-user databases for cost tracking and reporting

No custom integration code. MCP is the standard. Any AI agent connected to WorkingAgents gains governed access to Rafay’s infrastructure.

The Partnership Opportunity

For Rafay: WorkingAgents solves the “last mile” problem — what happens after infrastructure is provisioned. Their platform stops at compute orchestration. WorkingAgents extends governance to the agent and workflow layer, making Rafay’s platform more valuable to enterprises adopting agentic AI.

For WorkingAgents: Rafay solves the infrastructure problem. WorkingAgents needs compute for AI workloads — model inference, training jobs, batch processing. Rafay provides multi-cloud, governed compute at scale with cost optimization built in. Instead of managing infrastructure directly, WorkingAgents delegates to Rafay and focuses on agent governance.

For the joint customer: One governance model from GPU to agent. Same tenant boundaries. Same compliance posture. Infrastructure that’s self-service for humans today and self-service for agents tomorrow.

Concrete Next Steps

Technical integration — Wrap Rafay’s cluster and GPU APIs as WorkingAgents MCP tools. Estimate: 2-3 days for a proof-of-concept with 5-6 tools (provision, scale, status, teardown, usage).
Joint demo — Sovereign cloud scenario: an AI agent provisions GPU resources through WorkingAgents, runs a training job on Rafay infrastructure, and reports results — all with full audit trail and access control.
CSP pilot — Partner with one of Rafay’s CSP customers to deploy WorkingAgents as an agent governance add-on, validating the Agent-as-a-Service model.

Rafay turns infrastructure into a platform. WorkingAgents turns agents into governed employees. Together, they deliver the full stack enterprises need to run AI autonomously — from silicon to agent behavior — with guardrails at every layer.