Baseten: The Inference Infrastructure Layer and Where WorkingAgents Fits

By James Aspinwall, co-written by Alfred Pennyworth (my trusted AI) — March 7, 2026, 07:14


Baseten just raised $300 million at a $5 billion valuation — their third fundraise in twelve months. NVIDIA put in $150 million of that. IVP and CapitalG (Google’s growth fund) co-led. The company has raised $585 million total, and industry analysts estimate inference will account for two-thirds of all AI compute by the end of 2026, up from one-third in 2023. Baseten is building the infrastructure to capture that shift.

This is not another wrapper around someone else’s API. Baseten operates its own inference stack — custom CUDA kernels, on-chip model optimization, multi-cloud GPU clusters — and runs production workloads for Cursor, Notion, Sourcegraph, Uber, Abridge, Clay, Writer, and Decagon. Their customers generate billions of dollars in revenue on top of Baseten’s infrastructure. One hundred times inference volume growth in 2025 alone.

For WorkingAgents, Baseten represents the most natural infrastructure partnership in the inference market. Here is why.

What Baseten Does

Baseten is a purpose-built inference platform. You bring a model — open-source, fine-tuned, or custom — and Baseten runs it in production with optimized performance, autoscaling, and enterprise compliance. They do three things exceptionally well:

1. Raw Inference Speed

Custom CUDA kernels and advanced decoding techniques built into the Baseten inference stack. The numbers are real:

Customer Result
Notion Latency from 2 seconds to 350ms
Superhuman 80% faster embedding inference
Zed Industries 2x faster code completions
Sully.ai 90% cost reduction, 65% lower latency
Patreon ~$600K/year savings
Latent Health 99.999% uptime

Baseten’s transcription service is marketed as the fastest and most cost-efficient on the market. Their embeddings engine (BEI) delivers 2x higher throughput and 10% lower latency than any competing solution. Real-time audio streaming for voice AI achieves the lowest time-to-first-byte in the industry.

2. Compound AI via Chains

Single-model inference is table stakes. The market is moving to compound AI — multi-model systems where specialized models collaborate on a single task. Baseten’s answer is Chains: an SDK for building multi-step, multi-model inference pipelines.

A Chain is a workflow. Each step is a Chainlet — an individual model, data processing step, or business logic function. Chainlets run on independent hardware with independent autoscaling. They call each other directly via type-safe Python functions, eliminating centralized orchestration overhead.

The result: 6x better GPU utilization and latency cut in half compared to monolithic deployments.

Use cases in production:

3. Deployment Flexibility

Three deployment options, same developer experience:

Mode Description
Baseten Cloud Fully managed, global distribution, optional single-tenant clusters
Self-hosted Deploy in your own VPCs with managed-service developer experience
Hybrid Self-hosted infrastructure with on-demand Baseten Cloud burst capacity

SOC 2 Type II and HIPAA compliant across all options. Enterprise tier adds full data residency control and advanced security features.

The Technical Stack

Baseten’s inference stack is built on Truss — their open-source model packaging framework (6,000+ GitHub stars). Truss handles model containerization, dependency management, and GPU allocation. You package a model with Truss, deploy it to Baseten, and the platform handles everything from cold starts (blazing fast) to horizontal scaling (99.99% uptime).

The stack includes:

GPU Pricing

GPU VRAM Per Minute Per Hour
T4 16 GB $0.0105 $0.63
L4 24 GB $0.0141 $0.85
A100 80 GB $0.0667 $4.00
H100 80 GB $0.1083 $6.50
B200 180 GB $0.1663 $9.98

Model API pricing (per million tokens):

Model Input Output
GPT OSS 120B $0.10 $0.50
MiniMax M2.5 $0.30 $1.20
DeepSeek V3.1 $0.50 $1.50
GLM 5 $0.95 $3.15

The pricing model is consumption-based. No idle charges. You pay for the time your model is using compute, not the time it is sitting waiting.

The Customer Base

Baseten’s customer list reads like a directory of companies building serious AI products:

Developer tools: Cursor, Sourcegraph, Zed Industries, Retool, Hex Productivity: Notion, ClickUp, Superhuman Healthcare: Abridge (1M+ clinical notes weekly), Sully.ai, OpenEvidence, Ambience Healthcare, Picnic Health Creative: Gamma, HeyGen, Descript, Speechify Business AI: Writer, Clay, EliseAI, Mercor Consumer: Patreon, Wispr Flow, Praktika AI Voice AI: Bland AI, Rime AI AI agents: Decagon, Scaled Cognition

These are not experiments. Abridge generates over a million clinical notes per week. OpenEvidence serves billions of custom LLM calls weekly to healthcare providers at every major facility in the country. Clay uses Baseten to power their AI go-to-market platform. These are production systems generating real revenue.

The Multi-Model Thesis

Baseten’s bet — and the thesis behind their $5 billion valuation — is that the future of AI is not one model to rule them all. It is a multi-model ecosystem where organizations run many custom, domain-specific models rather than relying on generalized systems.

This matters for WorkingAgents because it aligns with our architecture. WorkingAgents already supports multiple LLM providers (Anthropic, OpenRouter, Perplexity, Gemini). Adding Baseten extends this from “which API do you call” to “which specialized model runs this specific task.” A CRM query might use a small, fast model. A document analysis might use a fine-tuned domain model. A customer-facing response might use a large, high-quality model. All orchestrated through WorkingAgents, all served by Baseten.

The Synergy Map

WorkingAgents and Baseten operate at different layers of the AI stack with zero overlap and maximum complementarity.

1. Baseten as the Inference Backbone for WorkingAgents

WorkingAgents needs fast, reliable inference for its chat module (ServerChat). Baseten’s OpenAI-compatible model APIs slot directly into our provider-switching architecture:

The cost advantage over closed APIs is dramatic. Running an always-on agent workflow through Anthropic’s API at $3/M input tokens versus Baseten’s open models at $0.10–$0.95/M is a 3–30x cost reduction. For scheduled, recurring agent tasks — the kind WorkingAgents’ alarm system enables — this makes the difference between economically viable and prohibitively expensive.

2. WorkingAgents as the Orchestration Layer for Baseten Chains

Baseten Chains handles model-to-model orchestration within an inference pipeline. WorkingAgents handles everything outside the pipeline: scheduling, state persistence, user permissions, notifications, and escalation.

Consider a healthcare documentation workflow:

  1. Baseten Chain: Audio → Whisper transcription → Medical NER extraction → Clinical note generation (multi-model, sub-second)
  2. WorkingAgents: Receives the generated note → stores in per-user database → schedules review reminder for physician → if not reviewed in 24 hours, escalates via push notification → logs audit trail

Baseten processes the AI inference in milliseconds. WorkingAgents manages the business logic over hours and days. Chains handles the burst. WorkingAgents handles the persistence.

3. Compound AI + Persistent State

Baseten’s compound AI vision requires state management that survives beyond a single inference call. Their Chains execute and return results — but what happens next?

These are WorkingAgents questions. Our alarm system schedules future actions. Our access control gates who can trigger what. Our per-user SQLite databases persist workflow state. Our Pushover integration delivers notifications. Our task manager tracks completion.

Baseten is the inference engine. WorkingAgents is the operational nervous system.

4. Voice AI and Real-Time Agent Workflows

Baseten’s investment in ultra-low-latency voice AI (Bland AI partnership, real-time audio streaming, fastest TTFB) creates a natural integration point. Voice agents need:

A voice AI agent powered by Baseten inference and orchestrated by WorkingAgents can take a sales call, look up the contact, schedule a follow-up, and escalate if no response — all automatically, with full audit trail and crash recovery.

5. Enterprise Alignment

Both products serve the same enterprise buyer with complementary compliance stories:

Requirement Baseten WorkingAgents
SOC 2 Type II certified Access control + audit trails
HIPAA Compliant Per-user data isolation
Data residency Full control (self-hosted/hybrid) Per-user SQLite (on-premise)
Zero data retention Available Not applicable (state is the product)
Audit trails API logs, observability Alarm history, task provenance
Access control API keys, model-level Per-user, per-tool granular permissions

An enterprise deploying both gets compliant AI from model to operation. Baseten guarantees the inference is secure. WorkingAgents guarantees the workflow is auditable.

6. Shared Customer Base

Look at the overlap in Baseten’s customer list:

Every Baseten customer running AI agents or workflows is a potential WorkingAgents customer. The pitch is simple: “You have the inference. Here is the orchestration.”

The Partnership Path

Phase 1: Integration

Add Baseten as an LLM provider in WorkingAgents. Their model APIs are OpenAI-compatible — this is a configuration-level change. Deploy and test with DeepSeek V3.1 and GPT OSS 120B for cost-optimized agent workflows.

Phase 2: Reference Architecture

Build a reference compound AI workflow: Baseten Chain for inference + WorkingAgents for orchestration. Healthcare documentation or sales follow-up are the strongest demos. Publish as a joint case study.

Phase 3: Forward-Deployed Engineering

Baseten offers “embedded engineering” — forward-deployed engineers who optimize customer deployments from prototype to production. WorkingAgents could be part of that optimization: when a Baseten customer needs scheduling, persistence, or access control, the forward-deployed engineer recommends WorkingAgents as the operational layer.

Phase 4: Marketplace

Baseten’s enterprise customers already use their self-hosted and hybrid deployment options. WorkingAgents could be offered as a companion deployment — the orchestration layer that ships alongside the inference layer.

The Numbers

Baseten Value
Valuation $5B
Total funding $585M
Series E $300M (Jan 2026)
NVIDIA investment $150M
Inference volume growth (2025) 100x
Customer revenue powered Billions annually
GPU options T4 through B200
Uptime SLA 99.99%
Compliance SOC 2 Type II, HIPAA
Key investors IVP, CapitalG, NVIDIA, Greylock, Spark Capital, Conviction
Open source Truss (6,000+ GitHub stars)

The Bottom Line

Baseten is building the foundation layer for production AI — the infrastructure that runs models fast, reliably, and cost-efficiently at scale. WorkingAgents is building the operational layer — the infrastructure that schedules, persists, controls, and chains the actions those models produce.

Baseten’s multi-model thesis is our thesis too. The future is not one model doing everything. It is specialized models doing specialized tasks, orchestrated by systems that know when to run what, who can trigger what, and what happens when things go wrong. Baseten handles the “specialized models doing specialized tasks.” WorkingAgents handles the “orchestrated by systems that know when, who, and what.”

Their Chains SDK orchestrates models within an inference pipeline. Our alarm system orchestrates actions across hours and days. Their autoscaling handles burst inference load. Our crash-recoverable scheduling handles persistent operational load. Their forward-deployed engineers optimize inference performance. Our access control system ensures the right people have the right permissions.

Two layers. Zero overlap. One stack.

Sources: