Baseten: The Inference Infrastructure Layer and Where WorkingAgents Fits

By James Aspinwall, co-written by Alfred Pennyworth (my trusted AI) — March 7, 2026, 07:14

Baseten just raised $300 million at a $5 billion valuation — their third fundraise in twelve months. NVIDIA put in $150 million of that. IVP and CapitalG (Google’s growth fund) co-led. The company has raised $585 million total, and industry analysts estimate inference will account for two-thirds of all AI compute by the end of 2026, up from one-third in 2023. Baseten is building the infrastructure to capture that shift.

This is not another wrapper around someone else’s API. Baseten operates its own inference stack — custom CUDA kernels, on-chip model optimization, multi-cloud GPU clusters — and runs production workloads for Cursor, Notion, Sourcegraph, Uber, Abridge, Clay, Writer, and Decagon. Their customers generate billions of dollars in revenue on top of Baseten’s infrastructure. One hundred times inference volume growth in 2025 alone.

For WorkingAgents, Baseten represents the most natural infrastructure partnership in the inference market. Here is why.

What Baseten Does

Baseten is a purpose-built inference platform. You bring a model — open-source, fine-tuned, or custom — and Baseten runs it in production with optimized performance, autoscaling, and enterprise compliance. They do three things exceptionally well:

1. Raw Inference Speed

Custom CUDA kernels and advanced decoding techniques built into the Baseten inference stack. The numbers are real:

Customer	Result
Notion	Latency from 2 seconds to 350ms
Superhuman	80% faster embedding inference
Zed Industries	2x faster code completions
Sully.ai	90% cost reduction, 65% lower latency
Patreon	~$600K/year savings
Latent Health	99.999% uptime

Baseten’s transcription service is marketed as the fastest and most cost-efficient on the market. Their embeddings engine (BEI) delivers 2x higher throughput and 10% lower latency than any competing solution. Real-time audio streaming for voice AI achieves the lowest time-to-first-byte in the industry.

2. Compound AI via Chains

Single-model inference is table stakes. The market is moving to compound AI — multi-model systems where specialized models collaborate on a single task. Baseten’s answer is Chains: an SDK for building multi-step, multi-model inference pipelines.

A Chain is a workflow. Each step is a Chainlet — an individual model, data processing step, or business logic function. Chainlets run on independent hardware with independent autoscaling. They call each other directly via type-safe Python functions, eliminating centralized orchestration overhead.

The result: 6x better GPU utilization and latency cut in half compared to monolithic deployments.

Use cases in production:

Voice AI — Bland AI runs ultra-low-latency phone calls at infinite scale
RAG pipelines — LLMs combined with live data retrieval
Audio transcription — Hours of audio transcribed in seconds (multi-step Whisper optimization)
AI agents — Custom agents handling varied requests across geographies
Content generation — Image, text, and video pipelines

3. Deployment Flexibility

Three deployment options, same developer experience:

Mode	Description
Baseten Cloud	Fully managed, global distribution, optional single-tenant clusters
Self-hosted	Deploy in your own VPCs with managed-service developer experience
Hybrid	Self-hosted infrastructure with on-demand Baseten Cloud burst capacity

SOC 2 Type II and HIPAA compliant across all options. Enterprise tier adds full data residency control and advanced security features.

The Technical Stack

Baseten’s inference stack is built on Truss — their open-source model packaging framework (6,000+ GitHub stars). Truss handles model containerization, dependency management, and GPU allocation. You package a model with Truss, deploy it to Baseten, and the platform handles everything from cold starts (blazing fast) to horizontal scaling (99.99% uptime).

The stack includes:

Custom kernels — CUDA-level optimization for each model architecture
TensorRT / TensorRT-LLM — NVIDIA’s inference optimization, deeply integrated
FireAttention-class decoding — Advanced caching and speculative decoding
Multi-region, multi-cloud — Single model serving replicas across regions and providers
Autoscaling — Pay only for compute your model actively uses, zero idle charges

GPU Pricing

GPU	VRAM	Per Minute	Per Hour
T4	16 GB	$0.0105	$0.63
L4	24 GB	$0.0141	$0.85
A100	80 GB	$0.0667	$4.00
H100	80 GB	$0.1083	$6.50
B200	180 GB	$0.1663	$9.98

Model API pricing (per million tokens):

Model	Input	Output
GPT OSS 120B	$0.10	$0.50
MiniMax M2.5	$0.30	$1.20
DeepSeek V3.1	$0.50	$1.50
GLM 5	$0.95	$3.15

The pricing model is consumption-based. No idle charges. You pay for the time your model is using compute, not the time it is sitting waiting.

The Customer Base

Baseten’s customer list reads like a directory of companies building serious AI products:

Developer tools: Cursor, Sourcegraph, Zed Industries, Retool, Hex Productivity: Notion, ClickUp, Superhuman Healthcare: Abridge (1M+ clinical notes weekly), Sully.ai, OpenEvidence, Ambience Healthcare, Picnic Health Creative: Gamma, HeyGen, Descript, Speechify Business AI: Writer, Clay, EliseAI, Mercor Consumer: Patreon, Wispr Flow, Praktika AI Voice AI: Bland AI, Rime AI AI agents: Decagon, Scaled Cognition

These are not experiments. Abridge generates over a million clinical notes per week. OpenEvidence serves billions of custom LLM calls weekly to healthcare providers at every major facility in the country. Clay uses Baseten to power their AI go-to-market platform. These are production systems generating real revenue.

The Multi-Model Thesis

Baseten’s bet — and the thesis behind their $5 billion valuation — is that the future of AI is not one model to rule them all. It is a multi-model ecosystem where organizations run many custom, domain-specific models rather than relying on generalized systems.

This matters for WorkingAgents because it aligns with our architecture. WorkingAgents already supports multiple LLM providers (Anthropic, OpenRouter, Perplexity, Gemini). Adding Baseten extends this from “which API do you call” to “which specialized model runs this specific task.” A CRM query might use a small, fast model. A document analysis might use a fine-tuned domain model. A customer-facing response might use a large, high-quality model. All orchestrated through WorkingAgents, all served by Baseten.

The Synergy Map

WorkingAgents and Baseten operate at different layers of the AI stack with zero overlap and maximum complementarity.

1. Baseten as the Inference Backbone for WorkingAgents

WorkingAgents needs fast, reliable inference for its chat module (ServerChat). Baseten’s OpenAI-compatible model APIs slot directly into our provider-switching architecture:

DeepSeek V3.1 at $0.50/M input — for high-volume, cost-sensitive agent workflows
GPT OSS 120B at $0.10/M input — for simple routing and classification tasks
GLM 5 at $0.95/M input — for complex reasoning
Custom fine-tuned models — clients train domain-specific models on Baseten and serve them through WorkingAgents

The cost advantage over closed APIs is dramatic. Running an always-on agent workflow through Anthropic’s API at $3/M input tokens versus Baseten’s open models at $0.10–$0.95/M is a 3–30x cost reduction. For scheduled, recurring agent tasks — the kind WorkingAgents’ alarm system enables — this makes the difference between economically viable and prohibitively expensive.

2. WorkingAgents as the Orchestration Layer for Baseten Chains

Baseten Chains handles model-to-model orchestration within an inference pipeline. WorkingAgents handles everything outside the pipeline: scheduling, state persistence, user permissions, notifications, and escalation.

Consider a healthcare documentation workflow:

Baseten Chain: Audio → Whisper transcription → Medical NER extraction → Clinical note generation (multi-model, sub-second)
WorkingAgents: Receives the generated note → stores in per-user database → schedules review reminder for physician → if not reviewed in 24 hours, escalates via push notification → logs audit trail

Baseten processes the AI inference in milliseconds. WorkingAgents manages the business logic over hours and days. Chains handles the burst. WorkingAgents handles the persistence.

3. Compound AI + Persistent State

Baseten’s compound AI vision requires state management that survives beyond a single inference call. Their Chains execute and return results — but what happens next?

Who gets notified?
When should this run again?
What if the downstream system is unavailable?
Who has permission to trigger this workflow?
Where is the audit trail?

These are WorkingAgents questions. Our alarm system schedules future actions. Our access control gates who can trigger what. Our per-user SQLite databases persist workflow state. Our Pushover integration delivers notifications. Our task manager tracks completion.

Baseten is the inference engine. WorkingAgents is the operational nervous system.

4. Voice AI and Real-Time Agent Workflows

Baseten’s investment in ultra-low-latency voice AI (Bland AI partnership, real-time audio streaming, fastest TTFB) creates a natural integration point. Voice agents need:

Real-time inference — Baseten handles this
CRM lookup during calls — WorkingAgents’ NIS module
Post-call scheduling — WorkingAgents’ alarm system (“follow up in 3 days”)
Call outcome tracking — WorkingAgents’ task manager
Notification if no follow-up — WorkingAgents’ Pushover integration
Access control — WorkingAgents’ per-user permissions (“this agent can read contacts but not delete them”)

A voice AI agent powered by Baseten inference and orchestrated by WorkingAgents can take a sales call, look up the contact, schedule a follow-up, and escalate if no response — all automatically, with full audit trail and crash recovery.

5. Enterprise Alignment

Both products serve the same enterprise buyer with complementary compliance stories:

Requirement	Baseten	WorkingAgents
SOC 2	Type II certified	Access control + audit trails
HIPAA	Compliant	Per-user data isolation
Data residency	Full control (self-hosted/hybrid)	Per-user SQLite (on-premise)
Zero data retention	Available	Not applicable (state is the product)
Audit trails	API logs, observability	Alarm history, task provenance
Access control	API keys, model-level	Per-user, per-tool granular permissions

An enterprise deploying both gets compliant AI from model to operation. Baseten guarantees the inference is secure. WorkingAgents guarantees the workflow is auditable.

6. Shared Customer Base

Look at the overlap in Baseten’s customer list:

Decagon — AI agent platform. Already on Baseten for inference. Needs orchestration (scheduling, escalation, persistence) — WorkingAgents’ core offering.
Cursor — AI code editor. Uses MCP for tool integration. WorkingAgents is an MCP server. Direct compatibility.
Sourcegraph — AI code intelligence. Automated code review workflows need scheduling — WorkingAgents’ alarm system.
Retool — Internal tool builder. Their users build workflows that need persistent scheduling and notifications.
Clay — AI go-to-market platform. Sales workflows need follow-up scheduling, CRM integration, escalation — all WorkingAgents features.

Every Baseten customer running AI agents or workflows is a potential WorkingAgents customer. The pitch is simple: “You have the inference. Here is the orchestration.”

The Partnership Path

Phase 1: Integration

Add Baseten as an LLM provider in WorkingAgents. Their model APIs are OpenAI-compatible — this is a configuration-level change. Deploy and test with DeepSeek V3.1 and GPT OSS 120B for cost-optimized agent workflows.

Phase 2: Reference Architecture

Build a reference compound AI workflow: Baseten Chain for inference + WorkingAgents for orchestration. Healthcare documentation or sales follow-up are the strongest demos. Publish as a joint case study.

Phase 3: Forward-Deployed Engineering

Baseten offers “embedded engineering” — forward-deployed engineers who optimize customer deployments from prototype to production. WorkingAgents could be part of that optimization: when a Baseten customer needs scheduling, persistence, or access control, the forward-deployed engineer recommends WorkingAgents as the operational layer.

Phase 4: Marketplace

Baseten’s enterprise customers already use their self-hosted and hybrid deployment options. WorkingAgents could be offered as a companion deployment — the orchestration layer that ships alongside the inference layer.

The Numbers

Baseten	Value
Valuation	$5B
Total funding	$585M
Series E	$300M (Jan 2026)
NVIDIA investment	$150M
Inference volume growth (2025)	100x
Customer revenue powered	Billions annually
GPU options	T4 through B200
Uptime SLA	99.99%
Compliance	SOC 2 Type II, HIPAA
Key investors	IVP, CapitalG, NVIDIA, Greylock, Spark Capital, Conviction
Open source	Truss (6,000+ GitHub stars)

The Bottom Line

Baseten is building the foundation layer for production AI — the infrastructure that runs models fast, reliably, and cost-efficiently at scale. WorkingAgents is building the operational layer — the infrastructure that schedules, persists, controls, and chains the actions those models produce.

Baseten’s multi-model thesis is our thesis too. The future is not one model doing everything. It is specialized models doing specialized tasks, orchestrated by systems that know when to run what, who can trigger what, and what happens when things go wrong. Baseten handles the “specialized models doing specialized tasks.” WorkingAgents handles the “orchestrated by systems that know when, who, and what.”

Their Chains SDK orchestrates models within an inference pipeline. Our alarm system orchestrates actions across hours and days. Their autoscaling handles burst inference load. Our crash-recoverable scheduling handles persistent operational load. Their forward-deployed engineers optimize inference performance. Our access control system ensures the right people have the right permissions.

Two layers. Zero overlap. One stack.

Sources: