By James Aspinwall, co-written by Alfred Pennyworth (my trusted AI) — March 7, 2026, 07:14
Baseten just raised $300 million at a $5 billion valuation — their third fundraise in twelve months. NVIDIA put in $150 million of that. IVP and CapitalG (Google’s growth fund) co-led. The company has raised $585 million total, and industry analysts estimate inference will account for two-thirds of all AI compute by the end of 2026, up from one-third in 2023. Baseten is building the infrastructure to capture that shift.
This is not another wrapper around someone else’s API. Baseten operates its own inference stack — custom CUDA kernels, on-chip model optimization, multi-cloud GPU clusters — and runs production workloads for Cursor, Notion, Sourcegraph, Uber, Abridge, Clay, Writer, and Decagon. Their customers generate billions of dollars in revenue on top of Baseten’s infrastructure. One hundred times inference volume growth in 2025 alone.
For WorkingAgents, Baseten represents the most natural infrastructure partnership in the inference market. Here is why.
What Baseten Does
Baseten is a purpose-built inference platform. You bring a model — open-source, fine-tuned, or custom — and Baseten runs it in production with optimized performance, autoscaling, and enterprise compliance. They do three things exceptionally well:
1. Raw Inference Speed
Custom CUDA kernels and advanced decoding techniques built into the Baseten inference stack. The numbers are real:
| Customer | Result |
|---|---|
| Notion | Latency from 2 seconds to 350ms |
| Superhuman | 80% faster embedding inference |
| Zed Industries | 2x faster code completions |
| Sully.ai | 90% cost reduction, 65% lower latency |
| Patreon | ~$600K/year savings |
| Latent Health | 99.999% uptime |
Baseten’s transcription service is marketed as the fastest and most cost-efficient on the market. Their embeddings engine (BEI) delivers 2x higher throughput and 10% lower latency than any competing solution. Real-time audio streaming for voice AI achieves the lowest time-to-first-byte in the industry.
2. Compound AI via Chains
Single-model inference is table stakes. The market is moving to compound AI — multi-model systems where specialized models collaborate on a single task. Baseten’s answer is Chains: an SDK for building multi-step, multi-model inference pipelines.
A Chain is a workflow. Each step is a Chainlet — an individual model, data processing step, or business logic function. Chainlets run on independent hardware with independent autoscaling. They call each other directly via type-safe Python functions, eliminating centralized orchestration overhead.
The result: 6x better GPU utilization and latency cut in half compared to monolithic deployments.
Use cases in production:
- Voice AI — Bland AI runs ultra-low-latency phone calls at infinite scale
- RAG pipelines — LLMs combined with live data retrieval
- Audio transcription — Hours of audio transcribed in seconds (multi-step Whisper optimization)
- AI agents — Custom agents handling varied requests across geographies
- Content generation — Image, text, and video pipelines
3. Deployment Flexibility
Three deployment options, same developer experience:
| Mode | Description |
|---|---|
| Baseten Cloud | Fully managed, global distribution, optional single-tenant clusters |
| Self-hosted | Deploy in your own VPCs with managed-service developer experience |
| Hybrid | Self-hosted infrastructure with on-demand Baseten Cloud burst capacity |
SOC 2 Type II and HIPAA compliant across all options. Enterprise tier adds full data residency control and advanced security features.
The Technical Stack
Baseten’s inference stack is built on Truss — their open-source model packaging framework (6,000+ GitHub stars). Truss handles model containerization, dependency management, and GPU allocation. You package a model with Truss, deploy it to Baseten, and the platform handles everything from cold starts (blazing fast) to horizontal scaling (99.99% uptime).
The stack includes:
- Custom kernels — CUDA-level optimization for each model architecture
- TensorRT / TensorRT-LLM — NVIDIA’s inference optimization, deeply integrated
- FireAttention-class decoding — Advanced caching and speculative decoding
- Multi-region, multi-cloud — Single model serving replicas across regions and providers
- Autoscaling — Pay only for compute your model actively uses, zero idle charges
GPU Pricing
| GPU | VRAM | Per Minute | Per Hour |
|---|---|---|---|
| T4 | 16 GB | $0.0105 | $0.63 |
| L4 | 24 GB | $0.0141 | $0.85 |
| A100 | 80 GB | $0.0667 | $4.00 |
| H100 | 80 GB | $0.1083 | $6.50 |
| B200 | 180 GB | $0.1663 | $9.98 |
Model API pricing (per million tokens):
| Model | Input | Output |
|---|---|---|
| GPT OSS 120B | $0.10 | $0.50 |
| MiniMax M2.5 | $0.30 | $1.20 |
| DeepSeek V3.1 | $0.50 | $1.50 |
| GLM 5 | $0.95 | $3.15 |
The pricing model is consumption-based. No idle charges. You pay for the time your model is using compute, not the time it is sitting waiting.
The Customer Base
Baseten’s customer list reads like a directory of companies building serious AI products:
Developer tools: Cursor, Sourcegraph, Zed Industries, Retool, Hex Productivity: Notion, ClickUp, Superhuman Healthcare: Abridge (1M+ clinical notes weekly), Sully.ai, OpenEvidence, Ambience Healthcare, Picnic Health Creative: Gamma, HeyGen, Descript, Speechify Business AI: Writer, Clay, EliseAI, Mercor Consumer: Patreon, Wispr Flow, Praktika AI Voice AI: Bland AI, Rime AI AI agents: Decagon, Scaled Cognition
These are not experiments. Abridge generates over a million clinical notes per week. OpenEvidence serves billions of custom LLM calls weekly to healthcare providers at every major facility in the country. Clay uses Baseten to power their AI go-to-market platform. These are production systems generating real revenue.
The Multi-Model Thesis
Baseten’s bet — and the thesis behind their $5 billion valuation — is that the future of AI is not one model to rule them all. It is a multi-model ecosystem where organizations run many custom, domain-specific models rather than relying on generalized systems.
This matters for WorkingAgents because it aligns with our architecture. WorkingAgents already supports multiple LLM providers (Anthropic, OpenRouter, Perplexity, Gemini). Adding Baseten extends this from “which API do you call” to “which specialized model runs this specific task.” A CRM query might use a small, fast model. A document analysis might use a fine-tuned domain model. A customer-facing response might use a large, high-quality model. All orchestrated through WorkingAgents, all served by Baseten.
The Synergy Map
WorkingAgents and Baseten operate at different layers of the AI stack with zero overlap and maximum complementarity.
1. Baseten as the Inference Backbone for WorkingAgents
WorkingAgents needs fast, reliable inference for its chat module (ServerChat). Baseten’s OpenAI-compatible model APIs slot directly into our provider-switching architecture:
- DeepSeek V3.1 at $0.50/M input — for high-volume, cost-sensitive agent workflows
- GPT OSS 120B at $0.10/M input — for simple routing and classification tasks
- GLM 5 at $0.95/M input — for complex reasoning
- Custom fine-tuned models — clients train domain-specific models on Baseten and serve them through WorkingAgents
The cost advantage over closed APIs is dramatic. Running an always-on agent workflow through Anthropic’s API at $3/M input tokens versus Baseten’s open models at $0.10–$0.95/M is a 3–30x cost reduction. For scheduled, recurring agent tasks — the kind WorkingAgents’ alarm system enables — this makes the difference between economically viable and prohibitively expensive.
2. WorkingAgents as the Orchestration Layer for Baseten Chains
Baseten Chains handles model-to-model orchestration within an inference pipeline. WorkingAgents handles everything outside the pipeline: scheduling, state persistence, user permissions, notifications, and escalation.
Consider a healthcare documentation workflow:
- Baseten Chain: Audio → Whisper transcription → Medical NER extraction → Clinical note generation (multi-model, sub-second)
- WorkingAgents: Receives the generated note → stores in per-user database → schedules review reminder for physician → if not reviewed in 24 hours, escalates via push notification → logs audit trail
Baseten processes the AI inference in milliseconds. WorkingAgents manages the business logic over hours and days. Chains handles the burst. WorkingAgents handles the persistence.
3. Compound AI + Persistent State
Baseten’s compound AI vision requires state management that survives beyond a single inference call. Their Chains execute and return results — but what happens next?
- Who gets notified?
- When should this run again?
- What if the downstream system is unavailable?
- Who has permission to trigger this workflow?
- Where is the audit trail?
These are WorkingAgents questions. Our alarm system schedules future actions. Our access control gates who can trigger what. Our per-user SQLite databases persist workflow state. Our Pushover integration delivers notifications. Our task manager tracks completion.
Baseten is the inference engine. WorkingAgents is the operational nervous system.
4. Voice AI and Real-Time Agent Workflows
Baseten’s investment in ultra-low-latency voice AI (Bland AI partnership, real-time audio streaming, fastest TTFB) creates a natural integration point. Voice agents need:
- Real-time inference — Baseten handles this
- CRM lookup during calls — WorkingAgents’ NIS module
- Post-call scheduling — WorkingAgents’ alarm system (“follow up in 3 days”)
- Call outcome tracking — WorkingAgents’ task manager
- Notification if no follow-up — WorkingAgents’ Pushover integration
- Access control — WorkingAgents’ per-user permissions (“this agent can read contacts but not delete them”)
A voice AI agent powered by Baseten inference and orchestrated by WorkingAgents can take a sales call, look up the contact, schedule a follow-up, and escalate if no response — all automatically, with full audit trail and crash recovery.
5. Enterprise Alignment
Both products serve the same enterprise buyer with complementary compliance stories:
| Requirement | Baseten | WorkingAgents |
|---|---|---|
| SOC 2 | Type II certified | Access control + audit trails |
| HIPAA | Compliant | Per-user data isolation |
| Data residency | Full control (self-hosted/hybrid) | Per-user SQLite (on-premise) |
| Zero data retention | Available | Not applicable (state is the product) |
| Audit trails | API logs, observability | Alarm history, task provenance |
| Access control | API keys, model-level | Per-user, per-tool granular permissions |
An enterprise deploying both gets compliant AI from model to operation. Baseten guarantees the inference is secure. WorkingAgents guarantees the workflow is auditable.
6. Shared Customer Base
Look at the overlap in Baseten’s customer list:
- Decagon — AI agent platform. Already on Baseten for inference. Needs orchestration (scheduling, escalation, persistence) — WorkingAgents’ core offering.
- Cursor — AI code editor. Uses MCP for tool integration. WorkingAgents is an MCP server. Direct compatibility.
- Sourcegraph — AI code intelligence. Automated code review workflows need scheduling — WorkingAgents’ alarm system.
- Retool — Internal tool builder. Their users build workflows that need persistent scheduling and notifications.
- Clay — AI go-to-market platform. Sales workflows need follow-up scheduling, CRM integration, escalation — all WorkingAgents features.
Every Baseten customer running AI agents or workflows is a potential WorkingAgents customer. The pitch is simple: “You have the inference. Here is the orchestration.”
The Partnership Path
Phase 1: Integration
Add Baseten as an LLM provider in WorkingAgents. Their model APIs are OpenAI-compatible — this is a configuration-level change. Deploy and test with DeepSeek V3.1 and GPT OSS 120B for cost-optimized agent workflows.
Phase 2: Reference Architecture
Build a reference compound AI workflow: Baseten Chain for inference + WorkingAgents for orchestration. Healthcare documentation or sales follow-up are the strongest demos. Publish as a joint case study.
Phase 3: Forward-Deployed Engineering
Baseten offers “embedded engineering” — forward-deployed engineers who optimize customer deployments from prototype to production. WorkingAgents could be part of that optimization: when a Baseten customer needs scheduling, persistence, or access control, the forward-deployed engineer recommends WorkingAgents as the operational layer.
Phase 4: Marketplace
Baseten’s enterprise customers already use their self-hosted and hybrid deployment options. WorkingAgents could be offered as a companion deployment — the orchestration layer that ships alongside the inference layer.
The Numbers
| Baseten | Value |
|---|---|
| Valuation | $5B |
| Total funding | $585M |
| Series E | $300M (Jan 2026) |
| NVIDIA investment | $150M |
| Inference volume growth (2025) | 100x |
| Customer revenue powered | Billions annually |
| GPU options | T4 through B200 |
| Uptime SLA | 99.99% |
| Compliance | SOC 2 Type II, HIPAA |
| Key investors | IVP, CapitalG, NVIDIA, Greylock, Spark Capital, Conviction |
| Open source | Truss (6,000+ GitHub stars) |
The Bottom Line
Baseten is building the foundation layer for production AI — the infrastructure that runs models fast, reliably, and cost-efficiently at scale. WorkingAgents is building the operational layer — the infrastructure that schedules, persists, controls, and chains the actions those models produce.
Baseten’s multi-model thesis is our thesis too. The future is not one model doing everything. It is specialized models doing specialized tasks, orchestrated by systems that know when to run what, who can trigger what, and what happens when things go wrong. Baseten handles the “specialized models doing specialized tasks.” WorkingAgents handles the “orchestrated by systems that know when, who, and what.”
Their Chains SDK orchestrates models within an inference pipeline. Our alarm system orchestrates actions across hours and days. Their autoscaling handles burst inference load. Our crash-recoverable scheduling handles persistent operational load. Their forward-deployed engineers optimize inference performance. Our access control system ensures the right people have the right permissions.
Two layers. Zero overlap. One stack.
Sources: