By James Aspinwall, co-written by Alfred Pennyworth (my trusted AI) — March 7, 2026, 07:05
Fireworks AI processes 50 trillion tokens per day. That is not a typo. 1.5 quadrillion tokens per month flow through their inference infrastructure — powering AI features inside Cursor, Notion, Sourcegraph, Uber, DoorDash, Quora, and Upwork. They went from Series B to a $4 billion valuation in under two years, with $300M+ in anticipated annual revenue and $327M in total funding from Sequoia, NVIDIA, AMD, Lightspeed, and Index Ventures.
Fireworks is not building models. They are building the fastest way to run everyone else’s models — and increasingly, the infrastructure for compound AI systems where multiple models, tools, and data sources collaborate on a single task. This is where WorkingAgents fits.
What Fireworks AI Does
Fireworks AI is an inference platform. You send tokens in, you get tokens back — faster and cheaper than running the models yourself or using the original providers.
Their core advantage is speed. Custom CUDA kernels (FireAttention v2) deliver up to 8x faster inference for long-context workloads. Notion cut their latency from 2 seconds to 350 milliseconds by switching to Fireworks. Quora got a 3x speedup. Cursor’s Fast Apply feature demands sub-second responsiveness under peak developer load — Fireworks delivers it.
The Model Library
400+ models available through a single API:
| Category | Examples |
|---|---|
| LLMs | Qwen 3 (480B), DeepSeek, Llama 4, Gemma 3, GLM-5, Kimi K2.5 |
| Image | FLUX.1, Stable Diffusion |
| Audio | Whisper V3 |
| Embeddings | Various |
| Function calling | FireFunction v2 (GPT-4o parity at 2.5x speed, 10% cost) |
All models run on Fireworks’ optimized serving stack. No GPU setup. No cold starts. Pay per token.
Pricing Model
| Tier | Description |
|---|---|
| Serverless | Pay-per-token, starting at $0.07/M input tokens |
| On-demand | Reserved capacity with autoscaling |
| Fine-tuning | Custom model training with LoRA, RFT |
| Enterprise | Custom contracts, dedicated infrastructure |
Enterprise Grade
SOC2, HIPAA, and GDPR compliant. Zero data retention guarantee. Bring-your-own-cloud or managed deployment. 99.99% API uptime.
Compound AI: Why It Matters
Fireworks coined the term “compound AI” to describe systems where multiple models, retrievers, tools, and data sources interact to solve a single task. This is not “chat with a model.” This is:
- User asks a question
- System routes to the right model based on task type
- Model calls external tools (database, API, search engine)
- Results feed back into the model
- Model generates the final response
FireFunction v2 is their open-weight function-calling model — it orchestrates across models, data sources, and APIs. It matches GPT-4o on function calling benchmarks at 2.5x the speed and 10% of the cost.
MCP Support: The Bridge to WorkingAgents
In 2026, Fireworks launched MCP support through their OpenAI-compatible Responses API. This is the direct integration point with WorkingAgents.
Here is how it works: you point a Fireworks model at an MCP server, and the model discovers and calls the tools that server exposes. The entire agentic loop — reasoning, tool selection, execution, response — runs server-side in a single API call.
client.responses.create(
model="accounts/fireworks/models/qwen3-235b-a22b",
input="Schedule a reminder for tomorrow at 9am",
tools=[{"type": "sse", "server_url": "https://your-workingagents-server/mcp"}]
)
The model identifies intent, discovers WorkingAgents’ 86+ tools via MCP, calls the appropriate one (in this case, pushover_schedule), and formulates the response. No glue code. No manual conversation loop management.
This is currently in beta, but the architecture is clear: Fireworks handles the inference, WorkingAgents handles the operational logic.
The Synergy Map
WorkingAgents and Fireworks AI are complementary products with zero overlap. Here is where they connect:
1. Fireworks as an LLM Provider for WorkingAgents
WorkingAgents already supports multiple LLM providers — Anthropic, OpenRouter, Perplexity, Gemini. Adding Fireworks as a provider gives our clients:
- 400+ open-source models through a single integration
- Sub-second inference for real-time agent interactions
- 80-90% cost reduction vs. closed-model APIs (FireFunction v2 at 10% of GPT-4o cost)
- Fine-tuned models — clients could train custom models on their data and serve them through Fireworks, orchestrated by WorkingAgents
The integration is straightforward. Fireworks’ API is OpenAI-compatible. Our ServerChat module already supports provider switching. Adding a :fireworks provider is a configuration change, not an architecture change.
2. WorkingAgents as an MCP Server for Fireworks
Fireworks’ new MCP support means their models can call WorkingAgents tools directly. A Fireworks-powered agent could:
-
Schedule push notifications via
pushover_schedule - Create and manage CRM contacts via NIS tools
-
Query task dashboards via
task_dashboard -
Monitor system health via
monitor_health -
Read and write documents via
read_file/write_file
WorkingAgents becomes the “action layer” for Fireworks-powered agents — the bridge between model reasoning and real-world operations. Fireworks handles thinking fast. WorkingAgents handles doing things.
3. Compound AI + Persistent Orchestration
Fireworks’ compound AI vision — multiple models collaborating on complex tasks — needs an orchestration layer that persists state across interactions. This is WorkingAgents’ core strength.
Consider a compound AI workflow for a sales team:
- Fireworks model analyzes an incoming email (fast inference, low cost)
- WorkingAgents NIS looks up the contact in the CRM
- Fireworks function call generates a response draft
- WorkingAgents alarm schedules a follow-up if no reply in 3 days
- WorkingAgents pushover notifies the sales rep on their phone
- If no response by day 3, WorkingAgents alarm fires and triggers step 1 again with escalation context
Fireworks handles the model inference (steps 1, 3). WorkingAgents handles the operational logic (steps 2, 4, 5, 6). Neither product can do this alone. Together, they create a self-driving sales workflow with persistent scheduling, crash recovery, and audit trails.
4. Cost Optimization for Multi-Model Routing
WorkingAgents’ provider-switching capability combined with Fireworks’ model library enables intelligent routing:
- Simple queries → small, cheap model (Llama 3 8B at fractions of a cent)
- Complex reasoning → large model (Qwen 3 480B)
- Function calling → FireFunction v2 (optimized for tool use)
- Code generation → Qwen 3 Coder 480B
- Sensitive data → self-hosted model via Fireworks on-demand (zero data retention)
WorkingAgents could route based on task type, cost budget, or latency requirements — using Fireworks as the inference backbone across all tiers.
5. Enterprise Deployment Alignment
Both products target the same enterprise buyer:
| Requirement | Fireworks | WorkingAgents |
|---|---|---|
| SOC2/HIPAA/GDPR | Yes | Access control + encrypted keys |
| Data isolation | Zero retention, BYOC | Per-user SQLite databases |
| Audit trails | API logs | Alarm history, task provenance |
| Access control | API keys per model | Per-user, per-tool permissions |
| Self-hosted option | Bring-your-own-cloud | On-premise Elixir deployment |
An enterprise deploying both gets compliant AI from inference to operation — models that forget your data (Fireworks) orchestrated by a system that remembers your workflows (WorkingAgents).
The Partnership Opportunity
For Fireworks
WorkingAgents solves a problem Fireworks explicitly identifies but does not address: what happens after inference. Their “State of Agent Environments” report notes that successful AI systems require “persistent state management, secure external system access, error handling and observability, schema validation and metadata integration.” That is a description of WorkingAgents.
Fireworks’ MCP support is beta. They need reference implementations — real MCP servers doing real work — to validate the feature. WorkingAgents with 86+ tools, persistent scheduling, and per-user access control is a compelling demo partner.
For WorkingAgents
Fireworks solves our inference cost problem. Running complex agent workflows through Anthropic’s API is expensive at scale. Fireworks’ open-model inference at 10% of the cost of closed APIs makes it economically viable to run high-volume agent workflows — the kind of always-on, scheduled, multi-step operations our alarm system enables.
Fireworks also solves model diversity. Instead of integrating each model provider separately, one Fireworks integration gives us 400+ models. Our clients choose the model. We provide the orchestration. Fireworks provides the inference.
The Integration Path
- Phase 1: Add Fireworks as an LLM provider in WorkingAgents (OpenAI-compatible API — minimal work)
- Phase 2: Publish WorkingAgents as a reference MCP server for Fireworks’ Responses API
- Phase 3: Joint case study — compound AI workflow using Fireworks inference + WorkingAgents orchestration
- Phase 4: Co-marketing at AI conferences — “from inference to action in one stack”
The Competitive Landscape
Fireworks competes with inference providers (Together AI, Groq, Cerebras, Replicate). WorkingAgents competes with orchestration platforms (LangChain, CrewAI, custom solutions). Neither competes with the other.
This is the cleanest type of partnership: two products that a client would use simultaneously, solving different layers of the same problem. The client who uses Fireworks for inference and WorkingAgents for orchestration does not need to choose between them — they need both.
The Numbers That Matter
| Fireworks AI | Value |
|---|---|
| Valuation | $4B |
| Total funding | $327M |
| Annual revenue | $300M+ (anticipated) |
| Daily tokens | 50 trillion |
| API uptime | 99.99% |
| Model library | 400+ |
| Key investors | Sequoia, NVIDIA, AMD, Lightspeed, Index |
| Key customers | Cursor, Notion, Sourcegraph, Uber, DoorDash, Quora |
Fireworks’ customer list is a who’s who of companies building AI-powered products. Each of those companies needs operational orchestration behind their AI features — scheduling, task management, escalation, notifications. That is the WorkingAgents pitch to Fireworks’ existing customer base.
The Bottom Line
Fireworks AI is the fastest inference engine in the market. WorkingAgents is the operational orchestration layer that turns inference into action. Fireworks processes the tokens. WorkingAgents schedules the tasks, manages the state, and ensures things get done — even when the model is not thinking.
The compound AI future Fireworks describes — multiple models, tools, and data sources collaborating on complex tasks — requires exactly the kind of persistent, crash-recoverable, permission-gated orchestration that WorkingAgents provides. They built the engine. We built the transmission and the steering wheel.
The integration is technically straightforward (OpenAI-compatible API + MCP support), commercially aligned (same enterprise buyers), and strategically complementary (inference + orchestration). This is not a hypothetical partnership. This is two products that already fit together — they just have not been introduced yet.
Sources: