Why the AI Agent Gateway Needs Its Own LLM Gateway

The first question any reviewer asks when they see an LLM gateway listed as a core feature of an AI agent gateway is: why does the gateway need an LLM? The calling agent already has one. The user’s Claude Code, Codex, OpenAI assistant, or custom MCP client already speaks to a model. Adding another LLM layer inside the agent gateway looks like duplication.

It is not. The two LLMs do different jobs.

The agent’s LLM does the user’s reasoning. The gateway’s LLM is the substrate that other gateway primitives rest on, the thing workflow steps and external-agent dispatchers and embedding generators all reach for when the deterministic path can’t do the job. Without it, every primitive that needs a single LLM call either has to import its own provider client (bypassing audit, bypassing per-token budgeting, scattering API keys across modules) or fall back to deterministic code (which doesn’t work for natural-language inputs).

This article lays out the cases where the gateway’s own LlmGateway earns its place in the core, organized by category.

The principle

A workflow in this gateway is mostly deterministic. The bulk of any real workflow is:

MCP tool calls (domain-specific, deterministic).
Function-node transformations (custom Elixir or sandboxed code, deterministic).
Alarms and notifications (deterministic).
Human-in-the-loop approvals (deterministic in shape, the wait is the only non-deterministic part).

A small number of steps inside a real workflow actually need a model. That is what LlmGateway is for. Not for the agent’s overall reasoning – the agent does that on its own model. For the specific, scoped, structured-output calls inside the gateway’s machinery.

The cases below are the catalog of where that small number of LLM calls actually pays off.

A. Workflow step types that genuinely need an LLM

These are workflow steps where the input is natural-language and the output is structured. The pattern is uniform: take some text, call the model once with a tightly-scoped prompt, parse the result into a typed value, hand it to the next step.

Document field extraction. A workflow pulls a PDF from Box. Box Extract gets most of the standard fields, but the format is non-standard and one field is missing. An LLM step reads the body, returns {vendor: ..., amount: ..., due_date: ...}, and the next step (an MCP call to the accounts-payable system) consumes the structured output directly. The LLM is doing extraction, not generation; the prompt is short and the output schema is strict.

Free-text classification. Incoming support tickets get tagged with category and severity. A Slack message gets classified as question | request | complaint | unrelated. The agent could do this on its own model, but classifying thousands of inputs is a place where consistency matters more than capability: one model, one prompt, one audit table, predictable cost.

Summarization for downstream steps. A workflow receives a 50-page deposition and the next step is “post a summary to the case management system.” That step needs 200 words, not 50 pages. An LLM step in between does the compression.

Translation. A workflow receives an email in Vietnamese and the next step expects English. One call, deterministic in shape (text in, text out), no creative reinterpretation – the prompt says “translate,” not “improve.”

Fuzzy data mapping between schemas. Box has a Status field with values like “Open / In Review / Closed”; Salesforce uses “New / Working / Closed-Won / Closed-Lost”. The mapping rules are too irregular for a function node and too cheap to send to a human. An LLM step is the right tool: takes the source value, emits the target enum, done.

Webhook payload understanding. A generic webhook handler receives JSON from N different upstream systems with N different shapes. The handler can’t be rewritten for every new vendor. An LLM step normalizes the incoming payload into the canonical workflow input.

The common shape: each of these is a single LLM call with a narrow prompt and a structured output. No multi-turn reasoning, no agent-style chain-of-thought. The gateway’s LlmGateway.complete/2 does exactly this and logs every call.

B. System-level uses where centralization matters

These are gateway primitives that internally need an LLM and route through LlmGateway for unified auth, audit, and billing.

ExternalAgent API backends. ExternalAgent.AnthropicApi and ExternalAgent.OpenAIApi are not separate API clients. They are thin wrappers around LlmGateway.complete/2 with different model: arguments. One audit trail, one cost ledger, one rate-limiting policy for every model call in the gateway. Without this, each backend would hold its own API keys, write its own audit table, and the operator would have to reconcile three or four bills to know what the gateway cost last month.

A customer-facing llm_complete MCP tool. When an agent on the instance wants to call an LLM but use the customer’s contract and quotas (rather than the agent’s own), it calls llm_complete through MCP. The gateway answers, the call goes through the customer’s permission keys, the cost lands on the customer’s budget. Useful when an agent is running on a customer’s behalf and should bill against the customer’s account, not its own.

Embedding generation for Doc.Index. The documentation surface uses sqlite_vec for semantic search. Vectors have to come from somewhere. LlmGateway.embed/2 produces them, routes to whichever embedding provider the customer has configured, and budgets the cost just like a completion call.

Embeddings for any other module’s semantic search. Workflow templates may eventually want “find me similar past runs”; alarm-history may want “find prior alarms similar to this one.” These are embedding-and-search problems. The gateway’s primitive handles them with one client, one cost model, one provider-portable interface.

C. Governance and observability

This is the case category that matters most to the customer’s CIO or finance team.

Consolidated billing per token. Every LLM call carries the calling sub-token’s id. The cost ledger answers “how much did agent X cost this month” without manual reconciliation across providers.

Per-provider permission keys. llm.anthropic, llm.openai, llm.google. A customer can grant an agent access only to the providers they have a contract with. Without the gateway, the granting decision is made implicitly by whoever holds the API key – fragile, hard to audit, easy to leak.

Budget enforcement. Token caps per user, per token, per workflow template. The gateway refuses the call when the budget is exceeded; no surprise bills at the end of the month.

Compliance audit retention. Some industries require every LLM input and output to be retained for years. One audit table, one retention policy, regardless of which provider was hit on a given call. The operator configures the retention window; the gateway enforces it.

Rate limiting. A runaway agent loop calling Claude API 10,000 times in five minutes trips a per-token circuit breaker before it bills $500. Hard to enforce when each agent talks to providers directly; trivial when every call funnels through one gateway.

D. Provider portability under the instance-per-customer model

This category exists because every customer ends up wanting one of these things, often a year into the deployment.

Swap providers without code changes. A workflow template declares “summarize step uses sonnet.” Change it to gpt-4o-mini and the workflow uses the new model on the next invocation. No code change. No redeploy.

A/B testing models for the same step. Run the same workflow step on two providers, log both outputs, let the operator compare. Useful when the customer wants to evaluate whether a cheaper model can replace a more expensive one for a specific task.

Fallback chains. Primary provider rate-limits or fails; the gateway transparently retries on the fallback provider configured for that step. The workflow doesn’t know the failover happened. The audit log does.

Absorbing provider deprecations. When claude-3-5-sonnet retires, the model alias sonnet resolves to the new version. Existing workflow templates do not need editing. The gateway is the place where the version-to-alias mapping lives.

E. “System intelligence” – limited but real

These are uses inside the gateway’s own logic, not for the calling agent’s benefit. They are deliberately few because the gateway is not in the chat-with-the-user business.

Natural-language command parsing. A WhatsApp message “tomorrow at 3pm remind me to call John about the contract” arrives in the gateway’s WhatsApp handler. An LLM step parses it into structured intent: Alarm.create(set_at: ~U[2026-05-19 15:00:00Z], action: ...). The gateway maps natural-language input from user-facing channels into its own tool surface. The agent could do this if the channel sat behind an agent, but for direct user channels (WhatsApp, email, SMS) the gateway has to do it itself.

Audit summary generation. Compress a thousand-row audit window into a human-readable “what happened in the last hour” summary for the operator dashboard. Runs on demand or on schedule.

Description generation for new docs. When a new file lands in asset/docs/, generate a one-line summary that powers preview cards and improves search ranking. Background batch job, not a per-request call.

Error explanation. When a tool call fails with an opaque upstream error (an HTTP 500 from a third-party API with a cryptic message), the gateway can route the error through an LLM step that explains it in plain language and posts the result to the audit log. Strictly opt-in per token, because routing every error through an LLM at scale is expensive.

What is NOT a good use case

Worth naming because the temptation will come up:

The agent’s own reasoning about the docs. The gateway exposes doc_search and doc_get over MCP. The agent retrieves chunks and synthesizes its own answer with its own model. The gateway does not run an LLM to answer questions about its own documentation. The customer would pay twice for the same reasoning and would lose control over which model produced the answer.

Replacing function nodes for transformations. “Parse this JSON, extract customer_id, multiply by 100” is a function node. Twenty lines of code, runs in milliseconds, free, deterministic. Routing it through an LLM is slow, non-deterministic, and expensive. The gateway uses LLMs for things deterministic code can’t do, not for things deterministic code can.

Replacing workflow templates with “describe what you want and the LLM figures out the steps.” That is what an agent does. The gateway runs the template the agent (or operator) committed to. Templates are deterministic by design so they audit cleanly and so failures can be diagnosed.

Open-ended chat with the operator. No “ask the AI anything” chatbox in the gateway’s own web UI. The operator already has their agent (Claude Code, Codex, whatever) for that. The gateway is operated, not chatted with.

The pattern

Across all five categories, the rule is the same: the gateway’s LLM is the substrate that other gateway primitives rest on. Workflow steps that need a model. External-agent dispatchers that wrap a provider API. Embedding generation for semantic indexes. Natural-language command parsing from direct user channels. Audit summary, doc description, error explanation. Each is a contained, structured call with a tight prompt and a typed output.

The calling agent’s LLM is for the user’s reasoning. The gateway’s LLM is the building block its own machinery needs.

That separation is what makes both make sense. The agent stays the agent. The gateway stays the gateway. Each pays for the model use that serves its own purpose. Audit logs capture both without confusion about which mind did which thinking.

Bottom line

A “minimal” gateway without an LLM gateway is missing a load-bearing primitive. The first time a workflow needs to extract three fields from a non-standard PDF, the first time an external-agent dispatcher needs unified billing across two providers, the first time Doc.Index needs vectors, the gap shows up. Adding an LLM gateway after the fact means redoing the audit table, the rate limiting, the budget tracking, and the per-token cost ledger – all of which work best when they are designed from day one as the universal path for any LLM call inside the gateway.

LlmGateway is in the core because the alternatives – scattered API clients, scattered audit, scattered billing, scattered rate limits – are not viable for an instance-per-customer governance product. The gateway exists to centralize control. The LLM gateway is one of the surfaces that has to be centralized for the gateway to do its job.