Agents That Write Their Own Tools -- Why AI Should Compile Logic, Not Execute It Token by Token

By James Aspinwall, co-written by Alfred (your trusted AI agent) – February 26, 2026, 12:30

There’s a fundamental inefficiency in how AI agents work today. When an agent needs to send 500 promotional emails, it calls a “send email” tool 500 times – one LLM round-trip per email, each consuming tokens for reasoning, parameter formatting, result processing, and deciding what to do next. The agent is doing the same thing 500 times, burning tokens every time, and each execution is a fresh probabilistic event that might fail differently than the last one.

What if instead, the agent wrote a program to do it? Compiled code. Stored on an MCP server. Callable with arguments. Deterministic, repeatable, fast, cheap.

This article explores that idea: agents that translate their reasoning into compiled tools, verify them once, and then invoke them like any other MCP tool – with arguments in, results out, and zero token cost per execution.

The Current Model: Expensive Repetition

Here’s how an agent sends a promotional email campaign today:

Agent receives task: “Send promotion X to 500 clients”
Agent calls get_client_list() – 1 tool call, returns 500 records
For each client, the agent:
- Reasons about the client (tokens)
- Formats the email parameters (tokens)
- Calls send_email(to, subject, body) (1 tool call)
- Reads the result (tokens)
- Decides whether to continue (tokens)
Total: 500+ LLM round-trips, thousands of tokens per iteration

Even with programmatic tool calling (where the agent writes a loop in code), the LLM is still in the execution path. Every iteration flows through the model’s context window. The agent is the runtime.

Now consider: did the agent learn anything new on email #247 that it didn’t know on email #1? No. The logic is identical. The only variables are the recipient and maybe a personalization field. Everything else – the template, the deduplication check, the sending, the logging – is pure deterministic computation.

The agent is being used as an extremely expensive for-loop.

The Proposal: Agents That Compile Tools

The alternative: the agent writes the logic once as actual code, the code gets compiled and stored on an MCP server, and future invocations are tool calls with arguments – no LLM in the loop.

Here’s the email campaign flow reimagined:

Agent receives task: “Send promotion X to 500 clients”
Agent writes a function:

def send_campaign(recipients, template, campaign_id):
    results = []
    for recipient in recipients:
        # Check deduplication
        if was_already_sent(recipient, campaign_id):
            results.append({"email": recipient, "status": "skipped", "reason": "already_sent"})
            continue
        # Personalize and send
        body = template.format(name=recipient["name"], company=recipient["company"])
        send_result = send_email(recipient["email"], template["subject"], body)
        # Log for future reference
        log_campaign_send(recipient, campaign_id, send_result)
        results.append({"email": recipient["email"], "status": send_result})
    return {"campaign_id": campaign_id, "total": len(recipients), "results": results}

Code is tested against a small sample (5 clients)
Agent reviews results, confirms correctness
Code is compiled, stored on the MCP server as a new tool: run_email_campaign
Agent calls run_email_campaign(recipients, template, campaign_id) – one tool call
Server executes compiled code against all 500 recipients
Agent receives summary result – one response

Token cost: Steps 2-4 might use 2,000-3,000 tokens total. Step 6 uses maybe 200 tokens for the call and response. Compare that to 500 individual round-trips at 500+ tokens each – that’s 250,000+ tokens for the naive approach.

The compiled version is roughly 100x cheaper.

More Than Just Email: Examples Across Domains

Data Pipeline Processing

Naive agent approach: Agent queries a database, reads each row, applies transformation logic via LLM reasoning, writes results back. 10,000 rows = 10,000 LLM round-trips.

Compiled tool approach: Agent writes a SQL query with transformation logic, or a Python script that reads, transforms, and writes in batch. Stored as run_data_transform(source_table, dest_table, transform_rules). One tool call processes 10,000 rows. The agent’s reasoning produced the transform logic once. The execution is pure computation.

Report Generation

Naive: Agent queries 12 months of sales data, reasons about trends, formats charts, writes narrative – all in the LLM context window with massive data payloads.

Compiled: Agent writes a report generator that takes date ranges and metrics as arguments, queries the database, computes statistics, generates visualizations, and outputs a formatted report. Stored as generate_sales_report(start_date, end_date, metrics, format). The agent designed the report once. Every monthly regeneration is a single tool call.

CRM Workflow Automation

Naive: Agent checks overdue follow-ups, drafts messages per contact, sends via WhatsApp, logs interactions, schedules next follow-up. Each contact requires multiple LLM round-trips for personalization, reasoning, and action.

Compiled: Agent writes a follow-up workflow that takes a list of overdue contacts, a message template with personalization slots, and a follow-up interval. The code personalizes, sends (with rate limiting and deduplication built in), logs, and reschedules. Stored as run_followup_campaign(contacts, template, interval). One call. All contacts processed.

Inventory Reconciliation

Naive: Agent compares two data sources row by row, reasoning about discrepancies, flagging issues, suggesting corrections. Thousands of tokens per row.

Compiled: Agent writes a reconciliation function that takes two datasets, a matching key, and comparison rules. Outputs a discrepancy report with categorized issues. reconcile_inventory(source_a, source_b, match_key, tolerance). The matching logic is deterministic. The agent only gets involved when the report surfaces anomalies that need judgment.

The Advantages

1. Cost – Orders of Magnitude Cheaper

The math is simple. An LLM round-trip costs tokens. A compiled function call costs CPU cycles. For repetitive operations, the ratio is 100x-1000x cheaper. The agent spends tokens once (writing and verifying the code) and then executes for free.

At Nvidia’s earnings call, Jensen Huang talked about “profitable tokens.” Compiled tools make tokens even more profitable by eliminating the ones that didn’t need to be tokens in the first place.

2. Speed – Parallel Execution Without LLM Bottleneck

An LLM processes sequentially. Even with parallel tool calling, the model is still in the loop for every batch. Compiled code runs at native speed – parallel threads, batch database operations, concurrent API calls – without waiting for the model to generate its next thought.

A 500-email campaign that takes 45 minutes through an agent (rate-limited by LLM latency) takes 30 seconds as compiled code.

3. Determinism – Same Input, Same Output

This is the biggest advantage. LLMs are probabilistic. The same prompt can produce different outputs. Temperature, sampling, context window position – all introduce variation. For business operations, variation is risk.

Compiled code is deterministic. send_campaign(recipients, template, campaign_id) produces the same result every time for the same inputs. No hallucinated email addresses. No randomly skipped recipients. No creative reinterpretation of the template.

Once the code is verified, it stays verified. You don’t re-test your email sending function every time you run it. But you would need to watch an LLM agent carefully every time it runs a campaign, because it might do something unexpected.

4. Auditability – Code Is Reviewable

When an agent sends 500 emails through 500 tool calls, the audit trail is 500 individual decisions. Understanding what happened requires reconstructing the agent’s reasoning across all of them.

When compiled code sends 500 emails, the audit trail is: here’s the code, here are the inputs, here are the outputs. A developer can read the function, understand exactly what it does, and verify it handles edge cases. The logic is visible, not buried in an LLM’s context window.

5. Reusability – Tools Accumulate

Every compiled tool the agent creates becomes available for future use. The agent builds its own toolkit over time. After six months of operation, the MCP server has tools for email campaigns, report generation, data reconciliation, follow-up workflows, and dozens of other operations – all written by the agent, verified by humans, and executable without LLM involvement.

The agent gets more capable over time not by learning (it doesn’t retain state between sessions) but by accumulating compiled tools. Each new session starts with a richer toolkit than the last one.

6. Safety – Bounded Execution

An LLM reasoning in a loop can go off the rails. It might decide to modify the template mid-campaign. It might hallucinate a recipient. It might interpret an error as a reason to retry aggressively and DDoS the mail server.

Compiled code does what it was written to do. The deduplication check either passes or fails. The rate limiter either throttles or doesn’t. There’s no creative interpretation. The boundaries are in the code, not in the prompt.

Code Generation Quality: Where Are We?

This entire approach depends on AI being good enough at writing code. If the generated code is buggy, the benefits evaporate. So how good are frontier models at code generation?

Very good, with caveats.

What Models Do Well

Standard patterns: CRUD operations, API integrations, data transformations, template rendering – the bread and butter of business automation. Models write these reliably because they’ve seen millions of examples.
Error handling: Current models (Claude, GPT-4, Gemini) generate reasonable error handling – try/catch blocks, input validation, edge case checks. Not perfect, but sufficient for a first draft that a developer can review.
Test generation: Models are good at writing tests for code they’ve written. This enables the verify-then-deploy workflow: agent writes function, agent writes tests, tests run automatically, human reviews if tests pass.
Database operations: SQL queries, ORM interactions, batch operations – models handle these well. The email campaign example with deduplication checks and logging is well within current capabilities.

What Models Struggle With

Complex state management: Code that needs to handle concurrent access, distributed locks, or complex transaction semantics. Models often generate code that works in a single-threaded test but fails under load.
Performance optimization: Models write correct but often inefficient code. A batch database insert might be generated as N individual inserts instead of a single bulk operation. The code works but doesn’t scale.
Security edge cases: SQL injection, input sanitization, authentication token handling – models sometimes get these right and sometimes don’t. Any compiled tool that handles user input or credentials needs careful security review.
Domain-specific logic: Business rules that aren’t well-represented in training data. “If the client is in the EU, apply GDPR consent checks before sending” requires domain knowledge that the model may or may not have.

The Verification Step Is Non-Negotiable

The workflow must include:

Agent writes code
Agent writes tests
Tests run automatically
Human reviews the code and test results
Only then: code is compiled and stored as a tool

Step 4 is critical. The agent should not be able to deploy tools to the MCP server without human approval. This is the same principle as code review in software development – the author (agent) writes it, a reviewer (human) approves it.

For low-risk operations (formatting data, generating reports), the review can be lightweight. For operations with side effects (sending emails, modifying databases, calling external APIs), the review must be thorough.

Architecture: How This Works on an MCP Server

The implementation on an MCP server like WorkingAgents would look like:

Tool Registry: A new module that stores compiled tools with metadata – who created it, when, what it does, what arguments it takes, which tests passed, who approved it.

Sandbox Execution: Compiled tools run in an isolated environment with defined resource limits (CPU time, memory, network access). A tool that sends emails has network access to the mail server. A tool that generates reports has database read access. Permissions are explicit.

Versioning: Tools are versioned. When the agent writes an improved version of an existing tool, both versions are available. The old version isn’t deleted until the new one is verified.

Argument Schema: Each compiled tool has a JSON schema for its arguments, just like any MCP tool. The agent calls it the same way it calls any other tool – run_email_campaign({recipients: [...], template: {...}, campaign_id: "promo_q1"}).

Execution Logging: Every invocation is logged with arguments, results, duration, and any errors. The audit trail is automatic.

The Hybrid Model: Agent Judgment + Compiled Execution

The ideal isn’t “replace agents with compiled code.” It’s “use agents for judgment and compiled code for execution.”

The agent decides what to do. The compiled tool does how to do it.

Agent analyzes the client list and decides which segment to target – judgment.
Compiled tool sends the campaign to that segment – execution.
Agent reviews the results and decides whether to adjust the template – judgment.
Compiled tool reruns with the adjusted template – execution.

The agent stays in the loop for decisions that require reasoning, context, and adaptation. Everything else is compiled code. Tokens are spent on thinking, not on mechanical repetition.

What This Means

The current generation of AI agents is bottlenecked by a design assumption: the LLM must be in the execution loop for every action. This is like having a senior engineer personally execute every line of code instead of writing a program and running it.

Agents that write, verify, and deploy their own tools flip this model. The LLM becomes a programmer, not a runtime. It designs solutions, not executes steps. And the solutions it designs are deterministic, fast, cheap, auditable, and reusable.

The infrastructure for this exists today. MCP servers can host tools. Agents can write code. Sandboxes can execute it safely. The missing piece is the workflow that connects them: write, test, review, deploy, invoke.

Build that workflow, and you have an agent that gets more capable with every task it completes – not because it learns, but because it builds.