Beyond Skills: Why Agents Should Write and Compile Their Own MCP Tools

By James Aspinwall – February 26, 2026, 12:00

The current paradigm of AI agents relies heavily on “skills” or predefined tools. When an agent needs to perform a complex task, it either follows a long system prompt instruction or calls a sequence of granular tools. This works for simple interactions, but as we push toward autonomous agents managing enterprise-scale workflows, this “one-call-at-a-time” approach becomes a massive bottleneck.

It’s time for a shift in how we think about agentic capability: Agents shouldn’t just use tools; they should write them. Specifically, agents should translate high-level logic into computer code, compile it, store it in an MCP (Model Context Protocol) server, and then execute it as a first-class service.

The Efficiency Gap: The Bulk Email Problem

Consider a common task: sending a promotional email to 5,000 clients.

In the traditional “skill-based” approach, the agent might have access to a send_email(recipient, body) tool. To complete the task, the agent (or its orchestrator) must:

Retrieve the list of 5,000 clients.
Iterate through the list in its own reasoning loop.
Call send_email 5,000 times.

This is catastrophically inefficient. It consumes a massive amount of tokens for the repeated tool calls and repetitive reasoning. It’s slow, as each call involves a round-trip to the LLM or a long-running sequential loop in a high-latency environment. Worst of all, it’s prone to mid-stream failures—if the agent loses context or hits a rate limit at email 2,500, resuming reliably requires complex state management.

The Compiled Approach

Instead, the agent identifies the pattern. It writes a specialized piece of code—in Elixir, Python, or Rust—that handles the entire batch operation natively.

The agent-generated code includes:

Deduplication logic: Ensuring no client receives the same template twice within a specific window.
Batching and Rate Limiting: Dispatching emails in chunks to respect provider limits without LLM intervention.
Direct Persistence: Logging the results directly to the database (e.g., via Sqler) as they happen.
Atomic Verification: Returning a single “OK” or a detailed summary report to the agent upon completion.

The agent sends this code to an MCP server that supports dynamic compilation (a “Logic Host”). Once compiled, the agent makes a single call: call_tool("bulk_promo_sender", %{recipients: list, template_id: "promo_v1"}).

Why Compiled Logic Wins

1. Repeatability and Determinism

LLMs are probabilistic; code is deterministic. When an agent writes a script to handle a complex logic flow, and that script is verified (either by a human or a secondary “reviewer” agent), it becomes a reliable, reusable building block. We move from “I hope the agent remembers to check the database before sending” to “The compiled code will check the database because that is its defined logic.”

2. Radical Token Savings

By moving the loop from the LLM context into the execution environment, we save thousands of tokens. The agent no longer needs to process every intermediate step or see the output of every granular tool call. It only needs the final result. This makes complex operations feasible for models with smaller context windows or tighter budget constraints.

3. Speed and Performance

Execution in a compiled environment (like the Erlang VM/BEAM or a fast Python runtime) happens at native speeds. An agent-orchestrated loop is limited by the latency of the LLM and the transport layer. A compiled loop is limited only by the CPU, memory, and I/O.

4. Safety and Auditing

It is significantly easier to audit a 50-line script generated by an agent than it is to audit a 5,000-step execution trace. We can run static analysis on the generated code, check it for security vulnerabilities (e.g., SQL injection, unauthorized file access), and sandbox it before it ever touches production data.

The Frontier Model Advantage

The feasibility of this approach rests entirely on the quality of code generation. Two years ago, this would have been a recipe for broken systems and “hallucinated” syntax. Today, frontier models like Claude 3.5 Sonnet and GPT-4o have reached a “phase change” in coding capability.

These models don’t just write snippets; they understand modularity, error handling, and language-specific idiomatics. In our testing with the WorkingAgents ecosystem, we’ve seen agents successfully generate Elixir modules that interface with existing Sqler databases and Pushover notification systems with near-zero syntax errors on the first pass.

Beyond Email: Other Use Cases

Data Transformation: Instead of an agent reading 1,000 JSON objects and reformatting them one-by-one, it writes a transformation script and runs it over the entire dataset in-memory.
Complex Financial Calculations: An agent writes a specialized calculator for a specific tax law or portfolio rebalancing strategy, ensuring mathematical precision that LLMs often struggle with during long-form reasoning.
Log Analysis: Instead of grep-ing through files manually, the agent writes a custom parser that aggregates errors and identifies patterns across gigabytes of logs in seconds.

Conclusion

The future of agency isn’t just “chatting with tools.” It’s “engineering on the fly.” By treating MCP servers as dynamic execution environments for agent-generated code, we unlock a level of scale, reliability, and cost-efficiency that was previously impossible. We aren’t just building assistants; we’re building autonomous DevOps engineers that expand their own capabilities in real-time.