By James Aspinwall, co-written by Alfred (your trusted AI agent) – February 24, 2026, 12:00
In the previous article we covered the four advanced tool use features Anthropic shipped: programmatic tool calling, dynamic filtering, tool search with deferred loading, and tool use examples. This article gets specific – how we implement each one in WorkingAgents, what we gain, what can go wrong, and a phased plan to get there.
Where We Stand Today
WorkingAgents currently exposes 94 MCP tools across 9 categories:
| Category | Tools | Permission Key |
|---|---|---|
| Tasks | 20 | 80_001 |
| CRM (NIS) | 17 | 80_001 |
| Access Control | 11 | 60_001 |
| 7 | 70_001 | |
| Summaries | 7 | 30_001 |
| Platform | 5 | 123_456 |
| Blogs | 4 | 50_001 |
| Utility | 4 | 60_001 |
| Other | 19 | various |
The tool-use loop in both ServerChat.Anthropic and WhatsAppClaude follows the standard pattern:
- Send all permitted tools + conversation history to Claude
-
Claude returns a
tool_useresponse with one or more tool calls -
Execute each tool sequentially via
Enum.map - Append the full result of every tool call to the conversation context
- Send the updated context back to Claude
-
Repeat until Claude returns
end_turn
This works. But with 94 tools, sequential execution, and full results in context, we’re leaving significant performance and cost on the table.
Feature 1: Tool Search and Deferred Loading
Why this comes first
With 94 tool definitions, every LLM call ships ~15,000-20,000 tokens just for the tool schemas. Most requests need 3-5 tools. That’s 80%+ wasted context on tool definitions alone.
How to implement it
The change is primarily in MyMCPServer and MyMCPServer.Manager.
Step 1: Categorize tools by visibility
Add a :deferred flag to each tool definition:
%{
name: "task_create",
description: "Create a new task...",
required_permission: 80_001,
deferred: true, # <-- new field
inputSchema: %{...}
}
Always-visible tools (never deferred): task_capture, task_next, task_plan, whatsapp_send, whatsapp_recent, nis_search, summary_request, current_time. These cover the most common entry points. Everything else gets deferred: true.
Step 2: Add a tool_search tool
Define a new tool that takes a natural language query and returns matching tool definitions:
%{
name: "tool_search",
description: "Search for available tools by keyword or capability...",
deferred: false,
inputSchema: %{
type: "object",
properties: %{
query: %{type: "string", description: "What capability you need"}
},
required: ["query"]
}
}
The handler searches tool names and descriptions using simple string matching or FTS. Returns the full schema for matching tools so Claude can call them in subsequent turns.
Step 3: Update tools_for_permissions/1
Currently this function filters by permission and returns all tools. Change it to also filter by :deferred – only return non-deferred tools plus tool_search by default. When tool_search is called, return the full schemas of matching deferred tools.
Step 4: Update the Manager
MyMCPServer.Manager.list_tools/1 needs to track which deferred tools have been “discovered” in the current session. After tool_search returns results, those tools become callable for subsequent turns.
What we gain
- ~80% reduction in tool schema tokens – from ~15,000 down to ~3,000 per call
- Faster response times from reduced prompt size
- Room to add more tools without linear context growth
Potential issues
-
Extra round-trip: If Claude needs a deferred tool, it must call
tool_searchfirst, then call the actual tool. That’s one extra LLM call for the first use of an unfamiliar tool. -
Search quality: Simple string matching might miss relevant tools. “Send a message to my contact” should find
whatsapp_sendeven though the query doesn’t mention “whatsapp.” We may need semantic matching or keyword aliases. - Session state: The Manager currently doesn’t track per-session discovered tools. We’d need to either pass discovered tools through the conversation or maintain a per-session tool registry.
- Provider compatibility: This is Anthropic-specific. OpenRouter and Perplexity providers would need their own handling or a provider-agnostic abstraction.
Feature 2: Programmatic Tool Calling
Why this matters for us
Consider this real workflow: “Find my overdue tasks, check which contacts are associated, and send them follow-up WhatsApp messages.”
Current flow (5+ LLM round-trips):
-
Claude calls
task_query(name: "overdue")– returns list of tasks -
Claude calls
nis_search(q: "contact name")– for each task -
Claude calls
whatsapp_send(jid, text)– for each contact - Each round-trip includes the full conversation history + all previous results
With programmatic tool calling (1-2 round-trips):
- Claude writes code that loops through tasks, searches contacts, and sends messages
- All tool executions happen in one batch
- Only the final summary enters the conversation context
How to implement it
Step 1: Add the code execution tool
Add a new tool to the Anthropic provider that signals code execution capability:
%{
type: "code_execution",
name: "code_execution_tool",
# Anthropic-specific version identifier
}
Step 2: Set allowed_caller on existing tools
Each tool definition gets an allowed_caller field pointing to the code execution tool. This tells Claude it can call these tools from within generated code rather than emitting individual JSON tool calls.
Step 3: Update the tool-use loop in ServerChat.Anthropic
The current loop in handle_response/5 (lines 79-138 of anthropic.ex) does:
tool_results = Enum.map(tool_uses, fn tool_use ->
result = ServerChat.call_mcp_tool(user_id, tool_use["name"], tool_use["input"])
%{type: "tool_result", tool_use_id: tool_use["id"], content: result}
end)
The new flow needs to:
- Detect when the response contains a code execution block
- Parse the code to extract all tool calls
- Execute them (potentially in parallel – see below)
- Return results to the code execution sandbox
- Let the sandbox complete execution and return the final result
Step 4: Parallel execution
Replace Enum.map with Task.async_stream for independent tool calls:
tool_results =
tool_uses
|> Task.async_stream(fn tool_use ->
result = ServerChat.call_mcp_tool(user_id, tool_use["name"], tool_use["input"])
{tool_use["id"], result}
end, max_concurrency: 5, timeout: 30_000)
|> Enum.map(fn {:ok, {id, result}} ->
%{type: "tool_result", tool_use_id: id, content: result}
end)
This alone – even without code execution – would improve performance for multi-tool responses.
What we gain
- 30-50% token savings on complex workflows
- Fewer LLM round-trips – code batches what would be 5+ individual calls
- Deterministic orchestration – loops, conditionals, and data piping happen in code, not in probabilistic JSON generation
- Parallel execution – independent tool calls run concurrently
Potential issues
-
Code sandbox security: We need a sandboxed execution environment. Running model-generated code directly is dangerous. Options:
- Use Anthropic’s hosted sandbox (if available)
- Run in a restricted BEAM process with timeout and memory limits
- Use a separate Node.js sandbox via our existing MuonTrap infrastructure
- Error handling in code: If one tool call in a code block fails, does the whole block fail? Need graceful error propagation within the code execution.
- Debugging: When a code block calls 8 tools in a loop, tracing failures is harder than single-call debugging. We’ll need structured logging of code execution steps.
-
Provider lock-in: This is deeply Anthropic-specific. OpenRouter/Perplexity providers can’t use this. The provider abstraction in
ServerChat.Providerneeds to handle this gracefully – providers that don’t support code execution fall back to standard tool calling. -
WhatsAppClaude: The tool-use loop in
whatsapp_claude.exis a separate implementation fromserver_chat/anthropic.ex. Both need updating. Consider extracting a shared tool execution module.
Feature 3: Tool Use Examples
Why this helps us now
Several of our tools have complex schemas where Claude makes mistakes:
-
task_create: priority (0-10), tags (comma-separated), due_at (Unix seconds), status (enum), parent_id (optional). Claude sometimes uses wrong date formats or forgets to convert to Unix seconds. -
nis_create_contact: 15+ fields, many optional. Claude often skipscontact_everyformat (“2w”, “1m”) or misformatsbirthday(should be “MM-DD” or “YYYY-MM-DD”). -
task_query: Thenameparameter must match one of 60+ query function names exactly. Claude sometimes invents names that don’t exist. -
nis_log_interaction:channelmust be one of email/phone/meeting/whatsapp. Claude sometimes uses “call” instead of “phone”.
How to implement it
Step 1: Add input_examples to tool definitions
Extend the tool definition structure:
%{
name: "task_create",
description: "Create a new task...",
inputSchema: %{...},
input_examples: [
%{
title: "Review Q1 budget",
priority: 7,
due_at: 1771904400,
tags: "#work, #finance",
status: "todo"
},
%{
title: "Buy groceries",
priority: 3,
due_at: 1771818000,
tags: "#home",
parent_id: 42
}
]
}
Step 2: Include examples in tool schemas sent to LLM
The Anthropic API supports input_examples as a top-level field on tool definitions. We already format tools in ServerChat.Anthropic.format_tools/1 – add the examples there.
Step 3: Write examples for the most error-prone tools
Priority targets based on current error rates:
-
task_create/task_update– date formats, priority range -
nis_create_contact/nis_update_contact– field formats, contact_every syntax -
task_query– valid query names -
nis_log_interaction– valid channel values -
task_capture– natural language syntax with tags and priority markers
What we gain
- ~72% to ~90% accuracy improvement on complex parameter handling (based on AI Jason’s experiments)
- Fewer failed tool calls that waste round-trips
- Less need for verbose descriptions – examples communicate format better than prose
Potential issues
- Token cost of examples: Each example adds tokens to the tool schema. With 94 tools, adding 2 examples each could add ~5,000 tokens. This partially offsets the gains from tool search. Prioritize examples for the most error-prone tools only.
-
Maintenance burden: Examples need updating when tool schemas change. If
task_creategets a new field, the examples should show it. -
Provider compatibility: Check that OpenRouter’s API passes
input_examplesthrough to underlying models. Perplexity may ignore them entirely.
Feature 4: Dynamic Filtering for Web Fetch
How it applies to us
We have two tools that fetch external content:
-
fetch_url(Utility) – fetches a URL and returns content -
summary_request(Summaries) – fetches a URL and summarizes it via LLM
Currently fetch_url returns the raw content, which can be enormous for HTML pages. The summary pipeline already does its own content extraction, but the raw fetch path doesn’t.
How to implement it
Step 1: Point to Anthropic’s filtered web_fetch version
If using Anthropic’s built-in web_fetch tool (version web_fetch-2026209), the filtering happens automatically through the code execution infrastructure. We’d replace our custom fetch_url with Anthropic’s version for the Anthropic provider.
Step 2: For our custom fetch, add content extraction
For providers that don’t support dynamic filtering, add a content extraction step in the fetch_url handler:
def handle_tool_call("fetch_url", %{"url" => url}, state) do
case Req.get(url) do
{:ok, %{body: body}} ->
# Strip scripts, styles, nav, footer -- keep article content
clean = body
|> Floki.parse_document!(attributes_as_maps: true)
|> Floki.filter_out("script,style,nav,footer,header,aside")
|> Floki.raw_html()
{:reply, %{content: [%{type: "text", text: clean}]}, state}
end
end
What we gain
- ~24% token reduction on web-fetch-heavy workflows
- Cleaner content for summarization pipeline
- Less noise in conversation context
Potential issues
-
Over-filtering: Aggressive HTML stripping might remove relevant content (e.g., code in
<script>tags on a tutorial page, data in<aside>elements). - Non-HTML content: PDFs, JSON APIs, and plain text don’t need filtering. Need content-type detection before applying filters.
The Implementation Plan
Phase 1: Quick Wins (1-2 days)
1a. Parallel tool execution
Replace Enum.map with Task.async_stream in both server_chat/anthropic.ex and whatsapp_claude.ex. This is a 5-line change per file with immediate performance gains for multi-tool responses.
1b. Tool use examples for top 5 error-prone tools
Add input_examples to task_create, task_update, nis_create_contact, task_query, and nis_log_interaction. Immediate accuracy improvement, no architectural changes.
Phase 2: Tool Search (3-5 days)
2a. Add deferred flag to tool definitions
Mark 70+ tools as deferred. Keep ~15 high-frequency tools always visible.
2b. Implement tool_search handler
Build the search logic – start with keyword matching on tool names and descriptions. Add aliases for common queries (“send message” -> whatsapp_send, “find contact” -> nis_search).
2c. Update tools_for_permissions and Manager
Filter deferred tools from default tool list. Track discovered tools per session.
2d. Test token savings
Measure prompt size before and after. Target: 70%+ reduction in tool schema tokens.
Phase 3: Content Filtering (2-3 days)
3a. Add HTML content extraction to fetch_url
Use Floki to strip non-content elements before returning results.
3b. Evaluate Anthropic’s web_fetch tool
Test the built-in filtered version. If it works well, use it for Anthropic provider and keep our custom version for others.
3c. Truncate large tool results
Add a configurable max length for tool results before they enter conversation context. Log the full result separately for debugging.
Phase 4: Programmatic Tool Calling (1-2 weeks)
4a. Research sandbox options
Evaluate: Anthropic’s hosted sandbox, restricted BEAM process, MuonTrap + Node.js sandbox. Choose based on security, latency, and complexity.
4b. Update Anthropic provider
Add code execution tool to tool list. Set allowed_caller on existing tools. Update handle_response to detect and handle code execution blocks.
4c. Extract shared tool execution module
Both server_chat/anthropic.ex and whatsapp_claude.ex implement the tool-use loop independently. Extract a shared ToolExecutor module that both can use. This simplifies adding code execution support in one place.
4d. Implement code execution flow
Parse code blocks, extract tool calls, execute (with parallel support), return results to sandbox, collect final output.
4e. Provider fallback
Ensure OpenRouter and Perplexity providers gracefully fall back to standard tool calling when code execution isn’t available.
Risk Assessment
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Tool search returns wrong tools | Medium | Low |
Always include tool_search so Claude can retry; add keyword aliases |
| Code execution sandbox escape | Low | High | Use Anthropic’s hosted sandbox; add timeout + memory limits |
| Examples increase token cost | Certain | Low | Only add examples to top 10 error-prone tools |
| Provider incompatibility | Medium | Medium | Feature-detect per provider; graceful fallback to standard calling |
| Breaking WhatsAppClaude loop | Medium | Medium | Extract shared module first; test both paths |
| Deferred tool not discovered | Medium | Low | Keep most-used tools always visible; good search aliases |
Expected Impact
| Metric | Current | After Phase 1-2 | After Phase 3-4 |
|---|---|---|---|
| Tool schema tokens per call | ~15,000 | ~3,000 | ~3,000 |
| Avg round-trips for complex tasks | 4-6 | 4-6 | 1-2 |
| Tool execution (parallel) | Sequential | Parallel (5x) | Parallel + batched |
| Parameter accuracy | ~75% | ~90% | ~90% |
| Web fetch token waste | ~24% overhead | ~24% | ~5% |
The phases are ordered by effort-to-impact ratio. Phase 1 gives immediate gains with minimal risk. Phase 2 delivers the biggest token savings. Phase 3 cleans up content handling. Phase 4 is the architectural leap – highest reward but most complex.
Start with Phase 1. Measure. Ship Phase 2. Measure again. By then we’ll know exactly how much Phase 3 and 4 are worth.