Implementing Advanced Tool Use in WorkingAgents -- A Practical Plan

By James Aspinwall, co-written by Alfred (your trusted AI agent) – February 24, 2026, 12:00

In the previous article we covered the four advanced tool use features Anthropic shipped: programmatic tool calling, dynamic filtering, tool search with deferred loading, and tool use examples. This article gets specific – how we implement each one in WorkingAgents, what we gain, what can go wrong, and a phased plan to get there.

Where We Stand Today

WorkingAgents currently exposes 94 MCP tools across 9 categories:

Category	Tools	Permission Key
Tasks	20	80_001
CRM (NIS)	17	80_001
Access Control	11	60_001
WhatsApp	7	70_001
Summaries	7	30_001
Platform	5	123_456
Blogs	4	50_001
Utility	4	60_001
Other	19	various

The tool-use loop in both ServerChat.Anthropic and WhatsAppClaude follows the standard pattern:

Send all permitted tools + conversation history to Claude
Claude returns a tool_use response with one or more tool calls
Execute each tool sequentially via Enum.map
Append the full result of every tool call to the conversation context
Send the updated context back to Claude
Repeat until Claude returns end_turn

This works. But with 94 tools, sequential execution, and full results in context, we’re leaving significant performance and cost on the table.

Feature 1: Tool Search and Deferred Loading

Why this comes first

With 94 tool definitions, every LLM call ships ~15,000-20,000 tokens just for the tool schemas. Most requests need 3-5 tools. That’s 80%+ wasted context on tool definitions alone.

How to implement it

The change is primarily in MyMCPServer and MyMCPServer.Manager.

Step 1: Categorize tools by visibility

Add a :deferred flag to each tool definition:

%{
  name: "task_create",
  description: "Create a new task...",
  required_permission: 80_001,
  deferred: true,            # <-- new field
  inputSchema: %{...}
}

Always-visible tools (never deferred): task_capture, task_next, task_plan, whatsapp_send, whatsapp_recent, nis_search, summary_request, current_time. These cover the most common entry points. Everything else gets deferred: true.

Step 2: Add a tool_search tool

Define a new tool that takes a natural language query and returns matching tool definitions:

%{
  name: "tool_search",
  description: "Search for available tools by keyword or capability...",
  deferred: false,
  inputSchema: %{
    type: "object",
    properties: %{
      query: %{type: "string", description: "What capability you need"}
    },
    required: ["query"]
  }
}

The handler searches tool names and descriptions using simple string matching or FTS. Returns the full schema for matching tools so Claude can call them in subsequent turns.

Step 3: Update tools_for_permissions/1

Currently this function filters by permission and returns all tools. Change it to also filter by :deferred – only return non-deferred tools plus tool_search by default. When tool_search is called, return the full schemas of matching deferred tools.

Step 4: Update the Manager

MyMCPServer.Manager.list_tools/1 needs to track which deferred tools have been “discovered” in the current session. After tool_search returns results, those tools become callable for subsequent turns.

What we gain

~80% reduction in tool schema tokens – from ~15,000 down to ~3,000 per call
Faster response times from reduced prompt size
Room to add more tools without linear context growth

Potential issues

Extra round-trip: If Claude needs a deferred tool, it must call tool_search first, then call the actual tool. That’s one extra LLM call for the first use of an unfamiliar tool.
Search quality: Simple string matching might miss relevant tools. “Send a message to my contact” should find whatsapp_send even though the query doesn’t mention “whatsapp.” We may need semantic matching or keyword aliases.
Session state: The Manager currently doesn’t track per-session discovered tools. We’d need to either pass discovered tools through the conversation or maintain a per-session tool registry.
Provider compatibility: This is Anthropic-specific. OpenRouter and Perplexity providers would need their own handling or a provider-agnostic abstraction.

Feature 2: Programmatic Tool Calling

Why this matters for us

Consider this real workflow: “Find my overdue tasks, check which contacts are associated, and send them follow-up WhatsApp messages.”

Current flow (5+ LLM round-trips):

Claude calls task_query(name: "overdue") – returns list of tasks
Claude calls nis_search(q: "contact name") – for each task
Claude calls whatsapp_send(jid, text) – for each contact
Each round-trip includes the full conversation history + all previous results

With programmatic tool calling (1-2 round-trips):

Claude writes code that loops through tasks, searches contacts, and sends messages
All tool executions happen in one batch
Only the final summary enters the conversation context

How to implement it

Step 1: Add the code execution tool

Add a new tool to the Anthropic provider that signals code execution capability:

%{
  type: "code_execution",
  name: "code_execution_tool",
  # Anthropic-specific version identifier
}

Step 2: Set allowed_caller on existing tools

Each tool definition gets an allowed_caller field pointing to the code execution tool. This tells Claude it can call these tools from within generated code rather than emitting individual JSON tool calls.

Step 3: Update the tool-use loop in ServerChat.Anthropic

The current loop in handle_response/5 (lines 79-138 of anthropic.ex) does:

tool_results = Enum.map(tool_uses, fn tool_use ->
  result = ServerChat.call_mcp_tool(user_id, tool_use["name"], tool_use["input"])
  %{type: "tool_result", tool_use_id: tool_use["id"], content: result}
end)

The new flow needs to:

Detect when the response contains a code execution block
Parse the code to extract all tool calls
Execute them (potentially in parallel – see below)
Return results to the code execution sandbox
Let the sandbox complete execution and return the final result

Step 4: Parallel execution

Replace Enum.map with Task.async_stream for independent tool calls:

tool_results =
  tool_uses
  |> Task.async_stream(fn tool_use ->
    result = ServerChat.call_mcp_tool(user_id, tool_use["name"], tool_use["input"])
    {tool_use["id"], result}
  end, max_concurrency: 5, timeout: 30_000)
  |> Enum.map(fn {:ok, {id, result}} ->
    %{type: "tool_result", tool_use_id: id, content: result}
  end)

This alone – even without code execution – would improve performance for multi-tool responses.

What we gain

30-50% token savings on complex workflows
Fewer LLM round-trips – code batches what would be 5+ individual calls
Deterministic orchestration – loops, conditionals, and data piping happen in code, not in probabilistic JSON generation
Parallel execution – independent tool calls run concurrently

Potential issues

Code sandbox security: We need a sandboxed execution environment. Running model-generated code directly is dangerous. Options:
- Use Anthropic’s hosted sandbox (if available)
- Run in a restricted BEAM process with timeout and memory limits
- Use a separate Node.js sandbox via our existing MuonTrap infrastructure
Error handling in code: If one tool call in a code block fails, does the whole block fail? Need graceful error propagation within the code execution.
Debugging: When a code block calls 8 tools in a loop, tracing failures is harder than single-call debugging. We’ll need structured logging of code execution steps.
Provider lock-in: This is deeply Anthropic-specific. OpenRouter/Perplexity providers can’t use this. The provider abstraction in ServerChat.Provider needs to handle this gracefully – providers that don’t support code execution fall back to standard tool calling.
WhatsAppClaude: The tool-use loop in whatsapp_claude.ex is a separate implementation from server_chat/anthropic.ex. Both need updating. Consider extracting a shared tool execution module.

Feature 3: Tool Use Examples

Why this helps us now

Several of our tools have complex schemas where Claude makes mistakes:

task_create: priority (0-10), tags (comma-separated), due_at (Unix seconds), status (enum), parent_id (optional). Claude sometimes uses wrong date formats or forgets to convert to Unix seconds.
nis_create_contact: 15+ fields, many optional. Claude often skips contact_every format (“2w”, “1m”) or misformats birthday (should be “MM-DD” or “YYYY-MM-DD”).
task_query: The name parameter must match one of 60+ query function names exactly. Claude sometimes invents names that don’t exist.
nis_log_interaction: channel must be one of email/phone/meeting/whatsapp. Claude sometimes uses “call” instead of “phone”.

How to implement it

Step 1: Add input_examples to tool definitions

Extend the tool definition structure:

%{
  name: "task_create",
  description: "Create a new task...",
  inputSchema: %{...},
  input_examples: [
    %{
      title: "Review Q1 budget",
      priority: 7,
      due_at: 1771904400,
      tags: "#work, #finance",
      status: "todo"
    },
    %{
      title: "Buy groceries",
      priority: 3,
      due_at: 1771818000,
      tags: "#home",
      parent_id: 42
    }
  ]
}

Step 2: Include examples in tool schemas sent to LLM

The Anthropic API supports input_examples as a top-level field on tool definitions. We already format tools in ServerChat.Anthropic.format_tools/1 – add the examples there.

Step 3: Write examples for the most error-prone tools

Priority targets based on current error rates:

task_create / task_update – date formats, priority range
nis_create_contact / nis_update_contact – field formats, contact_every syntax
task_query – valid query names
nis_log_interaction – valid channel values
task_capture – natural language syntax with tags and priority markers

What we gain

~72% to ~90% accuracy improvement on complex parameter handling (based on AI Jason’s experiments)
Fewer failed tool calls that waste round-trips
Less need for verbose descriptions – examples communicate format better than prose

Potential issues

Token cost of examples: Each example adds tokens to the tool schema. With 94 tools, adding 2 examples each could add ~5,000 tokens. This partially offsets the gains from tool search. Prioritize examples for the most error-prone tools only.
Maintenance burden: Examples need updating when tool schemas change. If task_create gets a new field, the examples should show it.
Provider compatibility: Check that OpenRouter’s API passes input_examples through to underlying models. Perplexity may ignore them entirely.

Feature 4: Dynamic Filtering for Web Fetch

How it applies to us

We have two tools that fetch external content:

fetch_url (Utility) – fetches a URL and returns content
summary_request (Summaries) – fetches a URL and summarizes it via LLM

Currently fetch_url returns the raw content, which can be enormous for HTML pages. The summary pipeline already does its own content extraction, but the raw fetch path doesn’t.

How to implement it

Step 1: Point to Anthropic’s filtered web_fetch version

If using Anthropic’s built-in web_fetch tool (version web_fetch-2026209), the filtering happens automatically through the code execution infrastructure. We’d replace our custom fetch_url with Anthropic’s version for the Anthropic provider.

Step 2: For our custom fetch, add content extraction

For providers that don’t support dynamic filtering, add a content extraction step in the fetch_url handler:

def handle_tool_call("fetch_url", %{"url" => url}, state) do
  case Req.get(url) do
    {:ok, %{body: body}} ->
      # Strip scripts, styles, nav, footer -- keep article content
      clean = body
        |> Floki.parse_document!(attributes_as_maps: true)
        |> Floki.filter_out("script,style,nav,footer,header,aside")
        |> Floki.raw_html()
      {:reply, %{content: [%{type: "text", text: clean}]}, state}
  end
end

What we gain

~24% token reduction on web-fetch-heavy workflows
Cleaner content for summarization pipeline
Less noise in conversation context

Potential issues

Over-filtering: Aggressive HTML stripping might remove relevant content (e.g., code in <script> tags on a tutorial page, data in <aside> elements).
Non-HTML content: PDFs, JSON APIs, and plain text don’t need filtering. Need content-type detection before applying filters.

The Implementation Plan

Phase 1: Quick Wins (1-2 days)

1a. Parallel tool execution

Replace Enum.map with Task.async_stream in both server_chat/anthropic.ex and whatsapp_claude.ex. This is a 5-line change per file with immediate performance gains for multi-tool responses.

1b. Tool use examples for top 5 error-prone tools

Add input_examples to task_create, task_update, nis_create_contact, task_query, and nis_log_interaction. Immediate accuracy improvement, no architectural changes.

Phase 2: Tool Search (3-5 days)

2a. Add deferred flag to tool definitions

Mark 70+ tools as deferred. Keep ~15 high-frequency tools always visible.

2b. Implement tool_search handler

Build the search logic – start with keyword matching on tool names and descriptions. Add aliases for common queries (“send message” -> whatsapp_send, “find contact” -> nis_search).

2c. Update tools_for_permissions and Manager

Filter deferred tools from default tool list. Track discovered tools per session.

2d. Test token savings

Measure prompt size before and after. Target: 70%+ reduction in tool schema tokens.

Phase 3: Content Filtering (2-3 days)

3a. Add HTML content extraction to fetch_url

Use Floki to strip non-content elements before returning results.

3b. Evaluate Anthropic’s web_fetch tool

Test the built-in filtered version. If it works well, use it for Anthropic provider and keep our custom version for others.

3c. Truncate large tool results

Add a configurable max length for tool results before they enter conversation context. Log the full result separately for debugging.

Phase 4: Programmatic Tool Calling (1-2 weeks)

4a. Research sandbox options

Evaluate: Anthropic’s hosted sandbox, restricted BEAM process, MuonTrap + Node.js sandbox. Choose based on security, latency, and complexity.

4b. Update Anthropic provider

Add code execution tool to tool list. Set allowed_caller on existing tools. Update handle_response to detect and handle code execution blocks.

4c. Extract shared tool execution module

Both server_chat/anthropic.ex and whatsapp_claude.ex implement the tool-use loop independently. Extract a shared ToolExecutor module that both can use. This simplifies adding code execution support in one place.

4d. Implement code execution flow

Parse code blocks, extract tool calls, execute (with parallel support), return results to sandbox, collect final output.

4e. Provider fallback

Ensure OpenRouter and Perplexity providers gracefully fall back to standard tool calling when code execution isn’t available.

Risk Assessment

Risk	Likelihood	Impact	Mitigation
Tool search returns wrong tools	Medium	Low	Always include `tool_search` so Claude can retry; add keyword aliases
Code execution sandbox escape	Low	High	Use Anthropic’s hosted sandbox; add timeout + memory limits
Examples increase token cost	Certain	Low	Only add examples to top 10 error-prone tools
Provider incompatibility	Medium	Medium	Feature-detect per provider; graceful fallback to standard calling
Breaking WhatsAppClaude loop	Medium	Medium	Extract shared module first; test both paths
Deferred tool not discovered	Medium	Low	Keep most-used tools always visible; good search aliases

Expected Impact

Metric	Current	After Phase 1-2	After Phase 3-4
Tool schema tokens per call	~15,000	~3,000	~3,000
Avg round-trips for complex tasks	4-6	4-6	1-2
Tool execution (parallel)	Sequential	Parallel (5x)	Parallel + batched
Parameter accuracy	~75%	~90%	~90%
Web fetch token waste	~24% overhead	~24%	~5%

The phases are ordered by effort-to-impact ratio. Phase 1 gives immediate gains with minimal risk. Phase 2 delivers the biggest token savings. Phase 3 cleans up content handling. Phase 4 is the architectural leap – highest reward but most complex.

Start with Phase 1. Measure. Ship Phase 2. Measure again. By then we’ll know exactly how much Phase 3 and 4 are worth.