Implementing Advanced Tool Use in WorkingAgents -- A Practical Plan

By James Aspinwall, co-written by Alfred (your trusted AI agent) – February 24, 2026, 12:00

In the previous article we covered the four advanced tool use features Anthropic shipped: programmatic tool calling, dynamic filtering, tool search with deferred loading, and tool use examples. This article gets specific – how we implement each one in WorkingAgents, what we gain, what can go wrong, and a phased plan to get there.


Where We Stand Today

WorkingAgents currently exposes 94 MCP tools across 9 categories:

Category Tools Permission Key
Tasks 20 80_001
CRM (NIS) 17 80_001
Access Control 11 60_001
WhatsApp 7 70_001
Summaries 7 30_001
Platform 5 123_456
Blogs 4 50_001
Utility 4 60_001
Other 19 various

The tool-use loop in both ServerChat.Anthropic and WhatsAppClaude follows the standard pattern:

  1. Send all permitted tools + conversation history to Claude
  2. Claude returns a tool_use response with one or more tool calls
  3. Execute each tool sequentially via Enum.map
  4. Append the full result of every tool call to the conversation context
  5. Send the updated context back to Claude
  6. Repeat until Claude returns end_turn

This works. But with 94 tools, sequential execution, and full results in context, we’re leaving significant performance and cost on the table.


Feature 1: Tool Search and Deferred Loading

Why this comes first

With 94 tool definitions, every LLM call ships ~15,000-20,000 tokens just for the tool schemas. Most requests need 3-5 tools. That’s 80%+ wasted context on tool definitions alone.

How to implement it

The change is primarily in MyMCPServer and MyMCPServer.Manager.

Step 1: Categorize tools by visibility

Add a :deferred flag to each tool definition:

%{
  name: "task_create",
  description: "Create a new task...",
  required_permission: 80_001,
  deferred: true,            # <-- new field
  inputSchema: %{...}
}

Always-visible tools (never deferred): task_capture, task_next, task_plan, whatsapp_send, whatsapp_recent, nis_search, summary_request, current_time. These cover the most common entry points. Everything else gets deferred: true.

Step 2: Add a tool_search tool

Define a new tool that takes a natural language query and returns matching tool definitions:

%{
  name: "tool_search",
  description: "Search for available tools by keyword or capability...",
  deferred: false,
  inputSchema: %{
    type: "object",
    properties: %{
      query: %{type: "string", description: "What capability you need"}
    },
    required: ["query"]
  }
}

The handler searches tool names and descriptions using simple string matching or FTS. Returns the full schema for matching tools so Claude can call them in subsequent turns.

Step 3: Update tools_for_permissions/1

Currently this function filters by permission and returns all tools. Change it to also filter by :deferred – only return non-deferred tools plus tool_search by default. When tool_search is called, return the full schemas of matching deferred tools.

Step 4: Update the Manager

MyMCPServer.Manager.list_tools/1 needs to track which deferred tools have been “discovered” in the current session. After tool_search returns results, those tools become callable for subsequent turns.

What we gain

Potential issues


Feature 2: Programmatic Tool Calling

Why this matters for us

Consider this real workflow: “Find my overdue tasks, check which contacts are associated, and send them follow-up WhatsApp messages.”

Current flow (5+ LLM round-trips):

  1. Claude calls task_query(name: "overdue") – returns list of tasks
  2. Claude calls nis_search(q: "contact name") – for each task
  3. Claude calls whatsapp_send(jid, text) – for each contact
  4. Each round-trip includes the full conversation history + all previous results

With programmatic tool calling (1-2 round-trips):

  1. Claude writes code that loops through tasks, searches contacts, and sends messages
  2. All tool executions happen in one batch
  3. Only the final summary enters the conversation context

How to implement it

Step 1: Add the code execution tool

Add a new tool to the Anthropic provider that signals code execution capability:

%{
  type: "code_execution",
  name: "code_execution_tool",
  # Anthropic-specific version identifier
}

Step 2: Set allowed_caller on existing tools

Each tool definition gets an allowed_caller field pointing to the code execution tool. This tells Claude it can call these tools from within generated code rather than emitting individual JSON tool calls.

Step 3: Update the tool-use loop in ServerChat.Anthropic

The current loop in handle_response/5 (lines 79-138 of anthropic.ex) does:

tool_results = Enum.map(tool_uses, fn tool_use ->
  result = ServerChat.call_mcp_tool(user_id, tool_use["name"], tool_use["input"])
  %{type: "tool_result", tool_use_id: tool_use["id"], content: result}
end)

The new flow needs to:

  1. Detect when the response contains a code execution block
  2. Parse the code to extract all tool calls
  3. Execute them (potentially in parallel – see below)
  4. Return results to the code execution sandbox
  5. Let the sandbox complete execution and return the final result

Step 4: Parallel execution

Replace Enum.map with Task.async_stream for independent tool calls:

tool_results =
  tool_uses
  |> Task.async_stream(fn tool_use ->
    result = ServerChat.call_mcp_tool(user_id, tool_use["name"], tool_use["input"])
    {tool_use["id"], result}
  end, max_concurrency: 5, timeout: 30_000)
  |> Enum.map(fn {:ok, {id, result}} ->
    %{type: "tool_result", tool_use_id: id, content: result}
  end)

This alone – even without code execution – would improve performance for multi-tool responses.

What we gain

Potential issues


Feature 3: Tool Use Examples

Why this helps us now

Several of our tools have complex schemas where Claude makes mistakes:

How to implement it

Step 1: Add input_examples to tool definitions

Extend the tool definition structure:

%{
  name: "task_create",
  description: "Create a new task...",
  inputSchema: %{...},
  input_examples: [
    %{
      title: "Review Q1 budget",
      priority: 7,
      due_at: 1771904400,
      tags: "#work, #finance",
      status: "todo"
    },
    %{
      title: "Buy groceries",
      priority: 3,
      due_at: 1771818000,
      tags: "#home",
      parent_id: 42
    }
  ]
}

Step 2: Include examples in tool schemas sent to LLM

The Anthropic API supports input_examples as a top-level field on tool definitions. We already format tools in ServerChat.Anthropic.format_tools/1 – add the examples there.

Step 3: Write examples for the most error-prone tools

Priority targets based on current error rates:

  1. task_create / task_update – date formats, priority range
  2. nis_create_contact / nis_update_contact – field formats, contact_every syntax
  3. task_query – valid query names
  4. nis_log_interaction – valid channel values
  5. task_capture – natural language syntax with tags and priority markers

What we gain

Potential issues


Feature 4: Dynamic Filtering for Web Fetch

How it applies to us

We have two tools that fetch external content:

Currently fetch_url returns the raw content, which can be enormous for HTML pages. The summary pipeline already does its own content extraction, but the raw fetch path doesn’t.

How to implement it

Step 1: Point to Anthropic’s filtered web_fetch version

If using Anthropic’s built-in web_fetch tool (version web_fetch-2026209), the filtering happens automatically through the code execution infrastructure. We’d replace our custom fetch_url with Anthropic’s version for the Anthropic provider.

Step 2: For our custom fetch, add content extraction

For providers that don’t support dynamic filtering, add a content extraction step in the fetch_url handler:

def handle_tool_call("fetch_url", %{"url" => url}, state) do
  case Req.get(url) do
    {:ok, %{body: body}} ->
      # Strip scripts, styles, nav, footer -- keep article content
      clean = body
        |> Floki.parse_document!(attributes_as_maps: true)
        |> Floki.filter_out("script,style,nav,footer,header,aside")
        |> Floki.raw_html()
      {:reply, %{content: [%{type: "text", text: clean}]}, state}
  end
end

What we gain

Potential issues


The Implementation Plan

Phase 1: Quick Wins (1-2 days)

1a. Parallel tool execution

Replace Enum.map with Task.async_stream in both server_chat/anthropic.ex and whatsapp_claude.ex. This is a 5-line change per file with immediate performance gains for multi-tool responses.

1b. Tool use examples for top 5 error-prone tools

Add input_examples to task_create, task_update, nis_create_contact, task_query, and nis_log_interaction. Immediate accuracy improvement, no architectural changes.

Phase 2: Tool Search (3-5 days)

2a. Add deferred flag to tool definitions

Mark 70+ tools as deferred. Keep ~15 high-frequency tools always visible.

2b. Implement tool_search handler

Build the search logic – start with keyword matching on tool names and descriptions. Add aliases for common queries (“send message” -> whatsapp_send, “find contact” -> nis_search).

2c. Update tools_for_permissions and Manager

Filter deferred tools from default tool list. Track discovered tools per session.

2d. Test token savings

Measure prompt size before and after. Target: 70%+ reduction in tool schema tokens.

Phase 3: Content Filtering (2-3 days)

3a. Add HTML content extraction to fetch_url

Use Floki to strip non-content elements before returning results.

3b. Evaluate Anthropic’s web_fetch tool

Test the built-in filtered version. If it works well, use it for Anthropic provider and keep our custom version for others.

3c. Truncate large tool results

Add a configurable max length for tool results before they enter conversation context. Log the full result separately for debugging.

Phase 4: Programmatic Tool Calling (1-2 weeks)

4a. Research sandbox options

Evaluate: Anthropic’s hosted sandbox, restricted BEAM process, MuonTrap + Node.js sandbox. Choose based on security, latency, and complexity.

4b. Update Anthropic provider

Add code execution tool to tool list. Set allowed_caller on existing tools. Update handle_response to detect and handle code execution blocks.

4c. Extract shared tool execution module

Both server_chat/anthropic.ex and whatsapp_claude.ex implement the tool-use loop independently. Extract a shared ToolExecutor module that both can use. This simplifies adding code execution support in one place.

4d. Implement code execution flow

Parse code blocks, extract tool calls, execute (with parallel support), return results to sandbox, collect final output.

4e. Provider fallback

Ensure OpenRouter and Perplexity providers gracefully fall back to standard tool calling when code execution isn’t available.


Risk Assessment

Risk Likelihood Impact Mitigation
Tool search returns wrong tools Medium Low Always include tool_search so Claude can retry; add keyword aliases
Code execution sandbox escape Low High Use Anthropic’s hosted sandbox; add timeout + memory limits
Examples increase token cost Certain Low Only add examples to top 10 error-prone tools
Provider incompatibility Medium Medium Feature-detect per provider; graceful fallback to standard calling
Breaking WhatsAppClaude loop Medium Medium Extract shared module first; test both paths
Deferred tool not discovered Medium Low Keep most-used tools always visible; good search aliases

Expected Impact

Metric Current After Phase 1-2 After Phase 3-4
Tool schema tokens per call ~15,000 ~3,000 ~3,000
Avg round-trips for complex tasks 4-6 4-6 1-2
Tool execution (parallel) Sequential Parallel (5x) Parallel + batched
Parameter accuracy ~75% ~90% ~90%
Web fetch token waste ~24% overhead ~24% ~5%

The phases are ordered by effort-to-impact ratio. Phase 1 gives immediate gains with minimal risk. Phase 2 delivers the biggest token savings. Phase 3 cleans up content handling. Phase 4 is the architectural leap – highest reward but most complex.

Start with Phase 1. Measure. Ship Phase 2. Measure again. By then we’ll know exactly how much Phase 3 and 4 are worth.