Engineering notesView as Markdown ↗

Tool discoverability for agents

A catalogue of ~230 tools is unusable by name-scanning. The path is: orient with a categorised map, search by intent, then measure the one honest signal — the rate at which a search returns nothing. A miss is a catalogue gap, and it should be turned into a request for the missing tool.

A Biloh tenant exposes on the order of 230 MCP tools. No agent should find the right one by reading names. The working pattern is three moves — orient, search, measure — and the measurement is the part people skip.

How does an agent orient?

The first call on a fresh session is get_session_context. Beyond confirming the tenant and persona, it returns a tool_map: every tool the persona can use, grouped by category (clients, contractors, invoices, jobs, proposals, …), each with a one-line summary. That is the floor-plan — enough to know where a capability lives before narrowing.

The authoritative flat list stays available as mcp_health.tool_names. The map is for orientation; the flat list is the source of truth.

How does an agent find a specific tool?

search_tools("onboard a client for a recurring service") returns a ranked, persona-scoped shortlist — name, summary, category. The agent searches by what it wants to do, not by guessing a name.

One non-obvious discipline kept the ranker honest: it carries no tool-name literals. Ranking is pure lexical overlap (weighted name > category > description, with a small domain synonym map). A structural test asserts the ranker source contains no tool names, so it can't be quietly "tuned" by hard-coding the answers to the test queries — it has to generalise. The spec proves this on a held-out intent set the ranker never sees.

What metric proves discoverability actually improved?

The honest signal is the zero-result rate. If search_tools returns nothing, the catalogue or the ranker failed to serve the agent's intent — that is the gap to close, and it is the one thing worth counting.

So a miss does three things: it sets an explicit zero_results flag on the response, it records a structured tool_search metric for aggregation, and it points the agent at the request-a-tool flow. That last move closes the loop:

search miss → tool request → catalogue grows → fewer misses.

A high zero-result rate is not a failure to hide; it is a map of exactly what agents wanted that didn't exist yet.

The shape to copy

  1. A categorised map for orientation (get_session_context), a flat catalogue for truth (mcp_health).
  2. Intent search with a ranker that has no knowledge of the test answers.
  3. A zero-result signal that is both measured and turned into an actionable request.

Next steps

Last updated 2026-06-25