More tools didn't make the agent smarter. Past a point, they made it worse.
I ask an agent for the Bank of Canada policy rate over the last two years, joined against CPI, and a few seconds later I have the table. No scraping, no Excel downloads, no cleanup. The next question is IRCC’s recent permanent-resident numbers, and that comes back the same way — real figures, pulled straight from the source, in one conversation. It looks like the agent simply knows things. It does not. Behind it is a server I built that exposes 266 tools over Canadian government data, and getting to the point where that demo just works taught me the opposite of what I expected going in.
The obvious thing was to hand the agent all 266 tools and let it choose. I tried that first. It broke.
Here is why it breaks, and it is not subtle. Before the agent can reason about your question at all, every tool it might call has to be described to it — name, purpose, parameters, in the system prompt. With 266 tools, the definitions alone ran to tens of thousands of tokens — about forty thousand at the time — spent before a single word of the actual question was read. That is the cheap part of the damage. The expensive part is selection. When an agent is staring at dozens of near-duplicate tools — one for each province’s data, a handful of ways to ask about weather, several overlapping economic series — it picks wrong more often. More tools did not make the agent smarter. Past a point, they made it worse.
So the fix was not a bigger model or a cleaner prompt. It was to stop showing the agent everything.
The agent now holds a handful of meta-tools instead of hundreds. The important one is discover_tools.
It calls discover_tools("exchange rates"), gets back the five tools that actually matter for that
question, and invokes them directly. The 266 tools still exist; the agent just finds the right few on
demand instead of carrying all of them at once. Discovery, not enumeration.
The interesting decision was how that search works. The modern reflex is vector embeddings — turn every tool description into a vector, turn the query into a vector, return the nearest matches. I chose BM25 instead. BM25 is old, boring keyword-ranking, the kind of thing that has sat inside search engines for decades. I picked it for three reasons, and they are the reasons a senior engineer reaches for the dull option. It is deterministic: the same query returns the same tools every time, with no model drift and no embedding-version churn to babysit. It has no dependencies: the index builds in-process at startup, with no vector database to run and no embedding API to call. And it is fast: sub-millisecond, against the hundred to five hundred milliseconds an embedding lookup would add to every query. For an MCP server with this many tools, BM25 is the boring choice. It is also the right one for most cases. That is the whole discipline in one decision — the interesting-sounding option was the wrong trade.
Discovery gets the agent to the right tool. It does not make the answer trustworthy. That is a separate body of unglamorous work, and it is most of the engineering. Every tool returns the same shape of response — a consistent envelope that carries not just the data but where the data came from and whether it was served fresh or from cache. There is TTL caching so a popular series is not re-fetched a hundred times an hour, and token-bucket rate limiting so the server stays a polite client of the government APIs it depends on. All of it is bilingual, English and French, because the source data is. None of this is clever. It is the plumbing that decides whether an agent working on real data can be relied on or not.
It matters because the data is real. The tools reach the source APIs directly — the Bank of Canada, StatCan, IRCC, Health Canada — not a web-search wrapper guessing from whatever it scraped. That is the difference between a number you can defend and a number that sounds right. And the source attribution riding in every envelope is the paper trail: when the agent gives you the policy rate, you can see exactly which series it read and when. Grounding is not a nice-to-have here. It is the reason to build this instead of asking a model to improvise.
The whole build sits in the open if you want to read it — it’s on GitHub.
What I took away is narrower than the demo makes it look. Scaling an agent’s reach is not an intelligence problem you solve by reaching for a larger model, and it is not a coverage problem you solve by adding tools until it can do everything. It is findability — can the agent locate the right capability at the moment it needs it. It is context discipline — is the agent spending its attention on the question or on a catalogue it will never use. And it is boring reliability engineering — caching, rate limits, source attribution, the same response shape every time. When teams tell me their agent isn’t good enough, this is usually where the real work is, and it is the work I spend most of my time on with the companies I help through Reyem Tech.