I think, from my experience, what they mean is tool use is as good as your model capability to stick to a given answer template/grammar. For example if it does tool calling using a JSON format it needs to stick to that format, not hallucinate extra fields and use the existing fields properly. This has worked for a few years and LLMs are getting better and better but the more tools you have, the more parameters your functions to call can have etc the higher the risk of errors. You also have systems that constrain the whole inference itself, for example with the outlines package, by changing the way tokens are sampled (this way you can force a model to stick to a template/grammar, but that can also degrade results in some other ways)
I see, thanks for channeling the GP! Yeah, like you say, I just don't think getting the tool call template right is really a problem anymore, at least with the big-labs SotA models that most of us use for coding agents. Claude Sonnet, Gemini, GPT-5 and friends have been heavily heavily RL-ed into being really good at tool calls, and it's all built into the providers' apis now so you never even see the magic where the tool call is parsed out of the raw response. To be honest, when I first read about tools calls with LLMs I thought, "that'll never work reliably, it'll mess up the syntax sometimes." But in practice, it does work. (Or, to be more precise, if the LLM ever does mess up the grammar, you never know because it's able to seamlessly retry and correct without it ever being visible at the user-facing api layer.) Claude Code plugged into Sonnet (or even Haiku) might do hundreds of tool calls in an hour of work without missing a beat. One of the many surprises of the last few years.