The logic of all of those things is really, really simple. An LLM emits a "tool ...

The logic of all of those things is really, really simple.

An LLM emits a "tool call" token, then it emits the actual tool call as normal text, and then it ends the token stream. The scaffolding sees that a "tool call" token was emitted, parses the call text, runs the tool accordingly, flings the tool output back into the LLM as text, and resumes inference.

It's very simple. You can write basic tool call scaffolding for an LLM in, like, 200 lines. But, of course, you need to train the LLM itself to actually use tools well. Which is the hard part. The AI is what does all the heavy lifting.

Image generation, at the low end, is just another tool call that's prompted by the LLM with text. At the high end, it's a type of multimodal output - the LLM itself is trained to be able to emit non-text tokens that are then converted into image or audio data. In this system, it's AI doing the heavy lifting once again.