It's the models. "A well crafted layer of business logic" just doesn't exist. Th...

justanotherunit · 2025-12-02T16:53:16 1764694396

So how would you go and explain how an output of tokens can call a function, or even generate an image since that requires a whole different kind of compute? It’s still a layer between the model which acts as a parser to enable these capabilities.

Maybe “business” is a bad term for it, but the actual output of the model still needs to be interpreted.

Maybe I am way out of line here since this is not my field, and I am doing my best to understand these layers. But in your terms you are maybe speaking of the model as an application?

ACCount37 · 2025-12-02T18:07:47 1764698867

The logic of all of those things is really, really simple.

An LLM emits a "tool call" token, then it emits the actual tool call as normal text, and then it ends the token stream. The scaffolding sees that a "tool call" token was emitted, parses the call text, runs the tool accordingly, flings the tool output back into the LLM as text, and resumes inference.

It's very simple. You can write basic tool call scaffolding for an LLM in, like, 200 lines. But, of course, you need to train the LLM itself to actually use tools well. Which is the hard part. The AI is what does all the heavy lifting.

Image generation, at the low end, is just another tool call that's prompted by the LLM with text. At the high end, it's a type of multimodal output - the LLM itself is trained to be able to emit non-text tokens that are then converted into image or audio data. In this system, it's AI doing the heavy lifting once again.