I think you're looking at it too abstractly. An LLM isn't representing anything, it has a bag of numbers that some other algorithm produced for it. When you give it some numbers, it takes them and does matrix operations with them in order to randomly select a token from a softmax distribution, one at a time, until the EOS token is generated.
If they don't have any training data that covers a particular concept, they can't map it onto a world model and make predictions about that concept based on an understanding of the world and how it works. [This video](https://www.youtube.com/watch?v=160F8F8mXlo) illustrates it pretty well. These things may or may not end up being fixed in the models, but that's only because they've been further trained with the specific examples. Brains have world models. Cats see a cup of water, and they know exactly what will happen when you tip it over (and you can bet they're gonna do it).
That video is a poor and mis-understood analysis of an old version of ChatGPT.
Analyzing an image generation failure modes from the dall-e family of models isn't really helpful in understanding if the invoking LLM has a robust world model or not.
The point of me sharing the video was to use the full glass of wine as an example for how generative AI models doing inference lack a true world model. The example was just as relevant now as it was then, and it applies to inference being done by LMs and SD models in the same way. Nothing has fundamentally changed in how these models work. Getting better at edge cases doesn't give them a world model.
That's the point though. Look at any end-to-end image model. Currently I think nano banana (Gemini 2.5 Flash) is probably the best in prod. (Looks like ChatGPT has regressed the image pipeline right now with GPT-5, but not sure)
SD models have a much higher propensity to fixate on proximal in distribution solutions because of the way they de-noise.
For example.. you can ask nano banana for a "Completely full wine glass in zero g" which I'm pretty sure is way more out of distribution, the model does a reasonable job at approximating what they might look like.
That's a fairly bad example. They don't have any trouble taking unrelated things and sticking them together. A world model isn't required for you to take two unrelated things and stick them together. If I ask it to put a frog on the moon, it can know what frogs look like and what the the moon looks like, and put the frog on the moon.
But what it won't be able to do, which does require a world model, is put a frog on the moon, and be able to imagine what that frog's body would look like on the moon in the vacuum of space as it dies a horrible death.
Your example is a good one. The frog won't work because ethically the model won't want to show a dead frog very easily, BUT if you ask nano-banana for:
"Create an image of what a watermelon would look like after being teleported to the surface of the moon for 30 seconds."
If they don't have any training data that covers a particular concept, they can't map it onto a world model and make predictions about that concept based on an understanding of the world and how it works. [This video](https://www.youtube.com/watch?v=160F8F8mXlo) illustrates it pretty well. These things may or may not end up being fixed in the models, but that's only because they've been further trained with the specific examples. Brains have world models. Cats see a cup of water, and they know exactly what will happen when you tip it over (and you can bet they're gonna do it).