I think you're looking at it too abstractly. An LLM isn't representing anything,...

XenophileJKO · 2025-09-24T20:10:44 1758744644

That video is a poor and mis-understood analysis of an old version of ChatGPT.

Analyzing an image generation failure modes from the dall-e family of models isn't really helpful in understanding if the invoking LLM has a robust world model or not.

jimbo808 · 2025-09-24T20:29:32 1758745772

The point of me sharing the video was to use the full glass of wine as an example for how generative AI models doing inference lack a true world model. The example was just as relevant now as it was then, and it applies to inference being done by LMs and SD models in the same way. Nothing has fundamentally changed in how these models work. Getting better at edge cases doesn't give them a world model.

XenophileJKO · 2025-09-24T21:02:23 1758747743

That's the point though. Look at any end-to-end image model. Currently I think nano banana (Gemini 2.5 Flash) is probably the best in prod. (Looks like ChatGPT has regressed the image pipeline right now with GPT-5, but not sure)

SD models have a much higher propensity to fixate on proximal in distribution solutions because of the way they de-noise.

For example.. you can ask nano banana for a "Completely full wine glass in zero g" which I'm pretty sure is way more out of distribution, the model does a reasonable job at approximating what they might look like.

jimbo808 · 2025-09-24T21:14:47 1758748487

That's a fairly bad example. They don't have any trouble taking unrelated things and sticking them together. A world model isn't required for you to take two unrelated things and stick them together. If I ask it to put a frog on the moon, it can know what frogs look like and what the the moon looks like, and put the frog on the moon.

But what it won't be able to do, which does require a world model, is put a frog on the moon, and be able to imagine what that frog's body would look like on the moon in the vacuum of space as it dies a horrible death.

XenophileJKO · 2025-09-24T23:02:45 1758754965

Your example is a good one. The frog won't work because ethically the model won't want to show a dead frog very easily, BUT if you ask nano-banana for:

"Create an image of what a watermelon would look like after being teleported to the surface of the moon for 30 seconds."

You'll see a burst frozen melon usually.