That's the thing about this. Calling things "world models" is only done to confuse people, because "world" is such a loose word. In this scenario the meaning is "3d scene". When others use it, they may mean "screen space physics model". In the context of LLMs it means something like "reasoning about real-world processes outside of text".