Fill 3D takes a different step from diffusion, in that it tries to build an actu...

Fill 3D takes a different step from diffusion, in that it tries to build an actual 3D scene (kinda like a clone) of what's in the image you upload. In some sense, that's actually the most fundamental representation of what's in your image (or said another way, your image is just a representation of that original scene).

So it works by trying to estimate a 3D 'room' that matches your image. Everything from the geometry, to the light fixtures, to the windows. It's heavily inspired by how humans (weird to contrast 'human' vs. AI work) do image/video compositing.

TL;DR: Image in, 3D scene out.