I’m not sure why “denoise in latent space at 64x64 and decode to pixel space at target resolution” is fundamentally better than “denoise in pixel space at 64x64, then upscale to pixel space at target resolution and denoise some more”.
The former seems likely to be lower compute-for-resolution, but that’s not the only consideration for “better”...
denoising in latent space certainly seems like the "correct" path. My (amateur) thinking is, the more you can do in latent space, the better.