There's a note which suggests you might be able to get by on lower. My 3060 struggles with SD on the defaults, but works fine with float16.
There are multiple ways to speed up the inference time and lower the memory consumption even more with diffusers. To do so, please have a look at the Diffusers docs:
Optimizing for inference time [1]
Optimizing for low memory during inference [2]
If you don't mind the power consumption I noticed that older nvidia P6000's (24GB) are pretty cheap on ebay! My 16GB P5000 is pretty handy for this stuff.
An M40 24GB is less than $200, if you don't mind the trouble to get it's drivers installed, cooled, etc. It's also important to note your motherboard must support larger VRAM addressing; many older chipsets won't be able to boot with it (i.e. some, perhaps almost all, Zen 1 supporters).
IIRC SD’s first party code had singificantly higher requirements than is possible with some of the downstream optimization most people are using, which have lower speed and more system RAM use but reduce peak VRAM use.
The architecture here looks different, but the code is licensed in a way which still makes downstream optimization and redistribution possible, so maybe there will be something there.
They took down the blogpost, but from what I remember the model is composite and consists of a text encoder as well as 3 "stages":
1. (11B) T5-XXL text encoder [1]
2. (4.3B) Stage 1 UNet
3. (1.3B) Stage 2 upscaler (64x64 -> 256x256)
4. (?B) Stage 3 upscaler (256x256 -> 1024x1024)
Resolution numbers could be off though. Also the third stage can apparently use the existing stable diffusion x4, or a new upscaler that they aren't releasing yet (ever?).
> Once these are quantized (I assume they can be)
Based on the success of LLaMA 4bit quantization, I believe the text encoder could be. As for the other modules, I'm not sure.
edit: the text encoder is 11B, not 4.5B as I initially wrote.
You'll be able to optimize it a lot to make it fit on small systems if you are willing to modify your workflow a bit: instead of 1 prompt -> 1 image _n_ times, do 1 prompt -> _n_ images 1 time -> _m_ times... For a given prompt, run it through the T5 model and store; you can do that in CPU RAM if you have to because you only need the embedding once so you don't need a GPU which can run T5-XXL naively. Then you can get a large batch of samples from #2; 64px is enough to preview; only once you pick some do you run through #3, and then from those through #4. Your peak VRAM should be 1 image in #2 or #4 and that can be quantized or pruned down to something that will fit on many GPUs.
>Can anyone explain why it needs so much ram in the first place though?
The T5-XXL text encoder is really large, also we do not quantize the UNets, the UNet outputs 8-bit pixels, so quantizing the UNet to that precision will create pretty bad outputs.