16GB VRAM minimum is a bit steep. Sadly excludes my 3080 which is annoying becau...

specproc · on April 26, 2023

There's a note which suggests you might be able to get by on lower. My 3060 struggles with SD on the defaults, but works fine with float16.

There are multiple ways to speed up the inference time and lower the memory consumption even more with diffusers. To do so, please have a look at the Diffusers docs:

        Optimizing for inference time [1]
        Optimizing for low memory during inference [2]

[1] https://huggingface.co/docs/diffusers/api/pipelines/if#optim...

[2] https://huggingface.co/docs/diffusers/api/pipelines/if#optim...

SequoiaHope · on April 26, 2023

If you don't mind the power consumption I noticed that older nvidia P6000's (24GB) are pretty cheap on ebay! My 16GB P5000 is pretty handy for this stuff.

NBJack · on April 26, 2023

An M40 24GB is less than $200, if you don't mind the trouble to get it's drivers installed, cooled, etc. It's also important to note your motherboard must support larger VRAM addressing; many older chipsets won't be able to boot with it (i.e. some, perhaps almost all, Zen 1 supporters).

coolspot · on April 26, 2023

Looks like P6000 24Gb goes for $800-$1200 while you can get superior 3090 24Gb for $800-$1000 .

CamperBob2 · on April 26, 2023

4090s are only $1600 or so now, for that matter.

SequoiaHope · on April 26, 2023

oh! My mistake thanks for letting me know.

dragonwriter · on April 26, 2023

IIRC SD’s first party code had singificantly higher requirements than is possible with some of the downstream optimization most people are using, which have lower speed and more system RAM use but reduce peak VRAM use.

The architecture here looks different, but the code is licensed in a way which still makes downstream optimization and redistribution possible, so maybe there will be something there.

thewataccount · on April 26, 2023

Once these are quantized (I assume they can be), they should be ~1/4th the size.

Can anyone explain why it needs so much ram in the first place though? 4.3B is only ~9GB at 16bit (I'm not as familiar with image models).

I'm really happy to see that fits under 24GB - that's what I consider the limit for being able to run on "consumer hardware".

SekstiNi · on April 26, 2023

They took down the blogpost, but from what I remember the model is composite and consists of a text encoder as well as 3 "stages":

1. (11B) T5-XXL text encoder [1]

2. (4.3B) Stage 1 UNet

3. (1.3B) Stage 2 upscaler (64x64 -> 256x256)

4. (?B) Stage 3 upscaler (256x256 -> 1024x1024)

Resolution numbers could be off though. Also the third stage can apparently use the existing stable diffusion x4, or a new upscaler that they aren't releasing yet (ever?).

> Once these are quantized (I assume they can be)

Based on the success of LLaMA 4bit quantization, I believe the text encoder could be. As for the other modules, I'm not sure.

edit: the text encoder is 11B, not 4.5B as I initially wrote.

[1]: https://huggingface.co/google/t5-v1_1-xxl

gwern · on April 26, 2023

You'll be able to optimize it a lot to make it fit on small systems if you are willing to modify your workflow a bit: instead of 1 prompt -> 1 image _n_ times, do 1 prompt -> _n_ images 1 time -> _m_ times... For a given prompt, run it through the T5 model and store; you can do that in CPU RAM if you have to because you only need the embedding once so you don't need a GPU which can run T5-XXL naively. Then you can get a large batch of samples from #2; 64px is enough to preview; only once you pick some do you run through #3, and then from those through #4. Your peak VRAM should be 1 image in #2 or #4 and that can be quantized or pruned down to something that will fit on many GPUs.

GaggiX · on April 26, 2023

The entire T5-XXL model is 11B but you don't need the decoder.

GaggiX · on April 26, 2023

>Can anyone explain why it needs so much ram in the first place though?

The T5-XXL text encoder is really large, also we do not quantize the UNets, the UNet outputs 8-bit pixels, so quantizing the UNet to that precision will create pretty bad outputs.