Yeah, I just asked Gemini and apparently some older estimates put a relatively f...

airspresso · 2025-07-22T20:34:09 1753216449

Yes, for inference the main bottleneck is GPU VRAM and the bandwidth between the GPU cores and VRAM. Ideally you want enough GPU VRAM to be able to load the entire model into VRAM + have room for caching the already-produced output in VRAM when you're generating output tokens. And fast enough VRAM bandwidth that you can copy the weights from VRAM to GPU compute cores as fast as possible to do the calculations for each token. This determines the tokens/sec speed you get for the output. So yes, more and faster VRAM is essential.