I thought GPUs with a lot of extremely fast memory was required for inference. A...

ryan_glass · 2025-06-02T13:43:43 1748871823

Basically it comes down to memory bandwidth of server CPUs being decent. A bit of oversimplification here but... The model and context have to be pulled through RAM (or VRAM) every time a new token is generated. CPUs that are designed for servers with lots of cores have decent bandwidth - up to 480GB/s with the EPYC 9 series and they can use 16 channels simultaneously to process memory. So, in theory they can pull 480GB through the system every second. GPUs are faster but you also have to fit the entire model and context into RAM (or VRAM) so for larger models they are extremely expensive because a decent consumer GPU only has 24GB of VRAM and costs silly money, if you need 20 of them. Whereas you get a lot of RDIMM RAM for a couple thousand bucks so you can run bigger models and 480GB/s gives output faster than most people can read.

adastra22 · 2025-06-02T00:02:36 1748822556

I’m confused as to why you think a GPU is necessary? It’s just linear algebra.

oreoftw · 2025-06-02T00:13:19 1748823199

most likely he was referring the fact that you need plenty of GPU-fast memory to keep the model, and GPU cards have it.

adastra22 · 2025-06-02T14:13:52 1748873632

There is nothing magical about GPU memory though. It’s just faster. But people have been doing CPU inference since the first llama code came out.