Disclamer: probably dumb questions so, the 20b model. Can someone explain to me ...

mlyle · 2025-08-05T17:25:47 1754414747

An A100 is probably 2-4k tokens/second on a 20B model with batched inference.

Multiply the number of A100's you need as necessary.

Here, you don't really need the ram. If you could accept fewer tokens/second, you could do it much cheaper with consumer graphics cards.

Even with A100, the sweet-spot in batching is not going to give you 1k/process/second. Of course, you could go up to H100...

d3m0t3p · 2025-08-05T20:38:31 1754426311

You can batch only if you have distinct chat in parallel,

mlyle · 2025-08-05T22:22:45 1754432565

> > if I want to run 20 concurrent processes, assuming I need 1k tokens/second throughput (on each)

petuman · 2025-08-05T17:25:13 1754414713

> assuming I need 1k tokens/second throughput (on each, so 20 x 1k)

3.6B activated at Q8 x 1000 t/s = 3.6TB/s just for activated model weights (there's also context). So pretty much straight to B200 and alike. 1000 t/s per user/agent is way too fast, make it 300 t/s and you could get away with 5090/RTX PRO 6000.

mythz · 2025-08-05T17:14:01 1754414041

gpt-oss:20b is ~14GB on disk [1] so fits nicely within a 16GB VRAM card.

[1] https://ollama.com/library/gpt-oss

dragonwriter · 2025-08-05T17:38:54 1754415534

You also need space in VRAM for what is required to support the context window; you might be able to do a model that is 14GB in parameters with a small (~8k maybe?) context window on a 16GB card.

artembugara · 2025-08-05T17:18:06 1754414286

thanks, this part is clear to me.

but I need to understand 20 x 1k token throughput

I assume it just might be too early to know the answer

Tostino · 2025-08-05T17:30:21 1754415021

I legitimately cannot think of any hardware that will get you to that throughput over that many streams with any of the hardware I know of (I don't work in the server space so there may be some new stuff I am unaware of).

artembugara · 2025-08-05T17:39:10 1754415550

oh, I totally understand that I'd need multiple GPUs. I'd just want to know what GPU specifically and how many

Tostino · 2025-08-05T17:48:36 1754416116

I don't think you can get 1k tokens/sec on a single stream using any consumer grade GPUs with a 20b model. Maybe you could with H100 or better, but I somewhat doubt that.

My 2x 3090 setup will get me ~6-10 streams of ~20-40 tokens/sec (generation) ~700-1000 tokens/sec (input) with a 32b dense model.

PeterStuer · 2025-08-05T18:09:44 1754417384

(answer for 1 inference) Al depends on the context length you want to support as the activation memory will dominate the requirements. For 4096 tokens you will get away with 24GB (or even 16GB), but if you want to go for the full 131072 tokens you are not going to get there with a 32GB consumer GPU like the 5090. You'll need to spring for at the minimum an A6000 (48GB) or preferably an RTX 6000 Pro (96GB).

Also keep in mind this model does use 4-bit layers for the MoE parts. Unfortunately native accelerated 4-bit support only started with Blackwell on NVIDIA. So your 3090/4090/A6000/A100's are not going to be fast. An RTX 5090 will be your best starting point in the traditional card space. Maybe the unified memory minipc's like the Spark systems or the Mac mini could be an alternative, but I do not know them enough.

vl · 2025-08-05T18:45:40 1754419540

How Macs compare to RTXs for this? I.e. what numbers can be expected from Mac mini/Mac Studio with 64/128/256/512GB of unified memory?

spott · 2025-08-05T17:44:12 1754415852

Groq is offering 1k tokens per second for the 20B model.

You are unlikely to match groq on off the shelf hardware as far as I'm aware.

coolspot · 2025-08-05T19:04:59 1754420699

https://apxml.com/tools/vram-calculator