I don't understand why we're paying for caching at all (except: model providers ...

simonw · 2025-10-15T18:00:16 1760551216

It's not about storing data on disk, it's about keeping data resident in memory.

jbellis · 2025-10-16T00:48:37 1760575717

Deepseek pioneered automatic prefix caching and caches on SSD. SSD reads are so fast compared to LLM inference that I can't think of a reason to waste ram on it.

jychang · 2025-10-16T05:30:14 1760592614

It’s not instantly fast though. Context is probably ~20gb of VRAM at max context size. That’s gonna take some time to get from SSD no matter what.

TtFT will get slower if you export kv cache to SSD.

criemen · 2025-10-15T18:08:38 1760551718

Fascinating, so I have to think more "pay for RAM/redis" than "pay for SSD"?

nthypes · 2025-10-15T18:43:24 1760553804

"pay for data on VRAM" RAM of GPU

criemen · 2025-10-15T18:57:31 1760554651

But that doesn't make sense? Why would they keep the cache persistent in the VRAM of the GPU nodes, which are needed for model weights? Shouldn't they be able to swap in/out the kvcache of your prompt when you actually use it?

tazjin · 2025-10-15T19:44:26 1760557466

Your intuition is correct and the sibling comments are wrong. Modern LLM inference servers support hierarchical caches (where data moves to slower storage tiers), often with pluggable backends. A popular open-source backend for the "slow" tier is Mooncake: https://github.com/kvcache-ai/Mooncake

simonw · 2025-10-15T23:31:57 1760571117

OK that's pretty fascinating, turns out Mooncake includes a trick that can populate GPU VRAM directly from NVMe SSD without it having to go through the host's regular CPU and RAM first!

https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/tran...

> Transfer Engine also leverages the NVMeof protocol to support direct data transfer from files on NVMe to DRAM/VRAM via PCIe, without going through the CPU and achieving zero-copy.

dotancohen · 2025-10-15T19:23:34 1760556214

They are not caching to save network bandwidth. They are caching to increase interference speed and reduce (their own) costs.

minimaxir · 2025-10-15T19:00:08 1760554808

That is slow.