Yeah, I just asked Gemini and apparently some older estimates put a relatively filtered dataset of Github source code at around 21TB in 2018, and some more recent estimates could put it in the low hundreds of TB.
Considering as you said, that LLMs are doing a form of compression, and assuming generously that you add extra compression on top, yeah, now I understand a bit more. Even if you focus on non-similar code to get the most coverage, I wouldn't be shocked if a modern, representative source code training data from Github weighed 1TB, which obviously is a lot more than consumer grade hardware can bear.
I guess we need to ramp up RAM production a bunch more :-(
Speaking of which, what's the next bottle neck except for storing the damned things? Training needs a ton of resources but that part can be pooled, even for OSS models, it "just" need to be done "once", and then the entire community can use the data set. So I guess inference is the scaling cost, what's the most used resource there? Data bandwidth for RAM?
Yes, for inference the main bottleneck is GPU VRAM and the bandwidth between the GPU cores and VRAM. Ideally you want enough GPU VRAM to be able to load the entire model into VRAM + have room for caching the already-produced output in VRAM when you're generating output tokens. And fast enough VRAM bandwidth that you can copy the weights from VRAM to GPU compute cores as fast as possible to do the calculations for each token. This determines the tokens/sec speed you get for the output. So yes, more and faster VRAM is essential.
Considering as you said, that LLMs are doing a form of compression, and assuming generously that you add extra compression on top, yeah, now I understand a bit more. Even if you focus on non-similar code to get the most coverage, I wouldn't be shocked if a modern, representative source code training data from Github weighed 1TB, which obviously is a lot more than consumer grade hardware can bear.
I guess we need to ramp up RAM production a bunch more :-(
Speaking of which, what's the next bottle neck except for storing the damned things? Training needs a ton of resources but that part can be pooled, even for OSS models, it "just" need to be done "once", and then the entire community can use the data set. So I guess inference is the scaling cost, what's the most used resource there? Data bandwidth for RAM?