The current generation of models all support pretty long context now - the Gemin...

clueless · 2025-09-23T16:04:21 1758643461

sorry I should have been more clear, I meant around open source llms. and I guess the question is, how are closed source llm doing it so well. And if OS OpenNote is the best we have...

simonw · 2025-09-23T22:18:12 1758665892

Mainly I think it's that you need a LOT of VRAM to handle long context - server-class hardware is pretty much a requirement to work with more than ~10,000 tokens.

ranger_danger · 2025-09-28T15:30:15 1759073415

On my i9 desktop with 128GB RAM and only 8GB VRAM, using llama.cpp I can split the work between both CPU/GPU and get the max 200k context to run on Qwen3 at a decent (human-reading) speed.