Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The current generation of models all support pretty long context now - the Gemini family has had 1m tokens for over a year, GPT-4.1 is 1m, interestingly GPT-5 is back down to 400,000, Claude 4 is 200,000 but there's a mode of Claude Sonnet 4 that can do 1m as well.

The bigger question is how well they perform - there are needle-in-haystack benchmarks that test that, they're mostly scoring quite highly on those now.

https://cloud.google.com/blog/products/ai-machine-learning/t... talks about that for Gemini 1.5.

Here's a couple of relevant leaderboards: https://huggingface.co/spaces/RMT-team/babilong and https://longbench2.github.io/



sorry I should have been more clear, I meant around open source llms. and I guess the question is, how are closed source llm doing it so well. And if OS OpenNote is the best we have...


Mainly I think it's that you need a LOT of VRAM to handle long context - server-class hardware is pretty much a requirement to work with more than ~10,000 tokens.


On my i9 desktop with 128GB RAM and only 8GB VRAM, using llama.cpp I can split the work between both CPU/GPU and get the max 200k context to run on Qwen3 at a decent (human-reading) speed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: