Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There's a huge disconnect between what the benchmarks are showing and what the day-to-day experience of those of us using LLMs are experiencing. According to SWE-bench, I should be able to outsource a lot of tasks to LLMs by now. But practically speaking, I can't get them to reliably do even the most basic of tasks. Benchmaxxing is a real phenomenon. Internal private assessments are the most accurate source of information that we have, and those seem to be quite mixed for the most recent models.


How ironic that these LLM's appear to be overfitting to the benchmark scores. Presumably these researchers deal with overfitting every day, but can't recognize it right in front of them


I'm sure they all know it's happening. But the incentives are all misaligned. They get promotions and raises for pushing the frontier which means showing SOTA performance on benchmarks.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: