Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If all your problems with attention are actually just problems with softmax, then that's an easy fix. Delete softmax lmao.

No but seriously, just fix the fucking softmax. Add a dedicated "parking spot" like GPT-OSS does and eat the gradient flow tax on that, or replace softmax with any of the almost-softmax-but-not-really candidates. Plenty of options there.

The reason why we're "benchmaxxing" is that benchmarks are the metrics we have, and the only way by which we can sift through this gajillion of "revolutionary new architecture ideas" and get at the ones that show any promise at all. Of which there are very few, and fewer still that are worth their gains when you account for: there not being an unlimited amount of compute. Especially not when it comes to frontier training runs.

Memorization vs generalization is a well known idiot trap, and we are all stupid dumb fucks in the face of applied ML. Still, some benchmarks are harder to game than others (guess how we found that out), and there's power in that.





reason we're benchmaxxing is because there's a huge monetary incentive now to have the best performing model on these synthetic benchmarks and that status is worth a lot of money

literally every new release of something point X model of every major player includes some benchmark graphs to show off


benchmaxxing has also been identified as one of the causes of hallucination.

hallucination is just built in, what am I missing?

That LLMs have some basic metaknowledge and metacognitive skills that they can use to reduce the hallucination rate.

Which is what humans do too - it's not magic. Humans just get more metacognitive juice for free. Resulting in a hallucination rate significantly lower than that of LLMs, but significantly higher than zero.

Now, having the skills you need to avoid hallucinations is good, even if they're weak and basic skills. But is an LLM willing to actually put them to use?

OpenAI cooked o3 with reckless RL using hallucination-unaware reward calculation - which punished reluctance to answer and rewarded overconfident guesses. And their benchmark suite didn't catch it, because the benchmarks were hallucination-unaware too.


> Add a dedicated "parking spot" like GPT-OSS does and eat the gradient flow tax on that

Not familiar with this topic, but intrigued-anywhere I can read more about it?


Looked for it briefly, think the best I got is this older discussion:

https://news.ycombinator.com/item?id=44834918


OpenAI have talked about it. The neural architecture needs to let the model handle the case where there's nothing worth attending to, as softmax requires attention to be allocated to all tokens but sometimes there's nothing worth it.



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: