I'm skeptical that we'll see a big breakthrough in the architecture itself. As s...

kingstnap · 2025-10-24T19:48:46 1761335326

There are many many problems with attention.

The softmax has issues regarding attention sinks [1]. The softmax also causes sharpness problems [2]. In general this decision boundary being Euclidean dot products isn't actually optimal for everything, there are many classes of problem where you want polyhedral cones [3]. Positional embedding are also janky af and so is rope tbh, I think Cannon layers are a more promising alternative for horizontal alignment [4].

I still think there is plenty of room to improve these things. But a lot of focus right now is unfortunately being spent on benchmaxxing using flawed benchmarks that can be hacked with memorization. I think a really promising and underappreciated direction is synthetically coming up with ideas and tests that mathematically do not work well and proving that current arhitectures struggle with it. A great example of this is the VITs need glasses paper [5], or belief state transformers with their star task [6]. The Google one about what are the limits of embedding dimensions also is great and shows how the dimension of the QK part is actually important to getting good retrevial [7].

[1] https://arxiv.org/abs/2309.17453

[2] https://arxiv.org/abs/2410.01104

[3] https://arxiv.org/abs/2505.17190

[4] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5240330

[5] https://arxiv.org/abs/2406.04267

[6] https://arxiv.org/abs/2410.23506

[6] https://arxiv.org/abs/2508.21038

ACCount37 · 2025-10-24T21:07:11 1761340031

If all your problems with attention are actually just problems with softmax, then that's an easy fix. Delete softmax lmao.

No but seriously, just fix the fucking softmax. Add a dedicated "parking spot" like GPT-OSS does and eat the gradient flow tax on that, or replace softmax with any of the almost-softmax-but-not-really candidates. Plenty of options there.

The reason why we're "benchmaxxing" is that benchmarks are the metrics we have, and the only way by which we can sift through this gajillion of "revolutionary new architecture ideas" and get at the ones that show any promise at all. Of which there are very few, and fewer still that are worth their gains when you account for: there not being an unlimited amount of compute. Especially not when it comes to frontier training runs.

Memorization vs generalization is a well known idiot trap, and we are all stupid dumb fucks in the face of applied ML. Still, some benchmarks are harder to game than others (guess how we found that out), and there's power in that.

thousand_nights · 2025-10-25T01:28:11 1761355691

reason we're benchmaxxing is because there's a huge monetary incentive now to have the best performing model on these synthetic benchmarks and that status is worth a lot of money

literally every new release of something point X model of every major player includes some benchmark graphs to show off

mycall · 2025-10-25T04:30:24 1761366624

benchmaxxing has also been identified as one of the causes of hallucination.

svnt · 2025-10-25T05:16:43 1761369403

hallucination is just built in, what am I missing?

ACCount37 · 2025-10-25T12:21:32 1761394892

That LLMs have some basic metaknowledge and metacognitive skills that they can use to reduce the hallucination rate.

Which is what humans do too - it's not magic. Humans just get more metacognitive juice for free. Resulting in a hallucination rate significantly lower than that of LLMs, but significantly higher than zero.

Now, having the skills you need to avoid hallucinations is good, even if they're weak and basic skills. But is an LLM willing to actually put them to use?

OpenAI cooked o3 with reckless RL using hallucination-unaware reward calculation - which punished reluctance to answer and rewarded overconfident guesses. And their benchmark suite didn't catch it, because the benchmarks were hallucination-unaware too.

skissane · 2025-10-25T04:18:53 1761365933

> Add a dedicated "parking spot" like GPT-OSS does and eat the gradient flow tax on that

Not familiar with this topic, but intrigued-anywhere I can read more about it?

ACCount37 · 2025-10-25T12:48:35 1761396515

Looked for it briefly, think the best I got is this older discussion:

https://news.ycombinator.com/item?id=44834918

qcnguy · 2025-10-25T17:40:32 1761414032

OpenAI have talked about it. The neural architecture needs to let the model handle the case where there's nothing worth attending to, as softmax requires attention to be allocated to all tokens but sometimes there's nothing worth it.

mxkopy · 2025-10-25T03:03:44 1761361424

I agree, gradient descent implicitly assumes things have a meaningful gradient, which they don’t always. And even if we say anything can be approximated by a continuous function, we’re learning we don’t like approximations in our AI. Some discrete alternative of SGD would be nice.

eldenring · 2025-10-24T19:49:24 1761335364

I think something with more uniform training and inference setups, and otherwise equally hardware friendly, just as easily trainable, and equally expressive could replace transformers.

krychu · 2025-10-24T21:07:53 1761340073

tim333 · 2025-10-24T21:24:21 1761341061

Yeah that thing is quite interesting - baby dragon hatchling https://news.ycombinator.com/item?id=45668408 https://youtu.be/mfV44-mtg7c