I'm skeptical that we'll see a big breakthrough in the architecture itself. As sick as we all are of transformers, they are really good universal approximators. You can get some marginal gains, but how more _universal_ are you realistically going to get? I could be wrong, and I'm glad there are researchers out there looking at alternatives like graphical models, but for my money we need to look further afeild. Reconsider the auto-regressive task, cross entropy loss, even gradient descent optimization itself.
The softmax has issues regarding attention sinks [1]. The softmax also causes sharpness problems [2]. In general this decision boundary being Euclidean dot products isn't actually optimal for everything, there are many classes of problem where you want polyhedral cones [3]. Positional embedding are also janky af and so is rope tbh, I think Cannon layers are a more promising alternative for horizontal alignment [4].
I still think there is plenty of room to improve these things. But a lot of focus right now is unfortunately being spent on benchmaxxing using flawed benchmarks that can be hacked with memorization. I think a really promising and underappreciated direction is synthetically coming up with ideas and tests that mathematically do not work well and proving that current arhitectures struggle with it. A great example of this is the VITs need glasses paper [5], or belief state transformers with their star task [6]. The Google one about what are the limits of embedding dimensions also is great and shows how the dimension of the QK part is actually important to getting good retrevial [7].
If all your problems with attention are actually just problems with softmax, then that's an easy fix. Delete softmax lmao.
No but seriously, just fix the fucking softmax. Add a dedicated "parking spot" like GPT-OSS does and eat the gradient flow tax on that, or replace softmax with any of the almost-softmax-but-not-really candidates. Plenty of options there.
The reason why we're "benchmaxxing" is that benchmarks are the metrics we have, and the only way by which we can sift through this gajillion of "revolutionary new architecture ideas" and get at the ones that show any promise at all. Of which there are very few, and fewer still that are worth their gains when you account for: there not being an unlimited amount of compute. Especially not when it comes to frontier training runs.
Memorization vs generalization is a well known idiot trap, and we are all stupid dumb fucks in the face of applied ML. Still, some benchmarks are harder to game than others (guess how we found that out), and there's power in that.
reason we're benchmaxxing is because there's a huge monetary incentive now to have the best performing model on these synthetic benchmarks and that status is worth a lot of money
literally every new release of something point X model of every major player includes some benchmark graphs to show off
That LLMs have some basic metaknowledge and metacognitive skills that they can use to reduce the hallucination rate.
Which is what humans do too - it's not magic. Humans just get more metacognitive juice for free. Resulting in a hallucination rate significantly lower than that of LLMs, but significantly higher than zero.
Now, having the skills you need to avoid hallucinations is good, even if they're weak and basic skills. But is an LLM willing to actually put them to use?
OpenAI cooked o3 with reckless RL using hallucination-unaware reward calculation - which punished reluctance to answer and rewarded overconfident guesses. And their benchmark suite didn't catch it, because the benchmarks were hallucination-unaware too.
OpenAI have talked about it. The neural architecture needs to let the model handle the case where there's nothing worth attending to, as softmax requires attention to be allocated to all tokens but sometimes there's nothing worth it.
I agree, gradient descent implicitly assumes things have a meaningful gradient, which they don’t always. And even if we say anything can be approximated by a continuous function, we’re learning we don’t like approximations in our AI. Some discrete alternative of SGD would be nice.
I think something with more uniform training and inference setups, and otherwise equally hardware friendly, just as easily trainable, and equally expressive could replace transformers.