I really really really hate the tone of this article. What would take me one sentence to understand, took me reading 5+ paras.
And then the sneering tone of this article, sounds unprofessional and disrespectful in my opinion.
I also am pretty sure he’s wrong or at least he has to change layernorm to make this work. Attention simply does a weighted average of the Value Vectors, his change breaks that and I think will push the output closer to 0 as you stack the layers (especially considering Layer Norm). He really should do some small experiments to validate his idea first!
And then the sneering tone of this article, sounds unprofessional and disrespectful in my opinion.
I also am pretty sure he’s wrong or at least he has to change layernorm to make this work. Attention simply does a weighted average of the Value Vectors, his change breaks that and I think will push the output closer to 0 as you stack the layers (especially considering Layer Norm). He really should do some small experiments to validate his idea first!