they acknowledged even in the original blog post that any well-funded group of N...

Cybiote · on Feb 19, 2019

Can we even say for certain it's an improvement over say something like TransformerXL? As far as I could see, the changes over GPT were a couple extra and tweak to layer normalizations, a small change to initialization and a change to text pre-processing. Other than for pre-processing, I didn't catch anything on theoretical motivations for these choices nor anything on ablation studies. The only thing that can be said for certain is it used lots of data and a very large number of parameters, trained on powerful hardware and achieved unmatched results in natural language generation.