Can’t provide a reference, but I can confirm that this is common knowledge. It’s...

Can’t provide a reference, but I can confirm that this is common knowledge. It’s why e.g. GPT-3 outperforms GPT-2.

Though as stable diffusion shows, network architecture still matters a lot!

Note that the article points out you’ll get more overfitting as your number or parameters approaches that of the training set, which is what I suspect you’ve seen. The trend does reverse later on, but only once the parameter count is orders of magnitude beyond that point, and I don’t know if that ever happens outside of ML. It’s a lot of parameters.