Well, HF training was a pretty big deal IMO. Definitely saved my bacon in traini...

Well, HF training was a pretty big deal IMO. Definitely saved my bacon in training some recurrent nets, much easier to get working and/or recovering from bad optimization but pretty slow.

The SGD we use today actually has some strong ties back to that second order optimization work - see some papers by Ilya Sutskever relating a special form of momentum back to second order methods like HF. http://www.cs.utoronto.ca/~ilya/pubs/2013/1051_2.pdf His dissertation covers this at some length as well.

Using Adagrad, Adadelta, etc. isn't really SGD as it was back in 2012, and this years entrant "GoogLeNet" basically halved the error again using these and other tricks (we think) - which is even more impressive considering 11% - 6.7% is a HUGE increase in difficulty, just my 2 cents.

However, there is a good reason the colloquial name for these things is "Alexnets"... that work was truly incredible and has not stopped behind the doors of Google I don't think.