Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Just to add a couple others:

rmsprop is a great technique I don't hear talked about as much, example implementation here: https://github.com/BRML/climin/blob/master/climin/rmsprop.py

Using nesterov momentum and a "sparse" weight initialization scheme rather than uniform: https://www.cs.toronto.edu/~hinton/absps/momentum.pdf

Reducing the learning rate exponentially and increasing the momentum rate linearly over the course of training. Learning rate from .5 to .0001, momentum from .7 to .995. I've seen variations on this, like adjusting based on sigmoid curve.

Dropout may or may not help, adjusting dropout rate (percentage of activations that are discarded) may or may not help.

Mini-batch size can make a difference. Somewhere between 2 and 200?

You can use bayesian optimization to intelligently search hyperparameters: https://github.com/JasperSnoek/spearmint

Try rmsprop though, I've heard good things.



I haven't had any luck so far with rmsprop, adagrad and adadelta. SGD + Nesterov momentum has served me best.


Great, thanks for the pointers! I've tried momentum trick before and it has helped. I'll try rmsprop.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: