Reducing the learning rate exponentially and increasing the momentum rate linearly over the course of training. Learning rate from .5 to .0001, momentum from .7 to .995. I've seen variations on this, like adjusting based on sigmoid curve.
Dropout may or may not help, adjusting dropout rate (percentage of activations that are discarded) may or may not help.
Mini-batch size can make a difference. Somewhere between 2 and 200?
rmsprop is a great technique I don't hear talked about as much, example implementation here: https://github.com/BRML/climin/blob/master/climin/rmsprop.py
Using nesterov momentum and a "sparse" weight initialization scheme rather than uniform: https://www.cs.toronto.edu/~hinton/absps/momentum.pdf
Reducing the learning rate exponentially and increasing the momentum rate linearly over the course of training. Learning rate from .5 to .0001, momentum from .7 to .995. I've seen variations on this, like adjusting based on sigmoid curve.
Dropout may or may not help, adjusting dropout rate (percentage of activations that are discarded) may or may not help.
Mini-batch size can make a difference. Somewhere between 2 and 200?
You can use bayesian optimization to intelligently search hyperparameters: https://github.com/JasperSnoek/spearmint
Try rmsprop though, I've heard good things.