Simple examples of optimizers implementation in numpy.
I based my implementation on examples from Deep Learning Specialization (Coursera) and this tutorial.
Great overview about optimization algorithms is here.
Results:
From this simple example, we can conclude:
- Not-adaptive algorithms (SGD, Momentum, NAG) need high learning rate for this task. With small learning rates, progress is slow.
- Adaptive algorithms (Adam, Adagrad, RMSProp) fail (diverge) with high learning rates
- Best results are achieved by Adam, RMSProp and Adagrad (depending on lr)