This answer will give you a brief explanation:
1.Momentum: It helps SGD to navigate along with relevant directions and softens oscillation in the irrelevant directions.It adds a fraction of the direction of the previous step to that of the current step which increases the speed of amplification in the correct direction, the fraction is usually in the range of 0 to 1. There is one major disadvantage with momentum which is when we are close to the goal, the momentum is usually very high and doesn’t slow down which causes it to miss or oscillate in and around the minima.
2. Nesterov accelerated gradient: It solves the disadvantage of momentum by starting to slow down early. Nag performs the same thing as momentum but in some other way,first it makes a big jump based on all the previous information, then calculates the gradient and makes some small changes. These changes give significant practical speedups.
3.AdaGrad allows the learning to adapt based on different parameters. It performs small updates for frequent parameters and large updates for infrequent parameters.It also eliminates the need for tuning the learning rate. Here, each parameter have its own learning rate and it decreases monotonically due to the peculiarities of the algorithm.
4.AdaDelta It resolves the monotonically decreasing problem of AdaGrad.In AdaDelta it used a sliding window which allows the sum to decrease instead of summing all the past square roots.RMSprop is very similar to AdaDelta.
5. Adam has an algorithm similar to AdaDelta.It stores momentum changes along with learning rates for each of the parameters.