Optimization Algorithms

April 30, 2019

Gradient descent is a way of minimizing an objective function J(θ) by updating the model's parameters in the opposite direction of the gradient of the objective function.

Batch Gradient Descent: Batch gradient descent computes the gradient of the cost function for the entire training set in just one update which makes it very slow and intractable for very large datasts. The parameters are updated as follows:

θ = θ - η ∇ J(θ)

Batch gradient descent is guaranteed to converge to the global minimum for convex problems.

Stochastic Gradient Descent: SGD performs parameter update for each training example. It is much faster and can be used to learn online but due to single point updates it can be very noisy and cause the objective function to oscillate. It can continuously keep oscillating and is not guaranteed to converge. However, upon using a decreasing learning rate it is known to converge almost certainly.

Mini-batch gradient descent: This is in between the above two algorithms and operates on a mini batch of n training examples in every iteration. Typically mini batch sizes can be between 64 to 512. It thus reduces the variance of the parameter updates and leads to stable convergence and can make use of vectorization from the state of the art libraries. One of the biggest challenge of using mini-batch gradient descent is the choice of the learning rate parameter. Another problem is that same learning rate applies to all the features but if the features have different frequencies and are sparse this needs to be taken care of. (Note: Other optimization algorithms such as Newton's method are infeasible to compute for high dimensional datasets.)

Momentum : The basic idea is to compute an exponentially weighted average of the gradients and then use that to update the gradients itself. The update is as follows:

Compute the moving average gradient vector
Vt = βVt-1 + (1 - β)∇J(θ)
Update the weights instead of updating with the derivatives
θ = θ - αVt

This smoothes out the step of gradient descent as it averages out the gradients. As a result it results in much smaller oscillations and faster convergence. There are 2 parameters α the learning rate parameter and β which controls the exponentially weighted average. In practice, β = 0.9 works very well. (Note: In many literature the term (1 - β) is ignored.)

RMSprop : Root mean squared prop is an unpublished method by Geoff Hinton. It proceeds as follows: On iteration t, Compute the derivative on current mini-batch and maintain a moving average of the square of the gradient. Note that the square is computed as an element wise square.

St = βSt-1 + (1 - β) ∇J(θ)^2
θt = θt-1 - α ∇J(θ)/√ St

It helps damp out the oscillations and hence can use a larger learning rate and faster learning.

Adagrad : Adagrad adapts the learning rate for the parameters performing larger updates for infrequent and smaller updates for frequent parameters. For this reason it is particularly useful for sparse datasets. Adagrad modifies the learning rate η based upon the past gradient of the parameter as follows:

gt,i = ∇ J(θt,i)
θt,i = θt-1, i - η gt,i/√ Gt,ii

Adam : Adaptive Moment Estimation (Adam) computes the adaptive learning rates for each parameters. It combines ideas from RMSprop and Momentum to come up with updates as follows:

On iteration t:
Compute gradient using mini batch dw
Vt = β1 Vt-1 + (1 - β1) dw
St = β2 St-1 + (1 - β2) dw^2
You can also correct for bias as the moving averages in the initial iterations are off.
Vt_corrected = Vt / ( 1 - β1^t)
St_corrected = St / ( 1 - β2^t)
Finally the updates are:
θt = θt-1 - αVt_corrected / √ St_corrected

Totally awesome blog post on Overview of Optimization Algorithms for Gradient Descent Corresponding paper of the blog post

My Learnings in ML

Optimization Algorithms

Comments

Post a Comment

Popular posts from this blog

SVM

kmeans++

MinHash, Bloom Filters and LSH