Least Squares, Regularization -LASSO and more
Lets now start with another widely used tool - regression. The main goal of regression analysis is to find a function f(x) that fits the data well and predicts y from x. Also, we want to minimize the risk associated with choosing the hypothesis function. As it is difficult to measure the actual risk and minimize it, we use the empirical measure of risk and try to minimize it.
If the dataset is sufficiently large or available in the form of real time updates, then there will be problems computing X'X. The least squares problem can be solved in such cases using the technique of steepest descent or gradient descent and online algorithm (considering one data point at a time) also known as stochastic gradient descent or the least mean squares algorithm. The idea behind gradient descent algorithm is based upon the principle that if f(x) is defined and differentiable then it decreases the fastest in the direction of the negative of the gradient. The gradient descent algorithm has the problem of getting stuck in a local minima and for certain choices of f(x), the convergence of f(x) is guaranteed. If f(x) is convex, then the local minima is also a global minima (which is true in the case of least squares). The weight update rule in the stochastic gradient descent is defined as follows: (Note that the stochastic gradient descent update leads to an update rule like the perceptron algorithm.)
Regularization and Least Squares:
In order to avoid overfitting, a regularization term can be added to the error function to control the weight vector. Regularization also helps in the cases when the matrix X'X is singular which is the case when two or more column vectors of X are correlated with each other. One simplest form of regularizer is to add a the square of the weight vector and is also known as the weight decay or parameter shrinkage method. The advantage of using this error function is that the function is still differentiable and a closed form solution exists for it. This is also known in literature as the ridge regression. The error function is as follows:
Helpful Links:
Remp = 1/N ∑ Loss(y_i, f(x_i))Least squares regression has a loss function which is basically the squared error of the target variable y_i from the predicted value f(x_i) and is defined as follows.
Loss(y_i, f(x_i)) = 1/2 (y_i - f(x_i))^2Using the above two equations and a vector representation for the input and target variable, we want to minimize the following Remp and we solve it as follows.
Remp = 1/2 |y - w'X|^2 = 1/2 (y - w'X)'(y - w'X) Taking the first derivative w.r.t. w, dR/dw = d/dw 1/2 (y'y - X'wy -y'w'X +X'ww'X) = 1/2(-2X'y + 2X'Xw) => w = inv(X'X)X'yThe least squares formulation thus has a closed form solution for the weight vector w and can be implemented in matlab using a single command to compute the Moore-Penrose pseudo-inverse.
w = pinv(x)*y; OR w = x\y;Big Data and Least Squares:
If the dataset is sufficiently large or available in the form of real time updates, then there will be problems computing X'X. The least squares problem can be solved in such cases using the technique of steepest descent or gradient descent and online algorithm (considering one data point at a time) also known as stochastic gradient descent or the least mean squares algorithm. The idea behind gradient descent algorithm is based upon the principle that if f(x) is defined and differentiable then it decreases the fastest in the direction of the negative of the gradient. The gradient descent algorithm has the problem of getting stuck in a local minima and for certain choices of f(x), the convergence of f(x) is guaranteed. If f(x) is convex, then the local minima is also a global minima (which is true in the case of least squares). The weight update rule in the stochastic gradient descent is defined as follows: (Note that the stochastic gradient descent update leads to an update rule like the perceptron algorithm.)
w(k+1) = w(k) - η∇E(w) Recall that, E(w) = 1/2 |y_i - w'x_i|^2 ∇E(w) = -|y_i - w'x_i|x_i Hence, w(k+1) = w(k) + η(y(i) - w(k)'x(i))x(i)The matlab code for the algorithm is as follows:
repeat for i=1:n w = w + eta*(y(i) - x(i,:)*w)*x(i,:)'; end until convergence==trueThe following figure shows the convergence on the iris dataset.
Regularization and Least Squares:
In order to avoid overfitting, a regularization term can be added to the error function to control the weight vector. Regularization also helps in the cases when the matrix X'X is singular which is the case when two or more column vectors of X are correlated with each other. One simplest form of regularizer is to add a the square of the weight vector and is also known as the weight decay or parameter shrinkage method. The advantage of using this error function is that the function is still differentiable and a closed form solution exists for it. This is also known in literature as the ridge regression. The error function is as follows:
E(w) = 1/2|y -w'X|^2 + λ w'w/2 ∇E(w) = 1/2(y - w'X)'(y - w'X) + λ w'w/2 = 1/2 (y'y - X'wy -y'w'X +X'ww'X) + λ w'w/2 = 1/2 (-2X'y + 2X'Xw') + λ w' w' = inv(X'X + λI)X'yThe parameter λ ≥ 0 controls the amount of shrinkage. Other regularization technique popularly used is the LASSO regularization which uses the L1 norm of the weight vector instead of the L2-norm as used by ridge regression to control the model parameters. The error function is defined as follows:
E(w) = 1/2|y -w'X|^2 + λ |w|Note that the resulting error function is not differentiable but can be solved using quadratic programming or convex optimization methods. One of the main advantages of using this regularizer is that it induces sparsity in the feature space and thus aids in dimensionality reduction. Given the importance of this regularizer in ML today, perhaps we should have another detailed discussion on it later.
Helpful Links:
Comments
Post a Comment