Logistic Regression

The logistic regression method even though called regression has a unique ability to give posterior probability values for a data point to belong to a particular class and hence serve as a discriminative function useful for classification. The idea is simple, to use the logistic sigmoid function on the linear function of feature vector.
p(y = 1| x) = σ(w'x +b)
The likelihood function can be written as
p(t|w) = Π y_i^t_i*(1 - y_i)^(1 - t_i)
ln p(t|w) = Σ t_i*ln(y_i) + (1 - t_i)*ln(1 - y_i) 
The error function is computed by taking the negative of the log likelihood function and is computed as follows:
E(w) = -ln p(t|w)
Now, ∇E(w) is computed as follows:
y = 1 / (1 + exp(-w'x)
Taking derivative of both sides,
∇y/∇w = exp(-w'x)(x)/(1 + exp(-w'x))^2
∇y/∇w = xy(1-y)
Substituting the value in ∇E(w), we get 
∇ E(w) = Σ (y_i - t_i)*x_i
The derivative of the error function is similar to the least squares problem, although we cannot obtain a closed form solution for it due to the presence of the sigmoid function. In order to solve this, we use the Newton Raphson optimization method and the corresponding method is known as the Iterative Reweighted Least Squares (IRLS) method. The Newton Raphson method provides an update rule to update the weight vector just like the gradient descent method but differs from it as it also considers the second derivative for the weight update. The advantage of Newton Raphson is that it converges much faster towards the required solution. However, as you would have guessed computing the Hessian for a large dataset gets tedious and is one of the shortcomings of the method. The update rule for Newton Raphson is as follows:
x_n+1 = x_n - f'(x_n)/f"(x_n)
Hence it translates as follows for us -
w_new = w_old - inv(H)∇E(w)
H = ∇∇E(w)
  = Σ sigmoid(w'xi)(1 - sigmoid(w'xi))xi'xi - λI
w_new = w_old + inv(X'WX + λI)∇E(w)
The matlab code is as follows
w = rand(dim,1);
w_new = zeros(dim, 1);

for it=1:iter
    % Step 1: Compute A
    A = zeros(numTrain, 1);
    ytsum = zeros(1,dim);
    for i=1:numTrain
        s = sigmoid(x_train(i,:)*w);
        A(i) =  s*(1 -s);
        yt = (s - y(i))*x_train(i,:);
        ytsum = ytsum + yt;
    end
    
    % Step 2: Compute Hessian
    H = zeros(dim, dim);
    xx = zeros(dim,1);
    
    for i=1:dim
        for j=i:dim
            H(i,j) = (A.*x_train(:,i))'*x_train(:,j);
            H(j,i) = H(i,j);
        end
    end
    
    H = H + lambda*eye(dim);
    % Step 3: Compute weight update
    w_new = w + inv(H)*ytsum';
    w = w_new;
    
end
Logistic Regression and Big Data
One of the problem with the IRLS method is that it is not scalable. For very large datasets as we can see the computation of the inverse of the Hessian will be problematic. Further, IRLS is known to have convergence issues and depending on the initialization it can run into trouble. So we can go back to stochastic gradient descent and update the weights iteratively. So we can use the weight update rule as follows
w_new = w_old - η∇ E(w)
      = w_old - η(y_i - t_i)x_i
Another problem that commonly arises in Maximum likelihood solutions is overfitting. This happens severely when the data is linearly separable. In these cases, the logistic sigmoid function degenerates to a heaviside step function with the probabilities of the two classes going to the two extremes. In order to over come this, we can consider a prior on w or add a regularization parameter. The weight update rule with the L2 regularizer then looks as follows:
w_new = w_old - η[(y_i - t_i)x_i + 2*λ*w_old
Note, that we can also use a L1 update rule with logistic regression. The advantage of an L1 rule is that it induces sparsity and drives the weight of non-important features to zero. The L1 update rule is as follows:
w_new = w_old - η * ((y_i - t_i)x_i + λ*sign(w_old));
Multiclass Logistic Regression
So far we have seen the weight update for logistic regression when there are two classes. Let us now extend the notion of logistic regression to solve the problem of multi-class classification. Such a classifier is also known as multinomial logistic regression or in NLP as maximum entropy classifier.

Comments

Popular posts from this blog

MinHash, Bloom Filters and LSH

Perceptron Algorithm