SVM

Lets begin with SVMs. SVMs are perhaps one of the most widely used classifiers. They are based upon the concept of maximizing the linear decision boundary between two classes. Let us consider a hyperplane that separates the positive and the negative classes. The points on the hyperplane H satisfy the equation w'x +b = 0. We need to maximize the separation between the two classes. Let us consider H+ and H- be the two hyperplanes that have maximum separation between them and no points from either class in between them. However, points from either of the classes do lie on the hyperplanes H+ and H-.
H+ ≡ w'x +b ≥ 1
H- ≡ w'x +b ≤ -1
The distance of H+ from H = 1 -b /w and the distance of H- from H is -1 -b/w. Hence the distance between the two hyperplanes is 2/w. We want to maximize the distance or minimize w/2. More formally, we can set the optimization problem to directly maximize the decision boundary in a beautiful manner.
minimize 1/2* |W|^2 
subject to
y_t * (w'x_t + b) >= 1 for all t=1, 2, ..... n
Note that the class labels in this formulation are +1 and -1. Hence, y_t*f(x_t; w) will be >= 1 if the examples are classified correctly and thus they form the constraints that need to be satisfied. The solution for the SVM is unique and it can be solved using quadratic programming in MATLAB using the following code. Note that this is a convex optimization problem and hence there exists a global solution unlike the neural networks and the decision trees which have the limitation of getting stuck in a local minima.
load iris.dat
% Use only the 1st two classes from the iris dataset.
x = iris(1:100,:);
labels = iris(1:100,5);

d = size(x,2); %dimension
n = size(x,1); %number of points

y = zeros(n,1); %set the labels to +1 or -1
y(labels == 1) = 1;
y(labels == 2) = -1; 

H = eye(d+1);
H(end,end) = 0; % Optimize on the w and not b
f = zeros(d+1,1);

A = -1 * repmat(y,1,d+1) .* [x ones(n,1)];
b = -1 * ones(n, 1);

opts = optimset('Algorithm','active-set','Display','on');
w = quadprog(H, f, A, b, [], [], [], [], [], opts)
We can also plot the decision boundary in MATLAB by computing the value of the f(x; w) using the two ends of the axis of the figure as shown. The SVM algorithm is shown to converge if the two classes are linearly separable. Let us now misclassify a few points. If we try to run the above code for the new data, quadprog returns an error that it is not able to find a feasible solution as the existing constraints are stringent. For such cases, when the points are not linearly separable, we can introduce slackness in the optimization problem and allow some misclassification. This is achieved by modifying the optimization problem as follows:
minimize 1/2* |W|^2 + C* sum(E_t)
subject to
y_t * (w'x_t + b) >= 1 - E_t for all t=1, 2, ..... n
The parameter C controls the number of violations allowed. For a small value of C, many margin constraints can be violated. However, if we increase the penalty C for margin violations, then E_t reduces to 0 and thus the optimization problem reduces to the original one. The matlab code for the slack SVM is as follows:
H = eye(d+1+n);
H(d+1:end,d+1:end) = 0;
C=0.001;
f = C*ones(d+1+n,1);
f(1:d+1) = 0;

A =   repmat(y,1,d+1) .* [x ones(n,1)];
A = -1*[A eye(n)];
b = -1 * ones(n, 1);

opts = optimset('Algorithm','active-set','Display','on');
w = quadprog(H, f, A, b, [], [], [], [], [], opts)
Thus the final classification image is shown as follows:
Multi-class SVM
SVMs are inherently two-class classifiers. In order to use them with multiple classes one of the most popularly used technique is to train them for one vs rest classifiers and thus learn N-1 decision boundaries. However, there are many problems with this simplistic approach (holes in decision boundaries).
Helpful Links
A Tutorial on Support Vector Machines for Pattern Recognition - Christopher Burges

Comments

  1. Great! but need to specify lower bound for slack variables. i.e. 'lb'. all slack variables must be non-negative.

    ReplyDelete
  2. Thanks for this really helpful guide!
    Like Ajay mentioned, you have to specify the lower bound of the slack variables too. i.e, E_t >= 0.
    Thus,
    lb = [-inf(1,d+1), zeros(1,n)] ' ;
    w = quadprog(H, f, A, b, [], [], lb, [], [], opts)

    Thank you once again :)

    ReplyDelete
  3. Thanks Jaya! and thanks for pointing out the lower bound too.

    ReplyDelete
  4. the output of the program generate 6 coefficients, what's the reason for that ?
    It should be 4 + 1 for the bias right ?

    ReplyDelete

Post a Comment

Popular posts from this blog

MinHash, Bloom Filters and LSH

Perceptron Algorithm

Logistic Regression