Boosting

Boosting is perhaps one of the most powerful ideas in Machine Learning. The idea is to combine several weak classifiers and turn it into a strong classifier. A weak classifier is a classifier that has Boosting has proven to be a very powerful approach and is used for various real world applications. The basic idea behind boosting is to concentrate on the examples that were incorrectly classified and use a majority voting from the various classifiers. One of the most popularly used boosting algorithms is known as AdaBoost.

The basic algorithm of Ada Boost is as follows: Suppose we have T weak classifiers such that the error on each of them is slightly less than 1/2 i.e., εt = 1/2 - γt
Input: Training points (x1, y1), (x2, y2) ...... (xn, yn) where xi ∈ X and yi ∈ {-1,1}
Let wt(i) be the weight associated with the ith training point ∀i ∈ 1..N in the t-th iteration
Let αt be the weight associated with the t-th classifier ∀t ∈ 1..T
Let the output of the t-th classifier for the ith point be Gt(xi)
  1. Initialize the weight w1(i) on all training points to be 1/N
  2. for t=1 ... T
  3. Train the t-th classifier using the distribution wt (Note that we need to modify the base classifier so that it can handle weighted versions of the training set. For example, ∑ wi*loss(xi,yi))
  4. Compute the error on the t-th classifier as εt = Pr[Gt(x) ≠ y]
  5. Update the weight of the classifier as αt = 1/2 ln( 1 - εt/ εt)
  6. ∀ i ∈ 1..N update the weight on the training samples using the multiplicative rule as - wt+1(i) = wt(i)exp(-αtyiGt(xi))/ Zt where Zt is the normalization factor
  7. end for
  8. The final output of the algorithm is G(xi) = sign [ ∑ t=1^T αtGt(xi)]

Let us try to understand this. Initially, we set the weight of all the training points to be equal which is the input to the 1st classifier. In the subsequent iterations however the weight on the training point that has been misclassified is increased and the weight on the point that has been classified correctly is decreased (From step 6, the value of the exponential is very large when yi*Gt(xi) is negative and the point is misclassified as it becomes an exponential of a positive term. On the contrary it is very small when yi*Gt(xi) is positive and the point is classified correctly and the weight of the point is reduced as it becomes exponential of a negative term as αt is positive.). Now, we see that the quantity αt represents the weight of the classifier used to compute the final output and is larger when the error εt of the classifier is small.(From step 5, if εt is low αt is high. Remember that εt should be < 1/2 otherwise αt will be negative.) Note that instead of using an exponential in the update as used in Adaboost other updates can be tried like in logit-boost.
Let us write some code in matlab that does boosted decision stumps. A decision stump is just a one-level decision tree.
function [] = boosted_stump()


end

The figure showing the impact on test error as we increase the number of decision stumps is as follows:
In order to show why boosting works, we need to show that the training set error is bounded. Let us examine the total training error. In order to do that we examine the final output of the algorithm and see how much it is off from the given class labels. Thus, we have,
1/N ∑ i=1^N I(g(xi ≠ yi)) ≤ 1/N ∑i=1^N exp(-yiG(xi)) = ∏ t=1^T Zt
In order for the training error to be bounded the Zts should be less than 1. Actually, it can be shown that Zt = 2*sqrt(εt (1 - εt)) = sqrt(1 - γt^2)
Alternative view of Boosting:
Useful Links
Rob Schapire's notes on boosting

Comments

Popular posts from this blog

MinHash, Bloom Filters and LSH

Perceptron Algorithm

Logistic Regression