AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague

Transcription

1 AdaBoost Jiri Matas and Jan Šochman Centre for Machine Perception Czech Technical University, Prague

2 Presentation Outline: AdaBoost algorithm Why is of interest? How it works? Why it works? AdaBoost variants AdaBoost with a Totally Corrective Step (TCS) Experiments with a Totally Corrective Step

3 Introduction 1990 Boost-by-majority algorithm (Freund) 1995 AdaBoost (Freund & Schapire) 1997 Generalized version of AdaBoost (Schapire & Singer) 2001 AdaBoost in Face Detection (Viola & Jones) Interesting properties: AB is a linear classifier with all its desirable properties. AB output converges to the logarithm of likelihood ratio. AB has good generalization properties. AB is a feature selector with a principled strategy (minimisation of upper bound on empirical error). AB close to sequential decision making (it produces a sequence of gradually more complex classifiers)

4 What is AdaBoost? AdaBoost is an algorithm for constructing a strong classifier as linear combination T f(x) = α t h t (x) of simple weak classifiers h t (x). t=

5 What is AdaBoost? AdaBoost is an algorithm for constructing a strong classifier as linear combination T f(x) = α t h t (x) of simple weak classifiers h t (x). Terminology h t (x)... weak or basis classifier, hypothesis, feature H(x) = sign(f(x))... strong or final classifier/hypothesis t=

6 What is AdaBoost? AdaBoost is an algorithm for constructing a strong classifier as linear combination T f(x) = α t h t (x) of simple weak classifiers h t (x). Terminology h t (x)... weak or basis classifier, hypothesis, feature H(x) = sign(f(x))... strong or final classifier/hypothesis Comments The ht(x) s can be thought of as features. Often (typically) the set H = {h(x)} is infinite. t=

7 (Discrete) AdaBoost Algorithm Singer & Schapire (1997) Given: (x 1, y 1 ),..., (x m, y m ); x i X, y i { 1, 1} Initialize weights D 1 (i) = 1/m For t = 1,..., T : 1. (Call WeakLearn), which returns the weak classifier h t : X { 1, 1} with minimum error w.r.t. distribution D t ; 2. Choose α t R, 3. Update D t+1 (i) = D t(i)exp( α t y i h t (x i )) Z t where Z t is a normalization factor chosen so that D t+1 is a distribution Output the strong classifier: ( T ) H(x) = sign α t h t (x) t=

8 (Discrete) AdaBoost Algorithm Singer & Schapire (1997) Given: (x 1, y 1 ),..., (x m, y m ); x i X, y i { 1, 1} Initialize weights D 1 (i) = 1/m For t = 1,..., T : 1. (Call WeakLearn), which returns the weak classifier h t : X { 1, 1} with minimum error w.r.t. distribution D t ; 2. Choose α t R, 3. Update D t+1 (i) = D t(i)exp( α t y i h t (x i )) Z t where Z t is a normalization factor chosen so that D t+1 is a distribution Output the strong classifier: ( T ) H(x) = sign α t h t (x) Comments The computational complexity of selecting ht is independent of t. All information about previously selected features is captured in Dt! t=

9 WeakLearn Loop step: Call WeakLearn, given distribution D t ; returns weak classifier h t : X { 1, 1} from H = {h(x)} Select a weak classifier with the smallest weighted error h t = arg min ɛ j = m i=1 D t(i)[y i h j (x i )] h j H Prerequisite: ɛt < 1/2 (otherwise stop) WeakLearn examples: Decision tree builder, perceptron learning rule H infinite Selecting the best one from given finite set H

10 WeakLearn Loop step: Call WeakLearn, given distribution D t ; returns weak classifier h t : X { 1, 1} from H = {h(x)} Select a weak classifier with the smallest weighted error h t = arg min ɛ j = m i=1 D t(i)[y i h j (x i )] h j H Prerequisite: ɛt < 1/2 (otherwise stop) WeakLearn examples: Decision tree builder, perceptron learning rule H infinite Selecting the best one from given finite set H Demonstration example Weak classifier = perceptron N(0, 1) 1 r 8π 3e 1/2(r 4)

11 WeakLearn Loop step: Call WeakLearn, given distribution D t ; returns weak classifier h t : X { 1, 1} from H = {h(x)} Select a weak classifier with the smallest weighted error h t = arg min ɛ j = m i=1 D t(i)[y i h j (x i )] h j H Prerequisite: ɛt < 1/2 (otherwise stop) WeakLearn examples: Decision tree builder, perceptron learning rule H infinite Selecting the best one from given finite set H Demonstration example Training set Weak classifier = perceptron N(0, 1) 1 r 8π 3e 1/2(r 4)

12 WeakLearn Loop step: Call WeakLearn, given distribution D t ; returns weak classifier h t : X { 1, 1} from H = {h(x)} Select a weak classifier with the smallest weighted error h t = arg min ɛ j = m i=1 D t(i)[y i h j (x i )] h j H Prerequisite: ɛt < 1/2 (otherwise stop) WeakLearn examples: Decision tree builder, perceptron learning rule H infinite Selecting the best one from given finite set H Demonstration example Training set Weak classifier = perceptron N(0, 1) 1 r 8π 3e 1/2(r 4)

13 AdaBoost as a Minimiser of an Upper Bound on the Empirical Error The main objective is to minimize εtr = 1m {i : H(xi) yi} It can be upper bounded by ε tr (H) T Z t t=

14 AdaBoost as a Minimiser of an Upper Bound on the Empirical Error The main objective is to minimize εtr = 1m {i : H(xi) yi} It can be upper bounded by ε tr (H) T How to set α t? Z t t=1 Select αt to greedily minimize Zt(α) in each step Z t (α) is convex differentiable function with one extremum h t (x) { 1, 1} then optimal α t = 1 2 log(1+r t 1 r t ) where r t = m i=1 D t(i)h t (x i )y i Z t = 2 ɛt(1 ɛt) 1 for optimal αt Justification of selection of h t according to ɛ t

15 AdaBoost as a Minimiser of an Upper Bound on the Empirical Error The main objective is to minimize εtr = 1m {i : H(xi) yi} It can be upper bounded by ε tr (H) T How to set α t? Z t t=1 Select αt to greedily minimize Zt(α) in each step Z t (α) is convex differentiable function with one extremum h t (x) { 1, 1} then optimal α t = 1 2 log(1+r t 1 r t ) where r t = m i=1 D t(i)h t (x i )y i Z t = 2 ɛt(1 ɛt) 1 for optimal αt Justification of selection of h t according to ɛ t Comments The process of selecting αt and ht(x) can be interpreted as a single optimization step minimising the upper bound on the empirical error. Improvement of the bound is guaranteed, provided that ɛ t < 1/2. The process can be interpreted as a component-wise local optimization (Gauss-Southwell iteration) in the (possibly infinite dimensional!) space of ᾱ = (α 1, α 2,... ) starting from. ᾱ 0 = (0, 0,... )

16 Reweighting Effect on the training set Reweighting formula: D t+1 (i) = D t(i)exp( α t y i h t (x i )) = exp( y t i q=1 α qh q (x i )) Z t m t q=1 Z q { < 1, yi = h exp( α t y i h t (x i )) t (x i ) > 1, y i h t (x i ) } Increase (decrease) weight of wrongly (correctly) classified examples. The weight is the upper bound on the error of a given example!

20 Reweighting Effect on the training set Reweighting formula: D t+1 (i) = D t(i)exp( α t y i h t (x i )) = exp( y t i q=1 α qh q (x i )) Z t m t q=1 Z q { < 1, yi = h exp( α t y i h t (x i )) t (x i ) > 1, y i h t (x i ) } Increase (decrease) weight of wrongly (correctly) classified examples. The weight is the upper bound on the error of a given example! err yf(x)

21 Reweighting Effect on the training set Reweighting formula: D t+1 (i) = D t(i)exp( α t y i h t (x i )) = exp( y t i q=1 α qh q (x i )) Z t m t q=1 Z q { < 1, yi = h exp( α t y i h t (x i )) t (x i ) > 1, y i h t (x i ) } Increase (decrease) weight of wrongly (correctly) classified examples. The weight is the upper bound on the error of a given example! Effect on h t α t minimize Zt i:h t (x i )=y i D t+1 (i) = Error of ht on Dt+1 is 1/2 i:h t (x i ) y i D t+1 (i) Next weak classifier is the most independent one e 0.5 err yf(x) t

22 Summary of the Algorithm

23 Summary of the Algorithm Initialization

24 Summary of the Algorithm Initialization... For t = 1,..., T :

25 Initialization... For t = 1,..., T : Summary of the Algorithm Find ht = arg min ɛ j = m D t (i)[y i h j (x i )] h j H i=1 t =

26 Initialization... For t = 1,..., T : Summary of the Algorithm Find ht = arg min ɛ j = m D t (i)[y i h j (x i )] h j H i=1 If ɛ t 1/2 then stop t =

27 Initialization... For t = 1,..., T : Summary of the Algorithm Find ht = arg min ɛ j = m D t (i)[y i h j (x i )] h j H i=1 If ɛ t 1/2 then stop Set αt = 12 log(1+rt 1 r t ) t =

28 Initialization... For t = 1,..., T : Summary of the Algorithm Find ht = arg min ɛ j = m D t (i)[y i h j (x i )] h j H i=1 If ɛ t 1/2 then stop Set αt = Update 12 log(1+rt 1 r t ) D t+1 (i) = D t(i)exp( α t y i h t (x i )) Z t t =

29 Summary of the Algorithm Initialization... For t = 1,..., T : Find ht = arg min ɛ j = m D t (i)[y i h j (x i )] h j H i=1 If ɛ t 1/2 then stop Set αt = Update 12 log(1+rt 1 r t ) D t+1 (i) = D t(i)exp( α t y i h t (x i )) Z t Output the final classifier: ( T ) H(x) = sign α t h t (x) t= t =

37 Does AdaBoost generalize? Margins in SVM Margins in AdaBoost max min (x,y) S max min (x,y) S Maximizing margins in AdaBoost P S [yf(x) θ] 2 T T t=1 Upper bounds based on margin P D [yf(x) 0] P S [yf(x) θ] + O y( α h(x)) α 2 y( α h(x)) α 1 ɛ 1 θ t (1 ɛ t ) 1+θ where f(x) = ( 1 d log 2 (m/d) m θ 2 α h(x) α 1 + log(1/δ) ) 1/

38 AdaBoost variants Freund & Schapire 1995 Discrete (h : X {0, 1}) Multiclass AdaBoost.M1 (h : X {0, 1,..., k}) Multiclass AdaBoost.M2 (h : X [0, 1]k) Real valued AdaBoost.R (Y = [0, 1], h : X [0, 1]) Schapire & Singer 1997 Confidence rated prediction (h : X R, two-class) Multilabel AdaBoost.MR, AdaBoost.MH (different formulation of minimized loss)... Many other modifications since then (Totally Corrective AB, Cascaded AB)

39 Pros and cons of AdaBoost Advantages Very simple to implement Feature selection on very large sets of features Fairly good generalization Disadvantages Suboptimal solution for ᾱ Can overfit in presence of noise

40 Adaboost with a Totally Corrective Step (TCA) Given: (x 1, y 1 ),..., (x m, y m ); x i X, y i { 1, 1} Initialize weights D 1 (i) = 1/m For t = 1,..., T : 1. (Call WeakLearn), which returns the weak classifier h t : X { 1, 1} with minimum error w.r.t. distribution D t ; 2. Choose α t R, 3. Update D t+1 4. (Call WeakLearn) on the set of h m s with non zero α s. Update α.. Update D t+1. Repeat till ɛ t 1/2 < δ, t. Comments All weak classifiers have ɛt 1/2, therefore the classifier selected at t + 1 is independent of all classifiers selected so far. It can be easily shown, that the totally corrective step reduces the upper bound on the empirical error without increasing classifier complexity. The TCA was first proposed by Kivinen and Warmuth, but their αt is set as in stadard Adaboost. Generalization of TCA is an open question

41 Experiments with TCA on the IDA Database Discrete AdaBoost, Real AdaBoost, and Discrete and Real TCA evaluated Weak learner: stumps. Data from the IDA repository (Ratsch:2000): Input Training Testing Number of dimension patterns patterns realizations Banana Breast cancer Diabetes German Heart Image segment Ringnorm Flare solar Splice Thyroid Titanic Twonorm Waveform Note that the training sets are fairly small

42 Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000)

43 Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) IMAGE Length of the strong classifier

44 Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) FLARE Length of the strong classifier

45 Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) GERMAN Length of the strong classifier

46 Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) RINGNORM Length of the strong classifier

47 Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) SPLICE Length of the strong classifier

48 Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) THYROID Length of the strong classifier

49 Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) TITANIC Length of the strong classifier

50 Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) BANANA Length of the strong classifier

51 Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) BREAST Length of the strong classifier

52 Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) DIABETIS Length of the strong classifier

53 Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) HEART Length of the strong classifier

54 Conclusions The AdaBoost algorithm was presented and analysed A modification of the Totally Corrective AdaBoost was introduced Initial test show that the TCA outperforms AB on some standard data sets

55

56

57

58

59

60

61

62

63 err yf(x)

64 e 0.5 t

65

66