A Tutorial on Boosting. Yoav Freund Rob Schapire. yoav schapire

Transcription

1 A Tutorial on Boosting Yoav Freund Rob Schapire yoav schapire

2 Example: How May I Help You? goal: automatically categorize type of call requested by phone customer (Collect, CallingCard, PersonToPerson, etc.) [Gorin et al.] yes I d like to place a collect call long distance please (Collect) operator I need to make a call but I need to bill it to my office (ThirdNumber) yes I d like to place a call on my master card please (CallingCard) I just called a number in sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off of my bill (BillingCredit) observation: easy to find rules of thumb that are often correct e.g.: IF card occurs in utterance THEN predict CallingCard hard to find single highly accurate prediction rule

3 The Boosting Approach select small subset of examples derive rough rule of thumb examine 2nd set of examples derive 2nd rule of thumb repeat T times questions: how to choose subsets of examples to examine on each round? how to combine all the rules of thumb into single prediction rule? boosting = general method of converting rough rules of thumb into highly accurate prediction rule

4 Tutorial outline first half (Rob): behavior on the training set background AdaBoost analyzing training error experiments connection to game theory confidence-rated predictions multiclass problems boosting for text categorization second half (Yoav): understanding AdaBoost s generalization performance

5 strong PAC algorithm for any distribution 8 > 0 > 0 The Boosting Problem given polynomially many random examples finds hypothesis with error with probability 1 ; weak PAC algorithm same, but only for 1 2 ; [Kearns & Valiant 88]: does weak learnability imply strong learnability?

6 [Schapire 89]: Early Boosting Algorithms first provable boosting algorithm call weak learner three times on three modified distributions get slight boost in accuracy apply recursively [Freund 90]: optimal algorithm that boosts by majority [Drucker, Schapire & Simard 92]: first experiments using boosting limited by practical drawbacks

7 [Freund & Schapire 95]: AdaBoost introduced AdaBoost AdaBoost algorithm strong practical advantages over previous boosting algorithms experiments using AdaBoost: [Drucker & Cortes 95] [Jackson & Craven 96] [Freund & Schapire 96] [Quinlan 96] [Breiman 96] [Schapire & Singer 98] [Maclin & Opitz 97] [Bauer & Kohavi 97] [Schwenk & Bengio 98] [Dietterich 98]. continuing development of theory and algorithms: [Schapire, Freund, Bartlett & Lee 97] [Breiman 97] [Grove & Schuurmans 98] [Schapire & Singer 98] [Mason, Bartlett & Baxter 98] [Friedman, Hastie & Tibshirani 98].

8 A Formal View of Boosting given training set (x 1 y 1 ) ::: (x m ym) y i 2 f;1 +1g correct label of instance x i 2 X for t = 1 ::: T: construct distribution D t on f1 ::: mg find weak hypothesis ( rule of thumb ) h t : X! f;1 +1g with small error t on D t : t = Pr Dt [h t (x i ) 6= y i ] output final hypothesis H final

9 AdaBoost [Freund & Schapire] constructing D t : D 1 (i) = 1=m given D t and h t : D t+1 (i) = D t (i) Z t 8 >< >: e ; t if y i = h t (x i ) e t if y i 6= h t (x i ) = D t (i) Z t exp(; t y i h t (x i )) where Z t = normalization constant 0 1 t = 1 2 ln ; t C A > 0 final hypothesis: 0 H final (x) = sign B t X t t h t (x) C A

10 Toy Example D 1

11 Round 1 Round 1 Round 1 Round 1 Round 1 2 D h 1 α ε 1 1 =0.30 =0.42

12 Round 2 Round 2 Round 2 Round 2 Round 2 α ε 2 2 =0.21 =0.65 h 2 3 D

13 Round 3 Round 3 Round 3 Round 3 Round h 3 α ε 3 3 =0.92 =0.14

14 Final Hypothesis Final Hypothesis Final Hypothesis Final Hypothesis Final Hypothesis = = sign H final * See demo at yoav/adaboost

15 Analyzing the training error Theorem: run AdaBoost let t = 1=2 ; t then training error(h final ) Y t = Y t 2 42 s t (1 ; t ) vu u t 1 ; 4 2 t exp X 2 t t 1 C A so: if 8t : t > 0 then training error(h final ) e;22 T adaptive: does not need to know or T a priori can exploit t

16 Proof let f(x) = X t t h t (x) ) H final (x) = sign(f(x)) Step 1: unwrapping recursion: D final (i) = 1 m exp 0 B i Y = 1 m e;y if (x i ) Y t Z t t t h t (x i ) t Z t Step 2: training error(h final ) Y t Z t Proof: H final (x) 6= y ) yf(x) 0 ) e;yf(x) 1 1 C A so: training error(h final ) = 1 m X i 8 >< >: 1 if y i 6= H final (x i ) 0 else 1 m X i e ;y if (x i ) = X i D final (i)y t Z t = Y t Z t

17 Proof (cont.) Step 3: Z t = 2 s t (1 ; t ) Proof: Z t = X D t (i) exp(; t y i h t (x i )) i = X D t (i)e t + i:y i 6=h t (x i ) X D t (i)e ; t i:y i =h t (x i ) = t e t + (1 ; t ) e ; t = 2 s t (1 ; t )

18 UCI Experiments [Freund & Schapire] tested AdaBoost on UCI benchmarks used: C4.5 (Quinlan s decision tree algorithm) decision stumps : very simple rules of thumb that test on single attributes eye color = brown? height > 5 feet? predict +1 yes no predict -1 yes predict -1 no predict C C boosting Stumps boosting C4.5

19 Game Theory game defined by matrix M: Rock Paper Scissors Rock 1=2 1 0 Paper 0 1=2 1 Scissors 1 0 1=2 row player chooses row i column player chooses column j (simultaneously) row player s goal: minimize loss M(i j) usually allow randomized play: players choose distributions P and Q over rows and columns learner s (expected) loss = X i j P(i)M(i j)q(j) = P T MQ M(P Q)

20 The Minmax Theorem von Neumann s minmax theorem: min P max Q M(P Q) = max = v Q min P M(P Q) = value value of game M in words: v = min max means: row player has strategy P such that 8 column strategy Q loss M(P Q) v v = max min means: this is optimal in sense that column player has strategy Q such that 8 row strategy P loss M(P Q ) v

21 row player $ booster The Boosting Game column player $ weak learner matrix M: row $ example (x i y i ) column $ 8weak hypothesis h >< 1 if y M(i h) = i = h(x i ) >: 0 else

22 if: Boosting and the Minmax Theorem 8 distributions over examples 9h with accuracy 1 2 ; then: min max P h by minmax theorem: max min Q i which means: M(P h) 1 2 ; M(i Q) 1 2 ; > weighted majority of hypotheses which correctly classifies all examples

23 AdaBoost and Game Theory [Freund & Schapire] AdaBoost is special case of general algorithm for solving games through repeated play can show distribution over examples converges to (approximate) minmax strategy for boosting game weights on weak hypotheses converge to (approximate) maxmin strategy different instantiation of game-playing algorithm gives on-line learning algorithms (such as weighted majority algorithm)

24 Confidence-rated Predictions useful to allow weak hypotheses to assign confidences to predictions formally, allow h t : X! R sign(h t (x)) = prediction jh t (x)j = confidence [Schapire & Singer] use identical update: D t+1 (i) = D t (i) Z t exp(; t y i h t (x i )) and identical rule for combining weak hypotheses questions: how to choose h t s (specifically, how to assign confidences to predictions) how to choose t s

25 Confidence-rated Predictions (cont.) Theorem: training error(h final ) Y t Z t Proof: same as before therefore, on each round t, should choose h t and t to minimize: Z t = X i D t (i) exp(; t y i h t (x i )) given h t, can find t which minimizes Z t analytically (sometimes) numerically (in general) should design weak learner to minimize Z t e.g.: for decision trees, criterion gives: splitting rule assignment of confidences at leaves

26 Minimizing Exponential Loss AdaBoost attempts to minimize: T Y t=1 Z t = 1 m = 1 m X i X exp(;y i f(x i )) 0 exp B i X t t h t (x i ) 1 C A (*) really a steepest descent procedure: each round, add term t h t to sum to minimize () why this loss function? upper bound on training (classification) error easy to work with connection to logistic regression [Friedman, Hastie & Tibshirani]

27 Multiclass Problems say y 2 Y = f1 ::: kg direct approach (AdaBoost.M1): h t : X! Y D t+1 (i) = D t (i) Z t 8 >< >: e ; t if y i = h t (x i ) e t if y i 6= h t (x i ) H final (x) = arg max y2y X t:h t (x)=y t can prove same bound on error if 8t : t 1=2 in practice, not usually a problem for strong weak learners (e.g., C4.5) significant problem for weak weak learners (e.g., decision stumps)

28 e.g.: Reducing to Binary Problems say possible labels are fa b c d eg each training example replaced by five f;1 +1g-labeled examples: x c! 8 >< >: (x a) ;1 (x b) ;1 (x c) +1 (x d) ;1 (x e) ;1 [Schapire & Singer]

29 formally: AdaBoost.MH h t : X Y! f;1 +1g(or R ) D t+1 (i y) = D t (i y) Z t where v i (y) = 8 >< >: +1 if y i = y ;1 if y i 6= y exp(; t v i (y) h t (x i y)) H X final (x) = arg max y2y t t h t (x y) can prove: training error(h final ) k 2 Y Z t

30 Using Output Codes [Schapire & Singer] alternative: reduce to random binary problems choose code word for each label a ; + ; + b ; + + ; c + ; ; + d + ; + + e ; + ; ; each training example mapped to one example per column 8 (x 1 ) +1 >< (x x c! 2 ) ;1 (x 3 ) ;1 >: (x 4 ) +1 to classify new example x: evaluate hypothesis on (x 1 ) ::: (x 4 ) choose label most consistent with results training error bounds independent of # of classes may be more efficient for very large # of classes

31 Example: Boosting for Text Categorization [Schapire & Singer] weak hypotheses: very simple weak hypotheses that test on simple patterns, namely, (sparse) n-grams find parameter t and rule h t of given form which minimize Z t use efficiently implemented exhaustive search How may I help you data: 7844 training examples (hand-transcribed) 10 test examples (both hand-transcribed and from speech recognizer) categories: AreaCode, AttService, BillingCredit, CallingCard, Collect, Competitor, DialForMe, Directory, HowToDial, PersonToPerson, Rate, ThirdNumber, Time, TimeCharge, Other.

32 Weak Hypotheses rnd term 1 collect AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT 2 card 3 my home 4 person? person 5 code 6 I 7 time 8 wrong number 9 how

33 rnd term 10 call More Weak Hypotheses AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT seven 12 trying to 13 and 14 third 15 to 16 for 17 charges 18 dial 19 just

34 error (%) Learning Curves conf (test ) conf (train) noconf (test ) noconf (train) # rounds of boosting test error reaches 20% for the first time on round... 1,932 without confidence ratings 191 with confidence ratings test error reaches 18% for the first time on round... 10,909 without confidence ratings 303 with confidence ratings

35 Finding Outliers examples with most weight are often outliers (mislabeled and/or ambiguous) I m trying to make a credit card call hello (Rate) (Collect) yes I d like to make a long distance collect call please (CallingCard) calling card please (Collect) yeah I d like to use my calling card number (Collect) can I get a collect call (CallingCard) yes I would like to make a long distant telephone call and have the charges billed to another number (CallingCard DialForMe) yeah I can not stand it this morning I did oversea call is so bad (BillingCredit) yeah special offers going on for long distance (AttService Rate) mister allen please william allen (PersonToPerson) yes ma am I I m trying to make a long distance call to a non dialable point in san miguel philippines (AttService Other) yes I like to make a long distance call and charge it to my home phone that s where I m calling at my home (DialForMe) I like to make a call and charge it to my ameritech (Competitor)