CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes of the books Introduction to Data Mining (Chap. 5) and On the Power of Ensemble: Supervised and Unsupervised Methods Reconciled (A tutorial at SDM 2010 by Jing Gao etal.).

Ensemble Methods Objective: To improve model performance in terms of accuracy by aggregating the predictions of multiple models How to do it: Construct a set of base models from the training data Make predictions by combing the predicted results made by each base model 2

3 General Idea

Stories of Success Million-dollar prize Improve the baseline movie recommendation approach of Netflix by 10% in accuracy The top submissions all combine several algorithms as an ensemble Data mining competitions on Kaggle Winning teams employ ensembles of classifiers 4

5 Netflix Prize Supervised learning task Training data is a set of users and movies, and a set of ratings (1, 2, 3, 4, 5 stars) on movies given by users. Construct a classifier that given a user and an unrated movie, correctly predicts user s rating on the movie:1, 2, 3, 4, or 5 stars $1 million prize for a 10% improvement over Netflix s current movie recommender Competition At first, single-model methods are developed, and performances are improved However, improvements slowed down Later, individuals and teams merged their results, and significant improvements are observed

Leaderboard Our final solution consists of blending 107 individual results. Predictive accuracy is substantially improved when blending multiple predictors. Our experience is that most efforts should be concentrated in deriving substantially different approaches, rather than refining a single technique. 6

Why Ensemble Work? Suppose there are 3 base classifiers Each classifier has error rate, ε = 0.35 or accuracy acc = 0.65. Given a test instance, if we choose any one of these classifiers to make prediction, the probability that the classifier makes a wrong prediction is 35%. Base classifiers: C 1 C 2 C 3 A test instance: x 7

Why Ensemble Work? Combine the 3 base classifiers to predict the class label of a test instance using a majority vote on the predictions made by the base classifiers Assume classifiers be independent, then the ensemble makes a wrong prediction only if more than 2 of the base classifiers predict incorrectly 8

Why Ensemble Work? x Truth label: -1 A wrong prediction A precise prediction C 1 C 2 C 3 +1 +1 +1 +1 +1-1 +1-1 +1-1 +1-1 -1-1 +1-1 +1-1 -1 +1-1 -1 +1-1 9 error rate: 35%, acc: 65%

Why Ensemble Work? 3 i= 2 Therefore, probability that the ensemble classifier makes a wrong prediction is: 3 i ε (1 ε ) i 3 i = 3 0.35 0.65 + 1 0.35 1 = That is the accuracy of the ensemble classifier is 71.83% 2 3 0.2817 10

Why Ensemble Work? Suppose there are 25 independent base classifiers Therefore, probability that the ensemble classifier makes a wrong prediction is: 25 i= 13 25 i ε (1 ε ) i 25 i = 0.06 That is the accuracy of the ensemble classifier is 94% 11

Why Ensemble Work? Some unknown distribution Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 12 Ensemble gives the global picture!

Why Independency Is Necessary? C 1 C 2 C 1 C 2 + - + - - + - + - - + - + - - - + - + + 13

Necessary Conditions The base classifiers are identical (perfectly correlated) The base classifiers are independent Error rate of an ensemble of 25 binary classifiers for different base classifier error rates Observation: the ensemble classifier performs worse than the base classifiers when the base classifier error rate is larger than 0.5 14

Necessary Conditions Two necessary conditions for an ensemble classifier to perform better than a single classifier: The base classifiers should be independent of each other In practice, this condition can be relaxed that the base classifiers can be slightly correlated. The base classifiers should do better than a classifier that performs random guessing (e.g., for binary classification, accuracy should be better than 0.5) 15

Ensemble Learning Methods Supervised ensemble learning methods classification, regression Unsupervised ensemble learning methods clustering 16

17 Supervised Ensemble Methods How to generate an ensemble of classifiers? By manipulating the training set: multiple training sets are created by resampling the original data according to some sampling distribution. A classifier is then build from each training set. Bagging, Boosting By manipulating the input features: a subset of input attributes is chosen to form each training set. The subset can be either chosen randomly or using domain knowledge Random forest By manipulating the learning algorithm: applying the algorithm several times on the same training data using different parameters.

Supervised Ensemble Method: General Procedure 1. Let D denote the original training data, k denote the number of base classifiers, and T be the test data. 2. for i = 1 to k do 3. Create training set, D i from D. 4. Build a base classifier C i from D i. 5. end for 6. for each test record x T do 7. C * (x) = Vote(C 1 (x), C 2 (x),, C k (x)) 8. end for Majority voting (can be other schemes) 18

Bagging Known as bootstrap aggregating, to repeatedly sample with replacement according to a uniform probability distribution Build classifier on each bootstrap sample, which is of the size of the original data Use majority voting to determine the class label of ensemble classifier 19

Bagging Index of an instance Original Data 1 2 3 4 5 6 7 8 9 10 Round 1 7 8 10 8 2 5 10 10 5 9 Round 2 1 4 9 1 2 3 2 7 3 2 C 1 C 2 Round 3 1 8 5 10 5 5 9 6 3 7 20 C 3

Bagging A training example has a probability of 1 1/N of not being selected Its probability of ending up not in a training set D i is (1 1/N) N 1/e=0.368 A bootstrap sample D i contains approximately 63.2% of the original training data 21

Boosting Principles: Boost a set of weak learners to a strong learner Make records currently misclassified more important. Generally, An iterative procedure to adaptively change the distribution of training data so that the base classifiers will focus more on previously misclassified records 22

23 Boosting Specifically, Initially, all N records are assigned equal weights Unlike bagging, weights may change at the end of each boosting round In each boosting round, after the weights are assigned to the training examples, we can either Draw a bootstrap sample from the original data by using the weights as a sampling distribution to build a model, or Learn a model that is biased toward higher-weighted examples

Boosting: Procedure (Resampling based on Instance Weights) 1. Initially, the examples are assigned equal weights 1/N, so that they are equally likely to be chosen for training. A sample is drawn uniformly to obtain a new training set. 2. A classifier is induced from the training set, and used to classify all the examples in the original training set 3. The weights of the training examples are updated at the end of each boosting round Records that are wrongly classified will have their weights increased Records that are classified correctly will have their weights decreased 4. Repeat Step 2 and 3 until the stopping condition is met 5. Finally, the ensemble is obtained by aggregating the base classifiers obtained from each boosting round 24

Boosting: Example Initially, all the examples are assigned the same weights. 1/10 1/10 1/10 1/10 1/10 1/10 1/10 1/10 1/10 1/10 Original Data 1 2 3 4 5 6 7 8 9 10 Uniformly randomly sample using boostrapping Round 1 7 3 2 8 7 9 4 10 6 3 A classifier built from the data C 1 Perform the classifier on all original instances 25 Original Data 1 2 3 4 5 6 7 8 9 10

Boosting: Example Adjust the weights based on weather the instances were chosen in the previous round, or misclassified in by the classifier trained in the previous around. E.g., instance 4 was misclassified in Round 1, then its weight is increased, and instance 5 was not chosen in Round 1, then its weight is increased as well, etc. 1/10 1/5 1/20 1/5 1/5 1/20 1/20 1/20 1/20 1/20 Original Data 1 2 3 4 5 6 7 8 9 10 Randomly sample based on the weights of each instance. Round 2 5 4 9 4 2 5 1 7 4 2 A classifier built from the data C 2 Perform the classifier on all instances 26 Original Data 1 2 3 4 5 6 7 8 9 10

Boosting: Example Adjust the weights based on weather the instances were chosen in the previous round, or misclassified in by the classifier trained in the previous around. Randomly sample based on the weights of each instance. Round 3 4 4 8 10 4 5 4 6 3 4 As the boosting rounds proceed, examples that are the hardest to classify tend to become even more prevalent, e.g., instance 4 27

Alternative: Weighted Classified To learn a model that is biased toward higherweighted examples By minimizing the weighted error that is biased toward higher-weighted examples E = 1 N N j= 1 w δ j ( f ( x ) y ) where w j is the weight of the instance x j, and δ(p)=1 if the predicate p is true, and 0 otherwise j j 28

Boosting: AdaBoost Let D = {(x i, y i ) I = 1, 2,, N} be the set of training examples In AdaBoost, let a set of base classifiers of each boosting round: f 1, f 2,, f T Error rate of each classifier: N 1 ε = ( ) i w jδ fi ( x j ) y j N 1 j= Importance of a classifier: α = i 1 2 1 ε i ln εi 29

Boosting: AdaBoost Weight update: w ( i j+ 1) = w Z ( j) i j e e α α j j if if f f j j ( x ( x i i ) ) = y y i i ( j) w i where denote the weight assigned to example (x i,y i ) during the j th boosting round, and Z j is the normalization factor to ensure ( j + 1) w i i = 1 30

Boosting: AdaBoost If any intermediate rounds produce error rate higher than 50%, the weights are reverted back to 1/N and the resampling procedure is repeated Classification: f * ( x) = arg max α δ y T j= 1 j ( f ( x) = y) j 31

Illustrating AdaBoost Example: 1-dimentional examples (1 attribute) with binary classes Initial weights for each data point Instances for training Original 0.1 0.1 0.1 - - - - + + Data + + + - B1 0.0094 0.0094 0.4623 Boosting Round 1 + + + - - - - - - - Data points for training α = 1.9459 32 Decision boundary Misclassified examples

Illustrating AdaBoost Boosting B1 0.0094 0.0094 0.4623 Round 1 + + + - - - - - - - Importance of the corresponding classifier α = 1.9459 Boosting B2 0.3037 0.0009 0.0422 Round 2 - - - - - - - - + + α = 2.9323 0.0276 0.1819 0.0038 Boosting Round 3 + + + + + + + + + + B3 α = 3.8744 33 Overall + + + - - - - - + + Ensemble result

Random Forests A class of ensemble methods specifically designed for decision tree classifiers Random Forests grows many trees Each tree is generated based on the values of an independent set of random vectors, which are generated from a fixed probability Final result on classifying a new instance: voting. Forest chooses the classification result having the most votes (over all the trees in the forest) 34

Random Forests 35 Illustration of random forests

Random Forests: Algorithm Choose T: number of trees to grow Choose m < M (M is the number of total features): number of features used to calculate the best split at each node (typically 20%) For each tree Choose a training set boostrapping For each node, randomly choose m features and calculate the best split Fully grown and not pruned Use majority vote among all the trees 36

Random Forests: Discussions Bagging + random features Improve Accuracy Incorporate more diversity Improve Efficiency Searching among subsets of features is much faster than searching among the complete set 37

Combination Methods Average Simple average Weighted average Voting Majority voting Plurality voting Weighted voting Combining by learning 38

Average 39 Simple average: Weighted average: = = T i f i x T x f 1 * ) ( 1 ) ( 1. and 0, where ) ( 1 ) ( 1 1 * = = = = T i i i T i i i w w x f w T x f

40 Voting Majority voting: Every classifier votes for one class label, and the final output class label is the one that receives more than half of the votes If none of the class labels receives more than half of the votes, a rejection option will be given and the combined classifier makes no prediction. Plurality voting: Takes the class label that receives the largest number of votes as the final winner. Weighted voting: A generalized version of plurality voting by introducing weights for each classifier.

Combining by Learning Stacking: A general procedure where a learner is trained to combine the individual learners Individual learners: first-level learners Combiner: second-level learner, or meta-learner 41

Combining by Learning: Illustration Suppose giving a binary classification problem, Predicted value of classifier C 1 on the instance x 1 T base classifiers Labels N instances C 1 C 2 C T Y x 1 0.6 0.9 0.3 1 x 2 0.3 0.45 0.6-1 x N 0.1 0.5 0.2 1 A new vector of features for each instance x i 42

Combining by Learning: Illustration Given D = {(x i, y i ) i = 1, 2,, N}, where x i = [C 1 (x i ),, C T (x i )] To learn a model in terms of w = [w 1,, w T ], s.t. the difference between y i and t i = w x i is as small as possible. The w = [w 1,, w T ] are the weights for each base classifier respectively. 43

Combining by Learning: Avoid Overfitting Whole training dataset Used for training Used for evaluation Used for firstlevel learners Used for meta-learner 44

Unsupervised Ensemble Methods Clustering ensembles: Given an unlabeled data set D = {x 1, x 2,, x N } An ensemble approach computes: A set of clustering solutions {C 1, C 2,, C T }, each of which maps data to a cluster: C j (x)=m A unified clustering solution C* which combines base clustering solutions by their consensus 45

Clustering Ensembles Index of a cluster 4 base clusterings Ensemble clustering 7 instances C 1 C 2 C 3 C 4 C* x 1 1 2 3 1 1 x 2 1 3 3 3 1 x 3 2 3 1 3 1 x 4 2 2 1 4 2 x 5 2 2 1 4 2 x 6 3 1 2 2 3 x 7 3 1 2 2 3 46

Clustering Ensembles: Challenges Unsupervised The correspondence between the clusters in different clustering solutions is unknown Combinatorial optimization problem is NPcomplete 47

Clustering Ensembles: Challenges Identical clustering results: {{x 1, x 2 }, {x 3, x 4, x 5 }, {x 6, x 7 }} C 1 C 2 C 3 C 4 C* x 1 1 2 3 1 1 x 2 1 3 3 3 1 x 3 2 3 1 3 1 x 4 2 2 1 4 2 x 5 2 2 1 4 2 x 6 3 1 2 2 3 x 7 3 1 2 2 3 Numbers of clusters in different base clusterings can be different 48 They may not represent the same cluster!

Clustering Ensembles: Similaritybased Methods Input: Data set D = {x 1, x 2,, x N } Base clustering algorithms: {C 1, C 2,, C T } A base clustering algorithm C for generating final results Process: 1. For i = 1,, T 2. Form a base clustering from D with k (i) clusters 3. Derive an N N similarity matrix M (i) based on the clustering result. 4. End 5. Form the consensus similarity matrix 6. Perform C on M to generate k clusters M = Output: Ensemble clustering results obtained by C 1 T T i= 1 M ( i) 49

Constructing Similarity Matrix If x i and x j belong to the same cluster, then 1, otherwise 0. M (1) C 1 x 1 1 x 2 1 x 3 2 x 4 2 x 5 2 x 6 3 x 7 3 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 1 1 1 0 0 0 0 0 x 2 1 1 0 0 0 0 0 x 3 0 0 1 1 1 0 0 x 4 0 0 1 1 1 0 0 x 5 0 0 1 1 1 0 0 x 6 0 0 0 0 0 1 1 x 7 0 0 0 0 0 1 1 Crisp clustering 50

Constructing Similarity Matrix C 1 P(l x i ) 1 2 3 x 1 1 0 0 x 2 1/2 1/2 0 x 3 1/3 1/3 1/3 x 4 1/4 1/2 1/4 x 5 3/5 1/5 1/5 x 6 2/5 2/5 1/5 x 7 0 1 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 1 1 x 2 1/2 1/2 x 3 x 4 x 5 x 6 x 7 M (1) Soft clustering M (1) 3 ( i, j) = P( l xi ) P( l l= 1 x j ) 51