CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Similar documents
CS570 Data Mining Classification: Ensemble Methods

Data Mining. Nonlinear Classification

Knowledge Discovery and Data Mining

Data Mining Practical Machine Learning Tools and Techniques

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Chapter 6. The stacking ensemble approach

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

Model Combination. 24 Novembre 2009

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

Knowledge Discovery and Data Mining

REVIEW OF ENSEMBLE CLASSIFICATION

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Using multiple models: Bagging, Boosting, Ensembles, Forests

Knowledge Discovery and Data Mining

Leveraging Ensemble Models in SAS Enterprise Miner

Decision Trees from large Databases: SLIQ

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Predicting borrowers chance of defaulting on credit loans

L25: Ensemble learning

Data Mining Methods: Applications for Institutional Research

Why Ensembles Win Data Mining Competitions

Ensemble Data Mining Methods

Azure Machine Learning, SQL Data Mining and R

How To Make A Credit Risk Model For A Bank Account

Boosting.

Social Media Mining. Data Mining Essentials

Ensembles and PMML in KNIME

New Ensemble Combination Scheme

Data Mining Part 5. Prediction

Heritage Provider Network Health Prize Round 3 Milestone: Team crescendo s Solution

How To Perform An Ensemble Analysis

Machine Learning Capacity and Performance Analysis and R

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Data Mining for Knowledge Management. Classification

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Distributed forests for MapReduce-based machine learning

Advanced Ensemble Strategies for Polynomial Models

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

II. RELATED WORK. Sentiment Mining

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Better credit models benefit us all

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Data Mining Algorithms Part 1. Dejan Sarka

The Artificial Prediction Market

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

Predict Influencers in the Social Network

Using Random Forest to Learn Imbalanced Data

Chapter 12 Bagging and Random Forests

Classification and Prediction

Beating the NCAA Football Point Spread

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Gerry Hobbs, Department of Statistics, West Virginia University

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Comparison of Data Mining Techniques used for Financial Data Analysis

Metalearning for Dynamic Integration in Ensemble Methods

Ensemble of Classifiers Based on Association Rule Mining

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Classification and Regression by randomforest

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering UE 141 Spring 2013

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Decompose Error Rate into components, some of which can be measured on unlabeled data

Classification algorithm in Data mining: An Overview

Data Mining - Evaluation of Classifiers

On the application of multi-class classification in physical therapy recommendation

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Environmental Remote Sensing GEOG 2021

Fast Analytics on Big Data with H20

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

How To Solve The Class Imbalance Problem In Data Mining

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

How To Identify A Churner

Selecting Data Mining Model for Web Advertising in Virtual Communities

An Ensemble Method for Large Scale Machine Learning with Hadoop MapReduce

SVM Ensemble Model for Investment Prediction

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information

Monday Morning Data Mining

Roulette Sampling for Cost-Sensitive Learning

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Supervised Learning (Big Data Analytics)

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Applied Multivariate Analysis - Big data analytics

Performance Metrics for Graph Mining Tasks

Data Mining Techniques for Prognosis in Pancreatic Cancer

An Overview of Knowledge Discovery Database and Data mining Techniques

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Homework Assignment 7

Transcription:

CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes of the books Introduction to Data Mining (Chap. 5) and On the Power of Ensemble: Supervised and Unsupervised Methods Reconciled (A tutorial at SDM 2010 by Jing Gao etal.).

Ensemble Methods Objective: To improve model performance in terms of accuracy by aggregating the predictions of multiple models How to do it: Construct a set of base models from the training data Make predictions by combing the predicted results made by each base model 2

3 General Idea

Stories of Success Million-dollar prize Improve the baseline movie recommendation approach of Netflix by 10% in accuracy The top submissions all combine several algorithms as an ensemble Data mining competitions on Kaggle Winning teams employ ensembles of classifiers 4

5 Netflix Prize Supervised learning task Training data is a set of users and movies, and a set of ratings (1, 2, 3, 4, 5 stars) on movies given by users. Construct a classifier that given a user and an unrated movie, correctly predicts user s rating on the movie:1, 2, 3, 4, or 5 stars $1 million prize for a 10% improvement over Netflix s current movie recommender Competition At first, single-model methods are developed, and performances are improved However, improvements slowed down Later, individuals and teams merged their results, and significant improvements are observed

Leaderboard Our final solution consists of blending 107 individual results. Predictive accuracy is substantially improved when blending multiple predictors. Our experience is that most efforts should be concentrated in deriving substantially different approaches, rather than refining a single technique. 6

Why Ensemble Work? Suppose there are 3 base classifiers Each classifier has error rate, ε = 0.35 or accuracy acc = 0.65. Given a test instance, if we choose any one of these classifiers to make prediction, the probability that the classifier makes a wrong prediction is 35%. Base classifiers: C 1 C 2 C 3 A test instance: x 7

Why Ensemble Work? Combine the 3 base classifiers to predict the class label of a test instance using a majority vote on the predictions made by the base classifiers Assume classifiers be independent, then the ensemble makes a wrong prediction only if more than 2 of the base classifiers predict incorrectly 8

Why Ensemble Work? x Truth label: -1 A wrong prediction A precise prediction C 1 C 2 C 3 +1 +1 +1 +1 +1-1 +1-1 +1-1 +1-1 -1-1 +1-1 +1-1 -1 +1-1 -1 +1-1 9 error rate: 35%, acc: 65%

Why Ensemble Work? 3 i= 2 Therefore, probability that the ensemble classifier makes a wrong prediction is: 3 i ε (1 ε ) i 3 i = 3 0.35 0.65 + 1 0.35 1 = That is the accuracy of the ensemble classifier is 71.83% 2 3 0.2817 10

Why Ensemble Work? Suppose there are 25 independent base classifiers Therefore, probability that the ensemble classifier makes a wrong prediction is: 25 i= 13 25 i ε (1 ε ) i 25 i = 0.06 That is the accuracy of the ensemble classifier is 94% 11

Why Ensemble Work? Some unknown distribution Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 12 Ensemble gives the global picture!

Why Independency Is Necessary? C 1 C 2 C 1 C 2 + - + - - + - + - - + - + - - - + - + + 13

Necessary Conditions The base classifiers are identical (perfectly correlated) The base classifiers are independent Error rate of an ensemble of 25 binary classifiers for different base classifier error rates Observation: the ensemble classifier performs worse than the base classifiers when the base classifier error rate is larger than 0.5 14

Necessary Conditions Two necessary conditions for an ensemble classifier to perform better than a single classifier: The base classifiers should be independent of each other In practice, this condition can be relaxed that the base classifiers can be slightly correlated. The base classifiers should do better than a classifier that performs random guessing (e.g., for binary classification, accuracy should be better than 0.5) 15

Ensemble Learning Methods Supervised ensemble learning methods classification, regression Unsupervised ensemble learning methods clustering 16

17 Supervised Ensemble Methods How to generate an ensemble of classifiers? By manipulating the training set: multiple training sets are created by resampling the original data according to some sampling distribution. A classifier is then build from each training set. Bagging, Boosting By manipulating the input features: a subset of input attributes is chosen to form each training set. The subset can be either chosen randomly or using domain knowledge Random forest By manipulating the learning algorithm: applying the algorithm several times on the same training data using different parameters.

Supervised Ensemble Method: General Procedure 1. Let D denote the original training data, k denote the number of base classifiers, and T be the test data. 2. for i = 1 to k do 3. Create training set, D i from D. 4. Build a base classifier C i from D i. 5. end for 6. for each test record x T do 7. C * (x) = Vote(C 1 (x), C 2 (x),, C k (x)) 8. end for Majority voting (can be other schemes) 18

Bagging Known as bootstrap aggregating, to repeatedly sample with replacement according to a uniform probability distribution Build classifier on each bootstrap sample, which is of the size of the original data Use majority voting to determine the class label of ensemble classifier 19

Bagging Index of an instance Original Data 1 2 3 4 5 6 7 8 9 10 Round 1 7 8 10 8 2 5 10 10 5 9 Round 2 1 4 9 1 2 3 2 7 3 2 C 1 C 2 Round 3 1 8 5 10 5 5 9 6 3 7 20 C 3

Bagging A training example has a probability of 1 1/N of not being selected Its probability of ending up not in a training set D i is (1 1/N) N 1/e=0.368 A bootstrap sample D i contains approximately 63.2% of the original training data 21

Boosting Principles: Boost a set of weak learners to a strong learner Make records currently misclassified more important. Generally, An iterative procedure to adaptively change the distribution of training data so that the base classifiers will focus more on previously misclassified records 22

23 Boosting Specifically, Initially, all N records are assigned equal weights Unlike bagging, weights may change at the end of each boosting round In each boosting round, after the weights are assigned to the training examples, we can either Draw a bootstrap sample from the original data by using the weights as a sampling distribution to build a model, or Learn a model that is biased toward higher-weighted examples

Boosting: Procedure (Resampling based on Instance Weights) 1. Initially, the examples are assigned equal weights 1/N, so that they are equally likely to be chosen for training. A sample is drawn uniformly to obtain a new training set. 2. A classifier is induced from the training set, and used to classify all the examples in the original training set 3. The weights of the training examples are updated at the end of each boosting round Records that are wrongly classified will have their weights increased Records that are classified correctly will have their weights decreased 4. Repeat Step 2 and 3 until the stopping condition is met 5. Finally, the ensemble is obtained by aggregating the base classifiers obtained from each boosting round 24

Boosting: Example Initially, all the examples are assigned the same weights. 1/10 1/10 1/10 1/10 1/10 1/10 1/10 1/10 1/10 1/10 Original Data 1 2 3 4 5 6 7 8 9 10 Uniformly randomly sample using boostrapping Round 1 7 3 2 8 7 9 4 10 6 3 A classifier built from the data C 1 Perform the classifier on all original instances 25 Original Data 1 2 3 4 5 6 7 8 9 10

Boosting: Example Adjust the weights based on weather the instances were chosen in the previous round, or misclassified in by the classifier trained in the previous around. E.g., instance 4 was misclassified in Round 1, then its weight is increased, and instance 5 was not chosen in Round 1, then its weight is increased as well, etc. 1/10 1/5 1/20 1/5 1/5 1/20 1/20 1/20 1/20 1/20 Original Data 1 2 3 4 5 6 7 8 9 10 Randomly sample based on the weights of each instance. Round 2 5 4 9 4 2 5 1 7 4 2 A classifier built from the data C 2 Perform the classifier on all instances 26 Original Data 1 2 3 4 5 6 7 8 9 10

Boosting: Example Adjust the weights based on weather the instances were chosen in the previous round, or misclassified in by the classifier trained in the previous around. Randomly sample based on the weights of each instance. Round 3 4 4 8 10 4 5 4 6 3 4 As the boosting rounds proceed, examples that are the hardest to classify tend to become even more prevalent, e.g., instance 4 27

Alternative: Weighted Classified To learn a model that is biased toward higherweighted examples By minimizing the weighted error that is biased toward higher-weighted examples E = 1 N N j= 1 w δ j ( f ( x ) y ) where w j is the weight of the instance x j, and δ(p)=1 if the predicate p is true, and 0 otherwise j j 28

Boosting: AdaBoost Let D = {(x i, y i ) I = 1, 2,, N} be the set of training examples In AdaBoost, let a set of base classifiers of each boosting round: f 1, f 2,, f T Error rate of each classifier: N 1 ε = ( ) i w jδ fi ( x j ) y j N 1 j= Importance of a classifier: α = i 1 2 1 ε i ln εi 29

Boosting: AdaBoost Weight update: w ( i j+ 1) = w Z ( j) i j e e α α j j if if f f j j ( x ( x i i ) ) = y y i i ( j) w i where denote the weight assigned to example (x i,y i ) during the j th boosting round, and Z j is the normalization factor to ensure ( j + 1) w i i = 1 30

Boosting: AdaBoost If any intermediate rounds produce error rate higher than 50%, the weights are reverted back to 1/N and the resampling procedure is repeated Classification: f * ( x) = arg max α δ y T j= 1 j ( f ( x) = y) j 31

Illustrating AdaBoost Example: 1-dimentional examples (1 attribute) with binary classes Initial weights for each data point Instances for training Original 0.1 0.1 0.1 - - - - + + Data + + + - B1 0.0094 0.0094 0.4623 Boosting Round 1 + + + - - - - - - - Data points for training α = 1.9459 32 Decision boundary Misclassified examples

Illustrating AdaBoost Boosting B1 0.0094 0.0094 0.4623 Round 1 + + + - - - - - - - Importance of the corresponding classifier α = 1.9459 Boosting B2 0.3037 0.0009 0.0422 Round 2 - - - - - - - - + + α = 2.9323 0.0276 0.1819 0.0038 Boosting Round 3 + + + + + + + + + + B3 α = 3.8744 33 Overall + + + - - - - - + + Ensemble result

Random Forests A class of ensemble methods specifically designed for decision tree classifiers Random Forests grows many trees Each tree is generated based on the values of an independent set of random vectors, which are generated from a fixed probability Final result on classifying a new instance: voting. Forest chooses the classification result having the most votes (over all the trees in the forest) 34

Random Forests 35 Illustration of random forests

Random Forests: Algorithm Choose T: number of trees to grow Choose m < M (M is the number of total features): number of features used to calculate the best split at each node (typically 20%) For each tree Choose a training set boostrapping For each node, randomly choose m features and calculate the best split Fully grown and not pruned Use majority vote among all the trees 36

Random Forests: Discussions Bagging + random features Improve Accuracy Incorporate more diversity Improve Efficiency Searching among subsets of features is much faster than searching among the complete set 37

Combination Methods Average Simple average Weighted average Voting Majority voting Plurality voting Weighted voting Combining by learning 38

Average 39 Simple average: Weighted average: = = T i f i x T x f 1 * ) ( 1 ) ( 1. and 0, where ) ( 1 ) ( 1 1 * = = = = T i i i T i i i w w x f w T x f

40 Voting Majority voting: Every classifier votes for one class label, and the final output class label is the one that receives more than half of the votes If none of the class labels receives more than half of the votes, a rejection option will be given and the combined classifier makes no prediction. Plurality voting: Takes the class label that receives the largest number of votes as the final winner. Weighted voting: A generalized version of plurality voting by introducing weights for each classifier.

Combining by Learning Stacking: A general procedure where a learner is trained to combine the individual learners Individual learners: first-level learners Combiner: second-level learner, or meta-learner 41

Combining by Learning: Illustration Suppose giving a binary classification problem, Predicted value of classifier C 1 on the instance x 1 T base classifiers Labels N instances C 1 C 2 C T Y x 1 0.6 0.9 0.3 1 x 2 0.3 0.45 0.6-1 x N 0.1 0.5 0.2 1 A new vector of features for each instance x i 42

Combining by Learning: Illustration Given D = {(x i, y i ) i = 1, 2,, N}, where x i = [C 1 (x i ),, C T (x i )] To learn a model in terms of w = [w 1,, w T ], s.t. the difference between y i and t i = w x i is as small as possible. The w = [w 1,, w T ] are the weights for each base classifier respectively. 43

Combining by Learning: Avoid Overfitting Whole training dataset Used for training Used for evaluation Used for firstlevel learners Used for meta-learner 44

Unsupervised Ensemble Methods Clustering ensembles: Given an unlabeled data set D = {x 1, x 2,, x N } An ensemble approach computes: A set of clustering solutions {C 1, C 2,, C T }, each of which maps data to a cluster: C j (x)=m A unified clustering solution C* which combines base clustering solutions by their consensus 45

Clustering Ensembles Index of a cluster 4 base clusterings Ensemble clustering 7 instances C 1 C 2 C 3 C 4 C* x 1 1 2 3 1 1 x 2 1 3 3 3 1 x 3 2 3 1 3 1 x 4 2 2 1 4 2 x 5 2 2 1 4 2 x 6 3 1 2 2 3 x 7 3 1 2 2 3 46

Clustering Ensembles: Challenges Unsupervised The correspondence between the clusters in different clustering solutions is unknown Combinatorial optimization problem is NPcomplete 47

Clustering Ensembles: Challenges Identical clustering results: {{x 1, x 2 }, {x 3, x 4, x 5 }, {x 6, x 7 }} C 1 C 2 C 3 C 4 C* x 1 1 2 3 1 1 x 2 1 3 3 3 1 x 3 2 3 1 3 1 x 4 2 2 1 4 2 x 5 2 2 1 4 2 x 6 3 1 2 2 3 x 7 3 1 2 2 3 Numbers of clusters in different base clusterings can be different 48 They may not represent the same cluster!

Clustering Ensembles: Similaritybased Methods Input: Data set D = {x 1, x 2,, x N } Base clustering algorithms: {C 1, C 2,, C T } A base clustering algorithm C for generating final results Process: 1. For i = 1,, T 2. Form a base clustering from D with k (i) clusters 3. Derive an N N similarity matrix M (i) based on the clustering result. 4. End 5. Form the consensus similarity matrix 6. Perform C on M to generate k clusters M = Output: Ensemble clustering results obtained by C 1 T T i= 1 M ( i) 49

Constructing Similarity Matrix If x i and x j belong to the same cluster, then 1, otherwise 0. M (1) C 1 x 1 1 x 2 1 x 3 2 x 4 2 x 5 2 x 6 3 x 7 3 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 1 1 1 0 0 0 0 0 x 2 1 1 0 0 0 0 0 x 3 0 0 1 1 1 0 0 x 4 0 0 1 1 1 0 0 x 5 0 0 1 1 1 0 0 x 6 0 0 0 0 0 1 1 x 7 0 0 0 0 0 1 1 Crisp clustering 50

Constructing Similarity Matrix C 1 P(l x i ) 1 2 3 x 1 1 0 0 x 2 1/2 1/2 0 x 3 1/3 1/3 1/3 x 4 1/4 1/2 1/4 x 5 3/5 1/5 1/5 x 6 2/5 2/5 1/5 x 7 0 1 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 1 1 x 2 1/2 1/2 x 3 x 4 x 5 x 6 x 7 M (1) Soft clustering M (1) 3 ( i, j) = P( l xi ) P( l l= 1 x j ) 51