Model Combination. 24 Novembre 2009



Similar documents
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Data Mining. Nonlinear Classification

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Knowledge Discovery and Data Mining

L25: Ensemble learning

Data Mining Practical Machine Learning Tools and Techniques

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Decision Trees from large Databases: SLIQ

Knowledge Discovery and Data Mining

CS570 Data Mining Classification: Ensemble Methods

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Decompose Error Rate into components, some of which can be measured on unlabeled data

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Comparison of Data Mining Techniques used for Financial Data Analysis

Chapter 6. The stacking ensemble approach

Boosting.

REVIEW OF ENSEMBLE CLASSIFICATION

A Learning Algorithm For Neural Network Ensembles

Data Mining Methods: Applications for Institutional Research

On the effect of data set size on bias and variance in classification learning

Ensemble Data Mining Methods

Applied Multivariate Analysis - Big data analytics

Advanced Ensemble Strategies for Polynomial Models

Using multiple models: Bagging, Boosting, Ensembles, Forests

Solving Regression Problems Using Competitive Ensemble Models

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Active Learning with Boosting for Spam Detection

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Chapter 12 Bagging and Random Forests

Data Mining - Evaluation of Classifiers

How To Solve The Class Imbalance Problem In Data Mining

Comparison of Bagging, Boosting and Stacking Ensembles Applied to Real Estate Appraisal

Why Ensembles Win Data Mining Competitions

Classification and Regression by randomforest

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Gerry Hobbs, Department of Statistics, West Virginia University

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Leveraging Ensemble Models in SAS Enterprise Miner

How Boosting the Margin Can Also Boost Classifier Complexity

How To Make A Credit Risk Model For A Bank Account

The Artificial Prediction Market

Better credit models benefit us all

Metalearning for Dynamic Integration in Ensemble Methods

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

MHI3000 Big Data Analytics for Health Care Final Project Report

Classification of Bad Accounts in Credit Card Industry

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

HT2015: SC4 Statistical Data Mining and Machine Learning

Ensembles and PMML in KNIME

Homework Assignment 7

Bootstrapping Big Data

E-commerce Transaction Anomaly Classification

6 Classification and Regression Trees, 7 Bagging, and Boosting

Cross-Validation. Synonyms Rotation estimation

Random forest algorithm in big data environment

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

Training Methods for Adaptive Boosting of Neural Networks for Character Recognition

Using Random Forest to Learn Imbalanced Data

Introduction To Ensemble Learning

Supervised Learning (Big Data Analytics)

Azure Machine Learning, SQL Data Mining and R

Monday Morning Data Mining

Studying Auto Insurance Data

ENSEMBLE METHODS FOR CLASSIFIERS

Employer Health Insurance Premium Prediction Elliott Lui

Lecture 13: Validation

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

II. RELATED WORK. Sentiment Mining

How To Identify A Churner

How To Train A Neural Net

Data Mining Classification: Decision Trees

FilterBoost: Regression and Classification on Large Datasets

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Lecture 10: Regression Trees

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

Operations Research and Knowledge Modeling in Data Mining

An Ensemble Method for Large Scale Machine Learning with Hadoop MapReduce

Case Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets

Fortgeschrittene Computerintensive Methoden

Car Insurance. Havránek, Pokorný, Tomášek

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

Getting the Most Out of Ensemble Selection

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Social Media Mining. Data Mining Essentials

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Microsoft Azure Machine learning Algorithms

Cross Validation. Dr. Thomas Jensen Expedia.com

Hong Kong Stock Index Forecasting

Why do statisticians "hate" us?

Ensemble Approaches for Regression: A Survey

Model Trees for Classification of Hybrid Data Types

Transcription:

Model Combination 24 Novembre 2009 Datamining 1 2009-2010

Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy learning Datamining 2 2009-2010

Principles of model combination Model selection versus model combination For a given learning task, we typically train several models. Model selection Choose the model that maximizes some performance criterion on a test set. Model combination Combine several models into a single aggregate model (aka ensemble or committee) Datamining 3 2009-2010

Why combine models? Principles of model combination The error of an ensemble < the error of an individual model provided that: 1 the models are diverse or independent; 2 each model performs at least slightly better than chance : err < 0.5. Justification: The error probability of a set of J models 1 follows a binomial distribution assumes independent models; 2 equals the probability that at least J/2 models are wrong. Datamining 4 2009-2010

Principles of model combination Ex. An ensemble of 21 models is in error when 11 make an error. Individual error=0.3 Individual error=0.51 0.00 0.05 0.10 0.15 P(maj in error) = 0.026 1 3 5 7 9 11 13 15 17 0.00 0.05 0.10 0.15 P(maj in error) = 0.537 1 3 5 7 9 11 13 15 17 # classifiers in error # classifiers in error p < 0.5 P(ensemble error) < P(individual error) p > 0.5 P(ensemble error) > P(individual error) Datamining 5 2009-2010

Phases of model combination Principles of model combination 1 Diversification : choose diverse models to cover different regions of your instance space 2 Intégration : combine these models to maximize the performance of the ensemble. Datamining 6 2009-2010

Model diversity Principles of model combination The diversity of 2 models h 1 et h 2 is quantified in terms of their error (non)correlation Different definitions, e.g. the error correlation of 2 models is the probability that both make the same error given that one of them makes an error: C err (h 1, h 2 ) = P (h 1 (x i ) = h 2 (x i ) h 1 (x i ) y i h 2 (x i ) y i ) Many other measures of diversity have been proposed [Kuncheva, 2003]. Datamining 7 2009-2010

Diversification techniques Principles of model combination Resampling : vary the data used to train a given algorithm Bagging : "bootstrap aggregation" Random Forests : bagging + "variable resampling" Boosting : resampling through adaptive weighting Hybrid learning : vary the algorithms trained on a given dataset Stacking : "stacked generalization" So-called multistrategy models Datamining 8 2009-2010

Integration techniques Principles of model combination Static : integration procedure is fixed, e.g., vote and retain the majority or mean/median of the individual predictions Dynamic : base predictions are combined using an adaptive procedure, e.g. meta-learning Datamining 9 2009-2010

Plan Principles of model combination 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for multistrategy learning Datamining 10 2009-2010

Bagging Resampling techniques Bagging = "bootstrap aggregation" Diversification via resampling Create T bootstrap replicates of the dataset D (samples of size D using random draws with replacement) Apply a given learning algorithm to the T replicates to produce T models or hypotheses Static integration Ensemble prediction = mean (regression) or uniform vote (classification) of the T models Datamining 11 2009-2010

Bagging schema Resampling techniques Diversification : Create bootstrap replicates of D D... Data D 1 Data D 2... Data D T Method L Method L... Method L Hypothesis h 1... Hypothesis h 2 Hypothesis h T H... Integration : Final hypothesis H=f(h 1, h 2,..., h T ) Datamining 12 2009-2010

Resampling techniques C4.5 with and without bagging Datamining 13 2009-2010

Bagging: summary Resampling techniques Bootstrap resampling yields highly overlapping training sets Reduces error due to variance: improves highly unstable or high-variance algorithms, i.e., where small data variations lead to very different models (e.g. decision trees, neural networks) "Reasonable" values of T 25 for regression 50 for classification [Breiman, 1996] Datamining 14 2009-2010

Plan Resampling techniques 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for multistrategy learning Datamining 15 2009-2010

Random Forests (RF) Resampling techniques RF combines random instance selection (bagging) with random variable selection Base learning algorithm: decision trees (CART). Forest = an ensemble of T trees. [Breiman, 2001] Datamining 16 2009-2010

RF algorithm Resampling techniques T RN contains n examples and p variables Hyperparameters : T = # of trees to build; m = # of variables to choose at each test node, m p To build each tree 1 Create a bootstrap replicate T RN i of T RN 2 Create a CART tree with the following modifications: At each tree node, randomly choose m candidate variables Do not prune the tree Integration method: uniform vote of the T trees Datamining 17 2009-2010

Optimizing RF Resampling techniques The ensemble error ɛ F depends on two factors : correlation co among trees of the forest: if co then ɛ F strength s i of individual trees. Strength prediction accuracy: if s i then ɛ F Impact of m, the # of randomly drawn variables Increasing m increases both co (ɛ F ) and s (ɛ F ) : co vs s trade-off Choose m by estimating ɛ i on V AL i = {x j T RN x j / T RN i } Impact of T, the number of trees T can be increased without risk of overfitting [Breiman, 2001] Datamining 18 2009-2010

Plan Resampling techniques 1 Principles of method combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for multistrategy learning Datamining 19 2009-2010

Boosting : the basic idea Resampling techniques Diversification: sequential adaptive resampling by instance weighting Initially: all instances have equal weights: 1/ T RN At each of the T iterations, - apply the algorithm and estimate the resubsitution error - increase/decrease weights of misclassified/correctly classified cases (focus learning on difficult cases) Integration by weighted voting : Apply the T base models to the current instance Return the weighted majority prediction (base classifiers weights their predictive power). Datamining 20 2009-2010

Boosting schema Resampling techniques Diversification through instance weighting T RN 1 D 1 Method L Hypothesis h 1 error T RN 2 D 2 Method L Hypothesis h 2 error... error... T RN T D T Method L Hypothesis h T Integration : Final hypothesis H=f(h 1, h 2,..., h T ) Datamining 21 2009-2010

Bagging versus boosting Resampling techniques Bagging Boosting Diversification : resampling 1 in parallel sequentially 2 random draws adaptive weighting Integration : vote 3 uniform weighted by 1 ɛ(h j ) Datamining 22 2009-2010

AdaBoost algorithm Resampling techniques Input: T RN = {(x i, y i )}, y i { 1, 1}, algo L Initialization: D 1 (i) = 1/N uniform distribution For t = 1 to T 1 h t L(T RN, D i ) 2 ɛ t i h t(x i ) y i D t (i) (weighted error) 3 if ɛ t > 0.5 then T t 1; exit 4 Compute D t+1 by adjusting weights of training cases Result: H final (x) weighted combination of h t s Freund & Schapire, 1996-1998 Datamining 23 2009-2010

Algorithmic details Resampling techniques L a weak learner (ɛ < 0.5) Computing D t+1 ( ) α t 1 2 ln 1 ɛt ɛ t (since ɛ t 0.5, α t 0 e αt 1) { D t+1 (i) Dt(i) e α t if y Z t i h t (x i ) weight e αt if y i = h t (x i ) weight Dt(i) Z t exp( α t y i h t (x i )) Z t normalization factor i D t+1(i) = 1 ( T ) H final (x) sgn t=1 α th t (x) Datamining 24 2009-2010

Resampling techniques C4.5 with and without boosting Datamining 25 2009-2010

Resampling techniques Performance of bagging vs boosting Datamining 26 2009-2010

Boosting: summary Resampling techniques The power of boosting comes from adaptive resampling Like bagging, boosting reduces variance Also reduces bias by obliging the learner to focus on hard cases combined hypothesis more flexible Fast convergence Sensitive to noise : when base learners misclassify noisy examples weights increase overfitting to noise Datamining 27 2009-2010

Plan Resampling techniques 1 Principles of method combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for multistrategy learning Datamining 28 2009-2010

Stacking Hybrid methods "Stacked generalization" [Wolpert 1992] Level 0 : diversification through the use of different learning algorithms Level 1 : integration through meta-learning Datamining 29 2009-2010

Stacking schema Hybrid methods Final hypothesis = L(D 1 ) Intégration Method L Data D 1 =target values of D 0 + predictions of h i s Hypothesis h 1 Hypothesis h 2... Hypothesis h T Diversification Method L 1 Method L 2... Method L T Data D 0 Datamining 30 2009-2010

Level 0 : Diversification Hybrid methods Input : D 0 = {(x i, y i )}, i = 1..N, J learning algorithms L j Diversification by K-fold cross-validation : for k = 1 to K do for j = 1 to J h kj L j (D T ST k 0 ) foreach x i T ST k pred ij h kj (x) Result: J predictions for each instance of D 0 Datamining 31 2009-2010

Level 1 meta-data Hybrid methods Notation : i = 1... N examples; j = 1... J models Input : D 1 = {(z i, y i )}, z i R J are the base predictions, y i the target values regression : z i = [z i1,..., z kj ], where z ij = approximation of y i by model h j classification : discrete classifier : z i = [z i1,..., z ij], with z ij C = class predicted by model h j for example i probabilistic classifier : z i = [z i1,..., z ij], with z ij = (z ij1, z ij2,..., z ij C ) = class probability distribution Datamining 32 2009-2010

Examples de meta-data Hybrid methods Level 0 (base-level) : Iris. L 1 =J48, L 2 =IB1, L 3 =NB Level 1 : discrete classifiers # J48 IB1 NB Target 1 setosa setosa versicolor setosa...... 150 virginica setosa virginica virginica Level 1 : probabilistic classifiers # J48 IB1 NB Target p1 p2 p3 p1 p2 p3 p1 p2 p3 1 0.58 0.22 0.2 0.74 0.20 0.06 0.37 0.59 0.04 setosa...... 150 0.02 0.16 0.82 0.45 0.39 0.16 0.36 0.23 0.41 virginica Datamining 33 2009-2010

Hybrid methods Level 1 : Integration through meta-learning The level 1 algorithm depends on the variables of D 1 real-valued predictions (regression ou probabilistic classification) : L M {trees, IBk, MLP, SVM, linear regressors,...} discrete predictions (discrete classification) : L M {trees, IBk, Naïve Bayes,...} The combined model H L M (D 1 ) Datamining 34 2009-2010

Using stacked models Hybrid methods After creation of thecombined model H, create the final base models h j by training the J algorithmes on D 0. In production mode: 2-step prediction given a new case x R d Level 0 : generate a meta-example z R J : For j = 1 to J, z j h j (x) Level 1: combined prediction H(z) Datamining 35 2009-2010

Stacking : summary Hybrid methods improves considerably over cross-validated model selection on average, 3 hypotheses play a significant role in the combined hypothesis stacking is just one example of hybrid model combination, many other examples have been proposed Datamining 36 2009-2010

Hybrid methods Generic multistrategy learning algorithm Diversification Train M learning algorithms on a base-level dataset and measure their performance on a test set Select J models with minimal correlation error Integration Static : continuous predictions : compute the mean, median, linear combination, etc. discrete prédictions : uniform or weighted vote Dynamic : use meta-learning Datamining 37 2009-2010

References Hybrid methods L. Breiman (1996). Bagging Predictors. Machine Learning 24(2): 123-140. L. Breiman (2001). Random Forests. Machine Learning 45: 5-32. T. G. Dietterich (1998). Machine Learning Research: Four Current Directions. AI Magazine 18(4): 97 136. Y. Freund, R. E. Schapire (1996). Experiments with a new boosting algorithm. ICML-96. L. Kuncheva, C. J. Whitaker (2003). Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning 51: 181 207. D. Wolpert (1992). Stacked Generalization, Neural Networks 5: 241-259. Datamining 38 2009-2010