Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -



Similar documents
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

L25: Ensemble learning

REVIEW OF ENSEMBLE CLASSIFICATION

CS570 Data Mining Classification: Ensemble Methods

Data Mining Practical Machine Learning Tools and Techniques

Model Combination. 24 Novembre 2009

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Chapter 12 Bagging and Random Forests

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Leveraging Ensemble Models in SAS Enterprise Miner

FilterBoost: Regression and Classification on Large Datasets

On the effect of data set size on bias and variance in classification learning

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Data Mining. Nonlinear Classification

Using multiple models: Bagging, Boosting, Ensembles, Forests

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Knowledge Discovery and Data Mining

Data Mining Methods: Applications for Institutional Research

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

6 Classification and Regression Trees, 7 Bagging, and Boosting

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Ensemble Data Mining Methods

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Package acrm. R topics documented: February 19, 2015

Data Analytics and Business Intelligence (8696/8697)

Ensemble Approach for the Classification of Imbalanced Data

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

Studying Auto Insurance Data

Knowledge Discovery and Data Mining

Solving Regression Problems Using Competitive Ensemble Models

How To Identify A Churner

Classification of Bad Accounts in Credit Card Industry

Ensembles and PMML in KNIME

Introduction To Ensemble Learning

Decision Trees from large Databases: SLIQ

New Ensemble Combination Scheme

Classification and Regression by randomforest

A Learning Algorithm For Neural Network Ensembles

Comparison of Data Mining Techniques used for Financial Data Analysis

Decompose Error Rate into components, some of which can be measured on unlabeled data

THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS XIAO CHENG. (Under the Direction of Jeongyoun Ahn)

How To Make A Credit Risk Model For A Bank Account

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

On Adaboost and Optimal Betting Strategies

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Risk pricing for Australian Motor Insurance

Gerry Hobbs, Department of Statistics, West Virginia University

Why Ensembles Win Data Mining Competitions

A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms

How To Solve The Class Imbalance Problem In Data Mining

Introduction to Online Learning Theory

Ensembles of data-reduction-based classifiers for distributed learning from very large data sets

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Chapter 6. The stacking ensemble approach

Knowledge Discovery and Data Mining

Boosting.

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

How Boosting the Margin Can Also Boost Classifier Complexity

II. RELATED WORK. Sentiment Mining

Active Learning with Boosting for Spam Detection

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Training Methods for Adaptive Boosting of Neural Networks for Character Recognition

Supervised Learning (Big Data Analytics)

Bike sharing model reuse framework for tree-based ensembles

Data Mining Techniques Chapter 6: Decision Trees

Applied Multivariate Analysis - Big data analytics

Boosting Applied to Classification of Mass Spectral Data

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

An Experimental Study on Rotation Forest Ensembles

SAS Software to Fit the Generalized Linear Model

Online Forecasting of Stock Market Movement Direction Using the Improved Incremental Algorithm

Making Sense of the Mayhem: Machine Learning and March Madness

Heritage Provider Network Health Prize Round 3 Milestone: Team crescendo s Solution

When Efficient Model Averaging Out-Performs Boosting and Bagging

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Data Mining - Evaluation of Classifiers

Advanced Ensemble Strategies for Polynomial Models

Statistics Graduate Courses

Generalized Boosted Models: A guide to the gbm package

Predictive Modeling Techniques in Insurance

CART 6.0 Feature Matrix

Local classification and local likelihoods

Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison Under Model Ensemble

Azure Machine Learning, SQL Data Mining and R

Lecture 3: Linear methods for classification

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

The Artificial Prediction Market

AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague

Distributed forests for MapReduce-based machine learning

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Classification tree analysis using TARGET

Ensemble of Classifiers Based on Association Rule Mining

REPORT DOCUMENTATION PAGE

Incremental SampleBoost for Efficient Learning from Multi-Class Data Sets

Logistic Regression (1/24/13)

Beating the MLB Moneyline

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Fast Analytics on Big Data with H20

Transcription:

Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 -

Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create models that are more powerful. Perturb and combine (P&C) methods generate multiple models by manipulating the distribution of the data or altering the construction method and then averaging the results (Breiman, 1998). Any unstable modeling method can be used, but trees are most often chosen because of their speed and flexibility. Commonly-Used Methods: o Boosting o Bagging o Random Forest - 2 -

Variance Reduction by P&C The attractiveness of P&C methods is their improved performance over single models. Bauer and Kohavi (1999) demonstrated the superiority of P&C methods with extensive experimentation. One reason why simple P&C methods give improved performance is variance reduction. If the base models have low bias and high variance, then averaging decreases the variance. In contrast, combining stable models can negatively affect performance - 3 -

Perturb Methods Resample Subsample Add noise Adaptively reweight Randomly choose from among the competitor splits - 4 -

Combine An ensemble model is the combination of multiple models. The combinations can be formed by o Voting on the classifications o Weighted voting where some models have more weight o Averaging (weighted or unweighted) of the predicted values. Ensemble methods are a very active area of research in the field of machine learning and statistics. Many other P&C methods have been devised. - 5 -

Boosting A Short History In 1996, Freund and Schapire proposed the well-known AdaBoost.M1 algorithm o Discrete AdaBoost o Real AdaBoost o Gentle AdaBoost Arcing (adaptive resampling and combining) by Breiman (1998) o Arc-x4 Stochastic Gradient Boosting (SGB) by Friedman (2001, Annals of Statistics) - 6 -

AdaBoost Basic Idea AdaBoost generates a sequentially weighted set of weak base classifiers that are combined to form an overall strong classifier. In each step of the sequence, AdaBoost attempts to find an optimal classifier according to the current distribution of weights on the observations. If an observation is incorrectly classified using the current distribution of weights, then the observation will receive more weight in the next iteration. On the other hand, correctly classified observations under the current distribution of weights will receive less weight in the next iteration. In the final overall model, classifiers that are accurate predictors of the training data receive more weight, whereas, classifiers that are poor predictors receive less weight. - 7 -

AdaBoost The Schematic - 8 -

AdaBoost The Algorithm - 9 -

Arcing The Arcing (adaptive resampling and combining) (Arc-x4; Breiman, 1998) is a simplified version of the AdaBoost (adaptive boosting) algorithm of Freund and Schapire (1996). It gives similar performance to AdaBoost (Breiman, 1998 and Bauer and Kohavi, 1999). Unlike bagging, pruning the individual trees and selecting the optimally-sized tree improves performance (Bauer and Kohavi, 1999) in boosting. - 10 -

Arcing The Weights At the k-th step a model (decision tree) is fitted using weights for each case. For the i-th case the arc-x4 weights are 4 1 + mi ( ) pi () =, { 4 1 + mi ( ) } where 0 mi ( ) kis the number of times that the ith case is misclassified in the preceding steps. - 11 -

Arcing Two Ways of Using the Weights The weights are incorporated either by using a weighted analysis or by resampling the data such that the probability that the ith case is selected is p(i). For convenience, the weights can be normalized to frequencies by multiplying by the sample size n (as shown in the following table). Bauer and Kohavi (1999) found that resampling performed better than reweighting for arc-x4 but did not change the performance of AdaBoost. AdaBoost uses a different (more complicated) formula for p(i). Both formulas put greater weight on cases that are frequently misclassified. - 12 -

Arcing A Simple Example by Hand Table 1: An Illustration of arc- 4 k=1 k=2 k=3 k=4 Case Freq m Freq m Freq m Freq 1 1 1 1.5 1.5 2.97 2 1 0.75 0.25 0.06 3 1 1 1.5 2 4.25 3 4.69 4 1 0.75 1.5 1.11 5 1 0.75 0.25 0.06 6 1 0.75 0.25 1.11 Total n = 6 n = 6 n = 6 n = 6-13 -

Arcing The process is repeated K times and the K models are combined by voting or averaging the posterior probabilities. AdaBoost used weighted voting where models with fewer misclassifications, particularly of the hard-to-classify cases, are given more weight. Breiman (1998) used K=50. Bauer and Kohavi (1999) used K=25. Arcing improves performance to a greater degree than bagging, but the improvement is less consistent (Breiman, 1998 and Bauer and Kohavi, 1999). SAS EM Implemented the Arc-x4 idea. - 14 -

Stochastic Gradient Boosting Boosting inherently relies on a gradient descent search for optimizing the underlying loss function to determine both the weights and the learner at each iteration. In Stochastic Gradient Boosting (SGB) a random permutation sampling strategy is employed at each iteration to obtain a refined training set. The full SGB algorithm with the gradient boosting modification relies on the regularization parameter v in [0, 1], the so-called learning rate. - 15 -

SGB the Algorithm Source: Taken from ada: an R Package for Stochastic Boosting by Mark Culp, Kjell Johnson, and George Michailidis. - 16 -

SGB The algorithm in its general form can operate under an arbitrary loss function: the exponential Lyf (, ) = e yf and logistic L( y, f) = log(1 + e yf ) loss functions. The η() function specifies the type of boosting: discrete η ( x) =sign(x), real η ( x) = 0.5 log(x/(1-x)), and gentle η ( x) = x. - 17 -

Several Packages Available R Implementation 1. gbm - The gbm package offers two versions of boosting for classification (gentle boost under logistic and exponential loss). In addition, it includes squared error, absolute error, Poisson and Cox type loss functions. 2. mboost - The mboost package has to a large extent similar functionality to the gbm package and in addition implements the general gradient boosting framework using regression based learners. 3. ada the one we are going to use. 4. boost Implementation of boosting for gene data. - 18 -

R ada Package The ada package implements the original AdaBoost algorithm, along with the Gentle and Real AdaBoost variants, using both exponential and logistic loss functions for classification problems. In addition, it allows the user to implement regularized versions of these methods by using the learning rate as a tuning parameter, which lead to improved computational performance. The base classifiers employed are classification/regression trees and therefore the underlying engine is the rpart package. - 19 -

ada - The Function Flow The top section consists of the functions used to create the ada object, while the bottom section are the functions invoked using an initialized ada object. - 20 -

Appendix: Kappa Statistic Kappa Statistics: an index which compares the agreement against that which might be expected by chance. Kappa can be thought of as the chance-corrected proportional agreement, and possible values range from +1 (perfect agreement) via 0 (no agreement above that expected by chance) to -1 (complete disagreement). Hypothetical Example: 29 patients are examined by two independent doctors (see Table). 'Yes' denotes the patient is diagnosed with disease X by a doctor. 'No' denotes the patient is classified as no disease X by a doctor. Doctor A No Yes Total Doctor B No 10 (34.5%) 7 (24.1%) 17 (58.6%) Yes 0 (0.0%) 12 (41.4%) 12 (41.4%) Total 10 (34.5%) 19 (65.5%) 29 Kappa = (Observed agreement - Chance agreement)/(1 - Chance agreement) Observed agreement = (10 + 12)/29 = 0.76 Chance agreement = 0.586 * 0.345 + 0.655 * 0.414 = 0.474 Kappa = (0.76-0.474)/(1-0.474) = 0.54-21 -