Ensemble Methods and Boosting. Copyright 2005 by David Helmbold 1

Similar documents
Data Mining Practical Machine Learning Tools and Techniques

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

CS570 Data Mining Classification: Ensemble Methods

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

L25: Ensemble learning

Model Combination. 24 Novembre 2009

Boosting.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Data Mining. Nonlinear Classification

Ensemble Data Mining Methods

REVIEW OF ENSEMBLE CLASSIFICATION

Chapter 6. The stacking ensemble approach

FilterBoost: Regression and Classification on Large Datasets

Decision Trees from large Databases: SLIQ

AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague

Advanced Ensemble Strategies for Polynomial Models

Knowledge Discovery and Data Mining

Data Mining Methods: Applications for Institutional Research

On the effect of data set size on bias and variance in classification learning

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Ensemble Approach for the Classification of Imbalanced Data

Decompose Error Rate into components, some of which can be measured on unlabeled data

How Boosting the Margin Can Also Boost Classifier Complexity

Active Learning with Boosting for Spam Detection

Source. The Boosting Approach. Example: Spam Filtering. The Boosting Approach to Machine Learning

Machine Learning Final Project Spam Filtering

Asymmetric Gradient Boosting with Application to Spam Filtering

Online Algorithms: Learning & Optimization with No Regret.

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

Knowledge Discovery and Data Mining

Solving Regression Problems Using Competitive Ensemble Models

Introduction to Learning & Decision Trees

Supervised Learning (Big Data Analytics)

Data Analytics and Business Intelligence (8696/8697)

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

Why Ensembles Win Data Mining Competitions

Beating the NCAA Football Point Spread

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

The Artificial Prediction Market

Making Sense of the Mayhem: Machine Learning and March Madness

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Simple and efficient online algorithms for real world applications

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Case Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets

II. RELATED WORK. Sentiment Mining

Data Mining - Evaluation of Classifiers

Comparison of Data Mining Techniques used for Financial Data Analysis

1 Maximum likelihood estimation

A Learning Algorithm For Neural Network Ensembles

Journal of Asian Scientific Research COMPARISON OF THREE CLASSIFICATION ALGORITHMS FOR PREDICTING PM2.5 IN HONG KONG RURAL AREA.

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

When Efficient Model Averaging Out-Performs Boosting and Bagging

Monday Morning Data Mining

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Knowledge Discovery and Data Mining

Using multiple models: Bagging, Boosting, Ensembles, Forests

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Gerry Hobbs, Department of Statistics, West Virginia University

Training Methods for Adaptive Boosting of Neural Networks for Character Recognition

Ensemble of Classifiers Based on Association Rule Mining

Big Data Analytics CSCI 4030

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Classification of Bad Accounts in Credit Card Industry

Ensembles and PMML in KNIME

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Lecture 10: Regression Trees

Better credit models benefit us all

Bike sharing model reuse framework for tree-based ensembles

Linear Threshold Units

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Introduction to Machine Learning Using Python. Vikram Kamath

MHI3000 Big Data Analytics for Health Care Final Project Report

Azure Machine Learning, SQL Data Mining and R

Why do statisticians "hate" us?

How To Make A Credit Risk Model For A Bank Account

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Risk pricing for Australian Motor Insurance

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Monotonicity Hints. Abstract

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

On Cross-Validation and Stacking: Building seemingly predictive models on random data

Incremental SampleBoost for Efficient Learning from Multi-Class Data Sets

CSE 473: Artificial Intelligence Autumn 2010

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Classification and Regression by randomforest

Logistic Model Trees

Random Forest Based Imbalanced Data Cleaning and Classification

Interactive Machine Learning. Maria-Florina Balcan

Assessing Data Mining: The State of the Practice

Chapter 12 Bagging and Random Forests

Applied Multivariate Analysis - Big data analytics

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Transcription:

Ensemble Methods and Boosting Copyright 2005 by David Helmbold 1

Ensemble Methods: Use set of hypothesis 2

Ensemble Methods Use ensemble or goup of hypotheses Diversity important ensemble of yes-men is useless Get diverse hypotheses by: Using different data Using different algorithms Using different hyper-parameters 3

Averaging reduces Variance Ensemble more stable than individual Consider biased coin (1/3, 2/3) Variance of number of heads H: Variance = E[H E[H]] 2 = E[H 2 ] - E[H] 2 of 1 flip = 1/3-1/9 = 2/9 sum of 2 flips = 8/9 4/9 = 4/9 Ave of 2 flips = 2/9 1/9 = 1/9 Ensembles often dependent 4

How to get different data? Use different training sets Bootstrap sample: pick m examples from sample with replacement Cross-validation sampling Re-weight data Use different features Random forests hide some features 5

How does Ensemble predict? Dictator why have ensemble? Unweighted vote Weighted vote Pay more attention to better predictors Cascade Other structures 6

Mixture of Experts Voting where weights are inputdependent (gating) (Jacobs et al., 1991) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 7

Stacking Combiner f () is another learner (Wolpert, 1992) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 8

Boosting is: A meta learning technique, it boosts the performance of another learning algorithm An ensemble technique, it creates a set of classifiers and predicts with a weighted vote Ensembles used before to reduce variance (e.g. bagging) 9

Generic Boosting algorithm Uses training sample S (±1 labels), learning algorithm L 1. for t from 1 to T do 1. Create distribution D t (or p t in book) on S 2. Call L with D t on S to get hypothesis h t (d t in book) 3. Calculate weight α t for h t 2. Final hypothesis is H(x) = t α t h t (x) or H(x) = argmax y t:ht (x)=y α t The trick is picking D t and w t 10

Learn D 1 uniform - + + + Example: with axis-parallel half-spaces D 2 D 3 - + + + Incorrect get more weight - + + + - + + + h 1 h 2 h 3 Majority of three hypotheses is perfect 11

AdaBoost (Freund&Schapire 97) Very powerful algorithm coming out of machine learning theory (Un-normalized) margin, m i, of an example (x i,y i ) S is y i H(x i ) = y i t α t h t (x i ) AdaBoost uses exponential potential : potential = i S e -m i This potential is smooth overestimate on number of mistakes Also use margins at each iteration 12

AdaBoost Generate a sequence of baselearners each focusing on previous one s errors (Freund and Schapire, 1996) 13

Analysis of Training Error (Use overhead) 14

AdaBoost training error Let γ t = 1/2 - ε t, γ t is the edge of h t over random guessing Let N be # examples in sample # Mistakes of final hypothesis on sample is at most: N e -2 t γt If γ t > const, this decreases exponentially, and O(log (N)) iterations to perfection (on training set) Generalization error bound (approximately): training error + O((Td/N) 1/2 ) where T = # boosting rounds, d = VC-dim of h t s 15

AdaBoost does constrained gradient descent on potential: at each iteration D t chosen to be (proportional to) negative gradient of potential with respect to current margins: D t (x i ) is how much an increase in x i s margin helps the potential decrease D t (x i ) set prop.to. e -m i (current margins) w t for h t is how much of h t to add in to minimize potential, error ε t = P i~dt [h t (x i ) y I ], α t = ln((1-ε t )/ε t )/2 16

α t Contours are where sum(e^-m i ) is constant 17

Boosting Overviews at: http://www.cs.princeton.edu/~schapire/boost.html Boosting Performance (vs decision tree algorithm) on UCI benchmarks - each dot is one benchmark 18

AdaBoost and Overfitting Adaboost somewhat resists overfitting Test error even improves after perfect on sample - big surprise! Boosting C4.5 on letter dataset (see Schapire 02 overview on his web page) 19

Generalization error Normalized margin is m i =m i / t w t Generalization error rate is, for all θ>0, (approximately) at most: fraction of sample with m i < θ Plus O((d/N) 1/2 / θ) Independent of T (no over-fitting!) But bounds loose, over-fitting can happen 20

Margin explanation Margin distribution After 5, 100, and 1000 (solid) Iterations. 21

Boosting and noise Exponential weighting e -m i can put exponential emphasis on noisy examples One Solution: logitboost bounds weights: D t (xi) prop to 1/ (1 + e m i ) 22

Boosting Successes Many: OCR (boosted nets), Image retrieval, Natural language processing, (see Schapire 02) Text classification Reuters newswire 23

Boosting Variants Boosting by reweighting, resampling, filtering Logitboost (Friedman, Hastie, Tibshirani) Totally corrective boosting (adjust all votes) LP boost, dual methods (e.g. Manfred s work) Info Boost (Aslam) Many, many others 24

Other Points: Boosting connection to game theory Boosting by filtering Boosting for outlier detection? Boosting for multiple classes - three ways AdaBoost.m1 requires really good weak learner AdaBoost.m2 more reasonable ECOC techniques transforms multi-class to binary 25

Boosting and Regression Can t just re-weight examples Convert to classification in AdaBoost paper Re-label points with residual of error Re-label points with signs of residual and weight with gradient of two-sided potential: exp(-r) + exp(r) - 2 (see Duffy and Helmbold) 26

Boosting Summary Model: Linear threshold of base hypotheses - depends on underlying learning algorithm Data: depends on the underlying learning algorithm Interpretable? Yes if decision stumps, but usually not Missing values? Depends on underlying learning alg Noise/outliers? AdaBoost bad, logitboost a bit better Irrelevant features? Depends on the underlying alg Comp. efficiency? Good, if underlying alg good 27

Interpreting AdaBoost hypotheses (Decision Stumps) on a LIDAR classification problem (2 of 5 features) Blue= buildings, Green = trees, Brown = road, Yellow = Grass 28