Boosting. riedmiller@informatik.uni-freiburg.de



Similar documents
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Active Learning with Boosting for Spam Detection

Model Combination. 24 Novembre 2009

Version Spaces.

CS570 Data Mining Classification: Ensemble Methods

Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning

AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Source. The Boosting Approach. Example: Spam Filtering. The Boosting Approach to Machine Learning

L25: Ensemble learning

Data Mining Practical Machine Learning Tools and Techniques

Ensemble Data Mining Methods

Data Mining. Nonlinear Classification

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Reinforcement Learning

Online Algorithms: Learning & Optimization with No Regret.

Reinforcement Learning

Data Mining Methods: Applications for Institutional Research

FilterBoost: Regression and Classification on Large Datasets

How Boosting the Margin Can Also Boost Classifier Complexity

Local features and matching. Image classification & object localization

REVIEW OF ENSEMBLE CLASSIFICATION

Ensembles and PMML in KNIME

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

MHI3000 Big Data Analytics for Health Care Final Project Report

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Decision Trees from large Databases: SLIQ

Maschinelles Lernen mit MATLAB

Machine Learning Algorithms for Classification. Rob Schapire Princeton University

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Training Methods for Adaptive Boosting of Neural Networks for Character Recognition

Decompose Error Rate into components, some of which can be measured on unlabeled data

Asymmetric Gradient Boosting with Application to Spam Filtering

Case Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets

Comparison of Data Mining Techniques used for Financial Data Analysis

Knowledge Discovery and Data Mining

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Beating the NCAA Football Point Spread

On the effect of data set size on bias and variance in classification learning

Data Mining - Evaluation of Classifiers

Supervised Learning (Big Data Analytics)

Chapter 6. The stacking ensemble approach

Question 2 Naïve Bayes (16 points)

Simple and efficient online algorithms for real world applications

Predicting borrowers chance of defaulting on credit loans

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

Azure Machine Learning, SQL Data Mining and R

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

II. RELATED WORK. Sentiment Mining

Solving Regression Problems Using Competitive Ensemble Models

Introduction to Learning & Decision Trees

Incremental SampleBoost for Efficient Learning from Multi-Class Data Sets

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Monday Morning Data Mining

Classification using Logistic Regression

Using multiple models: Bagging, Boosting, Ensembles, Forests

Machine Learning Final Project Spam Filtering

: Introduction to Machine Learning Dr. Rita Osadchy

The Artificial Prediction Market

Journal of Asian Scientific Research COMPARISON OF THREE CLASSIFICATION ALGORITHMS FOR PREDICTING PM2.5 IN HONG KONG RURAL AREA.

Robust Real-Time Face Detection

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

A Simple Introduction to Support Vector Machines

Leveraging Ensemble Models in SAS Enterprise Miner

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Getting Even More Out of Ensemble Selection

Distributed forests for MapReduce-based machine learning

Big Data Analytics CSCI 4030

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Classification and Regression by randomforest

Random forest algorithm in big data environment

On Adaboost and Optimal Betting Strategies

Towards better accuracy for Spam predictions

Advanced Ensemble Strategies for Polynomial Models

Why Ensembles Win Data Mining Competitions

Logistic Regression for Spam Filtering

Lecture 3: Linear methods for classification

Making Sense of the Mayhem: Machine Learning and March Madness

Support Vector Machines with Clustering for Training with Very Large Datasets

Chapter 5 Analysis of variance SPSS Analysis of variance

New Ensemble Combination Scheme

Chapter 12 Bagging and Random Forests

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining

Interactive Machine Learning. Maria-Florina Balcan

Predicting Flight Delays

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Homework Assignment 7

AdaBoost for Learning Binary and Multiclass Discriminations. (set to the music of Perl scripts) Avinash Kak Purdue University. June 8, :20 Noon

How To Solve The Class Imbalance Problem In Data Mining

Transcription:

. Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 1

Overview of Today s Lecture: Boosting Motivation AdaBoost examples Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 2

Motivation often, simple rules already yield a reasonable classification performance e.g. email spam filter: buy now would already be a good indicator for spam, doing better than random idea: use several of these simple rules ( decisions stumps ) and combine them to do classification ( committee appraoch) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3

Boosting Boosting is a general method to combine simple, rough rules into a highly accurate decision model. assumes base learners (or weak learners ) that do at least slightly better than random, e.g. 53% for a two-class classification. In other words the error is strictly smaller than 0.5. given sufficient data, a boosting algorithm can provably construct a single classifier with very high accuracy (say, e.g. 99%) Algorithm: AdaBoost (Freund and Schapire, 1995) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 4

General Framework for Boosting training set (x 1, y 1 ),...,(x p, y p ), where y i { 1,1} and x i X For t = 1... T construct a distribution D t = (d t 1,..., d t p) on the training set, where d t i is the probability of occurence of training pattern i (also called weight of i) train a base learner h t : X { 1,+1} according to the training set distribution D t with a small error ǫ t, ǫ t = Pr Dt [h t (x i ) y i ] output final classifier H final as a combination of the T hypothesis h 1,..., h T Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5

AdaBoost - Details (Freund and Schapire, 1995) the final hypothesis is a weighted sum of the individual hypothesis: H final (x) = sign( t α t h t (x)) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6

AdaBoost - Details (Freund and Schapire, 1995) the final hypothesis is a weighted sum of the individual hypothesis: H final (x) = sign( t α t h t (x)) hypothesis are weighted depending on the error they produce: α t := 1 2 ln 1 ǫ t ǫ t e.g: ǫ t = 0.1 α t = 2.197; ǫ t = 0.4 α t = 0.41; ǫ t = 0.5 α t = 0; the above choice of α can be shown to minimize an upper bound of the final hypothesis error (see e.g. (Schapire, 2003) for details) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6

AdaBoost - Details (cont d) (Freund and Schapire, 1995) training pattern probabilities are recomputed, such that wrongly classified patterns gets a higher probability in the next round: d t+1 i := dt i Z t { exp(αt ), if y i h t (x i ) exp( α t ), else = dt i Z t exp( α t y i h t (x i )) Z t is a normalisation factor, such that i dt i = 1 (probability distribution), i.e. Z t = d t i exp( α t y i h t (x i )) i Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7

AdaBoost (Freund and Schapire, 1995) Initialize D 1 = (1/p,...,1/p), p is the number of training patterns For t = 1... T: train base learner h t according to distribution D t determine expected error: ǫ t := p d t i I(h(x i ) y i ), i=1 where I(A) = 1, if A is true, 0, else. compute hypothesis weight and new pattern distribution: α t := 1 2 ln 1 ǫ t ǫ t d t+1 i := dt i Z t exp( α t y i h t (x i )), (Z t normalisation factor) Final hypothesis: H final (x) = sign( t α t h t (x)) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8

Example (1) Example taken from (R. Schapire) Hypothesis: simple splits parallel to one axis Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9

Example (2) Example taken from (R. Schapire) ǫ 1 = 0.3, α 1 = 0.42, wrong pattern: d 2 i = 0.166, correct: d2 i = 0.07 computation see next slide Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 10

ǫ 1 = p i=1 d 1 i I(h(x i ) y i ) = 1 10 3 = 0.3 α 1 = 1 2 ln 1 ǫ 1 ǫ 1 = 1 2 ln 1 0.3 0.3 = 0.42 wrong patterns: d 1 i exp(α 1) = 0.1 exp(0.42) = 0.152 correct patterns: d 1 i exp( α 1) = 0.1 exp( 0.42) = 0.066 Z 1 = 70.066 + 30.152 = 0.918 wrong patterns: d 2 i = 0.152/0.918 = 0.166 correct patterns: d 2 i = 0.066/0.918 = 0.07 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 11

Example (3) Example taken from (R. Schapire) ǫ 2 = 0.21, α 2 = 0.65 computation see next slide Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 12

ǫ 2 = p d 2 i I(h(x i ) y i ) = 0.07 + 0.07 + 0.07 = 0.21 i=1 α 1 = 1 2 ln 1 ǫ 2 ǫ 2 = 1 2 ln 1 0.21 0.21 = 0.65 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 13

Example (4) Example taken from (R. Schapire) ǫ 2 = 0.14, α 2 = 0.92 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 14

Example (5) Example taken from (R. Schapire) Final classifier: Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 15

Further examples (1) Taken from (Meir, Raetsch, 2003): AdaBoost on a 2D toy data set: color indicates the label and the diameter is proportional to the weight of the example. Dashed lines show decision boundaries of the single classifiers (up to 5th iteration). Solid line shows the decision line of the combined classifier. In the last two plots the decision line of Bagging is plotted for a comparison. Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 16

Performance of AdaBoost In case of low noise in the training data, AdaBoost can reasonbly learn even complex decision boundaries without overfitting Figure taken from G. Raetsch, Tutorial MLSS 03 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 17

Practical Application: Face Recognition (Viola and Jones, 2004) problem: find faces in photographs and movies weak classifiers: detec light/ dark rectangles in images many tricks to make it fast and accurate Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 18

Further aspects Boosting can be applied to any ML classifier: MLPs, Decision Trees,... related approach: Bagging (Breiman): training sample distribution remains unchanged. For each training of a hypotheses, another set of training samples is drawn with putting back. boosting performs usually very well with respect to non overfitting the data, if the noise in the data is low. While this is astonishing at first glance, it can be theortically explained by analyzing the margins of the classifiers. if the data has a higher noise level, boosting runs into the problem of overfitting. Regularisation methods exist, that try to avoid this (e.g. by restricting the weight of notorious outliers). extension to multi-class problems and regression problems exist. Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 19

References R. Schapire: A boosting tutorial. www.cs.princeton.edu/ schapire Meir and G. Raetsch, 2003: An introduction to boosting and leveraging 2003 G. Raetsch: Tutorial at MLSS 2003 Viola and Jones, 2004: Fast Face Recognition Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 20