AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz



Similar documents
Statistical Machine Learning

Boosting.

FilterBoost: Regression and Classification on Large Datasets

Active Learning with Boosting for Spam Detection

How Boosting the Margin Can Also Boost Classifier Complexity

A Simple Introduction to Support Vector Machines

Trading regret rate for computational efficiency in online learning with limited feedback

Linear Threshold Units

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Lecture 2: The SVM classifier

Source. The Boosting Approach. Example: Spam Filtering. The Boosting Approach to Machine Learning

Introduction to Online Learning Theory

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Training Methods for Adaptive Boosting of Neural Networks for Character Recognition

Model Combination. 24 Novembre 2009

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Machine Learning Final Project Spam Filtering

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Decompose Error Rate into components, some of which can be measured on unlabeled data

Data Mining Practical Machine Learning Tools and Techniques

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Big Data Analytics CSCI 4030

DUOL: A Double Updating Approach for Online Learning

Online Classification on a Budget

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Lecture 3: Linear methods for classification

The Artificial Prediction Market

LCs for Binary Classification

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Discrete Optimization

Incremental SampleBoost for Efficient Learning from Multi-Class Data Sets

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Jiří Matas. Hough Transform

On Adaboost and Optimal Betting Strategies

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Nonlinear Optimization: Algorithms 3: Interior-point methods

CS570 Data Mining Classification: Ensemble Methods

Interactive Machine Learning. Maria-Florina Balcan

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Support Vector Machines Explained

Learning Kernel Logistic Regression in the Presence of Class Label Noise

Comparison of Data Mining Techniques used for Financial Data Analysis

Linear Classification. Volker Tresp Summer 2015

Introduction to Learning & Decision Trees

MATH 425, PRACTICE FINAL EXAM SOLUTIONS.

Support Vector Machines for Classification and Regression

Logistic Regression (1/24/13)

CSCI567 Machine Learning (Fall 2014)

Designing a learning system

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Lecture 6: Logistic Regression

Cyber-Security Analysis of State Estimators in Power Systems

Lecture 8 February 4

STA 4273H: Statistical Machine Learning

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Notes from Week 1: Algorithms for sequential prediction

Towards a Structuralist Interpretation of Saving, Investment and Current Account in Turkey

Ensemble Data Mining Methods

CHAPTER 2 Estimating Probabilities

Is a Brownian motion skew?

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

A Potential-based Framework for Online Multi-class Learning with Partial Feedback

MHI3000 Big Data Analytics for Health Care Final Project Report

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Case Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets

Machine Learning and Pattern Recognition Logistic Regression

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Probabilistic user behavior models in online stores for recommender systems

Online learning of multi-class Support Vector Machines

The Heat Equation. Lectures INF2320 p. 1/88

Advanced Ensemble Strategies for Polynomial Models

24. The Branch and Bound Method

Adaptive Online Gradient Descent

Generalized Boosted Models: A guide to the gbm package

AdaBoost for Learning Binary and Multiclass Discriminations. (set to the music of Perl scripts) Avinash Kak Purdue University. June 8, :20 Noon

x a x 2 (1 + x 2 ) n.

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Contrôle dynamique de méthodes d approximation

Semantic parsing with Structured SVM Ensemble Classification Models

Point Biserial Correlation Tests

Microsoft Azure Machine learning Algorithms

Infinite Kernel Learning

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

HIGH throughput technologies now routinely produce

Online (and Offline) on an Even Tighter Budget

Euler s Method and Functions

Server Load Prediction

Making Sense of the Mayhem: Machine Learning and March Madness

A Neural Support Vector Network Architecture with Adaptive Kernels. 1 Introduction. 2 Support Vector Machines and Motivations

Fast Kernel Classifiers with Online and Active Learning

LECTURE 15: AMERICAN OPTIONS

Simple Programming in MATLAB. Plotting a graph using MATLAB involves three steps:

L25: Ensemble learning

Largest Fixed-Aspect, Axis-Aligned Rectangle

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

Transcription:

AdaBoost Jiri Matas and Jan Šochman Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz

Presentation Outline: AdaBoost algorithm Why is of interest? How it works? Why it works? AdaBoost variants AdaBoost with a Totally Corrective Step (TCS) Experiments with a Totally Corrective Step 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Introduction 1990 Boost-by-majority algorithm (Freund) 1995 AdaBoost (Freund & Schapire) 1997 Generalized version of AdaBoost (Schapire & Singer) 2001 AdaBoost in Face Detection (Viola & Jones) Interesting properties: AB is a linear classifier with all its desirable properties. AB output converges to the logarithm of likelihood ratio. AB has good generalization properties. AB is a feature selector with a principled strategy (minimisation of upper bound on empirical error). AB close to sequential decision making (it produces a sequence of gradually more complex classifiers). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

What is AdaBoost? AdaBoost is an algorithm for constructing a strong classifier as linear combination T f(x) = α t h t (x) of simple weak classifiers h t (x). t=1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

What is AdaBoost? AdaBoost is an algorithm for constructing a strong classifier as linear combination T f(x) = α t h t (x) of simple weak classifiers h t (x). Terminology h t (x)... weak or basis classifier, hypothesis, feature H(x) = sign(f(x))... strong or final classifier/hypothesis t=1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

What is AdaBoost? AdaBoost is an algorithm for constructing a strong classifier as linear combination T f(x) = α t h t (x) of simple weak classifiers h t (x). Terminology h t (x)... weak or basis classifier, hypothesis, feature H(x) = sign(f(x))... strong or final classifier/hypothesis Comments The ht(x) s can be thought of as features. Often (typically) the set H = {h(x)} is infinite. t=1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

(Discrete) AdaBoost Algorithm Singer & Schapire (1997) Given: (x 1, y 1 ),..., (x m, y m ); x i X, y i { 1, 1} Initialize weights D 1 (i) = 1/m For t = 1,..., T : 1. (Call WeakLearn), which returns the weak classifier h t : X { 1, 1} with minimum error w.r.t. distribution D t ; 2. Choose α t R, 3. Update D t+1 (i) = D t(i)exp( α t y i h t (x i )) Z t where Z t is a normalization factor chosen so that D t+1 is a distribution Output the strong classifier: ( T ) H(x) = sign α t h t (x) t=1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

(Discrete) AdaBoost Algorithm Singer & Schapire (1997) Given: (x 1, y 1 ),..., (x m, y m ); x i X, y i { 1, 1} Initialize weights D 1 (i) = 1/m For t = 1,..., T : 1. (Call WeakLearn), which returns the weak classifier h t : X { 1, 1} with minimum error w.r.t. distribution D t ; 2. Choose α t R, 3. Update D t+1 (i) = D t(i)exp( α t y i h t (x i )) Z t where Z t is a normalization factor chosen so that D t+1 is a distribution Output the strong classifier: ( T ) H(x) = sign α t h t (x) Comments The computational complexity of selecting ht is independent of t. All information about previously selected features is captured in Dt! t=1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

WeakLearn Loop step: Call WeakLearn, given distribution D t ; returns weak classifier h t : X { 1, 1} from H = {h(x)} Select a weak classifier with the smallest weighted error h t = arg min ɛ j = m i=1 D t(i)[y i h j (x i )] h j H Prerequisite: ɛt < 1/2 (otherwise stop) WeakLearn examples: Decision tree builder, perceptron learning rule H infinite Selecting the best one from given finite set H 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

WeakLearn Loop step: Call WeakLearn, given distribution D t ; returns weak classifier h t : X { 1, 1} from H = {h(x)} Select a weak classifier with the smallest weighted error h t = arg min ɛ j = m i=1 D t(i)[y i h j (x i )] h j H Prerequisite: ɛt < 1/2 (otherwise stop) WeakLearn examples: Decision tree builder, perceptron learning rule H infinite Selecting the best one from given finite set H Demonstration example Weak classifier = perceptron N(0, 1) 1 r 8π 3e 1/2(r 4)2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

WeakLearn Loop step: Call WeakLearn, given distribution D t ; returns weak classifier h t : X { 1, 1} from H = {h(x)} Select a weak classifier with the smallest weighted error h t = arg min ɛ j = m i=1 D t(i)[y i h j (x i )] h j H Prerequisite: ɛt < 1/2 (otherwise stop) WeakLearn examples: Decision tree builder, perceptron learning rule H infinite Selecting the best one from given finite set H Demonstration example Training set Weak classifier = perceptron N(0, 1) 1 r 8π 3e 1/2(r 4)2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

WeakLearn Loop step: Call WeakLearn, given distribution D t ; returns weak classifier h t : X { 1, 1} from H = {h(x)} Select a weak classifier with the smallest weighted error h t = arg min ɛ j = m i=1 D t(i)[y i h j (x i )] h j H Prerequisite: ɛt < 1/2 (otherwise stop) WeakLearn examples: Decision tree builder, perceptron learning rule H infinite Selecting the best one from given finite set H Demonstration example Training set Weak classifier = perceptron N(0, 1) 1 r 8π 3e 1/2(r 4)2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

AdaBoost as a Minimiser of an Upper Bound on the Empirical Error The main objective is to minimize εtr = 1m {i : H(xi) yi} It can be upper bounded by ε tr (H) T Z t t=1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

AdaBoost as a Minimiser of an Upper Bound on the Empirical Error The main objective is to minimize εtr = 1m {i : H(xi) yi} It can be upper bounded by ε tr (H) T How to set α t? Z t t=1 Select αt to greedily minimize Zt(α) in each step Z t (α) is convex differentiable function with one extremum h t (x) { 1, 1} then optimal α t = 1 2 log(1+r t 1 r t ) where r t = m i=1 D t(i)h t (x i )y i Z t = 2 ɛt(1 ɛt) 1 for optimal αt Justification of selection of h t according to ɛ t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

AdaBoost as a Minimiser of an Upper Bound on the Empirical Error The main objective is to minimize εtr = 1m {i : H(xi) yi} It can be upper bounded by ε tr (H) T How to set α t? Z t t=1 Select αt to greedily minimize Zt(α) in each step Z t (α) is convex differentiable function with one extremum h t (x) { 1, 1} then optimal α t = 1 2 log(1+r t 1 r t ) where r t = m i=1 D t(i)h t (x i )y i Z t = 2 ɛt(1 ɛt) 1 for optimal αt Justification of selection of h t according to ɛ t Comments The process of selecting αt and ht(x) can be interpreted as a single optimization step minimising the upper bound on the empirical error. Improvement of the bound is guaranteed, provided that ɛ t < 1/2. The process can be interpreted as a component-wise local optimization (Gauss-Southwell iteration) in the (possibly infinite dimensional!) space of ᾱ = (α 1, α 2,... ) starting from. ᾱ 0 = (0, 0,... ). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Reweighting Effect on the training set Reweighting formula: D t+1 (i) = D t(i)exp( α t y i h t (x i )) = exp( y t i q=1 α qh q (x i )) Z t m t q=1 Z q { < 1, yi = h exp( α t y i h t (x i )) t (x i ) > 1, y i h t (x i ) } Increase (decrease) weight of wrongly (correctly) classified examples. The weight is the upper bound on the error of a given example! 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Reweighting Effect on the training set Reweighting formula: D t+1 (i) = D t(i)exp( α t y i h t (x i )) = exp( y t i q=1 α qh q (x i )) Z t m t q=1 Z q { < 1, yi = h exp( α t y i h t (x i )) t (x i ) > 1, y i h t (x i ) } Increase (decrease) weight of wrongly (correctly) classified examples. The weight is the upper bound on the error of a given example! 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Reweighting Effect on the training set Reweighting formula: D t+1 (i) = D t(i)exp( α t y i h t (x i )) = exp( y t i q=1 α qh q (x i )) Z t m t q=1 Z q { < 1, yi = h exp( α t y i h t (x i )) t (x i ) > 1, y i h t (x i ) } Increase (decrease) weight of wrongly (correctly) classified examples. The weight is the upper bound on the error of a given example! 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Reweighting Effect on the training set Reweighting formula: D t+1 (i) = D t(i)exp( α t y i h t (x i )) = exp( y t i q=1 α qh q (x i )) Z t m t q=1 Z q { < 1, yi = h exp( α t y i h t (x i )) t (x i ) > 1, y i h t (x i ) } Increase (decrease) weight of wrongly (correctly) classified examples. The weight is the upper bound on the error of a given example! 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Reweighting Effect on the training set Reweighting formula: D t+1 (i) = D t(i)exp( α t y i h t (x i )) = exp( y t i q=1 α qh q (x i )) Z t m t q=1 Z q { < 1, yi = h exp( α t y i h t (x i )) t (x i ) > 1, y i h t (x i ) } Increase (decrease) weight of wrongly (correctly) classified examples. The weight is the upper bound on the error of a given example! err 3 2.5 2 1.5 1 0.5 0 2 1.5 1 0.5 0 0.5 1 1.5 2 yf(x) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Reweighting Effect on the training set Reweighting formula: D t+1 (i) = D t(i)exp( α t y i h t (x i )) = exp( y t i q=1 α qh q (x i )) Z t m t q=1 Z q { < 1, yi = h exp( α t y i h t (x i )) t (x i ) > 1, y i h t (x i ) } Increase (decrease) weight of wrongly (correctly) classified examples. The weight is the upper bound on the error of a given example! Effect on h t α t minimize Zt i:h t (x i )=y i D t+1 (i) = Error of ht on Dt+1 is 1/2 i:h t (x i ) y i D t+1 (i) Next weak classifier is the most independent one e 0.5 err 3 2.5 2 1.5 1 0.5 0 2 1.5 1 0.5 0 0.5 1 1.5 2 yf(x) 1 2 3 4 t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Summary of the Algorithm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Summary of the Algorithm Initialization... 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Summary of the Algorithm Initialization... For t = 1,..., T : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Initialization... For t = 1,..., T : Summary of the Algorithm Find ht = arg min ɛ j = m D t (i)[y i h j (x i )] h j H i=1 t = 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Initialization... For t = 1,..., T : Summary of the Algorithm Find ht = arg min ɛ j = m D t (i)[y i h j (x i )] h j H i=1 If ɛ t 1/2 then stop t = 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Initialization... For t = 1,..., T : Summary of the Algorithm Find ht = arg min ɛ j = m D t (i)[y i h j (x i )] h j H i=1 If ɛ t 1/2 then stop Set αt = 12 log(1+rt 1 r t ) t = 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Initialization... For t = 1,..., T : Summary of the Algorithm Find ht = arg min ɛ j = m D t (i)[y i h j (x i )] h j H i=1 If ɛ t 1/2 then stop Set αt = Update 12 log(1+rt 1 r t ) D t+1 (i) = D t(i)exp( α t y i h t (x i )) Z t t = 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Summary of the Algorithm Initialization... For t = 1,..., T : Find ht = arg min ɛ j = m D t (i)[y i h j (x i )] h j H i=1 If ɛ t 1/2 then stop Set αt = Update 12 log(1+rt 1 r t ) D t+1 (i) = D t(i)exp( α t y i h t (x i )) Z t Output the final classifier: ( T ) H(x) = sign α t h t (x) t=1 0.35 0.3 0.25 0.2 0.15 0.1 0.05 t = 1 0 0 5 10 15 20 25 30 35 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Summary of the Algorithm Initialization... For t = 1,..., T : Find ht = arg min ɛ j = m D t (i)[y i h j (x i )] h j H i=1 If ɛ t 1/2 then stop Set αt = Update 12 log(1+rt 1 r t ) D t+1 (i) = D t(i)exp( α t y i h t (x i )) Z t Output the final classifier: ( T ) H(x) = sign α t h t (x) t=1 0.35 0.3 0.25 0.2 0.15 0.1 0.05 t = 2 0 0 5 10 15 20 25 30 35 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Summary of the Algorithm Initialization... For t = 1,..., T : Find ht = arg min ɛ j = m D t (i)[y i h j (x i )] h j H i=1 If ɛ t 1/2 then stop Set αt = Update 12 log(1+rt 1 r t ) D t+1 (i) = D t(i)exp( α t y i h t (x i )) Z t Output the final classifier: ( T ) H(x) = sign α t h t (x) t=1 0.35 0.3 0.25 0.2 0.15 0.1 0.05 t = 3 0 0 5 10 15 20 25 30 35 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Summary of the Algorithm Initialization... For t = 1,..., T : Find ht = arg min ɛ j = m D t (i)[y i h j (x i )] h j H i=1 If ɛ t 1/2 then stop Set αt = Update 12 log(1+rt 1 r t ) D t+1 (i) = D t(i)exp( α t y i h t (x i )) Z t Output the final classifier: ( T ) H(x) = sign α t h t (x) t=1 0.35 0.3 0.25 0.2 0.15 0.1 0.05 t = 4 0 0 5 10 15 20 25 30 35 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Summary of the Algorithm Initialization... For t = 1,..., T : Find ht = arg min ɛ j = m D t (i)[y i h j (x i )] h j H i=1 If ɛ t 1/2 then stop Set αt = Update 12 log(1+rt 1 r t ) D t+1 (i) = D t(i)exp( α t y i h t (x i )) Z t Output the final classifier: ( T ) H(x) = sign α t h t (x) t=1 0.35 0.3 0.25 0.2 0.15 0.1 0.05 t = 5 0 0 5 10 15 20 25 30 35 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Summary of the Algorithm Initialization... For t = 1,..., T : Find ht = arg min ɛ j = m D t (i)[y i h j (x i )] h j H i=1 If ɛ t 1/2 then stop Set αt = Update 12 log(1+rt 1 r t ) D t+1 (i) = D t(i)exp( α t y i h t (x i )) Z t Output the final classifier: ( T ) H(x) = sign α t h t (x) t=1 0.35 0.3 0.25 0.2 0.15 0.1 0.05 t = 6 0 0 5 10 15 20 25 30 35 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Summary of the Algorithm Initialization... For t = 1,..., T : Find ht = arg min ɛ j = m D t (i)[y i h j (x i )] h j H i=1 If ɛ t 1/2 then stop Set αt = Update 12 log(1+rt 1 r t ) D t+1 (i) = D t(i)exp( α t y i h t (x i )) Z t Output the final classifier: ( T ) H(x) = sign α t h t (x) t=1 0.35 0.3 0.25 0.2 0.15 0.1 0.05 t = 7 0 0 5 10 15 20 25 30 35 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Summary of the Algorithm Initialization... For t = 1,..., T : Find ht = arg min ɛ j = m D t (i)[y i h j (x i )] h j H i=1 If ɛ t 1/2 then stop Set αt = Update 12 log(1+rt 1 r t ) D t+1 (i) = D t(i)exp( α t y i h t (x i )) Z t Output the final classifier: ( T ) H(x) = sign α t h t (x) t=1 0.35 0.3 0.25 0.2 0.15 0.1 0.05 t = 40 0 0 5 10 15 20 25 30 35 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Does AdaBoost generalize? Margins in SVM Margins in AdaBoost max min (x,y) S max min (x,y) S Maximizing margins in AdaBoost P S [yf(x) θ] 2 T T t=1 Upper bounds based on margin P D [yf(x) 0] P S [yf(x) θ] + O y( α h(x)) α 2 y( α h(x)) α 1 ɛ 1 θ t (1 ɛ t ) 1+θ where f(x) = ( 1 d log 2 (m/d) m θ 2 α h(x) α 1 + log(1/δ) ) 1/2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

AdaBoost variants Freund & Schapire 1995 Discrete (h : X {0, 1}) Multiclass AdaBoost.M1 (h : X {0, 1,..., k}) Multiclass AdaBoost.M2 (h : X [0, 1]k) Real valued AdaBoost.R (Y = [0, 1], h : X [0, 1]) Schapire & Singer 1997 Confidence rated prediction (h : X R, two-class) Multilabel AdaBoost.MR, AdaBoost.MH (different formulation of minimized loss)... Many other modifications since then (Totally Corrective AB, Cascaded AB) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Pros and cons of AdaBoost Advantages Very simple to implement Feature selection on very large sets of features Fairly good generalization Disadvantages Suboptimal solution for ᾱ Can overfit in presence of noise 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Adaboost with a Totally Corrective Step (TCA) Given: (x 1, y 1 ),..., (x m, y m ); x i X, y i { 1, 1} Initialize weights D 1 (i) = 1/m For t = 1,..., T : 1. (Call WeakLearn), which returns the weak classifier h t : X { 1, 1} with minimum error w.r.t. distribution D t ; 2. Choose α t R, 3. Update D t+1 4. (Call WeakLearn) on the set of h m s with non zero α s. Update α.. Update D t+1. Repeat till ɛ t 1/2 < δ, t. Comments All weak classifiers have ɛt 1/2, therefore the classifier selected at t + 1 is independent of all classifiers selected so far. It can be easily shown, that the totally corrective step reduces the upper bound on the empirical error without increasing classifier complexity. The TCA was first proposed by Kivinen and Warmuth, but their αt is set as in stadard Adaboost. Generalization of TCA is an open question. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Experiments with TCA on the IDA Database Discrete AdaBoost, Real AdaBoost, and Discrete and Real TCA evaluated Weak learner: stumps. Data from the IDA repository (Ratsch:2000): Input Training Testing Number of dimension patterns patterns realizations Banana 2 400 4900 100 Breast cancer 9 200 77 100 Diabetes 8 468 300 100 German 20 700 300 100 Heart 13 170 100 100 Image segment 18 1300 1010 20 Ringnorm 20 400 7000 100 Flare solar 9 666 400 100 Splice 60 1000 2175 20 Thyroid 5 140 75 100 Titanic 3 150 2051 100 Twonorm 20 400 7000 100 Waveform 21 400 4600 100 Note that the training sets are fairly small 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 IMAGE 0 10 0 10 1 10 2 10 3 Length of the strong classifier 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) 0.46 0.44 0.42 0.4 0.38 0.36 0.34 FLARE 0.32 10 0 10 1 10 2 10 3 Length of the strong classifier 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) 0.32 0.3 0.28 0.26 0.24 0.22 0.2 0.18 GERMAN 0.16 10 0 10 1 10 2 10 3 Length of the strong classifier 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 RINGNORM 0 10 0 10 1 10 2 10 3 Length of the strong classifier 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) 0.25 0.2 0.15 0.1 0.05 SPLICE 0 10 0 10 1 10 2 10 3 Length of the strong classifier 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) 0.25 0.2 0.15 0.1 0.05 THYROID 0 10 0 10 1 10 2 10 3 Length of the strong classifier 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) 0.24 0.235 0.23 0.225 0.22 0.215 TITANIC 0.21 10 0 10 1 10 2 10 3 Length of the strong classifier 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 BANANA 0.05 10 0 10 1 10 2 10 3 Length of the strong classifier 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) 0.31 0.3 0.29 0.28 0.27 0.26 0.25 BREAST 0.24 10 0 10 1 10 2 10 3 Length of the strong classifier 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) 0.35 0.3 0.25 0.2 0.15 0.1 0.05 DIABETIS 0 10 0 10 1 10 2 10 3 Length of the strong classifier 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Results with TCA on the IDA Database Training error (dashed line), test error (solid line) Discrete AdaBoost (blue), Real AdaBoost (green), Discrete AdaBoost with TCA (red), Real AdaBoost with TCA (cyan) the black horizontal line: the error of AdaBoost with RBF network weak classifiers from (Ratsch-ML:2000) 0.35 0.3 0.25 0.2 0.15 0.1 0.05 HEART 0 10 0 10 1 10 2 10 3 Length of the strong classifier 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Conclusions The AdaBoost algorithm was presented and analysed A modification of the Totally Corrective AdaBoost was introduced Initial test show that the TCA outperforms AB on some standard data sets. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

3 2.5 2 err 1.5 1 0.5 0 2 1.5 1 0.5 0 0.5 1 1.5 2 yf(x)

e 0.5 t

3 2.5 2 err 1.5 1 0.5 0 2 1.5 1 0.5 0 0.5 1 1.5 2 yf(x)

3 2.5 2 err 1.5 1 0.5 0 2 1.5 1 0.5 0 0.5 1 1.5 2 yf(x)

e 0.5 t

0 0 5 10 15 20 25 30 35 40 0.35 0.3 0.25 0.2 0.15 0.1 0.05

0 0 5 10 15 20 25 30 35 40 0.35 0.3 0.25 0.2 0.15 0.1 0.05

0 0 5 10 15 20 25 30 35 40 0.35 0.3 0.25 0.2 0.15 0.1 0.05

0 0 5 10 15 20 25 30 35 40 0.35 0.3 0.25 0.2 0.15 0.1 0.05

0 0 5 10 15 20 25 30 35 40 0.35 0.3 0.25 0.2 0.15 0.1 0.05

0 0 5 10 15 20 25 30 35 40 0.35 0.3 0.25 0.2 0.15 0.1 0.05

0 0 5 10 15 20 25 30 35 40 0.35 0.3 0.25 0.2 0.15 0.1 0.05

0 0 5 10 15 20 25 30 35 40 0.35 0.3 0.25 0.2 0.15 0.1 0.05

0 0 5 10 15 20 25 30 35 40 0.35 0.3 0.25 0.2 0.15 0.1 0.05

0 0 5 10 15 20 25 30 35 40 0.35 0.3 0.25 0.2 0.15 0.1 0.05

0 0 5 10 15 20 25 30 35 40 0.35 0.3 0.25 0.2 0.15 0.1 0.05

0 0 5 10 15 20 25 30 35 40 0.35 0.3 0.25 0.2 0.15 0.1 0.05

0 0 5 10 15 20 25 30 35 40 0.35 0.3 0.25 0.2 0.15 0.1 0.05

0 0 5 10 15 20 25 30 35 40 0.35 0.3 0.25 0.2 0.15 0.1 0.05

0 0 5 10 15 20 25 30 35 40 0.35 0.3 0.25 0.2 0.15 0.1 0.05

0 0 5 10 15 20 25 30 35 40 0.35 0.3 0.25 0.2 0.15 0.1 0.05

0.4 IMAGE 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 10 0 10 1 10 2 10 3 Length of the strong classifier

0.46 FLARE 0.44 0.42 0.4 0.38 0.36 0.34 0.32 10 0 10 1 10 2 10 3 Length of the strong classifier

0.32 GERMAN 0.3 0.28 0.26 0.24 0.22 0.2 0.18 0.16 10 0 10 1 10 2 10 3 Length of the strong classifier

0.4 RINGNORM 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 10 0 10 1 10 2 10 3 Length of the strong classifier

0.25 SPLICE 0.2 0.15 0.1 0.05 0 10 0 10 1 10 2 10 3 Length of the strong classifier

0.25 THYROID 0.2 0.15 0.1 0.05 0 10 0 10 1 10 2 10 3 Length of the strong classifier

0.24 TITANIC 0.235 0.23 0.225 0.22 0.215 0.21 10 0 10 1 10 2 10 3 Length of the strong classifier

0.45 BANANA 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 10 0 10 1 10 2 10 3 Length of the strong classifier

0.31 BREAST 0.3 0.29 0.28 0.27 0.26 0.25 0.24 10 0 10 1 10 2 10 3 Length of the strong classifier

0.35 DIABETIS 0.3 0.25 0.2 0.15 0.1 0.05 0 10 0 10 1 10 2 10 3 Length of the strong classifier

0.35 HEART 0.3 0.25 0.2 0.15 0.1 0.05 0 10 0 10 1 10 2 10 3 Length of the strong classifier

0.4 IMAGE 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 10 0 10 1 10 2 10 3 Length of the strong classifier

0.46 FLARE 0.44 0.42 0.4 0.38 0.36 0.34 0.32 10 0 10 1 10 2 10 3 Length of the strong classifier

0.32 GERMAN 0.3 0.28 0.26 0.24 0.22 0.2 0.18 0.16 10 0 10 1 10 2 10 3 Length of the strong classifier

0.4 RINGNORM 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 10 0 10 1 10 2 10 3 Length of the strong classifier

0.25 SPLICE 0.2 0.15 0.1 0.05 0 10 0 10 1 10 2 10 3 Length of the strong classifier

0.25 THYROID 0.2 0.15 0.1 0.05 0 10 0 10 1 10 2 10 3 Length of the strong classifier

0.24 TITANIC 0.235 0.23 0.225 0.22 0.215 0.21 10 0 10 1 10 2 10 3 Length of the strong classifier

0.45 BANANA 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 10 0 10 1 10 2 10 3 Length of the strong classifier

0.31 BREAST 0.3 0.29 0.28 0.27 0.26 0.25 0.24 10 0 10 1 10 2 10 3 Length of the strong classifier

0.35 DIABETIS 0.3 0.25 0.2 0.15 0.1 0.05 0 10 0 10 1 10 2 10 3 Length of the strong classifier

0.35 HEART 0.3 0.25 0.2 0.15 0.1 0.05 0 10 0 10 1 10 2 10 3 Length of the strong classifier