Consistency of Random Forests and Other Averaging Classifiers

Transcription

1 Joural of Machie Learig Research 9 (2008) Submitted 1/08; Revised 5/08; Published 9/08 Cosistecy of Radom Forests ad Other Averagig Classifiers Gérard Biau LSTA & LPMA Uiversité Pierre et Marie Curie Paris VI Boîte 158, 175 rue du Chevaleret Paris, Frace Luc Devroye School of Computer Sciece McGill Uiversity Motreal, Caada H3A 2K6 Gábor Lugosi ICREA ad Departmet of Ecoomics Pompeu Fabra Uiversity Ramo Trias Fargas Barceloa, Spai [email protected] [email protected] [email protected] Editor: Peter Bartlett Abstract I the last years of his life, Leo Breima promoted radom forests for use i classificatio. He suggested usig averagig as a meas of obtaiig good discrimiatio rules. The base classifiers used for averagig are simple ad radomized, ofte based o radom samples from the data. He left a few questios uaswered regardig the cosistecy of such rules. I this paper, we give a umber of theorems that establish the uiversal cosistecy of averagig rules. We also show that some popular classifiers, icludig oe suggested by Breima, are ot uiversally cosistet. Keywords: radom forests, classificatio trees, cosistecy, baggig 1. Itroductio This paper is dedicated to the memory of Leo Breima. Esemble methods, popular i machie learig, are learig algorithms that costruct a set of may idividual classifiers (called base learers) ad combie them to classify ew data poits by takig a weighted or uweighted vote of their predictios. It is ow well-kow that esembles are ofte much more accurate tha the idividual classifiers that make them up. The success of esemble algorithms o may bechmark data sets has raised cosiderable iterest i uderstadig why such methods succeed ad idetifyig circumstaces i which they ca be expected to produce good results. These methods differ i the way the base learer is fit ad combied. For example, baggig (Breima, 1996) proceeds by geeratig bootstrap samples from the origial data set, costructig a classifier from each bootstrap sample, ad votig to combie. I boostig (Freud ad Schapire, 1996) ad arcig algorithms (Breima, 1998) the successive classifiers are costructed by givig icreased weight to those poits that have bee frequetly misclassified, ad the classifiers are combied usig weighted votig. O the other had, radom split selectio (Dietterich, 2000) c 2008 Gérard Biau, Luc Devroye ad Gábor Lugosi.

2 BIAU, DEVROYE AND LUGOSI grows trees o the origial data set. For a fixed umber S, at each ode, S best splits (i terms of miimizig deviace) are foud ad the actual split is radomly ad uiformly selected from them. For a comprehesive review of esemble methods, we refer the reader to Dietterich (2000a) ad the refereces therei. Breima (2001) provides a geeral framework for tree esembles called radom forests. Each tree depeds o the values of a radom vector sampled idepedetly ad with the same distributio for all trees. Thus, a radom forest is a classifier that cosists of may decisio trees ad outputs the class that is the mode of the classes output by idividual trees. Algorithms for iducig a radom forest were first developed by Breima ad Cutler, ad Radom Forests is their trademark. The web page provides a collectio of dowloadable techical reports, ad gives a overview of radom forests as well as commets o the features of the method. Radom forests have bee show to give excellet performace o a umber of practical problems. They work fast, geerally exhibit a substatial performace improvemet over sigle tree classifiers such as CART, ad yield geeralizatio error rates that compare favorably to the best statistical ad machie learig methods. I fact, radom forests are amog the most accurate geeral-purpose classifiers available (see, for example, Breima, 2001). Differet radom forests differ i how radomess is itroduced i the tree buildig process, ragig from extreme radom splittig strategies (Breima, 2000; Cutler ad Zhao, 2001) to more ivolved data-depedet strategies (Amit ad Gema, 1997; Breima, 2001; Dietterich, 2000). As a matter of fact, the statistical mechaism of radom forests is ot yet fully uderstood ad is still uder active ivestigatio. Ulike sigle trees, where cosistecy is proved lettig the umber of observatios i each termial ode become large (Devroye, Györfi, ad Lugosi, 1996, Chapter 20), radom forests are geerally built to have a small umber of cases i each termial ode. Although the mechaism of radom forest algorithms appears simple, it is difficult to aalyze ad remais largely ukow. Some attempts to ivestigate the drivig force behid cosistecy of radom forests are by Breima (2000, 2004) ad Li ad Jeo (2006), who establish a coectio betwee radom forests ad adaptive earest eighbor methods. Meishause (2006) proved cosistecy of certai radom forests i the cotext of so-called quatile regressio. I this paper we offer cosistecy theorems for various versios of radom forests ad other radomized esemble classifiers. I Sectio 2 we itroduce a geeral framework for studyig classifiers based o averagig radomized base classifiers. We prove a simple but useful propositio showig that averaged classifiers are cosistet wheever the base classifiers are. I Sectio 3 we prove cosistecy of two simple radom forest classifiers, the purely radom forest (suggested by Breima as a startig poit for study) ad the scale-ivariat radom forest classifiers. I Sectio 4 it is show that averagig may covert icosistet rules ito cosistet oes. I Sectio 5 we briefly ivestigate cosistecy of baggig rules. We show that, i geeral, baggig preserves cosistecy of the base rule ad it may eve create cosistet rules from icosistet oes. I particular, we show that if the bootstrap samples are sufficietly small, the bagged versio of the 1-earest eighbor classifier is cosistet. 2016

3 CONSISTENCY OF RANDOM FORESTS Fially, i Sectio 6 we cosider radom forest classifiers based o radomized, greedily grow tree classifiers. We argue that some greedy radom forest classifiers, icludig Breima s radom forest classifier, are icosistet ad suggest a cosistet greedy radom forest classifier. 2. Votig ad Averaged Classifiers Let (X,Y ),(X 1,Y 1 ),...,(X,Y ) be i.i.d. pairs of radom variables such that X (the so-called feature vector) takes its values i R d while Y (the label) is a biary {0,1}-valued radom variable. The joit distributio of (X,Y ) is determied by the margial distributio µ of X (i.e., P{X A} = µ(a) for all Borel sets A R d ) ad the a posteriori probability η : R d [0,1 defied by η(x) = P{Y = 1 X = x}. The collectio (X 1,Y 1 ),...,(X,Y ) is called the traiig data, ad is deoted by D. A classifier g is a biary-valued fuctio of X ad D whose probability of error is defied by L(g ) = P (X,Y ) {g (X,D ) Y } where P (X,Y ) deotes probability with respect to the pair (X,Y) (i.e., coditioal probability, give D ). For brevity, we write g (X) = g (X,D ). It is well-kow (see, for example, Devroye, Györfi, ad Lugosi, 1996) that the classifier that miimizes the probability of error, the so-called Bayes classifier is g (x) = {η(x) 1/2}. The risk of g is called the Bayes risk: L = L(g ). A sequece {g } of classifiers is cosistet for a certai distributio of (X,Y) if L(g ) L i probability. I this paper we ivestigate classifiers that calculate their decisios by takig a majority vote over radomized classifiers. A radomized classifier may use a radom variable Z to calculate its decisio. More precisely, let Z be some measurable space ad let Z take its values i Z. A radomized classifier is a arbitrary fuctio of the form g (X,Z,D ), which we abbreviate by g (X,Z). The probability of error of g becomes L(g ) = P (X,Y ),Z {g (X,Z,D ) Y } = P{g (X,Z,D ) Y D }. The defiitio of cosistecy remais the same by augmetig the probability space appropriately to iclude the radomizatio. Give ay radomized classifier, oe may calculate the classifier for various draws of the radomizig variable Z. It is the a atural idea to defie a averaged classifier by takig a majority vote amog the obtaied radom classifiers. Assume that Z 1,...,Z m are idetically distributed draws of the radomizig variable, havig the same distributio as Z. Throughout the paper, we assume that Z 1,...,Z m are idepedet, coditioally o X, Y, ad D. Lettig Z m = (Z 1,...,Z m ), oe may defie the correspodig votig classifier by { g (m) (x,z m 1 if 1,D ) = m m j=1 g (x,z j,d ) 1 2, 0 otherwise. By the strog law of large umbers, for ay fixed x ad D for which P Z {g (x,z,d ) = 1} 1/2, we have almost surely lim m g (m) (x,z m,d ) = g (x,d ), where g (x,d ) = g (x) = {EZ g (x,z) 1/2} 2017

4 BIAU, DEVROYE AND LUGOSI is a (o-radomized) classifier that we call the averaged classifier. (Here P Z ad E Z deote probability ad expectatio with respect to the radomizig variable Z, that is, coditioally o X, Y, ad D.) g may be iterpreted as a idealized versio of the classifier g (m) that draws may idepedet copies of the radomizig variable Z ad takes a majority vote over the resultig classifiers. Our first result states that cosistecy of a radomized classifier is preserved by averagig. Propositio 1 Assume that the sequece {g } of radomized classifiers is cosistet for a certai distributio of (X,Y ). The the votig classifier g (m) (for ay value of m) ad the averaged classifier g are also cosistet. Proof Cosistecy of {g } is equivalet to sayig that EL(g ) = P{g (X,Z) Y } L. I fact, sice P{g (X,Z) Y X = x} P{g (X) Y X = x} for all x R d, cosistecy of {g } meas that for µ-almost all x, P{g (X,Z) Y X = x} P{g (X) Y X = x} = mi(η(x),1 η(x)). Without loss of geerality, assume that η(x) > 1/2. (I the case of η(x) = 1/2 ay classifier has a coditioal probability of error 1/2 ad there is othig to prove.) The P{g (X,Z) Y X = x} = (2η(x) 1)P{g (x,z) = 0} + 1 η(x), ad by cosistecy we have P{g (x,z) = 0} 0. To prove cosistecy of the votig classifier g (m) for µ-almost all x for which η(x) > 1/2. However, P{g (m) (x,z m ) = 0} = P { 2E, it suffices to show that P{g (m) (x,z m ) = 0} 0 (1/m) [ (1/m) m j=1 m j=1 {g (x,z j )=0} > 1/2 {g (x,z j )=0} (by Markov s iequality) = 2P{g (x,z) = 0} 0. Cosistecy of the averaged classifier is proved by a similar argumet. } 3. Radom Forests Radom forests, itroduced by Breima, are averaged classifiers i the sese defied i Sectio 2. Formally, a radom forest with m trees is a classifier cosistig of a collectio of radomized base tree classifiers g (x,z 1 ),...,g (x,z m ) where Z 1,...,Z m are idetically distributed radom vectors, idepedet coditioally o X, Y, ad D. The radomizig variable is typically used to determie how the successive cuts are performed whe buildig the tree such as selectio of the ode ad the coordiate to split, as well as the positio of the split. The radom forest classifier takes a majority vote amog the radom tree classifiers. If m is large, the radom forest classifier is well approximated by the averaged classifier 2018

5 CONSISTENCY OF RANDOM FORESTS g (x) = {EZ g (x,z) 1/2}. For brevity, we state most results of this paper for the averaged classifier oly, though by Propositio 1 various results remai true for the votig classifier g (m) as well. I this sectio we aalyze a simple radom forest already cosidered by Breima (2000), which we call the purely radom forest. The radom tree classifier g (x,z) is costructed as follows. Assume, for simplicity, that µ is supported o [0,1 d. All odes of the tree are associated with rectagular cells such that at each step of the costructio of the tree, the collectio of cells associated with the leaves of the tree (i.e., exteral odes) forms a partitio of [0,1 d. The root of the radom tree is [0,1 d itself. At each step of the costructio of the tree, a leaf is chose uiformly at radom. The split variable J is the selected uiformly at radom from the d cadidates x (1),...,x (d). Fially, the selected cell is split alog the radomly chose variable at a radom locatio, chose accordig to a uiform radom variable o the legth of the chose side of the selected cell. The procedure is repeated k times where k 1 is a determiistic parameter, fixed beforehad by the user, ad possibly depedig o. The radomized classifier g (x,z) takes a majority vote amog all Y i for which the correspodig feature vector X i falls i the same cell of the radom partitio as x. (For cocreteess, break ties i favor of the label 1.) The purely radom forest classifier is a radically simplified versio of radom forest classifiers used i practice. The mai simplificatio lies i the fact that recursive cell splits do ot deped o the labels Y 1,...,Y. The ext theorem maily serves as a illustratio of how the cosistecy problem of radom forest classifiers may be attacked. More ivolved versios of radom forest classifiers are discussed i subsequet sectios. Theorem 2 Assume that the distributio of X is supported o [0,1 d. The the purely radom forest classifier g is cosistet wheever k ad k/ 0 as k. Proof By Propositio 1 it suffices to prove cosistecy of the radomized base tree classifier g. To this ed, we recall a geeral cosistecy theorem for partitioig classifiers proved i (Devroye, Györfi, ad Lugosi, 1996, Theorem 6.1). Accordig to this theorem, g is cosistet if both diam(a (X,Z)) 0 i probability ad N (X,Z) i probability, where A (x,z) is the rectagular cell of the radom partitio cotaiig x ad N (x,z) = {X i A (x,z)} is the umber of data poits fallig i the same cell as x. First we show that N (X,Z) i probability. Cosider the radom tree partitio defied by Z. Observe that the partitio has k + 1 rectagular cells, say A 1,...,A k+1. Let N 1,...,N k+1 deote the umber of poits of X,X 1,...,X fallig i these k + 1 cells. Let S = {X,X 1,...,X } deote the set of positios of these + 1 poits. Sice these poits are idepedet ad idetically distributed, fixig the set S (but ot the order of the poits) ad Z, the coditioal probability that X falls i the i-th cell equals N i /( + 1). Thus, for every fixed t > 0, P{N (X,Z) < t} = E[P{N (X,Z) < t S,Z} [ N i = E + 1 i:n i <t 2019 (t 1) k

6 BIAU, DEVROYE AND LUGOSI which coverges to zero by our assumptio o k. It remais to show that diam(a (X,Z)) 0 i probability. To this aim, let V = V (x,z) be the size of the first dimesio of the rectagle cotaiig x. Let T = T (x,z) be the umber of times that the box cotaiig x is split whe we costruct the radom tree partitio. Let K be biomial (T,1/d), represetig the umber of times the box cotaiig x is split alog the first coordiate. Clearly, it suffices to show that V (x,z) 0 i probability for µ-almost all x, so it is eough to show that for all x, E[V (x,z) 0. Observe that if U 1,U 2,... are idepedet uiform [0,1, the [ E[V (x,z) E E [ K max(u i,1 U i ) K [ = E E[max(U 1,1 U 1 ) K = E [ (3/4) K [ ( = E 1 1 d + 3 ) T 4d [ ( = E 1 1 ) T. 4d Thus, it suffices to show that T i probability. To this ed, ote that the partitio tree is statistically related to a radom biary search tree with k + 1 exteral odes (ad thus k iteral odes). Such a tree is obtaied as follows. Iitially, the root is the sole exteral ode, ad there are o iteral odes. Select a exteral ode uiformly at radom, make it a iteral ode ad give it two childre, both exteral. Repeat util we have precisely k iteral odes ad k + 1 exteral odes. The resultig tree is the radom biary search tree o k iteral odes (see Devroye 1988 ad Mahmoud 1992 for more equivalet costructios of radom biary search trees). It is kow that all levels up to l = 0.37logk are full with probability tedig to oe as k (Devroye, 1986). The last full level F is called the fill-up level. Clearly, the partitio tree has this property. Therefore, we kow that all fial cells have bee cut at least l times ad therefore T l with probability covergig to 1. This cocludes the proof of Theorem 3.1. Remark 3 We observe that the largest first dimesio amog exteral odes does ot ted to zero i probability except for d = 1. For d 2, it teds to a limit radom variable that is ot atomic at zero (this ca be show usig the theory of brachig processes). Thus the proof above could ot have used the uiform smalless of all cells. Despite the fact that the radom partitio cotais some cells of huge diameter of o-shrikig size, the rule based o it is cosistet. Next we cosider a scale-ivariat versio of the purely radom forest classifier. I this variat the root cell is the etire feature space ad the radom tree is grow up to k cuts. The leaf cell to cut ad the directio J i which the cell is cut are chose uiformly at radom, exactly as i the purely radom forest classifier. The oly differece is that the positio of the cut is ow chose i a data-based maer: if the cell to be cut cotais N of the data poits X,X 1,...,X, the a radom idex I is chose uiformly from the set {0,1,...,N} ad the cell is cut so that, whe ordered by their J-th compoets, the poits with the I smallest values fall i oe of the subcells ad the rest i 2020

7 CONSISTENCY OF RANDOM FORESTS the other. To avoid ties, we assume that the distributio of X has o-atomic margials. I this case the radom tree is well-defied with probability oe. Just like before, the associated classifier takes a majority vote over the labels of the data poits fallig i the same cell as X. The scale-ivariat radom forest classifier is defied as the correspodig averaged classifier. Theorem 4 Assume that the distributio of X has o-atomic margials i R d. The the scaleivariat radom forest classifier g is cosistet wheever k ad k/ 0 as k. Proof Oce agai, we may use Propositio 1 ad (Devroye, Györfi, ad Lugosi, 1996, Theorem 6.1) to prove cosistecy of the radomized base tree classifier g. The proof of the fact that N (X,Z) i probability is the same as i Theorem 2. To show that diam(a (X,Z)) 0 i probability, we begi by otig that, just as i the case of the purely radom forest classifier, the partitio tree is equivalet to a biary search tree, ad therefore with probability covergig to oe, all fial cells have bee cut at least l = 0.37 log k times. Sice the classificatio rule is scale-ivariat, we may assume, without loss of geerality, that the distributio of X is cocetrated o the uit cube [0,1 d. Let i deote the cardiality of the i-th cell i the partitio, 1 i k + 1, where the cardiality of a cell C is C {X,X 1,...,X }. Thus, k+1 i = + 1. Let V i be the first dimesio of the i-th cell. Let V (X) be the first dimesio of the cell that cotais X. Clearly, give the i s, V (X) = V i with probability i /( + 1). We eed to show that E[V (X) 0. But we have [ k+1 E[V (X) = E iv i. + 1 So, it suffices to show that E[ i i V i = o(). It is worthy of metio that the radom split of a box ca be imagied as follows. Give that we split alog the s-th coordiate axis, ad that a box has m poits, the we select oe of the m + 1 spacigs defied by these m poits uiformly at radom, still for that s-th coordiate. We cut that spacig properly but are free to do so aywhere. We ca cut i proportios λ,1 λ with λ (0,1), ad the value of λ may vary from cut to cut ad eve be data-depedet. I fact, the, each iteral ad exteral ode of our partitio tree has associated with it two importat quatities, a cardiality, ad its first dimesio. If we keep usig i to idex cells, the we ca use i ad V i for the i-th cell, eve if it is a iteral cell. Let A be the collectio of exteral odes i the subtree of the i-th cell. The trivially, j V j i V i. j A Thus, if E is the collectio of all exteral odes of a partitio tree, l is at most the miimum path distace from ay cell i E to the root, ad L is the collectio of all odes at distace l from the root, the, by the last iequality, i E i V i i V i. i L Thus, usig the otio of fill-up level F of the biary search tree, ad settig l = 0.37logk, we have [ [ E i V i P{F < l} + E i V i. i E i L 2021

8 BIAU, DEVROYE AND LUGOSI We have see that the first term is o(). We argue that the secod term is ot more tha (1 1/(8d)) l, which is o() sice k. That will coclude the proof. It suffices ow to argue recursively ad fix oe cell of cardiality ad first dimesio V. Let C be the collectio of its childre. We will show that E [ i V i i C Repeatig this recursively l times shows that [ E i V i i L ( 1 1 8d ) V. ( 1 1 ) l 8d because V = 1 at the root. Fix that cell of cardiality, ad assume without loss of geerality that V = 1. Let the spacigs alog the first coordiate be a 1,...,a +1, their sum beig oe. With probability 1 1/d, there the first axis is ot cut, ad thus, i C i V i =. With probability 1/d, the first axis is cut i two parts. We will show that coditioal o the evet that the first directio is cut, [ E i V i 7 i 8. Ucoditioally, we have [ E i V i i ( 1 1 ) + 1d d 78 ( = 1 1 ), 8d as required. So, let us prove the coditioal result. Usig δ j to deote umbers draw from (0,1), possibly radom, we have [ E i V i i = = 1 [ E [( j 1)(a a j 1 + a j δ j ) j=1 +( + 1 j)(a j (1 δ j ) + a j a +1 ) [ ( E a k ( j 1) k=1 k< j j<k ( + 1 j) + δ k (k 1) + (1 δ k )( + 1 k) ( +1 ( k(k 1) a k ( + 1) k=1 2 ( k + 1)( k + 2) + max(k 1, + 1 k) )) )

9 CONSISTENCY OF RANDOM FORESTS ( +1 ( 1 ( + 1) = + 1 a k k=1 2 ( (( 1 + 1) ( ) 3/4 + (3/2) = + 1 ) ) + (k 1)( + 1 k) + max(k 1, + 1 k) ( ) ) ) a k k=1 7 8 if > 4. Our defiitio of the scale-ivariat radom forest classifier permits cells to be cut such that oe of the created cells becomes empty. Oe may easily prevet this by artificially forcig a miimum umber of poits i each cell. This may be doe by restrictig the radom positio of each cut so that both created subcells cotai at least, say, m poits. By a mior modificatio of the proof above it is easy to see that as log as m is bouded by a costat, the resultig radom forest classifier remais cosistet uder the same coditios as i Theorem Creatig Cosistet Rules by Radomizatio Propositio 1 shows that if a radomized classifier is cosistet, the the correspodig averaged classifier remais cosistet. The coverse is ot true. There exist icosistet radomized classifiers such that their averaged versio becomes cosistet. Ideed, Breima s (2001) origial radom forest classifier builds tree classifiers by successive radomized cuts util the cell of the poit X to be classified cotais oly oe data poit, ad classifies X as the label of this data poit. Breima s radom forest classifier is just the averaged versio of such radomized tree classifiers. The radomized base classifier g (x,z) is obviously ot cosistet for all distributios. This does ot imply that the averaged radom forest classifier is ot cosistet. I fact, i this sectio we will see that averagig may boost icosistet base classifiers ito cosistet oes. We poit out i Sectio 6 that there are distributios of (X,Y ) for which Breima s radom forest classifier is ot cosistet. The couterexample show i Propositio 8 is such that the distributio of X does t have a desity. It is possible, however, that Breima s radom forest classifier is cosistet wheever the distributio of X has a desity. Breima s rule is difficult to aalyze as each cut of the radom tree is determied by a complicated fuctio of the etire data set D (i.e., both feature vectors ad labels). However, i Sectio 6 below we provide argumets suggestig that Breima s radom forest is ot cosistet whe a desity exists. Istead of Breima s rule, ext we aalyze a stylized versio by showig that icosistet radomized rules that take the label of oly oe eighbor ito accout ca be made cosistet by averagig. For simplicity, we cosider the case d = 1, though the whole argumet exteds, i a straightforward way, to the multivariate case. To avoid complicatios itroduced by ties, assume that X has a o-atomic distributio. Defie a radomized earest eighbor rule as follows: for a fixed x R, let X (1) (x),x (2) (x),...,x () (x) be the orderig of the data poits X 1,...,X accordig to icreasig distaces from x. Let U 1,...,U be i.i.d. radom variables, uiformly distributed over [0,1. The vector of these radom variables costitutes the radomizatio Z of the classifier. We defie g (x,z) 2023

10 BIAU, DEVROYE AND LUGOSI to be equal to the label Y (i) (x) of the data poit X (i) (x) for which max(i,mu i ) max( j,mu j ) for all j = 1,..., where m is a parameter of the rule. We call X (i) (x) the perturbed earest eighbor of x. Note that X (1) (x) is the (uperturbed) earest eighbor of x. To obtai the perturbed versio, we artificially add a radom uiform coordiate ad select a data poit with the radomized rule defied above. Sice ties occur with probability zero, the perturbed earest eighbor classifier is well defied almost surely. It is clearly ot, i geeral, a cosistet classifier. Call the correspodig averaged classifier g (x) = {EZ g (x,z) 1/2} the averaged perturbed earest eighbor classifier. I the proof of the cosistecy result below, we use Stoe s (1977) geeral cosistecy theorem for locally weighted average classifiers, see also (Devroye, Györfi, ad Lugosi, 1996, Theorem 6.3). Stoe s theorem cocers classifiers that take the form g (x) = { Y i W i (x) (1 Y i)w i (x)} where the weights W i (x) = W i (x,x 1,...,X ) are o-egative ad sum to oe. Stoe s theorem, cosistecy holds if the followig three coditios are satisfied: Accordig to (i) (ii) For all a > 0, [ lim E max W i(x) = 0. 1 i [ lim E W i (X) { Xi X >a} = 0. (iii) There is a costat c such that, for every o-egative measurable fuctio f satisfyig E f (X) <, E [ W i (X) f (X i ) ce f (X). Theorem 5 The averaged perturbed earest eighbor classifier g is cosistet wheever the parameter m is such that m ad m/ 0. Proof If we defie W i (x) = P Z {X i is the perturbed earest eighbor of x} the it is clear that the averaged perturbed earest eighbor classifier is a locally weighted average classifier ad Stoe s theorem may be applied. It is coveiet to itroduce the otatio p i (x) = P Z {X (i) (x) is the perturbed earest eighbor of x} ad write W i (x) = j=1 {X i =X ( j) (x)}p j (x). 2024

11 CONSISTENCY OF RANDOM FORESTS To check the coditios of Stoe s theorem, first ote that p i (x) = P{mU i i mi mu j} + P{i < mu i mi max( j,mu j)} j<i j ( i = {i m} 1 i i 1 + P{i < mu i mi m m) max( j,mu j)}. j Now we are prepared to check the coditios of Stoe s theorem. To prove that (i) holds, ote that by mootoicity of p i (x) i i, it suffices to show that p 1 (x) 0. But clearly, for m 2, p 1 (x) 1 ( )} j {U m + P 1 mi max j m m,u j [ m = 1 { ( } j m + E U 1 max j=2p m,u j ) U 1 = 1 m + E [ m j=2 [ 1 {U1 > j/m}u 1 1 m + E[ (1 U 1 ) mu 1 2 { mu 1 3} + P{ mu1 < 3} which coverges to zero by mootoe covergece as m. (ii) follows by the coditio m/ 0 sice W i(x) { Xi X >a} = 0 wheever the distace of m-th earest eighbor of X to X is at most a. But this happes evetually, almost surely, see (Devroye, Györfi, ad Lugosi, 1996, Lemma 5.1). Fially, to check (iii), we use agai the mootoicity of p i (x) i i. We may write p i (x) = a i + a i a for some o-egative umbers a j,1 j, depedig upo m ad but ot x. Observe that j=1 ja j = p i(x) = 1. But the E [ W i (X) f (X i ) [ = E p i (X) f (X (i) ) = E = E = [ [ j=1 j=i a j f (X (i) ) j a j f (X (i) ) j a j E[ f (X (i) ) j=1 2025

12 BIAU, DEVROYE AND LUGOSI as desired. c j=1 a j je f (X) (by Stoe s (1977) lemma, see (Devroye, Györfi, ad Lugosi, 1996, Lemma 5.3), where c is a costat) = ce f (X) j=1 a j j = ce f (X) 5. Baggig Oe of the first ad simplest ways of radomizig ad averagig classifiers i order to improve their performace is baggig, suggested by Breima (1996). I baggig, radomizatio is achieved by geeratig may bootstrap samples from the origial data set. Breima suggests selectig traiig pairs (X i,y i ) at radom, with replacemet from the bag of all traiig pairs {(X 1,Y 1 ),...,(X,Y )}. Deotig the radom selectio process by Z, this way oe obtais ew traiig data D (Z) with possible repetitios ad give a classifier g (X,D ), oe ca calculate the radomized classifier g (X,Z,D ) = g (X,D (Z)). Breima suggests repeatig this procedure for may idepedet draws of the bootstrap sample, say m of them, ad calculatig the votig classifier g (m) (X,Z m,d ) as defied i Sectio 2. I this sectio we cosider a geeralized versio of baggig predictors i which the size of the bootstrap samples is ot ecessary the same as that the origial sample. Also, to avoid complicatios ad ambiguities due to replicated data poits, we exclude repetitios i the bootstrapped data. This is assumed for coveiece but samplig with replacemet ca be treated by mior modificatios of the argumets below. To describe the model we cosider, itroduce a parameter q [0,1. I the bootstrap sample D (Z) each data pair (X i,y i ) is preset with probability q, idepedetly of each other. Thus, the size of the bootstrapped data is a biomial radom variable N with parameters ad q. Give a sequece of (o-radomized) classifiers {g }, we may thus defie the radomized classifier g (X,Z,D ) = g N (X,D (Z)), that is, the classifier is defied based o the radomly re-sampled data. By drawig m idepedet bootstrap samples D (Z 1 ),...,D (Z m ) (with sizes N 1,...,N m ), we may defie the baggig classifier g (m) (X,Z m,d ) as the votig classifier based o the radomized classifiers g N1 (X,D (Z 1 )),..., g Nm (X,D (Z m )) as i Sectio 2. For the theoretical aalysis it is more coveiet to cosider the averaged classifier g (x,d ) = {EZ g N (x,d (Z)) 1/2} which is the limitig classifier oe obtais as the umber m of the bootstrap replicates grows to ifiity. The followig result establishes cosistecy of baggig classifiers uder the assumptio that the origial classifier is cosistet. It suffices that the expected size of the bootstrap sample goes to ifiity. The result is a immediate cosequece of Propositio 1. Note that the choice of m does ot matter i Theorem 6. It ca be oe, costat, or a fuctio of. Theorem 6 Let {g } be a sequece of classifiers that is cosistet for the distributio of (X,Y ). Cosider the baggig classifiers g (m) (x,z m,d ) ad g (x,d ) defied above, usig parameter q. If q as the both classifiers are cosistet. 2026

13 CONSISTENCY OF RANDOM FORESTS If a classifier is isesitive to duplicates i the data, Breima s origial suggestio is roughly equivalet to takig q 1 1/e. However, it may be advatageous to choose much smaller values of q. I fact, small values of q may tur icosistet classifiers ito cosistet oes via the baggig procedure. We illustrate this pheomeo o the simple example of the 1-earest eighbor rule. Recall that the 1-earest eighbor rule sets g (x,d ) = Y (1) (x) where Y (1) (x) is the label of the feature vector X (1) (x) whose Euclidea distace to x is miimal amog all X 1,...,X. Ties are broke i favor of smallest idices. It is well-kow that g is cosistet oly if either L = 0 or L = 1/2, otherwise its asymptotic probability of error is strictly greater tha L. However, by baggig oe may tur the 1-earest eighbor classifier ito a cosistet oe, provided that the size of the bootstrap sample is sufficietly small. The ext result characterizes cosistecy of the baggig versio of the 1-earest eighbor classifier i terms of the parameter q. Theorem 7 The baggig averaged 1-earest eighbor classifier g (x,d ) is cosistet for all distributios of (X,Y) if ad oly if q 0 ad q. Proof It is obvious that both q 0 ad q are ecessary for cosistecy for all distributios. Assume ow that q 0 ad q. The key observatio is that g (x,d ) is a locally weighted average classifier for which Stoe s cosistecy theorem, recalled i Sectio 4, applies. Recall that for a fixed x R, X (1) (x),x (2) (x),...,x () (x) deotes the orderig of the data poits X 1,...,X accordig to icreasig distaces from x. (Poits with equal distaces to x are ordered accordig to their idices.) Observe that g may be writte as g (x,d ) = { Y i W i (x) (1 Y i)w i (x)} where W i (x) = j=1 {X i =X ( j) (x)}p j (x) ad p i (x) = (1 q ) i 1 q is defied as the probability (with respect to the radom selectio Z of the bootstrap sample) that X (i) (x) is the earest eighbor of x i the sample D (Z). It suffices to prove that the weights W i (X) satisfy the three coditios of Stoe s theorem. Coditio (i) obviously holds because max 1 i W i (X) = p 1 (X) = q 0. /q To check coditio (ii), defie k =. Sice q implies that k / 0, it follows from (Devroye, Györfi, ad Lugosi, 1996, Lemma 5.1) that evetually, almost surely, X X (k )(X) a ad therefore W i (X) { Xi X >a} = i=k +1 p i (X) q (1 q ) i 1 i=k +1 (1 q ) k (1 q ) /q e q where we used 1 q e q. Therefore, W i(x) { Xi X >a} 0 almost surely ad Stoe s secod coditio is satisfied by domiated covergece. 2027

14 BIAU, DEVROYE AND LUGOSI Fially, coditio (iii) follows from the fact that p i (x) is mootoe decreasig i i, after usig a argumet as i the proof of Theorem Radom Forests Based o Greedily Grow Trees I this sectio we study radom forest classifiers that are based o radomized tree classifiers that are costructed i a greedy maer, by recursively splittig cells to miimize a empirical error criterio. Such greedy forests were itroduced by Breima (2001, 2004) ad have show excellet performace i may applicatios. Oe of his most popular classifiers is a averagig classifier, g, based o a radomized tree classifier g (x,z) defied as follows. The algorithm has a parameter 1 v < d which is a positive iteger. The feature space R d is partitioed recursively to form a tree partitio. The root of the radom tree is R d. At each step of the costructio of the tree, a leaf is chose uiformly at radom. v variables are selected uiformly at radom from the d cadidates x (1),...,x (d). A split is selected alog oe of these v variables to miimize the umber of misclassified traiig poits if a majority vote is used i each cell. The procedure is repeated util every cell cotais exactly oe traiig poit X i. (This is always possible if the distributio of X has o-atomic margials.) I some versios of Breima s algorithm, a bootstrap subsample of the traiig data is selected before the costructio of each tree to icrease the effect of radomizatio. As observed by Li ad Jeo (2006), Breima s classifier is a weighted layered earest eighbor classifier, that is, a classifier that takes a (weighted) majority vote amog the layered earest eighbors of the observatio x. X i is called a layered earest eighbor of x if the rectagle defied by x ad X i as their opposig vertices does ot cotai ay other data poit X j ( j i). This property of Breima s radom forest classifier is a simple cosequece of the fact that each tree is grow util every cell cotais just oe data poit. Ufortuately, this simple property prevets the radom tree classifier from beig cosistet for all distributios: Propositio 8 There exists a distributio of (X,Y) such that X has o-atomic margials for which Breima s radom forest classifier is ot cosistet. Proof The proof works for ay weighted layered earest eighbor classifier. Let the distributio of X be uiform o the segmet {x = (x (1),...,x (d) ) : x (1) = = x (d),x (1) [0,1} ad let the distributio of Y be such that L {0,1/2}. The with probability oe, X has oly two layered earest eighbors ad the classificatio rule is ot cosistet. (Note that Problem 11.6 i Devroye, Györfi, ad Lugosi 1996 erroeously asks the reader to prove cosistecy of the (uweighted) layered earest eighbor rule for ay distributio with o-atomic margials. As the example i this proof shows, the statemet of the exercise is icorrect. Cosistecy of the layered earest eighbor rule is true however, if the distributio of X has a desity.) Oe may also woder whether Breima s radom forest classifier is cosistet if istead of growig the tree dow to cells with a sigle data poit, oe uses a differet stoppig rule, for example if oe fixes the total umber of cuts at k ad let k grow slowly as i the examples of Sectio 3. The ext two-dimesioal example provides a idicatio that this is ot ecessarily the case. Cosider the joit distributio of (X,Y ) sketched i Figure 1. X has a uiform distributio o [0,1 [0,1 [1,2 [1,2 [2,3 [2,3. Y is a fuctio of X, that is η(x) {0,1} ad L = 0. The lower left square [0,1 [0,1 is divided ito coutably ifiitely may vertical stripes i 2028

15 CONSISTENCY OF RANDOM FORESTS Figure 1: A example of a distributio for which greedy radom forests are icosistet. The distributio of X is uiform o the uio of the three large squares. White areas represet the set where η(x) = 0 ad o the grey regios η(x) = 1. which the stripes with η(x) = 0 ad η(x) = 1 alterate. The upper right square [2,3 [2,3 is divided similarly ito horizotal stripes. The middle rectagle [1, 2 [1, 2 is a 2 2 checkerboard. Cosider Breima s radom forest classifier with v = 1 (the oly possible choice whe d = 2). For simplicity, cosider the case whe, istead of miimizig the empirical error, each tree is grow by miimizig the true probability of error at each split i each radom tree. The it is easy to see that o matter what the sequece of radom selectio of split directios is ad o matter for how log each tree is grow, o tree will ever cut the middle rectagle ad therefore the probability of error of the correspodig radom forest classifier is at least 1/6. It is ot so clear what happes i this example if the successive cuts are made by miimizig the empirical error. Whether the middle square is ever cut will deped o the precise form of the stoppig rule ad the exact parameters of the distributio. The example is here to illustrate that cosistecy of greedily grow radom forests is a delicate issue. Note however that if Breima s origial algorithm is used i this example (i.e., whe all cells with more tha oe data poit i it are split) the oe obtais a cosistet classificatio rule. If, o the other had, horizotal or vertical cuts are selected to miimize the probability of error, ad k i such a way that k = O( 1/2 ε ) for some ε > 0, the, as errors o the middle square are ever more tha about O(1/ ) (by the limit law for the Kolmogorov-Smirov statistic), we see that thi strips of probability mass more tha 1/ are preferetially cut. By choosig the probability weights of the strips, oe ca easily see that we ca costruct more tha 2k such strips. Thus, whe k = O( 1/2 ε ), o cosistecy is possible o that example. We ote here that may versios of radom forest classifiers build o radom tree classifiers based o bootstrap subsamplig. This is the case of Breima s pricipal radom forest classifier. 2029

16 BIAU, DEVROYE AND LUGOSI c c 2 c c 4c c c c c c c c c c Figure 2: A tree based o partitioig the plae ito rectagles. The right subtree of each iteral ode belogs to the iside of a rectagle, ad the left subtree belogs to the complemet of the same rectagle (i c deotes the complemet of i). Rectagles are ot allowed to overlap. Breima suggests to take a radom sample of size draw with replacemet from the origial data. While this may result i a improved behavior i some practical istaces, it is easy to see that such a subsamplig procedure does ot vary the cosistecy property of ay of the classifiers studied i this paper. For example, o-cosistecy of Breima s radom forest classifier with bootstrap resamplig for the distributio cosidered i the proof of Propositio 8 follows from the fact that the two layered earest eighbors o both sides are icluded i the bootstrap sample with a probability bouded away from zero ad therefore the weight of these two poits is too large, makig cosistecy impossible. I order to remedy the icosistecy of greedily grow tree classifiers, (Devroye, Györfi, ad Lugosi, 1996, Sectio 20.14) itroduce a greedy tree classifier which, istead of cuttig every cell alog just oe directio, cuts out a whole hyper-rectagle from a cell i a way to optimize the empirical error. The disadvatage of this method is that i each step, d parameters eed to be optimized joitly ad this may be computatioally prohibitive if d is ot very small. (The computatioal complexity of the method is O( d ).) However, we may use the methodology of radom forests to defie a computatioally feasible cosistet greedily grow radom forest classifier. I order to defie the cosistet greedy radom forest, we first recall the tree classifier of (Devroye, Györfi, ad Lugosi, 1996, Sectio 20.14). The space is partitioed ito rectagles as show i Figure 2. A hyper-rectagle defies a split i a atural way. A partitio is deoted by P, ad a decisio o a set A P is by majority vote. We write g P for such a rule: g P (x) = {i:xi A(x)Y i > i:xi A(x)(1 Y i )} where A(x) deotes the cell of the partitio cotaiig x. Give a partitio P, a legal hyper-rectagle T is oe for which T A = /0 or T A for all sets A P. If we refie P by addig a legal rectagle T somewhere, the we obtai the partitio T. The decisio g T agrees with g P except o the set A P that cotais T. 2030

17 CONSISTENCY OF RANDOM FORESTS Itroduce the coveiet otatio The empirical error of g P is where L (R) = 1 ν j (A) = P{X A,Y = j}, j {0,1}, ν j, (A) = 1 I {Xi A,Y i = j}, j {0,1}. L (P ) def = L (R), R P I {Xi R,g P (X i ) Y i } = mi(ν 0, (R),ν 1, (R)). We may similarly defie L (T ). Give a partitio P, the greedy classifier selects that legal rectagle T for which L (T ) is miimal (with ay appropriate policy for breakig ties). Let R be the set of P cotaiig T. The the greedy classifier picks that T for which L (T ) + L (R T ) L (R) is miimal. Startig with the trivial partitio P 0 = {R d }, we repeat the previous step k times, leadig thus to k + 1 regios. The sequece of partitios is deoted by P 0,P 1,...,P k. (Devroye, Györfi, ad Lugosi, 1996, Theorem 20.9) establish cosistecy of this classifier. More precisely, it is show that if X has o-atomic margials, the the greedy classifier with k ad ( /log ) k = o is cosistet. Based o the greedy tree classifier, we may defie a radom forest classifier by cosiderig its baggig versio. More precisely, let q [0,1 be a parameter ad let Z = Z(D ) deote a radom subsample of size biomial (,q ) of the traiig data (i.e., each pair (X i,y i ) is selected at radom, without replacemet, from D, with probability q ) ad let g (x,z) be the greedy tree classifier (as defied above) based o the traiig data Z(D ). Defie the correspodig averaged classifier g. We call g the greedy radom forest classifier. Note that g is just the baggig versio of the greedy tree classifier ad therefore Theorem 6 applies: Theorem 9 The greedy radom forest classifier is cosistet wheever X has o-atomic margials q ) i R d, q, k ad k = o( /log(q ) as. Proof This follows from Theorem 6 ad the fact that the greedy tree classifier is cosistet (see Theorem 20.9 of Devroye, Györfi, ad Lugosi (1996)). Observe that the computatioal complexity of buildig the radomized tree classifier g (x,z) is O((q ) d ). Thus, the complexity of computig the votig classifier g (m) is m(q ) d. If q 1, this may be a sigificat speed-up compared to the complexity O( d ) of computig a sigle tree classifier usig the full sample. Repeated subsamplig ad averagig may make up for the effect of decreased sample size. 2031

18 BIAU, DEVROYE AND LUGOSI Ackowledgmets We thak James Malley for stimulatig discussios. We also thak three referees for valuable commets ad isightful suggestios. The secod author s research was sposored by NSERC Grat A3456 ad FQRNT Grat 90- ER The third author ackowledges support by the Spaish Miistry of Sciece ad Techology grat MTM ad by the PASCAL Network of Excellece uder EC grat o Refereces Y. Amit ad D. Gema. Shape quatizatio ad recogitio with radomized trees. Neural Computatio, 9: , L. Breima. Baggig predictors. Machie Learig, 24: , L. Breima. Arcig classifiers. The Aals of Statistics, 24: , L. Breima. Some ifiite theory for predictor esembles. Techical Report 577, Statistics Departmet, UC Berkeley, breima. L. Breima. Radom forests. Machie Learig, 45:5 32, L. Breima. Cosistecy for a simple model of radom forests. Techical Report 670, Statistics Departmet, UC Berkeley, A. Cutler ad G. Zhao. Pert Perfect radom tree esembles, Computig Sciece ad Statistics, 33: , L. Devroye. Applicatios of the theory of records i the study of radom trees. Acta Iformatica, 26: , L. Devroye. A ote o the height of biary search trees. Joural of the ACM, 33: , L. Devroye, L. Györfi, ad G. Lugosi. A Probabilistic Theory of Patter Recogitio. Spriger- Verlag, New York, T.G. Dietterich. A experimetal compariso of three methods for costructig esembles of decisio trees: baggig, boostig, ad radomizatio. Machie Learig, 40: , T.G. Dietterich. Esemble methods i machie learig. I J. Kittler ad F. Roli (Eds.), First Iteratioal Workshop o Multiple Classifier Systems, Lecture Notes i Computer Sciece, pp. 1 15, Spriger-Verlag, New York, Y. Freud ad R. Schapire. Experimets with a ew boostig algorithm. I L. Saitta (Ed.), Machie Learig: Proceedigs of the 13th Iteratioal Coferece, pp , Morga Kaufma, Sa Fracisco, Y. Li ad Y. Jeo. Radom forests ad adaptive earest eighbors. Joural of the America Statistical Associatio, 101: ,

19 CONSISTENCY OF RANDOM FORESTS N. Meishause. Quatile regressio forests. Joural of Machie Learig Research, 7: , H.M. Mahmoud. Evolutio of Radom Search Trees. Joh Wiley, New York, C. Stoe. Cosistet oparametric regressio. The Aals of Statistics, 5: ,