Consistency of Random Forests and Other Averaging Classifiers


 Anthony Morrison
 3 years ago
 Views:
Transcription
1 Joural of Machie Learig Research 9 (2008) Submitted 1/08; Revised 5/08; Published 9/08 Cosistecy of Radom Forests ad Other Averagig Classifiers Gérard Biau LSTA & LPMA Uiversité Pierre et Marie Curie Paris VI Boîte 158, 175 rue du Chevaleret Paris, Frace Luc Devroye School of Computer Sciece McGill Uiversity Motreal, Caada H3A 2K6 Gábor Lugosi ICREA ad Departmet of Ecoomics Pompeu Fabra Uiversity Ramo Trias Fargas Barceloa, Spai Editor: Peter Bartlett Abstract I the last years of his life, Leo Breima promoted radom forests for use i classificatio. He suggested usig averagig as a meas of obtaiig good discrimiatio rules. The base classifiers used for averagig are simple ad radomized, ofte based o radom samples from the data. He left a few questios uaswered regardig the cosistecy of such rules. I this paper, we give a umber of theorems that establish the uiversal cosistecy of averagig rules. We also show that some popular classifiers, icludig oe suggested by Breima, are ot uiversally cosistet. Keywords: radom forests, classificatio trees, cosistecy, baggig 1. Itroductio This paper is dedicated to the memory of Leo Breima. Esemble methods, popular i machie learig, are learig algorithms that costruct a set of may idividual classifiers (called base learers) ad combie them to classify ew data poits by takig a weighted or uweighted vote of their predictios. It is ow wellkow that esembles are ofte much more accurate tha the idividual classifiers that make them up. The success of esemble algorithms o may bechmark data sets has raised cosiderable iterest i uderstadig why such methods succeed ad idetifyig circumstaces i which they ca be expected to produce good results. These methods differ i the way the base learer is fit ad combied. For example, baggig (Breima, 1996) proceeds by geeratig bootstrap samples from the origial data set, costructig a classifier from each bootstrap sample, ad votig to combie. I boostig (Freud ad Schapire, 1996) ad arcig algorithms (Breima, 1998) the successive classifiers are costructed by givig icreased weight to those poits that have bee frequetly misclassified, ad the classifiers are combied usig weighted votig. O the other had, radom split selectio (Dietterich, 2000) c 2008 Gérard Biau, Luc Devroye ad Gábor Lugosi.
2 BIAU, DEVROYE AND LUGOSI grows trees o the origial data set. For a fixed umber S, at each ode, S best splits (i terms of miimizig deviace) are foud ad the actual split is radomly ad uiformly selected from them. For a comprehesive review of esemble methods, we refer the reader to Dietterich (2000a) ad the refereces therei. Breima (2001) provides a geeral framework for tree esembles called radom forests. Each tree depeds o the values of a radom vector sampled idepedetly ad with the same distributio for all trees. Thus, a radom forest is a classifier that cosists of may decisio trees ad outputs the class that is the mode of the classes output by idividual trees. Algorithms for iducig a radom forest were first developed by Breima ad Cutler, ad Radom Forests is their trademark. The web page provides a collectio of dowloadable techical reports, ad gives a overview of radom forests as well as commets o the features of the method. Radom forests have bee show to give excellet performace o a umber of practical problems. They work fast, geerally exhibit a substatial performace improvemet over sigle tree classifiers such as CART, ad yield geeralizatio error rates that compare favorably to the best statistical ad machie learig methods. I fact, radom forests are amog the most accurate geeralpurpose classifiers available (see, for example, Breima, 2001). Differet radom forests differ i how radomess is itroduced i the tree buildig process, ragig from extreme radom splittig strategies (Breima, 2000; Cutler ad Zhao, 2001) to more ivolved datadepedet strategies (Amit ad Gema, 1997; Breima, 2001; Dietterich, 2000). As a matter of fact, the statistical mechaism of radom forests is ot yet fully uderstood ad is still uder active ivestigatio. Ulike sigle trees, where cosistecy is proved lettig the umber of observatios i each termial ode become large (Devroye, Györfi, ad Lugosi, 1996, Chapter 20), radom forests are geerally built to have a small umber of cases i each termial ode. Although the mechaism of radom forest algorithms appears simple, it is difficult to aalyze ad remais largely ukow. Some attempts to ivestigate the drivig force behid cosistecy of radom forests are by Breima (2000, 2004) ad Li ad Jeo (2006), who establish a coectio betwee radom forests ad adaptive earest eighbor methods. Meishause (2006) proved cosistecy of certai radom forests i the cotext of socalled quatile regressio. I this paper we offer cosistecy theorems for various versios of radom forests ad other radomized esemble classifiers. I Sectio 2 we itroduce a geeral framework for studyig classifiers based o averagig radomized base classifiers. We prove a simple but useful propositio showig that averaged classifiers are cosistet wheever the base classifiers are. I Sectio 3 we prove cosistecy of two simple radom forest classifiers, the purely radom forest (suggested by Breima as a startig poit for study) ad the scaleivariat radom forest classifiers. I Sectio 4 it is show that averagig may covert icosistet rules ito cosistet oes. I Sectio 5 we briefly ivestigate cosistecy of baggig rules. We show that, i geeral, baggig preserves cosistecy of the base rule ad it may eve create cosistet rules from icosistet oes. I particular, we show that if the bootstrap samples are sufficietly small, the bagged versio of the 1earest eighbor classifier is cosistet. 2016
3 CONSISTENCY OF RANDOM FORESTS Fially, i Sectio 6 we cosider radom forest classifiers based o radomized, greedily grow tree classifiers. We argue that some greedy radom forest classifiers, icludig Breima s radom forest classifier, are icosistet ad suggest a cosistet greedy radom forest classifier. 2. Votig ad Averaged Classifiers Let (X,Y ),(X 1,Y 1 ),...,(X,Y ) be i.i.d. pairs of radom variables such that X (the socalled feature vector) takes its values i R d while Y (the label) is a biary {0,1}valued radom variable. The joit distributio of (X,Y ) is determied by the margial distributio µ of X (i.e., P{X A} = µ(a) for all Borel sets A R d ) ad the a posteriori probability η : R d [0,1 defied by η(x) = P{Y = 1 X = x}. The collectio (X 1,Y 1 ),...,(X,Y ) is called the traiig data, ad is deoted by D. A classifier g is a biaryvalued fuctio of X ad D whose probability of error is defied by L(g ) = P (X,Y ) {g (X,D ) Y } where P (X,Y ) deotes probability with respect to the pair (X,Y) (i.e., coditioal probability, give D ). For brevity, we write g (X) = g (X,D ). It is wellkow (see, for example, Devroye, Györfi, ad Lugosi, 1996) that the classifier that miimizes the probability of error, the socalled Bayes classifier is g (x) = {η(x) 1/2}. The risk of g is called the Bayes risk: L = L(g ). A sequece {g } of classifiers is cosistet for a certai distributio of (X,Y) if L(g ) L i probability. I this paper we ivestigate classifiers that calculate their decisios by takig a majority vote over radomized classifiers. A radomized classifier may use a radom variable Z to calculate its decisio. More precisely, let Z be some measurable space ad let Z take its values i Z. A radomized classifier is a arbitrary fuctio of the form g (X,Z,D ), which we abbreviate by g (X,Z). The probability of error of g becomes L(g ) = P (X,Y ),Z {g (X,Z,D ) Y } = P{g (X,Z,D ) Y D }. The defiitio of cosistecy remais the same by augmetig the probability space appropriately to iclude the radomizatio. Give ay radomized classifier, oe may calculate the classifier for various draws of the radomizig variable Z. It is the a atural idea to defie a averaged classifier by takig a majority vote amog the obtaied radom classifiers. Assume that Z 1,...,Z m are idetically distributed draws of the radomizig variable, havig the same distributio as Z. Throughout the paper, we assume that Z 1,...,Z m are idepedet, coditioally o X, Y, ad D. Lettig Z m = (Z 1,...,Z m ), oe may defie the correspodig votig classifier by { g (m) (x,z m 1 if 1,D ) = m m j=1 g (x,z j,d ) 1 2, 0 otherwise. By the strog law of large umbers, for ay fixed x ad D for which P Z {g (x,z,d ) = 1} 1/2, we have almost surely lim m g (m) (x,z m,d ) = g (x,d ), where g (x,d ) = g (x) = {EZ g (x,z) 1/2} 2017
4 BIAU, DEVROYE AND LUGOSI is a (oradomized) classifier that we call the averaged classifier. (Here P Z ad E Z deote probability ad expectatio with respect to the radomizig variable Z, that is, coditioally o X, Y, ad D.) g may be iterpreted as a idealized versio of the classifier g (m) that draws may idepedet copies of the radomizig variable Z ad takes a majority vote over the resultig classifiers. Our first result states that cosistecy of a radomized classifier is preserved by averagig. Propositio 1 Assume that the sequece {g } of radomized classifiers is cosistet for a certai distributio of (X,Y ). The the votig classifier g (m) (for ay value of m) ad the averaged classifier g are also cosistet. Proof Cosistecy of {g } is equivalet to sayig that EL(g ) = P{g (X,Z) Y } L. I fact, sice P{g (X,Z) Y X = x} P{g (X) Y X = x} for all x R d, cosistecy of {g } meas that for µalmost all x, P{g (X,Z) Y X = x} P{g (X) Y X = x} = mi(η(x),1 η(x)). Without loss of geerality, assume that η(x) > 1/2. (I the case of η(x) = 1/2 ay classifier has a coditioal probability of error 1/2 ad there is othig to prove.) The P{g (X,Z) Y X = x} = (2η(x) 1)P{g (x,z) = 0} + 1 η(x), ad by cosistecy we have P{g (x,z) = 0} 0. To prove cosistecy of the votig classifier g (m) for µalmost all x for which η(x) > 1/2. However, P{g (m) (x,z m ) = 0} = P { 2E, it suffices to show that P{g (m) (x,z m ) = 0} 0 (1/m) [ (1/m) m j=1 m j=1 {g (x,z j )=0} > 1/2 {g (x,z j )=0} (by Markov s iequality) = 2P{g (x,z) = 0} 0. Cosistecy of the averaged classifier is proved by a similar argumet. } 3. Radom Forests Radom forests, itroduced by Breima, are averaged classifiers i the sese defied i Sectio 2. Formally, a radom forest with m trees is a classifier cosistig of a collectio of radomized base tree classifiers g (x,z 1 ),...,g (x,z m ) where Z 1,...,Z m are idetically distributed radom vectors, idepedet coditioally o X, Y, ad D. The radomizig variable is typically used to determie how the successive cuts are performed whe buildig the tree such as selectio of the ode ad the coordiate to split, as well as the positio of the split. The radom forest classifier takes a majority vote amog the radom tree classifiers. If m is large, the radom forest classifier is well approximated by the averaged classifier 2018
5 CONSISTENCY OF RANDOM FORESTS g (x) = {EZ g (x,z) 1/2}. For brevity, we state most results of this paper for the averaged classifier oly, though by Propositio 1 various results remai true for the votig classifier g (m) as well. I this sectio we aalyze a simple radom forest already cosidered by Breima (2000), which we call the purely radom forest. The radom tree classifier g (x,z) is costructed as follows. Assume, for simplicity, that µ is supported o [0,1 d. All odes of the tree are associated with rectagular cells such that at each step of the costructio of the tree, the collectio of cells associated with the leaves of the tree (i.e., exteral odes) forms a partitio of [0,1 d. The root of the radom tree is [0,1 d itself. At each step of the costructio of the tree, a leaf is chose uiformly at radom. The split variable J is the selected uiformly at radom from the d cadidates x (1),...,x (d). Fially, the selected cell is split alog the radomly chose variable at a radom locatio, chose accordig to a uiform radom variable o the legth of the chose side of the selected cell. The procedure is repeated k times where k 1 is a determiistic parameter, fixed beforehad by the user, ad possibly depedig o. The radomized classifier g (x,z) takes a majority vote amog all Y i for which the correspodig feature vector X i falls i the same cell of the radom partitio as x. (For cocreteess, break ties i favor of the label 1.) The purely radom forest classifier is a radically simplified versio of radom forest classifiers used i practice. The mai simplificatio lies i the fact that recursive cell splits do ot deped o the labels Y 1,...,Y. The ext theorem maily serves as a illustratio of how the cosistecy problem of radom forest classifiers may be attacked. More ivolved versios of radom forest classifiers are discussed i subsequet sectios. Theorem 2 Assume that the distributio of X is supported o [0,1 d. The the purely radom forest classifier g is cosistet wheever k ad k/ 0 as k. Proof By Propositio 1 it suffices to prove cosistecy of the radomized base tree classifier g. To this ed, we recall a geeral cosistecy theorem for partitioig classifiers proved i (Devroye, Györfi, ad Lugosi, 1996, Theorem 6.1). Accordig to this theorem, g is cosistet if both diam(a (X,Z)) 0 i probability ad N (X,Z) i probability, where A (x,z) is the rectagular cell of the radom partitio cotaiig x ad N (x,z) = {X i A (x,z)} is the umber of data poits fallig i the same cell as x. First we show that N (X,Z) i probability. Cosider the radom tree partitio defied by Z. Observe that the partitio has k + 1 rectagular cells, say A 1,...,A k+1. Let N 1,...,N k+1 deote the umber of poits of X,X 1,...,X fallig i these k + 1 cells. Let S = {X,X 1,...,X } deote the set of positios of these + 1 poits. Sice these poits are idepedet ad idetically distributed, fixig the set S (but ot the order of the poits) ad Z, the coditioal probability that X falls i the ith cell equals N i /( + 1). Thus, for every fixed t > 0, P{N (X,Z) < t} = E[P{N (X,Z) < t S,Z} [ N i = E + 1 i:n i <t 2019 (t 1) k
6 BIAU, DEVROYE AND LUGOSI which coverges to zero by our assumptio o k. It remais to show that diam(a (X,Z)) 0 i probability. To this aim, let V = V (x,z) be the size of the first dimesio of the rectagle cotaiig x. Let T = T (x,z) be the umber of times that the box cotaiig x is split whe we costruct the radom tree partitio. Let K be biomial (T,1/d), represetig the umber of times the box cotaiig x is split alog the first coordiate. Clearly, it suffices to show that V (x,z) 0 i probability for µalmost all x, so it is eough to show that for all x, E[V (x,z) 0. Observe that if U 1,U 2,... are idepedet uiform [0,1, the [ E[V (x,z) E E [ K max(u i,1 U i ) K [ = E E[max(U 1,1 U 1 ) K = E [ (3/4) K [ ( = E 1 1 d + 3 ) T 4d [ ( = E 1 1 ) T. 4d Thus, it suffices to show that T i probability. To this ed, ote that the partitio tree is statistically related to a radom biary search tree with k + 1 exteral odes (ad thus k iteral odes). Such a tree is obtaied as follows. Iitially, the root is the sole exteral ode, ad there are o iteral odes. Select a exteral ode uiformly at radom, make it a iteral ode ad give it two childre, both exteral. Repeat util we have precisely k iteral odes ad k + 1 exteral odes. The resultig tree is the radom biary search tree o k iteral odes (see Devroye 1988 ad Mahmoud 1992 for more equivalet costructios of radom biary search trees). It is kow that all levels up to l = 0.37logk are full with probability tedig to oe as k (Devroye, 1986). The last full level F is called the fillup level. Clearly, the partitio tree has this property. Therefore, we kow that all fial cells have bee cut at least l times ad therefore T l with probability covergig to 1. This cocludes the proof of Theorem 3.1. Remark 3 We observe that the largest first dimesio amog exteral odes does ot ted to zero i probability except for d = 1. For d 2, it teds to a limit radom variable that is ot atomic at zero (this ca be show usig the theory of brachig processes). Thus the proof above could ot have used the uiform smalless of all cells. Despite the fact that the radom partitio cotais some cells of huge diameter of oshrikig size, the rule based o it is cosistet. Next we cosider a scaleivariat versio of the purely radom forest classifier. I this variat the root cell is the etire feature space ad the radom tree is grow up to k cuts. The leaf cell to cut ad the directio J i which the cell is cut are chose uiformly at radom, exactly as i the purely radom forest classifier. The oly differece is that the positio of the cut is ow chose i a databased maer: if the cell to be cut cotais N of the data poits X,X 1,...,X, the a radom idex I is chose uiformly from the set {0,1,...,N} ad the cell is cut so that, whe ordered by their Jth compoets, the poits with the I smallest values fall i oe of the subcells ad the rest i 2020
7 CONSISTENCY OF RANDOM FORESTS the other. To avoid ties, we assume that the distributio of X has oatomic margials. I this case the radom tree is welldefied with probability oe. Just like before, the associated classifier takes a majority vote over the labels of the data poits fallig i the same cell as X. The scaleivariat radom forest classifier is defied as the correspodig averaged classifier. Theorem 4 Assume that the distributio of X has oatomic margials i R d. The the scaleivariat radom forest classifier g is cosistet wheever k ad k/ 0 as k. Proof Oce agai, we may use Propositio 1 ad (Devroye, Györfi, ad Lugosi, 1996, Theorem 6.1) to prove cosistecy of the radomized base tree classifier g. The proof of the fact that N (X,Z) i probability is the same as i Theorem 2. To show that diam(a (X,Z)) 0 i probability, we begi by otig that, just as i the case of the purely radom forest classifier, the partitio tree is equivalet to a biary search tree, ad therefore with probability covergig to oe, all fial cells have bee cut at least l = 0.37 log k times. Sice the classificatio rule is scaleivariat, we may assume, without loss of geerality, that the distributio of X is cocetrated o the uit cube [0,1 d. Let i deote the cardiality of the ith cell i the partitio, 1 i k + 1, where the cardiality of a cell C is C {X,X 1,...,X }. Thus, k+1 i = + 1. Let V i be the first dimesio of the ith cell. Let V (X) be the first dimesio of the cell that cotais X. Clearly, give the i s, V (X) = V i with probability i /( + 1). We eed to show that E[V (X) 0. But we have [ k+1 E[V (X) = E iv i. + 1 So, it suffices to show that E[ i i V i = o(). It is worthy of metio that the radom split of a box ca be imagied as follows. Give that we split alog the sth coordiate axis, ad that a box has m poits, the we select oe of the m + 1 spacigs defied by these m poits uiformly at radom, still for that sth coordiate. We cut that spacig properly but are free to do so aywhere. We ca cut i proportios λ,1 λ with λ (0,1), ad the value of λ may vary from cut to cut ad eve be datadepedet. I fact, the, each iteral ad exteral ode of our partitio tree has associated with it two importat quatities, a cardiality, ad its first dimesio. If we keep usig i to idex cells, the we ca use i ad V i for the ith cell, eve if it is a iteral cell. Let A be the collectio of exteral odes i the subtree of the ith cell. The trivially, j V j i V i. j A Thus, if E is the collectio of all exteral odes of a partitio tree, l is at most the miimum path distace from ay cell i E to the root, ad L is the collectio of all odes at distace l from the root, the, by the last iequality, i E i V i i V i. i L Thus, usig the otio of fillup level F of the biary search tree, ad settig l = 0.37logk, we have [ [ E i V i P{F < l} + E i V i. i E i L 2021
8 BIAU, DEVROYE AND LUGOSI We have see that the first term is o(). We argue that the secod term is ot more tha (1 1/(8d)) l, which is o() sice k. That will coclude the proof. It suffices ow to argue recursively ad fix oe cell of cardiality ad first dimesio V. Let C be the collectio of its childre. We will show that E [ i V i i C Repeatig this recursively l times shows that [ E i V i i L ( 1 1 8d ) V. ( 1 1 ) l 8d because V = 1 at the root. Fix that cell of cardiality, ad assume without loss of geerality that V = 1. Let the spacigs alog the first coordiate be a 1,...,a +1, their sum beig oe. With probability 1 1/d, there the first axis is ot cut, ad thus, i C i V i =. With probability 1/d, the first axis is cut i two parts. We will show that coditioal o the evet that the first directio is cut, [ E i V i 7 i 8. Ucoditioally, we have [ E i V i i ( 1 1 ) + 1d d 78 ( = 1 1 ), 8d as required. So, let us prove the coditioal result. Usig δ j to deote umbers draw from (0,1), possibly radom, we have [ E i V i i = = 1 [ E [( j 1)(a a j 1 + a j δ j ) j=1 +( + 1 j)(a j (1 δ j ) + a j a +1 ) [ ( E a k ( j 1) k=1 k< j j<k ( + 1 j) + δ k (k 1) + (1 δ k )( + 1 k) ( +1 ( k(k 1) a k ( + 1) k=1 2 ( k + 1)( k + 2) + max(k 1, + 1 k) )) )
9 CONSISTENCY OF RANDOM FORESTS ( +1 ( 1 ( + 1) = + 1 a k k=1 2 ( (( 1 + 1) ( ) 3/4 + (3/2) = + 1 ) ) + (k 1)( + 1 k) + max(k 1, + 1 k) ( ) ) ) a k k=1 7 8 if > 4. Our defiitio of the scaleivariat radom forest classifier permits cells to be cut such that oe of the created cells becomes empty. Oe may easily prevet this by artificially forcig a miimum umber of poits i each cell. This may be doe by restrictig the radom positio of each cut so that both created subcells cotai at least, say, m poits. By a mior modificatio of the proof above it is easy to see that as log as m is bouded by a costat, the resultig radom forest classifier remais cosistet uder the same coditios as i Theorem Creatig Cosistet Rules by Radomizatio Propositio 1 shows that if a radomized classifier is cosistet, the the correspodig averaged classifier remais cosistet. The coverse is ot true. There exist icosistet radomized classifiers such that their averaged versio becomes cosistet. Ideed, Breima s (2001) origial radom forest classifier builds tree classifiers by successive radomized cuts util the cell of the poit X to be classified cotais oly oe data poit, ad classifies X as the label of this data poit. Breima s radom forest classifier is just the averaged versio of such radomized tree classifiers. The radomized base classifier g (x,z) is obviously ot cosistet for all distributios. This does ot imply that the averaged radom forest classifier is ot cosistet. I fact, i this sectio we will see that averagig may boost icosistet base classifiers ito cosistet oes. We poit out i Sectio 6 that there are distributios of (X,Y ) for which Breima s radom forest classifier is ot cosistet. The couterexample show i Propositio 8 is such that the distributio of X does t have a desity. It is possible, however, that Breima s radom forest classifier is cosistet wheever the distributio of X has a desity. Breima s rule is difficult to aalyze as each cut of the radom tree is determied by a complicated fuctio of the etire data set D (i.e., both feature vectors ad labels). However, i Sectio 6 below we provide argumets suggestig that Breima s radom forest is ot cosistet whe a desity exists. Istead of Breima s rule, ext we aalyze a stylized versio by showig that icosistet radomized rules that take the label of oly oe eighbor ito accout ca be made cosistet by averagig. For simplicity, we cosider the case d = 1, though the whole argumet exteds, i a straightforward way, to the multivariate case. To avoid complicatios itroduced by ties, assume that X has a oatomic distributio. Defie a radomized earest eighbor rule as follows: for a fixed x R, let X (1) (x),x (2) (x),...,x () (x) be the orderig of the data poits X 1,...,X accordig to icreasig distaces from x. Let U 1,...,U be i.i.d. radom variables, uiformly distributed over [0,1. The vector of these radom variables costitutes the radomizatio Z of the classifier. We defie g (x,z) 2023
10 BIAU, DEVROYE AND LUGOSI to be equal to the label Y (i) (x) of the data poit X (i) (x) for which max(i,mu i ) max( j,mu j ) for all j = 1,..., where m is a parameter of the rule. We call X (i) (x) the perturbed earest eighbor of x. Note that X (1) (x) is the (uperturbed) earest eighbor of x. To obtai the perturbed versio, we artificially add a radom uiform coordiate ad select a data poit with the radomized rule defied above. Sice ties occur with probability zero, the perturbed earest eighbor classifier is well defied almost surely. It is clearly ot, i geeral, a cosistet classifier. Call the correspodig averaged classifier g (x) = {EZ g (x,z) 1/2} the averaged perturbed earest eighbor classifier. I the proof of the cosistecy result below, we use Stoe s (1977) geeral cosistecy theorem for locally weighted average classifiers, see also (Devroye, Györfi, ad Lugosi, 1996, Theorem 6.3). Stoe s theorem cocers classifiers that take the form g (x) = { Y i W i (x) (1 Y i)w i (x)} where the weights W i (x) = W i (x,x 1,...,X ) are oegative ad sum to oe. Stoe s theorem, cosistecy holds if the followig three coditios are satisfied: Accordig to (i) (ii) For all a > 0, [ lim E max W i(x) = 0. 1 i [ lim E W i (X) { Xi X >a} = 0. (iii) There is a costat c such that, for every oegative measurable fuctio f satisfyig E f (X) <, E [ W i (X) f (X i ) ce f (X). Theorem 5 The averaged perturbed earest eighbor classifier g is cosistet wheever the parameter m is such that m ad m/ 0. Proof If we defie W i (x) = P Z {X i is the perturbed earest eighbor of x} the it is clear that the averaged perturbed earest eighbor classifier is a locally weighted average classifier ad Stoe s theorem may be applied. It is coveiet to itroduce the otatio p i (x) = P Z {X (i) (x) is the perturbed earest eighbor of x} ad write W i (x) = j=1 {X i =X ( j) (x)}p j (x). 2024
11 CONSISTENCY OF RANDOM FORESTS To check the coditios of Stoe s theorem, first ote that p i (x) = P{mU i i mi mu j} + P{i < mu i mi max( j,mu j)} j<i j ( i = {i m} 1 i i 1 + P{i < mu i mi m m) max( j,mu j)}. j Now we are prepared to check the coditios of Stoe s theorem. To prove that (i) holds, ote that by mootoicity of p i (x) i i, it suffices to show that p 1 (x) 0. But clearly, for m 2, p 1 (x) 1 ( )} j {U m + P 1 mi max j m m,u j [ m = 1 { ( } j m + E U 1 max j=2p m,u j ) U 1 = 1 m + E [ m j=2 [ 1 {U1 > j/m}u 1 1 m + E[ (1 U 1 ) mu 1 2 { mu 1 3} + P{ mu1 < 3} which coverges to zero by mootoe covergece as m. (ii) follows by the coditio m/ 0 sice W i(x) { Xi X >a} = 0 wheever the distace of mth earest eighbor of X to X is at most a. But this happes evetually, almost surely, see (Devroye, Györfi, ad Lugosi, 1996, Lemma 5.1). Fially, to check (iii), we use agai the mootoicity of p i (x) i i. We may write p i (x) = a i + a i a for some oegative umbers a j,1 j, depedig upo m ad but ot x. Observe that j=1 ja j = p i(x) = 1. But the E [ W i (X) f (X i ) [ = E p i (X) f (X (i) ) = E = E = [ [ j=1 j=i a j f (X (i) ) j a j f (X (i) ) j a j E[ f (X (i) ) j=1 2025
12 BIAU, DEVROYE AND LUGOSI as desired. c j=1 a j je f (X) (by Stoe s (1977) lemma, see (Devroye, Györfi, ad Lugosi, 1996, Lemma 5.3), where c is a costat) = ce f (X) j=1 a j j = ce f (X) 5. Baggig Oe of the first ad simplest ways of radomizig ad averagig classifiers i order to improve their performace is baggig, suggested by Breima (1996). I baggig, radomizatio is achieved by geeratig may bootstrap samples from the origial data set. Breima suggests selectig traiig pairs (X i,y i ) at radom, with replacemet from the bag of all traiig pairs {(X 1,Y 1 ),...,(X,Y )}. Deotig the radom selectio process by Z, this way oe obtais ew traiig data D (Z) with possible repetitios ad give a classifier g (X,D ), oe ca calculate the radomized classifier g (X,Z,D ) = g (X,D (Z)). Breima suggests repeatig this procedure for may idepedet draws of the bootstrap sample, say m of them, ad calculatig the votig classifier g (m) (X,Z m,d ) as defied i Sectio 2. I this sectio we cosider a geeralized versio of baggig predictors i which the size of the bootstrap samples is ot ecessary the same as that the origial sample. Also, to avoid complicatios ad ambiguities due to replicated data poits, we exclude repetitios i the bootstrapped data. This is assumed for coveiece but samplig with replacemet ca be treated by mior modificatios of the argumets below. To describe the model we cosider, itroduce a parameter q [0,1. I the bootstrap sample D (Z) each data pair (X i,y i ) is preset with probability q, idepedetly of each other. Thus, the size of the bootstrapped data is a biomial radom variable N with parameters ad q. Give a sequece of (oradomized) classifiers {g }, we may thus defie the radomized classifier g (X,Z,D ) = g N (X,D (Z)), that is, the classifier is defied based o the radomly resampled data. By drawig m idepedet bootstrap samples D (Z 1 ),...,D (Z m ) (with sizes N 1,...,N m ), we may defie the baggig classifier g (m) (X,Z m,d ) as the votig classifier based o the radomized classifiers g N1 (X,D (Z 1 )),..., g Nm (X,D (Z m )) as i Sectio 2. For the theoretical aalysis it is more coveiet to cosider the averaged classifier g (x,d ) = {EZ g N (x,d (Z)) 1/2} which is the limitig classifier oe obtais as the umber m of the bootstrap replicates grows to ifiity. The followig result establishes cosistecy of baggig classifiers uder the assumptio that the origial classifier is cosistet. It suffices that the expected size of the bootstrap sample goes to ifiity. The result is a immediate cosequece of Propositio 1. Note that the choice of m does ot matter i Theorem 6. It ca be oe, costat, or a fuctio of. Theorem 6 Let {g } be a sequece of classifiers that is cosistet for the distributio of (X,Y ). Cosider the baggig classifiers g (m) (x,z m,d ) ad g (x,d ) defied above, usig parameter q. If q as the both classifiers are cosistet. 2026
13 CONSISTENCY OF RANDOM FORESTS If a classifier is isesitive to duplicates i the data, Breima s origial suggestio is roughly equivalet to takig q 1 1/e. However, it may be advatageous to choose much smaller values of q. I fact, small values of q may tur icosistet classifiers ito cosistet oes via the baggig procedure. We illustrate this pheomeo o the simple example of the 1earest eighbor rule. Recall that the 1earest eighbor rule sets g (x,d ) = Y (1) (x) where Y (1) (x) is the label of the feature vector X (1) (x) whose Euclidea distace to x is miimal amog all X 1,...,X. Ties are broke i favor of smallest idices. It is wellkow that g is cosistet oly if either L = 0 or L = 1/2, otherwise its asymptotic probability of error is strictly greater tha L. However, by baggig oe may tur the 1earest eighbor classifier ito a cosistet oe, provided that the size of the bootstrap sample is sufficietly small. The ext result characterizes cosistecy of the baggig versio of the 1earest eighbor classifier i terms of the parameter q. Theorem 7 The baggig averaged 1earest eighbor classifier g (x,d ) is cosistet for all distributios of (X,Y) if ad oly if q 0 ad q. Proof It is obvious that both q 0 ad q are ecessary for cosistecy for all distributios. Assume ow that q 0 ad q. The key observatio is that g (x,d ) is a locally weighted average classifier for which Stoe s cosistecy theorem, recalled i Sectio 4, applies. Recall that for a fixed x R, X (1) (x),x (2) (x),...,x () (x) deotes the orderig of the data poits X 1,...,X accordig to icreasig distaces from x. (Poits with equal distaces to x are ordered accordig to their idices.) Observe that g may be writte as g (x,d ) = { Y i W i (x) (1 Y i)w i (x)} where W i (x) = j=1 {X i =X ( j) (x)}p j (x) ad p i (x) = (1 q ) i 1 q is defied as the probability (with respect to the radom selectio Z of the bootstrap sample) that X (i) (x) is the earest eighbor of x i the sample D (Z). It suffices to prove that the weights W i (X) satisfy the three coditios of Stoe s theorem. Coditio (i) obviously holds because max 1 i W i (X) = p 1 (X) = q 0. /q To check coditio (ii), defie k =. Sice q implies that k / 0, it follows from (Devroye, Györfi, ad Lugosi, 1996, Lemma 5.1) that evetually, almost surely, X X (k )(X) a ad therefore W i (X) { Xi X >a} = i=k +1 p i (X) q (1 q ) i 1 i=k +1 (1 q ) k (1 q ) /q e q where we used 1 q e q. Therefore, W i(x) { Xi X >a} 0 almost surely ad Stoe s secod coditio is satisfied by domiated covergece. 2027
14 BIAU, DEVROYE AND LUGOSI Fially, coditio (iii) follows from the fact that p i (x) is mootoe decreasig i i, after usig a argumet as i the proof of Theorem Radom Forests Based o Greedily Grow Trees I this sectio we study radom forest classifiers that are based o radomized tree classifiers that are costructed i a greedy maer, by recursively splittig cells to miimize a empirical error criterio. Such greedy forests were itroduced by Breima (2001, 2004) ad have show excellet performace i may applicatios. Oe of his most popular classifiers is a averagig classifier, g, based o a radomized tree classifier g (x,z) defied as follows. The algorithm has a parameter 1 v < d which is a positive iteger. The feature space R d is partitioed recursively to form a tree partitio. The root of the radom tree is R d. At each step of the costructio of the tree, a leaf is chose uiformly at radom. v variables are selected uiformly at radom from the d cadidates x (1),...,x (d). A split is selected alog oe of these v variables to miimize the umber of misclassified traiig poits if a majority vote is used i each cell. The procedure is repeated util every cell cotais exactly oe traiig poit X i. (This is always possible if the distributio of X has oatomic margials.) I some versios of Breima s algorithm, a bootstrap subsample of the traiig data is selected before the costructio of each tree to icrease the effect of radomizatio. As observed by Li ad Jeo (2006), Breima s classifier is a weighted layered earest eighbor classifier, that is, a classifier that takes a (weighted) majority vote amog the layered earest eighbors of the observatio x. X i is called a layered earest eighbor of x if the rectagle defied by x ad X i as their opposig vertices does ot cotai ay other data poit X j ( j i). This property of Breima s radom forest classifier is a simple cosequece of the fact that each tree is grow util every cell cotais just oe data poit. Ufortuately, this simple property prevets the radom tree classifier from beig cosistet for all distributios: Propositio 8 There exists a distributio of (X,Y) such that X has oatomic margials for which Breima s radom forest classifier is ot cosistet. Proof The proof works for ay weighted layered earest eighbor classifier. Let the distributio of X be uiform o the segmet {x = (x (1),...,x (d) ) : x (1) = = x (d),x (1) [0,1} ad let the distributio of Y be such that L {0,1/2}. The with probability oe, X has oly two layered earest eighbors ad the classificatio rule is ot cosistet. (Note that Problem 11.6 i Devroye, Györfi, ad Lugosi 1996 erroeously asks the reader to prove cosistecy of the (uweighted) layered earest eighbor rule for ay distributio with oatomic margials. As the example i this proof shows, the statemet of the exercise is icorrect. Cosistecy of the layered earest eighbor rule is true however, if the distributio of X has a desity.) Oe may also woder whether Breima s radom forest classifier is cosistet if istead of growig the tree dow to cells with a sigle data poit, oe uses a differet stoppig rule, for example if oe fixes the total umber of cuts at k ad let k grow slowly as i the examples of Sectio 3. The ext twodimesioal example provides a idicatio that this is ot ecessarily the case. Cosider the joit distributio of (X,Y ) sketched i Figure 1. X has a uiform distributio o [0,1 [0,1 [1,2 [1,2 [2,3 [2,3. Y is a fuctio of X, that is η(x) {0,1} ad L = 0. The lower left square [0,1 [0,1 is divided ito coutably ifiitely may vertical stripes i 2028
15 CONSISTENCY OF RANDOM FORESTS Figure 1: A example of a distributio for which greedy radom forests are icosistet. The distributio of X is uiform o the uio of the three large squares. White areas represet the set where η(x) = 0 ad o the grey regios η(x) = 1. which the stripes with η(x) = 0 ad η(x) = 1 alterate. The upper right square [2,3 [2,3 is divided similarly ito horizotal stripes. The middle rectagle [1, 2 [1, 2 is a 2 2 checkerboard. Cosider Breima s radom forest classifier with v = 1 (the oly possible choice whe d = 2). For simplicity, cosider the case whe, istead of miimizig the empirical error, each tree is grow by miimizig the true probability of error at each split i each radom tree. The it is easy to see that o matter what the sequece of radom selectio of split directios is ad o matter for how log each tree is grow, o tree will ever cut the middle rectagle ad therefore the probability of error of the correspodig radom forest classifier is at least 1/6. It is ot so clear what happes i this example if the successive cuts are made by miimizig the empirical error. Whether the middle square is ever cut will deped o the precise form of the stoppig rule ad the exact parameters of the distributio. The example is here to illustrate that cosistecy of greedily grow radom forests is a delicate issue. Note however that if Breima s origial algorithm is used i this example (i.e., whe all cells with more tha oe data poit i it are split) the oe obtais a cosistet classificatio rule. If, o the other had, horizotal or vertical cuts are selected to miimize the probability of error, ad k i such a way that k = O( 1/2 ε ) for some ε > 0, the, as errors o the middle square are ever more tha about O(1/ ) (by the limit law for the KolmogorovSmirov statistic), we see that thi strips of probability mass more tha 1/ are preferetially cut. By choosig the probability weights of the strips, oe ca easily see that we ca costruct more tha 2k such strips. Thus, whe k = O( 1/2 ε ), o cosistecy is possible o that example. We ote here that may versios of radom forest classifiers build o radom tree classifiers based o bootstrap subsamplig. This is the case of Breima s pricipal radom forest classifier. 2029
16 BIAU, DEVROYE AND LUGOSI c c 2 c c 4c c c c c c c c c c Figure 2: A tree based o partitioig the plae ito rectagles. The right subtree of each iteral ode belogs to the iside of a rectagle, ad the left subtree belogs to the complemet of the same rectagle (i c deotes the complemet of i). Rectagles are ot allowed to overlap. Breima suggests to take a radom sample of size draw with replacemet from the origial data. While this may result i a improved behavior i some practical istaces, it is easy to see that such a subsamplig procedure does ot vary the cosistecy property of ay of the classifiers studied i this paper. For example, ocosistecy of Breima s radom forest classifier with bootstrap resamplig for the distributio cosidered i the proof of Propositio 8 follows from the fact that the two layered earest eighbors o both sides are icluded i the bootstrap sample with a probability bouded away from zero ad therefore the weight of these two poits is too large, makig cosistecy impossible. I order to remedy the icosistecy of greedily grow tree classifiers, (Devroye, Györfi, ad Lugosi, 1996, Sectio 20.14) itroduce a greedy tree classifier which, istead of cuttig every cell alog just oe directio, cuts out a whole hyperrectagle from a cell i a way to optimize the empirical error. The disadvatage of this method is that i each step, d parameters eed to be optimized joitly ad this may be computatioally prohibitive if d is ot very small. (The computatioal complexity of the method is O( d ).) However, we may use the methodology of radom forests to defie a computatioally feasible cosistet greedily grow radom forest classifier. I order to defie the cosistet greedy radom forest, we first recall the tree classifier of (Devroye, Györfi, ad Lugosi, 1996, Sectio 20.14). The space is partitioed ito rectagles as show i Figure 2. A hyperrectagle defies a split i a atural way. A partitio is deoted by P, ad a decisio o a set A P is by majority vote. We write g P for such a rule: g P (x) = {i:xi A(x)Y i > i:xi A(x)(1 Y i )} where A(x) deotes the cell of the partitio cotaiig x. Give a partitio P, a legal hyperrectagle T is oe for which T A = /0 or T A for all sets A P. If we refie P by addig a legal rectagle T somewhere, the we obtai the partitio T. The decisio g T agrees with g P except o the set A P that cotais T. 2030
17 CONSISTENCY OF RANDOM FORESTS Itroduce the coveiet otatio The empirical error of g P is where L (R) = 1 ν j (A) = P{X A,Y = j}, j {0,1}, ν j, (A) = 1 I {Xi A,Y i = j}, j {0,1}. L (P ) def = L (R), R P I {Xi R,g P (X i ) Y i } = mi(ν 0, (R),ν 1, (R)). We may similarly defie L (T ). Give a partitio P, the greedy classifier selects that legal rectagle T for which L (T ) is miimal (with ay appropriate policy for breakig ties). Let R be the set of P cotaiig T. The the greedy classifier picks that T for which L (T ) + L (R T ) L (R) is miimal. Startig with the trivial partitio P 0 = {R d }, we repeat the previous step k times, leadig thus to k + 1 regios. The sequece of partitios is deoted by P 0,P 1,...,P k. (Devroye, Györfi, ad Lugosi, 1996, Theorem 20.9) establish cosistecy of this classifier. More precisely, it is show that if X has oatomic margials, the the greedy classifier with k ad ( /log ) k = o is cosistet. Based o the greedy tree classifier, we may defie a radom forest classifier by cosiderig its baggig versio. More precisely, let q [0,1 be a parameter ad let Z = Z(D ) deote a radom subsample of size biomial (,q ) of the traiig data (i.e., each pair (X i,y i ) is selected at radom, without replacemet, from D, with probability q ) ad let g (x,z) be the greedy tree classifier (as defied above) based o the traiig data Z(D ). Defie the correspodig averaged classifier g. We call g the greedy radom forest classifier. Note that g is just the baggig versio of the greedy tree classifier ad therefore Theorem 6 applies: Theorem 9 The greedy radom forest classifier is cosistet wheever X has oatomic margials q ) i R d, q, k ad k = o( /log(q ) as. Proof This follows from Theorem 6 ad the fact that the greedy tree classifier is cosistet (see Theorem 20.9 of Devroye, Györfi, ad Lugosi (1996)). Observe that the computatioal complexity of buildig the radomized tree classifier g (x,z) is O((q ) d ). Thus, the complexity of computig the votig classifier g (m) is m(q ) d. If q 1, this may be a sigificat speedup compared to the complexity O( d ) of computig a sigle tree classifier usig the full sample. Repeated subsamplig ad averagig may make up for the effect of decreased sample size. 2031
18 BIAU, DEVROYE AND LUGOSI Ackowledgmets We thak James Malley for stimulatig discussios. We also thak three referees for valuable commets ad isightful suggestios. The secod author s research was sposored by NSERC Grat A3456 ad FQRNT Grat 90 ER The third author ackowledges support by the Spaish Miistry of Sciece ad Techology grat MTM ad by the PASCAL Network of Excellece uder EC grat o Refereces Y. Amit ad D. Gema. Shape quatizatio ad recogitio with radomized trees. Neural Computatio, 9: , L. Breima. Baggig predictors. Machie Learig, 24: , L. Breima. Arcig classifiers. The Aals of Statistics, 24: , L. Breima. Some ifiite theory for predictor esembles. Techical Report 577, Statistics Departmet, UC Berkeley, breima. L. Breima. Radom forests. Machie Learig, 45:5 32, L. Breima. Cosistecy for a simple model of radom forests. Techical Report 670, Statistics Departmet, UC Berkeley, A. Cutler ad G. Zhao. Pert Perfect radom tree esembles, Computig Sciece ad Statistics, 33: , L. Devroye. Applicatios of the theory of records i the study of radom trees. Acta Iformatica, 26: , L. Devroye. A ote o the height of biary search trees. Joural of the ACM, 33: , L. Devroye, L. Györfi, ad G. Lugosi. A Probabilistic Theory of Patter Recogitio. Spriger Verlag, New York, T.G. Dietterich. A experimetal compariso of three methods for costructig esembles of decisio trees: baggig, boostig, ad radomizatio. Machie Learig, 40: , T.G. Dietterich. Esemble methods i machie learig. I J. Kittler ad F. Roli (Eds.), First Iteratioal Workshop o Multiple Classifier Systems, Lecture Notes i Computer Sciece, pp. 1 15, SprigerVerlag, New York, Y. Freud ad R. Schapire. Experimets with a ew boostig algorithm. I L. Saitta (Ed.), Machie Learig: Proceedigs of the 13th Iteratioal Coferece, pp , Morga Kaufma, Sa Fracisco, Y. Li ad Y. Jeo. Radom forests ad adaptive earest eighbors. Joural of the America Statistical Associatio, 101: ,
19 CONSISTENCY OF RANDOM FORESTS N. Meishause. Quatile regressio forests. Joural of Machie Learig Research, 7: , H.M. Mahmoud. Evolutio of Radom Search Trees. Joh Wiley, New York, C. Stoe. Cosistet oparametric regressio. The Aals of Statistics, 5: ,
Asymptotic Growth of Functions
CMPS Itroductio to Aalysis of Algorithms Fall 3 Asymptotic Growth of Fuctios We itroduce several types of asymptotic otatio which are used to compare the performace ad efficiecy of algorithms As we ll
More informationDepartment of Computer Science, University of Otago
Departmet of Computer Sciece, Uiversity of Otago Techical Report OUCS200609 Permutatios Cotaiig May Patters Authors: M.H. Albert Departmet of Computer Sciece, Uiversity of Otago Micah Colema, Rya Fly
More informationProperties of MLE: consistency, asymptotic normality. Fisher information.
Lecture 3 Properties of MLE: cosistecy, asymptotic ormality. Fisher iformatio. I this sectio we will try to uderstad why MLEs are good. Let us recall two facts from probability that we be used ofte throughout
More information4.1 Sigma Notation and Riemann Sums
0 the itegral. Sigma Notatio ad Riema Sums Oe strategy for calculatig the area of a regio is to cut the regio ito simple shapes, calculate the area of each simple shape, ad the add these smaller areas
More informationI. Chisquared Distributions
1 M 358K Supplemet to Chapter 23: CHISQUARED DISTRIBUTIONS, TDISTRIBUTIONS, AND DEGREES OF FREEDOM To uderstad tdistributios, we first eed to look at aother family of distributios, the chisquared distributios.
More informationIn nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008
I ite Sequeces Dr. Philippe B. Laval Keesaw State Uiversity October 9, 2008 Abstract This had out is a itroductio to i ite sequeces. mai de itios ad presets some elemetary results. It gives the I ite Sequeces
More informationModified Line Search Method for Global Optimization
Modified Lie Search Method for Global Optimizatio Cria Grosa ad Ajith Abraham Ceter of Excellece for Quatifiable Quality of Service Norwegia Uiversity of Sciece ad Techology Trodheim, Norway {cria, ajith}@q2s.tu.o
More informationLecture 4: Cauchy sequences, BolzanoWeierstrass, and the Squeeze theorem
Lecture 4: Cauchy sequeces, BolzaoWeierstrass, ad the Squeeze theorem The purpose of this lecture is more modest tha the previous oes. It is to state certai coditios uder which we are guarateed that limits
More informationSection IV.5: Recurrence Relations from Algorithms
Sectio IV.5: Recurrece Relatios from Algorithms Give a recursive algorithm with iput size, we wish to fid a Θ (best big O) estimate for its ru time T() either by obtaiig a explicit formula for T() or by
More informationChapter 6: Variance, the law of large numbers and the MonteCarlo method
Chapter 6: Variace, the law of large umbers ad the MoteCarlo method Expected value, variace, ad Chebyshev iequality. If X is a radom variable recall that the expected value of X, E[X] is the average value
More information5 Boolean Decision Trees (February 11)
5 Boolea Decisio Trees (February 11) 5.1 Graph Coectivity Suppose we are give a udirected graph G, represeted as a boolea adjacecy matrix = (a ij ), where a ij = 1 if ad oly if vertices i ad j are coected
More informationMATH 361 Homework 9. Royden Royden Royden
MATH 61 Homework 9 Royde..9 First, we show that for ay subset E of the real umbers, E c + y = E + y) c traslatig the complemet is equivalet to the complemet of the traslated set). Without loss of geerality,
More informationIrreducible polynomials with consecutive zero coefficients
Irreducible polyomials with cosecutive zero coefficiets Theodoulos Garefalakis Departmet of Mathematics, Uiversity of Crete, 71409 Heraklio, Greece Abstract Let q be a prime power. We cosider the problem
More informationORDERS OF GROWTH KEITH CONRAD
ORDERS OF GROWTH KEITH CONRAD Itroductio Gaiig a ituitive feel for the relative growth of fuctios is importat if you really wat to uderstad their behavior It also helps you better grasp topics i calculus
More informationSequences and Series
CHAPTER 9 Sequeces ad Series 9.. Covergece: Defiitio ad Examples Sequeces The purpose of this chapter is to itroduce a particular way of geeratig algorithms for fidig the values of fuctios defied by their
More informationLecture 7: Borel Sets and Lebesgue Measure
EE50: Probability Foudatios for Electrical Egieers JulyNovember 205 Lecture 7: Borel Sets ad Lebesgue Measure Lecturer: Dr. Krisha Jagaatha Scribes: Ravi Kolla, Aseem Sharma, Vishakh Hegde I this lecture,
More informationTHE ABRACADABRA PROBLEM
THE ABRACADABRA PROBLEM FRANCESCO CARAVENNA Abstract. We preset a detailed solutio of Exercise E0.6 i [Wil9]: i a radom sequece of letters, draw idepedetly ad uiformly from the Eglish alphabet, the expected
More informationSequences II. Chapter 3. 3.1 Convergent Sequences
Chapter 3 Sequeces II 3. Coverget Sequeces Plot a graph of the sequece a ) = 2, 3 2, 4 3, 5 + 4,...,,... To what limit do you thik this sequece teds? What ca you say about the sequece a )? For ǫ = 0.,
More information1 Introduction to reducing variance in Monte Carlo simulations
Copyright c 007 by Karl Sigma 1 Itroductio to reducig variace i Mote Carlo simulatios 11 Review of cofidece itervals for estimatig a mea I statistics, we estimate a uow mea µ = E(X) of a distributio by
More informationConvexity, Inequalities, and Norms
Covexity, Iequalities, ad Norms Covex Fuctios You are probably familiar with the otio of cocavity of fuctios. Give a twicedifferetiable fuctio ϕ: R R, We say that ϕ is covex (or cocave up) if ϕ (x) 0 for
More informationDivide and Conquer, Solving Recurrences, Integer Multiplication Scribe: Juliana Cook (2015), V. Williams Date: April 6, 2016
CS 6, Lecture 3 Divide ad Coquer, Solvig Recurreces, Iteger Multiplicatio Scribe: Juliaa Cook (05, V Williams Date: April 6, 06 Itroductio Today we will cotiue to talk about divide ad coquer, ad go ito
More informationDiscrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13
EECS 70 Discrete Mathematics ad Probability Theory Sprig 2014 Aat Sahai Note 13 Itroductio At this poit, we have see eough examples that it is worth just takig stock of our model of probability ad may
More informationTaking DCOP to the Real World: Efficient Complete Solutions for Distributed MultiEvent Scheduling
Taig DCOP to the Real World: Efficiet Complete Solutios for Distributed MultiEvet Schedulig Rajiv T. Maheswara, Milid Tambe, Emma Bowrig, Joatha P. Pearce, ad Pradeep araatham Uiversity of Souther Califoria
More informationModule 4: Mathematical Induction
Module 4: Mathematical Iductio Theme 1: Priciple of Mathematical Iductio Mathematical iductio is used to prove statemets about atural umbers. As studets may remember, we ca write such a statemet as a predicate
More informationMARTINGALES AND A BASIC APPLICATION
MARTINGALES AND A BASIC APPLICATION TURNER SMITH Abstract. This paper will develop the measuretheoretic approach to probability i order to preset the defiitio of martigales. From there we will apply this
More informationTHE HEIGHT OF qbinary SEARCH TREES
THE HEIGHT OF qbinary SEARCH TREES MICHAEL DRMOTA AND HELMUT PRODINGER Abstract. q biary search trees are obtaied from words, equipped with the geometric distributio istead of permutatios. The average
More informationLecture 13. Lecturer: Jonathan Kelner Scribe: Jonathan Pines (2009)
18.409 A Algorithmist s Toolkit October 27, 2009 Lecture 13 Lecturer: Joatha Keler Scribe: Joatha Pies (2009) 1 Outlie Last time, we proved the BruMikowski iequality for boxes. Today we ll go over the
More informationChapter 7 Methods of Finding Estimators
Chapter 7 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 011 Chapter 7 Methods of Fidig Estimators Sectio 7.1 Itroductio Defiitio 7.1.1 A poit estimator is ay fuctio W( X) W( X1, X,, X ) of
More informationIncremental calculation of weighted mean and variance
Icremetal calculatio of weighted mea ad variace Toy Fich faf@cam.ac.uk dot@dotat.at Uiversity of Cambridge Computig Service February 009 Abstract I these otes I eplai how to derive formulae for umerically
More informationWeek 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable
Week 3 Coditioal probabilities, Bayes formula, WEEK 3 page 1 Expected value of a radom variable We recall our discussio of 5 card poker hads. Example 13 : a) What is the probability of evet A that a 5
More information3. Covariance and Correlation
Virtual Laboratories > 3. Expected Value > 1 2 3 4 5 6 3. Covariace ad Correlatio Recall that by takig the expected value of various trasformatios of a radom variable, we ca measure may iterestig characteristics
More information7. Sample Covariance and Correlation
1 of 8 7/16/2009 6:06 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 7. Sample Covariace ad Correlatio The Bivariate Model Suppose agai that we have a basic radom experimet, ad that X ad Y
More informationAlternatives To Pearson s and Spearman s Correlation Coefficients
Alteratives To Pearso s ad Spearma s Correlatio Coefficiets Floreti Smaradache Chair of Math & Scieces Departmet Uiversity of New Mexico Gallup, NM 8730, USA Abstract. This article presets several alteratives
More informationUnit 20 Hypotheses Testing
Uit 2 Hypotheses Testig Objectives: To uderstad how to formulate a ull hypothesis ad a alterative hypothesis about a populatio proportio, ad how to choose a sigificace level To uderstad how to collect
More informationReview for College Algebra Final Exam
Review for College Algebra Fial Exam (Please remember that half of the fial exam will cover chapters 14. This review sheet covers oly the ew material, from chapters 5 ad 7.) 5.1 Systems of equatios i
More informationA Gentle Introduction to Algorithms: Part II
A Getle Itroductio to Algorithms: Part II Cotets of Part I:. Merge: (to merge two sorted lists ito a sigle sorted list.) 2. Bubble Sort 3. Merge Sort: 4. The BigO, BigΘ, BigΩ otatios: asymptotic bouds
More informationA probabilistic proof of a binomial identity
A probabilistic proof of a biomial idetity Joatho Peterso Abstract We give a elemetary probabilistic proof of a biomial idetity. The proof is obtaied by computig the probability of a certai evet i two
More information0.7 0.6 0.2 0 0 96 96.5 97 97.5 98 98.5 99 99.5 100 100.5 96.5 97 97.5 98 98.5 99 99.5 100 100.5
Sectio 13 KolmogorovSmirov test. Suppose that we have a i.i.d. sample X 1,..., X with some ukow distributio P ad we would like to test the hypothesis that P is equal to a particular distributio P 0, i.e.
More informationMaximum Likelihood Estimators.
Lecture 2 Maximum Likelihood Estimators. Matlab example. As a motivatio, let us look at oe Matlab example. Let us geerate a radom sample of size 00 from beta distributio Beta(5, 2). We will lear the defiitio
More informationRiemann Sums y = f (x)
Riema Sums Recall that we have previously discussed the area problem I its simplest form we ca state it this way: The Area Problem Let f be a cotiuous, oegative fuctio o the closed iterval [a, b] Fid
More informationNPTEL STRUCTURAL RELIABILITY
NPTEL Course O STRUCTURAL RELIABILITY Module # 0 Lecture 1 Course Format: Web Istructor: Dr. Aruasis Chakraborty Departmet of Civil Egieerig Idia Istitute of Techology Guwahati 1. Lecture 01: Basic Statistics
More informationLesson 12. Sequences and Series
Retur to List of Lessos Lesso. Sequeces ad Series A ifiite sequece { a, a, a,... a,...} ca be thought of as a list of umbers writte i defiite order ad certai patter. It is usually deoted by { a } =, or
More informationNonlife insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring
Nolife isurace mathematics Nils F. Haavardsso, Uiversity of Oslo ad DNB Skadeforsikrig Mai issues so far Why does isurace work? How is risk premium defied ad why is it importat? How ca claim frequecy
More informationLECTURE 13: Crossvalidation
LECTURE 3: Crossvalidatio Resampli methods Cross Validatio Bootstrap Bias ad variace estimatio with the Bootstrap Threeway data partitioi Itroductio to Patter Aalysis Ricardo GutierrezOsua Texas A&M
More informationLecture Notes CMSC 251
We have this messy summatio to solve though First observe that the value remais costat throughout the sum, ad so we ca pull it out frot Also ote that we ca write 3 i / i ad (3/) i T () = log 3 (log ) 1
More informationConfidence Intervals for One Mean with Tolerance Probability
Chapter 421 Cofidece Itervals for Oe Mea with Tolerace Probability Itroductio This procedure calculates the sample size ecessary to achieve a specified distace from the mea to the cofidece limit(s) with
More information{{1}, {2, 4}, {3}} {{1, 3, 4}, {2}} {{1}, {2}, {3, 4}} 5.4 Stirling Numbers
. Stirlig Numbers Whe coutig various types of fuctios from., we quicly discovered that eumeratig the umber of oto fuctios was a difficult problem. For a domai of five elemets ad a rage of four elemets,
More informationRecursion and Recurrences
Chapter 5 Recursio ad Recurreces 5.1 Growth Rates of Solutios to Recurreces Divide ad Coquer Algorithms Oe of the most basic ad powerful algorithmic techiques is divide ad coquer. Cosider, for example,
More informationCHAPTER 3 DIGITAL CODING OF SIGNALS
CHAPTER 3 DIGITAL CODING OF SIGNALS Computers are ofte used to automate the recordig of measuremets. The trasducers ad sigal coditioig circuits produce a voltage sigal that is proportioal to a quatity
More informationHypothesis testing. Null and alternative hypotheses
Hypothesis testig Aother importat use of samplig distributios is to test hypotheses about populatio parameters, e.g. mea, proportio, regressio coefficiets, etc. For example, it is possible to stipulate
More informationAn example of nonquenched convergence in the conditional central limit theorem for partial sums of a linear process
A example of oqueched covergece i the coditioal cetral limit theorem for partial sums of a liear process Dalibor Volý ad Michael Woodroofe Abstract A causal liear processes X,X 0,X is costructed for which
More informationDefinition. Definition. 72 Estimating a Population Proportion. Definition. Definition
7 stimatig a Populatio Proportio I this sectio we preset methods for usig a sample proportio to estimate the value of a populatio proportio. The sample proportio is the best poit estimate of the populatio
More informationAdvanced Probability Theory
Advaced Probability Theory Math5411 HKUST Kai Che (Istructor) Chapter 1. Law of Large Numbers 1.1. σalgebra, measure, probability space ad radom variables. This sectio lays the ecessary rigorous foudatio
More informationif A S, then X \ A S, and if (A n ) n is a sequence of sets in S, then n A n S,
Lecture 5: Borel Sets Topologically, the Borel sets i a topological space are the σalgebra geerated by the ope sets. Oe ca build up the Borel sets from the ope sets by iteratig the operatios of complemetatio
More informationThe Field of Complex Numbers
The Field of Complex Numbers S. F. Ellermeyer The costructio of the system of complex umbers begis by appedig to the system of real umbers a umber which we call i with the property that i = 1. (Note that
More information2.7 Sequences, Sequences of Sets
2.7. SEQUENCES, SEQUENCES OF SETS 67 2.7 Sequeces, Sequeces of Sets 2.7.1 Sequeces Defiitio 190 (sequece Let S be some set. 1. A sequece i S is a fuctio f : K S where K = { N : 0 for some 0 N}. 2. For
More informationThe Stable Marriage Problem
The Stable Marriage Problem William Hut Lae Departmet of Computer Sciece ad Electrical Egieerig, West Virgiia Uiversity, Morgatow, WV William.Hut@mail.wvu.edu 1 Itroductio Imagie you are a matchmaker,
More information5: Introduction to Estimation
5: Itroductio to Estimatio Cotets Acroyms ad symbols... 1 Statistical iferece... Estimatig µ with cofidece... 3 Samplig distributio of the mea... 3 Cofidece Iterval for μ whe σ is kow before had... 4 Sample
More informationA Faster ClauseShortening Algorithm for SAT with No Restriction on Clause Length
Joural o Satisfiability, Boolea Modelig ad Computatio 1 2005) 4960 A Faster ClauseShorteig Algorithm for SAT with No Restrictio o Clause Legth Evgey Datsi Alexader Wolpert Departmet of Computer Sciece
More informationOutput Analysis (2, Chapters 10 &11 Law)
B. Maddah ENMG 6 Simulatio 05/0/07 Output Aalysis (, Chapters 10 &11 Law) Comparig alterative system cofiguratio Sice the output of a simulatio is radom, the comparig differet systems via simulatio should
More informationVladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT
Keywords: project maagemet, resource allocatio, etwork plaig Vladimir N Burkov, Dmitri A Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT The paper deals with the problems of resource allocatio betwee
More informationPlugin martingales for testing exchangeability online
Plugi martigales for testig exchageability olie Valetia Fedorova, Alex Gammerma, Ilia Nouretdiov, ad Vladimir Vovk Computer Learig Research Cetre Royal Holloway, Uiversity of Lodo, UK {valetia,ilia,alex,vovk}@cs.rhul.ac.uk
More informationEntropy of bicapacities
Etropy of bicapacities Iva Kojadiovic LINA CNRS FRE 2729 Site école polytechique de l uiv. de Nates Rue Christia Pauc 44306 Nates, Frace iva.kojadiovic@uivates.fr JeaLuc Marichal Applied Mathematics
More informationStandard Errors and Confidence Intervals
Stadard Errors ad Cofidece Itervals Itroductio I the documet Data Descriptio, Populatios ad the Normal Distributio a sample had bee obtaied from the populatio of heights of 5yearold boys. If we assume
More informationCHAPTER 7: Central Limit Theorem: CLT for Averages (Means)
CHAPTER 7: Cetral Limit Theorem: CLT for Averages (Meas) X = the umber obtaied whe rollig oe six sided die oce. If we roll a six sided die oce, the mea of the probability distributio is X P(X = x) Simulatio:
More information0,1 is an accumulation
Sectio 5.4 1 Accumulatio Poits Sectio 5.4 BolzaoWeierstrass ad HeieBorel Theorems Purpose of Sectio: To itroduce the cocept of a accumulatio poit of a set, ad state ad prove two major theorems of real
More informationARITHMETIC AND GEOMETRIC PROGRESSIONS
Arithmetic Ad Geometric Progressios Sequeces Ad ARITHMETIC AND GEOMETRIC PROGRESSIONS Successio of umbers of which oe umber is desigated as the first, other as the secod, aother as the third ad so o gives
More informationA Recursive Formula for Moments of a Binomial Distribution
A Recursive Formula for Momets of a Biomial Distributio Árpád Béyi beyi@mathumassedu, Uiversity of Massachusetts, Amherst, MA 01003 ad Saverio M Maago smmaago@psavymil Naval Postgraduate School, Moterey,
More informationCS103X: Discrete Structures Homework 4 Solutions
CS103X: Discrete Structures Homewor 4 Solutios Due February 22, 2008 Exercise 1 10 poits. Silico Valley questios: a How may possible sixfigure salaries i whole dollar amouts are there that cotai at least
More informationA CHARACTERIZATION OF MINIMAL ZEROSEQUENCES OF INDEX ONE IN FINITE CYCLIC GROUPS
INTEGERS: ELECTRONIC JOURNAL OF COMBINATORIAL NUMBER THEORY 5(1) (2005), #A27 A CHARACTERIZATION OF MINIMAL ZEROSEQUENCES OF INDEX ONE IN FINITE CYCLIC GROUPS Scott T. Chapma 1 Triity Uiversity, Departmet
More informationProject Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments
Project Deliverables CS 361, Lecture 28 Jared Saia Uiversity of New Mexico Each Group should tur i oe group project cosistig of: About 612 pages of text (ca be loger with appedix) 612 figures (please
More informationThe following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles
The followig eample will help us uderstad The Samplig Distributio of the Mea Review: The populatio is the etire collectio of all idividuals or objects of iterest The sample is the portio of the populatio
More informationThe Euler Totient, the Möbius and the Divisor Functions
The Euler Totiet, the Möbius ad the Divisor Fuctios Rosica Dieva July 29, 2005 Mout Holyoke College South Hadley, MA 01075 1 Ackowledgemets This work was supported by the Mout Holyoke College fellowship
More informationTHE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n
We will cosider the liear regressio model i matrix form. For simple liear regressio, meaig oe predictor, the model is i = + x i + ε i for i =,,,, This model icludes the assumptio that the ε i s are a sample
More informationSAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx
SAMPLE QUESTIONS FOR FINAL EXAM REAL ANALYSIS I FALL 006 3 4 Fid the followig usig the defiitio of the Riema itegral: a 0 x + dx 3 Cosider the partitio P x 0 3, x 3 +, x 3 +,......, x 3 3 + 3 of the iterval
More informationBASIC STATISTICS. Discrete. Mass Probability Function: P(X=x i ) Only one finite set of values is considered {x 1, x 2,...} Prob. t = 1.
BASIC STATISTICS 1.) Basic Cocepts: Statistics: is a sciece that aalyzes iformatio variables (for istace, populatio age, height of a basketball team, the temperatures of summer moths, etc.) ad attempts
More informationOverview of some probability distributions.
Lecture Overview of some probability distributios. I this lecture we will review several commo distributios that will be used ofte throughtout the class. Each distributio is usually described by its probability
More information1 The Binomial Theorem: Another Approach
The Biomial Theorem: Aother Approach Pascal s Triagle I class (ad i our text we saw that, for iteger, the biomial theorem ca be stated (a + b = c a + c a b + c a b + + c ab + c b, where the coefficiets
More informationDEGREE DISTRIBUTION IN THE LOWER LEVELS OF THE UNIFORM RECURSIVE TREE
Aales Uiv. Sci. Budapest., Sect. Comp. 36 2012 53 62 DEGREE DISTRIBUTION IN THE LOWER LEVELS OF THE UNIFORM RECURSIVE TREE Áges Backhausz ad Tamás F. Móri Budapest, Hugary Commuicated by Imre Kátai Received
More information3 Basic Definitions of Probability Theory
3 Basic Defiitios of Probability Theory 3defprob.tex: Feb 10, 2003 Classical probability Frequecy probability axiomatic probability Historical developemet: Classical Frequecy Axiomatic The Axiomatic defiitio
More informationrepresented by 4! different arrangements of boxes, divide by 4! to get ways
Problem Set #6 solutios A juggler colors idetical jugglig balls red, white, ad blue (a I how may ways ca this be doe if each color is used at least oce? Let us preemptively color oe ball i each color,
More informationSoving Recurrence Relations
Sovig Recurrece Relatios Part 1. Homogeeous liear 2d degree relatios with costat coefficiets. Cosider the recurrece relatio ( ) T () + at ( 1) + bt ( 2) = 0 This is called a homogeeous liear 2d degree
More informationCenter, Spread, and Shape in Inference: Claims, Caveats, and Insights
Ceter, Spread, ad Shape i Iferece: Claims, Caveats, ad Isights Dr. Nacy Pfeig (Uiversity of Pittsburgh) AMATYC November 2008 Prelimiary Activities 1. I would like to produce a iterval estimate for the
More informationThe Limit of a Sequence
3 The Limit of a Sequece 3. Defiitio of limit. I Chapter we discussed the limit of sequeces that were mootoe; this restrictio allowed some shortcuts ad gave a quick itroductio to the cocept. But may importat
More informationA ConstantFactor Approximation Algorithm for the Link Building Problem
A CostatFactor Approximatio Algorithm for the Lik Buildig Problem Marti Olse 1, Aastasios Viglas 2, ad Ilia Zvedeiouk 2 1 Ceter for Iovatio ad Busiess Developmet, Istitute of Busiess ad Techology, Aarhus
More informationMath Discrete Math Combinatorics MULTIPLICATION PRINCIPLE:
Math 355  Discrete Math 4.14.4 Combiatorics Notes MULTIPLICATION PRINCIPLE: If there m ways to do somethig ad ways to do aother thig the there are m ways to do both. I the laguage of set theory: Let
More informationSECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES
SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES Read Sectio 1.5 (pages 5 9) Overview I Sectio 1.5 we lear to work with summatio otatio ad formulas. We will also itroduce a brief overview of sequeces,
More informationUC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Solutions 9 Spring 2006
Exam format UC Bereley Departmet of Electrical Egieerig ad Computer Sciece EE 6: Probablity ad Radom Processes Solutios 9 Sprig 006 The secod midterm will be held o Wedesday May 7; CHECK the fial exam
More informationApproximating Area under a curve with rectangles. To find the area under a curve we approximate the area using rectangles and then use limits to find
1.8 Approximatig Area uder a curve with rectagles 1.6 To fid the area uder a curve we approximate the area usig rectagles ad the use limits to fid 1.4 the area. Example 1 Suppose we wat to estimate 1.
More informationPROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUSMALUS SYSTEM
PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY Physical ad Mathematical Scieces 2015, 1, p. 15 19 M a t h e m a t i c s AN ALTERNATIVE MODEL FOR BONUSMALUS SYSTEM A. G. GULYAN Chair of Actuarial Mathematics
More informationFactors of sums of powers of binomial coefficients
ACTA ARITHMETICA LXXXVI.1 (1998) Factors of sums of powers of biomial coefficiets by Neil J. Cali (Clemso, S.C.) Dedicated to the memory of Paul Erdős 1. Itroductio. It is well ow that if ( ) a f,a = the
More informationB1. Fourier Analysis of Discrete Time Signals
B. Fourier Aalysis of Discrete Time Sigals Objectives Itroduce discrete time periodic sigals Defie the Discrete Fourier Series (DFS) expasio of periodic sigals Defie the Discrete Fourier Trasform (DFT)
More informationSection 1.6: Proof by Mathematical Induction
Sectio.6 Proof by Iductio Sectio.6: Proof by Mathematical Iductio Purpose of Sectio: To itroduce the Priciple of Mathematical Iductio, both weak ad the strog versios, ad show how certai types of theorems
More informationNATIONAL SENIOR CERTIFICATE GRADE 12
NATIONAL SENIOR CERTIFICATE GRADE MATHEMATICS P EXEMPLAR 04 MARKS: 50 TIME: 3 hours This questio paper cosists of 8 pages ad iformatio sheet. Please tur over Mathematics/P DBE/04 NSC Grade Eemplar INSTRUCTIONS
More informationNotes on Hypothesis Testing
Probability & Statistics Grishpa Notes o Hypothesis Testig A radom sample X = X 1,..., X is observed, with joit pmf/pdf f θ x 1,..., x. The values x = x 1,..., x of X lie i some sample space X. The parameter
More informationx(x 1)(x 2)... (x k + 1) = [x] k n+m 1
1 Coutig mappigs For every real x ad positive iteger k, let [x] k deote the fallig factorial ad x(x 1)(x 2)... (x k + 1) ( ) x = [x] k k k!, ( ) k = 1. 0 I the sequel, X = {x 1,..., x m }, Y = {y 1,...,
More informationInstitute for the Advancement of University Learning & Department of Statistics
Istitute for the Advacemet of Uiversity Learig & Departmet of Statistics Descriptive Statistics for Research (Hilary Term, 00) Lecture 5: Cofidece Itervals (I.) Itroductio Cofidece itervals (or regios)
More informationPerfect Packing Theorems and the AverageCase Behavior of Optimal and Online Bin Packing
SIAM REVIEW Vol. 44, No. 1, pp. 95 108 c 2002 Society for Idustrial ad Applied Mathematics Perfect Packig Theorems ad the AverageCase Behavior of Optimal ad Olie Bi Packig E. G. Coffma, Jr. C. Courcoubetis
More information6 Algorithm analysis
6 Algorithm aalysis Geerally, a algorithm has three cases Best case Average case Worse case. To demostrate, let us cosider the a really simple search algorithm which searches for k i the set A{a 1 a...
More informationLearning outcomes. Algorithms and Data Structures. Time Complexity Analysis. Time Complexity Analysis How fast is the algorithm? Prof. Dr.
Algorithms ad Data Structures Algorithm efficiecy Learig outcomes Able to carry out simple asymptotic aalysisof algorithms Prof. Dr. Qi Xi 2 Time Complexity Aalysis How fast is the algorithm? Code the
More information