Consistency of Random Forests and Other Averaging Classifiers
|
|
|
- Anthony Morrison
- 10 years ago
- Views:
Transcription
1 Joural of Machie Learig Research 9 (2008) Submitted 1/08; Revised 5/08; Published 9/08 Cosistecy of Radom Forests ad Other Averagig Classifiers Gérard Biau LSTA & LPMA Uiversité Pierre et Marie Curie Paris VI Boîte 158, 175 rue du Chevaleret Paris, Frace Luc Devroye School of Computer Sciece McGill Uiversity Motreal, Caada H3A 2K6 Gábor Lugosi ICREA ad Departmet of Ecoomics Pompeu Fabra Uiversity Ramo Trias Fargas Barceloa, Spai [email protected] [email protected] [email protected] Editor: Peter Bartlett Abstract I the last years of his life, Leo Breima promoted radom forests for use i classificatio. He suggested usig averagig as a meas of obtaiig good discrimiatio rules. The base classifiers used for averagig are simple ad radomized, ofte based o radom samples from the data. He left a few questios uaswered regardig the cosistecy of such rules. I this paper, we give a umber of theorems that establish the uiversal cosistecy of averagig rules. We also show that some popular classifiers, icludig oe suggested by Breima, are ot uiversally cosistet. Keywords: radom forests, classificatio trees, cosistecy, baggig 1. Itroductio This paper is dedicated to the memory of Leo Breima. Esemble methods, popular i machie learig, are learig algorithms that costruct a set of may idividual classifiers (called base learers) ad combie them to classify ew data poits by takig a weighted or uweighted vote of their predictios. It is ow well-kow that esembles are ofte much more accurate tha the idividual classifiers that make them up. The success of esemble algorithms o may bechmark data sets has raised cosiderable iterest i uderstadig why such methods succeed ad idetifyig circumstaces i which they ca be expected to produce good results. These methods differ i the way the base learer is fit ad combied. For example, baggig (Breima, 1996) proceeds by geeratig bootstrap samples from the origial data set, costructig a classifier from each bootstrap sample, ad votig to combie. I boostig (Freud ad Schapire, 1996) ad arcig algorithms (Breima, 1998) the successive classifiers are costructed by givig icreased weight to those poits that have bee frequetly misclassified, ad the classifiers are combied usig weighted votig. O the other had, radom split selectio (Dietterich, 2000) c 2008 Gérard Biau, Luc Devroye ad Gábor Lugosi.
2 BIAU, DEVROYE AND LUGOSI grows trees o the origial data set. For a fixed umber S, at each ode, S best splits (i terms of miimizig deviace) are foud ad the actual split is radomly ad uiformly selected from them. For a comprehesive review of esemble methods, we refer the reader to Dietterich (2000a) ad the refereces therei. Breima (2001) provides a geeral framework for tree esembles called radom forests. Each tree depeds o the values of a radom vector sampled idepedetly ad with the same distributio for all trees. Thus, a radom forest is a classifier that cosists of may decisio trees ad outputs the class that is the mode of the classes output by idividual trees. Algorithms for iducig a radom forest were first developed by Breima ad Cutler, ad Radom Forests is their trademark. The web page provides a collectio of dowloadable techical reports, ad gives a overview of radom forests as well as commets o the features of the method. Radom forests have bee show to give excellet performace o a umber of practical problems. They work fast, geerally exhibit a substatial performace improvemet over sigle tree classifiers such as CART, ad yield geeralizatio error rates that compare favorably to the best statistical ad machie learig methods. I fact, radom forests are amog the most accurate geeral-purpose classifiers available (see, for example, Breima, 2001). Differet radom forests differ i how radomess is itroduced i the tree buildig process, ragig from extreme radom splittig strategies (Breima, 2000; Cutler ad Zhao, 2001) to more ivolved data-depedet strategies (Amit ad Gema, 1997; Breima, 2001; Dietterich, 2000). As a matter of fact, the statistical mechaism of radom forests is ot yet fully uderstood ad is still uder active ivestigatio. Ulike sigle trees, where cosistecy is proved lettig the umber of observatios i each termial ode become large (Devroye, Györfi, ad Lugosi, 1996, Chapter 20), radom forests are geerally built to have a small umber of cases i each termial ode. Although the mechaism of radom forest algorithms appears simple, it is difficult to aalyze ad remais largely ukow. Some attempts to ivestigate the drivig force behid cosistecy of radom forests are by Breima (2000, 2004) ad Li ad Jeo (2006), who establish a coectio betwee radom forests ad adaptive earest eighbor methods. Meishause (2006) proved cosistecy of certai radom forests i the cotext of so-called quatile regressio. I this paper we offer cosistecy theorems for various versios of radom forests ad other radomized esemble classifiers. I Sectio 2 we itroduce a geeral framework for studyig classifiers based o averagig radomized base classifiers. We prove a simple but useful propositio showig that averaged classifiers are cosistet wheever the base classifiers are. I Sectio 3 we prove cosistecy of two simple radom forest classifiers, the purely radom forest (suggested by Breima as a startig poit for study) ad the scale-ivariat radom forest classifiers. I Sectio 4 it is show that averagig may covert icosistet rules ito cosistet oes. I Sectio 5 we briefly ivestigate cosistecy of baggig rules. We show that, i geeral, baggig preserves cosistecy of the base rule ad it may eve create cosistet rules from icosistet oes. I particular, we show that if the bootstrap samples are sufficietly small, the bagged versio of the 1-earest eighbor classifier is cosistet. 2016
3 CONSISTENCY OF RANDOM FORESTS Fially, i Sectio 6 we cosider radom forest classifiers based o radomized, greedily grow tree classifiers. We argue that some greedy radom forest classifiers, icludig Breima s radom forest classifier, are icosistet ad suggest a cosistet greedy radom forest classifier. 2. Votig ad Averaged Classifiers Let (X,Y ),(X 1,Y 1 ),...,(X,Y ) be i.i.d. pairs of radom variables such that X (the so-called feature vector) takes its values i R d while Y (the label) is a biary {0,1}-valued radom variable. The joit distributio of (X,Y ) is determied by the margial distributio µ of X (i.e., P{X A} = µ(a) for all Borel sets A R d ) ad the a posteriori probability η : R d [0,1 defied by η(x) = P{Y = 1 X = x}. The collectio (X 1,Y 1 ),...,(X,Y ) is called the traiig data, ad is deoted by D. A classifier g is a biary-valued fuctio of X ad D whose probability of error is defied by L(g ) = P (X,Y ) {g (X,D ) Y } where P (X,Y ) deotes probability with respect to the pair (X,Y) (i.e., coditioal probability, give D ). For brevity, we write g (X) = g (X,D ). It is well-kow (see, for example, Devroye, Györfi, ad Lugosi, 1996) that the classifier that miimizes the probability of error, the so-called Bayes classifier is g (x) = {η(x) 1/2}. The risk of g is called the Bayes risk: L = L(g ). A sequece {g } of classifiers is cosistet for a certai distributio of (X,Y) if L(g ) L i probability. I this paper we ivestigate classifiers that calculate their decisios by takig a majority vote over radomized classifiers. A radomized classifier may use a radom variable Z to calculate its decisio. More precisely, let Z be some measurable space ad let Z take its values i Z. A radomized classifier is a arbitrary fuctio of the form g (X,Z,D ), which we abbreviate by g (X,Z). The probability of error of g becomes L(g ) = P (X,Y ),Z {g (X,Z,D ) Y } = P{g (X,Z,D ) Y D }. The defiitio of cosistecy remais the same by augmetig the probability space appropriately to iclude the radomizatio. Give ay radomized classifier, oe may calculate the classifier for various draws of the radomizig variable Z. It is the a atural idea to defie a averaged classifier by takig a majority vote amog the obtaied radom classifiers. Assume that Z 1,...,Z m are idetically distributed draws of the radomizig variable, havig the same distributio as Z. Throughout the paper, we assume that Z 1,...,Z m are idepedet, coditioally o X, Y, ad D. Lettig Z m = (Z 1,...,Z m ), oe may defie the correspodig votig classifier by { g (m) (x,z m 1 if 1,D ) = m m j=1 g (x,z j,d ) 1 2, 0 otherwise. By the strog law of large umbers, for ay fixed x ad D for which P Z {g (x,z,d ) = 1} 1/2, we have almost surely lim m g (m) (x,z m,d ) = g (x,d ), where g (x,d ) = g (x) = {EZ g (x,z) 1/2} 2017
4 BIAU, DEVROYE AND LUGOSI is a (o-radomized) classifier that we call the averaged classifier. (Here P Z ad E Z deote probability ad expectatio with respect to the radomizig variable Z, that is, coditioally o X, Y, ad D.) g may be iterpreted as a idealized versio of the classifier g (m) that draws may idepedet copies of the radomizig variable Z ad takes a majority vote over the resultig classifiers. Our first result states that cosistecy of a radomized classifier is preserved by averagig. Propositio 1 Assume that the sequece {g } of radomized classifiers is cosistet for a certai distributio of (X,Y ). The the votig classifier g (m) (for ay value of m) ad the averaged classifier g are also cosistet. Proof Cosistecy of {g } is equivalet to sayig that EL(g ) = P{g (X,Z) Y } L. I fact, sice P{g (X,Z) Y X = x} P{g (X) Y X = x} for all x R d, cosistecy of {g } meas that for µ-almost all x, P{g (X,Z) Y X = x} P{g (X) Y X = x} = mi(η(x),1 η(x)). Without loss of geerality, assume that η(x) > 1/2. (I the case of η(x) = 1/2 ay classifier has a coditioal probability of error 1/2 ad there is othig to prove.) The P{g (X,Z) Y X = x} = (2η(x) 1)P{g (x,z) = 0} + 1 η(x), ad by cosistecy we have P{g (x,z) = 0} 0. To prove cosistecy of the votig classifier g (m) for µ-almost all x for which η(x) > 1/2. However, P{g (m) (x,z m ) = 0} = P { 2E, it suffices to show that P{g (m) (x,z m ) = 0} 0 (1/m) [ (1/m) m j=1 m j=1 {g (x,z j )=0} > 1/2 {g (x,z j )=0} (by Markov s iequality) = 2P{g (x,z) = 0} 0. Cosistecy of the averaged classifier is proved by a similar argumet. } 3. Radom Forests Radom forests, itroduced by Breima, are averaged classifiers i the sese defied i Sectio 2. Formally, a radom forest with m trees is a classifier cosistig of a collectio of radomized base tree classifiers g (x,z 1 ),...,g (x,z m ) where Z 1,...,Z m are idetically distributed radom vectors, idepedet coditioally o X, Y, ad D. The radomizig variable is typically used to determie how the successive cuts are performed whe buildig the tree such as selectio of the ode ad the coordiate to split, as well as the positio of the split. The radom forest classifier takes a majority vote amog the radom tree classifiers. If m is large, the radom forest classifier is well approximated by the averaged classifier 2018
5 CONSISTENCY OF RANDOM FORESTS g (x) = {EZ g (x,z) 1/2}. For brevity, we state most results of this paper for the averaged classifier oly, though by Propositio 1 various results remai true for the votig classifier g (m) as well. I this sectio we aalyze a simple radom forest already cosidered by Breima (2000), which we call the purely radom forest. The radom tree classifier g (x,z) is costructed as follows. Assume, for simplicity, that µ is supported o [0,1 d. All odes of the tree are associated with rectagular cells such that at each step of the costructio of the tree, the collectio of cells associated with the leaves of the tree (i.e., exteral odes) forms a partitio of [0,1 d. The root of the radom tree is [0,1 d itself. At each step of the costructio of the tree, a leaf is chose uiformly at radom. The split variable J is the selected uiformly at radom from the d cadidates x (1),...,x (d). Fially, the selected cell is split alog the radomly chose variable at a radom locatio, chose accordig to a uiform radom variable o the legth of the chose side of the selected cell. The procedure is repeated k times where k 1 is a determiistic parameter, fixed beforehad by the user, ad possibly depedig o. The radomized classifier g (x,z) takes a majority vote amog all Y i for which the correspodig feature vector X i falls i the same cell of the radom partitio as x. (For cocreteess, break ties i favor of the label 1.) The purely radom forest classifier is a radically simplified versio of radom forest classifiers used i practice. The mai simplificatio lies i the fact that recursive cell splits do ot deped o the labels Y 1,...,Y. The ext theorem maily serves as a illustratio of how the cosistecy problem of radom forest classifiers may be attacked. More ivolved versios of radom forest classifiers are discussed i subsequet sectios. Theorem 2 Assume that the distributio of X is supported o [0,1 d. The the purely radom forest classifier g is cosistet wheever k ad k/ 0 as k. Proof By Propositio 1 it suffices to prove cosistecy of the radomized base tree classifier g. To this ed, we recall a geeral cosistecy theorem for partitioig classifiers proved i (Devroye, Györfi, ad Lugosi, 1996, Theorem 6.1). Accordig to this theorem, g is cosistet if both diam(a (X,Z)) 0 i probability ad N (X,Z) i probability, where A (x,z) is the rectagular cell of the radom partitio cotaiig x ad N (x,z) = {X i A (x,z)} is the umber of data poits fallig i the same cell as x. First we show that N (X,Z) i probability. Cosider the radom tree partitio defied by Z. Observe that the partitio has k + 1 rectagular cells, say A 1,...,A k+1. Let N 1,...,N k+1 deote the umber of poits of X,X 1,...,X fallig i these k + 1 cells. Let S = {X,X 1,...,X } deote the set of positios of these + 1 poits. Sice these poits are idepedet ad idetically distributed, fixig the set S (but ot the order of the poits) ad Z, the coditioal probability that X falls i the i-th cell equals N i /( + 1). Thus, for every fixed t > 0, P{N (X,Z) < t} = E[P{N (X,Z) < t S,Z} [ N i = E + 1 i:n i <t 2019 (t 1) k
6 BIAU, DEVROYE AND LUGOSI which coverges to zero by our assumptio o k. It remais to show that diam(a (X,Z)) 0 i probability. To this aim, let V = V (x,z) be the size of the first dimesio of the rectagle cotaiig x. Let T = T (x,z) be the umber of times that the box cotaiig x is split whe we costruct the radom tree partitio. Let K be biomial (T,1/d), represetig the umber of times the box cotaiig x is split alog the first coordiate. Clearly, it suffices to show that V (x,z) 0 i probability for µ-almost all x, so it is eough to show that for all x, E[V (x,z) 0. Observe that if U 1,U 2,... are idepedet uiform [0,1, the [ E[V (x,z) E E [ K max(u i,1 U i ) K [ = E E[max(U 1,1 U 1 ) K = E [ (3/4) K [ ( = E 1 1 d + 3 ) T 4d [ ( = E 1 1 ) T. 4d Thus, it suffices to show that T i probability. To this ed, ote that the partitio tree is statistically related to a radom biary search tree with k + 1 exteral odes (ad thus k iteral odes). Such a tree is obtaied as follows. Iitially, the root is the sole exteral ode, ad there are o iteral odes. Select a exteral ode uiformly at radom, make it a iteral ode ad give it two childre, both exteral. Repeat util we have precisely k iteral odes ad k + 1 exteral odes. The resultig tree is the radom biary search tree o k iteral odes (see Devroye 1988 ad Mahmoud 1992 for more equivalet costructios of radom biary search trees). It is kow that all levels up to l = 0.37logk are full with probability tedig to oe as k (Devroye, 1986). The last full level F is called the fill-up level. Clearly, the partitio tree has this property. Therefore, we kow that all fial cells have bee cut at least l times ad therefore T l with probability covergig to 1. This cocludes the proof of Theorem 3.1. Remark 3 We observe that the largest first dimesio amog exteral odes does ot ted to zero i probability except for d = 1. For d 2, it teds to a limit radom variable that is ot atomic at zero (this ca be show usig the theory of brachig processes). Thus the proof above could ot have used the uiform smalless of all cells. Despite the fact that the radom partitio cotais some cells of huge diameter of o-shrikig size, the rule based o it is cosistet. Next we cosider a scale-ivariat versio of the purely radom forest classifier. I this variat the root cell is the etire feature space ad the radom tree is grow up to k cuts. The leaf cell to cut ad the directio J i which the cell is cut are chose uiformly at radom, exactly as i the purely radom forest classifier. The oly differece is that the positio of the cut is ow chose i a data-based maer: if the cell to be cut cotais N of the data poits X,X 1,...,X, the a radom idex I is chose uiformly from the set {0,1,...,N} ad the cell is cut so that, whe ordered by their J-th compoets, the poits with the I smallest values fall i oe of the subcells ad the rest i 2020
7 CONSISTENCY OF RANDOM FORESTS the other. To avoid ties, we assume that the distributio of X has o-atomic margials. I this case the radom tree is well-defied with probability oe. Just like before, the associated classifier takes a majority vote over the labels of the data poits fallig i the same cell as X. The scale-ivariat radom forest classifier is defied as the correspodig averaged classifier. Theorem 4 Assume that the distributio of X has o-atomic margials i R d. The the scaleivariat radom forest classifier g is cosistet wheever k ad k/ 0 as k. Proof Oce agai, we may use Propositio 1 ad (Devroye, Györfi, ad Lugosi, 1996, Theorem 6.1) to prove cosistecy of the radomized base tree classifier g. The proof of the fact that N (X,Z) i probability is the same as i Theorem 2. To show that diam(a (X,Z)) 0 i probability, we begi by otig that, just as i the case of the purely radom forest classifier, the partitio tree is equivalet to a biary search tree, ad therefore with probability covergig to oe, all fial cells have bee cut at least l = 0.37 log k times. Sice the classificatio rule is scale-ivariat, we may assume, without loss of geerality, that the distributio of X is cocetrated o the uit cube [0,1 d. Let i deote the cardiality of the i-th cell i the partitio, 1 i k + 1, where the cardiality of a cell C is C {X,X 1,...,X }. Thus, k+1 i = + 1. Let V i be the first dimesio of the i-th cell. Let V (X) be the first dimesio of the cell that cotais X. Clearly, give the i s, V (X) = V i with probability i /( + 1). We eed to show that E[V (X) 0. But we have [ k+1 E[V (X) = E iv i. + 1 So, it suffices to show that E[ i i V i = o(). It is worthy of metio that the radom split of a box ca be imagied as follows. Give that we split alog the s-th coordiate axis, ad that a box has m poits, the we select oe of the m + 1 spacigs defied by these m poits uiformly at radom, still for that s-th coordiate. We cut that spacig properly but are free to do so aywhere. We ca cut i proportios λ,1 λ with λ (0,1), ad the value of λ may vary from cut to cut ad eve be data-depedet. I fact, the, each iteral ad exteral ode of our partitio tree has associated with it two importat quatities, a cardiality, ad its first dimesio. If we keep usig i to idex cells, the we ca use i ad V i for the i-th cell, eve if it is a iteral cell. Let A be the collectio of exteral odes i the subtree of the i-th cell. The trivially, j V j i V i. j A Thus, if E is the collectio of all exteral odes of a partitio tree, l is at most the miimum path distace from ay cell i E to the root, ad L is the collectio of all odes at distace l from the root, the, by the last iequality, i E i V i i V i. i L Thus, usig the otio of fill-up level F of the biary search tree, ad settig l = 0.37logk, we have [ [ E i V i P{F < l} + E i V i. i E i L 2021
8 BIAU, DEVROYE AND LUGOSI We have see that the first term is o(). We argue that the secod term is ot more tha (1 1/(8d)) l, which is o() sice k. That will coclude the proof. It suffices ow to argue recursively ad fix oe cell of cardiality ad first dimesio V. Let C be the collectio of its childre. We will show that E [ i V i i C Repeatig this recursively l times shows that [ E i V i i L ( 1 1 8d ) V. ( 1 1 ) l 8d because V = 1 at the root. Fix that cell of cardiality, ad assume without loss of geerality that V = 1. Let the spacigs alog the first coordiate be a 1,...,a +1, their sum beig oe. With probability 1 1/d, there the first axis is ot cut, ad thus, i C i V i =. With probability 1/d, the first axis is cut i two parts. We will show that coditioal o the evet that the first directio is cut, [ E i V i 7 i 8. Ucoditioally, we have [ E i V i i ( 1 1 ) + 1d d 78 ( = 1 1 ), 8d as required. So, let us prove the coditioal result. Usig δ j to deote umbers draw from (0,1), possibly radom, we have [ E i V i i = = 1 [ E [( j 1)(a a j 1 + a j δ j ) j=1 +( + 1 j)(a j (1 δ j ) + a j a +1 ) [ ( E a k ( j 1) k=1 k< j j<k ( + 1 j) + δ k (k 1) + (1 δ k )( + 1 k) ( +1 ( k(k 1) a k ( + 1) k=1 2 ( k + 1)( k + 2) + max(k 1, + 1 k) )) )
9 CONSISTENCY OF RANDOM FORESTS ( +1 ( 1 ( + 1) = + 1 a k k=1 2 ( (( 1 + 1) ( ) 3/4 + (3/2) = + 1 ) ) + (k 1)( + 1 k) + max(k 1, + 1 k) ( ) ) ) a k k=1 7 8 if > 4. Our defiitio of the scale-ivariat radom forest classifier permits cells to be cut such that oe of the created cells becomes empty. Oe may easily prevet this by artificially forcig a miimum umber of poits i each cell. This may be doe by restrictig the radom positio of each cut so that both created subcells cotai at least, say, m poits. By a mior modificatio of the proof above it is easy to see that as log as m is bouded by a costat, the resultig radom forest classifier remais cosistet uder the same coditios as i Theorem Creatig Cosistet Rules by Radomizatio Propositio 1 shows that if a radomized classifier is cosistet, the the correspodig averaged classifier remais cosistet. The coverse is ot true. There exist icosistet radomized classifiers such that their averaged versio becomes cosistet. Ideed, Breima s (2001) origial radom forest classifier builds tree classifiers by successive radomized cuts util the cell of the poit X to be classified cotais oly oe data poit, ad classifies X as the label of this data poit. Breima s radom forest classifier is just the averaged versio of such radomized tree classifiers. The radomized base classifier g (x,z) is obviously ot cosistet for all distributios. This does ot imply that the averaged radom forest classifier is ot cosistet. I fact, i this sectio we will see that averagig may boost icosistet base classifiers ito cosistet oes. We poit out i Sectio 6 that there are distributios of (X,Y ) for which Breima s radom forest classifier is ot cosistet. The couterexample show i Propositio 8 is such that the distributio of X does t have a desity. It is possible, however, that Breima s radom forest classifier is cosistet wheever the distributio of X has a desity. Breima s rule is difficult to aalyze as each cut of the radom tree is determied by a complicated fuctio of the etire data set D (i.e., both feature vectors ad labels). However, i Sectio 6 below we provide argumets suggestig that Breima s radom forest is ot cosistet whe a desity exists. Istead of Breima s rule, ext we aalyze a stylized versio by showig that icosistet radomized rules that take the label of oly oe eighbor ito accout ca be made cosistet by averagig. For simplicity, we cosider the case d = 1, though the whole argumet exteds, i a straightforward way, to the multivariate case. To avoid complicatios itroduced by ties, assume that X has a o-atomic distributio. Defie a radomized earest eighbor rule as follows: for a fixed x R, let X (1) (x),x (2) (x),...,x () (x) be the orderig of the data poits X 1,...,X accordig to icreasig distaces from x. Let U 1,...,U be i.i.d. radom variables, uiformly distributed over [0,1. The vector of these radom variables costitutes the radomizatio Z of the classifier. We defie g (x,z) 2023
10 BIAU, DEVROYE AND LUGOSI to be equal to the label Y (i) (x) of the data poit X (i) (x) for which max(i,mu i ) max( j,mu j ) for all j = 1,..., where m is a parameter of the rule. We call X (i) (x) the perturbed earest eighbor of x. Note that X (1) (x) is the (uperturbed) earest eighbor of x. To obtai the perturbed versio, we artificially add a radom uiform coordiate ad select a data poit with the radomized rule defied above. Sice ties occur with probability zero, the perturbed earest eighbor classifier is well defied almost surely. It is clearly ot, i geeral, a cosistet classifier. Call the correspodig averaged classifier g (x) = {EZ g (x,z) 1/2} the averaged perturbed earest eighbor classifier. I the proof of the cosistecy result below, we use Stoe s (1977) geeral cosistecy theorem for locally weighted average classifiers, see also (Devroye, Györfi, ad Lugosi, 1996, Theorem 6.3). Stoe s theorem cocers classifiers that take the form g (x) = { Y i W i (x) (1 Y i)w i (x)} where the weights W i (x) = W i (x,x 1,...,X ) are o-egative ad sum to oe. Stoe s theorem, cosistecy holds if the followig three coditios are satisfied: Accordig to (i) (ii) For all a > 0, [ lim E max W i(x) = 0. 1 i [ lim E W i (X) { Xi X >a} = 0. (iii) There is a costat c such that, for every o-egative measurable fuctio f satisfyig E f (X) <, E [ W i (X) f (X i ) ce f (X). Theorem 5 The averaged perturbed earest eighbor classifier g is cosistet wheever the parameter m is such that m ad m/ 0. Proof If we defie W i (x) = P Z {X i is the perturbed earest eighbor of x} the it is clear that the averaged perturbed earest eighbor classifier is a locally weighted average classifier ad Stoe s theorem may be applied. It is coveiet to itroduce the otatio p i (x) = P Z {X (i) (x) is the perturbed earest eighbor of x} ad write W i (x) = j=1 {X i =X ( j) (x)}p j (x). 2024
11 CONSISTENCY OF RANDOM FORESTS To check the coditios of Stoe s theorem, first ote that p i (x) = P{mU i i mi mu j} + P{i < mu i mi max( j,mu j)} j<i j ( i = {i m} 1 i i 1 + P{i < mu i mi m m) max( j,mu j)}. j Now we are prepared to check the coditios of Stoe s theorem. To prove that (i) holds, ote that by mootoicity of p i (x) i i, it suffices to show that p 1 (x) 0. But clearly, for m 2, p 1 (x) 1 ( )} j {U m + P 1 mi max j m m,u j [ m = 1 { ( } j m + E U 1 max j=2p m,u j ) U 1 = 1 m + E [ m j=2 [ 1 {U1 > j/m}u 1 1 m + E[ (1 U 1 ) mu 1 2 { mu 1 3} + P{ mu1 < 3} which coverges to zero by mootoe covergece as m. (ii) follows by the coditio m/ 0 sice W i(x) { Xi X >a} = 0 wheever the distace of m-th earest eighbor of X to X is at most a. But this happes evetually, almost surely, see (Devroye, Györfi, ad Lugosi, 1996, Lemma 5.1). Fially, to check (iii), we use agai the mootoicity of p i (x) i i. We may write p i (x) = a i + a i a for some o-egative umbers a j,1 j, depedig upo m ad but ot x. Observe that j=1 ja j = p i(x) = 1. But the E [ W i (X) f (X i ) [ = E p i (X) f (X (i) ) = E = E = [ [ j=1 j=i a j f (X (i) ) j a j f (X (i) ) j a j E[ f (X (i) ) j=1 2025
12 BIAU, DEVROYE AND LUGOSI as desired. c j=1 a j je f (X) (by Stoe s (1977) lemma, see (Devroye, Györfi, ad Lugosi, 1996, Lemma 5.3), where c is a costat) = ce f (X) j=1 a j j = ce f (X) 5. Baggig Oe of the first ad simplest ways of radomizig ad averagig classifiers i order to improve their performace is baggig, suggested by Breima (1996). I baggig, radomizatio is achieved by geeratig may bootstrap samples from the origial data set. Breima suggests selectig traiig pairs (X i,y i ) at radom, with replacemet from the bag of all traiig pairs {(X 1,Y 1 ),...,(X,Y )}. Deotig the radom selectio process by Z, this way oe obtais ew traiig data D (Z) with possible repetitios ad give a classifier g (X,D ), oe ca calculate the radomized classifier g (X,Z,D ) = g (X,D (Z)). Breima suggests repeatig this procedure for may idepedet draws of the bootstrap sample, say m of them, ad calculatig the votig classifier g (m) (X,Z m,d ) as defied i Sectio 2. I this sectio we cosider a geeralized versio of baggig predictors i which the size of the bootstrap samples is ot ecessary the same as that the origial sample. Also, to avoid complicatios ad ambiguities due to replicated data poits, we exclude repetitios i the bootstrapped data. This is assumed for coveiece but samplig with replacemet ca be treated by mior modificatios of the argumets below. To describe the model we cosider, itroduce a parameter q [0,1. I the bootstrap sample D (Z) each data pair (X i,y i ) is preset with probability q, idepedetly of each other. Thus, the size of the bootstrapped data is a biomial radom variable N with parameters ad q. Give a sequece of (o-radomized) classifiers {g }, we may thus defie the radomized classifier g (X,Z,D ) = g N (X,D (Z)), that is, the classifier is defied based o the radomly re-sampled data. By drawig m idepedet bootstrap samples D (Z 1 ),...,D (Z m ) (with sizes N 1,...,N m ), we may defie the baggig classifier g (m) (X,Z m,d ) as the votig classifier based o the radomized classifiers g N1 (X,D (Z 1 )),..., g Nm (X,D (Z m )) as i Sectio 2. For the theoretical aalysis it is more coveiet to cosider the averaged classifier g (x,d ) = {EZ g N (x,d (Z)) 1/2} which is the limitig classifier oe obtais as the umber m of the bootstrap replicates grows to ifiity. The followig result establishes cosistecy of baggig classifiers uder the assumptio that the origial classifier is cosistet. It suffices that the expected size of the bootstrap sample goes to ifiity. The result is a immediate cosequece of Propositio 1. Note that the choice of m does ot matter i Theorem 6. It ca be oe, costat, or a fuctio of. Theorem 6 Let {g } be a sequece of classifiers that is cosistet for the distributio of (X,Y ). Cosider the baggig classifiers g (m) (x,z m,d ) ad g (x,d ) defied above, usig parameter q. If q as the both classifiers are cosistet. 2026
13 CONSISTENCY OF RANDOM FORESTS If a classifier is isesitive to duplicates i the data, Breima s origial suggestio is roughly equivalet to takig q 1 1/e. However, it may be advatageous to choose much smaller values of q. I fact, small values of q may tur icosistet classifiers ito cosistet oes via the baggig procedure. We illustrate this pheomeo o the simple example of the 1-earest eighbor rule. Recall that the 1-earest eighbor rule sets g (x,d ) = Y (1) (x) where Y (1) (x) is the label of the feature vector X (1) (x) whose Euclidea distace to x is miimal amog all X 1,...,X. Ties are broke i favor of smallest idices. It is well-kow that g is cosistet oly if either L = 0 or L = 1/2, otherwise its asymptotic probability of error is strictly greater tha L. However, by baggig oe may tur the 1-earest eighbor classifier ito a cosistet oe, provided that the size of the bootstrap sample is sufficietly small. The ext result characterizes cosistecy of the baggig versio of the 1-earest eighbor classifier i terms of the parameter q. Theorem 7 The baggig averaged 1-earest eighbor classifier g (x,d ) is cosistet for all distributios of (X,Y) if ad oly if q 0 ad q. Proof It is obvious that both q 0 ad q are ecessary for cosistecy for all distributios. Assume ow that q 0 ad q. The key observatio is that g (x,d ) is a locally weighted average classifier for which Stoe s cosistecy theorem, recalled i Sectio 4, applies. Recall that for a fixed x R, X (1) (x),x (2) (x),...,x () (x) deotes the orderig of the data poits X 1,...,X accordig to icreasig distaces from x. (Poits with equal distaces to x are ordered accordig to their idices.) Observe that g may be writte as g (x,d ) = { Y i W i (x) (1 Y i)w i (x)} where W i (x) = j=1 {X i =X ( j) (x)}p j (x) ad p i (x) = (1 q ) i 1 q is defied as the probability (with respect to the radom selectio Z of the bootstrap sample) that X (i) (x) is the earest eighbor of x i the sample D (Z). It suffices to prove that the weights W i (X) satisfy the three coditios of Stoe s theorem. Coditio (i) obviously holds because max 1 i W i (X) = p 1 (X) = q 0. /q To check coditio (ii), defie k =. Sice q implies that k / 0, it follows from (Devroye, Györfi, ad Lugosi, 1996, Lemma 5.1) that evetually, almost surely, X X (k )(X) a ad therefore W i (X) { Xi X >a} = i=k +1 p i (X) q (1 q ) i 1 i=k +1 (1 q ) k (1 q ) /q e q where we used 1 q e q. Therefore, W i(x) { Xi X >a} 0 almost surely ad Stoe s secod coditio is satisfied by domiated covergece. 2027
14 BIAU, DEVROYE AND LUGOSI Fially, coditio (iii) follows from the fact that p i (x) is mootoe decreasig i i, after usig a argumet as i the proof of Theorem Radom Forests Based o Greedily Grow Trees I this sectio we study radom forest classifiers that are based o radomized tree classifiers that are costructed i a greedy maer, by recursively splittig cells to miimize a empirical error criterio. Such greedy forests were itroduced by Breima (2001, 2004) ad have show excellet performace i may applicatios. Oe of his most popular classifiers is a averagig classifier, g, based o a radomized tree classifier g (x,z) defied as follows. The algorithm has a parameter 1 v < d which is a positive iteger. The feature space R d is partitioed recursively to form a tree partitio. The root of the radom tree is R d. At each step of the costructio of the tree, a leaf is chose uiformly at radom. v variables are selected uiformly at radom from the d cadidates x (1),...,x (d). A split is selected alog oe of these v variables to miimize the umber of misclassified traiig poits if a majority vote is used i each cell. The procedure is repeated util every cell cotais exactly oe traiig poit X i. (This is always possible if the distributio of X has o-atomic margials.) I some versios of Breima s algorithm, a bootstrap subsample of the traiig data is selected before the costructio of each tree to icrease the effect of radomizatio. As observed by Li ad Jeo (2006), Breima s classifier is a weighted layered earest eighbor classifier, that is, a classifier that takes a (weighted) majority vote amog the layered earest eighbors of the observatio x. X i is called a layered earest eighbor of x if the rectagle defied by x ad X i as their opposig vertices does ot cotai ay other data poit X j ( j i). This property of Breima s radom forest classifier is a simple cosequece of the fact that each tree is grow util every cell cotais just oe data poit. Ufortuately, this simple property prevets the radom tree classifier from beig cosistet for all distributios: Propositio 8 There exists a distributio of (X,Y) such that X has o-atomic margials for which Breima s radom forest classifier is ot cosistet. Proof The proof works for ay weighted layered earest eighbor classifier. Let the distributio of X be uiform o the segmet {x = (x (1),...,x (d) ) : x (1) = = x (d),x (1) [0,1} ad let the distributio of Y be such that L {0,1/2}. The with probability oe, X has oly two layered earest eighbors ad the classificatio rule is ot cosistet. (Note that Problem 11.6 i Devroye, Györfi, ad Lugosi 1996 erroeously asks the reader to prove cosistecy of the (uweighted) layered earest eighbor rule for ay distributio with o-atomic margials. As the example i this proof shows, the statemet of the exercise is icorrect. Cosistecy of the layered earest eighbor rule is true however, if the distributio of X has a desity.) Oe may also woder whether Breima s radom forest classifier is cosistet if istead of growig the tree dow to cells with a sigle data poit, oe uses a differet stoppig rule, for example if oe fixes the total umber of cuts at k ad let k grow slowly as i the examples of Sectio 3. The ext two-dimesioal example provides a idicatio that this is ot ecessarily the case. Cosider the joit distributio of (X,Y ) sketched i Figure 1. X has a uiform distributio o [0,1 [0,1 [1,2 [1,2 [2,3 [2,3. Y is a fuctio of X, that is η(x) {0,1} ad L = 0. The lower left square [0,1 [0,1 is divided ito coutably ifiitely may vertical stripes i 2028
15 CONSISTENCY OF RANDOM FORESTS Figure 1: A example of a distributio for which greedy radom forests are icosistet. The distributio of X is uiform o the uio of the three large squares. White areas represet the set where η(x) = 0 ad o the grey regios η(x) = 1. which the stripes with η(x) = 0 ad η(x) = 1 alterate. The upper right square [2,3 [2,3 is divided similarly ito horizotal stripes. The middle rectagle [1, 2 [1, 2 is a 2 2 checkerboard. Cosider Breima s radom forest classifier with v = 1 (the oly possible choice whe d = 2). For simplicity, cosider the case whe, istead of miimizig the empirical error, each tree is grow by miimizig the true probability of error at each split i each radom tree. The it is easy to see that o matter what the sequece of radom selectio of split directios is ad o matter for how log each tree is grow, o tree will ever cut the middle rectagle ad therefore the probability of error of the correspodig radom forest classifier is at least 1/6. It is ot so clear what happes i this example if the successive cuts are made by miimizig the empirical error. Whether the middle square is ever cut will deped o the precise form of the stoppig rule ad the exact parameters of the distributio. The example is here to illustrate that cosistecy of greedily grow radom forests is a delicate issue. Note however that if Breima s origial algorithm is used i this example (i.e., whe all cells with more tha oe data poit i it are split) the oe obtais a cosistet classificatio rule. If, o the other had, horizotal or vertical cuts are selected to miimize the probability of error, ad k i such a way that k = O( 1/2 ε ) for some ε > 0, the, as errors o the middle square are ever more tha about O(1/ ) (by the limit law for the Kolmogorov-Smirov statistic), we see that thi strips of probability mass more tha 1/ are preferetially cut. By choosig the probability weights of the strips, oe ca easily see that we ca costruct more tha 2k such strips. Thus, whe k = O( 1/2 ε ), o cosistecy is possible o that example. We ote here that may versios of radom forest classifiers build o radom tree classifiers based o bootstrap subsamplig. This is the case of Breima s pricipal radom forest classifier. 2029
16 BIAU, DEVROYE AND LUGOSI c c 2 c c 4c c c c c c c c c c Figure 2: A tree based o partitioig the plae ito rectagles. The right subtree of each iteral ode belogs to the iside of a rectagle, ad the left subtree belogs to the complemet of the same rectagle (i c deotes the complemet of i). Rectagles are ot allowed to overlap. Breima suggests to take a radom sample of size draw with replacemet from the origial data. While this may result i a improved behavior i some practical istaces, it is easy to see that such a subsamplig procedure does ot vary the cosistecy property of ay of the classifiers studied i this paper. For example, o-cosistecy of Breima s radom forest classifier with bootstrap resamplig for the distributio cosidered i the proof of Propositio 8 follows from the fact that the two layered earest eighbors o both sides are icluded i the bootstrap sample with a probability bouded away from zero ad therefore the weight of these two poits is too large, makig cosistecy impossible. I order to remedy the icosistecy of greedily grow tree classifiers, (Devroye, Györfi, ad Lugosi, 1996, Sectio 20.14) itroduce a greedy tree classifier which, istead of cuttig every cell alog just oe directio, cuts out a whole hyper-rectagle from a cell i a way to optimize the empirical error. The disadvatage of this method is that i each step, d parameters eed to be optimized joitly ad this may be computatioally prohibitive if d is ot very small. (The computatioal complexity of the method is O( d ).) However, we may use the methodology of radom forests to defie a computatioally feasible cosistet greedily grow radom forest classifier. I order to defie the cosistet greedy radom forest, we first recall the tree classifier of (Devroye, Györfi, ad Lugosi, 1996, Sectio 20.14). The space is partitioed ito rectagles as show i Figure 2. A hyper-rectagle defies a split i a atural way. A partitio is deoted by P, ad a decisio o a set A P is by majority vote. We write g P for such a rule: g P (x) = {i:xi A(x)Y i > i:xi A(x)(1 Y i )} where A(x) deotes the cell of the partitio cotaiig x. Give a partitio P, a legal hyper-rectagle T is oe for which T A = /0 or T A for all sets A P. If we refie P by addig a legal rectagle T somewhere, the we obtai the partitio T. The decisio g T agrees with g P except o the set A P that cotais T. 2030
17 CONSISTENCY OF RANDOM FORESTS Itroduce the coveiet otatio The empirical error of g P is where L (R) = 1 ν j (A) = P{X A,Y = j}, j {0,1}, ν j, (A) = 1 I {Xi A,Y i = j}, j {0,1}. L (P ) def = L (R), R P I {Xi R,g P (X i ) Y i } = mi(ν 0, (R),ν 1, (R)). We may similarly defie L (T ). Give a partitio P, the greedy classifier selects that legal rectagle T for which L (T ) is miimal (with ay appropriate policy for breakig ties). Let R be the set of P cotaiig T. The the greedy classifier picks that T for which L (T ) + L (R T ) L (R) is miimal. Startig with the trivial partitio P 0 = {R d }, we repeat the previous step k times, leadig thus to k + 1 regios. The sequece of partitios is deoted by P 0,P 1,...,P k. (Devroye, Györfi, ad Lugosi, 1996, Theorem 20.9) establish cosistecy of this classifier. More precisely, it is show that if X has o-atomic margials, the the greedy classifier with k ad ( /log ) k = o is cosistet. Based o the greedy tree classifier, we may defie a radom forest classifier by cosiderig its baggig versio. More precisely, let q [0,1 be a parameter ad let Z = Z(D ) deote a radom subsample of size biomial (,q ) of the traiig data (i.e., each pair (X i,y i ) is selected at radom, without replacemet, from D, with probability q ) ad let g (x,z) be the greedy tree classifier (as defied above) based o the traiig data Z(D ). Defie the correspodig averaged classifier g. We call g the greedy radom forest classifier. Note that g is just the baggig versio of the greedy tree classifier ad therefore Theorem 6 applies: Theorem 9 The greedy radom forest classifier is cosistet wheever X has o-atomic margials q ) i R d, q, k ad k = o( /log(q ) as. Proof This follows from Theorem 6 ad the fact that the greedy tree classifier is cosistet (see Theorem 20.9 of Devroye, Györfi, ad Lugosi (1996)). Observe that the computatioal complexity of buildig the radomized tree classifier g (x,z) is O((q ) d ). Thus, the complexity of computig the votig classifier g (m) is m(q ) d. If q 1, this may be a sigificat speed-up compared to the complexity O( d ) of computig a sigle tree classifier usig the full sample. Repeated subsamplig ad averagig may make up for the effect of decreased sample size. 2031
18 BIAU, DEVROYE AND LUGOSI Ackowledgmets We thak James Malley for stimulatig discussios. We also thak three referees for valuable commets ad isightful suggestios. The secod author s research was sposored by NSERC Grat A3456 ad FQRNT Grat 90- ER The third author ackowledges support by the Spaish Miistry of Sciece ad Techology grat MTM ad by the PASCAL Network of Excellece uder EC grat o Refereces Y. Amit ad D. Gema. Shape quatizatio ad recogitio with radomized trees. Neural Computatio, 9: , L. Breima. Baggig predictors. Machie Learig, 24: , L. Breima. Arcig classifiers. The Aals of Statistics, 24: , L. Breima. Some ifiite theory for predictor esembles. Techical Report 577, Statistics Departmet, UC Berkeley, breima. L. Breima. Radom forests. Machie Learig, 45:5 32, L. Breima. Cosistecy for a simple model of radom forests. Techical Report 670, Statistics Departmet, UC Berkeley, A. Cutler ad G. Zhao. Pert Perfect radom tree esembles, Computig Sciece ad Statistics, 33: , L. Devroye. Applicatios of the theory of records i the study of radom trees. Acta Iformatica, 26: , L. Devroye. A ote o the height of biary search trees. Joural of the ACM, 33: , L. Devroye, L. Györfi, ad G. Lugosi. A Probabilistic Theory of Patter Recogitio. Spriger- Verlag, New York, T.G. Dietterich. A experimetal compariso of three methods for costructig esembles of decisio trees: baggig, boostig, ad radomizatio. Machie Learig, 40: , T.G. Dietterich. Esemble methods i machie learig. I J. Kittler ad F. Roli (Eds.), First Iteratioal Workshop o Multiple Classifier Systems, Lecture Notes i Computer Sciece, pp. 1 15, Spriger-Verlag, New York, Y. Freud ad R. Schapire. Experimets with a ew boostig algorithm. I L. Saitta (Ed.), Machie Learig: Proceedigs of the 13th Iteratioal Coferece, pp , Morga Kaufma, Sa Fracisco, Y. Li ad Y. Jeo. Radom forests ad adaptive earest eighbors. Joural of the America Statistical Associatio, 101: ,
19 CONSISTENCY OF RANDOM FORESTS N. Meishause. Quatile regressio forests. Joural of Machie Learig Research, 7: , H.M. Mahmoud. Evolutio of Radom Search Trees. Joh Wiley, New York, C. Stoe. Cosistet oparametric regressio. The Aals of Statistics, 5: ,
Department of Computer Science, University of Otago
Departmet of Computer Sciece, Uiversity of Otago Techical Report OUCS-2006-09 Permutatios Cotaiig May Patters Authors: M.H. Albert Departmet of Computer Sciece, Uiversity of Otago Micah Colema, Rya Fly
Asymptotic Growth of Functions
CMPS Itroductio to Aalysis of Algorithms Fall 3 Asymptotic Growth of Fuctios We itroduce several types of asymptotic otatio which are used to compare the performace ad efficiecy of algorithms As we ll
Properties of MLE: consistency, asymptotic normality. Fisher information.
Lecture 3 Properties of MLE: cosistecy, asymptotic ormality. Fisher iformatio. I this sectio we will try to uderstad why MLEs are good. Let us recall two facts from probability that we be used ofte throughout
I. Chi-squared Distributions
1 M 358K Supplemet to Chapter 23: CHI-SQUARED DISTRIBUTIONS, T-DISTRIBUTIONS, AND DEGREES OF FREEDOM To uderstad t-distributios, we first eed to look at aother family of distributios, the chi-squared distributios.
In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008
I ite Sequeces Dr. Philippe B. Laval Keesaw State Uiversity October 9, 2008 Abstract This had out is a itroductio to i ite sequeces. mai de itios ad presets some elemetary results. It gives the I ite Sequeces
Modified Line Search Method for Global Optimization
Modified Lie Search Method for Global Optimizatio Cria Grosa ad Ajith Abraham Ceter of Excellece for Quatifiable Quality of Service Norwegia Uiversity of Sciece ad Techology Trodheim, Norway {cria, ajith}@q2s.tu.o
Lecture 4: Cauchy sequences, Bolzano-Weierstrass, and the Squeeze theorem
Lecture 4: Cauchy sequeces, Bolzao-Weierstrass, ad the Squeeze theorem The purpose of this lecture is more modest tha the previous oes. It is to state certai coditios uder which we are guarateed that limits
Chapter 6: Variance, the law of large numbers and the Monte-Carlo method
Chapter 6: Variace, the law of large umbers ad the Mote-Carlo method Expected value, variace, ad Chebyshev iequality. If X is a radom variable recall that the expected value of X, E[X] is the average value
5 Boolean Decision Trees (February 11)
5 Boolea Decisio Trees (February 11) 5.1 Graph Coectivity Suppose we are give a udirected graph G, represeted as a boolea adjacecy matrix = (a ij ), where a ij = 1 if ad oly if vertices i ad j are coected
Sequences and Series
CHAPTER 9 Sequeces ad Series 9.. Covergece: Defiitio ad Examples Sequeces The purpose of this chapter is to itroduce a particular way of geeratig algorithms for fidig the values of fuctios defied by their
THE ABRACADABRA PROBLEM
THE ABRACADABRA PROBLEM FRANCESCO CARAVENNA Abstract. We preset a detailed solutio of Exercise E0.6 i [Wil9]: i a radom sequece of letters, draw idepedetly ad uiformly from the Eglish alphabet, the expected
MARTINGALES AND A BASIC APPLICATION
MARTINGALES AND A BASIC APPLICATION TURNER SMITH Abstract. This paper will develop the measure-theoretic approach to probability i order to preset the defiitio of martigales. From there we will apply this
Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling
Taig DCOP to the Real World: Efficiet Complete Solutios for Distributed Multi-Evet Schedulig Rajiv T. Maheswara, Milid Tambe, Emma Bowrig, Joatha P. Pearce, ad Pradeep araatham Uiversity of Souther Califoria
Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13
EECS 70 Discrete Mathematics ad Probability Theory Sprig 2014 Aat Sahai Note 13 Itroductio At this poit, we have see eough examples that it is worth just takig stock of our model of probability ad may
THE HEIGHT OF q-binary SEARCH TREES
THE HEIGHT OF q-binary SEARCH TREES MICHAEL DRMOTA AND HELMUT PRODINGER Abstract. q biary search trees are obtaied from words, equipped with the geometric distributio istead of permutatios. The average
Chapter 7 Methods of Finding Estimators
Chapter 7 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 011 Chapter 7 Methods of Fidig Estimators Sectio 7.1 Itroductio Defiitio 7.1.1 A poit estimator is ay fuctio W( X) W( X1, X,, X ) of
Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable
Week 3 Coditioal probabilities, Bayes formula, WEEK 3 page 1 Expected value of a radom variable We recall our discussio of 5 card poker hads. Example 13 : a) What is the probability of evet A that a 5
Convexity, Inequalities, and Norms
Covexity, Iequalities, ad Norms Covex Fuctios You are probably familiar with the otio of cocavity of fuctios. Give a twicedifferetiable fuctio ϕ: R R, We say that ϕ is covex (or cocave up) if ϕ (x) 0 for
Incremental calculation of weighted mean and variance
Icremetal calculatio of weighted mea ad variace Toy Fich [email protected] [email protected] Uiversity of Cambridge Computig Service February 009 Abstract I these otes I eplai how to derive formulae for umerically
A probabilistic proof of a binomial identity
A probabilistic proof of a biomial idetity Joatho Peterso Abstract We give a elemetary probabilistic proof of a biomial idetity. The proof is obtaied by computig the probability of a certai evet i two
0.7 0.6 0.2 0 0 96 96.5 97 97.5 98 98.5 99 99.5 100 100.5 96.5 97 97.5 98 98.5 99 99.5 100 100.5
Sectio 13 Kolmogorov-Smirov test. Suppose that we have a i.i.d. sample X 1,..., X with some ukow distributio P ad we would like to test the hypothesis that P is equal to a particular distributio P 0, i.e.
Maximum Likelihood Estimators.
Lecture 2 Maximum Likelihood Estimators. Matlab example. As a motivatio, let us look at oe Matlab example. Let us geerate a radom sample of size 00 from beta distributio Beta(5, 2). We will lear the defiitio
LECTURE 13: Cross-validation
LECTURE 3: Cross-validatio Resampli methods Cross Validatio Bootstrap Bias ad variace estimatio with the Bootstrap Three-way data partitioi Itroductio to Patter Aalysis Ricardo Gutierrez-Osua Texas A&M
Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring
No-life isurace mathematics Nils F. Haavardsso, Uiversity of Oslo ad DNB Skadeforsikrig Mai issues so far Why does isurace work? How is risk premium defied ad why is it importat? How ca claim frequecy
CHAPTER 3 DIGITAL CODING OF SIGNALS
CHAPTER 3 DIGITAL CODING OF SIGNALS Computers are ofte used to automate the recordig of measuremets. The trasducers ad sigal coditioig circuits produce a voltage sigal that is proportioal to a quatity
Hypothesis testing. Null and alternative hypotheses
Hypothesis testig Aother importat use of samplig distributios is to test hypotheses about populatio parameters, e.g. mea, proportio, regressio coefficiets, etc. For example, it is possible to stipulate
Plug-in martingales for testing exchangeability on-line
Plug-i martigales for testig exchageability o-lie Valetia Fedorova, Alex Gammerma, Ilia Nouretdiov, ad Vladimir Vovk Computer Learig Research Cetre Royal Holloway, Uiversity of Lodo, UK {valetia,ilia,alex,vovk}@cs.rhul.ac.uk
A Faster Clause-Shortening Algorithm for SAT with No Restriction on Clause Length
Joural o Satisfiability, Boolea Modelig ad Computatio 1 2005) 49-60 A Faster Clause-Shorteig Algorithm for SAT with No Restrictio o Clause Legth Evgey Datsi Alexader Wolpert Departmet of Computer Sciece
Entropy of bi-capacities
Etropy of bi-capacities Iva Kojadiovic LINA CNRS FRE 2729 Site école polytechique de l uiv. de Nates Rue Christia Pauc 44306 Nates, Frace [email protected] Jea-Luc Marichal Applied Mathematics
5: Introduction to Estimation
5: Itroductio to Estimatio Cotets Acroyms ad symbols... 1 Statistical iferece... Estimatig µ with cofidece... 3 Samplig distributio of the mea... 3 Cofidece Iterval for μ whe σ is kow before had... 4 Sample
Output Analysis (2, Chapters 10 &11 Law)
B. Maddah ENMG 6 Simulatio 05/0/07 Output Aalysis (, Chapters 10 &11 Law) Comparig alterative system cofiguratio Sice the output of a simulatio is radom, the comparig differet systems via simulatio should
Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT
Keywords: project maagemet, resource allocatio, etwork plaig Vladimir N Burkov, Dmitri A Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT The paper deals with the problems of resource allocatio betwee
CHAPTER 7: Central Limit Theorem: CLT for Averages (Means)
CHAPTER 7: Cetral Limit Theorem: CLT for Averages (Meas) X = the umber obtaied whe rollig oe six sided die oce. If we roll a six sided die oce, the mea of the probability distributio is X P(X = x) Simulatio:
The Stable Marriage Problem
The Stable Marriage Problem William Hut Lae Departmet of Computer Sciece ad Electrical Egieerig, West Virgiia Uiversity, Morgatow, WV [email protected] 1 Itroductio Imagie you are a matchmaker,
THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n
We will cosider the liear regressio model i matrix form. For simple liear regressio, meaig oe predictor, the model is i = + x i + ε i for i =,,,, This model icludes the assumptio that the ε i s are a sample
The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles
The followig eample will help us uderstad The Samplig Distributio of the Mea Review: The populatio is the etire collectio of all idividuals or objects of iterest The sample is the portio of the populatio
A Constant-Factor Approximation Algorithm for the Link Building Problem
A Costat-Factor Approximatio Algorithm for the Lik Buildig Problem Marti Olse 1, Aastasios Viglas 2, ad Ilia Zvedeiouk 2 1 Ceter for Iovatio ad Busiess Developmet, Istitute of Busiess ad Techology, Aarhus
CS103X: Discrete Structures Homework 4 Solutions
CS103X: Discrete Structures Homewor 4 Solutios Due February 22, 2008 Exercise 1 10 poits. Silico Valley questios: a How may possible six-figure salaries i whole dollar amouts are there that cotai at least
SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx
SAMPLE QUESTIONS FOR FINAL EXAM REAL ANALYSIS I FALL 006 3 4 Fid the followig usig the defiitio of the Riema itegral: a 0 x + dx 3 Cosider the partitio P x 0 3, x 3 +, x 3 +,......, x 3 3 + 3 of the iterval
Project Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments
Project Deliverables CS 361, Lecture 28 Jared Saia Uiversity of New Mexico Each Group should tur i oe group project cosistig of: About 6-12 pages of text (ca be loger with appedix) 6-12 figures (please
Overview of some probability distributions.
Lecture Overview of some probability distributios. I this lecture we will review several commo distributios that will be used ofte throughtout the class. Each distributio is usually described by its probability
Center, Spread, and Shape in Inference: Claims, Caveats, and Insights
Ceter, Spread, ad Shape i Iferece: Claims, Caveats, ad Isights Dr. Nacy Pfeig (Uiversity of Pittsburgh) AMATYC November 2008 Prelimiary Activities 1. I would like to produce a iterval estimate for the
3 Basic Definitions of Probability Theory
3 Basic Defiitios of Probability Theory 3defprob.tex: Feb 10, 2003 Classical probability Frequecy probability axiomatic probability Historical developemet: Classical Frequecy Axiomatic The Axiomatic defiitio
PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM
PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY Physical ad Mathematical Scieces 2015, 1, p. 15 19 M a t h e m a t i c s AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM A. G. GULYAN Chair of Actuarial Mathematics
Factors of sums of powers of binomial coefficients
ACTA ARITHMETICA LXXXVI.1 (1998) Factors of sums of powers of biomial coefficiets by Neil J. Cali (Clemso, S.C.) Dedicated to the memory of Paul Erdős 1. Itroductio. It is well ow that if ( ) a f,a = the
Soving Recurrence Relations
Sovig Recurrece Relatios Part 1. Homogeeous liear 2d degree relatios with costat coefficiets. Cosider the recurrece relatio ( ) T () + at ( 1) + bt ( 2) = 0 This is called a homogeeous liear 2d degree
Approximating Area under a curve with rectangles. To find the area under a curve we approximate the area using rectangles and then use limits to find
1.8 Approximatig Area uder a curve with rectagles 1.6 To fid the area uder a curve we approximate the area usig rectagles ad the use limits to fid 1.4 the area. Example 1 Suppose we wat to estimate 1.
A Recursive Formula for Moments of a Binomial Distribution
A Recursive Formula for Momets of a Biomial Distributio Árpád Béyi beyi@mathumassedu, Uiversity of Massachusetts, Amherst, MA 01003 ad Saverio M Maago smmaago@psavymil Naval Postgraduate School, Moterey,
NATIONAL SENIOR CERTIFICATE GRADE 12
NATIONAL SENIOR CERTIFICATE GRADE MATHEMATICS P EXEMPLAR 04 MARKS: 50 TIME: 3 hours This questio paper cosists of 8 pages ad iformatio sheet. Please tur over Mathematics/P DBE/04 NSC Grade Eemplar INSTRUCTIONS
UC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Solutions 9 Spring 2006
Exam format UC Bereley Departmet of Electrical Egieerig ad Computer Sciece EE 6: Probablity ad Radom Processes Solutios 9 Sprig 006 The secod midterm will be held o Wedesday May 7; CHECK the fial exam
SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES
SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES Read Sectio 1.5 (pages 5 9) Overview I Sectio 1.5 we lear to work with summatio otatio ad formulas. We will also itroduce a brief overview of sequeces,
Ekkehart Schlicht: Economic Surplus and Derived Demand
Ekkehart Schlicht: Ecoomic Surplus ad Derived Demad Muich Discussio Paper No. 2006-17 Departmet of Ecoomics Uiversity of Muich Volkswirtschaftliche Fakultät Ludwig-Maximilias-Uiversität Müche Olie at http://epub.ub.ui-mueche.de/940/
Universal coding for classes of sources
Coexios module: m46228 Uiversal codig for classes of sources Dever Greee This work is produced by The Coexios Project ad licesed uder the Creative Commos Attributio Licese We have discussed several parametric
Class Meeting # 16: The Fourier Transform on R n
MATH 18.152 COUSE NOTES - CLASS MEETING # 16 18.152 Itroductio to PDEs, Fall 2011 Professor: Jared Speck Class Meetig # 16: The Fourier Trasform o 1. Itroductio to the Fourier Trasform Earlier i the course,
How To Solve The Homewor Problem Beautifully
Egieerig 33 eautiful Homewor et 3 of 7 Kuszmar roblem.5.5 large departmet store sells sport shirts i three sizes small, medium, ad large, three patters plaid, prit, ad stripe, ad two sleeve legths log
where: T = number of years of cash flow in investment's life n = the year in which the cash flow X n i = IRR = the internal rate of return
EVALUATING ALTERNATIVE CAPITAL INVESTMENT PROGRAMS By Ke D. Duft, Extesio Ecoomist I the March 98 issue of this publicatio we reviewed the procedure by which a capital ivestmet project was assessed. The
Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.
This documet was writte ad copyrighted by Paul Dawkis. Use of this documet ad its olie versio is govered by the Terms ad Coditios of Use located at http://tutorial.math.lamar.edu/terms.asp. The olie versio
Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork
Solutios to Selected Problems I: Patter Classificatio by Duda, Hart, Stork Joh L. Weatherwax February 4, 008 Problem Solutios Chapter Bayesia Decisio Theory Problem radomized rules Part a: Let Rx be the
Example 2 Find the square root of 0. The only square root of 0 is 0 (since 0 is not positive or negative, so those choices don t exist here).
BEGINNING ALGEBRA Roots ad Radicals (revised summer, 00 Olso) Packet to Supplemet the Curret Textbook - Part Review of Square Roots & Irratioals (This portio ca be ay time before Part ad should mostly
Lecture 3. denote the orthogonal complement of S k. Then. 1 x S k. n. 2 x T Ax = ( ) λ x. with x = 1, we have. i = λ k x 2 = λ k.
18.409 A Algorithmist s Toolkit September 17, 009 Lecture 3 Lecturer: Joatha Keler Scribe: Adre Wibisoo 1 Outlie Today s lecture covers three mai parts: Courat-Fischer formula ad Rayleigh quotiets The
Measures of Spread and Boxplots Discrete Math, Section 9.4
Measures of Spread ad Boxplots Discrete Math, Sectio 9.4 We start with a example: Example 1: Comparig Mea ad Media Compute the mea ad media of each data set: S 1 = {4, 6, 8, 10, 1, 14, 16} S = {4, 7, 9,
Notes on exponential generating functions and structures.
Notes o expoetial geeratig fuctios ad structures. 1. The cocept of a structure. Cosider the followig coutig problems: (1) to fid for each the umber of partitios of a -elemet set, (2) to fid for each the
Chapter 5 O A Cojecture Of Erdíos Proceedigs NCUR VIII è1994è, Vol II, pp 794í798 Jeærey F Gold Departmet of Mathematics, Departmet of Physics Uiversity of Utah Do H Tucker Departmet of Mathematics Uiversity
The analysis of the Cournot oligopoly model considering the subjective motive in the strategy selection
The aalysis of the Courot oligopoly model cosiderig the subjective motive i the strategy selectio Shigehito Furuyama Teruhisa Nakai Departmet of Systems Maagemet Egieerig Faculty of Egieerig Kasai Uiversity
INFINITE SERIES KEITH CONRAD
INFINITE SERIES KEITH CONRAD. Itroductio The two basic cocepts of calculus, differetiatio ad itegratio, are defied i terms of limits (Newto quotiets ad Riema sums). I additio to these is a third fudametal
Section 11.3: The Integral Test
Sectio.3: The Itegral Test Most of the series we have looked at have either diverged or have coverged ad we have bee able to fid what they coverge to. I geeral however, the problem is much more difficult
Annuities Under Random Rates of Interest II By Abraham Zaks. Technion I.I.T. Haifa ISRAEL and Haifa University Haifa ISRAEL.
Auities Uder Radom Rates of Iterest II By Abraham Zas Techio I.I.T. Haifa ISRAEL ad Haifa Uiversity Haifa ISRAEL Departmet of Mathematics, Techio - Israel Istitute of Techology, 3000, Haifa, Israel I memory
Lecture 2: Karger s Min Cut Algorithm
priceto uiv. F 3 cos 5: Advaced Algorithm Desig Lecture : Karger s Mi Cut Algorithm Lecturer: Sajeev Arora Scribe:Sajeev Today s topic is simple but gorgeous: Karger s mi cut algorithm ad its extesio.
1 Computing the Standard Deviation of Sample Means
Computig the Stadard Deviatio of Sample Meas Quality cotrol charts are based o sample meas ot o idividual values withi a sample. A sample is a group of items, which are cosidered all together for our aalysis.
Hypergeometric Distributions
7.4 Hypergeometric Distributios Whe choosig the startig lie-up for a game, a coach obviously has to choose a differet player for each positio. Similarly, whe a uio elects delegates for a covetio or you
ON THE EDGE-BANDWIDTH OF GRAPH PRODUCTS
ON THE EDGE-BANDWIDTH OF GRAPH PRODUCTS JÓZSEF BALOGH, DHRUV MUBAYI, AND ANDRÁS PLUHÁR Abstract The edge-badwidth of a graph G is the badwidth of the lie graph of G We show asymptotically tight bouds o
Basic Elements of Arithmetic Sequences and Series
MA40S PRE-CALCULUS UNIT G GEOMETRIC SEQUENCES CLASS NOTES (COMPLETED NO NEED TO COPY NOTES FROM OVERHEAD) Basic Elemets of Arithmetic Sequeces ad Series Objective: To establish basic elemets of arithmetic
Statistical inference: example 1. Inferential Statistics
Statistical iferece: example 1 Iferetial Statistics POPULATION SAMPLE A clothig store chai regularly buys from a supplier large quatities of a certai piece of clothig. Each item ca be classified either
Totally Corrective Boosting Algorithms that Maximize the Margin
Mafred K. Warmuth [email protected] Ju Liao [email protected] Uiversity of Califoria at Sata Cruz, Sata Cruz, CA 95064, USA Guar Rätsch [email protected] Friedrich Miescher Laboratory of
Lesson 17 Pearson s Correlation Coefficient
Outlie Measures of Relatioships Pearso s Correlatio Coefficiet (r) -types of data -scatter plots -measure of directio -measure of stregth Computatio -covariatio of X ad Y -uique variatio i X ad Y -measurig
University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution
Uiversity of Califoria, Los Ageles Departmet of Statistics Statistics 100B Istructor: Nicolas Christou Three importat distributios: Distributios related to the ormal distributio Chi-square (χ ) distributio.
Exploratory Data Analysis
1 Exploratory Data Aalysis Exploratory data aalysis is ofte the rst step i a statistical aalysis, for it helps uderstadig the mai features of the particular sample that a aalyst is usig. Itelliget descriptios
Optimal Strategies from Random Walks
Optimal Strategies from Radom Walks Jacob Aberethy Divisio of Computer Sciece UC Berkeley jake@csberkeleyedu Mafred K Warmuth Departmet of Computer Sciece UC Sata Cruz mafred@cseucscedu Joel Yelli Divisio
A Combined Continuous/Binary Genetic Algorithm for Microstrip Antenna Design
A Combied Cotiuous/Biary Geetic Algorithm for Microstrip Atea Desig Rady L. Haupt The Pesylvaia State Uiversity Applied Research Laboratory P. O. Box 30 State College, PA 16804-0030 [email protected] Abstract:
Systems Design Project: Indoor Location of Wireless Devices
Systems Desig Project: Idoor Locatio of Wireless Devices Prepared By: Bria Murphy Seior Systems Sciece ad Egieerig Washigto Uiversity i St. Louis Phoe: (805) 698-5295 Email: [email protected] Supervised
Trigonometric Form of a Complex Number. The Complex Plane. axis. ( 2, 1) or 2 i FIGURE 6.44. The absolute value of the complex number z a bi is
0_0605.qxd /5/05 0:45 AM Page 470 470 Chapter 6 Additioal Topics i Trigoometry 6.5 Trigoometric Form of a Complex Number What you should lear Plot complex umbers i the complex plae ad fid absolute values
Confidence Intervals for One Mean
Chapter 420 Cofidece Itervals for Oe Mea Itroductio This routie calculates the sample size ecessary to achieve a specified distace from the mea to the cofidece limit(s) at a stated cofidece level for a
Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:
Chapter 7 - Samplig Distributios 1 Itroductio What is statistics? It cosist of three major areas: Data Collectio: samplig plas ad experimetal desigs Descriptive Statistics: umerical ad graphical summaries
CS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations
CS3A Hadout 3 Witer 00 February, 00 Solvig Recurrece Relatios Itroductio A wide variety of recurrece problems occur i models. Some of these recurrece relatios ca be solved usig iteratio or some other ad
1 Correlation and Regression Analysis
1 Correlatio ad Regressio Aalysis I this sectio we will be ivestigatig the relatioship betwee two cotiuous variable, such as height ad weight, the cocetratio of a ijected drug ad heart rate, or the cosumptio
Infinite Sequences and Series
CHAPTER 4 Ifiite Sequeces ad Series 4.1. Sequeces A sequece is a ifiite ordered list of umbers, for example the sequece of odd positive itegers: 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29...
Research Article Sign Data Derivative Recovery
Iteratioal Scholarly Research Network ISRN Applied Mathematics Volume 0, Article ID 63070, 7 pages doi:0.540/0/63070 Research Article Sig Data Derivative Recovery L. M. Housto, G. A. Glass, ad A. D. Dymikov
How To Understand The Theory Of Coectedess
35 Chapter 1: Fudametal Cocepts Sectio 1.3: Vertex Degrees ad Coutig 36 its eighbor o P. Note that P has at least three vertices. If G x v is coected, let y = v. Otherwise, a compoet cut off from P x v
How to read A Mutual Fund shareholder report
Ivestor BulletI How to read A Mutual Fud shareholder report The SEC s Office of Ivestor Educatio ad Advocacy is issuig this Ivestor Bulleti to educate idividual ivestors about mutual fud shareholder reports.
A Mathematical Perspective on Gambling
A Mathematical Perspective o Gamblig Molly Maxwell Abstract. This paper presets some basic topics i probability ad statistics, icludig sample spaces, probabilistic evets, expectatios, the biomial ad ormal
Determining the sample size
Determiig the sample size Oe of the most commo questios ay statisticia gets asked is How large a sample size do I eed? Researchers are ofte surprised to fid out that the aswer depeds o a umber of factors
Theorems About Power Series
Physics 6A Witer 20 Theorems About Power Series Cosider a power series, f(x) = a x, () where the a are real coefficiets ad x is a real variable. There exists a real o-egative umber R, called the radius
1. C. The formula for the confidence interval for a population mean is: x t, which was
s 1. C. The formula for the cofidece iterval for a populatio mea is: x t, which was based o the sample Mea. So, x is guarateed to be i the iterval you form.. D. Use the rule : p-value
