OUTOFBAG ESTIMATION. Leo Breiman* Statistics Department University of California Berkeley, CA


 Eunice Dortha Conley
 1 years ago
 Views:
Transcription
1 1 OUTOFBAG ESTIMATION Leo Breiman* Saisics Deparmen Universiy of California Berkeley, CA Absrac In bagging, predicors are consruced using boosrap samples from he raining se and hen aggregaed o form a bagged predicor. Each boosrap sample leaves ou abou 37% of he examples. These lefou examples can be used o form accurae esimaes of imporan quaniies. For insance, hey can be used o give much improved esimaes of node probabiliies and node error raes in decision rees. Using esimaed oupus insead of he observed oupus improves accuracy in regression rees. They can also be used o give nearly opimal esimaes of generalizaion errors for bagged predicors. * Parially suppored by NSF Gran Inroducion: We assume ha here is a raining se T= {(yn,xn), n=1,...,n} and a mehod for consrucing a predicor Q(x,T) using he given raining se. The oupu variable y can eiher be a class label (classificaion) or numerical (regression). In bagging (Breiman[1996a]) a sequence of raining ses TB,1,..., TB,K are generaed of he same size as T by boosrap selecion from T. Then K predicors are consruced such ha he kh predicor Q(x,Tk,B) is based on he kh boosrap raining se. I was shown ha if hese predicors are aggregaedaveraging in regression or voing in classificaion, hen he resulan predicor can be considerably more accurae han he original predicor. Accuracy is increased if he predicion mehod is unsable, i.e. if small changes in he raining se or in he parameers used in consrucion can resul in large changes in he resuling predicor. The examples generaed in Breiman[1996a] were based on rees and subse selecion in regression, bu i is known ha neural nes are also unsable, as are oher wellknown predicion mehods. Oher mehods such as neares neighbors, are sable. I urns ou ha bagging, besides is primary purpose of increasing accuracy, has valuable byproducs. Roughly 37% of he examples in he raining se T do no appear in a paricular boosrap raining se TB. Thus, o he predicor Q(x,TB) hese examples are unused es examples. Thus, if K = 100, each paricular example (y,x) in he raining se has abou 37 predicions among he Q(x,Tk,B) such ha Tk,B does no conain (y,x). The predicions for examples "ha are ouofhebag" can be used o form accurae esimaes for imporan quaniies. For example, in classificaion, he ouofbag predicions can be used o esimae he probabiliies ha he example belongs o any one of he J possible classes. Applied o CART his gives a mehod for esimaing node probabiliies more accuraely han anyhing available o dae. Applied o regression rees, we ge an improved mehod for esimaing he expeced error in a node predicion. In regression, using he ouofbag esimaed values for he oupus insead of he acual raining se oupus gives more accurae rees. Simple and accurae
2 2 ouofbag esimaes can be given for he generalizaion error of bagged predicors. Unlike crossvalidaion, hese require no addiional compuing. In his paper, we firs look a esimaes of node class probabiliies in CART (Secion 2), and hen a esimaes of meansquared nodes errors in he regression version of CART (Secion 3). Secion 4 looks a how much accuracy is los by using esimaes ha are averaged over erminal nodes. Indicaions from synheic daa are ha he averaging can accoun for a major componen of he error. Secion 5 gives some heoreical jusificaion for he accuracy of ouofbag esimaes in erms of a poinwise biasvariance decomposiion. Secion 6 gives he effec of consrucing regression rees using he ouofbag oupu esimaes. In Secion 7 he ouofbag esimaes of generalizaion error for bagged predicors are defined and sudied. The Appendix gives some saisical deails. The presen work came from wo simuli. One was he dissaisfacion, over many years, bu growing sronger more recenly, wih he biased node class probabiliy and error esimaes in CART. The oher consised of wo papers. One, by Tibshirani[1996], proposes an ouofbag esimae as par of a mehod for esimaing generalizaion error for any classifier. The second, by Wolper and Macready [1996], looks a a number of mehods for esimaing generalizaion error for bagged regressionsamong hem, a mehod using ouofbag predicions equivalen o he mehod we give in Secion Esimaing node class probabiliies Assume ha he raining se T consiss of independen draws from an Y,X disribuion where Y is a Jclass oupu label and X is a mulivariae inpu vecor. Define p*(x) as ha probabiliy vecor wih componens p*(j x) = P(Y=j X=x). Mos classificaion algorihms use he raining se T o consruc a probabiliy predicor p R (x,t) ha oupus a nonnegaive sumone Jvecor p R (x,t) = (p R (1 x),...,p R (J x)) and hen classify x as ha class for which p R( j x) is maximum. In many applicaions, he componens of p R as well as he classificaion is imporan. For insance, in medical survival analysis, esimaes of he survival probabiliy is imporan. In some consrucion mehods, he resubsiuion values p R (x) are inrinsically biased esimaes of p*(x). This is rue of mehods like rees or neural nes where he opimizaion over T ries o drive all componens of p R o zero excep for a single componen ha goes o one. The resuling vecors p R are poor esimaes of he rue class probabiliies p*. Wih rees, he p R esimaes are consan over each erminal node and are given by he proporion of class j examples in he erminal node. In Breiman e. al. [1984] wo mehods were proposed o improve esimaes. In his hesis, Walker[1992] showed ha he firs of he wo mehods worked reasonably well on some synheic daa. However, he mehod only esimaes maxjp*(j ), and he resuls are difficul o compare wih hose given below. To define he problem beerhere is he arge: assume again ha he raining se T consiss of independen draws from an Y,X disribuion. Assume ha a ree C(x,T) wih erminal nodes {} has already been consruced. For a given erminal node, define p*(j ) = P(Y=j X ). The vecor p*() is wha we wan o esimae. The resubsiuion probabiliy esimae p R (x, T) is consan over each erminal node and consiss of he relaive class proporions of he raining se examples in. The ouofbag esimae p B is goen as follows: draw 100 boosrap replicaes of T geing T1,B,...,T100,B. For each k, build he ree classifier C(x,Tk,B). For each (y,x) in T, define p B (x) as he average of he p R (x, Tk,B) over all k such ha (y,x) is no in Tk,B. Then for any erminal node, le p B () be he average of p B (x) over all x in.
3 3 2.1 Experimenal resuls We illusrae, by experimen, he improved accuracy of he ouofbag esimaes compared o he resubsiuion mehod using synheic and real daa. For any wo Jprobabiliies p and p' denoe p p' = p( j) p'( j) j p p' 2 = ( p( j) p'( j)) 2 j Le q*() = P(X ) be he probabiliy ha an example falls ino he erminal node. For any esimae {p()] of he {p*()} define wo error measures: E 1 = q* ( ) p*() p() E 2 = ( q*() p*() p() The difference beween he wo measures is ha large differences are weighed more heavily by E2. To simplify he inerpreaion we divide E1 by J and E2 by J. Then E1 measures he absolue average error in esimaing each componen of he probabiliy vecor, while E2 measures he corresponding rms average error. We use a es se o esimae q*() as he proporion q'() of he es se falling ino node and esimae p*() by he proporions of classes p'() in hose es se examples in. This les us esimae he wo error measures. Wih E2, i is possible o derive a correcion ha adjuss for he error in using p'() insead of p*(). The correcion is generally small if he es se is large and is derived in he Appendix. Synheic Daa We give resuls for four ses of synheic daa (see Breiman[1996b]) for specific definiions): Table 1 Synheic Daa Se Summary 2 ) 1 2 Daa Se Classes Inpus Training Tes waveform wonorm hreenorm ringnorm In all cases, here were 50 ieraions wih he raining and es ses generaed anew in each ieraion and 100 replicaions in he bagging. The resuls given are he averages over he 50 ieraions: Table 2 Node Probabiliy Errors Daa Se EB1 ER1 EB1/ER1 EB2 ER2 EB2/ER2 waveform, wonorm hreenorm ringnorm The raios of errors in he 3rd and 6h columns shows ha he ouofbag esimaes are giving significan error reducions.
4 4 Real Daa The daa ses we used in his experimen are available in he UCI reposiory. We used some of he larger daa ses o insure es ses large enough o give adequae esimaes of q* and p*. Table 3 Daa Se Summary Daa Se Classes Inpus Training Tes breascancer diabees vehicle saellie dna The daa ses lised in Table 3 consised examples whose number was he oal of he es and raining se numbers. So, for example, he breascancer daa se had 699 examples. The las wo daa ses lised came preseparaed ino es and raining ses. For insance, he saellie daa came as a 4435 example raining se and a 2000 example es se. These were pu ogeher o creae a single daa se wih 6435 examples. Noe ha he raining se sizes are 100 per class. There were 50 runs on each daa se. In each run, he daa se was randomly divided ino raining and es ses wih sizes as lised in Table 3, and 100 boosrap replicaes generaed wih each raining se. The resuls, averaged over he 50 runs, are given in Table 4. Table 4 Node Probabiliy Errors Daa Se EB1 ER1 EB1/ER1 EB2 ER2 EB2/ER2 breas cancer diabees vehicle saellie dna The resuls generally show a significan decrease in esimaion error when he ouofbag esimaes are used. 3. Esimaing node error in regression Assume here ha he raining se consiss of independen draws from he disribuion Y,X where Y is a numerical oupu, and X a mulivariae inpu. Some mehods for consrucing a predicor f(x,t) of y using he raining se also ry o consruc an esimae of he average error in he predicionfor insance by giving an esimae of he rms error in he predicion. When hese error esimaes are based on he raining se error, hey are ofen biased oward he low side. In rees, he prediced value f(x,t) is consan over each erminal node and is equal o he average y() of he raining se oupus over he node. The wihin node error esimae e R () for is compued as he rms error over all examples in he raining se falling ino. However, since he recursive spliing in CART is based on rying o minimize his error measure, i is clearly biased low as an esimae of he rue error rae e*() defined as:
5 5. e*()=(e((y y()) 2 X )) 1/2 The ouofbag esimaes are goen his way: draw 100 boosrap replicaes of T geing T1,B,...,T100,B. For each k, build he ree predicor f(x,tk,b). For each (y,x) in T, define s B (x) as he average of (y  f(x,tk,b)) 2 over all k such ha (y,x) is no in Tk,B. Then for any node define e B () as he square roo of he average over all x in of s B (x). 3.1 Experimenal resuls For any esimae e() of e*(), define wo error measures: E 1 = q*() e*() e() E 2 = ( q*() (e*() e()) 2 ) 1 2 To illusrae he improved accuracy of he ouofbag esimaes of e*(), five daa ses are usedhe same five ha were used in Breiman[1996a]. The firs hree of he daa ses are synheic daa, he las wo real. Table 5 Daa Se Summary Daa Se Inpus Training Tes Friedman # Friedman # Friedman # Ozone Boson There were 50 runs of he procedure on each daa se, and 100 boosrap baggings in each run. In he synheic daa ses, he raining and es se was freshly generaed for each run. Wih he real daa ses, for each run a differen random spli ino raining and es se was used. The values of q* and e* were esimaed using he es se. For E2 an adjusmen was used o correc for his approximaion (see Appendix). The resuls, averaged over he 50 runs, are given in Table 6. For inerpreabiliy, he error measures displayed have been divided by he sandard deviaion of he combined raining and es se. Table 4 Node Esimaion Errors Daa Se EB1 ER1 EB1/ER1 EB2 ER2 EB2/ER2 Friedman# Friedman # Freidman # Ozone Boson Again, here are significan reducions in esimaion error. 4. Error due o wihinnode variabiliy
6 6 Suppose ha we wan o esimae a funcion of he inpus h*(x). Only if we wan o say in he srucure of a single ree wih predicions consan over each erminal node, does i make sense o esimae h*(x) by some esimae of h*(). The arge funcion h*(x) may have considerable variabiliy over he region defined by a erminal node in a ree. Given a ree wih erminal nodes and esimaes h() of h*(), denoe he corresponding esimae of h by h(x,t). The squared error in esimaing h* given ha he inpus x are drawn from he disribuion of he random vecor X can be decomposed as: E X (h(x) h(x, T )) 2 = E X (h(x) h*( ) 2 X )P(X ) + (h*() h()) 2 P(X ) The firs erm in his decomposiion we call he error due o wihinnode variabiliy, he second is he node esimaion error. The relevan quesion is how large he wihinnode variabiliy error is compared o he node esimaion error. Obviously, his will be problem dependen, bu we give some evidence below ha i may be a major porion of he error. We look a he problem of esimaing he condiional probabiliies p*(x). The expression analogous o he above decomposiion is: E X p*(x) p(x,t) 2 = q*()e X ( p*(x) p*() 2 X ) + q*() p*() p() 2 (4.1) For he four synheic classificaion daa ses used in Secion 3, p*(x) can be evaluaed exacly. Therefore, replacing he expecaions over X by averages over a large es se (5000), he error E V due o wihinnode variabiliy (firs erm in (4.1) and he error E N due o node esimaion (second erm in (4.1))can be evaluaed. There are wo mehods of node esimaionhe sandard mehod leads o error E NR and he ouofbag mehod o error E NB. Anoher mehod of esimaing p* is by uilizing he sequence of bagged predicors. For each (y,x) in he es se, define p B (x) as he average of he p R (x, Tk,B) over all k. Then defining E B as: E B = E X p*(x) p B (X) 2, his quaniy is also evaluaed for he synheic daa by averaging over he es daa. The experimenal procedure consiss of 50 ieraions for each synheic daa se. In each ieraion, a 5000 examples es se and 100J example raining se is generaed, and he following raios evaluaed: R 1 = 100*E NR /(E NR +E V ), R 2 = 100*(E NB E NR )/(E NR +E V ), R 3 = 100*E NB /(E NB +E V ), R 4 = 100*E B /E V. Thus, R 1 is he percen of he oal error due o node esimaion when he resubsiuion mehod of esimaing p*() is used: R 2 is he percen of reducion in he oal error when he bagging esimae p B () is used. When p B () is used, R 3 is he percen of he oal error due o node esimaion. Finally, R 4 is he raio*100 of he error using he poinwise bagging esimae of p* o he error using esimaes consan over he erminal nodes bu using he opimal node esimae p*(). Table 5 gives he resuls.
7 7 Table 5 Error Raios(%) in Esimaing Class Probabiliies Daa Se R 1 R 2 R 3 R 4 waveform wonorm hreenorm ringnorm For hese synheic daa ses, he values for R 1 shows ha wihinnode variabiliy accouns for abou wohirds of he error. The second column ( R 2 ) shows ha use of he bagging esimae p B () eliminaes mos of he errors due o node esimaion, bu ha he reducion is relaively modes because of he srong conribuion of wihinnode variabiliy. The resuls for R 3 show ha when p B () is used, only abou 10% of he oal error is due o node esimaion, so ha we are close o he limi of wha can be accomplished using esimaes of p* consan over nodes. The final column gives he good news ha using he poinwise bagging esimaes p B (x) gives abou 50% reducion as compared o he bes possible node esimae. I is a bi disconcering o see how much accuracy is los by he averaging of esimaes over nodes. Smyh e.al.[1996] avoid his averaging by using a kernel densiy mehod o esimae variable wihinnode densiies. However, poinwise bagging esimaes may give comparable or beer resuls. Care mus be aken in generalizing from hese synheic daa ses as I suspec hey may have more wihinnode variabiliy han ypical real daa ses. 5. Why i workshe poinwise biasvariance decomposiion Suppose here is some underlying funcion h*(x) ha we wan o esimae using he raining se T and ha we have some mehod h(x,t) for esimaing h*(x). Then for x fixed, we can wrie, using ET o denoe expecaion over replicae raining ses of he same size drawn from he same disribuion E T (h*(x) h(x,t)) 2 =(h*(x) E T h(x,t)) 2 +E T (h(x,t) E T h(x,t)) 2. This is a poinwise in x version of he now familar biasvariance decomposiion. The ineresing hing ha i shows is ha a each poin x, ETh(x,T) has lower squared error han does h(x, T)i has zero variance bu he same bias. Tha is, averaging h(x,t) over replicae raining ses improves performance a each individual values of x. Bagging ries o ge an esimae of ETh(x,T) by averaging over he values of h(x,tk,b). Now ETh(x,T) is compued assuming x is held fixed and T is chosen in a way ha does no depend on x. Bu if x is in he raining se, hen he Tk,B ofen conain x, violaing he assumpion. A beer imiaion of ETh(x,T) would be o leave x ou of he raining se and do bagging on he deleed raining se. Bu his is exacly wha ouofbag esimaion does resuling in more accurae esimaes of h*(x) a every example in he raining se. When hese are averaged over any erminal node, more accurae esimaes h B () of h*()=e(h*(x) X ) resul. 6. Trees using ouofbag oupu esimaes. In regression, for y,x an example in he raining se, define he ouofbag esimae y B for he oupu y o be he average of f(x,tk,b) over all k such ha x is no in Tk,B. The ouofbag oupu esimaes will generally be less noisy han he original oupus. This suggess he
8 8 possibiliy of growing a ree using he y B as he oupus values for he raining se. We did his using he daa ses described in Secion 3, and followed he procedure in Breiman[1996]. Wih he real daa ses we randomly subdivided hem so ha 90% served as he raining se, and 10% as a es se. A ree was grown and pruned using he original raining se, and he 10% es se used o ge a meansquared error esimae. Then we did 100 boosrap ieraions and compued he y B. Finally a single ree was grown using he y B as oupus and is error measured using he 10% es se. The random subdivsion was repeaed 50 imes and he es se errors averaged. Wih he hree synheic daa ses, a raining se of 200 and a es se of 2000 were freshly generaed in each of he 50 runs. The resuls are given in Table 6. Table 6. Mean Square Tes Se Error Daa Se ErrorOriginal Oupus ErrorOB Oupus Friedman # Freidman #2* * x1000 Friedman #3** **/1000 Boson Ozone These decreases are no as dramaic as hose given by bagging. On he oher hand, hey involve predicion by a single ree, generally of abou he same size as hose grown on he original raining se. If he desire is o increase accuracy while reaining inerpreabiliy, hen using he ouofbag oupus does quie well. The sory in classificaion is ha using he ouofbag oupu esimaes gives very lile improvemen in accuracy and can acually resul in less accurae rees. The ouofbag oupu esimaes consis of he probabiliy vecors p B (x). CART was modified o accep probabiliy vecors as oupus in ree consrucion, and a procedure similar o ha used in regression was ried on a number of daa ses wih disappoining resuls. The problem is wofold. Firs, while he probabiliy vecor esimaes may be more accurae, he classificaion depends only on he locaion of he maximum componen. Second, for daa ses wih subsanial missclassificaion raes, he ouofbag esimaes may produce more disorion han he original class labels. 7. Ouofbag esimaes for bagged predicors. In his secion, we reinforce he work by Tibshirani [1996] and Wolper and Macready[1996], boh of whom proposed using ouofbag esimaes as an ingredien in esimaes of generalizaion error. Wolper and Macready worked on regression ype problems and proposed a number of mehods for esimaing he generalizaion error of bagged predicors. The mehod hey found ha gave bes performance is a special case of he mehod we propose. Tibshirani used ouofbag esimaes of variance o esimae generalizaion error for arbirary classifiers. We explore esimaes of generalizaion error for bagged predicors. For classificaion, our resuls are new. As Wolper and Macready poin ou in heir paper, crossvalidaing bagged predicors may lead o large compuing effors. The ouofbag esimaes are efficien in ha hey can be compued in he same run ha consrucs he bagged predicor wih lile addiional effor. Our experimens below also give evidence ha hese esimaes are close o opimal. Suppose again, ha we have a raining se T consising of examples wih an oupu variable y ha can be a mulidimensional vecor wih numerical or caegorical coordinaes, and corresponding inpu x. A mehod is used o consruc a predicor f(x,t), and a given loss funcion
9 9 L(y, f) measures he error in predicing y by f. Form boosrap raining ses Tk,B, predicors f(x, Tk,B) and aggregae hese predicors in an appropriae way o form he bagged predicor fb(x). For each y,x in he raining se, aggregae he predicors only over hose k for which Tk,B does no conaining y,x. Denoe hese ouofbag predicors by f OB Then he ouofbag esimae for he generalizaion error is he average of L(y, f OB (x)) over all examples in he raining se. Denoe by e TS he es se error esimae and by e OB he ouofbag error esimae. In all of he runs in Secions 2 and 3, we also accumulaed he average of e TS, e OB and e TS  e OB. We can also compue he expeced value of e TS  e OB under he assumpion ha e OB is compued using a es se of he same size as he raining se and independen of he acual es se used (see Appendix). We claim ha his expeced value is a lower bound for how well we could use he raining se o esimae he generalizaion error of he bagged predicor. Our reasoning is his: given ha he raining se is used o consruc he predicor, he mos accurae esimae for is error is a es se independen of he raining se. If we used a es se of he same size as he raining se, his is as well as can be done using his number of examples. Therefore, we can judge he efficiency of any generalizaion error esimae based on he raining se by comparing is accuracy o he esimae we would ge using a es se of he size of he raining se. Table 7 conains he resuls for he classificaion runs for boh he real and synheic daa. The las column is he raio of he experimenally observed e TS  e OB o he expeced value if he raining se were an independen es se. The closer his raio is o one, he closer o opimal e OB is. Table 8 gives he corresponding resuls for regression. Table 7 Esimaes of Generalisaion Error (% Missclassificaion) Daa Se Av e TS Av e OB Av e TS  e OB Raio waveform wonorm hreenorm ringnorm breascancer diabees vehicle saellie dna Table 8 Esimaes of Generalizaion Error (Mean Squared Error) Daa Se Av e TS Av e OB Av e TS  e OB Raio Friedman # Friedman #2 24.7* 23.6* * x1000 Friedman #3 32.8** 30.5** 6.7** 1.06 **/1000 Boson Ozone Tables 7 and 8 show ha he ouofbag esimaes are remarkably accurae. On he whole, he raio values are close o one, reflecing he accuracy of he ouofbag esimaes of he generalizaion error of he bagged predicors. In classificaion, he ouofbag esimaes
10 10 appear almos unbiased, i.e. he average of e OB is almos equal o he average of e TS. Bu he esimaes in regression may be sysemaically low. The wo values slighly less han one in Table 6 we aribue o random flucuaions. The denominaor in he raio column depends on a parameer which has o be esimaed from he daa. Error in his parameer esimae may drive he raio low )see Appendix for deails). References Breiman, L. [1996a] Bagging Predicors, Machine Learning 26, No. 2, Breiman, L. [1996b] Bias, Variance, and Arcing Classifiers, submied o Annals of Saisics, fp fp.sa.berkeley.edu pub/breiman/arcall.ps Breiman, L., Friedman, J., Olshen R., and Sone, C. [1984] Classificaion and Regression Trees, Wadsworh Smyh, P.,Gray, A. and Fayyad, U.[1996} Rerofiing Tree Classifiers Using Kernel Densiy Esimaion, Proceedings of he 1995 Conference on Machine Learning, San Francisco, CA: Morgan Kaufmann, 1995, pp Tibshirani, R.[1996] Bias, Variance, and Predicion Error for Classificaion Rules, Technical Repor, Saisics Deparmen, Universiy of Torono Walker, Michael G. (1992) Probabiliy Esimaion for ClassificaionTrees and Sequence Analysis. SanCS Ph.D. Disseraion. Deparmens of Compuer Science and Medicine, Sanford Universiy,Sanford, CA. Wolper, D.H. and Macready, W.G.[1996] An Efficien Mehod o Esimae Bagging's Generalizaion Error I. Adjusmen o E2 in classificaion Appendix Wih moderae sized es ses we can ge, a bes, only noisy esimaes of p*. Using hese esimaes o compue error raes may lead o biases in he resuls. However, a simple adjusmen is possible in he E2 error crierion. For {p()} probabiliy esimaes in he erminal nodes depending only on he examples in T, E2 is defined by E2 2 = q() p*() p() 2 (2.1) Le p'() be he class proporions of es se examples in node so ha p'() is an esimae of p*(). We assume ha he examples in S and T are independen. Le N be he number of es se examples falling ino erminal node. Condiional on N, p'(j ) imes N has a binomial disribuion B(p*(j ),N). Hence, p'(j ) has expecaion p*(j ) and variance p*(j )(1p*(j ) )/N. Wrie p' p 2 = p* p 2 + p' p* 2 +2(p* p,p' p*). Taking expecaions of boh sides wih respec o he examples in S holding N consan gives N E p' p 2 =N p* p 2 +1 p* 2 (A.1) Puing p=0 in (A.1) gives N E p' 2 =N p* 2 +1 p* 2 (A.2
11 11 Solving (A.1) and (A.2) for N p* p 2 gives Thus, we esimae he error measure E2 as: N p* p 2 =N E p' p 2 N N 1 (1 E p' 2 ) E2 2 = q'()[ p'() p() 2 (1 p' 2 ) (N() 1)] (A.3) where N() is he number of es se examples in node, q'() is N()/N, and we define he second erm in he brackes o be zero if N()=1. This erm in (A.3) is he adjusmen. In our examples, i had only a small effec on he resuls. II. Adjusmen o E2 in regression In regression, E2 is defined as he square roo of R= q*() (e*() e()) 2. Since e*() is unknown, we esimae i as he square roo of he average over all es se (y,x) falling in of (y y()) 2 and denoe his esimae by e'(). Le R' = q*() (e'() e()) 2. Then we wan o adjus R' so ha i is an unbiased esimae of R, i.e. so ha ER' = R, when he expecaion is aken over he es se examples. To simplify his compuaion, we assume ha: i) es se oupu values in are normally disribued wih rue mean y*(). ii) q*() y*()y() is small. Now, e'() 2 ' = (y n n y()) 2 / N() where he sum is over all es se (y,x) falling ino. By he Cenral Limi Theorem, e'() 2 = e*() 2 + Z/ N() where Z is approximaely normally disribued wih mean zero and variance equal o he variance of (Y y()) 2 condiional on X in. I follows ha Ee'() 2 = e*() 2, and o firs order in N(), Ee'() = e*()  EZ 2 /(8e*() 3 N()). Recall ha for a normally disribued random variable U wih variance σ 2,he variance of U 2 is 2σ 4. Using assumpion i) gives EZ 2 = 2(var(Y y())) 2.By assumpion ii) he variance of Y y() on can be approximaed by e*() 2. Thus, Ee'()=e*()[1.25/ N()] (A.4)
12 12 Wrie. (e'() e()) 2 =(e'() e*()) 2 +(e*() e()) 2 +2(e'() e*())(e*() e()). (A.4) Taking expecaion of (A.5) wih respec o he es se and using (A.4) gives E(e'() e()) 2 E(e*() e()) 2 +e*()e()/2n() (A.5) Approximaing e*()e() by e'() 2 gives he adjused measure: 2 E 2 = q'() [(e'() e()) 2.5e'() 2 / N()] Again, he adjusmen conribues a relaively small correcion in our runs. III. Lower bound for raining se accuracy. Suppose we have a classifier Q(x) wih rue generalizaion missclassificaion rae P*. Tha is, for he disribuion of he r.v. Y,X, e*= P(Y Q(X)). Two ses of daa T'= {(y'n,x'n), n=1,...,n'} and T= {(yn,xn), n=1,...,n} are independenly drawn from he underlying disribuion of Y,X and run hrough he classifier. The firs has has classificaion error rae e' and he second e. We evaluae g(n',n) =E e'e. In he conex of he experimens on ouofbag esimaion, he firs se of daa is he given es se. The second se is he raining se. The raining se is used boh o form he bagged classifier and he ouofbag esimae of he generalizaion error. Suppose we had an unouched es se of he same size as he raining se and used his new es se o esimae he generalizaion error. Cerainly, we would do beer han any way of using he raining se over again o do he same hing. Thus g(n',n) is a lower bound for he accuracy of generalizaion error esimaes using a raining se of N examples, when i is being compared o a es se using N' examples. Now, e is given by e= V n / N (A.6) n where V n is one if he nh case in T is missclssified and zero oherwise. By he Cenral Limi Theorem, e=e* + Z/ N, where Z is approximaely normal wih mean zero and variance P*(1 P*). Similarly, e'=e* + Z' / N' where Z' is approximaely normal mean zero also wih variance e*(1e*) and independen of Z. So e'e is normal wih variance s=e*(1e*)[(1/n)+(1/n')]. The expecaion E U of a mean zero normal variable U wih variance s is 2s/π and i was his las expression ha was used as a comparison in Secion 6, wih s esimaed using e' in place of e*. In regression, he predicor is a numericalvalued funcion f(x). The rue generalizaion error is e* = E(Yf(X)) 2. Expression (A.6) holds wih V n =(y n f (x n )) 2. Again, e=e*+ Z/ N, where Z is approximaely normal wih mean zero and variance c/ N wih c he variance of (Yf(X)) 2. Repeaing he argumen above, e' e is a normal variable wih mean zero and variance
13 s=c[(1/n)+(1/n') so E e'e equals 2s/π. The variance of (Yf(X)) 2 is given by E(Yf(X)) 4  (E(Yf(X)) 2 ) 2. This value is approximaed by he using he corresponding momens over he es se, leading o he evaluaion of he lower bound. 13
Follow the Leader If You Can, Hedge If You Must
Journal of Machine Learning Research 15 (2014) 12811316 Submied 1/13; Revised 1/14; Published 4/14 Follow he Leader If You Can, Hedge If You Mus Seven de Rooij seven.de.rooij@gmail.com VU Universiy and
More informationDoes Britain or the United States Have the Right Gasoline Tax?
Does Briain or he Unied Saes Have he Righ Gasoline Tax? Ian W.H. Parry and Kenneh A. Small March 2002 (rev. Sep. 2004) Discussion Paper 02 12 rev. Resources for he uure 1616 P Sree, NW Washingon, D.C.
More informationExchange Rate PassThrough into Import Prices: A Macro or Micro Phenomenon? Abstract
Exchange Rae PassThrough ino Impor Prices: A Macro or Micro Phenomenon? Absrac Exchange rae regime opimaliy, as well as moneary policy effeciveness, depends on he ighness of he link beween exchange rae
More informationI M F S T A F F D I S C U S S I O N N O T E
I M F S T A F F D I S C U S S I O N N O T E February 29, 2012 SDN/12/01 Two Targes, Two Insrumens: Moneary and Exchange Rae Policies in Emerging Marke Economies Jonahan D. Osry, Aish R. Ghosh, and Marcos
More informationThe concept of potential output plays a
Wha Do We Know (And No Know) Abou Poenial Oupu? Susano Basu and John G. Fernald Poenial oupu is an imporan concep in economics. Policymakers ofen use a onesecor neoclassical model o hink abou longrun
More informationImproved Techniques for Grid Mapping with RaoBlackwellized Particle Filters
1 Improved Techniques for Grid Mapping wih RaoBlackwellized Paricle Filers Giorgio Grisei Cyrill Sachniss Wolfram Burgard Universiy of Freiburg, Dep. of Compuer Science, GeorgesKöhlerAllee 79, D79110
More informationKONSTANTĪNS BEŅKOVSKIS IS THERE A BANK LENDING CHANNEL OF MONETARY POLICY IN LATVIA? EVIDENCE FROM BANK LEVEL DATA
ISBN 9984 676 20 X KONSTANTĪNS BEŅKOVSKIS IS THERE A BANK LENDING CHANNEL OF MONETARY POLICY IN LATVIA? EVIDENCE FROM BANK LEVEL DATA 2008 WORKING PAPER Lavias Banka, 2008 This source is o be indicaed
More informationWhen Should Public Debt Be Reduced?
I M F S T A F F D I S C U S S I ON N O T E When Should Public Deb Be Reduced? Jonahan D. Osry, Aish R. Ghosh, and Raphael Espinoza June 2015 SDN/15/10 When Should Public Deb Be Reduced? Prepared by Jonahan
More informationBoard of Governors of the Federal Reserve System. International Finance Discussion Papers. Number 1003. July 2010
Board of Governors of he Federal Reserve Sysem Inernaional Finance Discussion Papers Number 3 July 2 Is There a Fiscal Free Lunch in a Liquidiy Trap? Chrisopher J. Erceg and Jesper Lindé NOTE: Inernaional
More informationThe Macroeconomics of MediumTerm Aid ScalingUp Scenarios
WP//6 The Macroeconomics of MediumTerm Aid ScalingUp Scenarios Andrew Berg, Jan Goschalk, Rafael Porillo, and LuisFelipe Zanna 2 Inernaional Moneary Fund WP//6 IMF Working Paper Research Deparmen The
More informationFIRST PASSAGE TIMES OF A JUMP DIFFUSION PROCESS
Adv. Appl. Prob. 35, 54 531 23 Prined in Norhern Ireland Applied Probabiliy Trus 23 FIRST PASSAGE TIMES OF A JUMP DIFFUSION PROCESS S. G. KOU, Columbia Universiy HUI WANG, Brown Universiy Absrac This paper
More informationWhy Have Economic Reforms in Mexico Not Generated Growth?*
Federal Reserve Bank of Minneapolis Research Deparmen Saff Repor 453 November 2010 Why Have Economic Reforms in Mexico No Generaed Growh?* Timohy J. Kehoe Universiy of Minnesoa, Federal Reserve Bank of
More informationWhich Archimedean Copula is the right one?
Which Archimedean is he righ one? CPA Mario R. Melchiori Universidad Nacional del Lioral Sana Fe  Argenina Third Version Sepember 2003 Published in he YieldCurve.com ejournal (www.yieldcurve.com), Ocober
More informationMining the Most Interesting Rules
Appears in Pro. of he Fifh ACM SIGKDD In l Conf. on Knowledge Disovery and Daa Mining, 145154, 1999. Mining he Mos Ineresing Rules Robero J. Bayardo Jr. IBM Almaden Researh Cener hp://www.almaden.ibm.om/s/people/bayardo/
More informationISSN 15183548. Working Paper Series
ISSN 583548 Working Paper Series Nonlinear Mechanisms of he Exchange Rae PassThrough: A Phillips curve model wih hreshold for Brazil Arnildo da Silva Correa and André Minella November, 006 ISSN 583548
More informationThE Papers 07/02. Do sunk exporting costs differ among markets? Evidence from Spanish manufacturing firms.
ThE Papers 07/02 Deparameno de Teoría e Hisoria Económica Universidad de Granada Do sunk exporing coss differ among markes? Evidence from Spanish manufacuring firms. Blanes Crisóbal, José Vicene. Universidad
More informationThe Simple Analytics of Helicopter Money: Why It Works Always
Vol. 8, 201428 Augus 21, 2014 hp://dx.doi.org/10.5018/economicsejournal.ja.201428 The Simple Analyics of Helicoper Money: Why I Works Always Willem H. Buier Absrac The auhor proides a rigorous analysis
More informationHow Boosting the Margin Can Also Boost Classifier Complexity
Lev Reyzin lev.reyzin@yale.edu Yale University, Department of Computer Science, 51 Prospect Street, New Haven, CT 652, USA Robert E. Schapire schapire@cs.princeton.edu Princeton University, Department
More informationA DecisionTheoretic Generalization of OnLine Learning and an Application to Boosting*
journal of compuer and sysem scences 55, 119139 (1997) arcle no. SS971504 A Decsonheorec Generalzaon of OnLne Learnng and an Applcaon o Boosng* Yoav Freund and Rober E. Schapre  A6 Labs, 180 Park Avenue,
More informationThe effect of the increase in the monetary base on Japan s economy at zero interest rates: an empirical analysis 1
The effec of he icrease i he moeary base o Japa s ecoomy a zero ieres raes: a empirical aalysis 1 Takeshi Kimura, Hiroshi Kobayashi, Ju Muraaga ad Hiroshi Ugai, 2 Bak of Japa Absrac I his paper, we quaify
More informationProbability Estimates for Multiclass Classification by Pairwise Coupling
Journal of Machine Learning Research 5 (2004) 975005 Submitted /03; Revised 05/04; Published 8/04 Probability Estimates for Multiclass Classification by Pairwise Coupling TingFan Wu ChihJen Lin Department
More informationA New Approach to Linear Filtering and Prediction Problems 1
R. E. KALMAN Research Insue for Advanced Sudy, Balmore, Md. A New Approach o Lnear Flerng and Predcon Problems The classcal flerng and predcon problem s reexamned usng he Bode Shannon represenaon of
More informationHow to Use Expert Advice
NICOLÒ CESABIANCHI Università di Milano, Milan, Italy YOAV FREUND AT&T Labs, Florham Park, New Jersey DAVID HAUSSLER AND DAVID P. HELMBOLD University of California, Santa Cruz, Santa Cruz, California
More informationUsing Kalman Filter to Extract and Test for Common Stochastic Trends 1
Usig Kalma Filer o Exrac ad Tes for Commo Sochasic Treds Yoosoo Chag 2, Bibo Jiag 3 ad Joo Y. Park 4 Absrac This paper cosiders a sae space model wih iegraed lae variables. The model provides a effecive
More informationGetting the Most Out of Ensemble Selection
Getting the Most Out of Ensemble Selection Rich Caruana, Art Munson, Alexandru NiculescuMizil Department of Computer Science Cornell University Technical Report 20062045 {caruana, mmunson, alexn} @cs.cornell.edu
More informationVery Simple Classification Rules Perform Well on Most Commonly Used Datasets
Very Simple Classification Rules Perform Well on Most Commonly Used Datasets Robert C. Holte (holte@csi.uottawa.ca) Computer Science Department, University of Ottawa, Ottawa, Canada K1N 6N5 The classification
More informationThe Optimality of Naive Bayes
The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most
More informationIntroduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.
Introduction to Hypothesis Testing CHAPTER 8 LEARNING OBJECTIVES After reading this chapter, you should be able to: 1 Identify the four steps of hypothesis testing. 2 Define null hypothesis, alternative
More informationVIPer12ADIP VIPer12AS
VIPer12ADIP VIPer12AS LOW POWER OFF LINE SMPS PRIMARY SWITCHER TYPICAL POWER CAPABILITY Mains ype SO8 DIP8 European (195265 Vac) 8 W 13 W US / Wide range (85265 Vac) 5 W 8 W n FIXED 60 KHZ SWITCHING
More informationMisunderstandings between experimentalists and observationalists about causal inference
J. R. Statist. Soc. A (2008) 171, Part 2, pp. 481 502 Misunderstandings between experimentalists and observationalists about causal inference Kosuke Imai, Princeton University, USA Gary King Harvard University,
More information