Stéphane Boucheron 1, Olivier Bousquet 2 and Gábor Lugosi 3


 Doreen Grant
 2 years ago
 Views:
Transcription
1 ESAIM: Probability ad Statistics URL: Will be set by the publisher THEORY OF CLASSIFICATION: A SURVEY OF SOME RECENT ADVANCES Stéphae Bouchero 1, Olivier Bousquet 2 ad Gábor Lugosi 3 Abstract The last few years have witessed importat ew developmets i the theory ad practice of patter classificatio We ited to survey some of the mai ew ideas that have led to these recet results Résumé La pratique et la théorie de la recoaissace des formes ot cou des développemets importats durat ces derières aées Ce survol vise à exposer certaies des idées ouvelles qui ot coduit à ces développemets 1991 Mathematics Subject Classificatio 62G08,60E15,68Q32 September 23, 2005 Cotets 1 Itroductio 2 2 Basic model 2 3 Empirical risk miimizatio ad Rademacher averages 3 4 Miimizig cost fuctios: some basic ideas behid boostig ad support vector machies 8 41 Margibased performace bouds 9 42 Covex cost fuctioals 13 5 Tighter bouds for empirical risk miimizatio Relative deviatios Noise ad fast rates Localizatio Cost fuctios Miimax lower bouds 26 6 PACbayesia bouds 29 7 Stability 31 8 Model selectio Oracle iequalities 32 Keywords ad phrases: Patter Recogitio, Statistical Learig Theory, Cocetratio Iequalities, Empirical Processes, Model Selectio The authors ackowledge support by the PASCAL Network of Excellece uder EC grat o The work of the third author was supported by the Spaish Miistry of Sciece ad Techology ad FEDER, grat BMF Laboratoire Probabilités et Modèles Aléatoires, CNRS & Uiversité Paris VII, Paris, Frace, wwwprobajussieufr/~bouchero 2 Pertiece SA, 32 rue des Jeûeurs, Paris, Frace 3 Departmet of Ecoomics, Pompeu Fabra Uiversity, Ramo Trias Fargas 2527, Barceloa, Spai, c EDP Scieces, SMAI 1999
2 2 TITLE WILL BE SET BY THE PUBLISHER 82 A glimpse at model selectio methods Naive pealizatio Ideal pealties Localized Rademacher complexities Pretestig Revisitig holdout estimates 45 Refereces 47 1 Itroductio The last few years have witessed importat ew developmets i the theory ad practice of patter classificatio The itroductio of ew ad effective techiques of hadlig highdimesioal problems such as boostig ad support vector machies have revolutioized the practice of patter recogitio At the same time, the better uderstadig of the applicatio of empirical process theory ad cocetratio iequalities have led to effective ew ways of studyig these methods ad provided a statistical explaatio for their success These ew tools have also helped develop ew model selectio methods that are at the heart of may classificatio algorithms The purpose of this survey is to offer a overview of some of these theoretical tools ad give the mai ideas of the aalysis of some of the importat algorithms This survey does ot attempt to be exhaustive The selectio of the topics is largely biased by the persoal taste of the authors We also limit ourselves to describig the key ideas i a simple way, ofte sacrificig geerality I these cases the reader is poited to the refereces for the sharpest ad more geeral results available Refereces ad bibliographical remarks are give at the ed of each sectio, i a attempt to avoid iterruptios i the argumets 2 Basic model The problem of patter classificatio is about guessig or predictig the ukow class of a observatio A observatio is ofte a collectio of umerical ad/or categorical measuremets represeted by a ddimesioal vector x but i some cases it may eve be a curve or a image I our model we simply assume that x X where X is some abstract measurable space equipped with a σalgebra The ukow ature of the observatio is called a class It is deoted by y ad i the simplest case takes values i the biary set 1, 1} I these otes we restrict our attetio to biary classificatio The reaso is simplicity ad that the biary problem already captures may of the mai features of more geeral problems Eve though there is much to say about multiclass classificatio, this survey does ot cover this icreasig field of research I classificatio, oe creates a fuctio g : X 1, 1} which represets oe s guess of y give x The mappig g is called a classifier The classifier errs o x if g(x) y To formalize the learig problem, we itroduce a probabilistic settig, ad let (X, Y ) be a X 1, 1} valued radom pair, modelig observatio ad its correspodig class The distributio of the radom pair (X, Y ) may be described by the probability distributio of X (give by the probabilities PX A} for all measurable subsets A of X ) ad η(x) = PY = 1 X = x} The fuctio η is called the a posteriori probability We measure the performace of classifier g by its probability of error L(g) = Pg(X) Y } Give η, oe may easily costruct a classifier with miimal probability of error I particular, it is easy to see that if we defie g 1 if η(x) > 1/2 (x) = 1 otherwise
3 TITLE WILL BE SET BY THE PUBLISHER 3 the L(g ) L(g) for ay classifier g The miimal risk L def = L(g ) is called the Bayes risk (or Bayes error) More precisely, it is immediate to see that L(g) L = E [ 1 g(x) g (X)} 2η(X) 1 ] 0 (1) (see, eg, [72]) The optimal classifier g is ofte called the Bayes classifier I the statistical model we focus o, oe has access to a collectio of data (X i, Y i ), 1 i We assume that the data D cosists of a sequece of idepedet idetically distributed (iid) radom pairs (X 1, Y 1 ),, (X, Y ) with the same distributio as that of (X, Y ) A classifier is costructed o the basis of D = (X 1, Y 1,, X, Y ) ad is deoted by g Thus, the value of Y is guessed by g (X) = g (X; X 1, Y 1,, X, Y ) The performace of g is measured by its (coditioal) probability of error L(g ) = Pg (X) Y D } The focus of the theory (ad practice) of classificatio is to costruct classifiers g whose probability of error is as close to L as possible Obviously, the whole arseal of traditioal parametric ad oparametric statistics may be used to attack this problem However, the highdimesioal ature of may of the ew applicatios (such as image recogitio, text classificatio, microbiological applicatios, etc) leads to territories beyod the reach of traditioal methods Most ew advaces of statistical learig theory aim to face these ew challeges Bibliographical remarks Several textbooks, surveys, ad research moographs have bee writte o patter classificatio ad statistical learig theory A partial list icludes Fukuaga [97], Duda ad Hart [77], Vapik ad Chervoekis [233], Devijver ad Kittler [70], Vapik [229,230], Breima, Friedma, Olshe, ad Stoe [53], Nataraja [175], McLachla [169], Athoy ad Biggs [10], Kears ad Vazirai [117], Devroye, Györfi, ad Lugosi [72], Ripley [185], Vidyasagar [235] Kulkari, Lugosi, ad Vekatesh [128], Athoy ad Bartlett [9], Duda, Hart, ad Stork [78], Lugosi [144], ad Medelso [171] 3 Empirical risk miimizatio ad Rademacher averages A simple ad atural approach to the classificatio problem is to cosider a class C of classifiers g : X 1, 1} ad use databased estimates of the probabilities of error L(g) to select a classifier from the class The most atural choice to estimate the probability of error L(g) = Pg(X) Y } is the error cout L (g) = 1 1 g(xi) Y i} i=1 L (g) is called the empirical error of the classifier g First we outlie the basics of the theory of empirical risk miimizatio (ie, the classificatio aalog of Mestimatio) Deote by g the classifier that miimizes the estimated probability of error over the class: L (g ) L (g) for all g C The the probability of error L(g) = P g(x) Y D } of the selected rule is easily see to satisfy the elemetary iequalities L(g ) if g C L(g) 2 sup L (g) L(g), (2) g C L(g) L (g) + sup L (g) L(g) g C
4 4 TITLE WILL BE SET BY THE PUBLISHER We see that by guarateeig that the uiform deviatio sup g C L (g) L(g) of estimated probabilities from their true values is small, we make sure that the probability of the selected classifier g is ot much larger tha the best probability of error i the class C ad at the same time the empirical estimate L (g ) is also good It is importat to ote at this poit that boudig the excess risk by the maximal deviatio as i (2) is quite loose i may situatios I Sectio 5 we survey some ways of obtaiig improved bouds O the other had, the simple iequality above offers a coveiet way of uderstadig some of the basic priciples ad it is eve sharp i a certai miimax sese, see Sectio 55 Clearly, the radom variable L (g) is biomially distributed with parameters ad L(g) Thus, to obtai bouds for the success of empirical error miimizatio, we eed to study uiform deviatios of biomial radom variables from their meas We formulate the problem i a somewhat more geeral way as follows Let X 1,, X be idepedet, idetically distributed radom variables takig values i some set X ad let F be a class of bouded fuctios X [ 1, 1] Deotig expectatio ad empirical averages by P f = Ef(X 1 ) ad P f = (1/) i=1 f(x i), we are iterested i upper bouds for the maximal deviatio sup(p f P f) f F Cocetratio iequalities are amog the basic tools i studyig such deviatios powerful expoetial cocetratio iequality is the bouded differeces iequality The simplest, yet quite Theorem 31 bouded differeces iequality Le g : X R be a fuctio of variables such that for some oegative costats c 1,, c, sup x 1,,x, x i X g(x 1,, x ) g(x 1,, x i 1, x i, x i+1,, x ) c i, 1 i Let X 1,, X be idepedet radom variables The radom variable Z = g(x 1,, X ) satisfies where C = i=1 c2 i P Z EZ > t} 2e 2t2 /C The bouded differeces assumptio meas that if the ith variable of g is chaged while keepig all the others fixed, the value of the fuctio caot chage by more tha c i Our mai example for such a fuctio is Z = sup P f P f f F Obviously, Z satisfies the bouded differeces assumptio with c i = 2/ ad therefore, for ay δ (0, 1), with probability at least 1 δ, 2 log sup P f P f E 1 δ sup P f P f + (3) f F f F This cocetratio result allows us to focus o the expected value, which ca be bouded coveietly by a simple symmetrizatio device Itroduce a ghost sample X 1,, X, idepedet of the X i ad distributed idetically If P f = (1/) i=1 f(x i ) deotes the empirical averages measured o the ghost sample, the by Jese s iequality, E sup f F ( [ P f P f = E sup E f F ]) P f P f X 1,, X E sup P f P f f F
5 TITLE WILL BE SET BY THE PUBLISHER 5 Let ow σ 1,, σ be idepedet (Rademacher) radom variables with Pσ i = 1} = Pσ i = 1} = 1/2, idepedet of the X i ad X i The E sup f F [ P f P f = E sup f F [ = E sup 2E f F [ 1 1 sup f F ] (f(x i) f(x i ) i=1 ] σ i (f(x i) f(x i ) i=1 ] σ i f(x i ) Let A R be a bouded set of vectors a = (a 1,, a ), ad itroduce the quatity 1 i=1 1 R (A) = E sup σ i a i a A R (A) is called the Rademacher average associated with A For a give sequece x 1,, x X, we write F(x 1 ) for the class of vectors (f(x 1 ),, f(x )) with f F Thus, usig this otatio, we have deduced the followig i=1 Theorem 32 With probability at least 1 δ, sup P f P f 2ER (F(X1 )) + f F 2 log 1 δ We also have sup P f P f 2R (F(X1 )) + f F 2 log 2 δ The secod statemet follows simply by oticig that the radom variable R (F(X1 ) satisfies the coditios of the bouded differeces iequality The secod iequality is our first datadepedet performace boud It ivolves the Rademacher average of the coordiate projectio of F give by the data X 1,, X Give the data, oe may compute the Rademacher average, for example, by Mote Carlo itegratio Note that for a give 1 choice of the radom sigs σ 1,, σ, the computatio of sup f F i=1 σ if(x i ) is equivalet to miimizig i=1 σ if(x i ) over f F ad therefore it is computatioally equivalet to empirical risk miimizatio R (F(X1 )) measures the richess of the class F ad provides a sharp estimate for the maximal deviatios I fact, oe may prove that 1 2 ER (F(X1 )) 1 2 E sup f F P f P f 2ER (F(X 1 ))) (see, eg, va der Vaart ad Weller [227]) Next we recall some of the simple structural properties of Rademacher averages Theorem 33 properties of rademacher averages Let A, B be bouded subsets of R ad let c R be a costat The R (A B) R (A) + R (B), R (c A) = c R (A), R (A B) R (A) + R (B)
6 6 TITLE WILL BE SET BY THE PUBLISHER where c A = ca : a A} ad A B = a + b : a A, b B} Moreover, if A = a (1),, a (N) } R is a fiite set, the 2 log N R (A) max j=1,,n a(j) (4) N where deotes Euclidea orm If abscov(a) = j=1 c ja (j) : N N, } N j=1 c j 1, a (j) A is the absolute covex hull of A, the R (A) = R (abscov(a)) (5) Fially, the cotractio priciple states that if φ : R R is a fuctio with φ(0) = 0 ad Lipschitz costat L φ ad φ A is the set of vectors of form (φ(a 1 ),, φ(a )) R with a A, the R (φ A) L φ R (A) proof The first three properties are immediate from the defiitio Iequality (4) follows by Hoeffdig s iequality which states that if X is a bouded zeromea radom variable takig values i a iterval [α, β], the for ay s > 0, E exp(sx) exp ( s 2 (β α) 2 /8 ) I particular, by idepedece, This implies that E exp ( s 1 ) σ i a i = i=1 e sr(a) = exp i=1 ( 1 se max j=1,,n N Ee s 1 j=1 E exp (s 1 ) σ ia i i=1 σ i a (j) i ) ( s 2 a 2 ) ( i s 2 a 2 ) exp 2 2 = exp 2 2 i=1 E exp P i=1 σia(j) i N max j=1,,n exp ( s max j=1,,n 1 ( s 2 a (j) 2 ) 2 2 Takig the logarithm of both sides, dividig by s, ad choosig s to miimize the obtaied upper boud for R (A), we arrive at (4) The idetity (5) is easily see from the defiitio For a proof of the cotractio priciple, see Ledoux ad Talagrad [133] Ofte it is useful to derive further upper bouds o Rademacher averages As a illustratio, we cosider the case whe F is a class of idicator fuctios Recall that this is the case i our motivatig example i the classificatio problem described above whe each f F is the idicator fuctio of a set of the form (x, y) : g(x) y} I such a case, for ay collectio of poits x 1 = (x 1,, x ), F(x 1 ) is a fiite subset of R whose cardiality is deoted by S F (x 1 ) ad is called the vc shatter coefficiet (where vc stads for VapikChervoekis) Obviously, S F (x 1 ) 2 By iequality (4), we have, for all x 1, i=1 σ i a (j) i ) R (F(x 1 )) 2 log SF (x 1 ) (6) where we used the fact that for each f F, i f(x i) 2 I particular, 2 log SF (X1 E sup P f P f 2E ) f F The logarithm of the vc shatter coefficiet may be upper bouded i terms of a combiatorial quatity, called the vc dimesio If A 1, 1}, the the vc dimesio of A is the size V of the largest set of idices
7 TITLE WILL BE SET BY THE PUBLISHER 7 i 1,, i V } 1,, } such that for each biary V vector b = (b 1,, b V ) 1, 1} V there exists a a = (a 1,, a ) A such that (a i1,, a iv ) = b The key iequality establishig a relatioship betwee shatter coefficiets ad vc dimesio is kow as Sauer s lemma which states that the cardiality of ay set A 1, 1} may be upper bouded as A where V is the vc dimesio of A I particular, V i=0 ( ) ( + 1) V i log S F (x 1 ) V (x 1 ) log( + 1) where we deote by V (x 1 ) the vc dimesio of F(x 1 ) Thus, the expected maximal deviatio E sup f F P f P f may be upper bouded by 2E 2V (X1 ) log( + 1)/ To obtai distributiofree upper bouds, itroduce the vc dimesio of a class of biary fuctios F, defied by V = sup V (x 1 ),x 1 The we obtai the followig versio of what has bee kow as the VapikChervoekis iequality: Theorem 34 vapikchervoekis iequality For all distributios oe has E sup(p f P f) 2 f F 2V log( + 1) Also, for a uiversal costat C V E sup(p f P f) C f F The secod iequality, that allows to remove the logarithmic factor, follows from a somewhat refied aalysis (called chaiig) The vc dimesio is a importat combiatorial parameter of the class ad may of its properties are well kow Here we just recall oe useful result ad refer the reader to the refereces for further study: let G be a mdimesioal vector space of realvalued fuctios defied o X The class of idicator fuctios F = f(x) = 1 g(x) 0 : g G } has vc dimesio V m Bibliographical remarks Uiform deviatios of averages from their expectatios is oe of the cetral problems of empirical process theory Here we merely refer to some of the comprehesive coverages, such as Shorack ad Weller [199], Gié [98], va der Vaart ad Weller [227], Vapik [231], Dudley [83] The use of empirical processes i classificatio was pioeered by Vapik ad Chervoekis [232, 233] ad rediscovered 20 years later by Blumer, Ehrefeucht, Haussler, ad Warmuth [41], Ehrefeucht, Haussler, Kears, ad Valiat [88] For surveys see Nataraja [175], Devroye [71] Athoy ad Biggs [10], Kears ad Vazirai [117], Vapik [230, 231], Devroye, Györfi, ad Lugosi [72], Ripley [185], Vidyasagar [235], Athoy ad Bartlett [9], The bouded differeces iequality was formulated explicitly first by McDiarmid [166] (see also the surveys [167]) The martigale methods used by McDiarmid had appeared i early work of Hoeffdig [109], Azuma [18], Yuriksii [242, 243], Milma ad Schechtma [174] Closely related cocetratio results have bee obtaied i various ways icludig iformatiotheoretic methods (see Ahlswede, Gács, ad Körer [1], Marto [154],
8 8 TITLE WILL BE SET BY THE PUBLISHER [155], [156], Dembo [69], Massart [158] ad Rio [183]), Talagrad s iductio method [217], [213], [216] (see also McDiarmid [168], Luczak ad McDiarmid [143], Pacheko [ ]) ad the socalled etropy method, based o logarithmic Sobolev iequalities, developed by Ledoux [132], [131], see also Bobkov ad Ledoux [42], Massart [159], Rio [183], Bouchero, Lugosi, ad Massart [45, 46], Bousquet [47], ad Bouchero, Bousquet, Lugosi, ad Massart [44] Symmetrizatio was at the basis of the origial argumets of Vapik ad Chervoekis [232, 233] We leart the simple symmetrizatio trick show above from Gié ad Zi [99] but differet forms of symmetrizatio have bee at the core of obtaiig related results of similar flavor, see also Athoy ad ShaweTaylor [11], Cao, Ettiger, Hush, Scovel [55], Herbrich ad Williamso [108], Medelso ad Philips [172] The use of Rademacher averages i classificatio was first promoted by Koltchiskii [124] ad Bartlett, Bouchero, ad Lugosi [24], see also Koltchiskii ad Pacheko [126,127], Bartlett ad Medelso [29], Bartlett, Bousquet, ad Medelso [25], Bousquet, Koltchiskii, ad Pacheko [50], Kégl, Lider, ad Lugosi [13], Medelso [170] Hoeffdig s iequality appears i [109] For a proof of the cotractio priciple we refer to Ledoux ad Talagrad [133] Sauer s lemma was proved idepedetly by Sauer [189], Shelah [198], ad Vapik ad Chervoekis [232] For related combiatorial results we refer to Frakl [90], Haussler [106], Alesker [7], Alo, BeDavid, Cesa Biachi, ad Haussler [8], Szarek ad Talagrad [210], CesaBiachi ad Haussler [60], Medelso ad Vershyi [173], [188] The secod iequality of Theorem 34 is based o the method of chaiig, ad was first proved by Dudley [81] The questio of how sup f F P f P f behaves has bee kow as the GlivekoCatelli problem ad much has bee said about it A few key refereces iclude Vapik ad Chervoekis [232, 234], Dudley [79, 81, 82], Talagrad [211, 212, 214, 218], Dudley, Gié, ad Zi [84], Alo, BeDavid, CesaBiachi, ad Haussler [8], Li, Log, ad Sriivasa [138], Medelso ad Vershyi [173] The vc dimesio has bee widely studied ad may of its properties are kow We refer to Cover [63], Dudley [80, 83], Steele [204], Weocur ad Dudley [238], Assouad [15], Khovaskii [118], Macityre ad Sotag [149], Goldberg ad Jerrum [101], Karpiski ad A Macityre [114], Koira ad Sotag [121], Athoy ad Bartlett [9], ad Bartlett ad Maass [28] 4 Miimizig cost fuctios: some basic ideas behid boostig ad support vector machies The results summarized i the previous sectio reveal that miimizig the empirical risk L (g) over a class C of classifiers with a vc dimesio much smaller tha the sample size is guarateed to work well This result has two fudametal problems First, by requirig that the vc dimesio be small, oe imposes serious limitatios o the approximatio properties of the class I particular, eve though the differece betwee the probability of error L(g ) of the empirical risk miimizer is close to the smallest probability of error if g C L(g) i the class, if g C L(g) L may be very large The other problem is algorithmic: miimizig the empirical probability of misclassificatio L(g) is very ofte a computatioally difficult problem Eve i seemigly simple cases, for example whe X = R d ad C is the class of classifiers that split the space of observatios by a hyperplae, the miimizatio problem is p hard The computatioal difficulty of learig problems deserves some more attetio Let us cosider i more detail the problem i the case of halfspaces Formally, we are give a sample, that is a sequece of vectors (x 1,, x ) from R d ad a sequece of labels (y 1,, y ) from 1, 1}, ad i order to miimize the empirical misclassificatio risk we are asked to fid w R d ad b R so as to miimize # k : y k ( w, x k b) 0} Without loss of geerality, the vectors costitutig the sample are assumed to have ratioal coefficiets, ad the size of the data is the sum of the bit legths of the vectors makig the sample Not oly miimizig the umber
9 TITLE WILL BE SET BY THE PUBLISHER 9 of misclassificatio errors has bee proved to be at least as hard as solvig ay pcomplete problem, but eve approximately miimizig the umber of misclassificatio errors withi a costat factor of the optimum has bee show to be phard This meas that, uless p =p, we will ot be able to build a computatioally efficiet empirical risk miimizer for halfspaces that will work for all iput space dimesios If the iput space dimesio d is fixed, a algorithm ruig i O( d 1 log ) steps eumerates the trace of halfspaces o a sample of legth This allows a exhaustive search for the empirical risk miimizer Such a possibility should be cosidered with circumspectio sice its rage of applicatios would exted much beyod problems where iput dimesio is less tha 5 41 Margibased performace bouds A attempt to solve both of these problems is to modify the empirical fuctioal to be miimized by itroducig a cost fuctio Next we describe the mai ideas of empirical miimizatio of cost fuctioals ad its aalysis We cosider classifiers of the form 1 if f(x) 0 g f (x) = 1 otherwise where f : X R is a realvalued fuctio I such a case the probability of error of g may be writte as L(g f ) = Psg(f(X)) Y } E1 f(x)y <0 To lighte otatio we will simply write L(f) = L(g f ) Let φ : R R + be a oegative cost fuctio such that φ(x) 1 x>0 (Typical choices of φ iclude φ(x) = e x, φ(x) = log 2 (1+e x ), ad φ(x) = (1+x) + ) Itroduce the cost fuctioal ad its empirical versio by A(f) = Eφ( f(x)y ) ad A (f) = 1 φ( f(x i )Y i ) i=1 Obviously, L(f) A(f) ad L (f) A (f) Theorem 41 Assume that the fuctio f is chose from a class F based o the data (Z 1,, Z ) def = (X 1, Y 1 ),, (X, Y ) Let B deote a uiform upper boud o φ( f(x)y) ad let L φ be the Lipschitz costat of φ The the probability of error of the correspodig classifier may be bouded, with probability at least 1 δ, by L(f ) A (f ) + 2L φ ER (F(X 1 )) + B 2 log 1 δ Thus, the Rademacher average of the class of realvalued fuctios f bouds the performace of the classifier
10 10 TITLE WILL BE SET BY THE PUBLISHER proof The proof similar to he argumet of the previous sectio: L(f ) A(f ) A (f ) + sup(a(f) A (f)) f F A (f ) + 2ER (φ H(Z1 2 log 1 δ )) + B (where H is the class of fuctios X 1, 1} R of the form f(x)y, f F) A (f ) + 2L φ ER (H(Z1 2 log 1 δ )) + B (by the cotractio priciple of Theorem 33) = A (f ) + 2L φ ER (F(X 1 )) + B 2 log 1 δ 411 Weighted votig schemes I may applicatios such as boostig ad baggig, classifiers are combied by weighted votig schemes which meas that the classificatio rule is obtaied by meas of fuctios f from a class N F λ = f(x) = N c j g j (x) : N N, c j λ, g 1,, g N C (7) j=1 where C is a class of base classifiers, that is, fuctios defied o X, takig values i 1, 1} A classifier of this form may be thought of as oe that, upo observig x, takes a weighted vote of the classifiers g 1,, g N (usig the weights c 1,, c N ) ad decides accordig to the weighted majority I this case, by (5) ad (6) we have j=1 R (F λ (X 1 )) λr (C(X 1 )) λ 2VC log( + 1) where V C is the vc dimesio of the base class To uderstad the richess of classes formed by weighted averages of classifiers from a base class, just cosider the simple oedimesioal example i which the base class C cotais all classifiers of the form g(x) = 21 x a 1, a R The V C = 1 ad the closure of F λ (uder the L orm) is the set of all fuctios of total variatio bouded by 2λ Thus, F λ is rich i the sese that ay classifier may be approximated by classifiers associated with the fuctios i F λ I particular, the vc dimesio of the class of all classifiers iduced by fuctios i F λ is ifiite For such large classes of classifiers it is impossible to guaratee that L(f ) exceeds the miimal risk i the class by somethig of the order of 1/2 (see Sectio 55) However, L(f ) may be made as small as the miimum of the cost fuctioal A(f) over the class plus O( 1/2 ) Summarizig, we have obtaied that if F λ is of the form idicated above, the for ay fuctio f chose from F λ i a databased maer, the probability of error of the associated classifier satisfies, with probability at least 1 δ, 2VC log( + 1) 2 log 1 δ L(f ) A (f ) + 2L φ λ + B (8) The remarkable fact about this iequality is that the upper boud oly ivolves the vc dimesio of the class C of base classifiers which is typically small The price we pay is that the first term o the righthad side is
11 TITLE WILL BE SET BY THE PUBLISHER 11 the empirical cost fuctioal istead of the empirical probability of error As a first illustratio, cosider the example whe γ is a fixed positive parameter ad 0 if x γ φ(x) = 1 if x x/γ otherwise I this case B = 1 ad L φ = 1/γ Notice also that 1 x>0 φ(x) 1 x> γ ad therefore A (f) L γ (f) where L γ (f) is the socalled margi error defied by L γ (f) = 1 i=1 1 f(xi)y i<γ Notice that for all γ > 0, L γ (f) L (f) ad the L γ (f) is icreasig i γ A iterpretatio of the margi error L γ (f) is that it couts, apart from the umber of misclassified pairs (X i, Y i ), also those which are well classified but oly with a small cofidece (or margi ) by f Thus, (8) implies the followig margibased boud for the risk: Corollary 42 For ay γ > 0, with probability at least 1 δ, L(f ) L γ (f ) + 2 λ 2VC log( + 1) + γ 2 log 1 δ (9) Notice that, as γ grows, the first term of the sum icreases, while the secod decreases The boud ca be very useful wheever a classifier has a small margi error for a relatively large γ (ie, if the classifier classifies the traiig data well with high cofidece ) sice the secod term oly depeds o the vc dimesio of the small base class C This result has bee used to explai the good behavior of some votig methods such as AdaBoost, sice these methods have a tedecy to fid classifiers that classify the data poits well with a large margi 412 Kerel methods Aother popular way to obtai classificatio rules from a class of realvalued fuctios which is used i kerel methods such as Support Vector Machies (SVM) or Kerel Fisher Discrimiat (KFD) is to cosider balls of a reproducig kerel Hilbert space The basic idea is to use a positive defiite kerel fuctio k : X X R, that is, a symmetric fuctio satisfyig α i α j k(x i, x j ) 0, i,j=1 for all choices of, α 1,, α R ad x 1,, x X Such a fuctio aturally geerates a space of fuctios of the form } F = f( ) = α i k(x i, ) : N, α i R, x i X, i=1 which, with the ier product α i k(x i, ), β j k(x j, ) def = α i β j k(x i, x j ) ca be completed ito a Hilbert space The key property is that for all x 1, x 2 X there exist elemets f x1, f x2 F such that k(x 1, x 2 ) = f x1, f x2 This meas that ay liear algorithm based o computig ier products ca be exteded ito a oliear versio by replacig the ier products by a kerel fuctio The advatage is that eve though the algorithm remais of low complexity, it works i a class of fuctios that ca potetially represet ay cotiuous fuctio arbitrarily well (provided k is chose appropriately)
12 12 TITLE WILL BE SET BY THE PUBLISHER Algorithms workig with kerels usually perform miimizatio of a cost fuctioal o a ball of the associated reproducig kerel Hilbert space of the form N F λ = f(x) = c j k(x j, x) : N N, j=1 N c i c j k(x i, x j ) λ 2, x 1,, x N X (10) i,j=1 Notice that, i cotrast with (7) where the costrait is of l 1 type, the costrait here is of l 2 type Also, the basis fuctios, istead of beig chose from a fixed class, are determied by elemets of X themselves A importat property of fuctios i the reproducig kerel Hilbert space associated with k is that for all x X, f(x) = f, k(x, ) This is called the reproducig property The reproducig property may be used to estimate precisely the Rademacher average of F λ Ideed, deotig by E σ expectatio with respect to the Rademacher variables σ 1,, σ, we have R (F λ (X 1 )) = 1 E σ sup = 1 E σ sup f λ i=1 f λ i=1 σ i f(x i ) σ i f, k(x i, ) = λ E σ σ i k(x i, ) by the CauchySchwarz iequality, where deotes the orm i the reproducig kerel Hilbert space The KahaeKhichie iequality states that for ay vectors a 1,, a i a Hilbert space, It is also easy to see that so we obtai 1 2 ( E σ i a i 2 E i=1 E i=1 i=1 ) 2 2 σ i a i E σ i a i 2 σ i a i = E σ i σ j a i, a j = i=1 i,j=1 i=1 a i 2, i=1 λ k(x i, X i ) R (F λ (X1 )) λ k(x i, X i ) 2 i=1 This is very ice as it gives a boud that ca be computed very easily from the data A reasoig similar to the oe leadig to (9), usig the bouded differeces iequality to replace the Rademacher average by its empirical versio, gives the followig Corollary 43 Let f be ay fuctio chose from the ball F λ The, with probability at least 1 δ, L(f ) L γ (f ) + 2 λ k(x i, X i ) + γ i=1 i=1 2 log 2 δ
13 42 Covex cost fuctioals TITLE WILL BE SET BY THE PUBLISHER 13 Next we show that a proper choice of the cost fuctio φ has further advatages To this ed, we cosider oegative covex odecreasig cost fuctios with lim x φ(x) = 0 ad φ(0) = 1 Mai examples of φ iclude the expoetial cost fuctio φ(x) = e x used i AdaBoost ad related boostig algorithms, the logit cost fuctio φ(x) = log 2 (1 + e x ), ad the hige loss (or soft margi loss) φ(x) = (1 + x) + used i support vector machies Oe of the mai advatages of usig covex cost fuctios is that miimizig the empirical cost A (f) ofte becomes a covex optimizatio problem ad is therefore computatioally feasible I fact, most boostig ad support vector machie classifiers may be viewed as empirical miimizers of a covex cost fuctioal However, miimizig covex cost fuctioals have other theoretical advatages To uderstad this, assume, i additio to the above, that φ is strictly covex ad differetiable The it is easy to determie the fuctio f miimizig the cost fuctioal A(f) = Eφ( Y f(x) Just ote that for each x X, ad therefore the fuctio f is give by E [φ( Y f(x) X = x] = η(x)φ( f(x)) + (1 η(x))φ(f(x)) f (x) = argmi α h η(x) (α) where for each η [0, 1], h η (α) = ηφ( α) + (1 η)φ(α) Note that h η is strictly covex ad therefore f is well defied (though it may take values ± if η equals 0 or 1) Assumig that h η is differetiable, the miimum is achieved for the value of α for which h η(α) = 0, that is, whe η 1 η = φ (α) φ ( α) Sice φ is strictly icreasig, we see that the solutio is positive if ad oly if η > 1/2 This reveals the importat fact that the miimizer f of the fuctioal A(f) is such that the correspodig classifier g (x) = 21 f (x) 0 1 is just the Bayes classifier Thus, miimizig a covex cost fuctioal leads to a optimal classifier For example, if φ(x) = e x is the expoetial cost fuctio, the f (x) = (1/2) log(η(x)/(1 η(x))) I the case of the logit cost φ(x) = log 2 (1 + e x ), we have f (x) = log(η(x)/(1 η(x))) We ote here that, eve though the hige loss φ(x) = (1 + x) + does ot satisfy the coditios for φ used above (eg, it is ot strictly covex), it is easy to see that the fuctio f miimizig the cost fuctioal equals f (x) = 1 if η(x) > 1/2 1 if η(x) < 1/2 Thus, i this case the f ot oly iduces the Bayes classifier but it equals to it To obtai iequalities for the probability of error of classifiers based o miimizatio of empirical cost fuctioals, we eed to establish a relatioship betwee the excess probability of error L(f) L ad the correspodig excess cost fuctioal A(f) A where A = A(f ) = if f A(f) Here we recall a simple iequality of Zhag [244] which states that if the fuctio H : [0, 1] R is defied by H(η) = if α h η (α) ad the cost fuctio φ is such that for some positive costats s 1 ad c η s c s (1 H(η)), η [0, 1], the for ay fuctio f : X R, L(f) L 2c (A(f) A ) 1/s (11)
14 14 TITLE WILL BE SET BY THE PUBLISHER (The simple proof of this iequality is based o the expressio (1) ad elemetary covexity properties of h η ) I the special case of the expoetial ad logit cost fuctios H(η) = 2 η(1 η) ad H(η) = η log 2 η (1 η) log 2 (1 η), respectively I both cases it is easy to see that the coditio above is satisfied with s = 2 ad c = 1/ 2 Theorem 44 excess risk of covex risk miimizers Assume that f is chose from a class F λ defied i (7) by miimizig the empirical cost fuctioal A (f) usig either the expoetial of the logit cost fuctio The, with probability at least 1 δ, L(f ) L 2VC log( + 1) 2 log 2 2L 1 δ φ λ + B 1/2 + ( ) 1/2 2 if A(f) A f F λ proof L(f ) L 2 (A(f ) A ) 1/2 ( ) 1/2 2 A(f ) if A(f) + ( 2 if A(f) A f F λ f F λ ( ) 1/2 2 sup A(f) A (f) + ( 2 if A(f) A f F λ f F λ (just like i (2)) 2VC log( + 1) 2 log 2 2L 1 δ φ λ + B 1/2 ) 1/2 ) 1/2 + ( ) 1/2 2 if A(f) A f F λ with probability at least 1 δ, where at the last step we used the same boud for sup f Fλ A(f) A (f) as i (8) Note that for the expoetial cost fuctio L φ = e λ ad B = λ while for the logit cost L φ 1 ad B = λ I both cases, if there exists a λ sufficietly large so that if f Fλ A(f) = A, the the approximatio error disappears ad we obtai L(f ) L = O ( 1/4) The fact that the expoet i the rate of covergece is dimesiofree is remarkable (We ote here that these rates may be further improved by applyig the refied techiques resumed i Sectio 53, see also [40]) It is a iterestig approximatiotheoretic challege to uderstad what kid of fuctios f may be obtaied as a covex combiatio of base classifiers ad, more geerally, to describe approximatio properties of classes of fuctios of the form (7) Next we describe a simple example whe the abovemetioed approximatio properties are well uderstood Cosider the case whe X = [0, 1] d ad the base class C cotais all decisio stumps, that is, all classifiers of the form s + i,t (x) = 1 x (i) t 1 x (i) <t ad s i,t (x) = 1 x (i) <t 1 x (i) t, t [0, 1], i = 1,, d, where x (i) deotes the ith coordiate of x I this case the vc dimesio of the base class is easily see to be bouded by V C 2 log 2 (2d) Also it is easy to see that the closure of F λ with respect to the supremum orm cotais all fuctios f of the form f(x) = f 1 (x (1) ) + + f d (x (d) ) where the fuctios f i : [0, 1] R are such that f 1 T V + + f d T V λ where f i T V deotes the total variatio of the fuctio f i Therefore, if f has the above form, we have if f Fλ A(f) = A(f ) Recallig that the fuctio f optimizig the cost A(f) has the form f (x) = 1 2 log η(x) 1 η(x)
15 TITLE WILL BE SET BY THE PUBLISHER 15 i the case of the expoetial cost fuctio ad f (x) = log η(x) 1 η(x) i the case of the logit cost fuctio, we see that boostig usig decisio stumps is especially well fitted to the socalled additive logistic model i which η is assumed to be such that log(η/(1 η)) is a additive fuctio (ie, it ca be writte as a sum of uivariate fuctios of the compoets of x) Thus, whe η permits a additive logistic represetatio the the rate of covergece of the classifier is fast ad has a very mild depedece o the dimesio Cosider ext the case of the hige loss φ(x) = (1 + x) + ofte used i Support Vector Machies ad related kerel methods I this case H(η) = 2 (η, 1 η) ad therefore iequality (11) holds with c = 1/2 ad s = 1 Thus, L(f ) L A(f ) A ad the aalysis above leads to eve better rates of covergece However, i this case f (x) = 21 η(x) 1/2 1 ad approximatig this fuctio by weighted sums of base fuctios may be more difficult tha i the case of expoetial ad logit costs Oce agai, the approximatiotheoretic part of the problem is far from beig well uderstood, ad it is difficult to give recommedatios about which cost fuctio is more advatageous ad what base classes should be used Bibliographical remarks For results o the algorithmic difficulty of empirical risk miimizatio, see Johso ad Preparata [112], Vu [236], Bartlett ad BeDavid [26], BeDavid, Eiro, ad Simo [32] Boostig algorithms were origially itroduced by Freud ad Schapire (see [91], [94], ad [190]), as adaptive aggregatio of simple classifiers cotaied i a small base class The aalysis based o the observatio that AdaBoost ad related methods ted to produce largemargi classifiers appears i Schapire, Freud, Bartlett, ad Lee [191], ad Koltchiskii ad Pacheko [127] It was Breima [51] who observed that boostig performs gradiet descet optimizatio of a empirical cost fuctio differet from the umber of misclassified samples, see also Maso, Baxter, Bartlett, ad Frea [157], Collis, Schapire, ad Siger [61], Friedma, Hastie, ad Tibshirai [95] Based o this view, various versios of boostig algorithms have bee show to be cosistet i differet settigs, see Breima [52], Bühlma ad Yu [54], Blachard, Lugosi, ad Vayatis [40], Jiag [111], Lugosi ad Vayatis [146], Maor ad Meir [152], Maor, Meir, ad Zhag [153], Zhag [244] Iequality (8) was first obtaied by Schapire, Freud, Bartlett, ad Lee [191] The aalysis preseted here is due to Koltchiskii ad Pacheko [127] Other classifiers based o weighted votig schemes have bee cosidered by Catoi [57 59], Yag [241], Freud, Masour, ad Schapire [93] Kerel methods were pioeered by Aizerma, Braverma, ad Rozooer [2 5], Vapik ad Lerer [228], Bashkirov, Braverma, ad Muchik [31], Vapik ad Chervoekis [233], ad Specht [203] Support vector machies origiate i the pioeerig work of Boser, Guyo, ad Vapik [43], Cortes ad Vapik [62] For surveys we refer to Cristiaii ad ShaweTaylor [65], Smola, Bartlett, Schölkopf, ad Schuurmas [201], Hastie, Tibshirai, ad Friedma [104], Schölkopf ad Smola [192] The study of uiversal approximatio properties of kerels ad statistical cosistecy of Support Vector Machies is due to Steiwart [ ], Li [140, 141], Zhou [245], ad Blachard, Bousquet, ad Massart [39] We have cosidered the case of miimizatio of a loss fuctio o a ball of the reproducig kerel Hilbert space However, it is computatioally more coveiet to formulate the problem as the miimizatio of a regularized fuctioal of the form 1 mi φ( Y i f(x i )) + λ f 2 f F i=1 The stadard Support Vector Machie algorithm the correspods to the choice of φ(x) = (1 + x) + Kerel based regularizatio algorithms were studied by Kimeldorf ad Wahba [120] ad Crave ad Wahba [64] i the cotext of regressio Relatioships betwee Support Vector Machies ad regularizatio were described
16 16 TITLE WILL BE SET BY THE PUBLISHER by Smola, Schölkopf, ad Müller [202] ad Evhgeiou, Potil, ad Poggio [89] Geeral properties of regularized algorithms i reproducig kerel Hilbert spaces are ivestigated by Cucker ad Smale [68], Steiwart [206], Zhag [244] Various properties of the Support Vector Machie algorithm are ivestigated by Vapik [230, 231], Schölkopf ad Smola [192], Scovel ad Steiwart [195] ad Steiwart [208, 209] The fact that miimizig a expoetial cost fuctioal leads to the Bayes classifier was poited out by Breima [52], see also Lugosi ad Vayatis [146], Zhag [244] For a comprehesive theory of the coectio betwee cost fuctios ad probability of misclassificatio, see Bartlett, Jorda, ad McAuliffe [27] Zhag s lemma (11) appears i [244] For various geeralizatios ad refiemets we refer to Bartlett, Jorda, ad McAuliffe [27] ad Blachard, Lugosi, ad Vayatis [40] 5 Tighter bouds for empirical risk miimizatio This sectio is dedicated to the descriptio of some refiemets of the ideas described i the earlier sectios What we have see so far oly used firstorder properties of the fuctios that we cosidered, amely their boudedess It turs out that usig secodorder properties, like the variace of the fuctios, may of the above results ca be made sharper 51 Relative deviatios I order to uderstad the basic pheomeo, let us go back to the simplest case i which oe has a fixed fuctio f with values i 0, 1} I this case, P f is a average of idepedet Beroulli radom variables with parameter p = P f Recall that, as a simple cosequece of (3), with probability at least 1 δ, P f P f 2 log 1 δ (12) This is basically tight whe P f = 1/2, but ca be sigificatly improved whe P f is small Ideed, Berstei s iequality gives, with probability at least 1 δ, P f P f 2Var(f) log 1 δ + 2 log 1 δ 3 (13) Sice f takes its values i 0, 1}, Var(f) = P f(1 P f) P f which shows that whe P f is small, (13) is much better tha (12) 511 Geeral iequalities Next we exploit the pheomeo described above to obtai sharper performace bouds for empirical risk miimizatio Note that if we cosider the differece P f P f uiformly over the class F, the largest deviatios are obtaied by fuctios that have a large variace (ie, P f is close to 1/2) A idea is to scale each fuctio by dividig it by P f so that they all behave i a similar way Thus, we boud the quatity P f P f sup f F P f The first step cosists i symmetrizatio of the tail probabilities If t 2 2, P f P P f sup f F P f t } } P 2P sup f P f f F (P f + P f)/2 t
17 TITLE WILL BE SET BY THE PUBLISHER 17 Next we itroduce Rademacher radom variables, obtaiig, by simple symmetrizatio, 2P sup f F P f P f (P f + P f)/2 t } = 2E [P σ sup f F 1 i=1 σ i(f(x i ) f(x }] i)) t (P f + P f)/2 (where P σ is the coditioal probability, give the X i ad X i ) The last step uses tail bouds for idividual fuctios ad a uio boud over F(X1 2 ), where X1 2 deotes the uio of the iitial sample X1 ad of the extra symmetrizatio sample X 1,, X Summarizig, we obtai the followig iequalities: Theorem 51 Let F be a class of fuctios takig biary values i 0, 1} For ay δ (0, 1), with probability at least 1 δ, all f F satisfy P f P f log S F (X1 2 2 ) + log 4 δ P f Also, with probability at least 1 δ, for all f F, P f P f P f 2 log S F (X1 2) + log 4 δ As a cosequece, we have that for all s > 0, with probability at least 1 δ, sup f F P f P f P f + P f + s/2 2 log S F (X 2 1 ) + log 4 δ s (14) ad the same is true if P ad P are permuted Aother cosequece of Theorem 51 with iterestig applicatios is the followig For all t (0, 1], with probability at least 1 δ, I particular, settig t = 1, f F, P f (1 t)p f implies P f 4 log S F(X1 2 ) + log 4 δ t 2 (15) 512 Applicatios to empirical risk miimizatio f F, P f = 0 implies P f 4 log S F(X1 2 ) + log 4 δ It is easy to see that, for oegative umbers A, B, C 0, the fact that A B A + C etails A B 2 + B C + C so that we obtai from the secod iequality of Theorem 51 that, with probability at least 1 δ, for all f F, P f P f + 2 P f log S F(X 2 1 ) + log 4 δ + 4 log S F(X1 2 ) + log 4 δ Corollary 52 Let g be the empirical risk miimizer i a class C of vc dimesio V The, with probability at least 1 δ, L(g ) L (g ) + 2 L (g ) 2V log( + 1) + log 4 δ + 4 2V log( + 1) + log 4 δ
18 18 TITLE WILL BE SET BY THE PUBLISHER Cosider first the extreme situatio whe there exists a classifier i C which classifies without error This also meas that for some g C, Y = g (X) with probability oe This is clearly a quite restrictive assumptio, oly satisfied i very special cases Nevertheless, the assumptio that if g C L(g) = 0 has bee commoly used i computatioal learig theory, perhaps because of its mathematical simplicity I such a case, clearly L (g) = 0, so that we get, with probability at least 1 δ, L(g) if L(g) 42V log( + 1) + log 4 δ (16) g C The mai poit here is that the upper boud obtaied ( i this special case is of smaller order of magitude V ) tha i the geeral case (O(V l /) as opposed to O l / ) Oe ca actually obtai a versio which iterpolates betwee these two cases as follows: for simplicity, assume that there is a classifier g i C such that L(g ) = if g C L(g) The we have L (g ) L (g ) = L (g ) L(g ) + L(g ) Usig Berstei s iequality, we get, with probability 1 δ, which, together with Corollary 52, yields: L (g ) L(g ) 2L(g ) log 1 δ + 2 log 1 δ 3, Corollary 53 There exists a costat C such that, with probability at least 1 δ, L(g) if L(g) C g C if L(g)V log + log 1 δ g C + V log + log 1 δ 52 Noise ad fast rates We have see that i the case where f takes values i 0, 1} there is a ice relatioship betwee the variace of f (which cotrols the size of the deviatios betwee P f ad P f) ad its expectatio, amely, Var(f) P f This is the key property that allows oe to obtai faster rates of covergece for L(g ) if g C L(g) I particular, i the ideal situatio metioed above, whe if g C L(g) = 0, the differece L(g ) if g C L(g) may be much smaller tha the worstcase differece sup g C (L(g) L (g)) This actually happes i may cases, wheever the distributio satisfies certai coditios Next we describe such coditios ad show how the fier bouds ca be derived The mai idea is that, i order to get precise rates for L(g ) if g C L(g), we cosider fuctios of the form 1 g(x) Y 1 g (X) Y where g is a classifier miimizig the loss i the class C, that is, such that L(g ) = if g C L(g) Note that fuctios of this form are o loger oegative To illustrate the basic ideas i the simplest possible settig, cosider the case whe the loss class F is a fiite set of N fuctios of the form 1 g(x) Y 1 g (X) Y I additio, we assume that there is a relatioship betwee the variace ad the expectatio of the fuctios i F give by the iequality Var(f) ( ) α P f (17) h
19 TITLE WILL BE SET BY THE PUBLISHER 19 for some h > 0 ad α (0, 1] By Berstei s iequality ad a uio boud over the elemets of C, we have that, with probability at least 1 δ, for all f F, P f P f + 2(P f/h) α log N δ + 4 log N δ 3 As a cosequece, usig the fact that P f = L (g ) L (g ) 0, we have with probability at least 1 δ, L(g ) L(g ) 2((L(g ) L(g ))/h) α log N δ + 4 log N δ 3 Solvig this iequality for L(g ) L(g ) fially gives that with probability at least 1 δ, L(g ) if g G L(g) ( 2 log N δ h α ) 1 2 α (18) Note that the obtaied rate is the faster tha 1/2 wheever α > 0 I particular, for α = 1 we get 1 as i the ideal case It ow remais to show whether (17) is a reasoable assumptio As the simplest possible example, assume that the Bayes classifier g belogs to the class C (ie, g = g ) ad the a posteriori probability fuctio η is bouded away from 1/2, that is, there exists a positive costat h such that for all x X, 2η(x) 1 > h Note that the assumptio g = g is very restrictive ad is ulikely to be satisfied i practice, especially if the class C is fiite, as it is assumed i this discussio The assumptio that η is bouded away from zero may also appear to be quite specific However, the situatio described here may serve as a first illustratio of a otrivial example whe fast rates may be achieved Sice 1 g(x) Y 1 g (X) Y 1 g(x) g (X), the coditios stated above ad (1) imply that Var(f) E [ ] 1 1 g(x) g (X) h E [ ] 1 2η(X) 1 1 g(x) g (X) = h (L(g) L ) Thus (17) holds with β = 1/h ad α = 1 which shows that, with probability at least 1 δ, L(g ) L C log N δ h (19) Thus, the empirical risk miimizer has a sigificatly better performace tha predicted by the results of the previous sectio wheever the Bayes classifier is i the class C ad the a posteriori probability η stays away from 1/2 The behavior of η i the viciity of 1/2 has bee kow to play a importat role i the difficulty of the classificatio problem, see [72, 239, 240] Roughly speakig, if η has a complex behavior aroud the critical threshold 1/2, the oe caot avoid estimatig η, which is a typically difficult oparametric regressio problem However, the classificatio problem is sigificatly easier tha regressio if η is far from 1/2 with a large probability The coditio of η beig bouded away from 1/2 may be sigificatly relaxed ad geeralized Ideed, i the cotext of discrimiat aalysis, Mamme ad Tsybakov [151] ad Tsybakov [221] formulated a useful coditio that has bee adopted by may authors Let α [0, 1) The the MammeTsybakov coditio may
20 20 TITLE WILL BE SET BY THE PUBLISHER be stated by ay of the followig three equivalet statemets: (1) β > 0, g 0, 1} X, E [ ] 1 g(x) g (X) β(l(g) L ) α (2) c > 0, A X, A ( dp (x) c 2η(x) 1 dp (x) A (3) B > 0, t 0, P 2η(X) 1 t} Bt α We refer to this as the MammeTsybakov oise coditio The proof that these statemets are equivalet is straightforward, ad we omit it, but we commet o the meaig of these statemets Notice first that α has to be i [0, 1] because L(g) L = E [ 2η(X) 1 1 g(x) g (X)] E1g(X) g (X) Also, whe α = 0 these coditios are void The case α = 1 i (1) is realized whe there exists a s > 0 such that 2η(X) 1 > s almost surely (which is just the extreme oise coditio we cosidered above) The most importat cosequece of these coditios is that they imply a relatioship betwee the variace ad the expectatio of fuctios of the form 1 g(x) Y 1 g (X) Y Ideed, we obtai 1 α E [ (1 g(x) Y 1 g (X) Y ) 2] c(l(g) L ) α This is thus eough to get (18) for a fiite class of fuctios The sharper bouds, established i this sectio ad the ext, come at the price of the assumptio that the Bayes classifier is i the class C Because of this, it is difficult to compare the fast rates achieved with the slower rates proved i Sectio 3 O the other had, oise coditios like the MammeTsybakov coditio may be used to get improvemets eve whe g is ot cotaied i C I these cases the approximatio error L(g ) L also eeds to be take ito accout, ad the situatio becomes somewhat more complex We retur to these issues i Sectios 535 ad 8 53 Localizatio The purpose of this sectio is to geeralize the simple argumet of the previous sectio to more geeral classes C of classifiers This geeralizatio reveals the importace of the modulus of cotiuity of the empirical process as a measure of complexity of the learig problem 531 Talagrad s iequality Oe of the most importat recet developmets i empirical process theory is a cocetratio iequality for the supremum of a empirical process first proved by Talagrad [212] ad refied later by various authors This iequality is at the heart of may key developmets i statistical learig theory Here we recall the followig versio: Theorem 54 Let b > 0 ad set F to be a set of fuctios from X to R Assume that all fuctios i F satisfy P f f b The, with probability at least 1 δ, for ay θ > 0, [ ] sup (P f P f) (1 + θ)e sup (P f P f) f F f F which, for θ = 1 traslates to [ sup (P f P f) 2E f F sup (P f P f) f F ] + + 2(sup f F Var(f)) log 1 δ 2(sup f F Var(f)) log 1 δ ) α + (1 + 3/θ)b log 1 δ 3 + 4b log 1 δ 3,
Properties of MLE: consistency, asymptotic normality. Fisher information.
Lecture 3 Properties of MLE: cosistecy, asymptotic ormality. Fisher iformatio. I this sectio we will try to uderstad why MLEs are good. Let us recall two facts from probability that we be used ofte throughout
More informationChapter 7 Methods of Finding Estimators
Chapter 7 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 011 Chapter 7 Methods of Fidig Estimators Sectio 7.1 Itroductio Defiitio 7.1.1 A poit estimator is ay fuctio W( X) W( X1, X,, X ) of
More informationI. Chisquared Distributions
1 M 358K Supplemet to Chapter 23: CHISQUARED DISTRIBUTIONS, TDISTRIBUTIONS, AND DEGREES OF FREEDOM To uderstad tdistributios, we first eed to look at aother family of distributios, the chisquared distributios.
More informationModule 4: Mathematical Induction
Module 4: Mathematical Iductio Theme 1: Priciple of Mathematical Iductio Mathematical iductio is used to prove statemets about atural umbers. As studets may remember, we ca write such a statemet as a predicate
More information3. Covariance and Correlation
Virtual Laboratories > 3. Expected Value > 1 2 3 4 5 6 3. Covariace ad Correlatio Recall that by takig the expected value of various trasformatios of a radom variable, we ca measure may iterestig characteristics
More informationIn nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008
I ite Sequeces Dr. Philippe B. Laval Keesaw State Uiversity October 9, 2008 Abstract This had out is a itroductio to i ite sequeces. mai de itios ad presets some elemetary results. It gives the I ite Sequeces
More informationAsymptotic Growth of Functions
CMPS Itroductio to Aalysis of Algorithms Fall 3 Asymptotic Growth of Fuctios We itroduce several types of asymptotic otatio which are used to compare the performace ad efficiecy of algorithms As we ll
More informationChapter 6: Variance, the law of large numbers and the MonteCarlo method
Chapter 6: Variace, the law of large umbers ad the MoteCarlo method Expected value, variace, ad Chebyshev iequality. If X is a radom variable recall that the expected value of X, E[X] is the average value
More information7. Sample Covariance and Correlation
1 of 8 7/16/2009 6:06 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 7. Sample Covariace ad Correlatio The Bivariate Model Suppose agai that we have a basic radom experimet, ad that X ad Y
More informationIntroduction to Statistical Learning Theory
Itroductio to Statistical Learig Theory Olivier Bousquet 1, Stéphae Bouchero 2, ad Gábor Lugosi 3 1 MaxPlack Istitute for Biological Cyberetics Spemastr 38, D72076 Tübige, Germay olivierbousquet@m4xorg
More informationDepartment of Computer Science, University of Otago
Departmet of Computer Sciece, Uiversity of Otago Techical Report OUCS200609 Permutatios Cotaiig May Patters Authors: M.H. Albert Departmet of Computer Sciece, Uiversity of Otago Micah Colema, Rya Fly
More informationLecture 13. Lecturer: Jonathan Kelner Scribe: Jonathan Pines (2009)
18.409 A Algorithmist s Toolkit October 27, 2009 Lecture 13 Lecturer: Joatha Keler Scribe: Joatha Pies (2009) 1 Outlie Last time, we proved the BruMikowski iequality for boxes. Today we ll go over the
More informationIncremental calculation of weighted mean and variance
Icremetal calculatio of weighted mea ad variace Toy Fich faf@cam.ac.uk dot@dotat.at Uiversity of Cambridge Computig Service February 009 Abstract I these otes I eplai how to derive formulae for umerically
More informationModified Line Search Method for Global Optimization
Modified Lie Search Method for Global Optimizatio Cria Grosa ad Ajith Abraham Ceter of Excellece for Quatifiable Quality of Service Norwegia Uiversity of Sciece ad Techology Trodheim, Norway {cria, ajith}@q2s.tu.o
More informationSoving Recurrence Relations
Sovig Recurrece Relatios Part 1. Homogeeous liear 2d degree relatios with costat coefficiets. Cosider the recurrece relatio ( ) T () + at ( 1) + bt ( 2) = 0 This is called a homogeeous liear 2d degree
More informationSequences and Series
CHAPTER 9 Sequeces ad Series 9.. Covergece: Defiitio ad Examples Sequeces The purpose of this chapter is to itroduce a particular way of geeratig algorithms for fidig the values of fuctios defied by their
More informationA probabilistic proof of a binomial identity
A probabilistic proof of a biomial idetity Joatho Peterso Abstract We give a elemetary probabilistic proof of a biomial idetity. The proof is obtaied by computig the probability of a certai evet i two
More informationDiscrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13
EECS 70 Discrete Mathematics ad Probability Theory Sprig 2014 Aat Sahai Note 13 Itroductio At this poit, we have see eough examples that it is worth just takig stock of our model of probability ad may
More informationStatistical Learning Theory
1 / 130 Statistical Learig Theory Machie Learig Summer School, Kyoto, Japa Alexader (Sasha) Rakhli Uiversity of Pesylvaia, The Wharto School Pe Research i Machie Learig (PRiML) August 2728, 2012 2 / 130
More informationClass Meeting # 16: The Fourier Transform on R n
MATH 18.152 COUSE NOTES  CLASS MEETING # 16 18.152 Itroductio to PDEs, Fall 2011 Professor: Jared Speck Class Meetig # 16: The Fourier Trasform o 1. Itroductio to the Fourier Trasform Earlier i the course,
More informationConvexity, Inequalities, and Norms
Covexity, Iequalities, ad Norms Covex Fuctios You are probably familiar with the otio of cocavity of fuctios. Give a twicedifferetiable fuctio ϕ: R R, We say that ϕ is covex (or cocave up) if ϕ (x) 0 for
More informationSequences II. Chapter 3. 3.1 Convergent Sequences
Chapter 3 Sequeces II 3. Coverget Sequeces Plot a graph of the sequece a ) = 2, 3 2, 4 3, 5 + 4,...,,... To what limit do you thik this sequece teds? What ca you say about the sequece a )? For ǫ = 0.,
More informationTHE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n
We will cosider the liear regressio model i matrix form. For simple liear regressio, meaig oe predictor, the model is i = + x i + ε i for i =,,,, This model icludes the assumptio that the ε i s are a sample
More informationNPTEL STRUCTURAL RELIABILITY
NPTEL Course O STRUCTURAL RELIABILITY Module # 0 Lecture 1 Course Format: Web Istructor: Dr. Aruasis Chakraborty Departmet of Civil Egieerig Idia Istitute of Techology Guwahati 1. Lecture 01: Basic Statistics
More informationMaximum Likelihood Estimators.
Lecture 2 Maximum Likelihood Estimators. Matlab example. As a motivatio, let us look at oe Matlab example. Let us geerate a radom sample of size 00 from beta distributio Beta(5, 2). We will lear the defiitio
More informationSolutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork
Solutios to Selected Problems I: Patter Classificatio by Duda, Hart, Stork Joh L. Weatherwax February 4, 008 Problem Solutios Chapter Bayesia Decisio Theory Problem radomized rules Part a: Let Rx be the
More information4.1 Sigma Notation and Riemann Sums
0 the itegral. Sigma Notatio ad Riema Sums Oe strategy for calculatig the area of a regio is to cut the regio ito simple shapes, calculate the area of each simple shape, ad the add these smaller areas
More information1 Introduction to reducing variance in Monte Carlo simulations
Copyright c 007 by Karl Sigma 1 Itroductio to reducig variace i Mote Carlo simulatios 11 Review of cofidece itervals for estimatig a mea I statistics, we estimate a uow mea µ = E(X) of a distributio by
More information1 Computing the Standard Deviation of Sample Means
Computig the Stadard Deviatio of Sample Meas Quality cotrol charts are based o sample meas ot o idividual values withi a sample. A sample is a group of items, which are cosidered all together for our aalysis.
More informationTotally Corrective Boosting Algorithms that Maximize the Margin
Mafred K. Warmuth mafred@cse.ucsc.edu Ju Liao liaoju@cse.ucsc.edu Uiversity of Califoria at Sata Cruz, Sata Cruz, CA 95064, USA Guar Rätsch Guar.Raetsch@tuebige.mpg.de Friedrich Miescher Laboratory of
More informationKey Ideas Section 81: Overview hypothesis testing Hypothesis Hypothesis Test Section 82: Basics of Hypothesis Testing Null Hypothesis
Chapter 8 Key Ideas Hypothesis (Null ad Alterative), Hypothesis Test, Test Statistic, Pvalue Type I Error, Type II Error, Sigificace Level, Power Sectio 81: Overview Cofidece Itervals (Chapter 7) are
More informationHypothesis testing. Null and alternative hypotheses
Hypothesis testig Aother importat use of samplig distributios is to test hypotheses about populatio parameters, e.g. mea, proportio, regressio coefficiets, etc. For example, it is possible to stipulate
More informationThe second difference is the sequence of differences of the first difference sequence, 2
Differece Equatios I differetial equatios, you look for a fuctio that satisfies ad equatio ivolvig derivatives. I differece equatios, istead of a fuctio of a cotiuous variable (such as time), we look for
More information0.7 0.6 0.2 0 0 96 96.5 97 97.5 98 98.5 99 99.5 100 100.5 96.5 97 97.5 98 98.5 99 99.5 100 100.5
Sectio 13 KolmogorovSmirov test. Suppose that we have a i.i.d. sample X 1,..., X with some ukow distributio P ad we would like to test the hypothesis that P is equal to a particular distributio P 0, i.e.
More informationVladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT
Keywords: project maagemet, resource allocatio, etwork plaig Vladimir N Burkov, Dmitri A Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT The paper deals with the problems of resource allocatio betwee
More informationLECTURE 13: Crossvalidation
LECTURE 3: Crossvalidatio Resampli methods Cross Validatio Bootstrap Bias ad variace estimatio with the Bootstrap Threeway data partitioi Itroductio to Patter Aalysis Ricardo GutierrezOsua Texas A&M
More information5: Introduction to Estimation
5: Itroductio to Estimatio Cotets Acroyms ad symbols... 1 Statistical iferece... Estimatig µ with cofidece... 3 Samplig distributio of the mea... 3 Cofidece Iterval for μ whe σ is kow before had... 4 Sample
More informationBASIC STATISTICS. Discrete. Mass Probability Function: P(X=x i ) Only one finite set of values is considered {x 1, x 2,...} Prob. t = 1.
BASIC STATISTICS 1.) Basic Cocepts: Statistics: is a sciece that aalyzes iformatio variables (for istace, populatio age, height of a basketball team, the temperatures of summer moths, etc.) ad attempts
More informationTAYLOR SERIES, POWER SERIES
TAYLOR SERIES, POWER SERIES The followig represets a (icomplete) collectio of thigs that we covered o the subject of Taylor series ad power series. Warig. Be prepared to prove ay of these thigs durig the
More informationSUPPLEMENTARY MATERIAL TO GENERAL NONEXACT ORACLE INEQUALITIES FOR CLASSES WITH A SUBEXPONENTIAL ENVELOPE
SUPPLEMENTARY MATERIAL TO GENERAL NONEXACT ORACLE INEQUALITIES FOR CLASSES WITH A SUBEXPONENTIAL ENVELOPE By Guillaume Lecué CNRS, LAMA, Marelavallée, 77454 Frace ad By Shahar Medelso Departmet of Mathematics,
More informationPlugin martingales for testing exchangeability online
Plugi martigales for testig exchageability olie Valetia Fedorova, Alex Gammerma, Ilia Nouretdiov, ad Vladimir Vovk Computer Learig Research Cetre Royal Holloway, Uiversity of Lodo, UK {valetia,ilia,alex,vovk}@cs.rhul.ac.uk
More information5 Boolean Decision Trees (February 11)
5 Boolea Decisio Trees (February 11) 5.1 Graph Coectivity Suppose we are give a udirected graph G, represeted as a boolea adjacecy matrix = (a ij ), where a ij = 1 if ad oly if vertices i ad j are coected
More informationTrading the randomness  Designing an optimal trading strategy under a drifted random walk price model
Tradig the radomess  Desigig a optimal tradig strategy uder a drifted radom walk price model Yuao Wu Math 20 Project Paper Professor Zachary Hamaker Abstract: I this paper the author iteds to explore
More informationCS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations
CS3A Hadout 3 Witer 00 February, 00 Solvig Recurrece Relatios Itroductio A wide variety of recurrece problems occur i models. Some of these recurrece relatios ca be solved usig iteratio or some other ad
More informationTHE HEIGHT OF qbinary SEARCH TREES
THE HEIGHT OF qbinary SEARCH TREES MICHAEL DRMOTA AND HELMUT PRODINGER Abstract. q biary search trees are obtaied from words, equipped with the geometric distributio istead of permutatios. The average
More informationwhere: T = number of years of cash flow in investment's life n = the year in which the cash flow X n i = IRR = the internal rate of return
EVALUATING ALTERNATIVE CAPITAL INVESTMENT PROGRAMS By Ke D. Duft, Extesio Ecoomist I the March 98 issue of this publicatio we reviewed the procedure by which a capital ivestmet project was assessed. The
More informationMARTINGALES AND A BASIC APPLICATION
MARTINGALES AND A BASIC APPLICATION TURNER SMITH Abstract. This paper will develop the measuretheoretic approach to probability i order to preset the defiitio of martigales. From there we will apply this
More informationThe analysis of the Cournot oligopoly model considering the subjective motive in the strategy selection
The aalysis of the Courot oligopoly model cosiderig the subjective motive i the strategy selectio Shigehito Furuyama Teruhisa Nakai Departmet of Systems Maagemet Egieerig Faculty of Egieerig Kasai Uiversity
More informationCHAPTER 3 DIGITAL CODING OF SIGNALS
CHAPTER 3 DIGITAL CODING OF SIGNALS Computers are ofte used to automate the recordig of measuremets. The trasducers ad sigal coditioig circuits produce a voltage sigal that is proportioal to a quatity
More informationStandard Errors and Confidence Intervals
Stadard Errors ad Cofidece Itervals Itroductio I the documet Data Descriptio, Populatios ad the Normal Distributio a sample had bee obtaied from the populatio of heights of 5yearold boys. If we assume
More informationNotes on exponential generating functions and structures.
Notes o expoetial geeratig fuctios ad structures. 1. The cocept of a structure. Cosider the followig coutig problems: (1) to fid for each the umber of partitios of a elemet set, (2) to fid for each the
More informationSAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx
SAMPLE QUESTIONS FOR FINAL EXAM REAL ANALYSIS I FALL 006 3 4 Fid the followig usig the defiitio of the Riema itegral: a 0 x + dx 3 Cosider the partitio P x 0 3, x 3 +, x 3 +,......, x 3 3 + 3 of the iterval
More informationLecture 4: Cauchy sequences, BolzanoWeierstrass, and the Squeeze theorem
Lecture 4: Cauchy sequeces, BolzaoWeierstrass, ad the Squeeze theorem The purpose of this lecture is more modest tha the previous oes. It is to state certai coditios uder which we are guarateed that limits
More informationTheorems About Power Series
Physics 6A Witer 20 Theorems About Power Series Cosider a power series, f(x) = a x, () where the a are real coefficiets ad x is a real variable. There exists a real oegative umber R, called the radius
More informationWeek 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable
Week 3 Coditioal probabilities, Bayes formula, WEEK 3 page 1 Expected value of a radom variable We recall our discussio of 5 card poker hads. Example 13 : a) What is the probability of evet A that a 5
More informationIrreducible polynomials with consecutive zero coefficients
Irreducible polyomials with cosecutive zero coefficiets Theodoulos Garefalakis Departmet of Mathematics, Uiversity of Crete, 71409 Heraklio, Greece Abstract Let q be a prime power. We cosider the problem
More informationINVESTMENT PERFORMANCE COUNCIL (IPC)
INVESTMENT PEFOMANCE COUNCIL (IPC) INVITATION TO COMMENT: Global Ivestmet Performace Stadards (GIPS ) Guidace Statemet o Calculatio Methodology The Associatio for Ivestmet Maagemet ad esearch (AIM) seeks
More informationNonlife insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring
Nolife isurace mathematics Nils F. Haavardsso, Uiversity of Oslo ad DNB Skadeforsikrig Mai issues so far Why does isurace work? How is risk premium defied ad why is it importat? How ca claim frequecy
More informationOutput Analysis (2, Chapters 10 &11 Law)
B. Maddah ENMG 6 Simulatio 05/0/07 Output Aalysis (, Chapters 10 &11 Law) Comparig alterative system cofiguratio Sice the output of a simulatio is radom, the comparig differet systems via simulatio should
More informationUniversal coding for classes of sources
Coexios module: m46228 Uiversal codig for classes of sources Dever Greee This work is produced by The Coexios Project ad licesed uder the Creative Commos Attributio Licese We have discussed several parametric
More informationRecursion and Recurrences
Chapter 5 Recursio ad Recurreces 5.1 Growth Rates of Solutios to Recurreces Divide ad Coquer Algorithms Oe of the most basic ad powerful algorithmic techiques is divide ad coquer. Cosider, for example,
More informationStatistical inference: example 1. Inferential Statistics
Statistical iferece: example 1 Iferetial Statistics POPULATION SAMPLE A clothig store chai regularly buys from a supplier large quatities of a certai piece of clothig. Each item ca be classified either
More informationNormal Distribution.
Normal Distributio www.icrf.l Normal distributio I probability theory, the ormal or Gaussia distributio, is a cotiuous probability distributio that is ofte used as a first approimatio to describe realvalued
More informationTaking DCOP to the Real World: Efficient Complete Solutions for Distributed MultiEvent Scheduling
Taig DCOP to the Real World: Efficiet Complete Solutios for Distributed MultiEvet Schedulig Rajiv T. Maheswara, Milid Tambe, Emma Bowrig, Joatha P. Pearce, ad Pradeep araatham Uiversity of Souther Califoria
More informationChapter 7  Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:
Chapter 7  Samplig Distributios 1 Itroductio What is statistics? It cosist of three major areas: Data Collectio: samplig plas ad experimetal desigs Descriptive Statistics: umerical ad graphical summaries
More informationOverview of some probability distributions.
Lecture Overview of some probability distributios. I this lecture we will review several commo distributios that will be used ofte throughtout the class. Each distributio is usually described by its probability
More informationConfidence Intervals for One Mean
Chapter 420 Cofidece Itervals for Oe Mea Itroductio This routie calculates the sample size ecessary to achieve a specified distace from the mea to the cofidece limit(s) at a stated cofidece level for a
More information1. C. The formula for the confidence interval for a population mean is: x t, which was
s 1. C. The formula for the cofidece iterval for a populatio mea is: x t, which was based o the sample Mea. So, x is guarateed to be i the iterval you form.. D. Use the rule : pvalue
More informationHere are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.
This documet was writte ad copyrighted by Paul Dawkis. Use of this documet ad its olie versio is govered by the Terms ad Coditios of Use located at http://tutorial.math.lamar.edu/terms.asp. The olie versio
More informationINFINITE SERIES KEITH CONRAD
INFINITE SERIES KEITH CONRAD. Itroductio The two basic cocepts of calculus, differetiatio ad itegratio, are defied i terms of limits (Newto quotiets ad Riema sums). I additio to these is a third fudametal
More informationPROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUSMALUS SYSTEM
PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY Physical ad Mathematical Scieces 2015, 1, p. 15 19 M a t h e m a t i c s AN ALTERNATIVE MODEL FOR BONUSMALUS SYSTEM A. G. GULYAN Chair of Actuarial Mathematics
More informationInfinite Sequences and Series
CHAPTER 4 Ifiite Sequeces ad Series 4.1. Sequeces A sequece is a ifiite ordered list of umbers, for example the sequece of odd positive itegers: 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29...
More informationMeasures of Spread and Boxplots Discrete Math, Section 9.4
Measures of Spread ad Boxplots Discrete Math, Sectio 9.4 We start with a example: Example 1: Comparig Mea ad Media Compute the mea ad media of each data set: S 1 = {4, 6, 8, 10, 1, 14, 16} S = {4, 7, 9,
More informationTHE ABRACADABRA PROBLEM
THE ABRACADABRA PROBLEM FRANCESCO CARAVENNA Abstract. We preset a detailed solutio of Exercise E0.6 i [Wil9]: i a radom sequece of letters, draw idepedetly ad uiformly from the Eglish alphabet, the expected
More informationThe Stable Marriage Problem
The Stable Marriage Problem William Hut Lae Departmet of Computer Sciece ad Electrical Egieerig, West Virgiia Uiversity, Morgatow, WV William.Hut@mail.wvu.edu 1 Itroductio Imagie you are a matchmaker,
More informationFactors of sums of powers of binomial coefficients
ACTA ARITHMETICA LXXXVI.1 (1998) Factors of sums of powers of biomial coefficiets by Neil J. Cali (Clemso, S.C.) Dedicated to the memory of Paul Erdős 1. Itroductio. It is well ow that if ( ) a f,a = the
More informationTIGHT BOUNDS ON EXPECTED ORDER STATISTICS
Probability i the Egieerig ad Iformatioal Scieces, 20, 2006, 667 686+ Prited i the U+S+A+ TIGHT BOUNDS ON EXPECTED ORDER STATISTICS DIMITRIS BERTSIMAS Sloa School of Maagemet ad Operatios Research Ceter
More informationSECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES
SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES Read Sectio 1.5 (pages 5 9) Overview I Sectio 1.5 we lear to work with summatio otatio ad formulas. We will also itroduce a brief overview of sequeces,
More informationLecture 3. denote the orthogonal complement of S k. Then. 1 x S k. n. 2 x T Ax = ( ) λ x. with x = 1, we have. i = λ k x 2 = λ k.
18.409 A Algorithmist s Toolkit September 17, 009 Lecture 3 Lecturer: Joatha Keler Scribe: Adre Wibisoo 1 Outlie Today s lecture covers three mai parts: CouratFischer formula ad Rayleigh quotiets The
More informationAn example of nonquenched convergence in the conditional central limit theorem for partial sums of a linear process
A example of oqueched covergece i the coditioal cetral limit theorem for partial sums of a liear process Dalibor Volý ad Michael Woodroofe Abstract A causal liear processes X,X 0,X is costructed for which
More informationThe Euler Totient, the Möbius and the Divisor Functions
The Euler Totiet, the Möbius ad the Divisor Fuctios Rosica Dieva July 29, 2005 Mout Holyoke College South Hadley, MA 01075 1 Ackowledgemets This work was supported by the Mout Holyoke College fellowship
More informationUniversity of California, Los Angeles Department of Statistics. Distributions related to the normal distribution
Uiversity of Califoria, Los Ageles Departmet of Statistics Statistics 100B Istructor: Nicolas Christou Three importat distributios: Distributios related to the ormal distributio Chisquare (χ ) distributio.
More informationGrade 7. Strand: Number Specific Learning Outcomes It is expected that students will:
Strad: Number Specific Learig Outcomes It is expected that studets will: 7.N.1. Determie ad explai why a umber is divisible by 2, 3, 4, 5, 6, 8, 9, or 10, ad why a umber caot be divided by 0. [C, R] [C]
More informationCase Study. Normal and t Distributions. Density Plot. Normal Distributions
Case Study Normal ad t Distributios Bret Halo ad Bret Larget Departmet of Statistics Uiversity of Wiscosi Madiso October 11 13, 2011 Case Study Body temperature varies withi idividuals over time (it ca
More informationUnit 20 Hypotheses Testing
Uit 2 Hypotheses Testig Objectives: To uderstad how to formulate a ull hypothesis ad a alterative hypothesis about a populatio proportio, ad how to choose a sigificace level To uderstad how to collect
More informationAQA STATISTICS 1 REVISION NOTES
AQA STATISTICS 1 REVISION NOTES AVERAGES AND MEASURES OF SPREAD www.mathsbox.org.uk Mode : the most commo or most popular data value the oly average that ca be used for qualitative data ot suitable if
More informationAnalyzing Longitudinal Data from Complex Surveys Using SUDAAN
Aalyzig Logitudial Data from Complex Surveys Usig SUDAAN Darryl Creel Statistics ad Epidemiology, RTI Iteratioal, 312 Trotter Farm Drive, Rockville, MD, 20850 Abstract SUDAAN: Software for the Statistical
More informationPerfect Packing Theorems and the AverageCase Behavior of Optimal and Online Bin Packing
SIAM REVIEW Vol. 44, No. 1, pp. 95 108 c 2002 Society for Idustrial ad Applied Mathematics Perfect Packig Theorems ad the AverageCase Behavior of Optimal ad Olie Bi Packig E. G. Coffma, Jr. C. Courcoubetis
More information2.7 Sequences, Sequences of Sets
2.7. SEQUENCES, SEQUENCES OF SETS 67 2.7 Sequeces, Sequeces of Sets 2.7.1 Sequeces Defiitio 190 (sequece Let S be some set. 1. A sequece i S is a fuctio f : K S where K = { N : 0 for some 0 N}. 2. For
More informationA Faster ClauseShortening Algorithm for SAT with No Restriction on Clause Length
Joural o Satisfiability, Boolea Modelig ad Computatio 1 2005) 4960 A Faster ClauseShorteig Algorithm for SAT with No Restrictio o Clause Legth Evgey Datsi Alexader Wolpert Departmet of Computer Sciece
More informationThe following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles
The followig eample will help us uderstad The Samplig Distributio of the Mea Review: The populatio is the etire collectio of all idividuals or objects of iterest The sample is the portio of the populatio
More informationSwaps: Constant maturity swaps (CMS) and constant maturity. Treasury (CMT) swaps
Swaps: Costat maturity swaps (CMS) ad costat maturity reasury (CM) swaps A Costat Maturity Swap (CMS) swap is a swap where oe of the legs pays (respectively receives) a swap rate of a fixed maturity, while
More informationAMS 2000 subject classification. Primary 62G08, 62G20; secondary 62G99
VARIABLE SELECTION IN NONPARAMETRIC ADDITIVE MODELS Jia Huag 1, Joel L. Horowitz 2 ad Fegrog Wei 3 1 Uiversity of Iowa, 2 Northwester Uiversity ad 3 Uiversity of West Georgia Abstract We cosider a oparametric
More informationProblem Set 1 Oligopoly, market shares and concentration indexes
Advaced Idustrial Ecoomics Sprig 2016 Joha Steek 29 April 2016 Problem Set 1 Oligopoly, market shares ad cocetratio idexes 1 1 Price Competitio... 3 1.1 Courot Oligopoly with Homogeous Goods ad Differet
More informationOverview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals
Overview Estimatig the Value of a Parameter Usig Cofidece Itervals We apply the results about the sample mea the problem of estimatio Estimatio is the process of usig sample data estimate the value of
More informationCHAPTER 3 THE TIME VALUE OF MONEY
CHAPTER 3 THE TIME VALUE OF MONEY OVERVIEW A dollar i the had today is worth more tha a dollar to be received i the future because, if you had it ow, you could ivest that dollar ad ear iterest. Of all
More informationChapter 7: Confidence Interval and Sample Size
Chapter 7: Cofidece Iterval ad Sample Size Learig Objectives Upo successful completio of Chapter 7, you will be able to: Fid the cofidece iterval for the mea, proportio, ad variace. Determie the miimum
More informationAnnuities Under Random Rates of Interest II By Abraham Zaks. Technion I.I.T. Haifa ISRAEL and Haifa University Haifa ISRAEL.
Auities Uder Radom Rates of Iterest II By Abraham Zas Techio I.I.T. Haifa ISRAEL ad Haifa Uiversity Haifa ISRAEL Departmet of Mathematics, Techio  Israel Istitute of Techology, 3000, Haifa, Israel I memory
More informationLecture 5: Span, linear independence, bases, and dimension
Lecture 5: Spa, liear idepedece, bases, ad dimesio Travis Schedler Thurs, Sep 23, 2010 (versio: 9/21 9:55 PM) 1 Motivatio Motivatio To uderstad what it meas that R has dimesio oe, R 2 dimesio 2, etc.;
More informationDescriptive statistics deals with the description or simple analysis of population or sample data.
Descriptive statistics Some basic cocepts A populatio is a fiite or ifiite collectio of idividuals or objects. Ofte it is impossible or impractical to get data o all the members of the populatio ad a small
More information