Stéphane Boucheron 1, Olivier Bousquet 2 and Gábor Lugosi 3

Size: px
Start display at page:

Download "Stéphane Boucheron 1, Olivier Bousquet 2 and Gábor Lugosi 3"

Transcription

1 ESAIM: Probability ad Statistics URL: Will be set by the publisher THEORY OF CLASSIFICATION: A SURVEY OF SOME RECENT ADVANCES Stéphae Bouchero 1, Olivier Bousquet 2 ad Gábor Lugosi 3 Abstract The last few years have witessed importat ew developmets i the theory ad practice of patter classificatio We ited to survey some of the mai ew ideas that have led to these recet results Résumé La pratique et la théorie de la recoaissace des formes ot cou des développemets importats durat ces derières aées Ce survol vise à exposer certaies des idées ouvelles qui ot coduit à ces développemets 1991 Mathematics Subject Classificatio 62G08,60E15,68Q32 September 23, 2005 Cotets 1 Itroductio 2 2 Basic model 2 3 Empirical risk miimizatio ad Rademacher averages 3 4 Miimizig cost fuctios: some basic ideas behid boostig ad support vector machies 8 41 Margi-based performace bouds 9 42 Covex cost fuctioals 13 5 Tighter bouds for empirical risk miimizatio Relative deviatios Noise ad fast rates Localizatio Cost fuctios Miimax lower bouds 26 6 PAC-bayesia bouds 29 7 Stability 31 8 Model selectio Oracle iequalities 32 Keywords ad phrases: Patter Recogitio, Statistical Learig Theory, Cocetratio Iequalities, Empirical Processes, Model Selectio The authors ackowledge support by the PASCAL Network of Excellece uder EC grat o The work of the third author was supported by the Spaish Miistry of Sciece ad Techology ad FEDER, grat BMF Laboratoire Probabilités et Modèles Aléatoires, CNRS & Uiversité Paris VII, Paris, Frace, wwwprobajussieufr/~bouchero 2 Pertiece SA, 32 rue des Jeûeurs, Paris, Frace 3 Departmet of Ecoomics, Pompeu Fabra Uiversity, Ramo Trias Fargas 25-27, Barceloa, Spai, lugosi@upfes c EDP Scieces, SMAI 1999

2 2 TITLE WILL BE SET BY THE PUBLISHER 82 A glimpse at model selectio methods Naive pealizatio Ideal pealties Localized Rademacher complexities Pre-testig Revisitig hold-out estimates 45 Refereces 47 1 Itroductio The last few years have witessed importat ew developmets i the theory ad practice of patter classificatio The itroductio of ew ad effective techiques of hadlig high-dimesioal problems such as boostig ad support vector machies have revolutioized the practice of patter recogitio At the same time, the better uderstadig of the applicatio of empirical process theory ad cocetratio iequalities have led to effective ew ways of studyig these methods ad provided a statistical explaatio for their success These ew tools have also helped develop ew model selectio methods that are at the heart of may classificatio algorithms The purpose of this survey is to offer a overview of some of these theoretical tools ad give the mai ideas of the aalysis of some of the importat algorithms This survey does ot attempt to be exhaustive The selectio of the topics is largely biased by the persoal taste of the authors We also limit ourselves to describig the key ideas i a simple way, ofte sacrificig geerality I these cases the reader is poited to the refereces for the sharpest ad more geeral results available Refereces ad bibliographical remarks are give at the ed of each sectio, i a attempt to avoid iterruptios i the argumets 2 Basic model The problem of patter classificatio is about guessig or predictig the ukow class of a observatio A observatio is ofte a collectio of umerical ad/or categorical measuremets represeted by a d-dimesioal vector x but i some cases it may eve be a curve or a image I our model we simply assume that x X where X is some abstract measurable space equipped with a σ-algebra The ukow ature of the observatio is called a class It is deoted by y ad i the simplest case takes values i the biary set 1, 1} I these otes we restrict our attetio to biary classificatio The reaso is simplicity ad that the biary problem already captures may of the mai features of more geeral problems Eve though there is much to say about multiclass classificatio, this survey does ot cover this icreasig field of research I classificatio, oe creates a fuctio g : X 1, 1} which represets oe s guess of y give x The mappig g is called a classifier The classifier errs o x if g(x) y To formalize the learig problem, we itroduce a probabilistic settig, ad let (X, Y ) be a X 1, 1}- valued radom pair, modelig observatio ad its correspodig class The distributio of the radom pair (X, Y ) may be described by the probability distributio of X (give by the probabilities PX A} for all measurable subsets A of X ) ad η(x) = PY = 1 X = x} The fuctio η is called the a posteriori probability We measure the performace of classifier g by its probability of error L(g) = Pg(X) Y } Give η, oe may easily costruct a classifier with miimal probability of error I particular, it is easy to see that if we defie g 1 if η(x) > 1/2 (x) = 1 otherwise

3 TITLE WILL BE SET BY THE PUBLISHER 3 the L(g ) L(g) for ay classifier g The miimal risk L def = L(g ) is called the Bayes risk (or Bayes error) More precisely, it is immediate to see that L(g) L = E [ 1 g(x) g (X)} 2η(X) 1 ] 0 (1) (see, eg, [72]) The optimal classifier g is ofte called the Bayes classifier I the statistical model we focus o, oe has access to a collectio of data (X i, Y i ), 1 i We assume that the data D cosists of a sequece of idepedet idetically distributed (iid) radom pairs (X 1, Y 1 ),, (X, Y ) with the same distributio as that of (X, Y ) A classifier is costructed o the basis of D = (X 1, Y 1,, X, Y ) ad is deoted by g Thus, the value of Y is guessed by g (X) = g (X; X 1, Y 1,, X, Y ) The performace of g is measured by its (coditioal) probability of error L(g ) = Pg (X) Y D } The focus of the theory (ad practice) of classificatio is to costruct classifiers g whose probability of error is as close to L as possible Obviously, the whole arseal of traditioal parametric ad oparametric statistics may be used to attack this problem However, the high-dimesioal ature of may of the ew applicatios (such as image recogitio, text classificatio, micro-biological applicatios, etc) leads to territories beyod the reach of traditioal methods Most ew advaces of statistical learig theory aim to face these ew challeges Bibliographical remarks Several textbooks, surveys, ad research moographs have bee writte o patter classificatio ad statistical learig theory A partial list icludes Fukuaga [97], Duda ad Hart [77], Vapik ad Chervoekis [233], Devijver ad Kittler [70], Vapik [229,230], Breima, Friedma, Olshe, ad Stoe [53], Nataraja [175], McLachla [169], Athoy ad Biggs [10], Kears ad Vazirai [117], Devroye, Györfi, ad Lugosi [72], Ripley [185], Vidyasagar [235] Kulkari, Lugosi, ad Vekatesh [128], Athoy ad Bartlett [9], Duda, Hart, ad Stork [78], Lugosi [144], ad Medelso [171] 3 Empirical risk miimizatio ad Rademacher averages A simple ad atural approach to the classificatio problem is to cosider a class C of classifiers g : X 1, 1} ad use data-based estimates of the probabilities of error L(g) to select a classifier from the class The most atural choice to estimate the probability of error L(g) = Pg(X) Y } is the error cout L (g) = 1 1 g(xi) Y i} i=1 L (g) is called the empirical error of the classifier g First we outlie the basics of the theory of empirical risk miimizatio (ie, the classificatio aalog of M-estimatio) Deote by g the classifier that miimizes the estimated probability of error over the class: L (g ) L (g) for all g C The the probability of error L(g) = P g(x) Y D } of the selected rule is easily see to satisfy the elemetary iequalities L(g ) if g C L(g) 2 sup L (g) L(g), (2) g C L(g) L (g) + sup L (g) L(g) g C

4 4 TITLE WILL BE SET BY THE PUBLISHER We see that by guarateeig that the uiform deviatio sup g C L (g) L(g) of estimated probabilities from their true values is small, we make sure that the probability of the selected classifier g is ot much larger tha the best probability of error i the class C ad at the same time the empirical estimate L (g ) is also good It is importat to ote at this poit that boudig the excess risk by the maximal deviatio as i (2) is quite loose i may situatios I Sectio 5 we survey some ways of obtaiig improved bouds O the other had, the simple iequality above offers a coveiet way of uderstadig some of the basic priciples ad it is eve sharp i a certai miimax sese, see Sectio 55 Clearly, the radom variable L (g) is biomially distributed with parameters ad L(g) Thus, to obtai bouds for the success of empirical error miimizatio, we eed to study uiform deviatios of biomial radom variables from their meas We formulate the problem i a somewhat more geeral way as follows Let X 1,, X be idepedet, idetically distributed radom variables takig values i some set X ad let F be a class of bouded fuctios X [ 1, 1] Deotig expectatio ad empirical averages by P f = Ef(X 1 ) ad P f = (1/) i=1 f(x i), we are iterested i upper bouds for the maximal deviatio sup(p f P f) f F Cocetratio iequalities are amog the basic tools i studyig such deviatios powerful expoetial cocetratio iequality is the bouded differeces iequality The simplest, yet quite Theorem 31 bouded differeces iequality Le g : X R be a fuctio of variables such that for some oegative costats c 1,, c, sup x 1,,x, x i X g(x 1,, x ) g(x 1,, x i 1, x i, x i+1,, x ) c i, 1 i Let X 1,, X be idepedet radom variables The radom variable Z = g(x 1,, X ) satisfies where C = i=1 c2 i P Z EZ > t} 2e 2t2 /C The bouded differeces assumptio meas that if the i-th variable of g is chaged while keepig all the others fixed, the value of the fuctio caot chage by more tha c i Our mai example for such a fuctio is Z = sup P f P f f F Obviously, Z satisfies the bouded differeces assumptio with c i = 2/ ad therefore, for ay δ (0, 1), with probability at least 1 δ, 2 log sup P f P f E 1 δ sup P f P f + (3) f F f F This cocetratio result allows us to focus o the expected value, which ca be bouded coveietly by a simple symmetrizatio device Itroduce a ghost sample X 1,, X, idepedet of the X i ad distributed idetically If P f = (1/) i=1 f(x i ) deotes the empirical averages measured o the ghost sample, the by Jese s iequality, E sup f F ( [ P f P f = E sup E f F ]) P f P f X 1,, X E sup P f P f f F

5 TITLE WILL BE SET BY THE PUBLISHER 5 Let ow σ 1,, σ be idepedet (Rademacher) radom variables with Pσ i = 1} = Pσ i = 1} = 1/2, idepedet of the X i ad X i The E sup f F [ P f P f = E sup f F [ = E sup 2E f F [ 1 1 sup f F ] (f(x i) f(x i ) i=1 ] σ i (f(x i) f(x i ) i=1 ] σ i f(x i ) Let A R be a bouded set of vectors a = (a 1,, a ), ad itroduce the quatity 1 i=1 1 R (A) = E sup σ i a i a A R (A) is called the Rademacher average associated with A For a give sequece x 1,, x X, we write F(x 1 ) for the class of -vectors (f(x 1 ),, f(x )) with f F Thus, usig this otatio, we have deduced the followig i=1 Theorem 32 With probability at least 1 δ, sup P f P f 2ER (F(X1 )) + f F 2 log 1 δ We also have sup P f P f 2R (F(X1 )) + f F 2 log 2 δ The secod statemet follows simply by oticig that the radom variable R (F(X1 ) satisfies the coditios of the bouded differeces iequality The secod iequality is our first data-depedet performace boud It ivolves the Rademacher average of the coordiate projectio of F give by the data X 1,, X Give the data, oe may compute the Rademacher average, for example, by Mote Carlo itegratio Note that for a give 1 choice of the radom sigs σ 1,, σ, the computatio of sup f F i=1 σ if(x i ) is equivalet to miimizig i=1 σ if(x i ) over f F ad therefore it is computatioally equivalet to empirical risk miimizatio R (F(X1 )) measures the richess of the class F ad provides a sharp estimate for the maximal deviatios I fact, oe may prove that 1 2 ER (F(X1 )) 1 2 E sup f F P f P f 2ER (F(X 1 ))) (see, eg, va der Vaart ad Weller [227]) Next we recall some of the simple structural properties of Rademacher averages Theorem 33 properties of rademacher averages Let A, B be bouded subsets of R ad let c R be a costat The R (A B) R (A) + R (B), R (c A) = c R (A), R (A B) R (A) + R (B)

6 6 TITLE WILL BE SET BY THE PUBLISHER where c A = ca : a A} ad A B = a + b : a A, b B} Moreover, if A = a (1),, a (N) } R is a fiite set, the 2 log N R (A) max j=1,,n a(j) (4) N where deotes Euclidea orm If abscov(a) = j=1 c ja (j) : N N, } N j=1 c j 1, a (j) A is the absolute covex hull of A, the R (A) = R (abscov(a)) (5) Fially, the cotractio priciple states that if φ : R R is a fuctio with φ(0) = 0 ad Lipschitz costat L φ ad φ A is the set of vectors of form (φ(a 1 ),, φ(a )) R with a A, the R (φ A) L φ R (A) proof The first three properties are immediate from the defiitio Iequality (4) follows by Hoeffdig s iequality which states that if X is a bouded zero-mea radom variable takig values i a iterval [α, β], the for ay s > 0, E exp(sx) exp ( s 2 (β α) 2 /8 ) I particular, by idepedece, This implies that E exp ( s 1 ) σ i a i = i=1 e sr(a) = exp i=1 ( 1 se max j=1,,n N Ee s 1 j=1 E exp (s 1 ) σ ia i i=1 σ i a (j) i ) ( s 2 a 2 ) ( i s 2 a 2 ) exp 2 2 = exp 2 2 i=1 E exp P i=1 σia(j) i N max j=1,,n exp ( s max j=1,,n 1 ( s 2 a (j) 2 ) 2 2 Takig the logarithm of both sides, dividig by s, ad choosig s to miimize the obtaied upper boud for R (A), we arrive at (4) The idetity (5) is easily see from the defiitio For a proof of the cotractio priciple, see Ledoux ad Talagrad [133] Ofte it is useful to derive further upper bouds o Rademacher averages As a illustratio, we cosider the case whe F is a class of idicator fuctios Recall that this is the case i our motivatig example i the classificatio problem described above whe each f F is the idicator fuctio of a set of the form (x, y) : g(x) y} I such a case, for ay collectio of poits x 1 = (x 1,, x ), F(x 1 ) is a fiite subset of R whose cardiality is deoted by S F (x 1 ) ad is called the vc shatter coefficiet (where vc stads for Vapik-Chervoekis) Obviously, S F (x 1 ) 2 By iequality (4), we have, for all x 1, i=1 σ i a (j) i ) R (F(x 1 )) 2 log SF (x 1 ) (6) where we used the fact that for each f F, i f(x i) 2 I particular, 2 log SF (X1 E sup P f P f 2E ) f F The logarithm of the vc shatter coefficiet may be upper bouded i terms of a combiatorial quatity, called the vc dimesio If A 1, 1}, the the vc dimesio of A is the size V of the largest set of idices

7 TITLE WILL BE SET BY THE PUBLISHER 7 i 1,, i V } 1,, } such that for each biary V -vector b = (b 1,, b V ) 1, 1} V there exists a a = (a 1,, a ) A such that (a i1,, a iv ) = b The key iequality establishig a relatioship betwee shatter coefficiets ad vc dimesio is kow as Sauer s lemma which states that the cardiality of ay set A 1, 1} may be upper bouded as A where V is the vc dimesio of A I particular, V i=0 ( ) ( + 1) V i log S F (x 1 ) V (x 1 ) log( + 1) where we deote by V (x 1 ) the vc dimesio of F(x 1 ) Thus, the expected maximal deviatio E sup f F P f P f may be upper bouded by 2E 2V (X1 ) log( + 1)/ To obtai distributio-free upper bouds, itroduce the vc dimesio of a class of biary fuctios F, defied by V = sup V (x 1 ),x 1 The we obtai the followig versio of what has bee kow as the Vapik-Chervoekis iequality: Theorem 34 vapik-chervoekis iequality For all distributios oe has E sup(p f P f) 2 f F 2V log( + 1) Also, for a uiversal costat C V E sup(p f P f) C f F The secod iequality, that allows to remove the logarithmic factor, follows from a somewhat refied aalysis (called chaiig) The vc dimesio is a importat combiatorial parameter of the class ad may of its properties are well kow Here we just recall oe useful result ad refer the reader to the refereces for further study: let G be a m-dimesioal vector space of real-valued fuctios defied o X The class of idicator fuctios F = f(x) = 1 g(x) 0 : g G } has vc dimesio V m Bibliographical remarks Uiform deviatios of averages from their expectatios is oe of the cetral problems of empirical process theory Here we merely refer to some of the comprehesive coverages, such as Shorack ad Weller [199], Gié [98], va der Vaart ad Weller [227], Vapik [231], Dudley [83] The use of empirical processes i classificatio was pioeered by Vapik ad Chervoekis [232, 233] ad re-discovered 20 years later by Blumer, Ehrefeucht, Haussler, ad Warmuth [41], Ehrefeucht, Haussler, Kears, ad Valiat [88] For surveys see Nataraja [175], Devroye [71] Athoy ad Biggs [10], Kears ad Vazirai [117], Vapik [230, 231], Devroye, Györfi, ad Lugosi [72], Ripley [185], Vidyasagar [235], Athoy ad Bartlett [9], The bouded differeces iequality was formulated explicitly first by McDiarmid [166] (see also the surveys [167]) The martigale methods used by McDiarmid had appeared i early work of Hoeffdig [109], Azuma [18], Yuriksii [242, 243], Milma ad Schechtma [174] Closely related cocetratio results have bee obtaied i various ways icludig iformatio-theoretic methods (see Ahlswede, Gács, ad Körer [1], Marto [154],

8 8 TITLE WILL BE SET BY THE PUBLISHER [155], [156], Dembo [69], Massart [158] ad Rio [183]), Talagrad s iductio method [217], [213], [216] (see also McDiarmid [168], Luczak ad McDiarmid [143], Pacheko [ ]) ad the so-called etropy method, based o logarithmic Sobolev iequalities, developed by Ledoux [132], [131], see also Bobkov ad Ledoux [42], Massart [159], Rio [183], Bouchero, Lugosi, ad Massart [45, 46], Bousquet [47], ad Bouchero, Bousquet, Lugosi, ad Massart [44] Symmetrizatio was at the basis of the origial argumets of Vapik ad Chervoekis [232, 233] We leart the simple symmetrizatio trick show above from Gié ad Zi [99] but differet forms of symmetrizatio have bee at the core of obtaiig related results of similar flavor, see also Athoy ad Shawe-Taylor [11], Cao, Ettiger, Hush, Scovel [55], Herbrich ad Williamso [108], Medelso ad Philips [172] The use of Rademacher averages i classificatio was first promoted by Koltchiskii [124] ad Bartlett, Bouchero, ad Lugosi [24], see also Koltchiskii ad Pacheko [126,127], Bartlett ad Medelso [29], Bartlett, Bousquet, ad Medelso [25], Bousquet, Koltchiskii, ad Pacheko [50], Kégl, Lider, ad Lugosi [13], Medelso [170] Hoeffdig s iequality appears i [109] For a proof of the cotractio priciple we refer to Ledoux ad Talagrad [133] Sauer s lemma was proved idepedetly by Sauer [189], Shelah [198], ad Vapik ad Chervoekis [232] For related combiatorial results we refer to Frakl [90], Haussler [106], Alesker [7], Alo, Be-David, Cesa- Biachi, ad Haussler [8], Szarek ad Talagrad [210], Cesa-Biachi ad Haussler [60], Medelso ad Vershyi [173], [188] The secod iequality of Theorem 34 is based o the method of chaiig, ad was first proved by Dudley [81] The questio of how sup f F P f P f behaves has bee kow as the Gliveko-Catelli problem ad much has bee said about it A few key refereces iclude Vapik ad Chervoekis [232, 234], Dudley [79, 81, 82], Talagrad [211, 212, 214, 218], Dudley, Gié, ad Zi [84], Alo, Be-David, Cesa-Biachi, ad Haussler [8], Li, Log, ad Sriivasa [138], Medelso ad Vershyi [173] The vc dimesio has bee widely studied ad may of its properties are kow We refer to Cover [63], Dudley [80, 83], Steele [204], Weocur ad Dudley [238], Assouad [15], Khovaskii [118], Macityre ad Sotag [149], Goldberg ad Jerrum [101], Karpiski ad A Macityre [114], Koira ad Sotag [121], Athoy ad Bartlett [9], ad Bartlett ad Maass [28] 4 Miimizig cost fuctios: some basic ideas behid boostig ad support vector machies The results summarized i the previous sectio reveal that miimizig the empirical risk L (g) over a class C of classifiers with a vc dimesio much smaller tha the sample size is guarateed to work well This result has two fudametal problems First, by requirig that the vc dimesio be small, oe imposes serious limitatios o the approximatio properties of the class I particular, eve though the differece betwee the probability of error L(g ) of the empirical risk miimizer is close to the smallest probability of error if g C L(g) i the class, if g C L(g) L may be very large The other problem is algorithmic: miimizig the empirical probability of misclassificatio L(g) is very ofte a computatioally difficult problem Eve i seemigly simple cases, for example whe X = R d ad C is the class of classifiers that split the space of observatios by a hyperplae, the miimizatio problem is p hard The computatioal difficulty of learig problems deserves some more attetio Let us cosider i more detail the problem i the case of half-spaces Formally, we are give a sample, that is a sequece of vectors (x 1,, x ) from R d ad a sequece of labels (y 1,, y ) from 1, 1}, ad i order to miimize the empirical misclassificatio risk we are asked to fid w R d ad b R so as to miimize # k : y k ( w, x k b) 0} Without loss of geerality, the vectors costitutig the sample are assumed to have ratioal coefficiets, ad the size of the data is the sum of the bit legths of the vectors makig the sample Not oly miimizig the umber

9 TITLE WILL BE SET BY THE PUBLISHER 9 of misclassificatio errors has bee proved to be at least as hard as solvig ay p-complete problem, but eve approximately miimizig the umber of misclassificatio errors withi a costat factor of the optimum has bee show to be p-hard This meas that, uless p =p, we will ot be able to build a computatioally efficiet empirical risk miimizer for half-spaces that will work for all iput space dimesios If the iput space dimesio d is fixed, a algorithm ruig i O( d 1 log ) steps eumerates the trace of half-spaces o a sample of legth This allows a exhaustive search for the empirical risk miimizer Such a possibility should be cosidered with circumspectio sice its rage of applicatios would exted much beyod problems where iput dimesio is less tha 5 41 Margi-based performace bouds A attempt to solve both of these problems is to modify the empirical fuctioal to be miimized by itroducig a cost fuctio Next we describe the mai ideas of empirical miimizatio of cost fuctioals ad its aalysis We cosider classifiers of the form 1 if f(x) 0 g f (x) = 1 otherwise where f : X R is a real-valued fuctio I such a case the probability of error of g may be writte as L(g f ) = Psg(f(X)) Y } E1 f(x)y <0 To lighte otatio we will simply write L(f) = L(g f ) Let φ : R R + be a oegative cost fuctio such that φ(x) 1 x>0 (Typical choices of φ iclude φ(x) = e x, φ(x) = log 2 (1+e x ), ad φ(x) = (1+x) + ) Itroduce the cost fuctioal ad its empirical versio by A(f) = Eφ( f(x)y ) ad A (f) = 1 φ( f(x i )Y i ) i=1 Obviously, L(f) A(f) ad L (f) A (f) Theorem 41 Assume that the fuctio f is chose from a class F based o the data (Z 1,, Z ) def = (X 1, Y 1 ),, (X, Y ) Let B deote a uiform upper boud o φ( f(x)y) ad let L φ be the Lipschitz costat of φ The the probability of error of the correspodig classifier may be bouded, with probability at least 1 δ, by L(f ) A (f ) + 2L φ ER (F(X 1 )) + B 2 log 1 δ Thus, the Rademacher average of the class of real-valued fuctios f bouds the performace of the classifier

10 10 TITLE WILL BE SET BY THE PUBLISHER proof The proof similar to he argumet of the previous sectio: L(f ) A(f ) A (f ) + sup(a(f) A (f)) f F A (f ) + 2ER (φ H(Z1 2 log 1 δ )) + B (where H is the class of fuctios X 1, 1} R of the form f(x)y, f F) A (f ) + 2L φ ER (H(Z1 2 log 1 δ )) + B (by the cotractio priciple of Theorem 33) = A (f ) + 2L φ ER (F(X 1 )) + B 2 log 1 δ 411 Weighted votig schemes I may applicatios such as boostig ad baggig, classifiers are combied by weighted votig schemes which meas that the classificatio rule is obtaied by meas of fuctios f from a class N F λ = f(x) = N c j g j (x) : N N, c j λ, g 1,, g N C (7) j=1 where C is a class of base classifiers, that is, fuctios defied o X, takig values i 1, 1} A classifier of this form may be thought of as oe that, upo observig x, takes a weighted vote of the classifiers g 1,, g N (usig the weights c 1,, c N ) ad decides accordig to the weighted majority I this case, by (5) ad (6) we have j=1 R (F λ (X 1 )) λr (C(X 1 )) λ 2VC log( + 1) where V C is the vc dimesio of the base class To uderstad the richess of classes formed by weighted averages of classifiers from a base class, just cosider the simple oe-dimesioal example i which the base class C cotais all classifiers of the form g(x) = 21 x a 1, a R The V C = 1 ad the closure of F λ (uder the L orm) is the set of all fuctios of total variatio bouded by 2λ Thus, F λ is rich i the sese that ay classifier may be approximated by classifiers associated with the fuctios i F λ I particular, the vc dimesio of the class of all classifiers iduced by fuctios i F λ is ifiite For such large classes of classifiers it is impossible to guaratee that L(f ) exceeds the miimal risk i the class by somethig of the order of 1/2 (see Sectio 55) However, L(f ) may be made as small as the miimum of the cost fuctioal A(f) over the class plus O( 1/2 ) Summarizig, we have obtaied that if F λ is of the form idicated above, the for ay fuctio f chose from F λ i a data-based maer, the probability of error of the associated classifier satisfies, with probability at least 1 δ, 2VC log( + 1) 2 log 1 δ L(f ) A (f ) + 2L φ λ + B (8) The remarkable fact about this iequality is that the upper boud oly ivolves the vc dimesio of the class C of base classifiers which is typically small The price we pay is that the first term o the right-had side is

11 TITLE WILL BE SET BY THE PUBLISHER 11 the empirical cost fuctioal istead of the empirical probability of error As a first illustratio, cosider the example whe γ is a fixed positive parameter ad 0 if x γ φ(x) = 1 if x x/γ otherwise I this case B = 1 ad L φ = 1/γ Notice also that 1 x>0 φ(x) 1 x> γ ad therefore A (f) L γ (f) where L γ (f) is the so-called margi error defied by L γ (f) = 1 i=1 1 f(xi)y i<γ Notice that for all γ > 0, L γ (f) L (f) ad the L γ (f) is icreasig i γ A iterpretatio of the margi error L γ (f) is that it couts, apart from the umber of misclassified pairs (X i, Y i ), also those which are well classified but oly with a small cofidece (or margi ) by f Thus, (8) implies the followig margi-based boud for the risk: Corollary 42 For ay γ > 0, with probability at least 1 δ, L(f ) L γ (f ) + 2 λ 2VC log( + 1) + γ 2 log 1 δ (9) Notice that, as γ grows, the first term of the sum icreases, while the secod decreases The boud ca be very useful wheever a classifier has a small margi error for a relatively large γ (ie, if the classifier classifies the traiig data well with high cofidece ) sice the secod term oly depeds o the vc dimesio of the small base class C This result has bee used to explai the good behavior of some votig methods such as AdaBoost, sice these methods have a tedecy to fid classifiers that classify the data poits well with a large margi 412 Kerel methods Aother popular way to obtai classificatio rules from a class of real-valued fuctios which is used i kerel methods such as Support Vector Machies (SVM) or Kerel Fisher Discrimiat (KFD) is to cosider balls of a reproducig kerel Hilbert space The basic idea is to use a positive defiite kerel fuctio k : X X R, that is, a symmetric fuctio satisfyig α i α j k(x i, x j ) 0, i,j=1 for all choices of, α 1,, α R ad x 1,, x X Such a fuctio aturally geerates a space of fuctios of the form } F = f( ) = α i k(x i, ) : N, α i R, x i X, i=1 which, with the ier product α i k(x i, ), β j k(x j, ) def = α i β j k(x i, x j ) ca be completed ito a Hilbert space The key property is that for all x 1, x 2 X there exist elemets f x1, f x2 F such that k(x 1, x 2 ) = f x1, f x2 This meas that ay liear algorithm based o computig ier products ca be exteded ito a o-liear versio by replacig the ier products by a kerel fuctio The advatage is that eve though the algorithm remais of low complexity, it works i a class of fuctios that ca potetially represet ay cotiuous fuctio arbitrarily well (provided k is chose appropriately)

12 12 TITLE WILL BE SET BY THE PUBLISHER Algorithms workig with kerels usually perform miimizatio of a cost fuctioal o a ball of the associated reproducig kerel Hilbert space of the form N F λ = f(x) = c j k(x j, x) : N N, j=1 N c i c j k(x i, x j ) λ 2, x 1,, x N X (10) i,j=1 Notice that, i cotrast with (7) where the costrait is of l 1 type, the costrait here is of l 2 type Also, the basis fuctios, istead of beig chose from a fixed class, are determied by elemets of X themselves A importat property of fuctios i the reproducig kerel Hilbert space associated with k is that for all x X, f(x) = f, k(x, ) This is called the reproducig property The reproducig property may be used to estimate precisely the Rademacher average of F λ Ideed, deotig by E σ expectatio with respect to the Rademacher variables σ 1,, σ, we have R (F λ (X 1 )) = 1 E σ sup = 1 E σ sup f λ i=1 f λ i=1 σ i f(x i ) σ i f, k(x i, ) = λ E σ σ i k(x i, ) by the Cauchy-Schwarz iequality, where deotes the orm i the reproducig kerel Hilbert space The Kahae-Khichie iequality states that for ay vectors a 1,, a i a Hilbert space, It is also easy to see that so we obtai 1 2 ( E σ i a i 2 E i=1 E i=1 i=1 ) 2 2 σ i a i E σ i a i 2 σ i a i = E σ i σ j a i, a j = i=1 i,j=1 i=1 a i 2, i=1 λ k(x i, X i ) R (F λ (X1 )) λ k(x i, X i ) 2 i=1 This is very ice as it gives a boud that ca be computed very easily from the data A reasoig similar to the oe leadig to (9), usig the bouded differeces iequality to replace the Rademacher average by its empirical versio, gives the followig Corollary 43 Let f be ay fuctio chose from the ball F λ The, with probability at least 1 δ, L(f ) L γ (f ) + 2 λ k(x i, X i ) + γ i=1 i=1 2 log 2 δ

13 42 Covex cost fuctioals TITLE WILL BE SET BY THE PUBLISHER 13 Next we show that a proper choice of the cost fuctio φ has further advatages To this ed, we cosider oegative covex odecreasig cost fuctios with lim x φ(x) = 0 ad φ(0) = 1 Mai examples of φ iclude the expoetial cost fuctio φ(x) = e x used i AdaBoost ad related boostig algorithms, the logit cost fuctio φ(x) = log 2 (1 + e x ), ad the hige loss (or soft margi loss) φ(x) = (1 + x) + used i support vector machies Oe of the mai advatages of usig covex cost fuctios is that miimizig the empirical cost A (f) ofte becomes a covex optimizatio problem ad is therefore computatioally feasible I fact, most boostig ad support vector machie classifiers may be viewed as empirical miimizers of a covex cost fuctioal However, miimizig covex cost fuctioals have other theoretical advatages To uderstad this, assume, i additio to the above, that φ is strictly covex ad differetiable The it is easy to determie the fuctio f miimizig the cost fuctioal A(f) = Eφ( Y f(x) Just ote that for each x X, ad therefore the fuctio f is give by E [φ( Y f(x) X = x] = η(x)φ( f(x)) + (1 η(x))φ(f(x)) f (x) = argmi α h η(x) (α) where for each η [0, 1], h η (α) = ηφ( α) + (1 η)φ(α) Note that h η is strictly covex ad therefore f is well defied (though it may take values ± if η equals 0 or 1) Assumig that h η is differetiable, the miimum is achieved for the value of α for which h η(α) = 0, that is, whe η 1 η = φ (α) φ ( α) Sice φ is strictly icreasig, we see that the solutio is positive if ad oly if η > 1/2 This reveals the importat fact that the miimizer f of the fuctioal A(f) is such that the correspodig classifier g (x) = 21 f (x) 0 1 is just the Bayes classifier Thus, miimizig a covex cost fuctioal leads to a optimal classifier For example, if φ(x) = e x is the expoetial cost fuctio, the f (x) = (1/2) log(η(x)/(1 η(x))) I the case of the logit cost φ(x) = log 2 (1 + e x ), we have f (x) = log(η(x)/(1 η(x))) We ote here that, eve though the hige loss φ(x) = (1 + x) + does ot satisfy the coditios for φ used above (eg, it is ot strictly covex), it is easy to see that the fuctio f miimizig the cost fuctioal equals f (x) = 1 if η(x) > 1/2 1 if η(x) < 1/2 Thus, i this case the f ot oly iduces the Bayes classifier but it equals to it To obtai iequalities for the probability of error of classifiers based o miimizatio of empirical cost fuctioals, we eed to establish a relatioship betwee the excess probability of error L(f) L ad the correspodig excess cost fuctioal A(f) A where A = A(f ) = if f A(f) Here we recall a simple iequality of Zhag [244] which states that if the fuctio H : [0, 1] R is defied by H(η) = if α h η (α) ad the cost fuctio φ is such that for some positive costats s 1 ad c η s c s (1 H(η)), η [0, 1], the for ay fuctio f : X R, L(f) L 2c (A(f) A ) 1/s (11)

14 14 TITLE WILL BE SET BY THE PUBLISHER (The simple proof of this iequality is based o the expressio (1) ad elemetary covexity properties of h η ) I the special case of the expoetial ad logit cost fuctios H(η) = 2 η(1 η) ad H(η) = η log 2 η (1 η) log 2 (1 η), respectively I both cases it is easy to see that the coditio above is satisfied with s = 2 ad c = 1/ 2 Theorem 44 excess risk of covex risk miimizers Assume that f is chose from a class F λ defied i (7) by miimizig the empirical cost fuctioal A (f) usig either the expoetial of the logit cost fuctio The, with probability at least 1 δ, L(f ) L 2VC log( + 1) 2 log 2 2L 1 δ φ λ + B 1/2 + ( ) 1/2 2 if A(f) A f F λ proof L(f ) L 2 (A(f ) A ) 1/2 ( ) 1/2 2 A(f ) if A(f) + ( 2 if A(f) A f F λ f F λ ( ) 1/2 2 sup A(f) A (f) + ( 2 if A(f) A f F λ f F λ (just like i (2)) 2VC log( + 1) 2 log 2 2L 1 δ φ λ + B 1/2 ) 1/2 ) 1/2 + ( ) 1/2 2 if A(f) A f F λ with probability at least 1 δ, where at the last step we used the same boud for sup f Fλ A(f) A (f) as i (8) Note that for the expoetial cost fuctio L φ = e λ ad B = λ while for the logit cost L φ 1 ad B = λ I both cases, if there exists a λ sufficietly large so that if f Fλ A(f) = A, the the approximatio error disappears ad we obtai L(f ) L = O ( 1/4) The fact that the expoet i the rate of covergece is dimesio-free is remarkable (We ote here that these rates may be further improved by applyig the refied techiques resumed i Sectio 53, see also [40]) It is a iterestig approximatio-theoretic challege to uderstad what kid of fuctios f may be obtaied as a covex combiatio of base classifiers ad, more geerally, to describe approximatio properties of classes of fuctios of the form (7) Next we describe a simple example whe the above-metioed approximatio properties are well uderstood Cosider the case whe X = [0, 1] d ad the base class C cotais all decisio stumps, that is, all classifiers of the form s + i,t (x) = 1 x (i) t 1 x (i) <t ad s i,t (x) = 1 x (i) <t 1 x (i) t, t [0, 1], i = 1,, d, where x (i) deotes the i-th coordiate of x I this case the vc dimesio of the base class is easily see to be bouded by V C 2 log 2 (2d) Also it is easy to see that the closure of F λ with respect to the supremum orm cotais all fuctios f of the form f(x) = f 1 (x (1) ) + + f d (x (d) ) where the fuctios f i : [0, 1] R are such that f 1 T V + + f d T V λ where f i T V deotes the total variatio of the fuctio f i Therefore, if f has the above form, we have if f Fλ A(f) = A(f ) Recallig that the fuctio f optimizig the cost A(f) has the form f (x) = 1 2 log η(x) 1 η(x)

15 TITLE WILL BE SET BY THE PUBLISHER 15 i the case of the expoetial cost fuctio ad f (x) = log η(x) 1 η(x) i the case of the logit cost fuctio, we see that boostig usig decisio stumps is especially well fitted to the so-called additive logistic model i which η is assumed to be such that log(η/(1 η)) is a additive fuctio (ie, it ca be writte as a sum of uivariate fuctios of the compoets of x) Thus, whe η permits a additive logistic represetatio the the rate of covergece of the classifier is fast ad has a very mild depedece o the dimesio Cosider ext the case of the hige loss φ(x) = (1 + x) + ofte used i Support Vector Machies ad related kerel methods I this case H(η) = 2 (η, 1 η) ad therefore iequality (11) holds with c = 1/2 ad s = 1 Thus, L(f ) L A(f ) A ad the aalysis above leads to eve better rates of covergece However, i this case f (x) = 21 η(x) 1/2 1 ad approximatig this fuctio by weighted sums of base fuctios may be more difficult tha i the case of expoetial ad logit costs Oce agai, the approximatio-theoretic part of the problem is far from beig well uderstood, ad it is difficult to give recommedatios about which cost fuctio is more advatageous ad what base classes should be used Bibliographical remarks For results o the algorithmic difficulty of empirical risk miimizatio, see Johso ad Preparata [112], Vu [236], Bartlett ad Be-David [26], Be-David, Eiro, ad Simo [32] Boostig algorithms were origially itroduced by Freud ad Schapire (see [91], [94], ad [190]), as adaptive aggregatio of simple classifiers cotaied i a small base class The aalysis based o the observatio that AdaBoost ad related methods ted to produce large-margi classifiers appears i Schapire, Freud, Bartlett, ad Lee [191], ad Koltchiskii ad Pacheko [127] It was Breima [51] who observed that boostig performs gradiet descet optimizatio of a empirical cost fuctio differet from the umber of misclassified samples, see also Maso, Baxter, Bartlett, ad Frea [157], Collis, Schapire, ad Siger [61], Friedma, Hastie, ad Tibshirai [95] Based o this view, various versios of boostig algorithms have bee show to be cosistet i differet settigs, see Breima [52], Bühlma ad Yu [54], Blachard, Lugosi, ad Vayatis [40], Jiag [111], Lugosi ad Vayatis [146], Maor ad Meir [152], Maor, Meir, ad Zhag [153], Zhag [244] Iequality (8) was first obtaied by Schapire, Freud, Bartlett, ad Lee [191] The aalysis preseted here is due to Koltchiskii ad Pacheko [127] Other classifiers based o weighted votig schemes have bee cosidered by Catoi [57 59], Yag [241], Freud, Masour, ad Schapire [93] Kerel methods were pioeered by Aizerma, Braverma, ad Rozooer [2 5], Vapik ad Lerer [228], Bashkirov, Braverma, ad Muchik [31], Vapik ad Chervoekis [233], ad Specht [203] Support vector machies origiate i the pioeerig work of Boser, Guyo, ad Vapik [43], Cortes ad Vapik [62] For surveys we refer to Cristiaii ad Shawe-Taylor [65], Smola, Bartlett, Schölkopf, ad Schuurmas [201], Hastie, Tibshirai, ad Friedma [104], Schölkopf ad Smola [192] The study of uiversal approximatio properties of kerels ad statistical cosistecy of Support Vector Machies is due to Steiwart [ ], Li [140, 141], Zhou [245], ad Blachard, Bousquet, ad Massart [39] We have cosidered the case of miimizatio of a loss fuctio o a ball of the reproducig kerel Hilbert space However, it is computatioally more coveiet to formulate the problem as the miimizatio of a regularized fuctioal of the form 1 mi φ( Y i f(x i )) + λ f 2 f F i=1 The stadard Support Vector Machie algorithm the correspods to the choice of φ(x) = (1 + x) + Kerel based regularizatio algorithms were studied by Kimeldorf ad Wahba [120] ad Crave ad Wahba [64] i the cotext of regressio Relatioships betwee Support Vector Machies ad regularizatio were described

16 16 TITLE WILL BE SET BY THE PUBLISHER by Smola, Schölkopf, ad Müller [202] ad Evhgeiou, Potil, ad Poggio [89] Geeral properties of regularized algorithms i reproducig kerel Hilbert spaces are ivestigated by Cucker ad Smale [68], Steiwart [206], Zhag [244] Various properties of the Support Vector Machie algorithm are ivestigated by Vapik [230, 231], Schölkopf ad Smola [192], Scovel ad Steiwart [195] ad Steiwart [208, 209] The fact that miimizig a expoetial cost fuctioal leads to the Bayes classifier was poited out by Breima [52], see also Lugosi ad Vayatis [146], Zhag [244] For a comprehesive theory of the coectio betwee cost fuctios ad probability of misclassificatio, see Bartlett, Jorda, ad McAuliffe [27] Zhag s lemma (11) appears i [244] For various geeralizatios ad refiemets we refer to Bartlett, Jorda, ad McAuliffe [27] ad Blachard, Lugosi, ad Vayatis [40] 5 Tighter bouds for empirical risk miimizatio This sectio is dedicated to the descriptio of some refiemets of the ideas described i the earlier sectios What we have see so far oly used first-order properties of the fuctios that we cosidered, amely their boudedess It turs out that usig secod-order properties, like the variace of the fuctios, may of the above results ca be made sharper 51 Relative deviatios I order to uderstad the basic pheomeo, let us go back to the simplest case i which oe has a fixed fuctio f with values i 0, 1} I this case, P f is a average of idepedet Beroulli radom variables with parameter p = P f Recall that, as a simple cosequece of (3), with probability at least 1 δ, P f P f 2 log 1 δ (12) This is basically tight whe P f = 1/2, but ca be sigificatly improved whe P f is small Ideed, Berstei s iequality gives, with probability at least 1 δ, P f P f 2Var(f) log 1 δ + 2 log 1 δ 3 (13) Sice f takes its values i 0, 1}, Var(f) = P f(1 P f) P f which shows that whe P f is small, (13) is much better tha (12) 511 Geeral iequalities Next we exploit the pheomeo described above to obtai sharper performace bouds for empirical risk miimizatio Note that if we cosider the differece P f P f uiformly over the class F, the largest deviatios are obtaied by fuctios that have a large variace (ie, P f is close to 1/2) A idea is to scale each fuctio by dividig it by P f so that they all behave i a similar way Thus, we boud the quatity P f P f sup f F P f The first step cosists i symmetrizatio of the tail probabilities If t 2 2, P f P P f sup f F P f t } } P 2P sup f P f f F (P f + P f)/2 t

17 TITLE WILL BE SET BY THE PUBLISHER 17 Next we itroduce Rademacher radom variables, obtaiig, by simple symmetrizatio, 2P sup f F P f P f (P f + P f)/2 t } = 2E [P σ sup f F 1 i=1 σ i(f(x i ) f(x }] i)) t (P f + P f)/2 (where P σ is the coditioal probability, give the X i ad X i ) The last step uses tail bouds for idividual fuctios ad a uio boud over F(X1 2 ), where X1 2 deotes the uio of the iitial sample X1 ad of the extra symmetrizatio sample X 1,, X Summarizig, we obtai the followig iequalities: Theorem 51 Let F be a class of fuctios takig biary values i 0, 1} For ay δ (0, 1), with probability at least 1 δ, all f F satisfy P f P f log S F (X1 2 2 ) + log 4 δ P f Also, with probability at least 1 δ, for all f F, P f P f P f 2 log S F (X1 2) + log 4 δ As a cosequece, we have that for all s > 0, with probability at least 1 δ, sup f F P f P f P f + P f + s/2 2 log S F (X 2 1 ) + log 4 δ s (14) ad the same is true if P ad P are permuted Aother cosequece of Theorem 51 with iterestig applicatios is the followig For all t (0, 1], with probability at least 1 δ, I particular, settig t = 1, f F, P f (1 t)p f implies P f 4 log S F(X1 2 ) + log 4 δ t 2 (15) 512 Applicatios to empirical risk miimizatio f F, P f = 0 implies P f 4 log S F(X1 2 ) + log 4 δ It is easy to see that, for o-egative umbers A, B, C 0, the fact that A B A + C etails A B 2 + B C + C so that we obtai from the secod iequality of Theorem 51 that, with probability at least 1 δ, for all f F, P f P f + 2 P f log S F(X 2 1 ) + log 4 δ + 4 log S F(X1 2 ) + log 4 δ Corollary 52 Let g be the empirical risk miimizer i a class C of vc dimesio V The, with probability at least 1 δ, L(g ) L (g ) + 2 L (g ) 2V log( + 1) + log 4 δ + 4 2V log( + 1) + log 4 δ

18 18 TITLE WILL BE SET BY THE PUBLISHER Cosider first the extreme situatio whe there exists a classifier i C which classifies without error This also meas that for some g C, Y = g (X) with probability oe This is clearly a quite restrictive assumptio, oly satisfied i very special cases Nevertheless, the assumptio that if g C L(g) = 0 has bee commoly used i computatioal learig theory, perhaps because of its mathematical simplicity I such a case, clearly L (g) = 0, so that we get, with probability at least 1 δ, L(g) if L(g) 42V log( + 1) + log 4 δ (16) g C The mai poit here is that the upper boud obtaied ( i this special case is of smaller order of magitude V ) tha i the geeral case (O(V l /) as opposed to O l / ) Oe ca actually obtai a versio which iterpolates betwee these two cases as follows: for simplicity, assume that there is a classifier g i C such that L(g ) = if g C L(g) The we have L (g ) L (g ) = L (g ) L(g ) + L(g ) Usig Berstei s iequality, we get, with probability 1 δ, which, together with Corollary 52, yields: L (g ) L(g ) 2L(g ) log 1 δ + 2 log 1 δ 3, Corollary 53 There exists a costat C such that, with probability at least 1 δ, L(g) if L(g) C g C if L(g)V log + log 1 δ g C + V log + log 1 δ 52 Noise ad fast rates We have see that i the case where f takes values i 0, 1} there is a ice relatioship betwee the variace of f (which cotrols the size of the deviatios betwee P f ad P f) ad its expectatio, amely, Var(f) P f This is the key property that allows oe to obtai faster rates of covergece for L(g ) if g C L(g) I particular, i the ideal situatio metioed above, whe if g C L(g) = 0, the differece L(g ) if g C L(g) may be much smaller tha the worst-case differece sup g C (L(g) L (g)) This actually happes i may cases, wheever the distributio satisfies certai coditios Next we describe such coditios ad show how the fier bouds ca be derived The mai idea is that, i order to get precise rates for L(g ) if g C L(g), we cosider fuctios of the form 1 g(x) Y 1 g (X) Y where g is a classifier miimizig the loss i the class C, that is, such that L(g ) = if g C L(g) Note that fuctios of this form are o loger o-egative To illustrate the basic ideas i the simplest possible settig, cosider the case whe the loss class F is a fiite set of N fuctios of the form 1 g(x) Y 1 g (X) Y I additio, we assume that there is a relatioship betwee the variace ad the expectatio of the fuctios i F give by the iequality Var(f) ( ) α P f (17) h

19 TITLE WILL BE SET BY THE PUBLISHER 19 for some h > 0 ad α (0, 1] By Berstei s iequality ad a uio boud over the elemets of C, we have that, with probability at least 1 δ, for all f F, P f P f + 2(P f/h) α log N δ + 4 log N δ 3 As a cosequece, usig the fact that P f = L (g ) L (g ) 0, we have with probability at least 1 δ, L(g ) L(g ) 2((L(g ) L(g ))/h) α log N δ + 4 log N δ 3 Solvig this iequality for L(g ) L(g ) fially gives that with probability at least 1 δ, L(g ) if g G L(g) ( 2 log N δ h α ) 1 2 α (18) Note that the obtaied rate is the faster tha 1/2 wheever α > 0 I particular, for α = 1 we get 1 as i the ideal case It ow remais to show whether (17) is a reasoable assumptio As the simplest possible example, assume that the Bayes classifier g belogs to the class C (ie, g = g ) ad the a posteriori probability fuctio η is bouded away from 1/2, that is, there exists a positive costat h such that for all x X, 2η(x) 1 > h Note that the assumptio g = g is very restrictive ad is ulikely to be satisfied i practice, especially if the class C is fiite, as it is assumed i this discussio The assumptio that η is bouded away from zero may also appear to be quite specific However, the situatio described here may serve as a first illustratio of a otrivial example whe fast rates may be achieved Sice 1 g(x) Y 1 g (X) Y 1 g(x) g (X), the coditios stated above ad (1) imply that Var(f) E [ ] 1 1 g(x) g (X) h E [ ] 1 2η(X) 1 1 g(x) g (X) = h (L(g) L ) Thus (17) holds with β = 1/h ad α = 1 which shows that, with probability at least 1 δ, L(g ) L C log N δ h (19) Thus, the empirical risk miimizer has a sigificatly better performace tha predicted by the results of the previous sectio wheever the Bayes classifier is i the class C ad the a posteriori probability η stays away from 1/2 The behavior of η i the viciity of 1/2 has bee kow to play a importat role i the difficulty of the classificatio problem, see [72, 239, 240] Roughly speakig, if η has a complex behavior aroud the critical threshold 1/2, the oe caot avoid estimatig η, which is a typically difficult oparametric regressio problem However, the classificatio problem is sigificatly easier tha regressio if η is far from 1/2 with a large probability The coditio of η beig bouded away from 1/2 may be sigificatly relaxed ad geeralized Ideed, i the cotext of discrimiat aalysis, Mamme ad Tsybakov [151] ad Tsybakov [221] formulated a useful coditio that has bee adopted by may authors Let α [0, 1) The the Mamme-Tsybakov coditio may

20 20 TITLE WILL BE SET BY THE PUBLISHER be stated by ay of the followig three equivalet statemets: (1) β > 0, g 0, 1} X, E [ ] 1 g(x) g (X) β(l(g) L ) α (2) c > 0, A X, A ( dp (x) c 2η(x) 1 dp (x) A (3) B > 0, t 0, P 2η(X) 1 t} Bt α We refer to this as the Mamme-Tsybakov oise coditio The proof that these statemets are equivalet is straightforward, ad we omit it, but we commet o the meaig of these statemets Notice first that α has to be i [0, 1] because L(g) L = E [ 2η(X) 1 1 g(x) g (X)] E1g(X) g (X) Also, whe α = 0 these coditios are void The case α = 1 i (1) is realized whe there exists a s > 0 such that 2η(X) 1 > s almost surely (which is just the extreme oise coditio we cosidered above) The most importat cosequece of these coditios is that they imply a relatioship betwee the variace ad the expectatio of fuctios of the form 1 g(x) Y 1 g (X) Y Ideed, we obtai 1 α E [ (1 g(x) Y 1 g (X) Y ) 2] c(l(g) L ) α This is thus eough to get (18) for a fiite class of fuctios The sharper bouds, established i this sectio ad the ext, come at the price of the assumptio that the Bayes classifier is i the class C Because of this, it is difficult to compare the fast rates achieved with the slower rates proved i Sectio 3 O the other had, oise coditios like the Mamme-Tsybakov coditio may be used to get improvemets eve whe g is ot cotaied i C I these cases the approximatio error L(g ) L also eeds to be take ito accout, ad the situatio becomes somewhat more complex We retur to these issues i Sectios 535 ad 8 53 Localizatio The purpose of this sectio is to geeralize the simple argumet of the previous sectio to more geeral classes C of classifiers This geeralizatio reveals the importace of the modulus of cotiuity of the empirical process as a measure of complexity of the learig problem 531 Talagrad s iequality Oe of the most importat recet developmets i empirical process theory is a cocetratio iequality for the supremum of a empirical process first proved by Talagrad [212] ad refied later by various authors This iequality is at the heart of may key developmets i statistical learig theory Here we recall the followig versio: Theorem 54 Let b > 0 ad set F to be a set of fuctios from X to R Assume that all fuctios i F satisfy P f f b The, with probability at least 1 δ, for ay θ > 0, [ ] sup (P f P f) (1 + θ)e sup (P f P f) f F f F which, for θ = 1 traslates to [ sup (P f P f) 2E f F sup (P f P f) f F ] + + 2(sup f F Var(f)) log 1 δ 2(sup f F Var(f)) log 1 δ ) α + (1 + 3/θ)b log 1 δ 3 + 4b log 1 δ 3,

Properties of MLE: consistency, asymptotic normality. Fisher information.

Properties of MLE: consistency, asymptotic normality. Fisher information. Lecture 3 Properties of MLE: cosistecy, asymptotic ormality. Fisher iformatio. I this sectio we will try to uderstad why MLEs are good. Let us recall two facts from probability that we be used ofte throughout

More information

Chapter 7 Methods of Finding Estimators

Chapter 7 Methods of Finding Estimators Chapter 7 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 011 Chapter 7 Methods of Fidig Estimators Sectio 7.1 Itroductio Defiitio 7.1.1 A poit estimator is ay fuctio W( X) W( X1, X,, X ) of

More information

I. Chi-squared Distributions

I. Chi-squared Distributions 1 M 358K Supplemet to Chapter 23: CHI-SQUARED DISTRIBUTIONS, T-DISTRIBUTIONS, AND DEGREES OF FREEDOM To uderstad t-distributios, we first eed to look at aother family of distributios, the chi-squared distributios.

More information

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008 I ite Sequeces Dr. Philippe B. Laval Keesaw State Uiversity October 9, 2008 Abstract This had out is a itroductio to i ite sequeces. mai de itios ad presets some elemetary results. It gives the I ite Sequeces

More information

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method Chapter 6: Variace, the law of large umbers ad the Mote-Carlo method Expected value, variace, ad Chebyshev iequality. If X is a radom variable recall that the expected value of X, E[X] is the average value

More information

Asymptotic Growth of Functions

Asymptotic Growth of Functions CMPS Itroductio to Aalysis of Algorithms Fall 3 Asymptotic Growth of Fuctios We itroduce several types of asymptotic otatio which are used to compare the performace ad efficiecy of algorithms As we ll

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Itroductio to Statistical Learig Theory Olivier Bousquet 1, Stéphae Bouchero 2, ad Gábor Lugosi 3 1 Max-Plack Istitute for Biological Cyberetics Spemastr 38, D-72076 Tübige, Germay olivierbousquet@m4xorg

More information

Department of Computer Science, University of Otago

Department of Computer Science, University of Otago Departmet of Computer Sciece, Uiversity of Otago Techical Report OUCS-2006-09 Permutatios Cotaiig May Patters Authors: M.H. Albert Departmet of Computer Sciece, Uiversity of Otago Micah Colema, Rya Fly

More information

Lecture 13. Lecturer: Jonathan Kelner Scribe: Jonathan Pines (2009)

Lecture 13. Lecturer: Jonathan Kelner Scribe: Jonathan Pines (2009) 18.409 A Algorithmist s Toolkit October 27, 2009 Lecture 13 Lecturer: Joatha Keler Scribe: Joatha Pies (2009) 1 Outlie Last time, we proved the Bru-Mikowski iequality for boxes. Today we ll go over the

More information

Modified Line Search Method for Global Optimization

Modified Line Search Method for Global Optimization Modified Lie Search Method for Global Optimizatio Cria Grosa ad Ajith Abraham Ceter of Excellece for Quatifiable Quality of Service Norwegia Uiversity of Sciece ad Techology Trodheim, Norway {cria, ajith}@q2s.tu.o

More information

Incremental calculation of weighted mean and variance

Incremental calculation of weighted mean and variance Icremetal calculatio of weighted mea ad variace Toy Fich faf@cam.ac.uk dot@dotat.at Uiversity of Cambridge Computig Service February 009 Abstract I these otes I eplai how to derive formulae for umerically

More information

Sequences and Series

Sequences and Series CHAPTER 9 Sequeces ad Series 9.. Covergece: Defiitio ad Examples Sequeces The purpose of this chapter is to itroduce a particular way of geeratig algorithms for fidig the values of fuctios defied by their

More information

Soving Recurrence Relations

Soving Recurrence Relations Sovig Recurrece Relatios Part 1. Homogeeous liear 2d degree relatios with costat coefficiets. Cosider the recurrece relatio ( ) T () + at ( 1) + bt ( 2) = 0 This is called a homogeeous liear 2d degree

More information

A probabilistic proof of a binomial identity

A probabilistic proof of a binomial identity A probabilistic proof of a biomial idetity Joatho Peterso Abstract We give a elemetary probabilistic proof of a biomial idetity. The proof is obtaied by computig the probability of a certai evet i two

More information

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13 EECS 70 Discrete Mathematics ad Probability Theory Sprig 2014 Aat Sahai Note 13 Itroductio At this poit, we have see eough examples that it is worth just takig stock of our model of probability ad may

More information

Statistical Learning Theory

Statistical Learning Theory 1 / 130 Statistical Learig Theory Machie Learig Summer School, Kyoto, Japa Alexader (Sasha) Rakhli Uiversity of Pesylvaia, The Wharto School Pe Research i Machie Learig (PRiML) August 27-28, 2012 2 / 130

More information

Class Meeting # 16: The Fourier Transform on R n

Class Meeting # 16: The Fourier Transform on R n MATH 18.152 COUSE NOTES - CLASS MEETING # 16 18.152 Itroductio to PDEs, Fall 2011 Professor: Jared Speck Class Meetig # 16: The Fourier Trasform o 1. Itroductio to the Fourier Trasform Earlier i the course,

More information

Convexity, Inequalities, and Norms

Convexity, Inequalities, and Norms Covexity, Iequalities, ad Norms Covex Fuctios You are probably familiar with the otio of cocavity of fuctios. Give a twicedifferetiable fuctio ϕ: R R, We say that ϕ is covex (or cocave up) if ϕ (x) 0 for

More information

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n We will cosider the liear regressio model i matrix form. For simple liear regressio, meaig oe predictor, the model is i = + x i + ε i for i =,,,, This model icludes the assumptio that the ε i s are a sample

More information

Maximum Likelihood Estimators.

Maximum Likelihood Estimators. Lecture 2 Maximum Likelihood Estimators. Matlab example. As a motivatio, let us look at oe Matlab example. Let us geerate a radom sample of size 00 from beta distributio Beta(5, 2). We will lear the defiitio

More information

Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork

Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork Solutios to Selected Problems I: Patter Classificatio by Duda, Hart, Stork Joh L. Weatherwax February 4, 008 Problem Solutios Chapter Bayesia Decisio Theory Problem radomized rules Part a: Let Rx be the

More information

1 Computing the Standard Deviation of Sample Means

1 Computing the Standard Deviation of Sample Means Computig the Stadard Deviatio of Sample Meas Quality cotrol charts are based o sample meas ot o idividual values withi a sample. A sample is a group of items, which are cosidered all together for our aalysis.

More information

Totally Corrective Boosting Algorithms that Maximize the Margin

Totally Corrective Boosting Algorithms that Maximize the Margin Mafred K. Warmuth mafred@cse.ucsc.edu Ju Liao liaoju@cse.ucsc.edu Uiversity of Califoria at Sata Cruz, Sata Cruz, CA 95064, USA Guar Rätsch Guar.Raetsch@tuebige.mpg.de Friedrich Miescher Laboratory of

More information

Hypothesis testing. Null and alternative hypotheses

Hypothesis testing. Null and alternative hypotheses Hypothesis testig Aother importat use of samplig distributios is to test hypotheses about populatio parameters, e.g. mea, proportio, regressio coefficiets, etc. For example, it is possible to stipulate

More information

0.7 0.6 0.2 0 0 96 96.5 97 97.5 98 98.5 99 99.5 100 100.5 96.5 97 97.5 98 98.5 99 99.5 100 100.5

0.7 0.6 0.2 0 0 96 96.5 97 97.5 98 98.5 99 99.5 100 100.5 96.5 97 97.5 98 98.5 99 99.5 100 100.5 Sectio 13 Kolmogorov-Smirov test. Suppose that we have a i.i.d. sample X 1,..., X with some ukow distributio P ad we would like to test the hypothesis that P is equal to a particular distributio P 0, i.e.

More information

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT Keywords: project maagemet, resource allocatio, etwork plaig Vladimir N Burkov, Dmitri A Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT The paper deals with the problems of resource allocatio betwee

More information

5: Introduction to Estimation

5: Introduction to Estimation 5: Itroductio to Estimatio Cotets Acroyms ad symbols... 1 Statistical iferece... Estimatig µ with cofidece... 3 Samplig distributio of the mea... 3 Cofidece Iterval for μ whe σ is kow before had... 4 Sample

More information

LECTURE 13: Cross-validation

LECTURE 13: Cross-validation LECTURE 3: Cross-validatio Resampli methods Cross Validatio Bootstrap Bias ad variace estimatio with the Bootstrap Three-way data partitioi Itroductio to Patter Aalysis Ricardo Gutierrez-Osua Texas A&M

More information

SUPPLEMENTARY MATERIAL TO GENERAL NON-EXACT ORACLE INEQUALITIES FOR CLASSES WITH A SUBEXPONENTIAL ENVELOPE

SUPPLEMENTARY MATERIAL TO GENERAL NON-EXACT ORACLE INEQUALITIES FOR CLASSES WITH A SUBEXPONENTIAL ENVELOPE SUPPLEMENTARY MATERIAL TO GENERAL NON-EXACT ORACLE INEQUALITIES FOR CLASSES WITH A SUBEXPONENTIAL ENVELOPE By Guillaume Lecué CNRS, LAMA, Mare-la-vallée, 77454 Frace ad By Shahar Medelso Departmet of Mathematics,

More information

Tradigms of Astundithi and Toyota

Tradigms of Astundithi and Toyota Tradig the radomess - Desigig a optimal tradig strategy uder a drifted radom walk price model Yuao Wu Math 20 Project Paper Professor Zachary Hamaker Abstract: I this paper the author iteds to explore

More information

Plug-in martingales for testing exchangeability on-line

Plug-in martingales for testing exchangeability on-line Plug-i martigales for testig exchageability o-lie Valetia Fedorova, Alex Gammerma, Ilia Nouretdiov, ad Vladimir Vovk Computer Learig Research Cetre Royal Holloway, Uiversity of Lodo, UK {valetia,ilia,alex,vovk}@cs.rhul.ac.uk

More information

THE HEIGHT OF q-binary SEARCH TREES

THE HEIGHT OF q-binary SEARCH TREES THE HEIGHT OF q-binary SEARCH TREES MICHAEL DRMOTA AND HELMUT PRODINGER Abstract. q biary search trees are obtaied from words, equipped with the geometric distributio istead of permutatios. The average

More information

where: T = number of years of cash flow in investment's life n = the year in which the cash flow X n i = IRR = the internal rate of return

where: T = number of years of cash flow in investment's life n = the year in which the cash flow X n i = IRR = the internal rate of return EVALUATING ALTERNATIVE CAPITAL INVESTMENT PROGRAMS By Ke D. Duft, Extesio Ecoomist I the March 98 issue of this publicatio we reviewed the procedure by which a capital ivestmet project was assessed. The

More information

MARTINGALES AND A BASIC APPLICATION

MARTINGALES AND A BASIC APPLICATION MARTINGALES AND A BASIC APPLICATION TURNER SMITH Abstract. This paper will develop the measure-theoretic approach to probability i order to preset the defiitio of martigales. From there we will apply this

More information

5 Boolean Decision Trees (February 11)

5 Boolean Decision Trees (February 11) 5 Boolea Decisio Trees (February 11) 5.1 Graph Coectivity Suppose we are give a udirected graph G, represeted as a boolea adjacecy matrix = (a ij ), where a ij = 1 if ad oly if vertices i ad j are coected

More information

CHAPTER 3 DIGITAL CODING OF SIGNALS

CHAPTER 3 DIGITAL CODING OF SIGNALS CHAPTER 3 DIGITAL CODING OF SIGNALS Computers are ofte used to automate the recordig of measuremets. The trasducers ad sigal coditioig circuits produce a voltage sigal that is proportioal to a quatity

More information

The analysis of the Cournot oligopoly model considering the subjective motive in the strategy selection

The analysis of the Cournot oligopoly model considering the subjective motive in the strategy selection The aalysis of the Courot oligopoly model cosiderig the subjective motive i the strategy selectio Shigehito Furuyama Teruhisa Nakai Departmet of Systems Maagemet Egieerig Faculty of Egieerig Kasai Uiversity

More information

CS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations

CS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations CS3A Hadout 3 Witer 00 February, 00 Solvig Recurrece Relatios Itroductio A wide variety of recurrece problems occur i models. Some of these recurrece relatios ca be solved usig iteratio or some other ad

More information

Notes on exponential generating functions and structures.

Notes on exponential generating functions and structures. Notes o expoetial geeratig fuctios ad structures. 1. The cocept of a structure. Cosider the followig coutig problems: (1) to fid for each the umber of partitios of a -elemet set, (2) to fid for each the

More information

INVESTMENT PERFORMANCE COUNCIL (IPC)

INVESTMENT PERFORMANCE COUNCIL (IPC) INVESTMENT PEFOMANCE COUNCIL (IPC) INVITATION TO COMMENT: Global Ivestmet Performace Stadards (GIPS ) Guidace Statemet o Calculatio Methodology The Associatio for Ivestmet Maagemet ad esearch (AIM) seeks

More information

Irreducible polynomials with consecutive zero coefficients

Irreducible polynomials with consecutive zero coefficients Irreducible polyomials with cosecutive zero coefficiets Theodoulos Garefalakis Departmet of Mathematics, Uiversity of Crete, 71409 Heraklio, Greece Abstract Let q be a prime power. We cosider the problem

More information

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx SAMPLE QUESTIONS FOR FINAL EXAM REAL ANALYSIS I FALL 006 3 4 Fid the followig usig the defiitio of the Riema itegral: a 0 x + dx 3 Cosider the partitio P x 0 3, x 3 +, x 3 +,......, x 3 3 + 3 of the iterval

More information

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable Week 3 Coditioal probabilities, Bayes formula, WEEK 3 page 1 Expected value of a radom variable We recall our discussio of 5 card poker hads. Example 13 : a) What is the probability of evet A that a 5

More information

Lecture 4: Cauchy sequences, Bolzano-Weierstrass, and the Squeeze theorem

Lecture 4: Cauchy sequences, Bolzano-Weierstrass, and the Squeeze theorem Lecture 4: Cauchy sequeces, Bolzao-Weierstrass, ad the Squeeze theorem The purpose of this lecture is more modest tha the previous oes. It is to state certai coditios uder which we are guarateed that limits

More information

Theorems About Power Series

Theorems About Power Series Physics 6A Witer 20 Theorems About Power Series Cosider a power series, f(x) = a x, () where the a are real coefficiets ad x is a real variable. There exists a real o-egative umber R, called the radius

More information

Universal coding for classes of sources

Universal coding for classes of sources Coexios module: m46228 Uiversal codig for classes of sources Dever Greee This work is produced by The Coexios Project ad licesed uder the Creative Commos Attributio Licese We have discussed several parametric

More information

Output Analysis (2, Chapters 10 &11 Law)

Output Analysis (2, Chapters 10 &11 Law) B. Maddah ENMG 6 Simulatio 05/0/07 Output Aalysis (, Chapters 10 &11 Law) Comparig alterative system cofiguratio Sice the output of a simulatio is radom, the comparig differet systems via simulatio should

More information

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring No-life isurace mathematics Nils F. Haavardsso, Uiversity of Oslo ad DNB Skadeforsikrig Mai issues so far Why does isurace work? How is risk premium defied ad why is it importat? How ca claim frequecy

More information

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas: Chapter 7 - Samplig Distributios 1 Itroductio What is statistics? It cosist of three major areas: Data Collectio: samplig plas ad experimetal desigs Descriptive Statistics: umerical ad graphical summaries

More information

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling Taig DCOP to the Real World: Efficiet Complete Solutios for Distributed Multi-Evet Schedulig Rajiv T. Maheswara, Milid Tambe, Emma Bowrig, Joatha P. Pearce, ad Pradeep araatham Uiversity of Souther Califoria

More information

Overview of some probability distributions.

Overview of some probability distributions. Lecture Overview of some probability distributios. I this lecture we will review several commo distributios that will be used ofte throughtout the class. Each distributio is usually described by its probability

More information

Normal Distribution.

Normal Distribution. Normal Distributio www.icrf.l Normal distributio I probability theory, the ormal or Gaussia distributio, is a cotiuous probability distributio that is ofte used as a first approimatio to describe realvalued

More information

Confidence Intervals for One Mean

Confidence Intervals for One Mean Chapter 420 Cofidece Itervals for Oe Mea Itroductio This routie calculates the sample size ecessary to achieve a specified distace from the mea to the cofidece limit(s) at a stated cofidece level for a

More information

Statistical inference: example 1. Inferential Statistics

Statistical inference: example 1. Inferential Statistics Statistical iferece: example 1 Iferetial Statistics POPULATION SAMPLE A clothig store chai regularly buys from a supplier large quatities of a certai piece of clothig. Each item ca be classified either

More information

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed. This documet was writte ad copyrighted by Paul Dawkis. Use of this documet ad its olie versio is govered by the Terms ad Coditios of Use located at http://tutorial.math.lamar.edu/terms.asp. The olie versio

More information

INFINITE SERIES KEITH CONRAD

INFINITE SERIES KEITH CONRAD INFINITE SERIES KEITH CONRAD. Itroductio The two basic cocepts of calculus, differetiatio ad itegratio, are defied i terms of limits (Newto quotiets ad Riema sums). I additio to these is a third fudametal

More information

1. C. The formula for the confidence interval for a population mean is: x t, which was

1. C. The formula for the confidence interval for a population mean is: x t, which was s 1. C. The formula for the cofidece iterval for a populatio mea is: x t, which was based o the sample Mea. So, x is guarateed to be i the iterval you form.. D. Use the rule : p-value

More information

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY Physical ad Mathematical Scieces 2015, 1, p. 15 19 M a t h e m a t i c s AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM A. G. GULYAN Chair of Actuarial Mathematics

More information

Infinite Sequences and Series

Infinite Sequences and Series CHAPTER 4 Ifiite Sequeces ad Series 4.1. Sequeces A sequece is a ifiite ordered list of umbers, for example the sequece of odd positive itegers: 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29...

More information

Measures of Spread and Boxplots Discrete Math, Section 9.4

Measures of Spread and Boxplots Discrete Math, Section 9.4 Measures of Spread ad Boxplots Discrete Math, Sectio 9.4 We start with a example: Example 1: Comparig Mea ad Media Compute the mea ad media of each data set: S 1 = {4, 6, 8, 10, 1, 14, 16} S = {4, 7, 9,

More information

THE ABRACADABRA PROBLEM

THE ABRACADABRA PROBLEM THE ABRACADABRA PROBLEM FRANCESCO CARAVENNA Abstract. We preset a detailed solutio of Exercise E0.6 i [Wil9]: i a radom sequece of letters, draw idepedetly ad uiformly from the Eglish alphabet, the expected

More information

The Stable Marriage Problem

The Stable Marriage Problem The Stable Marriage Problem William Hut Lae Departmet of Computer Sciece ad Electrical Egieerig, West Virgiia Uiversity, Morgatow, WV William.Hut@mail.wvu.edu 1 Itroductio Imagie you are a matchmaker,

More information

Factors of sums of powers of binomial coefficients

Factors of sums of powers of binomial coefficients ACTA ARITHMETICA LXXXVI.1 (1998) Factors of sums of powers of biomial coefficiets by Neil J. Cali (Clemso, S.C.) Dedicated to the memory of Paul Erdős 1. Itroductio. It is well ow that if ( ) a f,a = the

More information

TIGHT BOUNDS ON EXPECTED ORDER STATISTICS

TIGHT BOUNDS ON EXPECTED ORDER STATISTICS Probability i the Egieerig ad Iformatioal Scieces, 20, 2006, 667 686+ Prited i the U+S+A+ TIGHT BOUNDS ON EXPECTED ORDER STATISTICS DIMITRIS BERTSIMAS Sloa School of Maagemet ad Operatios Research Ceter

More information

Perfect Packing Theorems and the Average-Case Behavior of Optimal and Online Bin Packing

Perfect Packing Theorems and the Average-Case Behavior of Optimal and Online Bin Packing SIAM REVIEW Vol. 44, No. 1, pp. 95 108 c 2002 Society for Idustrial ad Applied Mathematics Perfect Packig Theorems ad the Average-Case Behavior of Optimal ad Olie Bi Packig E. G. Coffma, Jr. C. Courcoubetis

More information

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES Read Sectio 1.5 (pages 5 9) Overview I Sectio 1.5 we lear to work with summatio otatio ad formulas. We will also itroduce a brief overview of sequeces,

More information

Lecture 3. denote the orthogonal complement of S k. Then. 1 x S k. n. 2 x T Ax = ( ) λ x. with x = 1, we have. i = λ k x 2 = λ k.

Lecture 3. denote the orthogonal complement of S k. Then. 1 x S k. n. 2 x T Ax = ( ) λ x. with x = 1, we have. i = λ k x 2 = λ k. 18.409 A Algorithmist s Toolkit September 17, 009 Lecture 3 Lecturer: Joatha Keler Scribe: Adre Wibisoo 1 Outlie Today s lecture covers three mai parts: Courat-Fischer formula ad Rayleigh quotiets The

More information

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution Uiversity of Califoria, Los Ageles Departmet of Statistics Statistics 100B Istructor: Nicolas Christou Three importat distributios: Distributios related to the ormal distributio Chi-square (χ ) distributio.

More information

Analyzing Longitudinal Data from Complex Surveys Using SUDAAN

Analyzing Longitudinal Data from Complex Surveys Using SUDAAN Aalyzig Logitudial Data from Complex Surveys Usig SUDAAN Darryl Creel Statistics ad Epidemiology, RTI Iteratioal, 312 Trotter Farm Drive, Rockville, MD, 20850 Abstract SUDAAN: Software for the Statistical

More information

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

Case Study. Normal and t Distributions. Density Plot. Normal Distributions Case Study Normal ad t Distributios Bret Halo ad Bret Larget Departmet of Statistics Uiversity of Wiscosi Madiso October 11 13, 2011 Case Study Body temperature varies withi idividuals over time (it ca

More information

Swaps: Constant maturity swaps (CMS) and constant maturity. Treasury (CMT) swaps

Swaps: Constant maturity swaps (CMS) and constant maturity. Treasury (CMT) swaps Swaps: Costat maturity swaps (CMS) ad costat maturity reasury (CM) swaps A Costat Maturity Swap (CMS) swap is a swap where oe of the legs pays (respectively receives) a swap rate of a fixed maturity, while

More information

AMS 2000 subject classification. Primary 62G08, 62G20; secondary 62G99

AMS 2000 subject classification. Primary 62G08, 62G20; secondary 62G99 VARIABLE SELECTION IN NONPARAMETRIC ADDITIVE MODELS Jia Huag 1, Joel L. Horowitz 2 ad Fegrog Wei 3 1 Uiversity of Iowa, 2 Northwester Uiversity ad 3 Uiversity of West Georgia Abstract We cosider a oparametric

More information

The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles

The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles The followig eample will help us uderstad The Samplig Distributio of the Mea Review: The populatio is the etire collectio of all idividuals or objects of iterest The sample is the portio of the populatio

More information

A Faster Clause-Shortening Algorithm for SAT with No Restriction on Clause Length

A Faster Clause-Shortening Algorithm for SAT with No Restriction on Clause Length Joural o Satisfiability, Boolea Modelig ad Computatio 1 2005) 49-60 A Faster Clause-Shorteig Algorithm for SAT with No Restrictio o Clause Legth Evgey Datsi Alexader Wolpert Departmet of Computer Sciece

More information

CHAPTER 3 THE TIME VALUE OF MONEY

CHAPTER 3 THE TIME VALUE OF MONEY CHAPTER 3 THE TIME VALUE OF MONEY OVERVIEW A dollar i the had today is worth more tha a dollar to be received i the future because, if you had it ow, you could ivest that dollar ad ear iterest. Of all

More information

Research Article Sign Data Derivative Recovery

Research Article Sign Data Derivative Recovery Iteratioal Scholarly Research Network ISRN Applied Mathematics Volume 0, Article ID 63070, 7 pages doi:0.540/0/63070 Research Article Sig Data Derivative Recovery L. M. Housto, G. A. Glass, ad A. D. Dymikov

More information

Entropy of bi-capacities

Entropy of bi-capacities Etropy of bi-capacities Iva Kojadiovic LINA CNRS FRE 2729 Site école polytechique de l uiv. de Nates Rue Christia Pauc 44306 Nates, Frace iva.kojadiovic@uiv-ates.fr Jea-Luc Marichal Applied Mathematics

More information

Chapter 7: Confidence Interval and Sample Size

Chapter 7: Confidence Interval and Sample Size Chapter 7: Cofidece Iterval ad Sample Size Learig Objectives Upo successful completio of Chapter 7, you will be able to: Fid the cofidece iterval for the mea, proportio, ad variace. Determie the miimum

More information

Hypergeometric Distributions

Hypergeometric Distributions 7.4 Hypergeometric Distributios Whe choosig the startig lie-up for a game, a coach obviously has to choose a differet player for each positio. Similarly, whe a uio elects delegates for a covetio or you

More information

Annuities Under Random Rates of Interest II By Abraham Zaks. Technion I.I.T. Haifa ISRAEL and Haifa University Haifa ISRAEL.

Annuities Under Random Rates of Interest II By Abraham Zaks. Technion I.I.T. Haifa ISRAEL and Haifa University Haifa ISRAEL. Auities Uder Radom Rates of Iterest II By Abraham Zas Techio I.I.T. Haifa ISRAEL ad Haifa Uiversity Haifa ISRAEL Departmet of Mathematics, Techio - Israel Istitute of Techology, 3000, Haifa, Israel I memory

More information

THE TWO-VARIABLE LINEAR REGRESSION MODEL

THE TWO-VARIABLE LINEAR REGRESSION MODEL THE TWO-VARIABLE LINEAR REGRESSION MODEL Herma J. Bieres Pesylvaia State Uiversity April 30, 202. Itroductio Suppose you are a ecoomics or busiess maor i a college close to the beach i the souther part

More information

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth Questio 1: What is a ordiary auity? Let s look at a ordiary auity that is certai ad simple. By this, we mea a auity over a fixed term whose paymet period matches the iterest coversio period. Additioally,

More information

Lecture 5: Span, linear independence, bases, and dimension

Lecture 5: Span, linear independence, bases, and dimension Lecture 5: Spa, liear idepedece, bases, ad dimesio Travis Schedler Thurs, Sep 23, 2010 (versio: 9/21 9:55 PM) 1 Motivatio Motivatio To uderstad what it meas that R has dimesio oe, R 2 dimesio 2, etc.;

More information

Example 2 Find the square root of 0. The only square root of 0 is 0 (since 0 is not positive or negative, so those choices don t exist here).

Example 2 Find the square root of 0. The only square root of 0 is 0 (since 0 is not positive or negative, so those choices don t exist here). BEGINNING ALGEBRA Roots ad Radicals (revised summer, 00 Olso) Packet to Supplemet the Curret Textbook - Part Review of Square Roots & Irratioals (This portio ca be ay time before Part ad should mostly

More information

1 The Gaussian channel

1 The Gaussian channel ECE 77 Lecture 0 The Gaussia chael Objective: I this lecture we will lear about commuicatio over a chael of practical iterest, i which the trasmitted sigal is subjected to additive white Gaussia oise.

More information

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals Overview Estimatig the Value of a Parameter Usig Cofidece Itervals We apply the results about the sample mea the problem of estimatio Estimatio is the process of usig sample data estimate the value of

More information

Concentration of Measure

Concentration of Measure Copyright c 2008 2010 Joh Lafferty, Ha Liu, ad Larry Wasserma Do Not Distribute Chapter 7 Cocetratio of Measure Ofte we wat to show that some radom quatity is close to its mea with high probability Results

More information

1 Correlation and Regression Analysis

1 Correlation and Regression Analysis 1 Correlatio ad Regressio Aalysis I this sectio we will be ivestigatig the relatioship betwee two cotiuous variable, such as height ad weight, the cocetratio of a ijected drug ad heart rate, or the cosumptio

More information

Chapter 5: Inner Product Spaces

Chapter 5: Inner Product Spaces Chapter 5: Ier Product Spaces Chapter 5: Ier Product Spaces SECION A Itroductio to Ier Product Spaces By the ed of this sectio you will be able to uderstad what is meat by a ier product space give examples

More information

Estimating Probability Distributions by Observing Betting Practices

Estimating Probability Distributions by Observing Betting Practices 5th Iteratioal Symposium o Imprecise Probability: Theories ad Applicatios, Prague, Czech Republic, 007 Estimatig Probability Distributios by Observig Bettig Practices Dr C Lych Natioal Uiversity of Irelad,

More information

Approximating Area under a curve with rectangles. To find the area under a curve we approximate the area using rectangles and then use limits to find

Approximating Area under a curve with rectangles. To find the area under a curve we approximate the area using rectangles and then use limits to find 1.8 Approximatig Area uder a curve with rectagles 1.6 To fid the area uder a curve we approximate the area usig rectagles ad the use limits to fid 1.4 the area. Example 1 Suppose we wat to estimate 1.

More information

UC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Solutions 9 Spring 2006

UC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Solutions 9 Spring 2006 Exam format UC Bereley Departmet of Electrical Egieerig ad Computer Sciece EE 6: Probablity ad Radom Processes Solutios 9 Sprig 006 The secod midterm will be held o Wedesday May 7; CHECK the fial exam

More information

Exploratory Data Analysis

Exploratory Data Analysis 1 Exploratory Data Aalysis Exploratory data aalysis is ofte the rst step i a statistical aalysis, for it helps uderstadig the mai features of the particular sample that a aalyst is usig. Itelliget descriptios

More information

Systems Design Project: Indoor Location of Wireless Devices

Systems Design Project: Indoor Location of Wireless Devices Systems Desig Project: Idoor Locatio of Wireless Devices Prepared By: Bria Murphy Seior Systems Sciece ad Egieerig Washigto Uiversity i St. Louis Phoe: (805) 698-5295 Email: bcm1@cec.wustl.edu Supervised

More information

Quadrat Sampling in Population Ecology

Quadrat Sampling in Population Ecology Quadrat Samplig i Populatio Ecology Backgroud Estimatig the abudace of orgaisms. Ecology is ofte referred to as the "study of distributio ad abudace". This beig true, we would ofte like to kow how may

More information

3. Greatest Common Divisor - Least Common Multiple

3. Greatest Common Divisor - Least Common Multiple 3 Greatest Commo Divisor - Least Commo Multiple Defiitio 31: The greatest commo divisor of two atural umbers a ad b is the largest atural umber c which divides both a ad b We deote the greatest commo gcd

More information

Lesson 17 Pearson s Correlation Coefficient

Lesson 17 Pearson s Correlation Coefficient Outlie Measures of Relatioships Pearso s Correlatio Coefficiet (r) -types of data -scatter plots -measure of directio -measure of stregth Computatio -covariatio of X ad Y -uique variatio i X ad Y -measurig

More information

BASIC STATISTICS. f(x 1,x 2,..., x n )=f(x 1 )f(x 2 ) f(x n )= f(x i ) (1)

BASIC STATISTICS. f(x 1,x 2,..., x n )=f(x 1 )f(x 2 ) f(x n )= f(x i ) (1) BASIC STATISTICS. SAMPLES, RANDOM SAMPLING AND SAMPLE STATISTICS.. Radom Sample. The radom variables X,X 2,..., X are called a radom sample of size from the populatio f(x if X,X 2,..., X are mutually idepedet

More information

arxiv:1506.03481v1 [stat.me] 10 Jun 2015

arxiv:1506.03481v1 [stat.me] 10 Jun 2015 BEHAVIOUR OF ABC FOR BIG DATA By Wetao Li ad Paul Fearhead Lacaster Uiversity arxiv:1506.03481v1 [stat.me] 10 Ju 2015 May statistical applicatios ivolve models that it is difficult to evaluate the likelihood,

More information

An Efficient Polynomial Approximation of the Normal Distribution Function & Its Inverse Function

An Efficient Polynomial Approximation of the Normal Distribution Function & Its Inverse Function A Efficiet Polyomial Approximatio of the Normal Distributio Fuctio & Its Iverse Fuctio Wisto A. Richards, 1 Robi Atoie, * 1 Asho Sahai, ad 3 M. Raghuadh Acharya 1 Departmet of Mathematics & Computer Sciece;

More information