Introduction to Statistical Learning Theory

Size: px
Start display at page:

Download "Introduction to Statistical Learning Theory"

Transcription

1 Itroductio to Statistical Learig Theory Olivier Bousquet 1, Stéphae Bouchero 2, ad Gábor Lugosi 3 1 Max-Plack Istitute for Biological Cyberetics Spemastr 38, D Tübige, Germay olivierbousquet@m4xorg WWW home page: 2 Uiversité de Paris-Sud, Laboratoire d Iformatique Bâtimet 490, F Orsay Cedex, Frace stephaebouchero@lrifr WWW home page: 3 Departmet of Ecoomics, Pompeu Fabra Uiversity Ramo Trias Fargas 25-27, Barceloa, Spai lugosi@upfes WWW home page: Abstract The goal of statistical learig theory is to study, i a statistical framework, the properties of learig algorithms I particular, most results take the form of so-called error bouds This tutorial itroduces the techiques that are used to obtai such results 1 Itroductio The mai goal of statistical learig theory is to provide a framework for studyig the problem of iferece, that is of gaiig kowledge, makig predictios, makig decisios or costructig models from a set of data This is studied i a statistical framework, that is there are assumptios of statistical ature about the uderlyig pheomea (i the way the data is geerated) As a motivatio for the eed of such a theory, let us just quote V Vapik: (Vapik, 1]) Nothig is more practical tha a good theory Ideed, a theory of iferece should be able to give a formal defiitio of words like learig, geeralizatio, overfittig, ad also to characterize the performace of learig algorithms so that, ultimately, it may help desig better learig algorithms There are thus two goals: make thigs more precise ad derive ew or improved algorithms 11 Learig ad Iferece What is uder study here is the process of iductive iferece which ca roughly be summarized as the followig steps:

2 176 Bousquet, Bouchero & Lugosi 1 Observe a pheomeo 2 Costruct a model of that pheomeo 3 Make predictios usig this model Of course, this defiitio is very geeral ad could be take more or less as the goal of Natural Scieces The goal of Machie Learig is to actually automate this process ad the goal of Learig Theory is to formalize it I this tutorial we cosider a special case of the above process which is the supervised learig framework for patter recogitio I this framework, the data cosists of istace-label pairs, where the label is either +1 or 1 Give a set of such pairs, a learig algorithm costructs a fuctio mappig istaces to labels This fuctio should be such that it makes few mistakes whe predictig the label of usee istaces Of course, give some traiig data, it is always possible to build a fuctio that fits exactly the data But, i the presece of oise, this may ot be the best thig to do as it would lead to a poor performace o usee istaces (this is usually referred to as overfittig) The geeral idea behid the desig of Fig 1 Trade-off betwee fit ad complexity learig algorithms is thus to look for regularities (i a sese to be defied later) i the observed pheomeo (ie traiig data) These ca the be geeralized from the observed past to the future Typically, oe would look, i a collectio of possible models, for oe which fits well the data, but at the same time is as simple as possible (see Figure 1) This immediately raises the questio of how to measure ad quatify simplicity of a model (ie a { 1, +1}-valued fuctio)

3 Statistical Learig Theory 177 It turs out that there are may ways to do so, but o best oe For example i Physics, people ted to prefer models which have a small umber of costats ad that correspod to simple mathematical formulas Ofte, the legth of descriptio of a model i a codig laguage ca be a idicatio of its complexity I classical statistics, the umber of free parameters of a model is usually a measure of its complexity Surprisigly as it may seem, there is o uiversal way of measurig simplicity (or its couterpart complexity) ad the choice of a specific measure iheretly depeds o the problem at had It is actually i this choice that the desiger of the learig algorithm itroduces kowledge about the specific pheomeo uder study This lack of uiversally best choice ca actually be formalized i what is called the No Free Luch theorem, which i essece says that, if there is o assumptio o how the past (ie traiig data) is related to the future (ie test data), predictio is impossible Eve more, if there is o a priori restrictio o the possible pheomea that are expected, it is impossible to geeralize ad there is thus o better algorithm (ay algorithm would be beate by aother oe o some pheomeo) Hece the eed to make assumptios, like the fact that the pheomeo we observe ca be explaied by a simple model However, as we said, simplicity is ot a absolute otio, ad this leads to the statemet that data caot replace kowledge, or i pseudo-mathematical terms: Geeralizatio = Data + Kowledge 12 Assumptios We ow make more precise the assumptios that are made by the Statistical Learig Theory framework Ideed, as we said before we eed to assume that the future (ie test) observatios are related to the past (ie traiig) oes, so that the pheomeo is somewhat statioary At the core of the theory is a probabilistic model of the pheomeo (or data geeratio process) Withi this model, the relatioship betwee past ad future observatios is that they both are sampled idepedetly from the same distributio (iid) The idepedece assumptio meas that each ew observatio yields maximum iformatio The idetical distributio meas that the observatios give iformatio about the uderlyig pheomeo (here a probability distributio) A immediate cosequece of this very geeral settig is that oe ca costruct algorithms (eg k-earest eighbors with appropriate k) that are cosistet, which meas that, as oe gets more ad more data, the predictios of the algorithm are closer ad closer to the optimal oes So this seems to idicate that we ca have some sort of uiversal algorithm Ufortuately, ay (cosistet) algorithm ca have a arbitrarily bad behavior whe give a fiite traiig set These otios are formalized i Appedix B Agai, this discussio idicates that geeralizatio ca oly come whe oe adds specific kowledge to the data Each learig algorithm ecodes specific

4 178 Bousquet, Bouchero & Lugosi kowledge (or a specific assumptio about how the optimal classifier looks like), ad works best whe this assumptio is satisfied by the problem to which it is applied Bibliographical remarks Several textbooks, surveys, ad research moographs have bee writte o patter classificatio ad statistical learig theory A partial list icludes Athoy ad Bartlett 2], Breima, Friedma, Olshe, ad Stoe 3], Devroye, Györfi, ad Lugosi 4], Duda ad Hart 5], Fukuaga 6], Kears ad Vazirai 7], Kulkari, Lugosi, ad Vekatesh 8], Lugosi 9], McLachla 10], Medelso 11], Nataraja 12], Vapik 13, 14, 1], ad Vapik ad Chervoekis 15] 2 Formalizatio We cosider a iput space X ad output space Y Sice we restrict ourselves to biary classificatio, we choose Y = { 1, 1} Formally, we assume that the pairs (X, Y ) X Y are radom variables distributed accordig to a ukow distributio P We observe a sequece of iid pairs (X i, Y i ) sampled accordig to P ad the goal is to costruct a fuctio g : X Y which predicts Y from X We eed a criterio to choose this fuctio g This criterio is a low probability of error P (g(x) Y ) We thus defie the risk of g as R(g) = P (g(x) Y ) = g(x) Y ] Notice that P ca be decomposed as P X P (Y X) We itroduce the regressio fuctio η(x) = Y X = x] = 2 Y = 1 X = x] 1 ad the target fuctio (or Bayes classifier) t(x) = sg η(x) This fuctio achieves the miimum risk over all possible measurable fuctios: R(t) = if g R(g) We will deote the value R(t) by R, called the Bayes risk I the determiistic case, oe has Y = t(x) almost surely ( Y = 1 X] {0, 1}) ad R = 0 I the geeral case we ca defie the oise level as s(x) = mi( Y = 1 X = x], 1 Y = 1 X = x]) = (1 η(x))/2 (s(x) = 0 almost surely i the determiistic case) ad this gives R = s(x) Our goal is thus to idetify this fuctio t, but sice P is ukow we caot directly measure the risk ad we also caot kow directly the value of t at the data poits We ca oly measure the agreemet of a cadidate fuctio with the data This is called the empirical risk: R (g) = 1 i=1 g(x i) Y i It is commo to use this quatity as a criterio to select a estimate of t

5 Statistical Learig Theory Algorithms Now that the goal is clearly specified, we review the commo strategies to (approximately) achieve it We deote by g the fuctio retured by the algorithm Because oe caot compute R(g) but oly approximate it by R (g), it would be ureasoable to look for the fuctio miimizig R (g) amog all possible fuctios Ideed, whe the iput space is ifiite, oe ca always costruct a fuctio g which perfectly predicts the labels of the traiig data (ie g (X i ) = Y i, ad R (g ) = 0), but behaves o the other poits as the opposite of the target fuctio t, ie g (X) = Y so that R(g ) = 1 4 So oe would have miimum empirical risk but maximum risk It is thus ecessary to prevet this overfittig situatio There are essetially two ways to do this (which ca be combied) The first oe is to restrict the class of fuctios i which the miimizatio is performed, ad the secod is to modify the criterio to be miimized (eg addig a pealty for complicated fuctios) Empirical Risk Miimizatio This algorithm is oe of the most straightforward, yet it is usually efficiet The idea is to choose a model G of possible fuctios ad to miimize the empirical risk i that model: g = arg mi g G R (g) Of course, this will work best whe the target fuctio belogs to G However, it is rare to be able to make such a assumptio, so oe may wat to elarge the model as much as possible, while prevetig overfittig Structural Risk Miimizatio The idea here is to choose a ifiite sequece {G d : d = 1, 2, } of models of icreasig size ad to miimize the empirical risk i each model with a added pealty for the size of the model: g = arg mi g G d,d R (g) + pe(d, ) The pealty pe(d, ) gives preferece to models where estimatio error is small ad measures the size or capacity of the model Regularizatio Aother, usually easier to implemet approach cosists i choosig a large model G (possibly dese i the cotiuous fuctios for example) ad to defie o G a regularizer, typically a orm g The oe has to miimize the regularized empirical risk: g = arg mi g G R (g) + λ g 2 4 Strictly speakig this is oly possible if the probability distributio satisfies some mild coditios (eg has o atoms) Otherwise, it may ot be possible to achieve R(g ) = 1 but eve i this case, provided the support of P cotais ifiitely may poits, a similar pheomeo occurs

6 180 Bousquet, Bouchero & Lugosi Compared to SRM, there is here a free parameter λ, called the regularizatio parameter which allows to choose the right trade-off betwee fit ad complexity Tuig λ is usually a hard problem ad most ofte, oe uses extra validatio data for this task Most existig (ad successful) methods ca be thought of as regularizatio methods Normalized Regularizatio There are other possible approaches whe the regularizer ca, i some sese, be ormalized, ie whe it correspods to some probability distributio over G Give a probability distributio π defied o G (usually called a prior), oe ca use as a regularizer log π(g) 5 Reciprocally, from a regularizer of the form g 2, if there exists a measure µ o G such that e λ g 2 dµ(g) < for some λ > 0, the oe ca costruct a prior correspodig to this regularizer For example, if G is the set of hyperplaes i d goig through the origi, G ca be idetified with d ad, takig µ as the Lebesgue measure, it is possible to go from the Euclidea orm regularizer to a spherical Gaussia measure o d as a prior 6 This type of ormalized regularizer, or prior, ca be used to costruct aother probability distributio ρ o G (usually called posterior), as ρ(g) = e γr(g) π(g), Z(γ) where γ 0 is a free parameter ad Z(γ) is a ormalizatio factor There are several ways i which this ρ ca be used If we take the fuctio maximizig it, we recover regularizatio as arg max ρ(g) = arg mi γr (g) log π(g), g G g G where the regularizer is γ 1 log π(g) 7 Also, ρ ca be used to radomize the predictios I that case, before computig the predicted label for a iput x, oe samples a fuctio g accordig to ρ ad outputs g(x) This procedure is usually called Gibbs classificatio Aother way i which the distributio ρ costructed above ca be used is by takig the expected predictio of the fuctios i G: g (x) = sg( ρ(g(x))) 5 This is fie whe G is coutable I the cotiuous case, oe has to cosider the desity associated to π We omit these details 6 Geeralizatio to ifiite dimesioal Hilbert spaces ca also be doe but it requires more care Oe ca for example establish a correspodece betwee the orm of a reproducig kerel Hilbert space ad a Gaussia process prior whose covariace fuctio is the kerel of this space 7 Note that miimizig γr (g) log π(g) is equivalet to miimizig R (g) γ 1 log π(g)

7 Statistical Learig Theory 181 This is typically called Bayesia averagig At this poit we have to isist agai o the fact that the choice of the class G ad of the associated regularizer or prior, has to come from a priori kowledge about the task at had, ad there is o uiversally best choice 22 Bouds We have preseted the framework of the theory ad the type of algorithms that it studies, we ow itroduce the kid of results that it aims at The overall goal is to characterize the risk that some algorithm may have i a give situatio More precisely, a learig algorithm takes as iput the data (X 1, Y 1 ),, (X, Y ) ad produces a fuctio g which depeds o this data We wat to estimate the risk of g However, R(g ) is a radom variable (sice it depeds o the data) ad it caot be computed from the data (sice it also depeds o the ukow P ) Estimates of R(g ) thus usually take the form of probabilistic bouds Notice that whe the algorithm chooses its output from a model G, it is possible, by itroducig the best fuctio g i G, with R(g ) = if g G R(g), to write R(g ) R = R(g ) R ] + R(g ) R(g )] The first term o the right had side is usually called the approximatio error, ad measures how well ca fuctios i G approach the target (it would be zero if t G) The secod term, called estimatio error is a radom quatity (it depeds o the data) ad measures how close is g to the best possible choice i G Estimatig the approximatio error is usually hard sice it requires kowledge about the target Classically, i Statistical Learig Theory it is preferable to avoid makig specific assumptios about the target (such as its belogig to some model), but the assumptios are rather o the value of R, or o the oise fuctio s It is also kow that for ay (cosistet) algorithm, the rate of covergece to zero of the approximatio error 8 ca be arbitrarily slow if oe does ot make assumptios about the regularity of the target, while the rate of covergece of the estimatio error ca be computed without ay such assumptio We will thus focus o the estimatio error Aother possible decompositio of the risk is the followig: R(g ) = R (g ) + R(g ) R (g )] I this case, oe estimates the risk by its empirical couterpart, ad some quatity which approximates (or upper bouds) R(g ) R (g ) To summarize, we write the three type of results we may be iterested i 8 For this coverge to mea aythig, oe has to cosider algorithms which choose fuctios from a class which grows with the sample size This is the case for example of Structural Risk Miimizatio or Regularizatio based algorithms

8 182 Bousquet, Bouchero & Lugosi Error boud: R(g ) R (g ) + B(, G) This correspods to the estimatio of the risk from a empirical quatity Error boud relative to the best i the class: R(g ) R(g ) + B(, G) This tells how optimal is the algorithm give the model it uses Error boud relative to the Bayes risk: R(g ) R + B(, G) This gives theoretical guaratees o the covergece to the Bayes risk 3 Basic Bouds I this sectio we show how to obtai simple error bouds (also called geeralizatio bouds) The elemetary material from probability theory that is eeded here ad i the later sectios is summarized i Appedix A 31 Relatioship to Empirical Processes ] Recall that we wat to estimate the risk R(g ) = g (X) Y of the fuctio g retured by the algorithm after seeig the data (X 1, Y 1 ),, (X, Y ) This quatity caot be observed (P is ukow) ad is a radom variable (sice it depeds o the data) Hece oe way to make a statemet about this quatity is to say how it relates to a estimate such as the empirical risk R (g ) This relatioship ca take the form of upper ad lower bouds for R(g ) R (g ) > ε] For coveiece, let Z i = (X i, Y i ) ad Z = (X, Y ) Give G defie the loss class F = {f : (x, y) g(x) y : g G} (1) Notice that G cotais fuctios with rage i { 1, 1} while F cotais oegative fuctios with rage i {0, 1} I the remaider of the tutorial, we will go back ad forth betwee F ad G (as there is a bijectio betwee them), sometimes statig the results i terms of fuctios i F ad sometimes i terms of fuctios i G It will be clear from the cotext which classes G ad F we refer to, ad F will always be derived from the last metioed class G i the way of (1) We use the shorthad otatio P f = f(x, Y )] ad P f = 1 i=1 f(x i, Y i ) P is usually called the empirical measure associated to the traiig sample With this otatio, the quatity of iterest (differece betwee true ad empirical risks) ca be writte as P f P f (2) A empirical process is a collectio of radom variables idexed by a class of fuctios, ad such that each radom variable is distributed as a sum of iid radom variables (values take by the fuctio at the data): {P f P f}

9 Statistical Learig Theory 183 Oe of the most studied quatity associated to empirical processes is their supremum: sup P f P f It is clear that if we kow a upper boud o this quatity, it will be a upper boud o (2) This shows that the theory of empirical processes is a great source of tools ad techiques for Statistical Learig Theory 32 Hoeffdig s Iequality Let us rewrite agai the quatity we are iterested i as follows R(g) R (g) = f(z)] 1 f(z i ) i=1 It is easy to recogize here the differece betwee the expectatio ad the empirical average of the radom variable f(z) By the law of large umbers, we immediately obtai that lim 1 ] f(z i ) f(z)] = 0 = 1 i=1 This idicates that with eough samples, the empirical risk of a fuctio is a good approximatio to its true risk It turs out that there exists a quatitative versio of the law of large umbers whe the variables are bouded Theorem 1 (Hoeffdig) Let Z 1,, Z be iid radom variables with f(z) a, b] The for all ε > 0, we have 1 f(z i ) i=1 ] f(z)] > ε ) 2 exp ( 2ε2 (b a) 2 Let us rewrite the above formula to better uderstad its cosequeces Deote the right had side by δ The P f P f > (b a) log 2 δ 2 δ, or (by iversio, see Appedix A) with probability at least 1 δ, P f P f (b a) log 2 δ 2

10 184 Bousquet, Bouchero & Lugosi Applyig this to f(z) = g(x) Y we get that for ay g, ad ay δ > 0, with probability at least 1 δ R(g) R (g) + log 2 δ 2 (3) Notice that oe has to cosider a fixed fuctio g ad the probability is with respect to the samplig of the data If the fuctio depeds o the data this does ot apply! 33 Limitatios Although the above result seems very ice (sice it applies to ay class of bouded fuctios), it is actually severely limited Ideed, what it essetially says is that for each (fixed) fuctio f F, there is a set S of samples for which log 2 δ 2 P f P f (ad this set of samples has measure S] 1 δ) However, these sets S may be differet for differet fuctios I other words, for the observed sample, oly some of the fuctios i F will satisfy this iequality Aother way to explai the limitatio of Hoeffdig s iequality is the followig If we take for G the class of all { 1, 1}-valued (measurable) fuctios, the for ay fixed sample, there exists a fuctio f F such that P f P f = 1 To see this, take the fuctio which is f(x i ) = Y i o the data ad f(x) = Y everywhere else This does ot cotradict Hoeffdig s iequality but shows that it does ot yield what we eed Figure 2 illustrates the above argumetatio The horizotal axis correspods Risk R R (g) R R(g) g g* g Fuctio class Fig 2 Covergece of the empirical risk to the true risk over the class of fuctios

11 Statistical Learig Theory 185 to the fuctios i the class The two curves represet the true risk ad the empirical risk (for some traiig sample) of these fuctios The true risk is fixed, while for each differet sample, the empirical risk will be a differet curve If we observe a fixed fuctio g ad take several differet samples, the poit o the empirical curve will fluctuate aroud the true risk with fluctuatios cotrolled by Hoeffdig s iequality However, for a fixed sample, if the class G is big eough, oe ca fid somewhere alog the axis, a fuctio for which the differece betwee the two curves will be very large 34 Uiform Deviatios Before seeig the data, we do ot kow which fuctio the algorithm will choose The idea is to cosider uiform deviatios R(f ) R (f ) sup(r(f) R (f)) (4) I other words, if we ca upper boud the supremum o the right, we are doe For this, we eed a boud which holds simultaeously for all fuctios i a class Let us explai how oe ca costruct such uiform bouds Cosider two fuctios f 1, f 2 ad defie C i = {(x 1, y 1 ),, (x, y ) : P f i P f i > ε} This set cotais all the bad samples, ie those for which the boud fails From Hoeffdig s iequality, for each i C i ] δ We wat to measure how may samples are bad for i = 1 or i = 2 For this we use (see Appedix A) C 1 C 2 ] C 1 ] + C 2 ] 2δ More geerally, if we have N fuctios i our class, we ca write As a result we obtai C 1 C N ] N C i ] i=1 f {f 1,, f N } : P f P f > ε] N P f i P f i > ε] i=1 N exp ( 2ε 2)

12 186 Bousquet, Bouchero & Lugosi Hece, for G = {g 1,, g N }, for all δ > 0 with probability at least 1 δ, log N + log 1 δ g G, R(g) R (g) + 2 This is a error boud Ideed, if we kow that our algorithm picks fuctios from G, we ca apply this result to g itself Notice that the mai differece with Hoeffdig s iequality is the extra log N term o the right had side This is the term which accouts for the fact that we wat N bouds to hold simultaeously Aother iterpretatio of this term is as the umber of bits oe would require to specify oe fuctio i G It turs out that this kid of codig iterpretatio of geeralizatio bouds is ofte possible ad ca be used to obtai error estimates 16] 35 Estimatio Error Usig the same idea as before, ad with o additioal effort, we ca also get a boud o the estimatio error We start from the iequality R(g ) R (g ) + sup(r(g) R (g)), g G which we combie with (4) ad with the fact that sice g miimizes the empirical risk i G, R (g ) R (g ) 0 Thus we obtai R(g ) = R(g ) R(g ) + R(g ) R (g ) R (g ) + R(g ) R(g ) + R(g ) 2 sup R(g) R (g) + R(g ) g G We obtai that with probability at least 1 δ R(g ) R(g log N + log 2 δ ) We otice that i the right had side, both terms deped o the size of the class G If this size icreases, the first term will decrease, while the secod will icrease 36 Summary ad Perspective At this poit, we ca summarize what we have exposed so far Iferece requires to put assumptios o the process geeratig the data (data sampled iid from a ukow P ), geeralizatio requires kowledge (eg restrictio, structure, or prior)

13 Statistical Learig Theory 187 The error bouds are valid with respect to the repeated samplig of traiig sets For a fixed fuctio g, for most of the samples For most of the samples if G = N R(g) R (g) 1/ sup R(g) R (g) log N/ g G The extra variability comes from the fact that the chose g chages with the data So the result we have obtaied so far is that with high probability, for a fiite class of size N, log N + log 1 δ sup(r(g) R (g)) g G 2 There are several thigs that ca be improved: Hoeffdig s iequality oly uses the boudedess of the fuctios, ot their variace The uio boud is as bad as if all the fuctios i the class were idepedet (ie if f 1 (Z) ad f 2 (Z) were idepedet) The supremum over G of R(g) R (g) is ot ecessarily what the algorithm would choose, so that upper boudig R(g ) R (g ) by the supremum might be loose 4 Ifiite Case: Vapik-Chervoekis Theory I this sectio we show how to exted the previous results to the case where the class G is ifiite This requires, i the o-coutable case, the itroductio of tools from Vapik-Chervoekis Theory 41 Refied Uio Boud ad Coutable Case We first start with a simple refiemet of the uio boud that allows to exted the previous results to the (coutably) ifiite case Recall that by Hoeffdig s iequality, for each f F, for each δ > 0 (possibly depedig o f, which we write δ(f)), P f P f > log 1 δ(f) 2 δ(f)

14 188 Bousquet, Bouchero & Lugosi Hece, if we have a coutable set F, the uio boud immediately yields log 1 f F : P f P f > δ(f) 2 δ(f) Choosig δ(f) = δp(f) with p(f) = 1, this makes the right-had side equal to δ ad we get the followig result With probability at least 1 δ, log 1 p(f) f F, P f P f + + log 1 δ 2 We otice that if F is fiite (with size N), takig a uiform p gives the log N as before Usig this approach, it is possible to put kowledge about the algorithm ito p(f), but p should be chose before seeig the data, so it is ot possible to cheat by settig all the weight to the fuctio retured by the algorithm after seeig the data (which would give the smallest possible boud) But, i geeral, if p is well-chose, the boud will have a small value Hece, the boud ca be improved if oe kows ahead of time the fuctios that the algorithm is likely to pick (ie kowledge improves the boud) 42 Geeral Case Whe the set G is ucoutable, the previous approach does ot directly work The geeral idea is to look at the fuctio class projected o the sample More precisely, give a sample z 1,, z, we cosider F z1,,z = {(f(z 1 ),, f(z )) : f F} The size of this set is the umber of possible ways i which the data (z 1,, z ) ca be classified Sice the fuctios f ca oly take two values, this set will always be fiite, o matter how big F is Defiitio 1 (Growth fuctio) The growth fuctio is the maximum umber of ways ito which poits ca be classified by the fuctio class: S F () = sup F z1,,z (z 1,,z ) We have defied the growth fuctio i terms of the loss class F but we ca do the same with the iitial class G ad otice that S F () = S G () It turs out that this growth fuctio ca be used as a measure of the size of a class of fuctio as demostrated by the followig result Theorem 2 (Vapik-Chervoekis) For ay δ > 0, with probability at least 1 δ, g G, R(g) R (g) log S G(2) + log 2 δ

15 Statistical Learig Theory 189 Notice that, i the fiite case where G = N, we have S G () N so that this boud is always better tha the oe we had before (except for the costats) But the problem becomes ow oe of computig S G () 43 VC Dimesio Sice g { 1, 1}, it is clear that S G () 2 If S G () = 2, there is a set of size such that the class of fuctios ca geerate ay classificatio o these poits (we say that G shatters the set) Defiitio 2 (VC dimesio) The VC dimesio of a class G is the largest such that S G () = 2 I other words, the VC dimesio of a class G is the size of the largest set that it ca shatter I order to illustrate this defiitio, we give some examples The first oe is the set of half-plaes i d (see Figure 3) I this case, as depicted for the case d = 2, oe ca shatter a set of d + 1 poits but o set of d + 2 poits, which meas that the VC dimesio is d + 1 Fig 3 Computig the VC dimesio of hyperplaes i dimesio 2: a set of 3 poits ca be shattered, but o set of four poits It is iterestig to otice that the umber of parameters eeded to defie half-spaces i d is d, so that a atural questio is whether the VC dimesio is related to the umber of parameters of the fuctio class The ext example, depicted i Figure 4, is a family of fuctios with oe parameter oly: {sg(si(tx)) : t } which actually has ifiite VC dimesio (this is a exercise left to the reader)

16 190 Bousquet, Bouchero & Lugosi Fig 4 VC dimesio of siusoids It remais to show how the otio of VC dimesio ca brig a solutio to the problem of computig the growth fuctio Ideed, at first glace, if we kow that a class has VC dimesio h, it etails that for all h, S G () = 2 ad S G () < 2 otherwise This seems of little use, but actually, a itriguig pheomeo occurs for h as depicted i Figure 5 The growth fuctio log(s()) h Fig 5 Typical behavior of the log growth fuctio which is expoetial (its logarithm is liear) up util the VC dimesio, becomes polyomial afterwards This behavior is captured i the followig lemma Lemma 1 (Vapik ad Chervoekis, Sauer, Shelah) Let G be a class of fuctios with fiite VC-dimesio h The for all, S G () h i=0 ( ), i

17 Statistical Learig Theory 191 ad for all h, ( e ) h S G () h Usig this lemma alog with Theorem 2 we immediately obtai that if G has VC dimesio h, with probability at least 1 δ, 2e h log h g G, R(g) R (g) log 2 δ What is importat to recall from this result, is that the differece betwee the true ad empirical risk is at most of order h log A iterpretatio of VC dimesio ad growth fuctios is that they measure the effective size of the class, that is the size of the projectio of the class oto fiite samples I additio, this measure does ot just cout the umber of fuctios i the class but depeds o the geometry of the class (rather its projectios) Fially, the fiiteess of the VC dimesio esures that the empirical risk will coverge uiformly over the class to the true risk 44 Symmetrizatio We ow idicate how to prove Theorem 2 The key igrediet to the proof is the so-called symmetrizatio lemma The idea is to replace the true risk by a estimate computed o a idepedet set of data This is of course a mathematical techique ad does ot mea oe eeds to have more data to be able to apply the result The extra data set is usually called virtual or ghost sample We will deote by Z 1,, Z a idepedet (ghost) sample ad by P the correspodig empirical measure Lemma 2 (Symmetrizatio) For ay t > 0, such that t 2 2, ] ] sup(p P )f t 2 sup(p P )f t/2 Proof Let f be the fuctio achievig the supremum (ote that it depeds o Z 1,, Z ) Oe has (with deotig the cojuctio of two evets), (P P )f >t (P P )f<t/2 = (P P )f >t (P P )f t/2 (P P)f>t/2 Takig expectatios with respect to the secod sample gives (P P )f >t (P P )f < t/2] (P P )f > t/2]

18 192 Bousquet, Bouchero & Lugosi By Chebyshev s iequality (see Appedix A), (P P )f t/2] 4Varf t 2 1 t 2 Ideed, a radom variable with rage i 0, 1] has variace less tha 1/4 Hece (P P )f >t(1 1 t 2 ) (P P )f > t/2] Takig expectatio with respect to first sample gives the result This lemma allows to replace the expectatio P f by a empirical average over the ghost sample As a result, the right had side oly depeds o the projectio of the class F o the double sample: F Z1,,Z,Z 1,,Z, which cotais fiitely may differet vectors Oe ca thus use the simple uio boud that was preseted before i the fiite case The other igrediet that is eeded to obtai Theorem 2 is agai Hoeffdig s iequality i the followig form: P f P f > t] e t2 /2 We ow just have to put the pieces together: sup (P P )f t ] 2 = 2 sup (P P )f t/2 ] ] sup Z1 (P,,Z,Z 1,,Z P )f t/2 2S F (2) (P P )f t/2] 4S F (2)e t2 /8 Usig iversio fiishes the proof of Theorem 2 45 VC Etropy Oe importat aspect of the VC dimesio is that it is distributio idepedet Hece, it allows to get bouds that do ot deped o the problem at had: the same boud holds for ay distributio Although this may be see as a advatage, it ca also be a drawback sice, as a result, the boud may be loose for most distributios We ow show how to modify the proof above to get a distributio-depedet result We use the followig otatio N (F, z 1 ) := F z1,,z Defiitio 3 (VC etropy) The (aealed) VC etropy is defied as H F () = log N (F, Z 1 )]

19 Statistical Learig Theory 193 Theorem 3 For ay δ > 0, with probability at least 1 δ, g G, R(g) R (g) H G(2) + log 2 δ Proof We agai begi with the symmetrizatio lemma so that we have to upper boud the quatity ] I = sup (P Z 1,Z 1 P )f t/2 Let σ 1,, σ be idepedet radom variables such that P (σ i = 1) = P (σ i = 1) = 1/2 (they are called Rademacher variables) We otice that the quatities (P P )f ad 1 i=1 σ i(f(z i ) f(z i)) have the same distributio sice chagig oe σ i correspods to exchagig Z i ad Z i Hece we have I σ sup Z 1,Z 1 1 ]] σ i (f(z i) f(z i )) t/2, i=1 ad the uio boud leads to I N ( F, Z1, Z1 ) max f 1 ]] σ i (f(z i) f(z i )) t/2 i=1 Sice σ i (f(z i ) f(z i)) 1, 1], Hoeffdig s iequality fially gives I N (F, Z, Z )] e t2 /8 The rest of the proof is as before 5 Capacity Measures We have see so far three measures of capacity or size of classes of fuctio: the VC dimesio ad growth fuctio both distributio idepedet, ad the VC etropy which depeds o the distributio Apart from the VC dimesio, they are usually hard or impossible to compute There are however other measures which ot oly may give sharper estimates, but also have properties that make their computatio possible from the data oly 51 Coverig Numbers We start by edowig the fuctio class F with the followig (radom) metric d (f, f ) = 1 {f(z i) f (Z i ) : i = 1,, }

20 194 Bousquet, Bouchero & Lugosi This is the ormalized Hammig distace of the projectios o the sample Give such a metric, we say that a set f 1,, f N covers F at radius ε if F N i=1b(f i, ε) We the defie the coverig umbers of F as follows Defiitio 4 (Coverig umber) The coverig umber of F at radius ε, with respect to d, deoted by N(F, ε, ) is the miimum size of a cover of radius ε Notice that it does ot matter if we apply this defiitio to the origial class G or the loss class F, sice N(F, ε, ) = N(G, ε, ) The coverig umbers characterize the size of a fuctio class as measured by the metric d The rate of growth of the logarithm of N(G, ε, ) usually called the metric etropy, is related to the classical cocept of vector dimesio Ideed, if G is a compact set i a d-dimesioal Euclidea space, N(G, ε, ) ε d Whe the coverig umbers are fiite, it is possible to approximate the class G by a fiite set of fuctios (which cover G) Which agai allows to use the fiite uio boud, provided we ca relate the behavior of all fuctios i G to that of fuctios i the cover A typical result, which we provide without proof, is the followig Theorem 4 For ay t > 0, g G : R(g) > R (g) + t] 8 N(G, t, )] e t2 /128 Coverig umbers ca also be defied for classes of real-valued fuctios We ow relate the coverig umbers to the VC dimesio Notice that, because the fuctios i G ca oly take two values, for all ε > 0, N(G, ε, ) G Z 1 = N(G, Z1 ) Hece the VC etropy correspods to log coverig umbers at miimal scale, which implies N(G, ε, ) h log e h, but oe ca have a cosiderably better result Lemma 3 (Haussler) Let G be a class of VC dimesio h The, for all ε > 0, all, ad ay sample, N(G, ε, ) Ch(4e) h ε h The iterest of this result is that the upper boud does ot deped o the sample size The coverig umber boud is a geeralizatio of the VC etropy boud where the scale is adapted to the error It turs out that this result ca be improved by cosiderig all scales (see Sectio 52) 52 Rademacher Averages Recall that we used i the proof of Theorem 3 Rademacher radom variables, ie idepedet { 1, 1}-valued radom variables with probability 1/2 of takig either value

21 Statistical Learig Theory 195 For coveiece we itroduce the followig otatio (siged empirical measure) R f = 1 i=1 σ if(z i ) We will deote by σ the expectatio take with respect to the Rademacher variables (ie coditioally to the data) while will deote the expectatio with respect to all the radom variables (ie the data, the ghost sample ad the Rademacher variables) Defiitio 5 (Rademacher averages) For a class F of fuctios, the Rademacher average is defied as R(F) = sup R f, ad the coditioal Rademacher average is defied as R (F) = σ sup R f We ow state the fudametal result ivolvig Rademacher averages Theorem 5 For all δ > 0, with probability at least 1 δ, log 1 δ f F, P f P f + 2R(F) + 2, ad also, with probability at least 1 δ, f F, P f P f + 2R (F) + 2 log 2 δ It is remarkable that oe ca obtai a boud (secod part of the theorem) which depeds solely o the data The proof of the above result requires a powerful tool called a cocetratio iequality for empirical processes Actually, Hoeffdig s iequality is a (simple) cocetratio iequality, i the sese that whe icreases, the empirical average is cocetrated aroud the expectatio It is possible to geeralize this result to fuctios that deped o iid radom variables as show i the theorem below Theorem 6 (McDiarmid 17]) Assume for all i = 1,,, the for all ε > 0, sup F (z 1,, z i,, z ) F (z 1,, z i,, z ) c, z 1,,z,z i F F ] > ε] 2 exp ) ( 2ε2 c 2 The meaig of this result is thus that, as soo as oe has a fuctio of idepedet radom variables, which is such that its variatio is bouded whe oe variable is modified, the fuctio will satisfy a Hoeffdig-like iequality

22 196 Bousquet, Bouchero & Lugosi Proof of Theorem 5 To prove Theorem 5, we will have to follow the followig three steps: 1 Use cocetratio to relate sup P f P f to its expectatio, 2 use symmetrizatio to relate the expectatio to the Rademacher average, 3 use cocetratio agai to relate the Rademacher average to the coditioal oe We first show that McDiarmid s iequality ca be applied to sup P f P f We deote temporarily by P i the empirical measure obtaied by modifyig oe elemet (eg Z i is replaced by Z i ) of the sample It is easy to check that the followig holds sup (P f P f) sup Sice f {0, 1} we obtai (P f Pf) i sup Pf i P f P i f P f = 1 f(z i) f(z i ) 1, ad thus McDiarmid s iequality ca be applied with c = 1/ This cocludes the first step of the proof We ext prove the (first part of the) followig symmetrizatio lemma Lemma 4 For ay class F, ad sup P f P f 2 sup P f P f 1 2 sup R f, sup R f 1 2 Proof We oly prove the first part We itroduce a ghost sample ad its correspodig measure P We successively use the fact that P f = P f ad the supremum is a covex fuctio (hece we ca apply Jese s iequality, see Appedix A): sup P f P f = sup P f] P f sup P f P f ] 1 = σ sup σ i (f(z i) f(z i )) i=1 ] 1 σ sup σ i f(z i) + σ = 2 sup R f i=1 sup 1 ] σ i f(z i )) i=1

23 Statistical Learig Theory 197 where the third step uses the fact that f(z i ) f(z i ) ad σ i(f(z i ) f(z i )) have the same distributio ad the last step uses the fact that the σ i f(z i ) ad σ i f(z i ) have the same distributio The above already establishes the first part of Theorem 5 For the secod part, we eed to use cocetratio agai For this we apply McDiarmid s iequality to the followig fuctioal F (Z 1,, Z ) = R (F) It is easy to check that F satisfies McDiarmid s assumptios with c = 1 As a result, F = R(F) ca be sharply estimated by F = R (F) Loss Class ad Iitial Class I order to make use of Theorem 5 we have to relate the Rademacher average of the loss class to those of the iitial class This ca be doe with the followig derivatio where oe uses the fact that σ i ad σ i Y i have the same distributio R(F) = = = 1 2 sup g G sup g G 1 1 sup g G σ i i=1 i=1 1 g(x i) Y i ] ] 1 σ i 2 (1 Y ig(x i )) ] σ i Y i g(x i ) = 1 2 R(G) i=1 Notice that the same is valid for coditioal Rademacher averages, so that we obtai that with probability at least 1 δ, 2 log 2 δ g G, R(g) R (g) + R (G) + Computig the Rademacher Averages We ow assess the difficulty of actually computig the Rademacher averages We write the followig ] 1 1 σ i g(x i ) 2 sup g G i=1 = sup g G = if g G ] 1 σ ig(x i ) 2 ] 1 σ i g(x i ) 2 ] i=1 i=1 = 1 2 if g G R (g, σ)

24 198 Bousquet, Bouchero & Lugosi This idicates that, give a sample ad a choice of the radom variables σ 1,, σ, computig R (G) is ot harder tha computig the empirical risk miimizer i G Ideed, the procedure would be to geerate the σ i radomly ad miimize the empirical error i G with respect to the labels σ i A advatage of rewritig R (G) as above is that it gives a ituitio of what it actually measures: it measures how much the class G ca fit radom oise If the class G is very large, there will always be a fuctio which ca perfectly fit the σ i ad the R (G) = 1/2, so that there is o hope of uiform covergece to zero of the differece betwee true ad empirical risks For a fiite set with G = N, oe ca show that R (G) 2 log N/, where we agai see the logarithmic factor log N A cosequece of this is that, by cosiderig the projectio o the sample of a class G with VC dimesio h, ad usig Lemma 1, we have h log e h R(G) 2 This result alog with Theorem 5 allows to recover the Vapik Chervoekis boud with a cocetratio-based proof Although the beefit of usig cocetratio may ot be etirely clear at that poit, let us just metio that oe ca actually improve the depedece o of the above boud This is based o the so-called chaiig techique The idea is to use coverig umbers at all scales i order to capture the geometry of the class i a better way tha the VC etropy does Oe has the followig result, called Dudley s etropy boud R (F) C log N(F, t, ) dt 0 As a cosequece, alog with Haussler s upper boud, we ca get the followig result h R (F) C We ca thus, with this approach, remove the uecessary log factor of the VC boud 6 Advaced Topics I this sectio, we poit out several ways i which the results preseted so far ca be improved The mai source of improvemet actually comes, as metioed earlier, from the fact that Hoeffdig ad McDiarmid iequalities do ot make use of the variace of the fuctios

25 Statistical Learig Theory Biomial Tails We recall that the fuctios we cosider are biary valued So, if we cosider a fixed fuctio f, the distributio of P f is actually a biomial law of parameters P f ad (sice we are summig iid radom variables f(z i ) which ca either be 0 or 1 ad are equal to 1 with probability f(z i ) = P f) Deotig p = P f, we ca have a exact expressio for the deviatios of P f from P f: P f P f t] = (p t) k=0 ( ) p k (1 p) k k Sice this expressio is ot easy to maipulate, we have used a upper boud provided by Hoeffdig s iequality However, there exist other (sharper) upper bouds The followig quatities are a upper boud o P f P f t], ( ) (1 p t) ( (p+t) 1 p p 1 p t p+t) (expoetial) e p 1 p ((1 t/p) log(1 t/p)+t/p) (Beett) e t 2 2p(1 p)+2t/3 e 2t2 (Berstei) (Hoeffdig) Examiig the above bouds (ad usig iversio), we ca say that roughly speakig, the small deviatios of P f P f have a Gaussia behavior of the form exp( t 2 /2p(1 p)) (ie Gaussia with variace p(1 p)) while the large deviatios have a Poisso behavior of the form exp( 3t/2) So the tails are heavier tha Gaussia, ad Hoeffdig s iequality cosists i upper boudig the tails with a Gaussia with maximum variace, hece the term exp( 2t 2 ) Each fuctio f F has a differet variace P f(1 P f) P f Moreover, for each f F, by Berstei s iequality, with probability at least 1 δ, P f P f + 2P f log 1 δ + 2 log 1 δ 3 The Gaussia part (secod term i the right had side) domiates (for P f ot too small, or large eough), ad it depeds o P f We thus wat to combie Berstei s iequality with the uio boud ad the symmetrizatio 62 Normalizatio The idea is to cosider the ratio P f P f P f Here (f {0, 1}), Varf P f 2 = P f

26 200 Bousquet, Bouchero & Lugosi The reaso for cosiderig this ratio is that after ormalizatio, fluctuatios are more uiform i the class F Hece the supremum i P f P f sup P f ot ecessarily attaied at fuctios with large variace as it was the case previously Moreover, we kow that our goal is to fid fuctios with small error P f (hece small variace) The ormalized supremum takes this ito accout We ow state a result similar to Theorem 2 for the ormalized supremum Theorem 7 (Vapik-Chervoekis, 18]) For δ > 0 with probability at least 1 δ, f F, P f P f log S F (2) + log 4 δ 2, P f ad also with probability at least 1 δ, sup f F, P f P f P f 2 log S F (2) + log 4 δ Proof We oly give a sketch of the proof The first step is a variatio of the symmetrizatio lemma ] ] P f P f P t P f 2 f P f (P f + P f)/2 t sup The secod step cosists i radomizatio (with Rademacher variables) ]] = 2 σ sup 1 i=1 σ i(f(z i ) f(z i)) t (P f + P f)/2 Fially, oe uses a tail boud of Berstei type Let us explore the cosequeces of this result From the fact that for o-egative umbers A, B, C, A B + C A A B + C 2 + BC, we easily get for example f F, P f P f + 2 P f log S F(2) + log 4 δ +4 log S F(2) + log 4 δ

27 Statistical Learig Theory 201 I the ideal situatio where there is o oise (ie Y = t(x) almost surely), ad t G, deotig by g the empirical risk miimizer, we have R = 0 ad also R (g ) = 0 I particular, whe G is a class of VC dimesio h, we obtai ( ) h log R(g ) = O So, i a way, Theorem 7 allows to iterpolate betwee the best case where the rate of covergece is O(h log /) ad the worst case where the rate is O( h log /) (it does ot allow to remove the log factor i this case) It is also possible to derive from Theorem 7 relative error bouds for the miimizer of the empirical error With probability at least 1 δ, R(g ) R(g ) + 2 R(g ) log S G(2) + log 4 δ +4 log S G(2) + log 4 δ We otice here that whe R(g ) = 0 (ie t G ad R = 0), the rate is agai of order 1/ while, as soo as R(g ) > 0, the rate is of order 1/ Therefore, it is ot possible to obtai a rate with a power of i betwee 1/2 ad 1 The mai reaso is that the factor of the square root term R(g ) is ot the right quatity to use here sice it does ot vary with We will see later that oe ca have istead R(g ) R(g ) as a factor, which is usually covergig to zero with icreasig Ufortuately, Theorem 7 caot be applied to fuctios of the type f f (which would be eeded to have the metioed factor), so we will eed a refied approach 63 Noise Coditios The refiemet we seek to obtai requires certai specific assumptios about the oise fuctio s(x) The ideal case beig whe s(x) = 0 everywhere (which correspods to R = 0 ad Y = t(x)) We ow itroduce quatities that measure how well-behaved the oise fuctio is The situatio is favorable whe the regressio fuctio η(x) is ot too close to 0, or at least ot too ofte close to 1/2 Ideed, η(x) = 0 meas that the oise is maximum at x (s(x) = 1/2) ad that the label is completely udetermied (ay predictio would yield a error with probability 1/2) Defiitios There are two types of coditios Defiitio 6 (Massart s Noise Coditio) For some c > 0, assume η(x) > 1 almost surely c

28 202 Bousquet, Bouchero & Lugosi This coditio implies that there is o regio where the decisio is completely radom, or the oise is bouded away from 1/2 Defiitio 7 (Tsybakov s Noise Coditio) Let α 0, 1], assume that oe the followig equivalet coditios is satisfied (i) c > 0, g { 1, 1} X, g(x)η(x) 0] c(r(g) R ) α (ii) c > 0, A X, dp (x) c( η(x) dp (x)) α (iii) B > 0, t 0, η(x) t] Bt α 1 α A Coditio (iii) is probably the easiest to iterpret: it meas that η(x) is close to the critical value 0 with low probability We idicate how to prove that coditios (i), (ii) ad (iii) are ideed equivalet: (i) (ii) It is easy to check that R(g) R = η(x) gη 0] For each fuctio g, there exists a set A such that A = gη 0 (ii) (iii) Let A = {x : η(x) t} η t] = dp (x) c( η(x) dp (x)) α A A ct α ( dp (x)) α A A (iii) (i) We write η t] c 1 1 α t α 1 α Takig t = R(g) R = η(x) gη 0] ] t gη 0 = t η t] t η t gη>0 ] η t t(1 Bt α 1 α ) t gη > 0] = t( gη 0] Bt α 1 α ) ( (1 α) gη 0] B gη 0] ) (1 α)/α fially gives B 1 α (1 α) ( 1 α)α α (R(g) R ) α We otice that the parameter α has to be i 0, 1] Ideed, oe has the opposite iequality R(g) R = η(x) gη 0] gη 0] = g(x)η(x) 0], which is icompatible with coditio (i) if α > 1 We also otice that whe α = 0, Tsybakov s coditio is void, ad whe α = 1, it is equivalet to Massart s coditio

29 Statistical Learig Theory 203 Cosequeces The coditios we impose o the oise yield a crucial relatioship betwee the variace ad the expectatio of fuctios i the so-called relative loss class defied as F = {(x, y) f(x, y) t(x) y : f F} This relatioship will allow to exploit Berstei type iequalities applied to this latter class Uder Massart s coditio, oe has (writte i terms of the iitial class) for g G, ( g(x) Y t(x) Y ) 2] c(r(g) R ), or, equivaletly, for f F, Varf P f 2 cp f Uder Tsybakov s coditio this becomes for g G, ( g(x) Y t(x) Y ) 2] c(r(g) R ) α, ad for f F, Varf P f 2 c(p f) α I the fiite case, with G = N, oe ca easily apply Berstei s iequality to F ad the fiite uio boud to get that with probability at least 1 δ, for all g G, R(g) R R (g) R (t) + 8c(R(g) R ) α log N δ + 4 log N δ 3 As a cosequece, whe t G, ad g is the miimizer of the empirical error (hece R (g) R (t)), oe has R(g ) R C ( ) 1 log N 2 α δ which always better tha 1/2 for α > 0 ad is valid eve if R > 0, 64 Local Rademacher Averages I this sectio we geeralize the above result by itroducig a localized versio of the Rademacher averages Goig from the fiite to the geeral case is more ivolved tha what has bee see before We first give the appropriate defiitios, the state the result ad give a proof sketch Defiitios Local Rademacher averages refer to Rademacher averages of subsets of the fuctio class determied by a coditio o the variace of the fuctio Defiitio 8 (Local Rademacher Average) The local Rademacher average at radius r 0 for the class F is defied as R(F, r) = sup R f :P f 2 r

30 204 Bousquet, Bouchero & Lugosi The reaso for this defiitio is that, as we have see before, the crucial igrediet to obtai better rates of covergece is to use the variace of the fuctios Localizig the Rademacher average allows to focus o the part of the fuctio class where the fast rate pheomeo occurs, that are fuctios with small variace Next we itroduce the cocept of a sub-root fuctio, a real-valued fuctio with certai mootoy properties Defiitio 9 (Sub-Root Fuctio) A fuctio ψ : is sub-root if (i) ψ is o-decreasig, (ii) ψ is o egative, (iii) ψ(r)/ r is o-icreasig A immediate cosequece of this defiitio is the followig result Lemma 5 A sub-root fuctio (i) is cotiuous, (ii) has a uique (o-zero) fixed poit r satisfyig ψ(r ) = r Figure 6 shows a typical sub-root fuctio ad its fixed poit 3 x phi(x) Fig 6 A example of a sub-root fuctio ad its fixed poit Before seeig the ratioale for itroducig the sub-root cocept, we eed yet aother defiitio, that of a star-hull (somewhat similar to a covex hull) Defiitio 10 (Star-Hull) Let F be a set of fuctios Its star-hull is defied as F = {αf : f F, α 0, 1]}

31 Statistical Learig Theory 205 Now, we state a lemma that idicates that by takig the star-hull of a class of fuctios, we are guarateed that the local Rademacher average behaves like a sub-root fuctio, ad thus has a uique fixed poit This fixed poit will tur out to be the key quatity i the relative error bouds Lemma 6 For ay class of fuctios F, R ( F, r) is sub-root Oe legitimate questio is whether takig the star-hull does ot elarge the class too much Oe way to see what the effect is o the size of the class is to compare the metric etropy (log coverig umbers) of F ad of F It is possible to see that the etropy icreases oly by a logarithmic factor, which is essetially egligible Result We ow state the mai result ivolvig local Rademacher averages ad their fixed poit Theorem 8 Let F be a class of bouded fuctios (eg f 1, 1]) ad r be the fixed poit of R( F, r) There exists a costat C > 0 such that with probability at least 1 δ, ( r f F, P f P f C Varf + log 1 δ ) + log log If i additio the fuctios i F satisfy Varf c(p f) β, the oe obtais that with probability at least 1 δ, ( f F, P f C P f + (r log 1 ) 1 2 β + δ ) + log log Proof We oly give the mai steps of the proof 1 The startig poit is Talagrad s iequality for empirical processes, a geeralizatio of McDiarmid s iequality of Berstei type (ie which icludes the variace) This iequality tells that with high probability, sup P f P f sup P f P f ] + c sup Varf/ + c /, for some costats c, c 2 The secod step cosists i peelig the class, that is splittig the class ito subclasses accordig to the variace of the fuctios F k = {f : Varf x k, x k+1 )},

Properties of MLE: consistency, asymptotic normality. Fisher information.

Properties of MLE: consistency, asymptotic normality. Fisher information. Lecture 3 Properties of MLE: cosistecy, asymptotic ormality. Fisher iformatio. I this sectio we will try to uderstad why MLEs are good. Let us recall two facts from probability that we be used ofte throughout

More information

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008 I ite Sequeces Dr. Philippe B. Laval Keesaw State Uiversity October 9, 2008 Abstract This had out is a itroductio to i ite sequeces. mai de itios ad presets some elemetary results. It gives the I ite Sequeces

More information

I. Chi-squared Distributions

I. Chi-squared Distributions 1 M 358K Supplemet to Chapter 23: CHI-SQUARED DISTRIBUTIONS, T-DISTRIBUTIONS, AND DEGREES OF FREEDOM To uderstad t-distributios, we first eed to look at aother family of distributios, the chi-squared distributios.

More information

Chapter 7 Methods of Finding Estimators

Chapter 7 Methods of Finding Estimators Chapter 7 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 011 Chapter 7 Methods of Fidig Estimators Sectio 7.1 Itroductio Defiitio 7.1.1 A poit estimator is ay fuctio W( X) W( X1, X,, X ) of

More information

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx SAMPLE QUESTIONS FOR FINAL EXAM REAL ANALYSIS I FALL 006 3 4 Fid the followig usig the defiitio of the Riema itegral: a 0 x + dx 3 Cosider the partitio P x 0 3, x 3 +, x 3 +,......, x 3 3 + 3 of the iterval

More information

0.7 0.6 0.2 0 0 96 96.5 97 97.5 98 98.5 99 99.5 100 100.5 96.5 97 97.5 98 98.5 99 99.5 100 100.5

0.7 0.6 0.2 0 0 96 96.5 97 97.5 98 98.5 99 99.5 100 100.5 96.5 97 97.5 98 98.5 99 99.5 100 100.5 Sectio 13 Kolmogorov-Smirov test. Suppose that we have a i.i.d. sample X 1,..., X with some ukow distributio P ad we would like to test the hypothesis that P is equal to a particular distributio P 0, i.e.

More information

Convexity, Inequalities, and Norms

Convexity, Inequalities, and Norms Covexity, Iequalities, ad Norms Covex Fuctios You are probably familiar with the otio of cocavity of fuctios. Give a twicedifferetiable fuctio ϕ: R R, We say that ϕ is covex (or cocave up) if ϕ (x) 0 for

More information

Lecture 4: Cauchy sequences, Bolzano-Weierstrass, and the Squeeze theorem

Lecture 4: Cauchy sequences, Bolzano-Weierstrass, and the Squeeze theorem Lecture 4: Cauchy sequeces, Bolzao-Weierstrass, ad the Squeeze theorem The purpose of this lecture is more modest tha the previous oes. It is to state certai coditios uder which we are guarateed that limits

More information

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method Chapter 6: Variace, the law of large umbers ad the Mote-Carlo method Expected value, variace, ad Chebyshev iequality. If X is a radom variable recall that the expected value of X, E[X] is the average value

More information

Hypothesis testing. Null and alternative hypotheses

Hypothesis testing. Null and alternative hypotheses Hypothesis testig Aother importat use of samplig distributios is to test hypotheses about populatio parameters, e.g. mea, proportio, regressio coefficiets, etc. For example, it is possible to stipulate

More information

Overview of some probability distributions.

Overview of some probability distributions. Lecture Overview of some probability distributios. I this lecture we will review several commo distributios that will be used ofte throughtout the class. Each distributio is usually described by its probability

More information

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13 EECS 70 Discrete Mathematics ad Probability Theory Sprig 2014 Aat Sahai Note 13 Itroductio At this poit, we have see eough examples that it is worth just takig stock of our model of probability ad may

More information

A probabilistic proof of a binomial identity

A probabilistic proof of a binomial identity A probabilistic proof of a biomial idetity Joatho Peterso Abstract We give a elemetary probabilistic proof of a biomial idetity. The proof is obtaied by computig the probability of a certai evet i two

More information

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution Uiversity of Califoria, Los Ageles Departmet of Statistics Statistics 100B Istructor: Nicolas Christou Three importat distributios: Distributios related to the ormal distributio Chi-square (χ ) distributio.

More information

Asymptotic Growth of Functions

Asymptotic Growth of Functions CMPS Itroductio to Aalysis of Algorithms Fall 3 Asymptotic Growth of Fuctios We itroduce several types of asymptotic otatio which are used to compare the performace ad efficiecy of algorithms As we ll

More information

Sequences and Series

Sequences and Series CHAPTER 9 Sequeces ad Series 9.. Covergece: Defiitio ad Examples Sequeces The purpose of this chapter is to itroduce a particular way of geeratig algorithms for fidig the values of fuctios defied by their

More information

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n We will cosider the liear regressio model i matrix form. For simple liear regressio, meaig oe predictor, the model is i = + x i + ε i for i =,,,, This model icludes the assumptio that the ε i s are a sample

More information

Lecture 13. Lecturer: Jonathan Kelner Scribe: Jonathan Pines (2009)

Lecture 13. Lecturer: Jonathan Kelner Scribe: Jonathan Pines (2009) 18.409 A Algorithmist s Toolkit October 27, 2009 Lecture 13 Lecturer: Joatha Keler Scribe: Joatha Pies (2009) 1 Outlie Last time, we proved the Bru-Mikowski iequality for boxes. Today we ll go over the

More information

Statistical inference: example 1. Inferential Statistics

Statistical inference: example 1. Inferential Statistics Statistical iferece: example 1 Iferetial Statistics POPULATION SAMPLE A clothig store chai regularly buys from a supplier large quatities of a certai piece of clothig. Each item ca be classified either

More information

Confidence Intervals for One Mean

Confidence Intervals for One Mean Chapter 420 Cofidece Itervals for Oe Mea Itroductio This routie calculates the sample size ecessary to achieve a specified distace from the mea to the cofidece limit(s) at a stated cofidece level for a

More information

5: Introduction to Estimation

5: Introduction to Estimation 5: Itroductio to Estimatio Cotets Acroyms ad symbols... 1 Statistical iferece... Estimatig µ with cofidece... 3 Samplig distributio of the mea... 3 Cofidece Iterval for μ whe σ is kow before had... 4 Sample

More information

Maximum Likelihood Estimators.

Maximum Likelihood Estimators. Lecture 2 Maximum Likelihood Estimators. Matlab example. As a motivatio, let us look at oe Matlab example. Let us geerate a radom sample of size 00 from beta distributio Beta(5, 2). We will lear the defiitio

More information

1. C. The formula for the confidence interval for a population mean is: x t, which was

1. C. The formula for the confidence interval for a population mean is: x t, which was s 1. C. The formula for the cofidece iterval for a populatio mea is: x t, which was based o the sample Mea. So, x is guarateed to be i the iterval you form.. D. Use the rule : p-value

More information

Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork

Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork Solutios to Selected Problems I: Patter Classificatio by Duda, Hart, Stork Joh L. Weatherwax February 4, 008 Problem Solutios Chapter Bayesia Decisio Theory Problem radomized rules Part a: Let Rx be the

More information

1 Computing the Standard Deviation of Sample Means

1 Computing the Standard Deviation of Sample Means Computig the Stadard Deviatio of Sample Meas Quality cotrol charts are based o sample meas ot o idividual values withi a sample. A sample is a group of items, which are cosidered all together for our aalysis.

More information

MARTINGALES AND A BASIC APPLICATION

MARTINGALES AND A BASIC APPLICATION MARTINGALES AND A BASIC APPLICATION TURNER SMITH Abstract. This paper will develop the measure-theoretic approach to probability i order to preset the defiitio of martigales. From there we will apply this

More information

Lecture 5: Span, linear independence, bases, and dimension

Lecture 5: Span, linear independence, bases, and dimension Lecture 5: Spa, liear idepedece, bases, ad dimesio Travis Schedler Thurs, Sep 23, 2010 (versio: 9/21 9:55 PM) 1 Motivatio Motivatio To uderstad what it meas that R has dimesio oe, R 2 dimesio 2, etc.;

More information

Department of Computer Science, University of Otago

Department of Computer Science, University of Otago Departmet of Computer Sciece, Uiversity of Otago Techical Report OUCS-2006-09 Permutatios Cotaiig May Patters Authors: M.H. Albert Departmet of Computer Sciece, Uiversity of Otago Micah Colema, Rya Fly

More information

Plug-in martingales for testing exchangeability on-line

Plug-in martingales for testing exchangeability on-line Plug-i martigales for testig exchageability o-lie Valetia Fedorova, Alex Gammerma, Ilia Nouretdiov, ad Vladimir Vovk Computer Learig Research Cetre Royal Holloway, Uiversity of Lodo, UK {valetia,ilia,alex,vovk}@cs.rhul.ac.uk

More information

3. Greatest Common Divisor - Least Common Multiple

3. Greatest Common Divisor - Least Common Multiple 3 Greatest Commo Divisor - Least Commo Multiple Defiitio 31: The greatest commo divisor of two atural umbers a ad b is the largest atural umber c which divides both a ad b We deote the greatest commo gcd

More information

Example 2 Find the square root of 0. The only square root of 0 is 0 (since 0 is not positive or negative, so those choices don t exist here).

Example 2 Find the square root of 0. The only square root of 0 is 0 (since 0 is not positive or negative, so those choices don t exist here). BEGINNING ALGEBRA Roots ad Radicals (revised summer, 00 Olso) Packet to Supplemet the Curret Textbook - Part Review of Square Roots & Irratioals (This portio ca be ay time before Part ad should mostly

More information

Section 11.3: The Integral Test

Section 11.3: The Integral Test Sectio.3: The Itegral Test Most of the series we have looked at have either diverged or have coverged ad we have bee able to fid what they coverge to. I geeral however, the problem is much more difficult

More information

Notes on exponential generating functions and structures.

Notes on exponential generating functions and structures. Notes o expoetial geeratig fuctios ad structures. 1. The cocept of a structure. Cosider the followig coutig problems: (1) to fid for each the umber of partitios of a -elemet set, (2) to fid for each the

More information

Modified Line Search Method for Global Optimization

Modified Line Search Method for Global Optimization Modified Lie Search Method for Global Optimizatio Cria Grosa ad Ajith Abraham Ceter of Excellece for Quatifiable Quality of Service Norwegia Uiversity of Sciece ad Techology Trodheim, Norway {cria, ajith}@q2s.tu.o

More information

Normal Distribution.

Normal Distribution. Normal Distributio www.icrf.l Normal distributio I probability theory, the ormal or Gaussia distributio, is a cotiuous probability distributio that is ofte used as a first approimatio to describe realvalued

More information

1 The Gaussian channel

1 The Gaussian channel ECE 77 Lecture 0 The Gaussia chael Objective: I this lecture we will lear about commuicatio over a chael of practical iterest, i which the trasmitted sigal is subjected to additive white Gaussia oise.

More information

The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles

The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles The followig eample will help us uderstad The Samplig Distributio of the Mea Review: The populatio is the etire collectio of all idividuals or objects of iterest The sample is the portio of the populatio

More information

Chapter 14 Nonparametric Statistics

Chapter 14 Nonparametric Statistics Chapter 14 Noparametric Statistics A.K.A. distributio-free statistics! Does ot deped o the populatio fittig ay particular type of distributio (e.g, ormal). Sice these methods make fewer assumptios, they

More information

Statistical Learning Theory

Statistical Learning Theory 1 / 130 Statistical Learig Theory Machie Learig Summer School, Kyoto, Japa Alexader (Sasha) Rakhli Uiversity of Pesylvaia, The Wharto School Pe Research i Machie Learig (PRiML) August 27-28, 2012 2 / 130

More information

Output Analysis (2, Chapters 10 &11 Law)

Output Analysis (2, Chapters 10 &11 Law) B. Maddah ENMG 6 Simulatio 05/0/07 Output Aalysis (, Chapters 10 &11 Law) Comparig alterative system cofiguratio Sice the output of a simulatio is radom, the comparig differet systems via simulatio should

More information

Entropy of bi-capacities

Entropy of bi-capacities Etropy of bi-capacities Iva Kojadiovic LINA CNRS FRE 2729 Site école polytechique de l uiv. de Nates Rue Christia Pauc 44306 Nates, Frace iva.kojadiovic@uiv-ates.fr Jea-Luc Marichal Applied Mathematics

More information

Incremental calculation of weighted mean and variance

Incremental calculation of weighted mean and variance Icremetal calculatio of weighted mea ad variace Toy Fich faf@cam.ac.uk dot@dotat.at Uiversity of Cambridge Computig Service February 009 Abstract I these otes I eplai how to derive formulae for umerically

More information

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY Physical ad Mathematical Scieces 2015, 1, p. 15 19 M a t h e m a t i c s AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM A. G. GULYAN Chair of Actuarial Mathematics

More information

Determining the sample size

Determining the sample size Determiig the sample size Oe of the most commo questios ay statisticia gets asked is How large a sample size do I eed? Researchers are ofte surprised to fid out that the aswer depeds o a umber of factors

More information

Universal coding for classes of sources

Universal coding for classes of sources Coexios module: m46228 Uiversal codig for classes of sources Dever Greee This work is produced by The Coexios Project ad licesed uder the Creative Commos Attributio Licese We have discussed several parametric

More information

Soving Recurrence Relations

Soving Recurrence Relations Sovig Recurrece Relatios Part 1. Homogeeous liear 2d degree relatios with costat coefficiets. Cosider the recurrece relatio ( ) T () + at ( 1) + bt ( 2) = 0 This is called a homogeeous liear 2d degree

More information

CHAPTER 3 DIGITAL CODING OF SIGNALS

CHAPTER 3 DIGITAL CODING OF SIGNALS CHAPTER 3 DIGITAL CODING OF SIGNALS Computers are ofte used to automate the recordig of measuremets. The trasducers ad sigal coditioig circuits produce a voltage sigal that is proportioal to a quatity

More information

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies ( 3.1.1) Limitations of Experiments. Pseudocode ( 3.1.2) Theoretical Analysis

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies ( 3.1.1) Limitations of Experiments. Pseudocode ( 3.1.2) Theoretical Analysis Ruig Time ( 3.) Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects.

More information

LECTURE 13: Cross-validation

LECTURE 13: Cross-validation LECTURE 3: Cross-validatio Resampli methods Cross Validatio Bootstrap Bias ad variace estimatio with the Bootstrap Three-way data partitioi Itroductio to Patter Aalysis Ricardo Gutierrez-Osua Texas A&M

More information

CHAPTER 7: Central Limit Theorem: CLT for Averages (Means)

CHAPTER 7: Central Limit Theorem: CLT for Averages (Means) CHAPTER 7: Cetral Limit Theorem: CLT for Averages (Meas) X = the umber obtaied whe rollig oe six sided die oce. If we roll a six sided die oce, the mea of the probability distributio is X P(X = x) Simulatio:

More information

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable Week 3 Coditioal probabilities, Bayes formula, WEEK 3 page 1 Expected value of a radom variable We recall our discussio of 5 card poker hads. Example 13 : a) What is the probability of evet A that a 5

More information

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling Taig DCOP to the Real World: Efficiet Complete Solutios for Distributed Multi-Evet Schedulig Rajiv T. Maheswara, Milid Tambe, Emma Bowrig, Joatha P. Pearce, ad Pradeep araatham Uiversity of Souther Califoria

More information

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas: Chapter 7 - Samplig Distributios 1 Itroductio What is statistics? It cosist of three major areas: Data Collectio: samplig plas ad experimetal desigs Descriptive Statistics: umerical ad graphical summaries

More information

The Stable Marriage Problem

The Stable Marriage Problem The Stable Marriage Problem William Hut Lae Departmet of Computer Sciece ad Electrical Egieerig, West Virgiia Uiversity, Morgatow, WV William.Hut@mail.wvu.edu 1 Itroductio Imagie you are a matchmaker,

More information

Ekkehart Schlicht: Economic Surplus and Derived Demand

Ekkehart Schlicht: Economic Surplus and Derived Demand Ekkehart Schlicht: Ecoomic Surplus ad Derived Demad Muich Discussio Paper No. 2006-17 Departmet of Ecoomics Uiversity of Muich Volkswirtschaftliche Fakultät Ludwig-Maximilias-Uiversität Müche Olie at http://epub.ub.ui-mueche.de/940/

More information

THE HEIGHT OF q-binary SEARCH TREES

THE HEIGHT OF q-binary SEARCH TREES THE HEIGHT OF q-binary SEARCH TREES MICHAEL DRMOTA AND HELMUT PRODINGER Abstract. q biary search trees are obtaied from words, equipped with the geometric distributio istead of permutatios. The average

More information

Lecture 3. denote the orthogonal complement of S k. Then. 1 x S k. n. 2 x T Ax = ( ) λ x. with x = 1, we have. i = λ k x 2 = λ k.

Lecture 3. denote the orthogonal complement of S k. Then. 1 x S k. n. 2 x T Ax = ( ) λ x. with x = 1, we have. i = λ k x 2 = λ k. 18.409 A Algorithmist s Toolkit September 17, 009 Lecture 3 Lecturer: Joatha Keler Scribe: Adre Wibisoo 1 Outlie Today s lecture covers three mai parts: Courat-Fischer formula ad Rayleigh quotiets The

More information

WHEN IS THE (CO)SINE OF A RATIONAL ANGLE EQUAL TO A RATIONAL NUMBER?

WHEN IS THE (CO)SINE OF A RATIONAL ANGLE EQUAL TO A RATIONAL NUMBER? WHEN IS THE (CO)SINE OF A RATIONAL ANGLE EQUAL TO A RATIONAL NUMBER? JÖRG JAHNEL 1. My Motivatio Some Sort of a Itroductio Last term I tought Topological Groups at the Göttige Georg August Uiversity. This

More information

UC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Solutions 9 Spring 2006

UC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Solutions 9 Spring 2006 Exam format UC Bereley Departmet of Electrical Egieerig ad Computer Sciece EE 6: Probablity ad Radom Processes Solutios 9 Sprig 006 The secod midterm will be held o Wedesday May 7; CHECK the fial exam

More information

Theorems About Power Series

Theorems About Power Series Physics 6A Witer 20 Theorems About Power Series Cosider a power series, f(x) = a x, () where the a are real coefficiets ad x is a real variable. There exists a real o-egative umber R, called the radius

More information

Our aim is to show that under reasonable assumptions a given 2π-periodic function f can be represented as convergent series

Our aim is to show that under reasonable assumptions a given 2π-periodic function f can be represented as convergent series 8 Fourier Series Our aim is to show that uder reasoable assumptios a give -periodic fuctio f ca be represeted as coverget series f(x) = a + (a cos x + b si x). (8.) By defiitio, the covergece of the series

More information

Infinite Sequences and Series

Infinite Sequences and Series CHAPTER 4 Ifiite Sequeces ad Series 4.1. Sequeces A sequece is a ifiite ordered list of umbers, for example the sequece of odd positive itegers: 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29...

More information

THE ABRACADABRA PROBLEM

THE ABRACADABRA PROBLEM THE ABRACADABRA PROBLEM FRANCESCO CARAVENNA Abstract. We preset a detailed solutio of Exercise E0.6 i [Wil9]: i a radom sequece of letters, draw idepedetly ad uiformly from the Eglish alphabet, the expected

More information

Stéphane Boucheron 1, Olivier Bousquet 2 and Gábor Lugosi 3

Stéphane Boucheron 1, Olivier Bousquet 2 and Gábor Lugosi 3 ESAIM: Probability ad Statistics URL: http://wwwemathfr/ps/ Will be set by the publisher THEORY OF CLASSIFICATION: A SURVEY OF SOME RECENT ADVANCES Stéphae Bouchero 1, Olivier Bousquet 2 ad Gábor Lugosi

More information

5 Boolean Decision Trees (February 11)

5 Boolean Decision Trees (February 11) 5 Boolea Decisio Trees (February 11) 5.1 Graph Coectivity Suppose we are give a udirected graph G, represeted as a boolea adjacecy matrix = (a ij ), where a ij = 1 if ad oly if vertices i ad j are coected

More information

1. MATHEMATICAL INDUCTION

1. MATHEMATICAL INDUCTION 1. MATHEMATICAL INDUCTION EXAMPLE 1: Prove that for ay iteger 1. Proof: 1 + 2 + 3 +... + ( + 1 2 (1.1 STEP 1: For 1 (1.1 is true, sice 1 1(1 + 1. 2 STEP 2: Suppose (1.1 is true for some k 1, that is 1

More information

Chapter 5: Inner Product Spaces

Chapter 5: Inner Product Spaces Chapter 5: Ier Product Spaces Chapter 5: Ier Product Spaces SECION A Itroductio to Ier Product Spaces By the ed of this sectio you will be able to uderstad what is meat by a ier product space give examples

More information

Concentration of Measure

Concentration of Measure Copyright c 2008 2010 Joh Lafferty, Ha Liu, ad Larry Wasserma Do Not Distribute Chapter 7 Cocetratio of Measure Ofte we wat to show that some radom quatity is close to its mea with high probability Results

More information

Basic Elements of Arithmetic Sequences and Series

Basic Elements of Arithmetic Sequences and Series MA40S PRE-CALCULUS UNIT G GEOMETRIC SEQUENCES CLASS NOTES (COMPLETED NO NEED TO COPY NOTES FROM OVERHEAD) Basic Elemets of Arithmetic Sequeces ad Series Objective: To establish basic elemets of arithmetic

More information

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

Case Study. Normal and t Distributions. Density Plot. Normal Distributions Case Study Normal ad t Distributios Bret Halo ad Bret Larget Departmet of Statistics Uiversity of Wiscosi Madiso October 11 13, 2011 Case Study Body temperature varies withi idividuals over time (it ca

More information

BASIC STATISTICS. f(x 1,x 2,..., x n )=f(x 1 )f(x 2 ) f(x n )= f(x i ) (1)

BASIC STATISTICS. f(x 1,x 2,..., x n )=f(x 1 )f(x 2 ) f(x n )= f(x i ) (1) BASIC STATISTICS. SAMPLES, RANDOM SAMPLING AND SAMPLE STATISTICS.. Radom Sample. The radom variables X,X 2,..., X are called a radom sample of size from the populatio f(x if X,X 2,..., X are mutually idepedet

More information

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT Keywords: project maagemet, resource allocatio, etwork plaig Vladimir N Burkov, Dmitri A Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT The paper deals with the problems of resource allocatio betwee

More information

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring No-life isurace mathematics Nils F. Haavardsso, Uiversity of Oslo ad DNB Skadeforsikrig Mai issues so far Why does isurace work? How is risk premium defied ad why is it importat? How ca claim frequecy

More information

3 Basic Definitions of Probability Theory

3 Basic Definitions of Probability Theory 3 Basic Defiitios of Probability Theory 3defprob.tex: Feb 10, 2003 Classical probability Frequecy probability axiomatic probability Historical developemet: Classical Frequecy Axiomatic The Axiomatic defiitio

More information

Math C067 Sampling Distributions

Math C067 Sampling Distributions Math C067 Samplig Distributios Sample Mea ad Sample Proportio Richard Beigel Some time betwee April 16, 2007 ad April 16, 2007 Examples of Samplig A pollster may try to estimate the proportio of voters

More information

THE ARITHMETIC OF INTEGERS. - multiplication, exponentiation, division, addition, and subtraction

THE ARITHMETIC OF INTEGERS. - multiplication, exponentiation, division, addition, and subtraction THE ARITHMETIC OF INTEGERS - multiplicatio, expoetiatio, divisio, additio, ad subtractio What to do ad what ot to do. THE INTEGERS Recall that a iteger is oe of the whole umbers, which may be either positive,

More information

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals Overview Estimatig the Value of a Parameter Usig Cofidece Itervals We apply the results about the sample mea the problem of estimatio Estimatio is the process of usig sample data estimate the value of

More information

Irreducible polynomials with consecutive zero coefficients

Irreducible polynomials with consecutive zero coefficients Irreducible polyomials with cosecutive zero coefficiets Theodoulos Garefalakis Departmet of Mathematics, Uiversity of Crete, 71409 Heraklio, Greece Abstract Let q be a prime power. We cosider the problem

More information

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth Questio 1: What is a ordiary auity? Let s look at a ordiary auity that is certai ad simple. By this, we mea a auity over a fixed term whose paymet period matches the iterest coversio period. Additioally,

More information

How To Solve The Homewor Problem Beautifully

How To Solve The Homewor Problem Beautifully Egieerig 33 eautiful Homewor et 3 of 7 Kuszmar roblem.5.5 large departmet store sells sport shirts i three sizes small, medium, ad large, three patters plaid, prit, ad stripe, ad two sleeve legths log

More information

1 Correlation and Regression Analysis

1 Correlation and Regression Analysis 1 Correlatio ad Regressio Aalysis I this sectio we will be ivestigatig the relatioship betwee two cotiuous variable, such as height ad weight, the cocetratio of a ijected drug ad heart rate, or the cosumptio

More information

Class Meeting # 16: The Fourier Transform on R n

Class Meeting # 16: The Fourier Transform on R n MATH 18.152 COUSE NOTES - CLASS MEETING # 16 18.152 Itroductio to PDEs, Fall 2011 Professor: Jared Speck Class Meetig # 16: The Fourier Trasform o 1. Itroductio to the Fourier Trasform Earlier i the course,

More information

Lecture 4: Cheeger s Inequality

Lecture 4: Cheeger s Inequality Spectral Graph Theory ad Applicatios WS 0/0 Lecture 4: Cheeger s Iequality Lecturer: Thomas Sauerwald & He Su Statemet of Cheeger s Iequality I this lecture we assume for simplicity that G is a d-regular

More information

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES Read Sectio 1.5 (pages 5 9) Overview I Sectio 1.5 we lear to work with summatio otatio ad formulas. We will also itroduce a brief overview of sequeces,

More information

Measures of Spread and Boxplots Discrete Math, Section 9.4

Measures of Spread and Boxplots Discrete Math, Section 9.4 Measures of Spread ad Boxplots Discrete Math, Sectio 9.4 We start with a example: Example 1: Comparig Mea ad Media Compute the mea ad media of each data set: S 1 = {4, 6, 8, 10, 1, 14, 16} S = {4, 7, 9,

More information

SUPPLEMENTARY MATERIAL TO GENERAL NON-EXACT ORACLE INEQUALITIES FOR CLASSES WITH A SUBEXPONENTIAL ENVELOPE

SUPPLEMENTARY MATERIAL TO GENERAL NON-EXACT ORACLE INEQUALITIES FOR CLASSES WITH A SUBEXPONENTIAL ENVELOPE SUPPLEMENTARY MATERIAL TO GENERAL NON-EXACT ORACLE INEQUALITIES FOR CLASSES WITH A SUBEXPONENTIAL ENVELOPE By Guillaume Lecué CNRS, LAMA, Mare-la-vallée, 77454 Frace ad By Shahar Medelso Departmet of Mathematics,

More information

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed. This documet was writte ad copyrighted by Paul Dawkis. Use of this documet ad its olie versio is govered by the Terms ad Coditios of Use located at http://tutorial.math.lamar.edu/terms.asp. The olie versio

More information

Lesson 17 Pearson s Correlation Coefficient

Lesson 17 Pearson s Correlation Coefficient Outlie Measures of Relatioships Pearso s Correlatio Coefficiet (r) -types of data -scatter plots -measure of directio -measure of stregth Computatio -covariatio of X ad Y -uique variatio i X ad Y -measurig

More information

Confidence Intervals. CI for a population mean (σ is known and n > 30 or the variable is normally distributed in the.

Confidence Intervals. CI for a population mean (σ is known and n > 30 or the variable is normally distributed in the. Cofidece Itervals A cofidece iterval is a iterval whose purpose is to estimate a parameter (a umber that could, i theory, be calculated from the populatio, if measuremets were available for the whole populatio).

More information

Center, Spread, and Shape in Inference: Claims, Caveats, and Insights

Center, Spread, and Shape in Inference: Claims, Caveats, and Insights Ceter, Spread, ad Shape i Iferece: Claims, Caveats, ad Isights Dr. Nacy Pfeig (Uiversity of Pittsburgh) AMATYC November 2008 Prelimiary Activities 1. I would like to produce a iterval estimate for the

More information

Totally Corrective Boosting Algorithms that Maximize the Margin

Totally Corrective Boosting Algorithms that Maximize the Margin Mafred K. Warmuth mafred@cse.ucsc.edu Ju Liao liaoju@cse.ucsc.edu Uiversity of Califoria at Sata Cruz, Sata Cruz, CA 95064, USA Guar Rätsch Guar.Raetsch@tuebige.mpg.de Friedrich Miescher Laboratory of

More information

CS103X: Discrete Structures Homework 4 Solutions

CS103X: Discrete Structures Homework 4 Solutions CS103X: Discrete Structures Homewor 4 Solutios Due February 22, 2008 Exercise 1 10 poits. Silico Valley questios: a How may possible six-figure salaries i whole dollar amouts are there that cotai at least

More information

GCSE STATISTICS. 4) How to calculate the range: The difference between the biggest number and the smallest number.

GCSE STATISTICS. 4) How to calculate the range: The difference between the biggest number and the smallest number. GCSE STATISTICS You should kow: 1) How to draw a frequecy diagram: e.g. NUMBER TALLY FREQUENCY 1 3 5 ) How to draw a bar chart, a pictogram, ad a pie chart. 3) How to use averages: a) Mea - add up all

More information

A Mathematical Perspective on Gambling

A Mathematical Perspective on Gambling A Mathematical Perspective o Gamblig Molly Maxwell Abstract. This paper presets some basic topics i probability ad statistics, icludig sample spaces, probabilistic evets, expectatios, the biomial ad ormal

More information

The analysis of the Cournot oligopoly model considering the subjective motive in the strategy selection

The analysis of the Cournot oligopoly model considering the subjective motive in the strategy selection The aalysis of the Courot oligopoly model cosiderig the subjective motive i the strategy selectio Shigehito Furuyama Teruhisa Nakai Departmet of Systems Maagemet Egieerig Faculty of Egieerig Kasai Uiversity

More information

4.3. The Integral and Comparison Tests

4.3. The Integral and Comparison Tests 4.3. THE INTEGRAL AND COMPARISON TESTS 9 4.3. The Itegral ad Compariso Tests 4.3.. The Itegral Test. Suppose f is a cotiuous, positive, decreasig fuctio o [, ), ad let a = f(). The the covergece or divergece

More information

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown Z-TEST / Z-STATISTIC: used to test hypotheses about µ whe the populatio stadard deviatio is kow ad populatio distributio is ormal or sample size is large T-TEST / T-STATISTIC: used to test hypotheses about

More information

Research Method (I) --Knowledge on Sampling (Simple Random Sampling)

Research Method (I) --Knowledge on Sampling (Simple Random Sampling) Research Method (I) --Kowledge o Samplig (Simple Radom Samplig) 1. Itroductio to samplig 1.1 Defiitio of samplig Samplig ca be defied as selectig part of the elemets i a populatio. It results i the fact

More information

Approximating Area under a curve with rectangles. To find the area under a curve we approximate the area using rectangles and then use limits to find

Approximating Area under a curve with rectangles. To find the area under a curve we approximate the area using rectangles and then use limits to find 1.8 Approximatig Area uder a curve with rectagles 1.6 To fid the area uder a curve we approximate the area usig rectagles ad the use limits to fid 1.4 the area. Example 1 Suppose we wat to estimate 1.

More information

COMPARISON OF THE EFFICIENCY OF S-CONTROL CHART AND EWMA-S 2 CONTROL CHART FOR THE CHANGES IN A PROCESS

COMPARISON OF THE EFFICIENCY OF S-CONTROL CHART AND EWMA-S 2 CONTROL CHART FOR THE CHANGES IN A PROCESS COMPARISON OF THE EFFICIENCY OF S-CONTROL CHART AND EWMA-S CONTROL CHART FOR THE CHANGES IN A PROCESS Supraee Lisawadi Departmet of Mathematics ad Statistics, Faculty of Sciece ad Techoology, Thammasat

More information