Statistical Learning Theory


 Kristina Griffin
 1 years ago
 Views:
Transcription
1 1 / 130 Statistical Learig Theory Machie Learig Summer School, Kyoto, Japa Alexader (Sasha) Rakhli Uiversity of Pesylvaia, The Wharto School Pe Research i Machie Learig (PRiML) August 2728, 2012
2 2 / 130 Refereces Parts of these lectures are based o O. Bousquet, S. Bouchero, G. Lugosi: Itroductio to Statistical Learig Theory, MLSS otes by O. Bousquet S. Medelso: A Few Notes o Statistical Learig Theory Lecture otes by S. ShalevShwartz Lecture otes (S. R. ad K. Sridhara) Prerequisites: a basic familiarity with Probability is assumed.
3 3 / 130 Outlie Itroductio Statistical Learig Theory The Settig of SLT Cosistecy, No Free Luch Theorems, BiasVariace Tradeoff Tools from Probability, Empirical Processes From Fiite to Ifiite Classes Uiform Covergece, Symmetrizatio, ad Rademacher Complexity Large Margi Theory for Classificatio Properties of Rademacher Complexity Coverig Numbers ad ScaleSesitive Dimesios Faster Rates Model Selectio Sequetial Predictio / Olie Learig Motivatio Supervised Learig Olie Covex ad Liear Optimizatio OlietoBatch Coversio, SVM optimizatio
4 Example #1: Hadwritte Digit Recogitio Imagie you are asked to write a computer program that recogizes postal codes o evelopes. You observe the huge amout of variatio ad ambiguity i the data: Oe ca try to hardcode all the possibilities, but likely to fail. It would be ice if a program looked at a large corpus of data ad leared the distictios! This picture of MNIST dataset was yaked from 4 / 130
5 5 / 130 Example #1: Hadwritte Digit Recogitio Need to represet data i the computer. Pixel itesities is oe possibility, but ot ecessarily the best oe. Feature represetatio: feature map We also eed to specify the label of this example: 3. The labeled example is the ( , 3. ( After lookig at may of these examples, we wat the program to predict the label of the ext hadwritte digit.
6 Example #2: Predict Topic of a News Article You would like to automatically collect ews stories from the web ad display them to the reader i the best possible way. You would like to group or filter these articles by topic. Hardcodig possible topics for articles is a dautig task! Represetatio i the computer: This is a bagofwords represetatio. If 1 stads for the category politics, the this example ca be represeted as 6 / 130 ( , After lookig at may of such examples, we would like the program to predict the topic of a ew article. (
7 7 / 130 Why Machie Learig? Impossible to hardcode all the kowledge ito a computer program. The systems eed to be adaptive to the chages i the eviromet. Examples: Computer visio: face detectio, face recogitio Audio: voice recogitio, parsig Text: documet topics, traslatio Ad placemet o web pages Movie recommedatios spam detectio
8 8 / 130 Machie Learig (Huma) learig is the process of acquirig kowledge or skill. Quite vague. How ca we build a mathematical theory for somethig so imprecise? Machie Learig is cocered with the desig ad aalysis of algorithms that improve performace after observig data. That is, the acquired kowledge comes from data. We eed to make mathematically precise the followig terms: performace, improve, data.
9 9 / 130 Learig from Examples How is it possible to coclude somethig geeral from specific examples? Learig is iheretly a illposed problem, as there are may alteratives that could be cosistet with the observed examples. Learig ca be see as the process of iductio (as opposed to deductio): extrapolatig from examples. Prior kowledge is how we make the problem wellposed. Memorizatio is ot learig, ot iductio. Our theory should make this apparet. Very importat to delieate assumptios. The we will be able to prove mathematically that certai learig algorithms perform well.
10 Data Space of iputs (or, predictors): X e.g. x X {0, 1,..., 216 }64 is a strig of pixel itesities i a 8 8 image. e.g. x X R33,000 is a set of gee expressio levels. x1 = x2 =... x1 = x2 =... x1 = / 130 x2 = # cigarettes/day # driks/day BMI
11 11 / 130 Data Sometimes the space X is uiquely defied for the problem. I other cases, such as i visio/text/audio applicatios, may possibilities exist, ad a good feature represetatio is key to obtaiig good performace. This importat part of machie learig applicatios will ot be discussed i this lecture, ad we will assume that X has bee chose by the practitioer.
12 12 / 130 Data Space of outputs (or, resposes): Y e.g. y Y = {0, 1} is a biary label (1 = cat ) e.g. y Y = [0, 200] is life expectacy A pair (x, y) is a labeled example. e.g. (x, y) is a example of a image with a label y = 1, which stads for the presece of a face i the image x Dataset (or traiig data): examples {(x 1, y 1),..., (x, y )} e.g. a collectio of images labeled accordig to the presece or absece of a face
13 13 / 130 The Multitude of Learig Frameworks Presece/absece of labeled data: Supervised Learig: {(x 1, y 1),..., (x, y )} Usupervised Learig: {x 1,..., x } Semisupervised Learig: a mix of the above This distictio is importat, as labels are ofte difficult or expesive to obtai (e.g. ca collect a large corpus of s, but which oes are spam?) Types of labels: Biary Classificatio / Patter Recogitio: Y = {0, 1} Multiclass: Y = {0,..., K} Regressio: Y R Structure predictio: Y is a set of complex objects (graphs, traslatios)
14 14 / 130 The Multitude of Learig Frameworks Problems also differ i the protocol for obtaiig data: Passive Active ad i assumptios o data: Batch (typically i.i.d.) Olie (i.i.d. or worstcase or some stochastic process) Eve more ivolved: Reiforcemet Learig ad other frameworks.
15 15 / 130 Why Theory?... theory is the first term i the Taylor series of practice Thomas M. Cover, 1990 Shao Lecture Theory ad Practice should go hadihad. Boostig, Support Vector Machies came from theoretical cosideratios. Sometimes, theory is suggestig practical methods, sometimes practice comes ahead ad theory tries to catch up ad explai the performace.
16 16 / 130 This tutorial First 2/3 of the tutorial: we will study the problem of supervised learig (with a focus o biary classificatio) with a i.i.d. assumptio o the data. The last 1/3 of the tutorial: we will tur to olie learig without the i.i.d. assumptio.
17 17 / 130 Outlie Itroductio Statistical Learig Theory The Settig of SLT Cosistecy, No Free Luch Theorems, BiasVariace Tradeoff Tools from Probability, Empirical Processes From Fiite to Ifiite Classes Uiform Covergece, Symmetrizatio, ad Rademacher Complexity Large Margi Theory for Classificatio Properties of Rademacher Complexity Coverig Numbers ad ScaleSesitive Dimesios Faster Rates Model Selectio Sequetial Predictio / Olie Learig Motivatio Supervised Learig Olie Covex ad Liear Optimizatio OlietoBatch Coversio, SVM optimizatio
18 18 / 130 Outlie Itroductio Statistical Learig Theory The Settig of SLT Cosistecy, No Free Luch Theorems, BiasVariace Tradeoff Tools from Probability, Empirical Processes From Fiite to Ifiite Classes Uiform Covergece, Symmetrizatio, ad Rademacher Complexity Large Margi Theory for Classificatio Properties of Rademacher Complexity Coverig Numbers ad ScaleSesitive Dimesios Faster Rates Model Selectio Sequetial Predictio / Olie Learig Motivatio Supervised Learig Olie Covex ad Liear Optimizatio OlietoBatch Coversio, SVM optimizatio
19 19 / 130 Statistical Learig Theory The variable x is related to y, ad we would like to lear this relatioship from data. The relatioship is ecapsulated by a distributio P o X Y. Example: x = [weight, blood glucose,...] ad y is the risk of diabetes. We assume there is a relatioship betwee x ad y: it is less likely to see certai x cooccur with low risk ad ulikely to see some other x cooccur with high risk. This relatioship is ecapsulated by P(x, y). This is a assumptio about the populatio of all (x, y). However, what we see is a sample.
20 20 / 130 Statistical Learig Theory Data deoted by {(x 1, y 1),..., (x, y )}, where is the sample size. The distributio P is ukow to us (otherwise, there is o learig to be doe). The observed data are sampled idepedetly from P (the i.i.d. assumptio) It is ofte helpful to write P = P x P y x. The distributio P x o the iputs is called the margial distributio, while P y x is the coditioal distributio.
21 21 / 130 Statistical Learig Theory Upo observig the traiig data {(x 1, y 1),..., (x, y )}, the learer is asked to summarize what she had leared about the relatioship betwee x ad y. The learer s summary takes the form of a fuctio ˆf X Y. The hat idicates that this fuctio depeds o the traiig data. Learig algorithm: a mappig {(x 1, y 1),..., (x, y )} ˆf. The quality of the leared relatioship is give by comparig the respose ˆf (x) to y for a pair (x, y) idepedetly draw from the same distributio P: E (x,y) l(ˆf (x), y) where l Y Y R is a loss fuctio. This is our measure of performace.
22 22 / 130 Loss Fuctios Idicator loss (classificatio): l(y, y ) = I {y y } Square loss: l(y, y ) = (y y ) 2 Absolute loss: l(y, y ) = y y
23 23 / 130 Examples Probably the simplest learig algorithm that you are probably familiar with is liear least squares: Give (x 1, y 1),..., (x, y ), let 1 ˆβ = arg mi β R d i=1 (y i β, x i ) 2 ad defie ˆf (x) = ˆβ, x Aother basic method is regularized least squares: 1 ˆβ = arg mi β R d i=1 (y i β, x i ) 2 + λ β 2
24 24 / 130 Methods vs Problems Algorithms ˆf Distributios P
25 25 / 130 Expected Loss ad Empirical Loss The expected loss of ay fuctio f X Y is L(f) = El(f(x), y) Sice P is ukow, we caot calculate L(f). However, we ca calculate the empirical loss of f X Y ˆL(f) = 1 i=1 l(f(x i ), y i )
26 ... agai, what is radom here? Sice data (x 1, y 1),..., (x, y ) are a radom i.i.d. draw from P, ˆL(f) is a radom quatity ˆf is a radom quatity (a radom fuctio, output of our learig procedure after seeig data) hece, L(ˆf ) is also a radom quatity for a give f X Y, the quatity L(f) is ot radom! It is importat that these are uderstood before we proceed further. 26 / 130
27 27 / 130 The Gold Stadard Withi the framework we set up, the smallest expected loss is achieved by the Bayes optimal fuctio f = arg mi L(f) f where the miimizatio is over all (measurable) predictio rules f X Y. The value of the lowest expected loss is called the Bayes error: L(f ) = if L(f) f Of course, we caot calculate ay of these quatities sice P is ukow.
28 28 / 130 Bayes Optimal Fuctio Bayes optimal fuctio f takes o the followig forms i these two particular cases: Biary classificatio (Y = {0, 1}) with the idicator loss: f (x) = I {η(x) 1/2}, where η(x) = E[Y X = x] 1 0 (x)
29 28 / 130 Bayes Optimal Fuctio Bayes optimal fuctio f takes o the followig forms i these two particular cases: Biary classificatio (Y = {0, 1}) with the idicator loss: f (x) = I {η(x) 1/2}, where η(x) = E[Y X = x] 1 0 (x) Regressio (Y = R) with squared loss: f (x) = η(x), where η(x) = E[Y X = x]
30 29 / 130 The big questio: is there a way to costruct a learig algorithm with a guaratee that L(ˆf ) L(f ) is small for large eough sample size?
31 30 / 130 Outlie Itroductio Statistical Learig Theory The Settig of SLT Cosistecy, No Free Luch Theorems, BiasVariace Tradeoff Tools from Probability, Empirical Processes From Fiite to Ifiite Classes Uiform Covergece, Symmetrizatio, ad Rademacher Complexity Large Margi Theory for Classificatio Properties of Rademacher Complexity Coverig Numbers ad ScaleSesitive Dimesios Faster Rates Model Selectio Sequetial Predictio / Olie Learig Motivatio Supervised Learig Olie Covex ad Liear Optimizatio OlietoBatch Coversio, SVM optimizatio
32 31 / 130 Cosistecy A algorithm that esures lim L(ˆf ) = L(f ) almost surely is called cosistet. Cosistecy esures that our algorithm is approachig the best possible predictio performace as the sample size icreases. The good ews: cosistecy is possible to achieve. easy if X is a fiite or coutable set ot too hard if X is ifiite, ad the uderlyig relatioship betwee x ad y is cotiuous
33 32 / 130 The bad ews... I geeral, we caot prove aythig iterestig about L(ˆf ) L(f ), uless we make further assumptios (icorporate prior kowledge). What do we mea by othig iterestig? This is the subject of the socalled No Free Luch Theorems. Uless we posit further assumptios,
34 32 / 130 The bad ews... I geeral, we caot prove aythig iterestig about L(ˆf ) L(f ), uless we make further assumptios (icorporate prior kowledge). What do we mea by othig iterestig? This is the subject of the socalled No Free Luch Theorems. Uless we posit further assumptios, For ay algorithm ˆf, ay ad ay ɛ > 0, there exists a distributio P such that L(f ) = 0 ad EL(ˆf ) 1 2 ɛ
35 32 / 130 The bad ews... I geeral, we caot prove aythig iterestig about L(ˆf ) L(f ), uless we make further assumptios (icorporate prior kowledge). What do we mea by othig iterestig? This is the subject of the socalled No Free Luch Theorems. Uless we posit further assumptios, For ay algorithm ˆf, ay ad ay ɛ > 0, there exists a distributio P such that L(f ) = 0 ad EL(ˆf ) 1 2 ɛ For ay algorithm ˆf, ad ay sequece a that coverges to 0, there exists a probability distributio P such that L(f ) = 0 ad for all EL(ˆf ) a Referece: (Devroye, Györfi, Lugosi: A Probabilistic Theory of Patter Recogitio), (Bousquet, Bouchero, Lugosi, 2004)
36 33 / 130 is this really bad ews? Not really. We always have some domai kowledge. Two ways of icorporatig prior kowledge: Direct way: assume that the distributio P is ot arbitrary (also kow as a modelig approach, geerative approach, statistical modelig) Idirect way: redefie the goal to perform as well as a referece set F of predictors: L(ˆf ) if f F L(f) This is kow as a discrimiative approach. F ecapsulates our iductive bias.
37 34 / 130 Pros/Cos of the two approaches Pros of the discrimiative approach: we ever assume that P takes some particular form, but we rather put our prior kowledge ito what are the types of predictors that will do well. Cos: caot really iterpret ˆf. Pros of the geerative approach: ca estimate the model / parameters of the distributio (iferece). Cos: it is ot clear what the aalysis says if the assumptio is actually violated. Both approaches have their advatages. A machie learig researcher or practitioer should ideally kow both ad should uderstad their stregths ad weakesses. I this tutorial we oly focus o the discrimiative approach.
38 35 / 130 Example: Liear Discrimiat Aalysis Cosider the classificatio problem with Y = {0, 1}. Suppose the classcoditioal desities are multivariate Gaussia with the same covariace Σ = I: p(x y = 0) = (2π) k/2 exp { 1 2 x µ0 2 } ad p(x y = 1) = (2π) k/2 exp { 1 2 x µ1 2 } The best (Bayes) classifier is f = I {P(y=1 x) 1/2} which correspods to the halfspace defied by the decisio boudary p(x y = 1) p(x y = 0). This boudary is liear.
39 36 / 130 Example: Liear Discrimiat Aalysis The (liear) optimal decisio boudary comes from our geerative assumptio o the form of the uderlyig distributio. Alteratively, we could have idirectly postulated that we will be lookig for a liear discrimiat betwee the two classes, without makig distributioal assumptios. Such liear discrimiat (classificatio) fuctios are I { w,x b} for a uitorm w ad some bias b R. Quadratic Discrimiat Aalysis: If uequal correlatio matrices Σ 1 ad Σ 2 are assumed, the resultig boudary is quadratic. We ca the defie classificatio fuctio by I {q(x) 0} where q(x) is a quadratic fuctio.
40 37 / 130 BiasVariace Tradeoff How do we choose the iductive bias F? L(ˆf ) L(f ) = L(ˆf ) if f F L(f) Estimatio Error + if f F L(f) L(f ) Approximatio Error ˆf f F f F Clearly, the two terms are at odds with each other: Makig F larger meas smaller approximatio error but (as we will see) larger estimatio error Takig a larger sample meas smaller estimatio error ad has o effect o the approximatio error. Thus, it makes sese to trade off size of F ad. This is called Structural Risk Miimizatio, or Method of Sieves, or Model Selectio.
41 38 / 130 BiasVariace Tradeoff We will oly focus o the estimatio error, yet the ideas we develop will make it possible to read about model selectio o your ow. Note: if we guessed correctly ad f F, the L(ˆf ) L(f ) = L(ˆf ) if f F L(f) For a particular problem, oe hopes that prior kowledge about the problem ca esure that the approximatio error if f F L(f) L(f ) is small.
42 39 / 130 Occam s Razor Occam s Razor is ofte quoted as a priciple for choosig the simplest theory or explaatio out of the possible oes. However, this is a rather philosophical argumet sice simplicity is ot uiquely defied. We will discuss this issue later. What we will do is to try to uderstad complexity whe it comes to behavior of certai stochastic processes. Such a questio will be welldefied mathematically.
43 40 / 130 Lookig Ahead So far: represeted prior kowledge by meas of the class F. Lookig forward, we ca fid a algorithm that, after lookig at a dataset of size, produces ˆf such that L(ˆf ) if f F L(f) decreases (i a certai sese which we will make precise) at a otrivial rate which depeds o richess of F. This will give a sample complexity guaratee: how may samples are eeded to make the error smaller tha a desired accuracy.
44 41 / 130 Outlie Itroductio Statistical Learig Theory The Settig of SLT Cosistecy, No Free Luch Theorems, BiasVariace Tradeoff Tools from Probability, Empirical Processes From Fiite to Ifiite Classes Uiform Covergece, Symmetrizatio, ad Rademacher Complexity Large Margi Theory for Classificatio Properties of Rademacher Complexity Coverig Numbers ad ScaleSesitive Dimesios Faster Rates Model Selectio Sequetial Predictio / Olie Learig Motivatio Supervised Learig Olie Covex ad Liear Optimizatio OlietoBatch Coversio, SVM optimizatio
45 42 / 130 Types of Bouds I expectatio vs i probability (cotrol the mea vs cotrol the tails): E {L(ˆf ) if L(f)} < ψ() vs P (L(ˆf ) if L(f) ɛ) < ψ(, ɛ) f F f F
46 42 / 130 Types of Bouds I expectatio vs i probability (cotrol the mea vs cotrol the tails): E {L(ˆf ) if L(f)} < ψ() vs P (L(ˆf ) if L(f) ɛ) < ψ(, ɛ) f F f F The iprobability boud ca be iverted as P (L(ˆf ) if L(f) φ(δ, )) < δ f F by settig δ = ψ(ɛ, ) ad solvig for ɛ. I this lecture, we are after the fuctio φ(δ, ). We will call it the rate. With high probability typically meas logarithmic depedece of φ(δ, ) o 1/δ. Very desirable: the boud grows oly modestly eve for high cofidece bouds.
47 43 / 130 Sample Complexity Sample complexity is the sample size required by the algorithm ˆf to guaratee L(ˆf ) if f F L(f) ɛ with probability at least 1 δ. Of course, we just eed to ivert a boud P (L(ˆf ) if L(f) φ(δ, )) < δ f F by settig ɛ = φ(δ, ) ad solvig for. I other words, (ɛ, δ) is sample complexity of the algorithm ˆf if as soo as (ɛ, δ). P (L(ˆf ) if L(f) ɛ) δ f F Hece, rate ca be traslated ito sample complexity ad vice versa. Easy to remember: rate O(1/ ) meas O(1/ɛ 2 ) sample complexity, whereas rate O(1/) is a smaller O(1/ɛ) sample complexity.
48 44 / 130 Types of Bouds Other distictios to keep i mid: We ca ask for bouds (either i expectatio or i probability) o the followig radom variables: L(ˆf ) L(f ) (A) L(ˆf ) if f F L(f) (B) L(ˆf ) ˆL(ˆf ) (C) sup {L(f) ˆL(f)} f F (D) sup {L(f) ˆL(f) pe (f)} f F (E) Let s make sure we uderstad the differeces betwee these radom quatities!
49 45 / 130 Types of Bouds Upper bouds o (D) ad (E) are used as tools for achievig the other bouds. Let s see why. Obviously, for ay algorithm that outputs ˆf F, L(ˆf ) ˆL(ˆf ) sup {L(f) ˆL(f)} f F ad so a boud o (D) implies a boud o (C). How about a boud o (B)? Is it implied by (C) or (D)? It depeds o what the algorithm does! Deote f F = arg mi f F L(f). Suppose (D) is small. It the makes sese to ask the learig algorithm to miimize or (approximately miimize) the empirical error (why?)
50 46 / 130 Caoical Algorithms Empirical Risk Miimizatio (ERM) algorithm: ˆf = arg mi ˆL(f) f F Regularized Empirical Risk Miimizatio algorithm: ˆf = arg mi ˆL(f) + pe (f) f F We will deal with the regularized ERM a bit later. For ow, let s focus o ERM. Remark: to actually compute f F miimizig the above objectives, oe eeds to employ some optimizatio methods. I practice, the objective might be optimized oly approximately.
51 47 / 130 Performace of ERM If ˆf is a ERM, L(ˆf ) L(f F ) {L(ˆf ) ˆL(ˆf )} + {ˆL(ˆf) ˆL(f F )} + {ˆL(fF ) L(f F )} {L(ˆf ) ˆL(ˆf )} + {ˆL(fF ) L(f F )} (C) sup {L(f) ˆL(f)} + {ˆL(fF ) L(f F )} f F (D) because the secod term is egative. So, (C) also implies a boud o (B) whe ˆf is ERM (or close to ERM). Also, (D) also implies a boud o (B). What about this extra term ˆL(f F ) L(f F )? Cetral Limit Theorem says that for i.i.d. radom variables with bouded secod momet, the average coverges to the expectatio. Let s quatify this.
52 48 / 130 Hoeffdig Iequality Let W, W 1,..., W be i.i.d. such that P (a W b) = 1. The P (EW 1 i=1 W i > ɛ) exp ( 2ɛ2 (b a) ) 2 ad P ( 1 i=1 W i EW > ɛ) exp ( 2ɛ2 (b a) ) 2 Let W i = l(f F (x i ), y i ). Clearly, W 1,..., W i are i.i.d. The, P ( L(f F ) ˆL(f F ) > ɛ) 2 exp ( 2ɛ2 (b a) ) 2 assumig a l(f F (x), y) b for all x X, y Y.
53 49 / 130 Wait, Are We Doe? Ca t we coclude directly that (C) is small? That is, P (El(ˆf (x), y) 1 i=1 l(ˆf (x i ), y i ) > ɛ) 2 exp ( 2ɛ2 (b a) )? 2
54 49 / 130 Wait, Are We Doe? Ca t we coclude directly that (C) is small? That is, P (El(ˆf (x), y) 1 i=1 l(ˆf (x i ), y i ) > ɛ) 2 exp ( 2ɛ2 (b a) )? 2 No! The radom variables l(ˆf (x i ), y i ) are ot ecessarily idepedet ad it is possible that El(ˆf (x), y) = EW El(ˆf (x i ), y i ) = EW i The expected loss is out of sample performace while the secod term is i sample. We say that l(ˆf (x i ), y i ) is a biased estimate of El(ˆf (x), y). How bad ca this bias be?
55 50 / 130 Example X = [0, 1], Y = {0, 1} l(f(x i ), Y i ) = I {f(xi ) Y i } distributio P = P x P y x with P x = Uif[0, 1] ad P y x = δ y=1 fuctio class F = N {f = f S S X, S =, f S (x) = I {x S} } ERM ˆf memorizes (perfectly fits) the data, but has o ability to geeralize. Observe that 0 = El(ˆf (x i ), y i ) El(ˆf (x), y) = 1 This pheomeo is called overfittig.
56 51 / 130 Example Not oly is (C) large i this example. Also, uiform deviatios (D) do ot coverge to zero. For ay N ad ay (x 1, y 1),..., (x, y ) P sup {E x,yl(f(x), y) 1 f F i=1 l(f(x i ), y i )} = 1 Where do we go from here? Two approaches: 1. uderstad how to upper boud uiform deviatios (D) 2. fid properties of algorithms that limit i some way the bias of l(ˆf (x i ), y i ). Stability ad compressio are two such approaches.
57 52 / 130 Uiform Deviatios We first focus o uderstadig sup {E x,yl(f(x), y) 1 f F i=1 l(f(x i ), y i )} If F = {f 0} cosists of a sigle fuctio, the clearly sup {El(f(x), y) 1 f F i=1 l(f(x i ), y i )} = {El(f 0(x), y) 1 i=1 This quatity is O P (1/ ) by Hoeffdig s iequality, assumig a l(f 0(x), y) b. l(f 0(x i ), y i )} Moral: for simple classes F the uiform deviatios (D) ca be bouded while for rich classes ot. We will see how far we ca push the size of F.
58 53 / 130 A bit of otatio to simplify thigs... To ease the otatio, Let z i = (x i, y i ) so that the traiig data is {z 1,..., z } g(z) = l(f(x), y) for z = (x, y) Loss class G = {g g(z) = l(f(x), y)} = l F ĝ = l(ˆf ( ), ), g G = l(f F ( ), ) g = arg mi g Eg(z) = l(f ( ), ) is Bayes optimal (loss) fuctio We ca ow work with the set G, but keep i mid that each g G correspods to a f F: g G f F Oce agai, the quatity of iterest is sup g G {Eg(z) 1 i=1 g(z i )} O the ext slide, we visualize deviatios Eg(z) 1 i=1 g(z i ) for all possible fuctios g ad discuss all the cocepts itroduces so far.
59 54 / 130 Empirical Process Viewpoit Eg 0 g all fuctios
60 54 / 130 Empirical Process Viewpoit 1 X g(z i ) i=1 Eg 0 g all fuctios
61 54 / 130 Empirical Process Viewpoit 1 X g(z i ) i=1 Eg 0 g all fuctios
62 54 / 130 Empirical Process Viewpoit 1 X g(z i ) i=1 Eg 0 ĝ g all fuctios
63 54 / 130 Empirical Process Viewpoit 1 X g(z i ) i=1 0 ĝ g
64 54 / 130 Empirical Process Viewpoit 1 X g(z i ) i=1 G Eg 0 g all fuctios
65 54 / 130 Empirical Process Viewpoit 1 X g(z i ) i=1 G Eg 0 g g G ĝ all fuctios
66 54 / 130 Empirical Process Viewpoit 1 X g(z i ) i=1 G Eg 0 g all fuctios
67 55 / 130 Empirical Process Viewpoit A stochastic process is a collectio of radom variables idexed by some set. A empirical process is a stochastic process idexed by a fuctio class G. {Eg(z) 1 Uiform Law of Large Numbers: i probability. sup Eg 1 g G i=1 i=1 g(z i )} g G g(z i ) 0
68 55 / 130 Empirical Process Viewpoit A stochastic process is a collectio of radom variables idexed by some set. A empirical process is a stochastic process idexed by a fuctio class G. {Eg(z) 1 Uiform Law of Large Numbers: i probability. sup Eg 1 g G i=1 i=1 g(z i )} g G g(z i ) 0 Key questio: How big ca G be for the supremum of the empirical process to still be maageable?
69 56 / 130 Uio Boud (Boole s iequality) Boole s iequality: for a fiite or coutable set of evets, Let G = {g 1,..., g N }. The P ( g G Eg 1 i=1 P ( j A j ) P (A j ) j N g(z i ) > ɛ) P (Eg j 1 j=1 Assumig P (a g(z i ) b) = 1 for every g G, P (sup g G {Eg 1 i=1 i=1 g(z i )} > ɛ) N exp ( 2ɛ2 (b a) ) 2 g j (z i ) > ɛ)
70 57 / 130 Fiite Class Alteratively, we set δ = N exp ( 2ɛ2 (b a) 2 ) ad write P sup g G {Eg 1 i=1 g(z i )} > (b a) Aother way to write it: with probability at least 1 δ, sup g G {Eg 1 i=1 g(z i )} (b a) log(n) + log(1/δ) 2 δ log(n) + log(1/δ) 2 Hece, with probability at least 1 δ, the ERM algorithm ˆf for a class F of cardiality N satisfies log(n) + log(1/δ) L(ˆf ) if L(f) 2(b a) f F 2 assumig a l(f(x), y) b for all f F, x X, y Y. The costat 2 is due to the L(f F ) ˆL(f F ) term. This is a loose upper boud.
71 58 / 130 Oce agai... A takeaway message is that the followig two statemets are worlds apart: with probability at least 1 δ, for ay g G, Eg 1 i=1 g(z i ) ɛ vs for ay g G, with probability at least 1 δ, Eg 1 i=1 g(z i ) ɛ The secod statemet follows from CLT, while the first statemet is ofte difficult to obtai ad oly holds for some G.
72 59 / 130 Outlie Itroductio Statistical Learig Theory The Settig of SLT Cosistecy, No Free Luch Theorems, BiasVariace Tradeoff Tools from Probability, Empirical Processes From Fiite to Ifiite Classes Uiform Covergece, Symmetrizatio, ad Rademacher Complexity Large Margi Theory for Classificatio Properties of Rademacher Complexity Coverig Numbers ad ScaleSesitive Dimesios Faster Rates Model Selectio Sequetial Predictio / Olie Learig Motivatio Supervised Learig Olie Covex ad Liear Optimizatio OlietoBatch Coversio, SVM optimizatio
73 60 / 130 Coutable Class: Weighted Uio Boud Let G be coutable ad fix a distributio w o G such that g G w(g) 1. For ay δ > 0, for ay g G P Eg 1 log 1/w(g) + log(1/δ) g(z i ) (b a) 2 δ w(g) i=1 by Hoeffdig s iequality (easy to verify!). By the Uio Boud, P g G Eg 1 i=1 g(z i ) (b a) log 1/w(g) + log(1/δ) 2 δ w(g) δ g G Therefore, with probability at least 1 δ, for all f F L(f) ˆL(f) log 1/w(f) + log(1/δ) (b a) 2 pe (f)
74 61 / 130 Coutable Class: Weighted Uio Boud If ˆf is a regularized ERM, L(ˆf ) L(f F ) {L(ˆf ) ˆL(ˆf ) pe (ˆf )} + {ˆL(ˆf) + pe (ˆf ) ˆL(f F ) pe (f F )} + {ˆL(fF ) L(f F )} + pe (f F ) sup {L(f) ˆL(f) pe (f)} + {ˆL(fF ) L(f F )} + pe (f F ) f F So, (E) implies a boud o (B) whe ˆf is regularized ERM. From the weighted uio boud for a coutable class: L(ˆf ) L(f F ) {ˆL(fF ) L(f F )} + pe (f F ) log 1/w(f F ) + log(1/δ) 2(b a) 2
75 62 / 130 Ucoutable Class: Compressio Bouds Let us make the depedece of the algorithm ˆf o the traiig set S = {(x 1, y 1),..., (x, y )} explicit: ˆf = ˆf [S]. Suppose F has the property that there exists a compressio fuctio C k which selects from ay dataset S of ay size a subset of k labeled examples C k (S) S such that the algorithm ca be writte as The, ˆf [S] = ˆf k [C k (S)] L(ˆf ) ˆL(ˆf ) = El(ˆf k [C k (S)](x), y) 1 i=1 max {El(ˆf k [S I ](x), y) 1 I {1,...,}, I k l(ˆf k [C k (S)](x i ), y i ) i=1 l(ˆf k [S I ](x i ), y i )}
76 63 / 130 Ucoutable Class: Compressio Bouds Sice ˆf k [S I ] oly depeds o k out of poits, the empirical average is mostly out of sample. Addig ad subtractig 1 l(ˆf k [S I ](x ), y ) (x,y ) W for a additioal set of i.i.d. radom variables W = {(x 1, y 1),..., (x k, y k)} results i a upper boud max I {1,...,}, I k El(ˆf k [S I ](x), y) 1 l(ˆf k [S I ](x), y) + (x,y) S S I W I (b a)k We appeal to the uio boud over the ( ) possibilities, with a Hoeffdig s k boud for each. The with probability at least 1 δ, L(ˆf ) if L(f) 2(b a) f F k log(e/k) + log(1/δ) 2 assumig a l(f(x), y) b for all f F, x X, y Y. + (b a)k
77 64 / 130 Example: Classificatio with Thresholds i 1D X = [0, 1], Y = {0, 1} F = {f θ f θ (x) = I {x θ}, θ [0, 1]} l(f θ (x), y) = I {fθ (x) y} ˆf 0 1 For ay set of data (x 1, y 1),..., (x, y ), the ERM solutio ˆf has the property that the first occurrece x l o the left of the threshold has label y l = 0, while first occurrece x r o the right label y r = 1. Eough to take k = 2 ad defie ˆf [S] = ˆf 2[(x l, 0), (x r, 1)].
78 65 / 130 Stability Yet aother way to limit the bias of l(ˆf (x i ), y i ) as a estimate of L(ˆf ) is through a otio of stability. A algorithm ˆf is stable if a chage (or removal) of a sigle data poit does ot chage (i a certai mathematical sese) the fuctio ˆf by much. Of course, a dumb algorithm which outputs ˆf = f 0 without eve lookig at data is very stable ad l(ˆf (x i ), y i ) are idepedet radom variables... But it is ot a good algorithm! We would like to have a algorithm that both approximately miimizes the empirical error ad is stable. Turs out, certai types of regularizatio methods are stable. Example: ˆf = arg mi f F 1 i=1 (f(x i ) y i ) 2 + λ f 2 K where is the orm iduced by the kerel of a reproducig kerel Hilbert space (RKHS) F.
79 66 / 130 Summary so far We proved upper bouds o L(ˆf ) L(f F ) for ERM over a fiite class Regularized ERM over a coutable class (weighted uio boud) ERM over classes F with the compressio property ERM or Regularized ERM that are stable (oly sketched it) What about a more geeral situatio? Is there a way to measure complexity of F that tells us whether ERM will succeed?
80 67 / 130 Outlie Itroductio Statistical Learig Theory The Settig of SLT Cosistecy, No Free Luch Theorems, BiasVariace Tradeoff Tools from Probability, Empirical Processes From Fiite to Ifiite Classes Uiform Covergece, Symmetrizatio, ad Rademacher Complexity Large Margi Theory for Classificatio Properties of Rademacher Complexity Coverig Numbers ad ScaleSesitive Dimesios Faster Rates Model Selectio Sequetial Predictio / Olie Learig Motivatio Supervised Learig Olie Covex ad Liear Optimizatio OlietoBatch Coversio, SVM optimizatio
Properties of MLE: consistency, asymptotic normality. Fisher information.
Lecture 3 Properties of MLE: cosistecy, asymptotic ormality. Fisher iformatio. I this sectio we will try to uderstad why MLEs are good. Let us recall two facts from probability that we be used ofte throughout
More informationChapter 7 Methods of Finding Estimators
Chapter 7 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 011 Chapter 7 Methods of Fidig Estimators Sectio 7.1 Itroductio Defiitio 7.1.1 A poit estimator is ay fuctio W( X) W( X1, X,, X ) of
More information3. Covariance and Correlation
Virtual Laboratories > 3. Expected Value > 1 2 3 4 5 6 3. Covariace ad Correlatio Recall that by takig the expected value of various trasformatios of a radom variable, we ca measure may iterestig characteristics
More informationI. Chisquared Distributions
1 M 358K Supplemet to Chapter 23: CHISQUARED DISTRIBUTIONS, TDISTRIBUTIONS, AND DEGREES OF FREEDOM To uderstad tdistributios, we first eed to look at aother family of distributios, the chisquared distributios.
More informationAsymptotic Growth of Functions
CMPS Itroductio to Aalysis of Algorithms Fall 3 Asymptotic Growth of Fuctios We itroduce several types of asymptotic otatio which are used to compare the performace ad efficiecy of algorithms As we ll
More informationIn nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008
I ite Sequeces Dr. Philippe B. Laval Keesaw State Uiversity October 9, 2008 Abstract This had out is a itroductio to i ite sequeces. mai de itios ad presets some elemetary results. It gives the I ite Sequeces
More information7. Sample Covariance and Correlation
1 of 8 7/16/2009 6:06 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 7. Sample Covariace ad Correlatio The Bivariate Model Suppose agai that we have a basic radom experimet, ad that X ad Y
More informationDiscrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13
EECS 70 Discrete Mathematics ad Probability Theory Sprig 2014 Aat Sahai Note 13 Itroductio At this poit, we have see eough examples that it is worth just takig stock of our model of probability ad may
More informationChapter 6: Variance, the law of large numbers and the MonteCarlo method
Chapter 6: Variace, the law of large umbers ad the MoteCarlo method Expected value, variace, ad Chebyshev iequality. If X is a radom variable recall that the expected value of X, E[X] is the average value
More informationOverview of some probability distributions.
Lecture Overview of some probability distributios. I this lecture we will review several commo distributios that will be used ofte throughtout the class. Each distributio is usually described by its probability
More informationHypothesis testing. Null and alternative hypotheses
Hypothesis testig Aother importat use of samplig distributios is to test hypotheses about populatio parameters, e.g. mea, proportio, regressio coefficiets, etc. For example, it is possible to stipulate
More informationSoving Recurrence Relations
Sovig Recurrece Relatios Part 1. Homogeeous liear 2d degree relatios with costat coefficiets. Cosider the recurrece relatio ( ) T () + at ( 1) + bt ( 2) = 0 This is called a homogeeous liear 2d degree
More informationMaximum Likelihood Estimators.
Lecture 2 Maximum Likelihood Estimators. Matlab example. As a motivatio, let us look at oe Matlab example. Let us geerate a radom sample of size 00 from beta distributio Beta(5, 2). We will lear the defiitio
More informationLECTURE 13: Crossvalidation
LECTURE 3: Crossvalidatio Resampli methods Cross Validatio Bootstrap Bias ad variace estimatio with the Bootstrap Threeway data partitioi Itroductio to Patter Aalysis Ricardo GutierrezOsua Texas A&M
More informationA probabilistic proof of a binomial identity
A probabilistic proof of a biomial idetity Joatho Peterso Abstract We give a elemetary probabilistic proof of a biomial idetity. The proof is obtaied by computig the probability of a certai evet i two
More informationIntroduction to Statistical Learning Theory
Itroductio to Statistical Learig Theory Olivier Bousquet 1, Stéphae Bouchero 2, ad Gábor Lugosi 3 1 MaxPlack Istitute for Biological Cyberetics Spemastr 38, D72076 Tübige, Germay olivierbousquet@m4xorg
More informationTHE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n
We will cosider the liear regressio model i matrix form. For simple liear regressio, meaig oe predictor, the model is i = + x i + ε i for i =,,,, This model icludes the assumptio that the ε i s are a sample
More information0.7 0.6 0.2 0 0 96 96.5 97 97.5 98 98.5 99 99.5 100 100.5 96.5 97 97.5 98 98.5 99 99.5 100 100.5
Sectio 13 KolmogorovSmirov test. Suppose that we have a i.i.d. sample X 1,..., X with some ukow distributio P ad we would like to test the hypothesis that P is equal to a particular distributio P 0, i.e.
More informationOutput Analysis (2, Chapters 10 &11 Law)
B. Maddah ENMG 6 Simulatio 05/0/07 Output Aalysis (, Chapters 10 &11 Law) Comparig alterative system cofiguratio Sice the output of a simulatio is radom, the comparig differet systems via simulatio should
More informationModified Line Search Method for Global Optimization
Modified Lie Search Method for Global Optimizatio Cria Grosa ad Ajith Abraham Ceter of Excellece for Quatifiable Quality of Service Norwegia Uiversity of Sciece ad Techology Trodheim, Norway {cria, ajith}@q2s.tu.o
More informationLecture 13. Lecturer: Jonathan Kelner Scribe: Jonathan Pines (2009)
18.409 A Algorithmist s Toolkit October 27, 2009 Lecture 13 Lecturer: Joatha Keler Scribe: Joatha Pies (2009) 1 Outlie Last time, we proved the BruMikowski iequality for boxes. Today we ll go over the
More informationSequences and Series
CHAPTER 9 Sequeces ad Series 9.. Covergece: Defiitio ad Examples Sequeces The purpose of this chapter is to itroduce a particular way of geeratig algorithms for fidig the values of fuctios defied by their
More informationConvexity, Inequalities, and Norms
Covexity, Iequalities, ad Norms Covex Fuctios You are probably familiar with the otio of cocavity of fuctios. Give a twicedifferetiable fuctio ϕ: R R, We say that ϕ is covex (or cocave up) if ϕ (x) 0 for
More informationSAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx
SAMPLE QUESTIONS FOR FINAL EXAM REAL ANALYSIS I FALL 006 3 4 Fid the followig usig the defiitio of the Riema itegral: a 0 x + dx 3 Cosider the partitio P x 0 3, x 3 +, x 3 +,......, x 3 3 + 3 of the iterval
More informationSequences II. Chapter 3. 3.1 Convergent Sequences
Chapter 3 Sequeces II 3. Coverget Sequeces Plot a graph of the sequece a ) = 2, 3 2, 4 3, 5 + 4,...,,... To what limit do you thik this sequece teds? What ca you say about the sequece a )? For ǫ = 0.,
More informationKey Ideas Section 81: Overview hypothesis testing Hypothesis Hypothesis Test Section 82: Basics of Hypothesis Testing Null Hypothesis
Chapter 8 Key Ideas Hypothesis (Null ad Alterative), Hypothesis Test, Test Statistic, Pvalue Type I Error, Type II Error, Sigificace Level, Power Sectio 81: Overview Cofidece Itervals (Chapter 7) are
More informationClass Meeting # 16: The Fourier Transform on R n
MATH 18.152 COUSE NOTES  CLASS MEETING # 16 18.152 Itroductio to PDEs, Fall 2011 Professor: Jared Speck Class Meetig # 16: The Fourier Trasform o 1. Itroductio to the Fourier Trasform Earlier i the course,
More informationChapter 7  Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:
Chapter 7  Samplig Distributios 1 Itroductio What is statistics? It cosist of three major areas: Data Collectio: samplig plas ad experimetal desigs Descriptive Statistics: umerical ad graphical summaries
More information5 Boolean Decision Trees (February 11)
5 Boolea Decisio Trees (February 11) 5.1 Graph Coectivity Suppose we are give a udirected graph G, represeted as a boolea adjacecy matrix = (a ij ), where a ij = 1 if ad oly if vertices i ad j are coected
More informationCase Study. Normal and t Distributions. Density Plot. Normal Distributions
Case Study Normal ad t Distributios Bret Halo ad Bret Larget Departmet of Statistics Uiversity of Wiscosi Madiso October 11 13, 2011 Case Study Body temperature varies withi idividuals over time (it ca
More informationNonlife insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring
Nolife isurace mathematics Nils F. Haavardsso, Uiversity of Oslo ad DNB Skadeforsikrig Mai issues so far Why does isurace work? How is risk premium defied ad why is it importat? How ca claim frequecy
More informationSECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES
SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES Read Sectio 1.5 (pages 5 9) Overview I Sectio 1.5 we lear to work with summatio otatio ad formulas. We will also itroduce a brief overview of sequeces,
More informationCenter, Spread, and Shape in Inference: Claims, Caveats, and Insights
Ceter, Spread, ad Shape i Iferece: Claims, Caveats, ad Isights Dr. Nacy Pfeig (Uiversity of Pittsburgh) AMATYC November 2008 Prelimiary Activities 1. I would like to produce a iterval estimate for the
More informationUniversal coding for classes of sources
Coexios module: m46228 Uiversal codig for classes of sources Dever Greee This work is produced by The Coexios Project ad licesed uder the Creative Commos Attributio Licese We have discussed several parametric
More information5: Introduction to Estimation
5: Itroductio to Estimatio Cotets Acroyms ad symbols... 1 Statistical iferece... Estimatig µ with cofidece... 3 Samplig distributio of the mea... 3 Cofidece Iterval for μ whe σ is kow before had... 4 Sample
More information4.1 Sigma Notation and Riemann Sums
0 the itegral. Sigma Notatio ad Riema Sums Oe strategy for calculatig the area of a regio is to cut the regio ito simple shapes, calculate the area of each simple shape, ad the add these smaller areas
More informationLecture 4: Cauchy sequences, BolzanoWeierstrass, and the Squeeze theorem
Lecture 4: Cauchy sequeces, BolzaoWeierstrass, ad the Squeeze theorem The purpose of this lecture is more modest tha the previous oes. It is to state certai coditios uder which we are guarateed that limits
More information1. C. The formula for the confidence interval for a population mean is: x t, which was
s 1. C. The formula for the cofidece iterval for a populatio mea is: x t, which was based o the sample Mea. So, x is guarateed to be i the iterval you form.. D. Use the rule : pvalue
More informationDepartment of Computer Science, University of Otago
Departmet of Computer Sciece, Uiversity of Otago Techical Report OUCS200609 Permutatios Cotaiig May Patters Authors: M.H. Albert Departmet of Computer Sciece, Uiversity of Otago Micah Colema, Rya Fly
More information1 Correlation and Regression Analysis
1 Correlatio ad Regressio Aalysis I this sectio we will be ivestigatig the relatioship betwee two cotiuous variable, such as height ad weight, the cocetratio of a ijected drug ad heart rate, or the cosumptio
More informationMARTINGALES AND A BASIC APPLICATION
MARTINGALES AND A BASIC APPLICATION TURNER SMITH Abstract. This paper will develop the measuretheoretic approach to probability i order to preset the defiitio of martigales. From there we will apply this
More informationSUPPLEMENTARY MATERIAL TO GENERAL NONEXACT ORACLE INEQUALITIES FOR CLASSES WITH A SUBEXPONENTIAL ENVELOPE
SUPPLEMENTARY MATERIAL TO GENERAL NONEXACT ORACLE INEQUALITIES FOR CLASSES WITH A SUBEXPONENTIAL ENVELOPE By Guillaume Lecué CNRS, LAMA, Marelavallée, 77454 Frace ad By Shahar Medelso Departmet of Mathematics,
More informationLesson 17 Pearson s Correlation Coefficient
Outlie Measures of Relatioships Pearso s Correlatio Coefficiet (r) types of data scatter plots measure of directio measure of stregth Computatio covariatio of X ad Y uique variatio i X ad Y measurig
More informationCHAPTER 3 DIGITAL CODING OF SIGNALS
CHAPTER 3 DIGITAL CODING OF SIGNALS Computers are ofte used to automate the recordig of measuremets. The trasducers ad sigal coditioig circuits produce a voltage sigal that is proportioal to a quatity
More informationConfidence Intervals for One Mean
Chapter 420 Cofidece Itervals for Oe Mea Itroductio This routie calculates the sample size ecessary to achieve a specified distace from the mea to the cofidece limit(s) at a stated cofidece level for a
More informationPlugin martingales for testing exchangeability online
Plugi martigales for testig exchageability olie Valetia Fedorova, Alex Gammerma, Ilia Nouretdiov, ad Vladimir Vovk Computer Learig Research Cetre Royal Holloway, Uiversity of Lodo, UK {valetia,ilia,alex,vovk}@cs.rhul.ac.uk
More informationMath C067 Sampling Distributions
Math C067 Samplig Distributios Sample Mea ad Sample Proportio Richard Beigel Some time betwee April 16, 2007 ad April 16, 2007 Examples of Samplig A pollster may try to estimate the proportio of voters
More informationOverview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals
Overview Estimatig the Value of a Parameter Usig Cofidece Itervals We apply the results about the sample mea the problem of estimatio Estimatio is the process of usig sample data estimate the value of
More informationPROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUSMALUS SYSTEM
PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY Physical ad Mathematical Scieces 2015, 1, p. 15 19 M a t h e m a t i c s AN ALTERNATIVE MODEL FOR BONUSMALUS SYSTEM A. G. GULYAN Chair of Actuarial Mathematics
More informationUniversity of California, Los Angeles Department of Statistics. Distributions related to the normal distribution
Uiversity of Califoria, Los Ageles Departmet of Statistics Statistics 100B Istructor: Nicolas Christou Three importat distributios: Distributios related to the ormal distributio Chisquare (χ ) distributio.
More information3 Basic Definitions of Probability Theory
3 Basic Defiitios of Probability Theory 3defprob.tex: Feb 10, 2003 Classical probability Frequecy probability axiomatic probability Historical developemet: Classical Frequecy Axiomatic The Axiomatic defiitio
More informationHypergeometric Distributions
7.4 Hypergeometric Distributios Whe choosig the startig lieup for a game, a coach obviously has to choose a differet player for each positio. Similarly, whe a uio elects delegates for a covetio or you
More informationConfidence Intervals. CI for a population mean (σ is known and n > 30 or the variable is normally distributed in the.
Cofidece Itervals A cofidece iterval is a iterval whose purpose is to estimate a parameter (a umber that could, i theory, be calculated from the populatio, if measuremets were available for the whole populatio).
More informationLinear classifier MAXIMUM ENTROPY. Linear regression. Logistic regression 11/3/11. f 1
Liear classifier A liear classifier predicts the label based o a weighted, liear combiatio of the features predictio = w 0 + w 1 f 1 + w 2 f 2 +...+ w m f m For two classes, a liear classifier ca be viewed
More informationWeek 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable
Week 3 Coditioal probabilities, Bayes formula, WEEK 3 page 1 Expected value of a radom variable We recall our discussio of 5 card poker hads. Example 13 : a) What is the probability of evet A that a 5
More informationCHAPTER 7: Central Limit Theorem: CLT for Averages (Means)
CHAPTER 7: Cetral Limit Theorem: CLT for Averages (Meas) X = the umber obtaied whe rollig oe six sided die oce. If we roll a six sided die oce, the mea of the probability distributio is X P(X = x) Simulatio:
More informationHere are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.
This documet was writte ad copyrighted by Paul Dawkis. Use of this documet ad its olie versio is govered by the Terms ad Coditios of Use located at http://tutorial.math.lamar.edu/terms.asp. The olie versio
More informationTHE HEIGHT OF qbinary SEARCH TREES
THE HEIGHT OF qbinary SEARCH TREES MICHAEL DRMOTA AND HELMUT PRODINGER Abstract. q biary search trees are obtaied from words, equipped with the geometric distributio istead of permutatios. The average
More informationINFINITE SERIES KEITH CONRAD
INFINITE SERIES KEITH CONRAD. Itroductio The two basic cocepts of calculus, differetiatio ad itegratio, are defied i terms of limits (Newto quotiets ad Riema sums). I additio to these is a third fudametal
More informationUnit 20 Hypotheses Testing
Uit 2 Hypotheses Testig Objectives: To uderstad how to formulate a ull hypothesis ad a alterative hypothesis about a populatio proportio, ad how to choose a sigificace level To uderstad how to collect
More informationCS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations
CS3A Hadout 3 Witer 00 February, 00 Solvig Recurrece Relatios Itroductio A wide variety of recurrece problems occur i models. Some of these recurrece relatios ca be solved usig iteratio or some other ad
More information1 The Gaussian channel
ECE 77 Lecture 0 The Gaussia chael Objective: I this lecture we will lear about commuicatio over a chael of practical iterest, i which the trasmitted sigal is subjected to additive white Gaussia oise.
More information1 Computing the Standard Deviation of Sample Means
Computig the Stadard Deviatio of Sample Meas Quality cotrol charts are based o sample meas ot o idividual values withi a sample. A sample is a group of items, which are cosidered all together for our aalysis.
More informationSolutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork
Solutios to Selected Problems I: Patter Classificatio by Duda, Hart, Stork Joh L. Weatherwax February 4, 008 Problem Solutios Chapter Bayesia Decisio Theory Problem radomized rules Part a: Let Rx be the
More informationChapter 14 Nonparametric Statistics
Chapter 14 Noparametric Statistics A.K.A. distributiofree statistics! Does ot deped o the populatio fittig ay particular type of distributio (e.g, ormal). Sice these methods make fewer assumptios, they
More informationThe Stable Marriage Problem
The Stable Marriage Problem William Hut Lae Departmet of Computer Sciece ad Electrical Egieerig, West Virgiia Uiversity, Morgatow, WV William.Hut@mail.wvu.edu 1 Itroductio Imagie you are a matchmaker,
More informationTaking DCOP to the Real World: Efficient Complete Solutions for Distributed MultiEvent Scheduling
Taig DCOP to the Real World: Efficiet Complete Solutios for Distributed MultiEvet Schedulig Rajiv T. Maheswara, Milid Tambe, Emma Bowrig, Joatha P. Pearce, ad Pradeep araatham Uiversity of Souther Califoria
More informationSubject CT5 Contingencies Core Technical Syllabus
Subject CT5 Cotigecies Core Techical Syllabus for the 2015 exams 1 Jue 2014 Aim The aim of the Cotigecies subject is to provide a groudig i the mathematical techiques which ca be used to model ad value
More informationDistributions of Order Statistics
Chapter 2 Distributios of Order Statistics We give some importat formulae for distributios of order statistics. For example, where F k: (x)=p{x k, x} = I F(x) (k, k + 1), I x (a,b)= 1 x t a 1 (1 t) b 1
More informationStéphane Boucheron 1, Olivier Bousquet 2 and Gábor Lugosi 3
ESAIM: Probability ad Statistics URL: http://wwwemathfr/ps/ Will be set by the publisher THEORY OF CLASSIFICATION: A SURVEY OF SOME RECENT ADVANCES Stéphae Bouchero 1, Olivier Bousquet 2 ad Gábor Lugosi
More informationVladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT
Keywords: project maagemet, resource allocatio, etwork plaig Vladimir N Burkov, Dmitri A Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT The paper deals with the problems of resource allocatio betwee
More informationExample 2 Find the square root of 0. The only square root of 0 is 0 (since 0 is not positive or negative, so those choices don t exist here).
BEGINNING ALGEBRA Roots ad Radicals (revised summer, 00 Olso) Packet to Supplemet the Curret Textbook  Part Review of Square Roots & Irratioals (This portio ca be ay time before Part ad should mostly
More informationRunning Time ( 3.1) Analysis of Algorithms. Experimental Studies ( 3.1.1) Limitations of Experiments. Pseudocode ( 3.1.2) Theoretical Analysis
Ruig Time ( 3.) Aalysis of Algorithms Iput Algorithm Output A algorithm is a stepbystep procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects.
More informationThe second difference is the sequence of differences of the first difference sequence, 2
Differece Equatios I differetial equatios, you look for a fuctio that satisfies ad equatio ivolvig derivatives. I differece equatios, istead of a fuctio of a cotiuous variable (such as time), we look for
More informationTHE ABRACADABRA PROBLEM
THE ABRACADABRA PROBLEM FRANCESCO CARAVENNA Abstract. We preset a detailed solutio of Exercise E0.6 i [Wil9]: i a radom sequece of letters, draw idepedetly ad uiformly from the Eglish alphabet, the expected
More informationNotes on exponential generating functions and structures.
Notes o expoetial geeratig fuctios ad structures. 1. The cocept of a structure. Cosider the followig coutig problems: (1) to fid for each the umber of partitios of a elemet set, (2) to fid for each the
More informationLecture 3. denote the orthogonal complement of S k. Then. 1 x S k. n. 2 x T Ax = ( ) λ x. with x = 1, we have. i = λ k x 2 = λ k.
18.409 A Algorithmist s Toolkit September 17, 009 Lecture 3 Lecturer: Joatha Keler Scribe: Adre Wibisoo 1 Outlie Today s lecture covers three mai parts: CouratFischer formula ad Rayleigh quotiets The
More informationIncremental calculation of weighted mean and variance
Icremetal calculatio of weighted mea ad variace Toy Fich faf@cam.ac.uk dot@dotat.at Uiversity of Cambridge Computig Service February 009 Abstract I these otes I eplai how to derive formulae for umerically
More informationSection 11.3: The Integral Test
Sectio.3: The Itegral Test Most of the series we have looked at have either diverged or have coverged ad we have bee able to fid what they coverge to. I geeral however, the problem is much more difficult
More information1 Hypothesis testing for a single mean
BST 140.65 Hypothesis Testig Review otes 1 Hypothesis testig for a sigle mea 1. The ull, or status quo, hypothesis is labeled H 0, the alterative H a or H 1 or H.... A type I error occurs whe we falsely
More informationMeasures of Spread and Boxplots Discrete Math, Section 9.4
Measures of Spread ad Boxplots Discrete Math, Sectio 9.4 We start with a example: Example 1: Comparig Mea ad Media Compute the mea ad media of each data set: S 1 = {4, 6, 8, 10, 1, 14, 16} S = {4, 7, 9,
More informationInstitute of Actuaries of India Subject CT1 Financial Mathematics
Istitute of Actuaries of Idia Subject CT1 Fiacial Mathematics For 2014 Examiatios Subject CT1 Fiacial Mathematics Core Techical Aim The aim of the Fiacial Mathematics subject is to provide a groudig i
More informationResearch Method (I) Knowledge on Sampling (Simple Random Sampling)
Research Method (I) Kowledge o Samplig (Simple Radom Samplig) 1. Itroductio to samplig 1.1 Defiitio of samplig Samplig ca be defied as selectig part of the elemets i a populatio. It results i the fact
More informationCHAPTER 3 THE TIME VALUE OF MONEY
CHAPTER 3 THE TIME VALUE OF MONEY OVERVIEW A dollar i the had today is worth more tha a dollar to be received i the future because, if you had it ow, you could ivest that dollar ad ear iterest. Of all
More informationEntropy of bicapacities
Etropy of bicapacities Iva Kojadiovic LINA CNRS FRE 2729 Site école polytechique de l uiv. de Nates Rue Christia Pauc 44306 Nates, Frace iva.kojadiovic@uivates.fr JeaLuc Marichal Applied Mathematics
More informationDetermining the sample size
Determiig the sample size Oe of the most commo questios ay statisticia gets asked is How large a sample size do I eed? Researchers are ofte surprised to fid out that the aswer depeds o a umber of factors
More informationEkkehart Schlicht: Economic Surplus and Derived Demand
Ekkehart Schlicht: Ecoomic Surplus ad Derived Demad Muich Discussio Paper No. 200617 Departmet of Ecoomics Uiversity of Muich Volkswirtschaftliche Fakultät LudwigMaximiliasUiversität Müche Olie at http://epub.ub.uimueche.de/940/
More information*The most important feature of MRP as compared with ordinary inventory control analysis is its time phasing feature.
Itegrated Productio ad Ivetory Cotrol System MRP ad MRP II Framework of Maufacturig System Ivetory cotrol, productio schedulig, capacity plaig ad fiacial ad busiess decisios i a productio system are iterrelated.
More informationTrading the randomness  Designing an optimal trading strategy under a drifted random walk price model
Tradig the radomess  Desigig a optimal tradig strategy uder a drifted radom walk price model Yuao Wu Math 20 Project Paper Professor Zachary Hamaker Abstract: I this paper the author iteds to explore
More informationTotally Corrective Boosting Algorithms that Maximize the Margin
Mafred K. Warmuth mafred@cse.ucsc.edu Ju Liao liaoju@cse.ucsc.edu Uiversity of Califoria at Sata Cruz, Sata Cruz, CA 95064, USA Guar Rätsch Guar.Raetsch@tuebige.mpg.de Friedrich Miescher Laboratory of
More informationInfinite Sequences and Series
CHAPTER 4 Ifiite Sequeces ad Series 4.1. Sequeces A sequece is a ifiite ordered list of umbers, for example the sequece of odd positive itegers: 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29...
More informationRecursion and Recurrences
Chapter 5 Recursio ad Recurreces 5.1 Growth Rates of Solutios to Recurreces Divide ad Coquer Algorithms Oe of the most basic ad powerful algorithmic techiques is divide ad coquer. Cosider, for example,
More informationPSYCHOLOGICAL STATISTICS
UNIVERSITY OF CALICUT SCHOOL OF DISTANCE EDUCATION B Sc. Cousellig Psychology (0 Adm.) IV SEMESTER COMPLEMENTARY COURSE PSYCHOLOGICAL STATISTICS QUESTION BANK. Iferetial statistics is the brach of statistics
More informationAn example of nonquenched convergence in the conditional central limit theorem for partial sums of a linear process
A example of oqueched covergece i the coditioal cetral limit theorem for partial sums of a liear process Dalibor Volý ad Michael Woodroofe Abstract A causal liear processes X,X 0,X is costructed for which
More informationBuilding Blocks Problem Related to Harmonic Series
TMME, vol3, o, p.76 Buildig Blocks Problem Related to Harmoic Series Yutaka Nishiyama Osaka Uiversity of Ecoomics, Japa Abstract: I this discussio I give a eplaatio of the divergece ad covergece of ifiite
More informationStatistical inference: example 1. Inferential Statistics
Statistical iferece: example 1 Iferetial Statistics POPULATION SAMPLE A clothig store chai regularly buys from a supplier large quatities of a certai piece of clothig. Each item ca be classified either
More informationwhere: T = number of years of cash flow in investment's life n = the year in which the cash flow X n i = IRR = the internal rate of return
EVALUATING ALTERNATIVE CAPITAL INVESTMENT PROGRAMS By Ke D. Duft, Extesio Ecoomist I the March 98 issue of this publicatio we reviewed the procedure by which a capital ivestmet project was assessed. The
More informationLecture 2: Karger s Min Cut Algorithm
priceto uiv. F 3 cos 5: Advaced Algorithm Desig Lecture : Karger s Mi Cut Algorithm Lecturer: Sajeev Arora Scribe:Sajeev Today s topic is simple but gorgeous: Karger s mi cut algorithm ad its extesio.
More informationChapter 5 Discrete Probability Distributions
Slides Prepared by JOHN S. LOUCKS St. Edward s Uiversity Slide Chapter 5 Discrete Probability Distributios Radom Variables Discrete Probability Distributios Epected Value ad Variace Poisso Distributio
More informationTAYLOR SERIES, POWER SERIES
TAYLOR SERIES, POWER SERIES The followig represets a (icomplete) collectio of thigs that we covered o the subject of Taylor series ad power series. Warig. Be prepared to prove ay of these thigs durig the
More information