10-705/ Intermediate Statistics

Transcription

1 0-705/ Itermediate Statistics Larry Wasserma Fall 0 Week Class I Class II Day III Class IV Syllabus August 9 Review Review, Iequalities Iequalities September 5 No Class O P HW [sol] VC Theory September Covergece Covergece HW [sol] Test I September 9 Covergece Addedum Sufficiecy HW 3 [sol] Sufficiecy September 6 Likelihood Poit Estimatio HW 4 [sol] Miimax Theory October 3 Miimax Summary Asymptotics HW 5 [sol] Asymptotics October 0 Asymptotics Review Test II October 7 Testig Testig HW 6 [sol] Mid-semester Break October 4 Testig Cofidece Itervals HW 7 [sol] Cofidece Itervals October 3 Noparametric Noparametric Review November 7 Test III No Class HW 8 [sol] The Bootstrap November 4 The Bootstrap Bayesia Iferece HW 9 [sol] Bayesia Iferece November No Class No Class No Class November 8 Predictio Predictio HW 0 [sol] Model Selectio December 5 Multiple Testig Causatio Idividual Sequeces Practice Fial

2 0-705/36-705: Itermediate Statistics, Fall 00 Professor Larry Wasserma Office Baker Hall 8 A [email protected] Phoe Office hours Modays, :30-:30 Class Time Mo-Wed-Fri :30 - :0 Locatio GHC 4307 TAs Wajie Wag ad Xiaoli Yag Website larry/=stat705 Objective This course will cover the fudametals of theoretical statistics. Topics iclude: poit ad iterval estimatio, hypothesis testig, data reductio, covergece cocepts, Bayesia iferece, oparametric statistics ad bootstrap resamplig. We will cover Chapters 5 0 from Casella ad Berger plus some supplemetary material. This course is excellet preparatio for advaced work i Statistics ad Machie Learig. Textbook Casella, G. ad Berger, R. L. (00). Statistical Iferece, d ed. Backgroud I assume that you are familiar with the material i Chapters - 4 of Casella ad Berger. Other Recommeded Texts Wasserma, L. (004). All of Statistics: A cocise course i statistical iferece. Bickel, P. J. ad Doksum, K. A. (977). Mathematical Statistics. Rice, J. A. (977). Mathematical Statistics ad Data Aalysis, Secod Editio. Gradig 0% : Test I (Sept. 6) o the material of Chapters 4 0% : Test II (October 4) 0% : Test III (November 7) 30% : Fial Exam (Date set by the Uiversity) 0% : Homework

3 Exams All exams are closed book. Do NOT buy a plae ticket util the fial exam has bee scheduled. Homework Homework assigmets will be posted o the web. Had i homework to Mari Alice Mcshae, 8 Baker Hall by 3 pm Thursday. No late homework. Readig ad Class Notes Class otes will be posted o the web regularly. Brig a copy to class. The otes are ot meat to be a substitute for the book ad hece are geerally quite terse. Read both the otes ad the text before lecture. Sometimes I will cover topics from other sources. Group Work You are ecouraged to work with others o the homework. But write-up your fial solutios o your ow.. Quick Review of Chapters -4. Iequalities 3. Vapik-Chervoekis Theory 4. Covergece 5. Sufficiecy 6. Likelihood 7. Poit Estimatio 8. Miimax Theory 9. Asymptotics 0. Robustess. Hypothesis Testig. Cofidece Itervals 3. Noparametric Iferece 4. Predictio ad Classificatio 5. The Bootstrap 6. Bayesia Iferece 7. Markov Chai Mote Carlo 8. Model Selectio Course Outlie

4 Lecture Notes Quick Review of Basic Probability (Casella ad Berger Chapters -4) Probability Review Chapters -4 are a review. I will assume you have read ad uderstood Chapters -4. Let us recall some of the key ideas.. Radom Variables A radom variable is a map X from a probability space Ω to R. We write P (X A) = P ({ω Ω : X(ω) A}) ad we write X P to mea that X has distributio P. fuctio (cdf) of X is F X (x) = F (x) = P (X x). If X is discrete, its probability mass fuctio (pmf) is The cumulative distributio p X (x) = p(x) = P (X = x). If X is cotiuous, the its probability desity fuctio fuctio (pdf) satisfies P (X A) = p X (x)dx = p(x)dx ad p X (x) = p(x) = F (x). The followig are all equivalet: A X P, X F, X p. Suppose that X P ad Y Q. We say that X ad Y have the same distributio if A P (X A) = Q(Y A) for all A. I other words, P = Q. I that case we say that X ad Y are equal i distributio ad we write X = d Y. It ca be show that X = d Y if ad oly if F X (t) = F Y (t) for all t.. Expected Values The mea or expected value of g(x) is { g(x)p(x)dx if X is cotiuous E (g(x)) = g(x)df (x) = g(x)dp (x) = j g(x j)p(x j ) if X is discrete.

5 Recall that:. E( k j= c jg j (X)) = k j= c je(g j (X)).. If X,..., X are idepedet the ( ) E X i = i= i E (X i ). 3. We ofte write µ = E(X). 4. σ = Var (X) = E ((X µ) ) is the Variace. 5. Var (X) = E (X ) µ. 6. If X,..., X are idepedet the ( ) Var a i X i = i= i a i Var (X i ). 7. The covariace is Cov(X, Y ) = E((X µ x )(Y µ y )) = E(XY ) µ X µ Y ad the correlatio is ρ(x, Y ) = Cov(X, Y )/σ x σ y. Recall that ρ(x, Y ). The coditioal expectatio of Y give X is the radom variable E(Y X) whose value, whe X = x is E(Y X = x) = y p(y x)dy where p(y x) = p(x, y)/p(x). The Law of Total Expectatio or Law of Iterated Expectatio: E(Y ) = E [ E(Y X) ] = E(Y X = x)p X (x)dx. The Law of Total Variace is Var(Y ) = Var [ E(Y X) ] + E [ Var(Y X) ]. The th momet is E (X ) ad the th cetral momet is E ((X µ) ). The momet geeratig fuctio (mgf) is M X (t) = E ( e tx). The, M () X (t) t=0 = E (X ). If M X (t) = M Y (t) for all t i a iterval aroud 0 the X d = Y.

6 .3 Expoetial Families A family of distributios {p(x; θ) : θ Θ} is called a expoetial family if { k } p(x; θ) = h(x)c(θ) exp w i (θ)t i (x). Example X Poisso(λ) is expoetial family sice p(x) = P (X = x) = e λ λ x = x! x! e λ exp{log λ x}. Example X U (0, θ) is ot a expoetial family. The desity is where I A (x) = if x A ad 0 otherwise. i= p X (x) = θ I (0,θ)(x) We ca rewrite a expoetial family i terms of a atural parameterizatio. For k = we have p(x; η) = h(x) exp{ηt(x) A(η)} where A(η) = log For example a Poisso ca be writte as h(x) exp{ηt(x)}dx. p(x; η) = exp{ηx e η }/x! where the atural parameter is η = log λ. Let X have a expoetial family distributio. The E (t(x)) = A (η), Practice Problem: Prove the above result..4 Trasformatios Var (t(x)) = A (η). Let Y = g(x). The F Y (y) = P(Y y) = P(g(X) y) = where The p Y (y) = F Y (y). If g is mootoic, the where h = g. A y = {x : g(x) y}. p Y (y) = p X (h(y)) dh(y) dy 3 A(y) p X (x)dx

7 Example 3 Let p X (x) = e x for x > 0. Hece F X (x) = e x. Let Y = g(x) = log X. The ad p Y (y) = e y e ey for y R. F Y (y) = P (Y y) = P (log(x) y) = P (X e y ) = F X (e y ) = e ey Example 4 Practice problem. Let X be uiform o (, ) ad let Y = X. Fid the desity of Y. Let Z = g(x, Y ). For exampe, Z = X + Y or Z = X/Y. The we fid the pdf of Z as follows:. For each z, fid the set A z = {(x, y) : g(x, y) z}.. Fid the CDF F Z (z) = P (Z z) = P (g(x, Y ) z) = P ({(x, y) : g(x, y) z}) = p X,Y (x, y)dxdy. A z 3. The pdf is p Z (z) = F Z (z). Example 5 Practice problem. Let (X, Y ) be uiform o the uit square. Let Z = X/Y. Fid the desity of Z..5 Idepedece Recall that X ad Y are idepedet if ad oly if for all A ad B. P(X A, Y B) = P(X A)P(Y B) Theorem 6 Let (X, Y ) be a bivariate radom vector with p X,Y (x, y). X ad Y are idepedet iff p X,Y (x, y) = p X (x)p Y (y). X,..., X are idepedet if ad oly if P(X A,..., X A ) = P(X i A i ). Thus, p X,...,X (x,..., x ) = i= p X i (x i ). If X,..., X are idepedet ad idetically distributed we say they are iid (or that they are a radom sample) ad we write X,..., X P or X,..., X F or X,..., X p. 4 i=

8 .6 Importat Distributios X N(µ, σ ) if p(x) = σ π e (x µ) /(σ). If X R d the X N(µ, Σ) if ( p(x) = (π) d/ Σ exp ) (x µ)t Σ (x µ). X χ p if X = p j= Z j where Z,..., Z p N(0, ). X Beroulli(θ) if P(X = ) = θ ad P(X = 0) = θ ad hece p(x) = θ x ( θ) x x = 0,. X Biomial(θ) if p(x) = P(X = x) = ( ) θ x ( θ) x x x {0,..., }. X Uiform(0, θ) if p(x) = I(0 x θ)/θ..7 Sample Mea ad Variace The sample mea is ad the sample variace is X = X i, S = i (X i X). Let X,..., X be iid with µ = E(X i ) = µ ad σ = Var(X i ) = σ. The E(X) = µ, Theorem 7 If X,..., X N(µ, σ ) the (a) X N(µ, σ ) i Var(X) = σ, E(S ) = σ. (b) ( )S σ χ (c) X ad S are idepedet 5

9 .8 Delta Method If X N(µ, σ ), Y = g(x) ad σ is small the To see this, ote that Y N(g(µ), σ (g (µ)) ). Y = g(x) = g(µ) + (X µ)g (µ) + (X µ) g (ξ) for some ξ. Now E((X µ) ) = σ which we are assumig is small ad so Y = g(x) g(µ) + (X µ)g (µ). Thus Hece, E(Y ) g(µ), Var(Y ) (g (µ)) σ. g(x) N ( g(µ), (g (µ)) σ ). Appedix: Useful Facts Facts about sums i= i = (+). i= i = (+)(+) 6. Geometric series: a + ar + ar +... = a, for 0 < r <. r Partial Geometric series a + ar + ar ar = a( r ) r. Biomial Theorem x=0 ( ) a x = ( + a), x x=0 ( ) a x b x = (a + b). x Hypergeometric idetity x=0 ( )( ) a b = x x ( a + b ). 6

10 Commo Distributios Discrete Uiform X U (,..., N) X takes values x =,,..., N P (X = x) = /N E (X) = x xp (X = x) = x x N = N E (X ) = x x P (X = x) = x x N = N Biomial X Bi(, p) X takes values x = 0,,..., P (X = x) = ( ) x p x ( p) x Hypergeometric X Hypergeometric(N, M, K) P (X = x) = (M x )( N M K x ) ( N K) Geometric X Geom(p) P (X = x) = ( p) x p, x =,,... E (X) = x x( p)x = p x Poisso X Poisso(λ) P (X = x) = e λ λ x x! x = 0,,,... E (X) = Var (X) = λ N(N+) = (N+) N(N+)(N+) 6 d ( ( dp p)x ) = p p =. p p M X (t) = x=0 etx e λ λ x = e λ (λe t ) x x! x=0 = e λ e λet = e λ(et ). x! 7

11 E (X) = λe t e λ(et ) t=0 = λ. Use mgf to show: if X Poisso(λ ), X Poisso(λ ), idepedet the Y = X + X Poisso(λ + λ ). Cotiuous Distributios Normal X N(µ, σ ) p(x) = πσ exp{ σ (x µ) }, x R mgf M X (t) = exp{µt + σ t /}. E (X) = µ Var (X) = σ. e.g., If Z N(0, ) ad X = µ + σz, the X N(µ, σ ). Show this... Proof. which is the mgf of a N(µ, σ ). Alterative proof: M X (t) = E ( e tx) = E ( e t(µ+σz)) = e tµ E ( e tσz) = e tµ M Z (tσ) = e tµ e (tσ) / = e tµ+t σ / F X (x) = P (X x) = P (µ + σz x) = P ( ) x µ = F Z σ ( ) x µ p X (x) = F X(x) = p Z σ σ { = exp ( ) } x µ π σ σ { = exp ( ) } x µ, πσ σ which is the pdf of a N(µ, σ ). ( Z x µ ) σ 8

12 Gamma X Γ(α, β). p X (x) = Γ(α)β α x α e x/β, x a positive real. Γ(α) = 0 x α e x/β dx. β α Importat statistical distributio: χ p = Γ( p, ). χ p = p i= X i, where X i N(0, ), iid. Expoetial X expoe(β) p X (x) = β e x/β, x a positive real. expoe(β) = Γ(, β). e.g., Used to model waitig time of a Poisso Process. Suppose N is the umber of phoe calls i hour ad N P oisso(λ). Let T be the time betwee cosecutive phoe calls, the T expoe(/λ) ad E (T ) = (/λ). If X,..., X are iid expoe(β), the i X i Γ(, β). Memoryless Property: If X expoe(β), the P (X > t + s X > t) = P (X > s). Liear Regressio Model the respose (Y ) as a liear fuctio of the parameters ad covariates (x) plus radom error (ɛ). Y i = θ(x, β) + ɛ i where θ(x, β) = Xβ = β 0 + β x + β x β k x k. 9

13 Geeralized Liear Model Model the atural parameters as liear fuctios of the the covariates. Example: Logistic Regressio. P (Y = X = x) = I other words, Y X = x Bi(, p(x)) ad where η(x) = β T x eβt x + e βt x. ( ) p(x) η(x) = log. p(x) Logistic Regressio cosists of modellig the atural parameter, which is called the log odds ratio, as a liear fuctio of covariates. Locatio ad Scale Families, CB 3.5 Let p(x) be a pdf. Locatio family : {p(x µ) = p(x µ) : µ R} Scale family : { p(x σ) = ( x ) } σ f : σ > 0 σ Locatio Scale family : { p(x µ, σ) = ( ) } x µ σ f : µ R, σ > 0 σ () Locatio family. Shifts the pdf. e.g., Uiform with p(x) = o (0, ) ad p(x θ) = o (θ, θ + ). e.g., Normal with stadard pdf the desity of a N(0, ) ad locatio family pdf N(θ, ). () Scale family. Stretches the pdf. e.g., Normal with stadard pdf the desity of a N(0, ) ad scale family pdf N(0, σ ). (3) Locatio-Scale family. Stretches ad shifts the pdf. e.g., Normal with stadard pdf the desity of a N(0, ) ad locatio-scale family pdf N(θ, σ ), i.e., x µ p( ). σ σ 0

14 Multiomial Distributio The multivariate versio of a Biomial is called a Multiomial. Cosider drawig a ball from a ur with has balls with k differet colors labeled color, color,..., color k. Let p = (p, p,..., p k ) where j p j = ad p j is the probability of drawig color j. Draw balls from the ur (idepedetly ad with replacemet) ad let X = (X, X,..., X k ) be the cout of the umber of balls of each color draw. We say that X has a Multiomial (, p) distributio. The pdf is ( ) p(x) = p x... p x k k x,..., x. k Multivariate Normal Distributio We ow defie the multivariate ormal distributio ad derive its basic properties. We wat to allow the possibility of multivariate ormal distributios whose covariace matrix is ot ecessarily positive defiite. Therefore, we caot defie the distributio by its desity fuctio. Istead we defie the distributio by its momet geeratig fuctio. (The reader may woder how a radom vector ca have a momet geeratig fuctio if it has o desity fuctio. However, the momet geeratig fuctio ca be defied usig more geeral types of itegratio. I this book, we assume that such a defiitio is possible but fid the momet geeratig fuctio by elemetary meas.) We fid the desity fuctio for the case of positive defiite covariace matrix i Theorem 5. Lemma 8 (a). Let X = AY + b The The M X (t) = exp (b t)m Y (A t). (b). Let c be a costat. Let Z = cy. The (c). Let Y = M Z (t) = M Y (ct). ( Y Y ), t = ( t t ( ) t M Y (t ) = M Y. 0 (d). Y ad Y are idepedet if ad oly if ( ) ( ( ) t t M Y = M t Y )M 0 Y. 0t )

15 We start with Z,..., Z idepedet radom variables such that Z i N (0, ). Let Z = (Z,..., Z ). The E(Z) = 0, cov(z) = I, M Z (t) = exp t i = exp t t. () Let µ be a vector ad A a matrix. Let Y = AZ + µ. The E(Y) = µ cov(y) = AA. () Let Σ = AA. We ow show that the distributio of Y depeds oly o µ ad Σ. The momet geeratig fuctio M Y (t) is give by ( M Y (t) = exp(µ t)m Z (A t) = exp µ t + t (A ) ( ) A)t = exp µ t + t Σt. With this motivatio i mid, let µ be a vector, ad let Σ be a oegative defiite matrix. The we say that the -dimesioal radom vector Y has a -dimesioal ormal distributio with mea vector µ, ad covariace matrix Σ, if Y has momet geeratig fuctio ( ) M Y (t) = exp µ t + t Σt. (3) We write Y N (µ, Σ). The followig theorem summarizes some elemetary facts about multivariate ormal distributios. Theorem 9 (a). If Y N (µ, Σ), the E(Y) = µ, cov(y) = Σ. (b). If Y N (µ, Σ), c is a scalar, the cy N (cµ, c Σ). (c). Let Y N (µ, Σ). If A is p, b is p, the AY + b N p (Aµ + b, AΣA ). (d). Let µ be ay vector, ad let Σ be ay oegative defiite matrix. The there exists Y such that Y N (µ, Σ). Proof. (a). This follows directly from () above. (b) ad (c). Homework. (d). Let Z,..., Z be idepedet, Z i N(0, ). Let Z = (Z,..., Z ). It is easily verified that Z N (0, I). Let Y = Σ / Z + µ. By part b, above, Y N (Σ / 0 + µ, Σ). We have ow show that the family of ormal distributios is preserved uder liear operatios o the radom vectors. We ow show that it is preserved uder takig margial ad coditioal distributios.

16 Theorem 0 Suppose that Y N (µ, Σ). Let ( ) ( ) Y µ Y =, µ =, Σ = Y µ ( Σ Σ Σ Σ where Y ad µ are p, ad Σ is p p. (a). Y N p (µ, Σ ), Y N p (µ, Σ ). (b). Y ad Y are idepedet if ad oly if Σ = 0. (c). If Σ > 0, the the coditio distributio of Y give Y is Y Y N p (µ + Σ Σ (Y µ ), Σ Σ Σ Σ ). Proof. (a). Let t = (t, t ) where t is p. The joit momet geeratig fuctio of Y ad Y is Therefore, M Y (t) = exp(µ t + µ t + (t Σ t + t Σ t + t Σ t + t Σ t )). M Y ( t 0 ) = exp(µ t + ( ) t Σ t ), M Y = exp(µ 0t t + t Σ t ). By Lemma c, we see that Y N p (µ, Σ ), Y N p (µ, Σ ). (b). We ote that ( ( ) t M Y (t) = M Y )M 0 Y 0t if ad oly if t Σ t + t Σ t = 0. Sice Σ is symmetric ad t Σ t is a scalar, we see that t Σ t = t Σ t. Fially, t Σ t = 0 for all t R p, t R p if ad oly if Σ = 0, ad the result follows from Lemma d. (c). We first fid the joit distributio of X = Y Σ Σ Y ad Y. ( ) ( X I Σ Σ )( ) Y = 0 I Y Therefore, by Theorem c, the joit distributio of X ad Y is ( ) (( X µ Σ Σ ) ( µ Σ Σ Σ )) Σ 0 N, Y µ 0 Σ ad hece X ad Y are idepedet. Therefore, the coditioal distributio of X give Y is the same as the margial distributio of X, Y ). X Y N p (µ Σ Σ µ, Σ Σ Σ Σ ). 3

17 Sice Y is just a costat i the coditioal distributio of X give Y we have, by Theorem c, that the coditioal distributio of Y = X + Σ Σ Y give Y is Y Y N p (µ Σ Σ µ + Σ Σ Y, Σ Σ Σ Σ ) Note that we eed Σ > 0 i part c so that Σ exists. Lemma Let Y N (µ, σ I), where Y = (Y,..., Y ), µ = (µ,..., µ ) ad σ > 0 is a scalar. The the Y i are idepedet, Y i N (µ, σ ) ad ( ) Y = Y Y µ χ µ σ σ. σ Proof. Let Y i be idepedet, Y i N (µ i, σ ). The joit momet geeratig fuctio of the Y i is M Y (t) = (exp(µ i t i + σ t i )) = exp(µ t + σ t t) i= which is the momet geeratig fuctio of a radom vector that is ormally distributed with mea vector µ ad covariace matrix σ I. Fially, Y Y = ΣYi, µ µ = Σµ i ad Y i /σ N (µ i /σ, ). Therefore Y Y/σ χ (µ µ/σ ) by the defiitio of the ocetral χ distributio. We are ow ready to derive the osigular ormal desity fuctio. Theorem Let Y N (µ, Σ), with Σ > 0. The Y has desity fuctio ( p Y (y) = exp ) (π) / Σ / (y µ) Σ (y µ). Proof. We could derive this by fidig the momet geeratig fuctio of this desity ad showig that it satisfied (3). We would also have to show that this fuctio is a desity fuctio. We ca avoid all that by startig with a radom vector whose distributio we kow. Let Z N (0, I). Z = (Z,..., Z ). The the Z i are idepedet ad Z i N (0, ), by Lemma 4. Therefore, the joit desity of the Z i is ( p Z (z) = exp ) ( (π) / z i = exp ) (π) / z z. i= Let Y = Σ / Z + µ. By Theorem c, Y N (µ, Σ). Also Z = Σ / (Y µ), ad the trasformatio from Z to Y is therefore ivertible. Furthermore, the Jacobia of this iverse trasformatio is just Σ / = Σ /. Hece the desity of Y is p Y (y) = p Z (Σ / (y µ)) Σ ( / = exp ) Σ / (π) / (y µ) Σ (y µ). 4

18 We ow prove a result that is useful later i the book ad is also the basis for Pearso s χ tests. Theorem 3 Let Y N (µ, Σ), Σ > 0. The (a). Y Σ Y χ (µ Σ µ). (b). (Y µ) Σ (Y µ) χ (0). Proof. (a). Let Z = Σ / Y N (Σ / µ, I). By Lemma 4, we see that (b). Follows fairly directly. Z Z = Y Σ Y χ (µ Σ µ). The Spherical Normal For the first part of this book, the most importat class of multivariate ormal distributio is the class i which Y N (µ, σ I). We ow show that this distributio is spherically symmetric about µ. A rotatio about µ is give by X = Γ(Y µ) + µ, where Γ is a orthogoal matrix (i.e., ΓΓ = I). By Theorem, X N (µ, σ I), so that the distributio is uchaged uder rotatios about µ. We therefore call this ormal distributio the spherical ormal distributio. If σ = 0, the P (Y = µ) =. Otherwise its desity fuctio (by Theorem 4) is p Y (y) = (π) / σ exp ( y µ σ By Lemma 4, we ote that the compoets of Y are idepedetly ormally distributed with commo variace σ. I fact, the spherical ormal distributio is the oly multivariate distributio with idepedet compoets that is spherically symmetric. ). 5

19 Probability Iequalities Lecture Notes Iequalities are useful for boudig quatities that might otherwise be hard to compute. They will also be used i the theory of covergece. Theorem (The Gaussia Tail Iequality) Let X N(0, ). The If X,..., X N(0, ) the / P( X > ɛ) e ɛ. ɛ P( X > ɛ) ɛ e ɛ /. Proof. The desity of X is φ(x) = (π) / e x /. Hece, By symmetry, P(X > ɛ) = ɛ = ɛ φ(s)ds ɛ ɛ φ (s)ds = φ(ɛ) ɛ / P( X > ɛ) e ɛ. ɛ ɛ s φ(s)ds / e ɛ. ɛ Now let X,..., X N(0, ). The X = i= X i N(0, /). Thus, X d = / Z where Z N(0, ) ad P( X > ɛ) = P( / Z > ɛ) = P( Z > ɛ) ɛ e ɛ /.

20 Theorem (Markov s iequality) Let X be a o-egative radom variable ad suppose that E(X) exists. For ay t > 0, Proof. Sice X > 0, P(X > t) E(X). () t E(X) = 0 x p(x)dx = x p(x)dx t t x p(x)dx + 0 t t t xp(x)dx p(x)dx = t P(X > t). Theorem 3 (Chebyshev s iequality) Let µ = E(X) ad σ = Var(X). The, P( X µ t) σ t ad P( Z k) k () where Z = (X µ)/σ. I particular, P( Z > ) /4 ad P( Z > 3) /9. Proof. We use Markov s iequality to coclude that P( X µ t) = P( X µ t ) The secod part follows by settig t = kσ. E(X µ) t = σ t. If X,..., X Beroulli(p) the ad X = i= X i The, Var(X ) = Var(X )/ = p( p)/ ad sice p( p) 4 P( X p > ɛ) Var(X ) ɛ = for all p. Hoeffdig s Iequality p( p) ɛ 4ɛ Hoeffdig s iequality is similar i spirit to Markov s iequality but it is a sharper iequality. We begi with the followig importat result. Lemma 4 Supppose that E(X) = 0 ad that a X b. The E(e tx ) e t (b a) /8.

21 Recall that a fuctio g is covex if for each x, y ad each α [0, ], g(αx + ( α)y) αg(x) + ( α)g(y). Proof. Sice a X b, we ca write X as a covex combiatio of a ad b, amely, X = αb + ( α)a where α = (X a)/(b a). By the covexity of the fuctio y e ty we have e tx αe tb + ( α)e ta = X a b a etb + b X b a eta. Take expectatios of both sides ad use the fact that E(X) = 0 to get Ee tx a b a etb + b b a eta = e g(u) (3) where u = t(b a), g(u) = γu + log( γ + γe u ) ad γ = a/(b a). Note that g(0) = g (0) = 0. Also, g (u) /4 for all u > 0. By Taylor s theorem, there is a ξ (0, u) such that g(u) = g(0) + ug (0) + u g (ξ) = u g (ξ) u 8 = t (b a). 8 Hece, Ee tx e g(u) e t (b a) /8. Next, we eed to use Cheroff s method. Lemma 5 Let X be a radom variable. The Proof. For ay t > 0, P(X > ɛ) if t 0 e tɛ E(e tx ). P(X > ɛ) = P(e X > e ɛ ) = P(e tx > e tɛ ) e tɛ E(e tx ). Sice this is true for every t 0, the result follows. Theorem 6 (Hoeffdig s Iequality) Let Y,..., Y be iid observatios such that E(Y i ) = µ ad a Y i b where a < 0 < b. The, for ay ɛ > 0, P ( Y µ ɛ ) e ɛ /(b a). (4) Proof. Without los of geerality, we asume that µ = 0. First we have P( Y ɛ) = P(Y ɛ) + P(Y ɛ) = P(Y ɛ) + P( Y ɛ). 3

22 Next we use Cheroff s method. For ay t > 0, we have, from Markov s iequality, that ( ) ( P(Y ɛ) = P Y i ɛ = P e P ) i= Y i e ɛ i= ( = P e t P i= Y i e tɛ ) e tɛ E ( e t P ) i= Y i = e tɛ i E(e ty i ) = e tɛ (E(e ty i )). From Lemma 4, E(e ty i ) e t (b a) /8. So P(Y ɛ) e tɛ e t (b a) /8. This is miimized by settig t = 4ɛ/(b a) givig P(Y ɛ) e ɛ /(b a). Applyig the same argumet to P( Y ɛ) yields the result. Example 7 Let X,..., X Beroulli(p). Chebyshev s iequality yields Accordig to Hoeffdig s iequality, which decreases much faster. P( X p > ɛ) 4ɛ. P( X p > ɛ) e ɛ Corollary 8 If X, X,..., X are idepedet with P(a X i b) = ad commo mea µ, the, with probability at least δ, ( ) c X µ log (5) δ where c = (b a). 3 The Bouded Differece Iequality So far we have focused o sums of radom variables. The followig result exteds Hoeffdig s iequality to more geeral fuctios g(x,..., x ). Here we cosider McDiarmid s iequality, also kow as the Bouded Differece iequality. 4

23 Theorem 9 (McDiarmid) Let X,..., X be idepedet radom variables. Suppose that sup g(x,..., x i, x i, x i+,..., x ) g(x,..., x i, x i, x i+,..., x ) c i (6) x,...,x,x i for i =,...,. The ( P g(x,..., X ) E(g(X,..., X )) ɛ ) } exp { ɛ. (7) i= c i Proof. Let V i = E(g X,..., X i ) E(g X,..., X i ). The g(x,..., X ) E(g(X,..., X )) = i= V i ad E(V i X,..., X i ) = 0. Usig a similar argumet as i Hoeffdig s Lemma we have, E(e tv i X,..., X i ) e t c i /8. (8) Now, for ay t > 0, ( ) P (g(x,..., X ) E(g(X,..., X )) ɛ) = P V i ɛ ( = P = e tɛ E e t P i= ) ( i= V i e tɛ e tɛ E e t P ) i= V i ( )) (e tv X,..., X e t P i= V i E e tɛ e t c /8 E (e t P ) i= V i. e tɛ e t P i= c i. The result follows by takig t = 4ɛ/ i= c i. Example 0 If we take g(x,..., x ) = i= x i the we get back Hoeffdig s iequality. Example Suppose we throw m balls ito bis. What fractio of bis are empty? Let Z be the umber of empty bis ad let F = Z/ be the fractio of empty bis. We ca write Z = i= Z i where Z i = of bi i is empty ad Z i = 0 otherwise. The µ = E(Z) = E(Z i ) = ( /) m = e m log( /) e m/ i= ad θ = E(F ) = µ/ e m/. How close is Z to µ? Note that the Z i s are ot idepedet so we caot just apply Hoeffdig. Istead, we proceed as follows. 5

24 Defie variables X,..., X m where X s = i if ball s falls ito bi i. The Z = g(x,..., X m ). If we move oe ball ito a differet bi, the Z ca chage by at most. Hece, (6) holds with c i = ad so P( Z µ > t) e t /m. Recall that he fractio of empty bis is F = Z/m with mea θ = µ/. We have P( F θ > t) = P( Z µ > t) e t /m. 4 Bouds o Expected Values Theorem (Cauchy-Schwartz iequality) If X ad Y have fiite variaces the E XY E(X )E(Y ). (9) The Cauchy-Schwarz iequality ca be writte as Cov (X, Y ) σ Xσ Y. Recall that a fuctio g is covex if for each x, y ad each α [0, ], g(αx + ( α)y) αg(x) + ( α)g(y). If g is twice differetiable ad g (x) 0 for all x, the g is covex. It ca be show that if g is covex, the g lies above ay lie that touches g at some poit, called a taget lie. A fuctio g is cocave if g is covex. Examples of covex fuctios are g(x) = x ad g(x) = e x. Examples of cocave fuctios are g(x) = x ad g(x) = log x. Theorem 3 (Jese s iequality) If g is covex, the Eg(X) g(ex). (0) If g is cocave, the Eg(X) g(ex). () Proof. Let L(x) = a+bx be a lie, taget to g(x) at the poit E(X). Sice g is covex, it lies above the lie L(x). So, Eg(X) EL(X) = E(a + bx) = a + be(x) = L(E(X)) = g(ex). Example 4 From Jese s iequality we see that E(X ) (EX). 6

25 Example 5 (Kullback Leibler Distace) Defie the Kullback-Leibler distace betwee two desities p ad q by ( ) p(x) D(p, q) = p(x) log dx. q(x) Note that D(p, p) = 0. We will use Jese to show that D(p, q) 0. Let X f. The ( ) ( ) q(x) q(x) D(p, q) = E log log E = log p(x) q(x) dx = log q(x)dx = log() = 0. p(x) p(x) p(x) So, D(p, q) 0 ad hece D(p, q) 0. Example 6 It follows from Jese s iequality that 3 types of meas ca be ordered. Assume that a,..., a are positive umbers ad defie the arithmetic, geometric ad harmoic meas as a A = (a a ) The a H a G a A. a G = (a... a ) / a H = a a ). Suppose we have a expoetial boud o P(X > ɛ). I that case we ca boud E(X ) as follows. Theorem 7 Suppose that X 0 ad that for every ɛ > 0, for some c > 0 ad c > /e. The, P(X > ɛ) c e c ɛ () where C = ( + log(c ))/c. E(X ) C. (3) Proof. Recall that for ay oegative radom variable Y, E(Y ) = P(Y t)dt. 0 Hece, for ay a > 0, E(X ) = 0 P(X t)dt = a 0 P(X t)dt + Equatio () implies that P(X > t) c e c t. Hece, E(X ) a + a P(X t)dt = a + a a P(X t)dt a + P(X t)dt a + c 7 a a P(X t)dt. e c t dt = a + c e c a c.

26 Set a = log(c )/(c ) ad coclude that Fially, we have E(X ) log(c ) c + c = + log(c ) c. E(X ) E(X ) + log(c ) c. Now we cosider boudig the maximum of a set of radom variables. Theorem 8 Let X,..., X be radom variables. Suppose there exists σ > 0 such that E(e tx i ) e tσ / for all t > 0. The ( ) E max X i σ log. (4) i Proof. By Jese s iequality, { ( )} exp te max X i E i ( { }) exp t max X i i ) ( = E max exp {tx i} i E (exp {tx i }) e t σ /. i= Thus, ( ) E max X i log + tσ i t. The result follows by settig t = log /σ. 5 O P ad o P I statisics, probability ad machie learig, we make use of o P ad O P otatio. Recall first, that a = o() meas that a 0 as. a = o(b ) meas that a /b = o(). a = O() meas that a is evetually bouded, that is, for all large, a C for some C > 0. a = O(b ) meas that a /b = O(). We write a b if both a /b ad b /a are evetually bouded. I computer sicece this s writte as a = Θ(b ) but we prefer usig a b sice, i statistics, Θ ofte deotes a parameter space. Now we move o to the probabilistic versios. Say that Y = o P () if, for every ɛ > 0, P( Y > ɛ) 0. 8

27 Say that Y = o P (a ) if, Y /a = o P (). Say that Y = O P () if, for every ɛ > 0, there is a C > 0 such that P( Y > C) ɛ. Say that Y = O P (a ) if Y /a = O P (). Let s use Hoeffdig s iequality to show that sample proportios are O P (/ ) withi the the true mea. Let Y,..., Y be coi flips i.e. Y i {0, }. Let p = P(Y i = ). Let p = Y i. i= We will show that: p p = o P () ad p p = O P (/ ). We have that P( p p > ɛ) e ɛ 0 ad so p p = o P (). Also, P( p p > C) = P ( p p > C ) e C < δ if we pick C large eough. Hece, ( p p) = O P () ad so ( ) p p = O P. Now cosider m cois with probabilities p,..., p m. The P(max p j p j > ɛ) j m P( p j p j > ɛ) uio boud j= m j= e ɛ Hoeffdig = me ɛ = exp { (ɛ log m) }. Supose that m e γ where 0 γ <. The P(max p j p j > ɛ) exp { (ɛ γ ) } 0. j Hece, max p j p j = o P (). j 9

28 Uiform Bouds Lecture Notes 3 Recall that, if X,..., X Beroulli(p) ad p = i= X i the, from Hoeffdig s iequality, P( p p > ɛ) e ɛ. Sometimes we wat to say more tha this. Example Suppose that X,..., X have cdf F. Let F (t) = I(X i t). i= We call F the empirical cdf. How close is F to F? That is, how big is F (t) F (t)? From Hoeffdig s iequality, P( F (t) F (t) > ɛ) e ɛ. But that is oly for oe poit t. How big is sup t F (t) F (t)? We would like a boud of the form ( ) P sup F (t) F (t) > ɛ t somethig small. Example Suppose that X,..., X P. Let P (A) = I(X i A). i= How close is P (A) to P (A)? That is, how big is P (A) P (A)? From Hoeffdig s iequality, P( P (A) P (A) > ɛ) e ɛ. But that is oly for oe set A. How big is sup A A P (A) P (A) for a class of sets A? We would like a boud of the form ( ) P sup P (A) P (A) > ɛ A A somethig small. Example 3 (Classificatio.) Suppose we observe data (X, Y ),..., (X, Y ) where Y i {0, }. Let (X, Y ) be a ew pair. Suppose we observe X. Now we wat to predict Y. A classifier h is a fuctio h(x) which takes values i {0, }. Whe we observe X we predict Y with h(x). The classificatio error, or risk, is the probability of a error: R(h) = P(Y h(x)).

29 The traiig error is the fractio of errors o the observed data (X, Y ),..., (X, Y ): R(h) = I(Y i h(x i )). By Hoeffdig s iequality, i= P( R(h) R(h) > ɛ) e ɛ. How do we choose a classifier? Oe way is to start with a set of classifiers H. The we defie ĥ to be the member of H that miimizes the traiig error. Thus ĥ = argmi h H R(h). A example is the set of liear classifiers. Suppose that x R d. A liear classifier has the form h(x) = of β T x 0 ad h(x) = 0 of β T x < 0 where β = (β,..., β d ) T is a set of parameters. Although ĥ miimizes R(h), it does ot miimize R(h). Let h miimize the true error R(h). A fudametal questio is: how close is R(ĥ) to R(h )? We will see later tha R(ĥ) is close to R(h ) if sup h R(h) R(h) is small. So we wat ( ) P sup R(h) R(h) > ɛ h somethig small. More geerally, we ca state out goal as follows. For ay fuctio f defie P (f) = f(x)dp (x), P (f) = f(x i ). Let F be a set of fuctios. I our first example, each f was of the form f t (x) = I(x t) ad F = {f t : t R}. We wat to boud ) ( P sup P (f) P (f) > ɛ f F We will see that the bouds we obtai have the form ( P sup P (f) P (f) > ɛ f F i= ) c κ(f)e c ɛ where c ad c are positive costats ad κ(f) is a measure of the size (or complexity) of the class F. Similarly, if A is a class of sets the we wat a boud of the form ( ) P sup P (A) P (A) > ɛ c κ(a)e c ɛ A A where P (A) = i= I(X i A). Bouds like these are called uiform bods sice they hold uiformly over a class of fuctios or over a class of sets..

30 Fiite Classes Let F = {f,..., f N }. Suppose that max sup f j (x) B. j N We will make use of the uio boud. Recall that ) P (A AN x N P(A j ). Let A j be the evet that P (f j ) P (f) > ɛ. From Hoeffdig s iequality, P(A j ) e ɛ /(B ). The ( ) P sup P (f) P (f) > ɛ = P(A AN ) f F N N P(A j ) e ɛ /(B ) = Ne ɛ /(B ). Thus we have show that ( ) P sup P (f) P (f) > ɛ f F κe ɛ /(B ) j= where κ = F. The same idea applies to classes of sets. Let A = {A,..., A N } be a fiite collectio of sets. By the same reasoig we have ( ) P sup P (A) P (A) > ɛ A A κe ɛ /(B ) where κ = F ad P (A) = i= I(X i A). To exted these ideas to ifiite classes like F = {f t : t R} we eed to itroduce a few more cocepts. j= j= 3 Shatterig Let A be a class of sets. Some examples are:. A = {(, t] : t R}.. A = {(a, b) : a b}. 3. A = {(a, b) (c, d) : a b c d}. 3

31 4. A = all discs i R d. 5. A = all rectagles i R d. 6. A = all half-spaces i R d = {x : β T x 0}. 7. A = all covex sets i R d. Let F = {x,..., x } be a fiite set. Let G be a subset of F. Say that A picks out G if A F = G for some A A. For example, let A = {(a, b) : a b}. Suppose that F = {,, 7, 8, 9} ad G = {, 7}. The A picks out G sice A F = G if we choose A = (.5, 7.5) for example. Let S(A, F ) be the umber of these subsets picked out by A. Of course S(A, F ). Example 4 Let A = {(a, b) : a b} ad F = {,, 3}. The A ca pick out:, {}, {}, {3}, {, }, {, 3}, {,, 3}. So s(a, F ) = 7. Note that 7 < 8 = 3. If F = {, 6} the A ca pick out: I this case s(a, F ) = 4 =., {}, {6}, {, 6}. We say that F is shattered if s(a, F ) = where is the umber of poits i F. Let F deote all fiite sets with elemets. Defie the shatter coefficiet Note that s (A). s (A) = sup F F s(a, F ). The followig theorem is due to Vapik ad Chervoeis. The proof is beyod the scope of the course. (If you take 0-70/36-70 you will lear the proof.) 4

32 Class A VC dimesio V A A = {A,..., A N } log N Itervals [a, b] o the real lie Discs i R 3 Closed balls i R d d + Rectagles i R d d Half-spaces i R d d + Covex polygos i R Covex polygos with d vertices d + Table : The VC dimesio of some classes A. Theorem 5 Let A be a class of sets. The ( ) P sup P (A) P (A) > ɛ A A 8 s (A) e ɛ /3. () This partly solves oe of our problems. But, how big ca s (A) be? Sometimes s (A) = for all. For example, let A be all polygos i the plae. The s (A) = for all. But, i may cases, we will see that s (A) = for all up to some iteger d ad the s (A) < for all > d. The Vapik-Chervoekis (VC) dimesio is d = d(a) = largest such that s (A) =. I other words, d is the size of the largest set that ca be shattered. Thus, s (A) = for all d ad s (A) < for all > d. The VC dimesios of some commo examples are summarized i Table. Now here is a iterestig questio: for > d how does s (A) behave? It is less tha but how much less? Theorem 6 (Sauer s Theorem) Suppose that A has fiite VC dimesio d. The, for all d, s(a, ) ( + ) d. () 5

33 We coclude that: Theorem 7 Let A be a class of sets with VC dimesio d <. The ( ) P sup P (A) P (A) > ɛ A A 8 ( + ) d e ɛ /3. (3) Example 8 Let s retur to our first example. Suppose that X,..., X have cdf F. Let F (t) = I(X i t). i= We would like to boud P(sup t F (t) F (t) > ɛ). Notice that F (t) = P (A) where A = (, t]. Let A = {(, t] : t R}. This has VC dimesio d =. So ( ) P(sup F (t) F (t) > ɛ) = P t sup P (A) P (A) > ɛ A A 8 ( + ) e ɛ /3. I fact, there is a tighter boud i this case called the DKW (Dvoretsky-Kiefer-Wolfowitz) iequality: P(sup F (t) F (t) > ɛ) e ɛ. t 4 Boudig Expectatios Eearlier we saw that we ca use expoetial bouds o probabilities to get bouds o expectatios. Let us recall how that works. Cosider a fiite collectio A = {A,..., A N }. Let We kow that Z = max j N P (A j ) P (A j ). P(Z > ɛ) me ɛ. (4) But ow we wat to boud ( ) E(Z ) = max P (A j ) P (A j ). j N We ca rewrite (4) as or, i other words, Recall that, i geeral, if Y 0 the P(Z > ɛ ) Ne ɛ. P(Z > t) Ne t. E(Y ) = 0 6 P(Y > t)dt.

34 Hece, for ay s, E(Z) = = 0 s 0 s + P(Z > t)dt P(Z > t)dt + s s P(Z > t)dt s + N e t dt s ( ) e s = s + N P(Z > t)dt = s + N e s. Let s = log(n)/(). The E(Z) s + N e s = log N + = log N +. Fially, we use Cauchy-Schwartz: E(Z ) ( ) log N + log N E(Z) = O. I summary: ( ) E max P (A j ) P (A j ) = O j N ( ) log N. For a sigle set A we would have E P (A) P (A) O(/ ). The boud oly icreases logarithmically with N. 7

35 Radom Samples Lecture Notes 4 Let X,..., X F. A statistic is ay fuctio T = g(x,..., X ). Recall that the sample mea is X = X i ad sample variace is S = Let µ = E(X i ) ad σ = Var(X i ). Recall that E(X ) = µ, i= (X i X ). i= Var(X ) = σ, E(S ) = σ. Theorem If X,..., X N(µ, σ ) the X N(µ, σ /). So, Proof. We kow that M Xi (s) = e µs+σ s /. M X (t) = E(e tx ) = E(e t P i= X i ) = (Ee txi/ ) = (M Xi (t/)) = { } = exp µt + which is the mgf of a N(µ, σ /). σ t ( ) e (µt/)+σ t /( ) Example (Example 5..0). Let Z,..., Z Cauchy(0, ). The Z Cauchy(0, ). Lemma 3 If X,..., X N(µ, σ ) the T = X µ S/ t N(0, ). Let X (),..., X () deoted the ordered values: X () X () X (). The X (),..., X () are called the order statistics.

36 Covergece Let X, X,... be a sequece of radom variables ad let X be aother radom variable. Let F deote the cdf of X ad let F deote the cdf of X.. X coverges almost surely to X, writte X a.s. X, if, for every ɛ > 0, P( lim X X < ɛ) =. (). X coverges to X i probability, writte X P X, if, for every ɛ > 0, as. I other words, X X = o P (). P( X X > ɛ) 0 () 3. X coverges to X i quadratic mea (also called covergece i L ), writte X qm X, if E(X X) 0 (3) as. 4. X coverges to X i distributio, writte X X, if at all t for which F is cotiuous. lim F (t) = F (t) (4) Covergece to a Costat. A radom variable X has a poit mass distributio if there exists a costat c such that P(X = c) =. The distributio for X is deoted by δ c ad we write X δ c. If X P δ c the we also write X P c. Similarly for the other types of covergece. Theorem 4 X as X if ad oly if, for every ɛ > 0, lim P(sup X m X ɛ) =. m Example 5 (Example 5.5.8). This example shows that covergece i probability does ot imply almost sure covergece. Let S = [0, ]. Let P be uiform o [0, ]. We draw S P. Let X(s) = s ad let X = s + I [0,] (s), X = s + I [0,/] (s), X 3 = s + I [/,] (s) X 4 = s + I [0,/3] (s), X 5 = s + I [/3,/3] (s), X 6 = s + I [/3,] (s) etc. The X P X. But, for each s, X (s) does ot coverge to X(s). Hece, X does ot coverge almost surely to X.

37 Example 6 Let X N(0, /). Ituitively, X is cocetratig at 0 so we would like to say that X coverges to 0. Let s see if this is true. Let F be the distributio fuctio for a poit mass at 0. Note that X N(0, ). Let Z deote a stadard ormal radom variable. For t < 0, sice t. For t > 0, F (t) = P(X < t) = P( X < t) = P(Z < t) 0 F (t) = P(X < t) = P( X < t) = P(Z < t) sice t. Hece, F (t) F (t) for all t 0 ad so X 0. Notice that F (0) = / F (/) = so covergece fails at t = 0. That does t matter because t = 0 is ot a cotiuity poit of F ad the defiitio of covergece i distributio oly requires covergece at cotiuity poits. Now cosider covergece i probability. For ay ɛ > 0, usig Markov s iequality, as. Hece, X P 0. P( X > ɛ) = P( X > ɛ ) E(X ) ɛ = ɛ 0 The ext theorem gives the relatioship betwee the types of covergece. Theorem 7 The followig relatioships hold: (a) X qm X implies that X P X. (b) X P X implies that X X. (c) If X X ad if P(X = c) = for some real umber c, the X P X. as (d) X X implies X P X. I geeral, oe of the reverse implicatios hold except the special case i (c). Proof. We start by provig (a). Suppose that X qm X. Fix ɛ > 0. The, usig Markov s iequality, P( X X > ɛ) = P( X X > ɛ ) E X X ɛ 0. Proof of (b). Fix ɛ > 0 ad let x be a cotiuity poit of F. The F (x) = P(X x) = P(X x, X x + ɛ) + P(X x, X > x + ɛ) P(X x + ɛ) + P( X X > ɛ) = F (x + ɛ) + P( X X > ɛ). 3

38 Also, F (x ɛ) = P(X x ɛ) = P(X x ɛ, X x) + P(X x ɛ, X > x) F (x) + P( X X > ɛ). Hece, F (x ɛ) P( X X > ɛ) F (x) F (x + ɛ) + P( X X > ɛ). Take the limit as to coclude that F (x ɛ) lim if F (x) lim sup F (x) F (x + ɛ). This holds for all ɛ > 0. Take the limit as ɛ 0 ad use the fact that F is cotiuous at x ad coclude that lim F (x) = F (x). Proof of (c). Fix ɛ > 0. The, P( X c > ɛ) = P(X < c ɛ) + P(X > c + ɛ) Proof of (d). This follows from Theorem 4. P(X c ɛ) + P(X > c + ɛ) = F (c ɛ) + F (c + ɛ) F (c ɛ) + F (c + ɛ) = 0 + = 0. Let us ow show that the reverse implicatios do ot hold. Covergece i probability does ot imply covergece i quadratic mea. Let U Uif(0, ) ad let X = I (0,/) (U). The P( X > ɛ) = P( I (0,/) (U) > ɛ) = P(0 U < /) = / 0. Hece, X P 0. But E(X) = / du = for all so X 0 does ot coverge i quadratic mea. Covergece i distributio does ot imply covergece i probability. Let X N(0, ). Let X = X for =,, 3,...; hece X N(0, ). X has the same distributio fuctio as X for all so, trivially, lim F (x) = F (x) for all x. Therefore, X X. But P( X X > ɛ) = P( X > ɛ) = P( X > ɛ/) 0. So X does ot coverge to X i probability. The relatioships betwee the types of covergece ca be summarized as follows: q.m. a.s. prob distributio 4

39 Example 8 Oe might cojecture that if X P b, the E(X ) b. This is ot true. Let X be a radom variable defied by P(X = ) = / ad P(X = 0) = (/). Now, P( X < ɛ) = P(X = 0) = (/). Hece, X P 0. However, E(X ) = [ (/)] + [0 ( (/))] =. Thus, E(X ). Example 9 Let X,..., X Uiform(0, ). Let X () = max i X i. First we claim that P. This follows sice X () P( X () > ɛ) = P(X () ɛ) = i P(X i ɛ) = ( ɛ) 0. Also So ( X () ) Exp(). P(( X () ) t) = P(X () (t/)) = ( t/) e t. Some covergece properties are preserved uder trasformatios. Theorem 0 Let X, X, Y, Y be radom variables. Let g be a cotiuous fuctio. (a) If X P X ad Y P Y, the X + Y P X + Y. (b) If X qm X ad Y qm Y, the X + Y qm X + Y. (c) If X X ad Y c, the X + Y X + c. (d) If X P X ad Y P Y, the X Y P XY. (e) If X X ad Y c, the X Y cx. (f) If X P X, the g(x ) P g(x). (g) If X X, the g(x ) g(x). Parts (c) ad (e) are kow as Slutzky s theorem Parts (f) ad (g) are kow as The Cotiuous Mappig Theorem. It is worth otig that X X ad Y Y does ot i geeral imply that X +Y X + Y. 3 The Law of Large Numbers The law of large umbers (LLN) says that the mea of a large sample is close to the mea of the distributio. For example, the proportio of heads of a large umber of tosses of a fair coi is expected to be close to /. We ow make this more precise. Let X, X,... be a iid sample, let µ = E(X ) ad σ = Var(X ). Recall that the sample mea is defied as X = i= X i ad that E(X ) = µ ad Var(X ) = σ /. 5

40 Theorem (The Weak Law of Large Numbers (WLLN)) If X,..., X are iid, the X P µ. Thus, X µ = o P (). Iterpretatio of the WLLN: The distributio of X aroud µ as gets large. becomes more cocetrated Proof. Assume that σ <. This is ot ecessary but it simplifies the proof. Usig Chebyshev s iequality, P ( X µ > ɛ ) Var(X ) = σ ɛ ɛ which teds to 0 as. Theorem The Strog Law of Large Numbers. Let X,..., X be iid with mea µ. as The X µ. The proof is beyod the scope of this course. 4 The Cetral Limit Theorem The law of large umbers says that the distributio of X piles up ear µ. This is t eough to help us approximate probability statemets about X. For this we eed the cetral limit theorem. Suppose that X,..., X are iid with mea µ ad variace σ. The cetral limit theorem (CLT) says that X = i X i has a distributio which is approximately Normal with mea µ ad variace σ /. This is remarkable sice othig is assumed about the distributio of X i, except the existece of the mea ad variace. Theorem 3 (The Cetral Limit Theorem (CLT)) Let X,..., X be iid with mea µ ad variace σ. Let X = i= X i. The Z X µ Var(X ) where Z N(0, ). I other words, = (X µ) σ Z lim P(Z z) = Φ(z) = z π e x / dx. Iterpretatio: Probability statemets about X ca be approximated usig a Normal distributio. It s the probability statemets that we are approximatig, ot the radom variable itself. 6

41 A cosequece of the CLT is that X µ = O P ( I additio to Z N(0, ), there are several forms of otatio to deote the fact that the distributio of Z is covergig to a Normal. They all mea the same thig. Here they are: ) Z N(0, ) ) X N (µ, σ ) X µ N (0, σ (X µ) N ( 0, σ ) (X µ) N(0, ). σ Recall that if X is a radom variable, its momet geeratig fuctio (mgf) is ψ X (t) = Ee tx. Assume i what follows that the mgf is fiite i a eighborhood aroud t = 0. Lemma 4 Let Z, Z,... be a sequece of radom variables. Let ψ be the mgf of Z. Let Z be aother radom variable ad deote its mgf by ψ. If ψ (t) ψ(t) for all t i some ope iterval aroud 0, the Z Z. Proof of the cetral limit theorem. Let Y i = (X i µ)/σ. The, Z = / i Y i. Let ψ(t) be the mgf of Y i. The mgf of i Y i is (ψ(t)) ad mgf of Z is [ψ(t/ )] ξ (t). Now ψ (0) = E(Y ) = 0, ψ (0) = E(Y ) = Var(Y ) =. So,. Now, ψ(t) = ψ(0) + tψ (0) + t! ψ (0) + t3 3! ψ (0) + = t + t3 3! ψ (0) + = + t + t3 3! ψ (0) + ξ (t) = [ ψ ( t )] = = [ + t + t3 3! 3/ ψ (0) + [ t + + t3 ψ (0) + 3! / e t / 7 ] ]

42 which is the mgf of a N(0,). The result follows from Lemma 4. I the last step we used the fact that if a a the ( + a ) e a. The cetral limit theorem tells us that Z = (X µ)/σ is approximately N(0,). However, we rarely kow σ. We ca estimate σ from X,..., X by S = (X i X ). i= This raises the followig questio: if we replace σ with S, is the cetral limit theorem still true? The aswer is yes. Theorem 5 Assume the same coditios as the CLT. The, T = (X µ) S N(0, ). Proof. We have that where ad Now Z N(0, ) ad W T = Z W (X µ) Z = σ W = σ. S P. The result follows from Slutzky s theorem. There is also a multivariate versio of the cetral limit theorem. Recall that X = (X,..., X k ) T has a multivariate Normal distributio with mea vector µ ad covariace matrix Σ if ( f(x) = exp ) (π) k/ Σ / (x µ)t Σ (x µ). I this case we write X N(µ, Σ). Theorem 6 (Multivariate cetral limit theorem) Let X,..., X be iid radom vectors where X i = (X i,..., X ki ) T with mea µ = (µ,..., µ k ) T ad covariace matrix Σ. Let X = (X,..., X k ) T where X j = i= X ji. The, (X µ) N(0, Σ). 8

43 5 The Delta Method If Y has a limitig Normal distributio the the delta method allows us to fid the limitig distributio of g(y ) where g is ay smooth fuctio. Theorem 7 (The Delta Method) Suppose that (Y µ) N(0, ) σ ad that g is a differetiable fuctio such that g (µ) 0. The I other words, Y N ( µ, σ ) (g(y ) g(µ)) g (µ) σ implies that N(0, ). g(y ) N ( g(µ), (g (µ)) σ ). Example 8 Let X,..., X be iid with fiite mea µ ad fiite variace σ. By the cetral limit theorem, (X µ)/σ N(0, ). Let W = e X. Thus, W = g(x ) where g(s) = e s. Sice g (s) = e s, the delta method implies that W N(e µ, e µ σ /). There is also a multivariate versio of the delta method. Theorem 9 (The Multivariate Delta Method) Suppose that Y = (Y,..., Y k ) is a sequece of radom vectors such that (Y µ) N(0, Σ). Let g : R k R ad let g(y) = Let µ deote g(y) evaluated at y = µ ad assume that the elemets of µ are ozero. The (g(y ) g(µ)) N ( 0, T µ Σ µ ). Example 0 Let ( X X ), ( X X g y. g y k. ),..., ( X X be iid radom vectors with mea µ = (µ, µ ) T ad variace Σ. Let X = X i, i= 9 X = i= ) X i

44 ad defie Y = X X. Thus, Y = g(x, X ) where g(s, s ) = s s. By the cetral limit theorem, ( ) X µ N(0, Σ). X µ Now ad so g(s) = ( T σ σ µ Σ µ = (µ µ ) σ σ ( g s g s ) ( µ µ ) = ( s s ) ) = µ σ + µ µ σ + µ σ. Therefore, (X X µ µ ) N (0, µ σ + µ µ σ + µ σ ). 0

45 Addedum to Lecture Notes 4 where Here is the proof that T = (X µ) S N(0, ) S = (X i X ). i= Step. We first show that R Note that R = P σ where R = (X i X ). i= Xi i= ( ) X i. Defie Y i = X i. The, usig the LLN (law of large umbers) Next, by the LLN, i= X i = i= Y P i E(Y i ) = E(Xi ) = µ + σ. i= X P i µ. Sice g(t) = t is cotiuous, the cotiuous mappig theorem implies that ( ) X i P µ. Thus R i= i= P (µ + σ ) µ = σ. Step. Note that Sice, R S = ( ) R. P σ ad /( ), we have that S P σ. Step 3. Sice g(t) = t is cotiuous, (for t 0) the cotiuous mappig theorem implies that S P σ.

46 Step 4. Sice g(t) = t/σ is cotiuous, the cotiuous mappig theorem implies that S /σ P. Step 5. Sice g(t) = /t is cotiuous (for t > 0) the cotiuous mappig theorem implies that σ/s P. Sice covergece i probability implies covergece i distributio, σ/s. Step 5. Note that ( ) ( ) (X µ) σ T = V W. σ S Now V Z where Z N(0, ) by the CLT. Ad we showed that W. By Slutzky s theorem, T = V W Z = Z.

47 Lecture Notes 5 Statistical Models A statistical model P is a collectio of probability distributios (or a collectio of desities). A example of a oparametric model is { } P = p : (p (x)) dx <. A parametric model has the form { } P = p(x; θ) : θ Θ where Θ R d. A example is the set of Normal desities {p(x; θ) = (π) / e (x θ) / }. For ow, we focus o parametric models. The model comes from assumptios. Some examples: Time util somethig fails is ofte modeled by a expoetial distributio. Number of rare evets is ofte modeled by a Poisso distributio. Legths ad weights are ofte modeled by a Normal distributio. These models are ot correct. But they might be useful. Later we cosider oparametric methods that do ot assume a parametric model Statistics Let X,..., X p(x; θ). Let X (X,..., X ). Ay fuctio T = T (X,..., X ) is itself a radom variable which we will call a statistic. Some examples are: order statistics, X () X () X ()

48 sample mea: X = i X i, sample variace: S = i (X i x), sample media: middle value of ordered statistics, sample miimum: X () sample maximum: X () sample rage: X () X () sample iterquartile rage: X (.75) X (.5) Example If X,..., X Γ(α, β), the X Γ(α, β/). Proof: This is the mgf of Γ(α, β/). M X = E[e tx ] = E[e P X i t/ ] = E[e Xi(t/) ] i [( ) α ] [ ] α = [M X (t/)] = =. βt/ β/t Example If X,..., X N(µ, σ ) the X N(µ, σ /). Example 3 If X,..., X iid Cauchy(0,), for x R, the X Cauchy(0,). p(x) = π( + x ) Example 4 If X,..., X N(µ, σ ) the The proof is based o the mgf. ( ) S χ σ ( ).

49 Example 5 Let X (), X (),..., X () be the order statistics, which meas that the sample X, X,..., X has bee ordered from smallest to largest: X () X () X (). Now, F X(k) (x) = P (X (k) x) = P (at least k of the X,..., X x) = P (exactly j of the X,..., X x) = j=k j=k ( ) [F X (x)] j [ F X (x)] j j Differetiate to fid the pdf (See CB p. 9): p X(k) (x) =! (k )!( k)! [F X(x)] k p(x) [ F X (x)] k. 3 Sufficiecy (Ch 6 CB) We cotiue with parametric iferece. reductio as a formal cocept. I this sectio we discuss data Sample X = X,, X F. Assume F belogs to a family of distributios, (e.g. F is Normal), idexed by some parameter θ. We wat to lear about θ ad try to summarize the data without throwig ay iformatio about θ away. If a statistic T (X,, X ) cotais all the iformatio about θ i the sample we say T is sufficiet. 3

50 3. Sufficiet Statistics Defiitio: T is sufficiet for θ if the coditioal distributio of X T does ot deped o θ. Thus, f(x,..., x t; θ) = f(x,..., x t). Example 6 X,, X Poisso(θ). Let T = i= X i. The, But Hece, p X T (x t) = P(X = x T (X ) = t) = P (X = x ad T = t). P (T = t) 0 if T (x ) t P (X = x ad T = t) = P (X = x ) if T (X ) = t P (X = x ) = Now, T (x ) = x i = t ad so e θ θ x i i= x i! = e θ θ P x i (xi!) = e θ θ t (xi!). P (T = t) = e θ (θ) t t! sice T Poisso(θ). Thus, P (X = x ) P (T = t) = t! ( x i )! t which does ot deped o θ. So T = i X i is a sufficiet statistic for θ. Other sufficiet statistics are: T = 3.7 i X i, T = ( i X i, X 4 ), ad T (X,..., X ) = (X,..., X ). 3. Sufficiet Partitios It is better to describe sufficiecy i terms of partitios of the sample space. Example 7 Let X, X, X 3 Beroulli(θ). Let T = X i. 4

51 x t p(x t) (0, 0, 0) t = 0 (0, 0, ) t = /3 (0,, 0) t = /3 (, 0, 0) t = /3 (0,, ) t = /3 (, 0, ) t = /3 (,, 0) t = /3 (,, ) t = 3 8 elemets 4 elemets. A partitio B,..., B k is sufficiet if f(x X B) does ot deped o θ.. A statistic T iduces a partitio. For each t, {x : T (x) = t} is oe elemet of the partitio. T is sufficiet if ad oly if the partitio is sufficiet. 3. Two statistics ca geerate the same partitio: example: i X i ad 3 i X i. 4. If we split ay elemet B i of a sufficiet partitio ito smaller pieces, we get aother sufficiet partitio. Example 8 Let X, X, X 3 Beroulli(θ). The T = X is ot sufficiet. Look at its partitio: 5

52 3.3 The Factorizatio Theorem x t p(x t) (0, 0, 0) t = 0 ( θ) (0, 0, ) t = 0 θ( θ) (0,, 0) t = 0 θ( θ) (0,, ) t = 0 θ (, 0, 0) t = ( θ) (, 0, ) t = θ( θ) (,, 0) t = θ( θ) (,, ) t = θ 8 elemets elemets Theorem 9 T (X ) is sufficiet for θ if the joit pdf/pmf of X ca be factored as p(x ; θ) = h(x ) g(t; θ). Example 0 Let X,, X Poisso. The p(x ; θ) = e θ θ P X i (xi!) = (xi!) e θ θ P i X i. Example X,, X N(µ, σ ). The p(x ; µ, σ ) = ( ) { exp πσ (xi x) + (x µ) σ }. (a) If σ kow: ( ) { } p(x (xi x) ; µ) = exp πσ σ } {{ } h(x ) { } (x µ) exp. σ } {{ } g(t (x ) µ) Thus, X is sufficiet for µ. (b) If (µ, σ ) ukow the T = (X, S ) is sufficiet. So is T = ( X i, X i ). 6

53 3.4 Miimal Sufficiet Statistics (MSS) We wat the greatest reductio i dimesio. Example X,, X N(0, σ ). Some sufficiet statistics are: T (X,, X ) = (X,, X ) T (X,, X ) = (X,, X) ( m T (X,, X ) = Xi, T (X,, X ) = X i. i= i=m+ X i ) Defiitio: T is a Miimal Sufficiet Statistic if the followig two statemets are true:. T is sufficiet ad. If U is ay other sufficiet statistic the T = g(u) for some fuctio g. I other words, T geerates the coarsest sufficiet partitio. Suppose U is sufficiet. Suppose T = H(U) is also sufficiet. T provides greater reductio tha U uless H is a trasformatio, i which case T ad U are equivalet. Example 3 X N(0, σ ). X is sufficiet. X is sufficiet. X is MSS. So are X, X 4, e X. Example 4 Let X, X, X 3 Beroulli(θ). Let T = X i. 7

54 x t p(x t) u p(x u) (0, 0, 0) t = 0 u = 0 (0, 0, ) t = /3 u = /3 (0,, 0) t = /3 u = /3 (, 0, 0) t = /3 u = /3 (0,, ) t = /3 u = 73 / (, 0, ) t = /3 u = 73 / (,, 0) t = /3 u = 9 (,, ) t = 3 u = 03 Note that U ad T are both sufficiet but U is ot miimal. 3.5 How to fid a Miimal Sufficiet Statistic Theorem 5 Defie Suppose that T has the followig property: R(x, y ; θ) = p(y ; θ) p(x ; θ). R(x, y ; θ) does ot deped o θ if ad oly if T (y ) = T (x ). The T is a MSS. Example 6 Y,, Y iid Poisso (θ). p(y ; θ) = e θ θ P y i yi, p(y ; θ) p(x ; θ) = θ P yi P x i yi!/ x i! which is idepedet of θ iff y i = x i. This implies that T (Y ) = Y i is a miimal sufficiet statistic for θ. The miimal sufficiet statistic is ot uique. But, the miimal sufficiet partitio is uique. 8

55 Example 7 Cauchy. The p(x; θ) = p(y ; θ) p(x ; θ) = The ratio is a costat fuctio of θ if π( + (x θ) ). { + (x i θ) }. { + (y j θ) } i= j= T (Y ) = (Y (),, Y () ). It is techically harder to show that this is true oly if T is the order statistics, but it could be doe usig theorems about polyomials. Havig show this, oe ca coclude that the order statistics are the miimal sufficiet statistics for θ. Note: Igore the material o completeess ad acillary statistics. 9

56 Lecture Notes 6 The Likelihood Fuctio Defiitio. Let X = (X,, X ) have joit desity p(x ; θ) = p(x,..., x ; θ) where θ Θ. The likelihood fuctio L : Θ [0, ) is defied by L(θ) L(θ; x ) = p(x ; θ) where x is fixed ad θ varies i Θ.. The likelihood fuctio is a fuctio of θ.. The likelihood fuctio is ot a probability desity fuctio. 3. If the data are iid the the likelihood is L(θ) = p(x i ; θ) i= iid case oly. 4. The likelihood is oly defied up to a costat of proportioality. 5. The likelihood fuctio is used (i) to geerate estimators (the maximum likelihood estimator) ad (ii) as a key igrediet i Bayesia iferece. Example These samples have the same likelihood fuctio: (X, X, X 3 ) Multiomial ( = 6, θ, θ, θ) X = (, 3, ) = L(θ) = 6!!3!! θ θ 3 ( θ) θ 4 ( θ) X = (,, ) = L(θ) = 6!!!! θ θ ( θ) θ 4 ( θ) Example X,, X N(µ, ). The, ( ) { } L(µ) = exp (x i µ) exp { } π (x µ). i=

57 Example 3 Let X,..., X Beroulli(p). The L(p) p X ( p) X for p [0, ] where X = i X i. Theorem 4 Write x y if L(θ x ) L(θ y ). The partitio iduced by is the miimal sufficiet partitio. Example 5 A o iid example. A AR() time series auto regressive model. The model is: X N(0, σ ) ad X i+ = θx i + e i+ e i iid N(0, σ ). It ca be show that we have the Markov property: o(x + x, x,, x ) = p(x + x ). The likelihood fuctio is L(θ) = p(x ; θ) = p(x ; θ)p(x x ; θ) p(x x,..., x ; θ) = p(x x ; θ)p(x x ; θ) p(x x ; θ)p(x ; θ) ( ) = exp πθ θ (x +i θx i ). i= Likelihood, Sufficiecy ad the Likelihood Priciple The likelihood fuctio is a miimal sufficiet statistic. That is, if we defie the equivalece relatio: x y whe L(θ; x ) L(θ; y ) the the resultig partitio is miimal sufficiet. Does this mea that the likelihood fuctio cotais all the relevat iformatio? Some people say yes it does. This is sometimes called the likelihood priciple. That is, the likelihood priciple says that the likelihood fuctio cotais all the ifomatio i the data. This is FALSE. Here is a simple example to illustrate why. Let C = {c,..., c N } be a fiite set of costats. For simplicity, asssume that c j {0, } (although this is ot importat). Let θ = N N j= c j. Suppose we wat to estimate θ. We proceed as follows.

58 Let S,..., S Beroulli(π) where π is kow. If S i = you get to see c i. Otherwise, you do ot. (This is a example of survey samplig.) The likelihood fuctio is π S i ( π) S i. i The ukow parameter does ot appear i the likelihood. I fact, there are o ukow parameters i the likelihood! The likelihood fuctio cotais o iformatio at all. But we ca estimate θ. Let θ = Nπ N c j S j. j= The E( θ) = θ. Hoeffdig s iequality implies that P( θ θ > ɛ) e ɛ π. Hece, θ is close to θ with high probability. Summary: the miimal sufficiet statistic has all the iformatio you eed to compute the likelihood. But that does ot mea that all the iformatio is i the likelihood. 3

59 Lecture Notes 7 Parametric Poit Estimatio X,..., X p(x; θ). Wat to estimate θ = (θ,..., θ k ). A estimator is a fuctio of the data. Methods:. Method of Momets (MOM). Maximum likelihood (MLE) 3. Bayesia estimators Evaluatig Estimators:. Bias ad Variace. Mea squared error (MSE) 3. Miimax Theory 4. Large sample theory (later). θ = θ = w(x,..., X ) Some Termiology E θ ( θ) = θ(x,..., x )p(x ; θ) p(x ; θ)dx dx Bias: E θ ( θ) θ the distributio of θ is called its samplig distributio the stadard deviatio of θ is called the stadard error deoted by se( θ ) θ is cosistet if θ θ P later we will see that if bias 0 ad Var( θ ) 0 as the θ is cosistet a estimator is robust if it is ot strogly affected by perturbatios i the data (more later)

60 3 Method of Momets Defie m = m = X i, µ (θ) = E(X i ) i= Xi, µ (θ) = E(Xi ) i=.. m k = Xi k, µ k (θ) = E(Xi k ). i= Let θ = ( θ,..., θ k ) solve: m j = µ j ( θ), j =,..., k. Example N(β, σ ) with θ = (β, σ ). The µ = β ad µ = σ + β. Equate: X i = β, i= Xi = σ + β i= to get β = X, σ = (X i X ). i= Example Suppose X,..., X Biomial(k, p) where both k ad p are ukow. We get kp = X, Xi = kp( p) + k p i= givig p = X k, k = X X i (X i X).

61 4 Maximum Likelihood Let θ maximize Same as maximizig Ofte it suffices to solve L(θ) = p(x,..., X ; θ). l(θ) = log L(θ). l(θ) θ j = 0, j =,..., k. Example 3 Biomial. L(p) = i px i ( p) X i = p S ( p) S where S = i X i. So l(p) = S log p + ( S) log( p) ad p = X. Example 4 X,..., X N(µ, ). L(µ) i e (X i µ) / e (X µ), l(µ) = (X µ) ad µ = X. For N(µ, σ ) we have L(µ, σ ) i { σ exp σ } (X i µ) i= ad Set to get l(µ, σ ) = log σ σ µ = (X i µ). i= l µ = 0, l σ = 0 X i, i= σ = (X i X). i= 3

62 Example 5 Let X,..., X Uiform(0, θ). The ad so θ = X (). L(θ) = θ I(θ > X ()) The mle is equivariat. if η = g(θ) the η = g( θ). Suppose g is ivertible so η = g(θ) ad θ = g (η). Defie L (η) = L(θ) where θ = g (η). So, for ay η, L ( η) = L( θ) L(θ) = L (η) ad hece η = g( θ) maximizes L (η). For o ivertible fuctios this is still true if we defie L (η) = sup L(θ). θ:τ(θ)=η Example 6 Biomial. The mle is p = X. Let ψ = log(p/( p)). The ψ = log( p/( p)). Later, we will see that maximum likelihood estimators have certai optimality properties. 5 Bayes Estimator Regard θ as radom. Start with prior distribitio π(θ). Note that f(x θ)π(θ) = f(x, θ). Now Compute the posterior distribitio by Bayes theorem: π(θ x) = f(x θ)π(θ) m(x) where m(x) = f(x θ)π(θ)dθ. This ca be writte as π(θ x) L(θ)π(θ). 4

63 Now compute a poit estimator from the posterior. For example: θf(x θ)π(θ)dθ θ = E(θ x) = θπ(θ x)dθ =. f(x θ)π(θ)dθ This approach is cotroversial. We will discuss the cotroversey ad the meaig of the prior later i the course. For ow, we just thik of this as a way to defie a estimator. Example 7 Let X,..., X Beroulli(p). Let the prior be p Beta(α, β). Hece π(p) = Γ(α + β) Γ(α)Γ(β) ad Set Y = i X i. The Γ(α) = 0 t α e t dt. π(p X) p Y p } {{ Y p } α p β p } {{ } Y +α p Y +β. likelihood prior Therefore, p X Beta(Y + α, Y + β). (See page 35 for more details.) The Bayes estimator is where p = Y + α (Y + α) + ( Y + β) = Y + α α + β + = ( λ) p mle + λ p p = This is a example of a cojugate prior. α α + β, λ = α + β α + β +. Example 8 Let X,..., X N(µ, σ ) with σ kow. Let µ N(m, τ ). The ad E(µ X) = τ τ + σ X + Var(µ X) = σ τ /. τ + σ σ m τ + σ 5

64 6 MSE The mea squared error (MSE) is E θ ( θ θ) = ( θ(x,..., x ) θ) f(x ; θ) f(x ; θ)dx... dx. The bias is B = E θ ( θ) θ ad the variace is V = Var θ ( θ). Theorem 9 We have MSE = B + V. Proof. Let m = E θ ( θ). The MSE = E θ ( θ θ) = E θ ( θ m + m θ) = E θ ( θ m) + (m θ) + E θ ( θ m)(m θ) = E θ ( θ m) + (m θ) = V + B. A estimator is ubiased if the bias is 0. I that case, the MSE = Variace. There is ofte a tradeoff betwee bias ad variace. So low bias ca imply high variace ad vice versa. Example 0 Let X,..., X N(µ, σ ). The E(X) = µ, E(S ) = σ. The MSE s are See p 33 for calculatios. E(X µ) = σ, E(S σ ) = σ4. 6

65 7 Best Ubiased Estimators What is the smallest variace of a ubiased estimator? This was oce cosidered a importat questio. Today we cosider it ot so importat. There is o reaso to require a estmator to be ubiased. Havig small MSE is more importat. However, for completeess, we will briefly cosider the questio. A estimator W is UMVUE (Uiform Miimum Variace Ubiased Estimator) for τ(θ) if (i) E θ (W ) = τ(θ) for all θ ad (ii) if E θ (W ) = τ(θ) for all θ the Var θ (W ) Var θ (W ). The Cramer-Rao iequality gives a lower boud o the variace of ay ubaised estimator. The boud is: Var θ (W ) There is also a lik with sufficiecy. ( d E dθ θw ) (τ (θ)) ( ( ) = E θ log f(x; θ)) I (θ). θ Theorem The Rao-Blackwell Theorem. Let W be a ubiased estimator of τ(θ) ad let T be a sufficiet statistic. Defie W = φ(t ) = E(W T ). The W is ubiased ad Var θ (W ) Var θ (W ) for all θ. Note that φ is a well-defied estimator sice, by sufficiecy, it does ot deped o θ. Proof. We have E θ (W ) = E θ (E(W T )) = E θ (W ) = τ(θ) so W is ubiased. Also, Var θ (W ) = Var θ (E(W T )) + E θ (Var(W T )) = Var θ (W ) + E θ (Var(W T )) Var θ (W ). Igore the material o completeess. 7

66 Miimax Theory Lecture Notes 8 Suppose we wat to estimate a parameter θ usig data X = (X,..., X ). What is the best possible estimator θ = θ(x,..., X ) of θ? Miimax theory provides a framework for aswerig this questio.. Itroductio Let θ = θ(x ) be a estimator for the parameter θ Θ. We start with a loss fuctio L(θ, θ) that measures how good the estimator is. For example: L(θ, θ) = (θ θ) L(θ, θ) = θ θ L(θ, θ) = θ θ p L(θ, θ) = 0 if θ = θ or if θ θ L(θ, θ) = I( θ θ > c) L(θ, θ) = ) log p(x; θ)dx ( p(x; θ) p(x; b θ) squared error loss, absolute error loss, L p loss, zero oe loss, large deviatio loss, Kullback Leibler loss. If θ = (θ,..., θ k ) is a vector the some commo loss fuctios are L(θ, θ) = θ θ = k ( θ j θ j ), j= ( k ) /p L(θ, θ) = θ θ p = θ j θ j p. Whe the problem is to predict a Y {0, } based o some classifier h(x) a commoly used loss is L(Y, h(x)) = I(Y h(x)). j= For real valued predictio a commo loss fuctio is L(Y, Ŷ ) = (Y Ŷ ). The risk of a estimator θ is ) R(θ, θ) = E θ (L(θ, θ) = L(θ, θ(x,..., x ))p(x,..., x ; θ)dx. ()

67 Whe the loss fuctio is squared error, the risk is just the MSE (mea squared error): R(θ, θ) = E θ ( θ θ) = Var θ ( θ) + bias. () If we do ot state what loss fuctio we are usig, assume the loss fuctio is squared error. The miimax risk is R = if bθ sup R(θ, θ) θ where the ifimum is over all estimators. A estimator θ is a miimax estimator if sup R(θ, θ) = if sup R(θ, θ). θ bθ θ Example Let X,..., X N(θ, ). We will see that X is miimax with respect to may differet loss fuctios. The risk is /. Example Let X,..., X be a sample from a desity f. Let F be the class of smooth desities (defied more precisely later). We will see (later i the course) that the miimax risk for estimatig f is C 4/5.. Comparig Risk Fuctios To compare two estimators, we compare their risk fuctios. However, this does ot provide a clear aswer as to which estimator is better. Cosider the followig examples. Example 3 Let X N(θ, ) ad assume we are usig squared error loss. Cosider two estimators: θ = X ad θ = 3. The risk fuctios are R(θ, θ ) = E θ (X θ) = ad R(θ, θ ) = E θ (3 θ) = (3 θ). If < θ < 4 the R(θ, θ ) < R(θ, θ ), otherwise, R(θ, θ ) < R(θ, θ ). Neither estimator uiformly domiates the other; see Figure. Example 4 Let X,..., X Beroulli(p). Cosider squared error loss ad let p = X. Sice this has zero bias, we have that Aother estimator is R(p, p ) = Var(X) = p = Y + α α + β + p( p).

68 3 R(θ, θ ) R(θ, θ ) θ Figure : Comparig two risk fuctios. Neither risk fuctio domiates the other at all values of θ. where Y = i= X i ad α ad β are positive costats. Now, R(p, p ) = Var p ( p ) + (bias p ( p )) ( ) Y + α = Var p + α + β + = p( p) (α + β + ) + Let α = β = /4. The resultig estimator is ad the risk fuctio is ( E p ( ( p + α α + β + p p = Y + /4 + R(p, p ) = 4( + ). ) Y + α α + β + ). ) p The risk fuctios are plotted i figure. As we ca see, either estimator uiformly domiates the other. These examples highlight the eed to be able to compare risk fuctios. To do so, we eed a oe-umber summary of the risk fuctio. Two such summaries are the maximum risk ad the Bayes risk. The maximum risk is R( θ) = sup R(θ, θ) (3) θ Θ This is the posterior mea usig a Beta (α, β) prior. 3

69 Risk p Figure : Risk fuctios for p ad p i Example 4. The solid curve is R( p ). The dotted lie is R( p ). ad the Bayes risk uder prior π is B π ( θ) = R(θ, θ)π(θ)dθ. (4) Example 5 Cosider agai the two estimators i Example 4. We have ad p( p) R( p ) = max 0 p R( p ) = max p = 4 4( + ) = 4( + ). Based o maximum risk, p is a better estimator sice R( p ) < R( p ). However, whe is large, R( p ) has smaller risk except for a small regio i the parameter space ear p = /. Thus, may people prefer p to p. This illustrates that oe-umber summaries like maximum risk are imperfect. These two summaries of the risk fuctio suggest two differet methods for devisig estimators: choosig θ to miimize the maximum risk leads to miimax estimators; choosig θ to miimize the Bayes risk leads to Bayes estimators. A estimator θ that miimizes the Bayes risk is called a Bayes estimator. That is, B π ( θ) = if eθ 4 B π ( θ) (5)

70 where the ifimum is over all estimators θ. A estimator that miimizes the maximum risk is called a miimax estimator. That is, sup R(θ, θ) = if θ eθ sup R(θ, θ) (6) θ where the ifimum is over all estimators θ. We call the right had side of (6), amely, R R (Θ) = if bθ sup R(θ, θ), (7) θ Θ the miimax risk. Statistical decisio theory has two goals: determie the miimax risk R ad fid a estimator that achieves this risk. Oce we have foud the miimax risk R we wat to fid the miimax estimator that achieves this risk: sup R(θ, θ) = if θ Θ bθ Sometimes we settle for a asymptotically miimax estimator sup R(θ, θ). (8) θ Θ sup R(θ, θ) if sup R(θ, θ) (9) θ Θ bθ θ Θ where a b meas that a /b. Eve that ca prove too difficult ad we might settle for a estimator that achieves the miimax rate, sup R(θ, θ) if sup R(θ, θ) (0) θ Θ bθ θ Θ where a b meas that both a /b ad b /a are both bouded as..3 Bayes Estimators Let π be a prior distributio. After observig X = (X,..., X ), the posterior distributio is, accordig to Bayes theorem, P(θ A X ) = p(x,..., X A θ)π(θ)dθ p(x Θ,..., X θ)π(θ)dθ = L(θ)π(θ)dθ A L(θ)π(θ)dθ () Θ where L(θ) = p(x ; θ) is the likelihood fuctio. The posterior has desity π(θ x ) = p(x θ)π(θ) m(x ) () where m(x ) = p(x θ)π(θ)dθ is the margial distributio of X. Defie the posterior risk of a estimator θ(x ) by r( θ x ) = L(θ, θ(x ))π(θ x )dθ. (3) 5

71 Theorem 6 The Bayes risk B π ( θ) satisfies B π ( θ) = r( θ x )m(x ) dx. (4) Let θ(x ) be the value of θ that miimizes r( θ x ). The θ is the Bayes estimator. Proof.Let p(x, θ) = p(x θ)π(θ) deote the joit desity of X ad θ. We ca rewrite the Bayes risk as follows: B π ( θ) = = = ( R(θ, θ)π(θ)dθ = L(θ, θ(x ))p(x θ)dx )π(θ)dθ L(θ, θ(x ))p(x, θ)dx dθ = L(θ, θ(x ))π(θ x )m(x )dx dθ ( ) L(θ, θ(x ))π(θ x )dθ m(x ) dx = r( θ x )m(x ) dx. If we choose θ(x ) to be the value of θ that miimizes r( θ x ) the we will miimize the itegrad at every x ad thus miimize the itegral r( θ x )m(x )dx. Now we ca fid a explicit formula for the Bayes estimator for some specific loss fuctios. Theorem 7 If L(θ, θ) = (θ θ) the the Bayes estimator is θ(x ) = θπ(θ x )dθ = E(θ X = x ). (5) If L(θ, θ) = θ θ the the Bayes estimator is the media of the posterior π(θ x ). If L(θ, θ) is zero oe loss, the the Bayes estimator is the mode of the posterior π(θ x ). Proof.We will prove the theorem for squared error loss. The Bayes estimator θ(x ) miimizes r( θ x ) = (θ θ(x )) π(θ x )dθ. Takig the derivative of r( θ x ) with respect to θ(x ) ad settig it equal to zero yields the equatio (θ θ(x ))π(θ x )dθ = 0. Solvig for θ(x ) we get 5. Example 8 Let X,..., X N(µ, σ ) where σ is kow. Suppose we use a N(a, b ) prior for µ. The Bayes estimator with respect to squared error loss is the posterior mea, which is θ(x,..., X ) = b b + σ X + σ a. (6) b + σ 6

72 .4 Miimax Estimators Fidig miimax estimators is complicated ad we caot attempt a complete coverage of that theory here but we will metio a few key results. The mai message to take away from this sectio is: Bayes estimators with a costat risk fuctio are miimax. Theorem 9 Let θ be the Bayes estimator for some prior π. If the θ is miimax ad π is called a least favorable prior. R(θ, θ) B π ( θ) for all θ (7) Proof.Suppose that θ is ot miimax. The there is aother estimator θ 0 such that sup θ R(θ, θ 0 ) < sup θ R(θ, θ). Sice the average of a fuctio is always less tha or equal to its maximum, we have that B π ( θ 0 ) sup θ R(θ, θ 0 ). Hece, which is a cotradictio. B π ( θ 0 ) sup θ R(θ, θ 0 ) < sup R(θ, θ) B π ( θ) (8) θ Theorem 0 Suppose that θ is the Bayes estimator with respect to some prior π. If the risk is costat the θ is miimax. Proof.The Bayes risk is B π ( θ) = R(θ, θ)π(θ)dθ = c ad hece R(θ, θ) B π ( θ) for all θ. Now apply the previous theorem. Example Cosider the Beroulli model with squared error loss. I example 4 we showed that the estimator p(x i= ) = X i + /4 + has a costat risk fuctio. This estimator is the posterior mea, ad hece the Bayes estimator, for the prior Beta(α, β) with α = β = /4. Hece, by the previous theorem, this estimator is miimax. Example Cosider agai the Beroulli but with loss fuctio L(p, p) = Let p(x ) = p = i= X i/. The risk is ( ) ( p p) R(p, p) = E = p( p) (p p) p( p). p( p) ( ) p( p) = which, as a fuctio of p, is costat. It ca be show that, for this loss fuctio, p(x ) is the Bayes estimator uder the prior π(p) =. Hece, p is miimax. 7

73 What is the miimax estimator for a Normal model? To aswer this questio i geerality we first eed a defiitio. A fuctio l is bowl-shaped if the sets {x : l(x) c} are covex ad symmetric about the origi. A loss fuctio L is bowl-shaped if L(θ, θ) = l(θ θ) for some bowl-shaped fuctio l. Theorem 3 Suppose that the radom vector X has a Normal distributio with mea vector θ ad covariace matrix Σ. If the loss fuctio is bowl-shaped the X is the uique (up to sets of measure zero) miimax estimator of θ. If the parameter space is restricted, the the theorem above does ot apply as the ext example shows. Example 4 Suppose that X N(θ, ) ad that θ is kow to lie i the iterval [ m, m] where 0 < m <. The uique, miimax estimator uder squared error loss is ( ) e θ(x) mx e mx = m. e mx + e mx This is the Bayes estimator with respect to the prior that puts mass / at m ad mass / at m. The risk is ot costat but it does satisfy R(θ, θ) B π ( θ) for all θ; see Figure 3. Hece, Theorem 9 implies that θ is miimax. This might seem like a toy example but it is ot. The essece of moder miimax theory is that the miimax risk depeds crucially o how the space is restricted. The bouded iterval case is the tip of the iceberg. Proof That X is Miimax Uder Squared Error Loss. Now we will explai why X is justified by miimax theory. Let X N p (θ, I) be multivariate Normal with mea vector θ = (θ,..., θ p ). We will prove that θ = X is miimax whe L(θ, θ) = θ θ. Assig the prior π = N(0, c I). The the posterior is Θ X = x N ( c x + c, c + c I ). (9) The Bayes risk for a estimator θ is R π ( θ) = R(θ, θ)π(θ)dθ which is miimized by the posterior mea θ = c X/( + c ). Direct computatio shows that R π ( θ) = pc /( + c ). Hece, if θ is ay estimator, the pc = R + c π ( θ) R π (θ ) (0) = R(θ, θ)dπ(θ) sup R(θ, θ). () θ We have ow proved that R(Θ) pc /( + c ) for every c > 0 ad hece But the risk of θ = X is p. So, θ = X is miimax. R(Θ) p. () 8

74 θ Figure 3: Risk fuctio for costraied Normal with m=.5. The two short dashed lies show the least favorable prior which puts its mass at two poits..5 Maximum Likelihood For parametric models that satisfy weak regularity coditios, the maximum likelihood estimator is approximately miimax. Cosider squared error loss which is squared bias plus variace. I parametric models with large samples, it ca be show that the variace term domiates the bias so the risk of the mle θ roughly equals the variace: R(θ, θ) = Var θ ( θ) + bias Var θ ( θ). (3) The variace of the mle is approximately Var( θ) where I(θ) is the Fisher iformatio. I(θ) Hece, R(θ, θ) I(θ). (4) For ay other estimator θ, it ca be show that for large, R(θ, θ ) R(θ, θ). So the maximum likelihood estimator is approximately miimax. This assumes that the dimesio of θ is fixed ad is icreasig..6 The Hodges Example Here is a iterestig example about the subtleties of optimal estimators. Let X,..., X N(θ, ). The mle is θ = X = i= X i. But cosider the followig estimator due to Typically, the squared bias is order O( ) while the variace is of order O( ). 9

75 Hodges. Let ad defie J = [ ], /4 /4 θ = { X if X / J 0 if X J. (5) (6) Suppose that θ 0. Choose a small ɛ so that 0 is ot cotaied i I = (θ ɛ, θ + ɛ). By the law of large umbers, P(X I). I the meatime J is shrikig. See Figure 4. Thus, for large, θ = X with high probability. We coclude that, for ay θ 0, θ behaves like X. Whe θ = 0, P(X J ) = P( X /4 ) (7) = P( X /4 ) = P( N(0, ) /4 ). (8) Thus, for large, θ = 0 = θ with high probability. This is a much better estimator of θ tha X. We coclude that Hodges estimator is like X whe θ 0 ad is better tha X whe θ = 0. So X is ot the best estimator. θ is better. Or is it? Figure 5 shows the mea squared error, or risk, R (θ) = E( θ θ) as a fuctio of θ (for = 000). The horizotal lie is the risk of X. The risk of θ is good at θ = 0. At ay θ, it will evetually behave like the risk of X. But the maximum risk of θ is terrible. We pay for the improvemet at θ = 0 by a icrease i risk elsewhere. There are two lessos here. First, we eed to pay attetio to the maximum risk. Secod, it is better to look at uiform asymptotics lim sup θ R (θ) rather tha poitwise asymptotics sup θ lim R (θ). 0

76 J I [ ] ( ) /4 0 /4 θ ǫ θ θ + ǫ J / / [ ] /4 0 /4 Figure 4: Top: whe θ 0, X will evetually be i I ad will miss the iterval J. Bottom: whe θ = 0, X is about / away from 0 ad so is evetually i J Figure 5: The risk of the Hodges estimator for = 000 as a fuctio of θ. The horizotal lie is the risk of the sample mea.

77 36-705/0-705 Summary of Miimax Theory Larry Wasserma October 5, 0. Loss L(θ, θ) where θ = θ(x,..., X ). Remember that θ is a fuctio of X,..., X.. Risk R(θ, θ) = E θ [L(θ, θ)] = L(θ, θ(x,..., x ))p(x,..., x ; θ)dx dx. 3. If L(θ, θ) = ( θ θ) the R(θ, θ) = E θ ( θ θ) = MSE = bias + variace. 4. Maximum risk: we defie how good a estimator is by its maximum risk 5. Miimax risk: 6. A estimator θ is miimax if R = if bθ sup R(θ, θ). θ sup R(θ, θ). θ Θ sup R(θ, θ) = R. θ Θ 7. The Bayes risk for a estimator θ, with respect to a prior π is B π ( θ) = R(θ, θ)π(θ)dθ. 8. A estimator θ π is the Bayes estimator with respect to a prior π if B π ( θ π ) = if bθ B π ( θ). I other words, θ π miimizes B π ( θ) over all estimators. 9. The Bayes risk ca we re-writte as B π ( θ) = r( θ) m(x,..., x )dx dx where m(x,..., x ) = p(x,..., x ; θ)π(θ)dθ ad r( θ) = L(θ, θ)p(θ x,..., x )dθ. Hece, to miimize B π ( θ π ) is suffices to miimize r( θ). 0. Key Theorem: Suppose that (i) θ is the Bayes estimator with respect to some prior π ad (ii) R(θ, θ) is costat. The θ is miimax.

78 . Bouds. Sometimes it is hard to fid R so it is useful to fid a lower boud ad a upper boud o the miimax risk. The followig result is helpful: Theorem: Let θ π by a Bayes estimator with respect to some prior π. Let θ be ay estimator. The: B π ( θ π ) R sup R(θ, θ ). () θ Proof of the Lower Boud. Let θ π be the Bayes estimator for some prior π. Let θ be ay other estimator. The, B π ( θ π ) B π ( θ) = R(θ, θ)π(θ)dθ sup R(θ, θ). θ Take the if over all θ ad coclude that B π ( θ π ) if bθ sup R(θ, θ) = R. θ Hece, R B π ( θ π ). Proof of the Upper boud. Choose ay estimator θ. The R = if bθ sup θ R(θ, θ) sup R(θ, θ ). θ. How to prove that X is miimax for the Normal model. Let X,..., X N(θ, σ ) where σ is kow. Let L(θ, θ) = ( θ θ). (a) First we show that R = σ /. We do this by gettig a lower boud ad a upper boud o R. (b) Lower Boud. Let π = N(0, c ). The posterior p(θ X,..., X ) is N(a, b ) where a = X/σ c + σ ad b = +. c σ The Bayes estimator miimizes r( θ) = ( θ θ) p(θ x,..., x )dθ. This is miimized by θ π = θp(θ x,..., x )dθ = E(θ X,..., X ). But E(θ X,..., X ) = a. So the Bayes esimator is θ π = X/σ +. c σ Next we compute R( θ π, θ). This meas we eed to compute the MSE of θ π. The bias θ π is θσ /(σ + c ). The variace of θ π is c 4 σ /(σ + c ). So R(θ, θ π ) = bias + variace = θ σ 4 (σ + c ) + c4 σ (σ + c ) = θ σ 4 + c 4 σ (σ + c ).

79 Let us ow compute the Bayes risk of this estimator. It is B π ( θ π ) = R(θ, θ π )π(θ)dθ = σ4 θ π(θ)dθ + c 4 σ (σ + c ) By (), this proves that = σ4 c + c 4 σ = σ (σ + c ) σ +. c R σ σ +. c (c) Upper Boud. Choose θ = X. The R(θ, θ) = σ /. By (), R sup R(θ, θ) = σ θ. (d) Combiig the lower ad upper boud we see that σ c σ + R σ. This boud is true for all c > 0. If take the limit as c the we get that R = σ. We have succeeded i fidig the miimax risk R. (e) The last step is to fid a miimax estimator. We have to fid a estimator whose maximum risk is R. But we already saw that X has maximum risk equal to R. Hece X is miimax. 3

80 Lecture Notes 9 Asymptotic (Large Sample) Theory Review of o, O, etc.. a = o() mea a 0 as.. A radom sequece A is o p () if A P 0 as. 3. A radom sequece A is o p (b ) if A /b P 0 as. 4. p o p () = o p ( p ), so o p (/ ) = o p () P o p () o p () = o p ().. a = O() if a is bouded by a costat as.. A radom sequece Y is O p () if for every ɛ > 0 there exists a costat M such that lim P ( Y > M) < ɛ as. 3. A radom sequece Y is O p (b ) if Y /b is O p (). 4. If Y Y, the Y is O p (). 5. If (Y c) Y the Y = O P (/ ). (potetial test qustio: prove this) 6. O p () O p () = O p (). 7. o p () O p () = o p (). Distaces Betwee Probability Distributios Let P ad Q be distributios with desities p ad q. We will use the followig distaces betwee P ad Q.. Total variatio distace V (P, Q) = sup A P (A) Q(A).

81 . L distace d (P, Q) = p q. 3. Helliger distace h(p, Q) = ( p q). 4. Kullback-Leibler distace K(P, Q) = p log(p/q). 5. L distace d (P, Q) = (p q). Here are some properties of these distaces:. V (P, Q) = d (P, Q). (prove this!). h (P, Q) = ( pq). 3. V (P, Q) h(p, Q) V (P, Q). 4. h (P, Q) K(P, Q). 5. V (P, Q) h(p, Q) K(P, Q). 6. V (P, Q) K(P, Q)/. 3 Cosistecy θ = T (X ) is cosistet for θ if P θ θ as. I other words, θ θ = o p (). Here are two commo ways to prove that θ cosistet. Method : Show that, for all ε > 0, P( θ θ ε) 0.

82 Method. Prove covergece i quadratic mea: MSE( θ ) = Bias ( θ ) + Var( θ ) 0. If bias 0 ad var 0 the θ qm θ which implies that θ p θ. Example Beroulli(p). The mle p has bias 0 ad variace p( p)/ 0. So p P p ad is cosistet. Now let ψ = log(p/( p)). The ψ = log( p/( p)). Now ψ = g( p) where g(p) = log(p/( p)). By the cotiuous mappig theorem, ψ P ψ so this is cosistet. Now cosider The ad So this is cosistet. p = X + +. bias = E( p) p = p ( + ) 0 var = p( p) 0. Example X,..., X Uiform(0, θ). Let θ = X (). By direct proof (we did it earlier) P we have θ θ. Method of momets estimators are typically cosistet. Cosider oe parameter. Recall that µ( θ) = m where m = i= X i. Assume that µ exists ad is cotiuous. So θ = µ (m). By the WLLN m P µ(θ). So, by the cotiuous mappig Theorem, θ = µ (m) P µ (µ(θ)) = θ. 4 Cosistecy of the MLE Uder regularity coditios (see page 56), the mle is cosistet. Let us prove this i a special case. This will also reveal a coectio betwee the mle ad Helliger distace. 3

83 Suppose that the model cosists of fiitely may distict desities {p 0, p,..., p N }. The likelihood fuctio is L(p j ) = p j (X i ). i= The mle p is the desity p j that maximizes L(p j ). Without loss of geerality, assume that the true desity is p 0. Theorem 3 P( p p 0 ) 0 as. Proof. Let us begi by first provig a iequality. Let ɛ j = h(p 0, p j ). The, for j 0, ( ) ( ) ( L(pj ) P L(p 0 ) > ) e ɛ j / p j (X i ) = P p i= 0 (X i ) > e ɛ j / p j (X i ) = P p i= 0 (X i ) > e ɛ4 j / ( ) ( ) e ɛ j /4 p j (X i ) E = e ɛ j /4 p j (X i ) E p i= 0 (X i ) p i= 0 (X i ) ( ) ( ) = e ɛ j /4 pj p 0 = e ɛ j /4 h (p 0, p j ) = e ɛ j ( /4 ɛ j { ( )} = e ɛ j /4 exp log ɛ j e ɛ j /4 e ɛ j / = e ɛ j /. We used the fact that h (p 0, p j ) = p0 p j ad also that log( x) x for x > 0. Let ɛ = mi{ɛ,..., ɛ N }. The ( L(pj ) P( p p 0 ) P L(p 0 ) > / e ɛ j N ( ) L(pj ) P L(p 0 ) > e ɛ j / j= ) for some j N e ɛ j / Ne ɛ / 0. j= ) We ca prove a similar result usig Kullback-Leibler distace as follows. Let X, X,... be iid F θ. Let θ 0 be the true value of θ ad let θ be some other value. We will show that 4

84 L(θ 0 )/L(θ) > with probability tedig to. We assume that the model is idetifiable; this meas that θ θ implies that K(θ, θ ) > 0 where K is the Kullback-Leibler distace. Theorem 4 Suppose the model is idetifiable. Let θ 0 be the true value of the parameter. For ay θ θ 0 as. ( ) L(θ0 ) P L(θ) > Proof. We have (l(θ 0) l(θ)) = p = = log p(x i ; θ 0 ) i= log p(x i ; θ) i= E(log p(x; θ 0 )) E(log p(x; θ)) (log p(x; θ 0 ))p(x; θ 0 )dx (log p(x; θ))p(x; θ 0 )dx ( log p(x; θ ) 0) p(x; θ 0 )dx p(x; θ) = K(θ 0, θ) > 0. So ( ) L(θ0 ) P L(θ) > = P (l(θ 0 ) l(θ) > 0) ( ) = P (l(θ 0) l(θ)) > 0. This is ot quite eough to show that θ θ 0. Example 5 Icosistecy of a mle. I all examples so far, but the umber of parameters is fixed. What if the umber of parameters also goes to? Let Y, Y N(µ, σ ) Y, Y N(µ, σ ).. Y, Y N(µ, σ ). 5

85 Some calculatios show that σ (Y ij Y i ) =. i= It is easy to show (good test questio) that j= σ p σ. Note that the modified estimator σ is cosistet. The reaso why cosistecy fails is because the dimesio of the parameter space is icreasig with. Theorem 6 Uder regularity coditios o the model {p(x; θ) : θ Θ}, the mle is cosistet. 5 Score ad Fisher Iformatio The score ad Fisher iformatio are the key quatities i may aspects of statistical iferece. (See Sectio 7.3. of CB.) Suppose for ow that θ R. L(θ) = p(x ; θ) l(θ) = log L(θ) S(θ) = θ l(θ) score fuctio. Recall that the value θ that maximizes L(θ) is the maximum likelihood estimator (mle). Equivaletly, θ maximizes l(θ). Note that θ = T (X,..., X ) is a fuctio of the data. Ofte, we get θ by differetiatio. I that case θ solves S( θ) = 0. We ll discuss the mle i detail later. 6

86 Some Notatio: Recall that E θ (g(x)) g(x)p(x; θ)dx. Theorem 7 Uder regularity coditios, E θ [S(θ)] = 0. I other words, ( ) log p(x,..., x ; θ) p(x,..., x ; θ)dx... dx = 0. θ That is, if the expected value is take at the same θ as we evaluate S, the the expectatio is 0. This does ot hold whe the θ s mismatch: E θ0 [S(θ )] 0. Proof. log p(x ; θ) E θ [S(θ)] = p(x ; θ) dx dx θ θ = p(x ; θ) p(x ; θ) p(x ; θ) dx dx = p(x ; θ) dx dx θ } {{ } = 0. Example 8 Let X,..., X N(θ, ). The S(θ) = (X i θ). i= Warig: If the support of f depeds o θ, the ad θ caot be switched. The ext quatity of iterest is the Fisher Iformatio or Expected Iformatio. The iformatio is used to calculate the variace of quatities that arise i iferece problems 7

87 such as the mle θ. It is called iformatio because it tells how much iformatio is i the likelihood about θ. The defiitio is: I(θ) = E θ [S(θ) ] = E θ [S(θ) ] ( E θ [S(θ)] ) = Var θ (S(θ)) sice E θ [S(θ)] = 0 [ ] = E θ θ l(θ) easiest way to calculate We will prove the fial equality uder regularity coditios shortly. I(θ) grows liearly i, so for a iid sample, a more careful otatio would be I (θ) ] [ ] I (θ) = E [ θ l(θ) = E θ log p(x i; θ) i= [ ] = E θ log p(x ; θ) = I (θ). Note that the Fisher iformatio is a fuctio of θ i two places: The derivate is w.r.t. θ ad the iformatio is evaluated at a particular value of θ. The expectatio is w.r.t. θ also. The otatio oly allows for a sigle value of θ because the two quatities should match. A related quatity of iterest is the observed iformatio, defied as Î (θ) = θ l(θ) = θ i= log p(x i; θ). By the LLN Î(θ) P I (θ). So observed iformatio ca be used as a good approximatio to the Fisher iformatio. Let us prove the idetity: E θ [S(θ) ] = E θ [ θ ote that p = p = 0 p = 0 8 ] l(θ). For simplicity take =. First ( ) p p p p = 0 E = 0. p

88 Let l = log p ad S = l = p /p. The l = (p /p) (p /p) ad ( p V (S) = E(S ) (E(S)) = E(S ) = E p ( ) p ( ) p = E E p p ( (p ) ( ) ) p = E p p = E(l ). ) Why is I(θ) called Iformatio? Later we will see that Var( θ) /I (θ). The Vector Case. Let θ = (θ,, θ K ). L(θ) ad l(θ) are defied as before. S(θ) = [ l(θ) θ i ]i=,,k a vector of dimesio K Iformatio I(θ) = Var[S(θ)] is the variace-covariace matrix of S(θ) = [I ij ] ij=,,k where [ ] l(θ) I ij = E θ. θ i θ j I(θ) is the asymptotic variace of θ. (This is the iverse of the matrix, evaluated at the proper compoet of the matrix.) 9

89 Example 9 X,, X N(µ, γ) L(µ, γ) = i= πγ exp l(µ, γ) = K log γ γ Σ(x i µ) S(µ, γ) = Σ(x γ i µ) + Σ(x γ γ i µ) I(µ, γ) = E = γ γ 0 0 γ Σ(x γ i µ) { } { } γ (x i µ) γ exp γ Σ(x i µ) Σ(x γ i µ) Σ(x γ γ 3 i µ) You ca check that E θ (S) = (0, 0) T. 6 Efficiecy ad Asymptotic Normality If ( θ θ) N(0, v ) the we call v the asymptotic variace of θ. This is ot the same as the limit of the variace which is lim Var( θ ). Cosider X. I this case, the asymptotic variace is σ. We also have that lim Var(X ) = σ. I this case, they are the same. I geeral, the latter may be larger (or eve ifiite). Example 0 (Example 0..0) Suppose we observe Y N(0, ) with probability p ad Y N(0, σ) with probability p. We ca write this as a hierachical model: W Beroulli(p ) Y W N(0, W + ( W )σ). Now, Var(Y ) = VarE(Y W ) + EVar(Y W ) = Var(0) + E(W + ( W )σ) = p + ( p )σ. 0

90 Suppose that p, σ ad that ( p )σ. The Var(Y ). The P(Y a) = p P(Z a) + ( p )P(Z a/σ ) P(Z a) ad so Y N(0, ). So the asymptotic variace is. where Suppose we wat to estimate τ(θ). Let v(θ) = τ (θ) I(θ) ( ) ( ) I(θ) = Var log p(x; θ) = E θ log p(x; θ). θ θ We call v(θ) the Cramer-Rao lower boud. Geerally, ay well-behaved estimator will have a limitig variace bigger tha or equal to v(θ). We say that W is efficiet if (W τ(θ)) N(0, v(θ)). Theorem Let X, X,..., be iid. Assume that the model satisfies the regularity coditios i Let θ be the mle. The (τ( θ) τ(θ)) N(0, v(θ)). So τ( θ) is cosistet ad efficiet. We will ow prove the asymptotic ormality of the mle. Theorem Hece, ( θ θ) N ( ) 0,. I(θ) θ = θ + O P ( ). Proof. By Taylor s theorem 0 = l ( θ) = l (θ) + ( θ θ)l (θ) +.

91 Hece Now ( θ θ) l (θ) l (θ) = A B. A = l (θ) = S(θ, X i ) = (S 0) i= where S(θ, X i ) is the score fuctio based o X i. Recall that E(S(θ, X i )) = 0 ad Var(S(θ, X i )) = I(θ). By the cetral limit theorem, A N(0, I(θ)) = I(θ)Z where Z N(0, ). By the WLLN, B P E(l ) = I(θ). By Slutsky s theorem A B I(θ)Z I(θ) = Z ( ) = N 0,. I(θ) I(θ) So Theorem follows by the delta method: ( ) ( θ θ) N 0,. I(θ) ( θ θ) N(0, /I(θ)) implies that (τ( θ ) τ(θ)) N(0, (τ (θ)) /I(θ)). The stadard error of θ is The estimated stadard error is se = I(θ) = I (θ). ŝe = I ( θ). The stadard error of τ = τ( θ) is se = τ (θ) I(θ) = τ (θ) I (θ).

92 The estimated stadard error is ŝe = τ ( θ) I ( θ). Example 3 X,, X iid Expoetial (θ). Let t = x. So: p(z; θ) = θe θx, L(θ) = e θt+ l θ, l(θ) = θt + l θ, S(θ) = t θ = θ t E[ l (θ)] =, θ ( ) N θ, θ. θ = X, l (θ) = θ, I(θ) = Example 4 X,..., X Beroulli(p). The mle is p = X. The Fisher iformatio for = is So I(p) = p( p). ( p p) N(0, p( p)). Iformally, p N ( p, ) p( p). The asymptotic variace is p( p)/. This ca be estimated by p( p)/. That is, the estimated stadard error of the mle is p( p) ŝe =. Now suppose we wat to estimate τ = p/( p). The mle is τ = p/( p). Now The estimated stadard error is p( p) ŝe( τ) = p p p = ( p) ( p) = p ( p) 3. 7 Relative Efficiecy If (W τ(θ)) N(0, σ W ) (V τ(θ)) N(0, σ V ) 3

93 the the asymptotic relative efficiecy (ARE) is ARE(V, W ) = σ W σ V Example 5 (0..7). Let X,..., X Poisso(λ). The mle of λ is X. Let. τ = P(X i = 0). So τ = e λ. Defie Y i = I(X i = 0). This suggests the estimator W = Y i. i= Aother estimator is the mle The delta method gives We have V = e bλ. Var(V ) λe λ. (W τ) N(0, e λ ( e λ )) (V τ) N(0, λe λ ). So ARE(W, V ) = λ e λ. Sice the mle is efficiet, we kow that, i geeral, ARE(W, mle). 8 Robustess The mle is efficiet oly if the model is right. The mle ca be bad if the model is wrog. That is why we should cosider usig oparametric methods. Oe ca also replace the mle with estimators that are more robust. 4

94 Suppose we assume that X,..., X N(θ, σ ). The mle is θ = X. Suppose, however that we have a perturbed model X i is N(θ, σ ) with probability δ ad X i is Cauchy with probability δ. The, Var(X ) =. Cosider the media M. We will show that ARE(media, mle) =.64. But, uder the perturbed model the media still performs well while the mle is terrible. I other words, we ca trade efficiecy for robustess. Let us ow fid the limitig distributio of M. Let Y i = I(X i µ + a/ ). The Y i Beroulli(p ) where p = P (µ + a/ ) = P (µ) + Also, i Y i has mea p ad stadard deviatio Note that, The, M µ + a a p(µ) + o( / ) = + a p(µ) + o( / ). σ = p ( p ). if ad oly if i Y i +. P( ( (M µ) a) = P M µ + a ) ( ) = P Y i + i ( i = P Y + ) i p p. σ σ Now, + p ap(µ) σ ad hece P( ( (M µ) a) P(Z ap(µ)) = P Z ) p(µ) a so that (M µ) N For a stadard Normal, (p(0)) = ( ) 0,. (p(µ)) ( ) Z = P p(µ) a

95 Lecture Notes 0 Hypothesis Testig Itroductio (See Chapter 8 ad Chapter 0.3.) Null hypothesis: H 0 : θ Θ 0 Alterative hypothesis: H : θ Θ where Θ 0 Θ =. Example X,..., X Beroulli(p). H 0 : p = H : p. The questio is ot whether H 0 is true or false. The questio is whether there is sufficiet evidece to reject H 0, much like a court case. Our possible actios are: reject H 0 or retai (do t reject) H 0. Decisio H 0 true H true Retai H 0 Reject H 0 Type I error (false positive) Type II error (false egative) Warig: Hypothesis testig should oly be used whe it is appropriate. Ofte times, people use hypothesis tetsig whe it would be much more appropriate to use cofidece itervals (which is the ext topic).

96 Costructig Tests. Choose a test statistic W = W (X,..., X ).. Choose a rejectio regio R. 3. If W R we reject H 0 otherwise we retai H 0. Example X,..., X Beroulli(p). H 0 : p = H : p. Let W = i= X i. Let R = {x : w(x ) / > δ}. So we reject H 0 if W / > δ. We eed to choose W ad R so that the test has good statistical properties. We will cosider the followig tests:. Neyma-Pearso Test. Wald test 3. Likelihood Ratio Test (LRT) 4. the permutatio test 5. the score test (optioal) Before we discuss these methods, we first eed to talk about how we evaluate tests. 3 Evaluatig Tests Suppose we reject H 0 whe X = (X,..., X ) R. Defie the power fuctio by β(θ) = P θ (X R). We wat β(θ) to be small whe θ Θ 0 ad we wat β(θ) to be large whe θ Θ. The geeral strategy is:

97 . Fix α [0, ].. Now try to maximize β(θ) for θ Θ subject to β(θ) α for θ Θ 0. We eed the followig defiitios. A test is size α if A test is level α if sup β(θ) = α. θ Θ 0 sup β(θ) α. θ Θ 0 A size α test ad a level α test are almost the same thig. The distictio is made bcause sometimes we wat a size α test ad we caot costruct a test with exact size α but we ca costruct oe with a smaller error rate. Example 3 X,..., X N(θ, σ ) with σ kow. Suppose H 0 : θ = θ 0, H : θ > θ 0. This is called a oe-sided alterative. Suppose we reject H 0 if W > c where W = X θ 0 σ/. The ( ) X θ 0 β(θ) = P θ σ/ > c ( X θ = P θ σ/ > c + θ ) 0 θ σ/ ( = P Z > c + θ ) 0 θ σ/ ( = Φ c + θ ) 0 θ σ/ where Φ is the cdf of a stadard Normal. Now sup β(θ) = β(θ 0 ) = Φ(c). θ Θ 0 3

98 To get a size α test, set Φ(c) = α so that c = z α where z α = Φ ( α). Our test is: reject H 0 whe W = X θ 0 σ/ > z α. Example 4 X,..., X N(θ, σ ) with σ kow. Suppose H 0 : θ θ 0, H : θ θ 0. This is called a two-sided alterative. We will reject H 0 if W > c where W is defied as before. Now β(θ) = P θ (W < c) + P θ (W > c) ( ) ( ) X θ 0 = P θ σ/ < c X θ 0 + P θ σ/ > c ( = P Z < c + θ ) ( 0 θ σ/ + P Z > c + θ ) 0 θ σ/ ( = Φ c + θ ) ( 0 θ σ/ + Φ c + θ ) 0 θ σ/ ( = Φ c + θ ) ( 0 θ σ/ + Φ c θ ) 0 θ σ/ sice Φ( x) = Φ(x). The size is β(θ 0 ) = Φ( c). To get a size α test we set Φ( c) = α so that c = Φ (α/) = Φ ( α/) = z α/. The test is: reject H 0 whe W = X θ 0 σ/ > z α/. 4 The Neyma-Pearso Test Let C α deote all level α tests. A test i C α with power fuctio β is uiformly most powerful (UMP) if the followig holds: if β is the power fuctio of ay other test i C α the β(θ) β (θ) for all θ Θ. 4

99 Cosider testig H 0 : θ = θ 0 versus H : θ = θ. (Simple ull ad simple alterative.) Theorem 5 Suppose we set { R = x = (x,..., x ) : where k is chose so that I other words, reject H 0 if This test is a UMP level α test. } { f(x,..., X ; θ ) f(x,..., X ; θ 0 ) > k = x : P θ0 (X R) = α. L(θ ) L(θ 0 ) > k. } L(θ ) L(θ 0 ) > k This is theorem 8.3. i the book. The proof is short; you should read the proof. Notes:. Igore the material o uio-itersectio tests ad mootoote likelihood ratios (MLR).. I geeral it is hard to fid UMP tests. Sometimes they do t eve exist. Still, we ca fid tests with good properties. 5 The Wald Test Let W = θ θ 0. se Uder the uusal coditios we have that uder H 0, W N(0, ). Hece, a asymptotic level α test is to reject whe W > z α/. For example, with Beroulli data, to test H 0 : p = p 0, You ca also use W = p p 0 bp( bp) W = p p 0. p 0 ( p 0 ). 5

100 I other words, to compute the stadard error, you ca replace θ with a estimate θ or by the ull value θ 0. 6 The Likelihood Ratio Test (LRT) This test is simple: reject H 0 if λ(x ) c where where θ 0 maximizes L(θ) subject to θ Θ 0. λ(x ) = sup θ Θ 0 L(θ) sup θ Θ L(θ) = L( θ 0 ) L( θ) Example 6 X,..., X N(θ, ). Suppose After some algebra (see page 376), So H 0 : θ = θ 0, H : θ θ 0. λ = exp { } (X θ 0 ). R = {x : λ c} = {x : X θ 0 c } where c = log c/. Choosig c to make this level α gives: reject if W > z α/ where W = (X θ 0 ) which is the test we costructed before. Example 7 X,..., X N(θ, σ ). Suppose The H 0 : θ θ 0, H : θ θ 0. λ(x ) = L(θ 0, σ 0 ) L( θ, σ) where σ 0 maximizes the likelihood subject to θ = θ 0. I the homework, you will prove that λ(x ) < c correspods to rejectig whe T > k for some costat k where T = X θ 0 S/. 6

101 Uder H 0, T has a t-distributio with degrees of freedom. So the fial test is: reject H 0 if T > t,α/. This is called Studet s t-test. It was iveted by William Gosset workig at Guiess Breweries ad writig uder the pseudoym Srudet. Theorem 8 Cosider testig H 0 : θ = θ 0 versus H : θ θ 0 where θ R. Uder H 0, log λ(x ) χ. Hece, if we let W = log λ(x ) the P θ0 (W > χ,α) α as. ad so Proof. Usig a Taylor expasio: l(θ) l( θ) + l ( θ)(θ θ) + l (θ θ) ( θ) = l( θ) + l (θ θ) ( θ) P log λ(x ) = l( θ) l(θ 0 ) l( θ) l( θ) l ( θ)(θ θ) = l ( θ)(θ θ) = l ( θ) I (θ 0 ) I (θ 0 )( ( θ θ 0 )) = A B. Now A by the WLLN ad B N(0, ). The result follows by Slutsky s theorem. Example 9 X,..., X Poisso(λ). We wat to test H 0 : λ = λ 0 versus H : λ λ 0. The log λ(x ) = [(λ 0 λ) λ log(λ 0 / λ)]. We reject H 0 whe log λ(x ) > χ,α. 7

102 Now suppose that θ = (θ,..., θ k ). Suppose that H 0 fixes some of the parameters. The log λ(x ) χ ν where ν = dim(θ) dim(θ 0 ). Example 0 Cosider a multiomial with θ = (p,..., p 5 ). So Suppose we wat to test L(θ) = p y p y 5 5. H 0 : p = p = p 3 ad p 4 = p 5 versus the alterative that H 0 is false. I this case ν = 4 = 3. The LRT test statistic is λ(x ) = 5 i= py j 0j 5 i= py j j where p j = Y j /, p 0 = p 0 = p 30 = (Y + Y + Y 3 )/, p 40 = p 50 = ( 3 p 0 )/. These calculatios are o p 49. Make sure you uderstad them. Now we reject H 0 if λ(x ) > χ 3,α. 7 p-values Whe we test at a give level α we will reject or ot reject. It is useful to summarize what levels we would reject at ad what levels we woud ot reject at. The p-value is the smallest α at which we would reject H 0. I other words, we reject at all α p. So, if the pvalue is 0.03, the we would reject at α = 0.05 but ot at α = 0.0. Hece, to test at level α whe p < α. 8

103 Theorem Suppose we have a test of the form: reject whe W (X ) > c. The the p-value whe X = x is p(x ) = sup θ Θ 0 P θ (W (X ) W (x )). Example X,..., X N(θ, ). Test that H 0 : θ = θ 0 versus H : θ θ 0. We reject whe W is large, where W = (X θ 0 ). So p = P θ0 ( (X θ 0 ) > w ) = P ( Z > w) = Φ( w ). Theorem 3 Uder H 0, p Uif(0, ). Importat. Note that p is NOT equal to P (H 0 X,..., X ). The latter is a Bayesia quatity which we will discuss later. 8 The Permutatio Test This is a very cool test. approximatios. Suppose we have data ad We wat to test: Let Create labels It is distributio free ad it does ot ivolve ay asymptotic X,..., X F Y,..., Y m G. H 0 : F = G versus H : F G. Z = (X,..., X, Y,..., Y m ). L = (,...,,,..., ). } {{ } } {{ } values m values 9

104 A test statistic ca be writte as a fuctio of Z ad L. For example, if W = X Y the we ca write N i= W = Z ii(l i = ) N i= N i= I(L Z ii(l i = ) i = ) N i= I(L i = ) where N = + m. So we write W = g(l, Z). Defie p = I(g(L π, Z) > g(l, Z)) N! π where L π is a permutatio of the labels ad the sum is over all permutatios. Uder H 0, permutig the labels does ot chage the distributio. I other words, g(l, Z) has a equal chace of havig ay rak amog all the permuted values. That is, uder H 0, Uif(0, ) ad if we reject whe p < α, the we have a level α test. Summig over all permutatios is ifeasible. But it suffices to use a radom sample of permutatios. So we do this:. Compute a radom permutatio of the labels ad compute W. Do this K times givig values W,..., W K.. Compute the p-value K K I(W j > W ). j= 9 The Score Test (Optioal) Recall that the score statistic is S(θ) = θ logf(x,..., X ; θ) = i= θ logf(x iθ). Recall that E θ S(θ) = 0 ad V θ S(θ) = I (θ). By the CLT, Z = S(θ 0) I (θ 0 ) N(0, ) 0

105 uder H 0. So we reject if Z > z α/. The advatage of the score test is that it does ot require maximizig the likelihood fuctio. Example 4 For the Biomial, ad so S(p) = ( p p) p( p), I (p) = Z = p p 0 This is the same as the Wald test i this case. p 0 ( p 0 ). p( p)

106 Lecture Notes Iterval Estimatio (Cofidece Itervals) Chapter 9 ad Chapter 0.4 Itroductio Fid C = [L(X,..., X ), U(X,..., X )] so that ) P θ (L(X,..., X ) θ U(X,..., X ) α for all θ Θ. I other words: ( ) if P θ L(X,..., X ) θ U(X,..., X ) α. θ Θ We say that C has coverage α or that C is a α cofidece iterval. Note that C is radom ad θ is fixed (but ukow). More geerally, a α cofidece set C is a (radom) set C Θ such that ( ) if P θ θ C (X,..., X ) α. θ Θ Agai, C is radom, θ is ot. Example Let X,..., X N(θ, σ). Suppose that σ is kow. Let L = L(X,..., X ) = X c ad U = U(X,..., X ) = X + c. The P θ (L θ U) = P θ (X c θ X + c) = P θ ( c < X θ < c) = P θ ( c ( = P c σ < Z < c ) σ = Φ( c /σ) = α σ < (X θ) = Φ(c /σ) Φ( c /σ) if we choose c = σz α/ /. So, if we defie C = X ± σz α/ the for all θ. P θ (θ C ) = α σ < c ) σ

107 Example X i N(θ i, ) for i =,...,. Let C = {θ R : X θ χ,α}. The P θ (θ / C ) = P θ ( X θ > χ,α) = P (χ > χ,α) = α. Four methods:. Probability Iequalities. Ivertig a test 3. Pivots 4. Large Sample Approximatios Optimal cofidece itervals are cofidece itervals that are as short as possible but we will ot discuss optimality. Usig Probability Iequalities Itervals that are valid for fiite samples ca be obtaied by probability iequalities. Example 3 Let X,..., X Beroulli(p). By Hoeffdig s iequality: P( p p > ɛ) e ɛ. Let The P ( ɛ = p p > log log ( ). α ( ) ) α. α Hece, P(p C) α where C = ( p ɛ, p + ɛ ).

108 Example 4 Let X,..., X F. Suppose we wat a cofidece bad for F. We ca use VC theory. Remember that ( ) P sup F (x) F (x) > ɛ x e ɛ. Let The Hece, for all F, where P ( ɛ = log sup F (x) F (x) > x ( ). α log ( ) ) α. α P F (L(t) F (t) U(t) for all t) α L(t) = F (t) ɛ, U(t) = F (t) + ɛ. We ca improve this by takig { } { } L(t) = max F (t) ɛ, 0, U(t) = mi F (t) + ɛ,. 3 Ivertig a Test For each θ 0, costruct a level α test of H 0 : θ = θ 0 versus H : θ θ 0. Defie φ θ0 (x ) = if we reject ad φ θ0 (x ) = 0 if we do t reject. Let A(θ 0 ) be the acceptace regio, that is, A(θ 0 ) = {x : φ θ0 (x ) = 0}. Let C(x ) = {θ : x A(θ)} = {θ : φ θ (x ) = 0}. Theorem 5 For each θ, P θ (θ C(x )) = α. Proof. P θ (θ C(x )) is the probability of rejectig θ whe θ is true which is α. 3

109 The coverse is also true: if C(x ) is a α cofidece iterval the the test: reject H 0 if θ 0 / C(x ) is a level α test. Example 6 Suppose we use the LRT. We reject H 0 whe So C = { L(θ 0 ) L( θ) c. θ : } L(θ) L( θ) c. See Example 9..3 for a detailed example ivolvig the expoetial distributio. Example 7 Let X,..., X N(µ, σ ) with σ kow. The LRT of H 0 : µ = µ 0 rejects whe X µ 0 σ z α/. So A(µ) = {x : X µ 0 < σ } z α/ ad so µ C(X ) if ad oly if X µ σ z α/. I other words, C = X ± σ z α/. If σ is ukow, the this becomes C = X ± S t,α/. (Good practice questio.) 4

110 4 Pivots A fuctio Q(X,..., X, θ) is a pivot if the distributio of Q does ot deped o θ. For example, if X,..., X N(θ, ) the X θ N(0, /) so Q = X θ is a pivot. Let a ad b be such that P θ (a Q(X, θ) b) α for all θ. We ca fid such a a ad b because Q is a pivot. It follows immediately that C(x) = {θ : a Q(x, θ) b} has coverage α. Example 8 Let X,..., X N(µ, σ ). (σ kow.) The Z = (X µ) σ N(0, ). We kow that ad so Thus P P ( z α/ Z z α/ ) = α ( z α/ (X µ) σ C = X ± σ z α/. z α/ ) = α. If σ is ukow, the this becomes C = X ± S t,α/ because T = (X µ) S t. 5

111 Example 9 Let X,..., X Uiform(0, θ). Let Q = X () /θ. The P(Q t) = i P(X i tθ) = t so Q is a pivot. Let c = α /. The P(Q c ) = α. Also, P(Q ) =. Therefore, ( α = P(c Q ) = P c X () θ ( = P c θ ) X () ( = P X () θ X ) () c so a α cofidece iterval is ( X (), ) X (). α / ) 5 Large Sample Cofidece Itervals We kow that, uder regularity coditios, θ θ N(0, ) se where θ is the mle ad se = / I ( θ). So this is a asymptotic pivot ad a approximate cofidece iterval is θ ± z α/ se. By the delta method, a cofidece iterval for τ(θ) is τ( θ ) ± z α/ se( θ) τ ( θ ). By ivertig the LRT ad usig the χ limitig distributio we get the LRT large sample cofidece set: { ( ) } L(θ) C = θ : log χ k,α. L( θ) 6

112 The P θ (θ C) α for each θ. Example 0 Let X,..., X Beroulli(p). Usig the Wald statistic so a approximate cofidece iterval is p p bp( bp) N(0, ) p( p) p ± z α/. Usig the LRT we get { ( ) } p Y ( p) Y C = p : log χ p Y ( p) Y,α. These itervals are differet but, for large, they are early the same. iterval ca be costructed by ivertig a test. A fiite sample 6 A Pivot For the cdf Let X,..., X F. We wat to costruct two fuctios L(t) L(t, X) ad U(t) U(t, X) such that P F (L(t) F (t) U(t) for all t) α for all F. Let where F (x) = K = sup F (x) F (x) x i= I(X i x) = #{X i x} 7

113 is the empirical distribito fuctio. We claim that K is a pivot. To see this, let U i = F (X i ). The U,..., U Uiform(0, ). So K = sup F (x) F (x) x = sup I(X i x) F (x) x i= = sup I(F (X i ) F (x)) F (x) x i= = sup I(U i F (x)) F (x) x i= = sup I(U i t) t 0 t i= ad the latter has a distributio depedig oly o U,..., U. We could fid, by simulatio, a umber c such that ( P A cofidece set is the sup 0 t ) I(U i t) t > c = α. i= C = {F : sup F (x) F (x) < c}. x 8

114 Lecture Notes Noparametric Iferece This is ot i the text. Suppose we wat to estimate somethig without assumig a parametric model. Some examples are:. Estimate the cdf F.. Estimate a desity fuctio p(x). 3. Estimate a regressio fuctio m(x) = E(Y X = x). 4. Estimate a fuctioal T (P ) of a distributio P for example T (P ) = E(X) = x p(x)dx. The cdf ad the Empirical Probability We already solved this problem whe we did VC theory. Give X,..., X F where X i R we use, We saw that Hece, ad ( P sup x F (x) = I(X i x). i= F ) (x) F (x) > ɛ e ɛ. sup x P F (x) F (x) 0 ( sup F (x) F (x) = O P x It ca be show that this is the miimax rate of covergece. I other words, More geerally, for X i R d, we set ). P (A) = I(X i A). i=

115 We saw that, for ay class A with VC dimesio v, ( ) P sup P (A) P (A) > ɛ A A c v e c ɛ. Desity Estimatio X,..., X are iid with desity p. For simplicity assume that X i R. What happes if we try to do maximum likelihood? The likelihood is L(p) = p(x i ). i= We ca make this as large as we wat by makig p highly peaked at each X i. So sup p L(p) = ad the mle is the desity that puts ifiite spikes at each X i. We will eed to put some restrictio o p. For example { } p P = p : p 0, p =, p (x) dx C. The most commoly used oparametric desity estimator is probably the histogram. Aother commo estimator is the kerel desity estimator. A kerel K is a symmetric desity fuctio with mea 0. The estimator is p (x) = where h > 0 is called the badwidth. i= h K ( ) x Xi The badwidth cotrols the smoothess of the estimator. Larger h makes f smoother. As a loss fuctio we will use L(p, p) = (p(x) p(x)) dx. h The risk is R = E (L(p, p)) = E(p(x) p(x)) dx = (b (x) + v(x))dx where b(x) = E( p(x)) p(x)

116 is the bias ad v(x) = Var( p(x)). Let Y i = ( ) x h K Xi. h The p (x) = i= Y i ad ( ) E( p(x)) = E Y i = E(Y i ) i= ( ( )) = E h K Xi x h ( ) u x = h K p(u)du h = K(t)p(x + ht)dt where u = x + ht ) = K(t) (p(x) + htp (x) + h t p (x) + o(h ) dt = p(x) K(t)dt + hp (x) tk(t)dt + h p (x) t K(t)dt + o(h )dt where κ = t K(t)dt. So ad Thus = (p(x) ) + (hp (x) 0) + h p (x)κ + o(h ) E( p(x)) p(x) + h p (x)κ b(x) h p (x)κ. b (x)dx = h4 4 κ i= (p (x)) dx. Now we compute the variace. We have ( ) v(x) = Var Y i = VarY i = E(Y i ) (E(Y i )). 3

117 Now ( ( )) E(Yi Xi x ) = E h K h ( ) u x = h K p(u)du h = K (t)p(x + ht)dt u = x + ht h p(x) K (t)dt = p(x)ξ h h where ξ = K (t)dt. Now (E(Y i )) ) (p(x) + h p (x)κ = f (x) + O(h ) f (x). So ad Fially, Note that v(x) = E(Y i ) (E(Y i)) R h4 4 κ p(x) h + f (x) = p(x)ξ ( ) h + o p(x)ξ h h v(x)dx ξ h. (p (x)) dx + ξ h = Ch4 + ξ h. h bias, variace If we choose h = h to satisfy h bias, variace. h 0, h the we see that p (x) P p(x). If we miimize over h we get h = ( ) /5 ξ = O 4C ( ) /5. 4

118 This gives for some costat C. R = C 4/5 Ca we do better? The aswer, based o miimax theory, is o. Theorem There is a costat a such that if bp sup R(f, p) f F a 4/5. So the kerel estimator achieves the miimax rate of covergece. The histogram coverges at the sub-optimal rate of /3. course. Provig these facts is beyod the scope of the There are may practical questios such as: how to choose h i practice, how to exted to higher dimesios etc. These are discussed i 0-70 as well as other courses. 3 Regressio We observe (X, Y ),..., (X, Y ). Give a ew X we wat to predict Y. If our predictio is m(x) the the predictive loss os (Y m(x)). Later i the course we will discuss predictio i detail ad we will see that the optimal predictor is the regressio fuctio m(x) = E(Y X = x) = yp(y x)dy. The kerel estimator is i= m (x) = Y ik ( x X i ) h i= K ( x X i ). h The properties are similar to kerel desity estimatio. Agai, you will study this i more detail i some other classes. 4 Fuctioals Let X,..., X F. Let F be all distributios. A map T : F R is called a statistical fuctioal. 5

119 Notatio. Let F be a distributio fuctio. Let f deote the probability mass fuctio if F is discrete ad the probability desity fuctio if F is cotiuous. The itegral g(x)df (x) is iterpreted as follows: j g(x)df (x) = g(x j)p(x j ) g(x)p(x)dx if F is discrete if F is cotiuous. A statistical fuctioal T (F ) is ay fuctio of of the cdf F. Examples iclude the mea µ = x df (x), the variace σ = (x µ) df (x), the media m = F (/), ad the largest eigevalue of the covariace matrix Σ. The plug-i estimator of θ = T (F ) is defied by θ = T ( F ). A fuctioal of the form a(x)df (x) is called a liear fuctioal. The empirical cdf F (x) is discrete, puttig mass / at each X i. Hece, if T (F ) = a(x)df (x) is a liear fuctioal the the plug-i estimator for liear fuctioal T (F ) = a(x)df (x) is: T ( F ) = a(x)d F (x) = a(x i ). Let ŝe be a estimate of the stadard error of T ( F ). I may cases, it turs out that i= θ = T ( F ) N(T (F ), ŝe ). I that case, a approximate α cofidece iterval for T (F ) is the θ ± z α/ ŝe. We ca use the Wald statistic to do a hypothesis test. W = θ θ 0 se 6

120 Example (The mea) Let µ = T (F ) = x df (x). The plug-i estimator is µ = x d F (x) = X. The stadard error is se = Var(X ) = σ/. If σ deotes a estimate of σ, the the estimated stadard error is ŝe = σ/. A Normal-based cofidece iterval for µ is X ± z α/ σ/. Example 3 (The variace) Let σ = Var(X) = x df (x) ( x df (x) ). The plug-i estimator is σ = = = ( x d F (x) Xi i= ( xd F (x)) () ) X i () i= (X i X ). (3) i= Example 4 (Quatiles) Let F be strictly icreasig with desity f. Let T (F ) = F (p) be the p th quatile. The estimate of T (F ) is ot ivertible. To avoid ambiguity we defie the p th sample quatile. F (p). We have to be a bit careful sice F is F (p) = if{x : F (x) p}. We call F (p) How do we estimate the stadard error? There are two approaches. Oe is based o somethig called the ifluece fuctio which is a oparametric versio of the score fuctio. We wo t cover that i this course. The secod approach is to use the bootstrap which we will discuss i a upcomig lecture. 5 Optioal: The Ifluece Fuctio If you are curious what the ifluece is, I will describe it here. This sectio is optioal ad you ca skip it if you prefer. The ifluece fuctio is defied by ( ) T ( ɛ)f + ɛδx T (F ) L F (x) = lim ɛ 0 ɛ 7

121 where δ x deote a poit mass distributio at x: δ x (y) = 0 if y < x ad δ x (y) = if y x. The empirical ifluece fuctio is defied by ( ) T ( ɛ) F + ɛδ x T ( F ) L(x) = lim. ɛ 0 ɛ The ifluece fuctio is the oparametric versio of the score fuctio. More precisely, it behaves like the score divided by the Fisher iformatio, L = score/iformatio = S/I. Theorem 5 If T is Hadamard differetiable with respect to d(f, G) = sup x F (x) G(x) the (T ( F ) T (F )) N(0, τ ) where τ = L F (x)df (x). Also, (T ( F ) T (F )) ŝe N(0, ) where ŝe = τ/ ad τ = L (X i ). i= We call the approximatio (T ( F ) T (F ))/ŝe N(0, ) the fuctioal delta method or the oparametric delta method. From the ormal approximatio, a large sample cofidece iterval is: T ( F ) ± z α/ ŝe. Example 6 (The mea) Let θ = T (F ) = x df (x). The plug-i estimator is θ = x d F (x) = X. Also, T (( ɛ)f +ɛδ x ) = ( ɛ)θ+ɛx. Thus, L(x) = x θ, L(x) = x X ad ŝe = σ / where σ = i= (X i X ). A poitwise asymptotic oparametric 95 percet cofidece iterval for θ is X ± ŝe. Hadamard differetiability is a smoothess coditio o T. 8

122 Example 7 (Quatiles) Let F be strictly icreasig with positive desity f, ad let T (F ) = F (p) be the p th quatile. The ifluece fuctio is L(x) = p p(θ), x θ p p(θ), x > θ. The asymptotic variace of T ( F ) is τ = L (x)df (x) = p( p) f (θ). 9

123 This is mostly ot i the text. Lecture Notes 3 The Bootstrap Itroductio Ca we estimate the mea of a distributio without usig a parametric model? Yes. The key idea is to first estimate the distributio fuctio oparametrically. The we ca get a estimate of the mea (ad may other parameters) from the distributio fuctio. How ca we get the stadard error of that estimator? The aswer is: the bootstrap. The bootstrap is a oparametric method for fidig stadard errors ad cofidece itervals. Notatio. Let F be a distributio fuctio. Let p deote the probability mass fuctio if F is discrete ad the probability desity fuctio if F is cotiuous. The itegral g(x)df (x) is iterpreted as follows: j g(x)df (x) = g(x j)p(x j ) if F is discrete () g(x)p(x)dx if F is cotiuous. For 0 < α < defie z α by P(Z > z α ) = α where Z N(0, ). Thus z α = Φ ( α) = Φ (α). Review of The Empirical Distributio Fuctio The bootstrap uses the empirical distributio fuctio. Let X,..., X F where F (x) = P(X x) is a distributio fuctio o the real lie. We ca estimate F with the empirical distributio fuctio F, the cdf that puts mass / at each data poit X i. Recall that the empirical distributio fuctio F is defied by F (x) = I(X i x) () i=

124 where if X i x I(X i x) = 0 if X i > x. (3) From () it follows that g(x)d F (x) = i= g(x i). Accordig to the Gliveko Catelli Theorem, sup F (x) F (x) as 0. (4) x Hece, F is a cosistet estimator of F. I fact, the covergece is fast. Accordig to the Dvoretzky Kiefer Wolfowitz (DKW) iequality, for ay ɛ > 0, ( ) P sup F (x) F (x) > ɛ x e ɛ. (5) If ɛ = c / where c, the P(sup x F (x) F (x) > ɛ ) 0. Hece, sup x F (x) F (x) = O P ( / ). 3 Statistical Fuctioals Recall that a statistical fuctioal T (F ) is ay fuctio of of the cdf F. Examples iclude the mea µ = x df (x), the variace σ = (x µ) df (x), m = F (/), ad the largest eigevalue of the covariace matrix Σ. The plug-i estimator of θ = T (F ) is defied by θ = T ( F ). (6) Let ŝe be a estimate of the stadard error of T ( F ). (We will see how to get this later.) I may cases, it turs out that T ( F ) N(T (F ), ŝe ). (7) I that case, a approximate α cofidece iterval for T (F ) is the T ( F ) ± z α/ ŝe. (8)

125 Example (The mea) Let µ = T (F ) = x df (x). The plug-i estimator is µ = x d F (x) = X. The stadard error is se = Var(X ) = σ/. If σ deotes a estimate of σ, the the estimated stadard error is ŝe = σ/. A Normal-based cofidece iterval for µ is X ± z α/ σ/. Example A fuctioal of the form a(x)df (x) is called a liear fuctioal. (Recall that a(x)df (x) is defied to be a(x)p(x)dx i the cotiuous case ad j a(x j)p(x j ) i the discrete case.) The empirical cdf F (x) is discrete, puttig mass / at each X i. Hece, if T (F ) = a(x)df (x) is a liear fuctioal the the plug-i estimator for liear fuctioal T (F ) = a(x)df (x) is: T ( F ) = a(x)d F (x) = a(x i ). (9) Example 3 (The variace) Let σ = Var(X) = x df (x) ( x df (x) ). The plug-i estimator is σ = = = ( x d F (x) Xi i= ( i= xd F (x)) (0) ) X i () i= (X i X ). () i= Example 4 (The skewess) Let µ ad σ deote the mea ad variace of a radom variable X. The skewess which measures the lack of symmetry of a distributio is defied to be κ = E(X µ)3 σ 3 = (x µ) 3 df (x) { (x µ) df (x) } 3/. (3) To fid the plug-i estimate, first recall that µ = i= X i ad σ = i= (X i µ). The plug-i estimate of κ is κ = (x µ) 3 d F (x) { (x µ) d F (x)} 3/ = i= (X i µ) 3. (4) σ 3 3

126 Example 5 (Correlatio) Let Z = (X, Y ) ad let ρ = T (F ) = E(X µ X )(Y µ Y )/(σ x σ y ) deote the correlatio betwee X ad Y, where F (x, y) is bivariate. We ca write T (F ) = a(t (F ), T (F ), T 3 (F ), T 4 (F ), T 5 (F )) where T (F ) = x df (z) T (F ) = y df (z) T 3 (F ) = xy df (z) T 4 (F ) = x df (z) T 5 (F ) = y df (z) (5) ad a(t,..., t 5 ) = t 3 t t (t4 t )(t 5 t ). (6) Replace F with F i T (F ),..., T 5 (F ), ad take ρ = a(t ( F ), T ( F ), T 3 ( F ), T 4 ( F ), T 5 ( F )). (7) We get ρ = which is called the sample correlatio. i= (X i X )(Y i Y ) i= (X i X ) i= (Y i Y ) (8) Example 6 (Quatiles) Let F be strictly icreasig with desity f. Let T (F ) = F (p) be the p th quatile. The estimate of T (F ) is ot ivertible. To avoid ambiguity we defie the p th sample quatile. F (p). We have to be a bit careful sice F is F (p) = if{x : F (x) p}. We call F (p) 4 The Bootstrap Let T = g(x,..., X ) be a statistic ad let Var F (T ) deote the variace of T. We have added the subscript F to emphasize that the variace is itself a fuctio of F. I other words Var F (T ) = (g(x,..., X ) µ) df (x )df (x ) df (x ) where µ = E(T ) = g(x,..., X )df (x )df (x ) df (x ). 4

127 If we kew F we could, at least i priciple, compute the variace. For example, if T = i= X i, the Var F (T ) = σ = x df (x) ( xdf (x) ). (9) I other words, the variace of θ = T (F ) is itself a fuctio of F. We ca write Var F (T ) = U(F ) for some U. Therefore, to estimate Var F (T ) we ca use Var F (T ) = U( F ). This is the bootstrap estimate of the stadard error. To repeat: we estimate U(F ) = Var F (T ) with U( F ) = Var bf (T ). I other words, we use a plug-i estimator of the variace. But how ca we compute Var bf (T )? We approximate it with a simulatio estimate deoted by v boot. Specifically, we do the followig steps: Bootstrap Variace Estimatio. Draw X,..., X F.. Compute T = g(x,..., X). 3. Repeat steps ad, B times to get T,,..., T,B. 4. Let v boot = B ( B T,b B T B,r). (0) b= r= as By the law of large umbers, v boot Var bf (T ) as B. The estimated stadard error of T is ŝe boot = v boot. The followig diagram illustrates the bootstrap idea: Real world: F = X,..., X = T = g(x,..., X ) Bootstrap world: F = X,..., X = T = g(x,..., X) 5

128 Bootstrap for the Media Give data X = (X(),..., X()): T = media(x) Tboot = vector of legth B for(i i :N){ Xstar = sample of size from X (with replacemet) Tboot[i] = media(xstar) } se = sqrt(variace(tboot)) Figure : Pseudo-code for bootstrappig the media. Var F (T ) O(/ ) O(/ B) {}}{ {}}{ Var bf (T ) v boot. () How do we simulate from F? Sice F gives probability / to each data poit, drawig poits at radom from F is the same as drawig a sample of size with replacemet from the origial data. Therefore step ca be replaced by:. Draw X,..., X with replacemet from X,..., X. Example 7 Figure shows pseudo-code for usig the bootstrap to estimate the stadard error of the media. 5 The Parametric Bootstrap So far, we have estimated F oparametrically. There is also a parametric bootstrap. If F θ depeds o a parameter θ ad θ is a estimate of θ, the we simply sample from F bθ 6

129 istead of F. This is just as accurate, but much simpler tha, the delta method. Here is more detail. Suppose that X,..., X p(x; θ). Let θ be the mle. Let τ = g(θ). The τ = g( θ). To get the stadard error of τ we eed to compute the Fisher iformatio ad the do the delta method. The bootstrap allows us to avoid both steps. We just do the followig:. Compute the estimate θ from the data X,..., X.. Draw a sample X,..., X f(x; θ). Compute θ ad τ = g( θ ) from the ew data. Repeat B times to get τ,..., τ B. 3. Compute the stadard deviatio ŝe = B B ( τ j τ) where τ = B b= B τ j. () b= No eed to get the Fisher iformatio or do the delta method. 6 Bootstrap Cofidece Itervals There are several ways to costruct bootstrap cofidece itervals. They vary i ease of calculatio ad accuracy. Normal Iterval. The simplest is the Normal iterval θ ± z α/ ŝe boot (3) where ŝe boot is the bootstrap estimate of the stadard error. Pivotal Itervals. Let θ = T (F ) ad θ = T ( F ). We ca also costruct a approximate cofidece iterval for θ usig the (approximate) pivot ( θ θ) as follows: ( C = θ H ( ) α, θ H ( )) α (4) 7

130 where where H(r) = B I ( ( θ j B θ) r ) (5) j= H(r) = P ( ( θ θ) r ), Ĥ(r) = P ( I ( ( θ j θ) r ). (6) Theorem 8 Uder appropriate coditios o T, sup u H(u) Ĥ(u) P 0 as ad sup u Ĥ(u) H(u) P 0 as B. Now we ca show that the cofidece iterval has coverage that is approximately equal to α. Applyig Theorem 8 we have ( P(θ C) = P θ H ( ) α θ θ H ( )) α ( ( α ) = P H ( θ θ) H ( α )) ( ( = H H α )) ( ( α )) H H (Ĥ ( H α )) (Ĥ ( α )) H ( H (H α )) ( ( α )) H H ( = α ) α = α. 7 Remarks About The Bootstrap. The bootstrap is oparametric but it does require some assumptios. You ca t assume it is always valid. (See the appedix.). The bootstrap is a asymptotic method. Thus the coverage of the cofidece iterval is α + r where the remaider r 0 as. 3. There is a related method called the jackkife where the stadard error is estimated by leavig out oe observatio at a time. However, the bootstrap is valid uder weaker coditios tha the jackkife. See Shao ad Tu (995). 8

131 4. Aother way to costruct a bootstrap cofidece iterval is to set C = [a, b] where a is the α/ quatile of θ,..., θ B ad b is the α/ quatile. This is called the percetile iterval. This iterval seems very ituitive but does ot have the theoretical support of the iterval i (4). However, i practice, the percetile iterval ad the iterval i (4) are ofte quite similar. 5. There are may cases where the bootstrap is ot formally justified. This is especially true with discrete structures like trees ad graphs. Noethless, the bootstrap ca be used i a iformal way to get some ituitio of the variability of the procedure. But keep i mid that the formal guaratees may ot apply i these cases. For example, see Holmes (003) for a discussio of the bootstrap applied to phylogeetic tres. 6. There is a improvemet o the bootstrap called subsamplig. I this case, we draw samples of size m < without replacemet. Subsamplig produces valid cofidece itervals uder weaker coditios tha the bootstrap. See Politis, Romao ad Wolf (999). 7. There are may modificatios of the bootstrap that lead to more accurate cofidece itervals; see Efro (996). 8 Examples Example 9 (The Media) The top left plot of Figure shows the desity for a χ distributio with 4 degrees of freedom. The top right plot shows a histogam of = 50 draws from this distributio. Let θ = T (P ) be the media. The true value is θ = The samlpe media turs out to be θ = 3.. We computed B = 000 bootstrap values θ,..., θ B show i the histogram (bottom left plot). The estimated stadard error is This is smaller tha the true stadard error which is Next we coducted a small simulatio. We drew a sample of size ad computed the 95 percet bootstrap cofidece iterval. We repeated this process N = 00 times. The bottom right plot shows the 00 itervals. The vertical lie is the true value of θ. The percetage 9

132 Figure : Top left: desity of a χ with 4 degrees of freedom. The vertical lie shows the media. Top right: = 50 draw from the distributio. Bottom left: B = 000 boostrap values θ,..., θ B. Bottom right: Bootstrap cofidece itervals from 00 experimets. of itervals that cover θ is 0.83 which shows that the bootstrap iterval udercovers i this case. Example 0 (Noparametric Regressio) The bootstrap is ofte used iformally to get a sese of the variability of a procedure. top left plot of Figure 3. Cosider the data (X, Y ),..., (X, Y ) i the To estimate the regressio fuctio m(x) = E(Y X = x) we use a kerel regressio estimator give by m(x) = i= Y i w i (x) where w i (x) = K((x X i )/h)/ j K((x X j)/h) ad K(x) = e x / is a Gaussia kerel. The estimated curve is show i the top right plot. We ow create B =, 000 boostrap replicatios resultig i curves m,..., m B i the bottom left plot. At each x, we fid the.05 ad.975 quatile of 0

133 the bootstrap replicatios. This reults i the upper ad lower bad i the bottom right plot. The bootstrap reveals greater variability i the estimated curve aroud x = 0.5. The reaso why we call this a iformal use of the bootstrap is that the bads show i the lower right plot are ot rigorous cofidece bads. There are several reasos for this. First, we used a percetile iterval (described i the earlier list of remarks) rather tha the iterval defied by (4). Secod, we have ot adjusted for the fact that we are makig simultaeous bads over all x. Fially, the theory of the bootstrap does ot directly apply to oparametric smoothig. Roughly speakig, we are really creatig approximate cofidece itervals for E( m(x)) rather tha for m(x). Despite these shortcomigs, the bootstrap is still regarded as a useful tool here but we must keep i mid that it is beig used i a iformal way. Some authors refer to the bads as variability bads rather tha cofidece bads for this reaso. Example (Estimatig Eigevalues) Let X,..., X be radom vectors where X i R p ad let Σ be the covariace matrix of X i. A commo dimesio reductio techique is pricipal compoets which ivolves fidig the spectral decompositio Σ = EΛE T where the colums of E are the eigevectors of Σ ad Λ is a diagoal matrix whose diagoal elemets are the ordered eigevalues λ λ p. The data dimesio ca be reduced to q < p by projectig each data poit oto the first q eigevalues. We choose q such that p j=q+ λ j is small. Of course, we eed to estimate the eigevectors ad eigevalues. For ow, let us focus o estimatig the largest eigevalue ad deote this by θ. A estimate of θ is the largest pricipal compoet θ of the sample covariace matrix S = (X i X)(X i X) T. (7) i= It is ot at all obvious how ca ca estimate the stadard error of θ or how to fid a cofidece iterval for θ. I this example, the bootstrap works as follows. Draw a sample of size with replacemet from X,..., X. The ew sample is deoted by X,..., X. Compute the sample covariace matrix S of the ew data ad let θ deote the largest eigevector of S. Repeat this process B times where B is typically about 0,000. This yields bootstrap values

134 Figure 3: Top left: the data (X, Y ),..., (X, Y ). Top right: kerel regressio estimator. Bottom left:,000 bootstrap replicatios. Bottom right: 95 percet variability bads..

135 θ,..., θ B. The stadard deviatio of θ,..., θ B is a estimate of the stadard error of the origial estimator θ. Figure 4 shows a PCA aalysis of US arrest data. The last plot shows bootstrap replicatios of the first pricipal compoet. Example (Media Regressio) Cosider the liear regressio model Y i = X T i β + ɛ i. (8) Istead of usig least squares to estimate β, defie β to miimize media Y i X T i β. (9) The resultig estimator β is more resistat to outliers tha the least squares estimator. But how ca we fid the stadard error of β? Usig the bootstrap approach, we resample the pairs of data to get the bootstrap sample (X, Y ),..., (X Y ) ad the we get the correspodig bootstrap estimate β. We ca repeat this may times ad use the stadard deviatio of the bootstrap estimates to estimate the stadard error of β. Figure 5 shows bootstrap replicatios of fits from regressio ad robust regressio (miimizig L error istead of squared error) i a dataset with a outlier. Warig! The bootstrap is ot magic. Its validity requires some coditios to hold. Whe the coditios do t hold, the bootstrap, like ay method, ca give misleadig aswers. 3

136 out Variaces PC Mississippi North Carolia South Carolia Georgia West Virgiia Vermot Alaska Alabama Arkasas Ketucky Murder Louisiaa Teessee South Dakota Motaa North Dakota Assault Marylad Wyomig Maie New Mexico Virgiia Idaho Florida New Hampshire Michiga Idiaa Iowa Missouri Delaware Kasas Nebraska Rape Oklahoma Texas Orego Pesylvaia Arizoa Illiois Miesota Wiscosi Nevada New York Ohio Colorado Washigto Coecticut Califoria New Massachusetts Jersey Rhode Utah Hawaii Islad UrbaPop PC Histogram of v Frequecy first v Figure 4: US Arrest Data 4

137 y x y x Figure 5: Robust Regressio 5

138 Lecture Notes 4 Bayesia Iferece Relevat material is scattered throughout the book: see sectios 7..3, 8.., 9..4 ad We will also cover some material that is ot i the book. Itroductio So far we have bee usig frequetist (or classical) methods. I the frequetist approach, probability is iterpreted as log ru frequecies. The goal of frequetist iferece is to create procedures with log ru guaratees. Ideed, a better ame for frequetist iferece might be procedural iferece. Moreover, the guaratees should be uiform over θ if possible. For example, a cofidece iterval traps the true value of θ with probability α, o matter what the true value of θ is. I frequetist iferece, procedures are radom while parameters are fixed, ukow quatities. I the Bayesia approach, probability is regarded as a measure of subjective degree of belief. I this framework, everythig, icludig parameters, is regarded as radom. There are o log ru frequecy guaratees. Bayesia iferece is quite cotroversial. Note that whe we used Bayes estimators i miimax theory, we were ot doig Bayesia iferece. We were simply usig Bayesia estimators as a method to derive miimax estimators. The Mechaics of Bayes Let X,..., X p(x θ). I Bayes we also iclude a prior π(θ). It follows from Bayes theorem that the posterior distributio of θ give the data is where π(θ X,..., X ) = p(x,..., X θ)π(θ) m(x,..., X ) m(x,..., X ) = p(x,..., X θ)π(θ)dθ.

139 Hece, π(θ X,..., X ) L(θ)π(θ) where L(θ) = p(x,..., X θ) is the likelihood fuctio. The iterpretatio is that π(θ X,..., X ) represets your subjective beliefs about θ after observig X,..., X. A commoly used poit estimator is the posterior mea θ = E(θ X,..., X ) = θπ(θ X,..., X )dθ = θl(θ)π(θ) L(θ)π(θ). For iterval estimatio we use C = (a, b) where a ad b are chose so that This iterpretatio is that b a π(θ X,..., X ) = α. P (θ C X,..., X ) = α. This does ot mea that C traps θ with probability α. We will discuss the distictio i detail later. Example Let X,..., X Beroulli(p). Let the prior be p Beta(α, β). Hece ad Set Y = i X i. The π(p) = Γ(α) = Γ(α + β) Γ(α)Γ(β) 0 t α e t dt. π(p X) p Y p } {{ Y p } α p β p } {{ } Y +α p Y +β. likelihood prior Therefore, p X Beta(Y + α, Y + β). (See page 35 for more details.) The Bayes estimator is where p = Y + α (Y + α) + ( Y + β) = Y + α α + β + = ( λ) p mle + λ p p = This is a example of a cojugate prior. α α + β, λ = α + β α + β +.

140 Example Let X,..., X N(µ, σ ) with σ kow. Let µ N(m, τ ). The ad E(µ X) = τ τ + σ X + Var(µ X) = σ τ /. τ + σ σ m τ + σ 3 Where Does the Prior Come From? This is the millio dollar questio. I priciple, the Bayesia is supposed to choose a prior π that represets their prior iformatio. This will be challegig i high dimesioal cases to say the least. Also, critics will say that someoe s prior opiios should ot be icluded i a data aalysis because this is ot scietific. There has bee some effort to defie oiformative priors but this has ot worked out so well. A example is the Jeffreys prior which is defied to be π(θ) I(θ). You ca use a flat prior but be aware that this prior does t retai its flatess uder trasformatios. I high dimesioal cases, the prior eds up beig highly ifluetial. The result is that Bayesia methds ted to have poor frequetist behavior. We ll retur to this poit soo. It is commo to use flat priors eve if they do t itegrate to. This is posible sice the posterior might still itegrate to eve if the prior does t. 4 Large Sample Theory There is a Bayesia cetral limit theorem. I ice models, with large, ( ) π(θ X,..., X ) N θ, I ( θ) () 3

141 where θ is the mle ad I is the Fisher iformatio. I these cases, the α Bayesia itervals will be approximately the same as the frequetist cofidece itervals. That is, a approximate α posterior iterval is which is the Wald cofidece iterval. dimesio of the model is fixed. C = θ ± z α/ I ( θ) Here is a rough derivatio of (). Note that However, this is oly true if is large ad the log π(θ X,..., X ) = log p(x i θ) + log π(θ) log C i= where C is the ormalizig costat. Now the sum has terms which grows with sample size. The last two terms are O(). So the sum domiates, that is, log π(θ X,..., X ) log p(x i θ) = l(θ). i= Next, we ote that Now l ( θ) = 0 so Thus, approximately, where l(θ) l( θ) + (θ θ)l ( θ) + (θ θ) l ( θ). l(θ) l( θ) + (θ θ) l ( θ). ( ) (θ θ) π(θ X,..., X ) exp σ σ = l ( θ). Let l i = log p(x i θ 0 ) where θ 0 is the true value. Sice θ θ 0, l ( θ) l (θ 0 ) = i ad therefore, σ /I ( θ). l i = i l i I (θ 0 ) I ( θ) = I ( θ) 4

142 5 Bayes Versus Frequetist I geeral, Bayesia ad frequetist ifereces ca be quite differet. If C is a α Bayesia iterval the This does ot imply that P (θ C X) = α. frequetist coverage = if θ P θ(θ C) = α.. Typically, a α Bayesia iterval has coverage lower tha α. Suppose you wake up everyday ad produce a Bayesia 95 percet iterval for some parameter. (A differet parameter everyday.) The fractio of times your iterval cotais the true parameter will ot be 95 percet. Here are some examples to make this clear. Example 3 Normal meas. Let X i N(µ i, ), i =,...,. Suppose we use the flat prior π(µ,..., µ ). The, with µ = (µ,..., µ ), the posterior for µ is multivariate Normal with mea X = (X,..., X ) ad covariace matrix equal to the idetity matrix. Let θ = i= µ i. Let C = [c, ) where c is chose so that P(θ C X,..., X ) =.95. How ofte, i the frequetist sese, does C trap θ? Stei (959) showed that P µ (θ C ) 0, as. Thus, P µ (θ C ) 0 eve though P(θ C X,..., X ) =.95. Example 4 Samplig to a Foregoe Coclusio. Let X, X,... N(θ, ). Suppose we cotiue samplig util T > k where T = X ad k is a fixed umber, say, k = 0. The sample size N is ow a radom variable. It ca be show that P(N < ) =. It ca also be show that the posterior π(θ X,..., X N ) is the same as if N had bee fixed i advace. That is, the radomess i N does ot affect the posterior. Now if the prior π(θ) is smooth the the posterior is approximately θ X,..., X N N(X, /). Hece, if C = X ±.96/ the P(θ C X,..., X N ).95. Notice that 0 is ever i C sice, whe we stop samplig, T > 0, ad therefore X.96 > 0.96 > 0. () 5

143 Hece, whe θ = 0, P θ (θ C ) = 0. Thus, the coverage is Coverage = if θ P θ(θ C ) = 0. This is called samplig to a foregoe coclusio ad is a real issue i sequetial cliical trials. Example 5 Here is a example we discussed earlier. Let C = {c,..., c N } be a fiite set of costats. For simplicity, asssume that c j {0, } (although this is ot importat). Let θ = N N j= c j. Suppose we wat to estimate θ. We proceed as follows. Let S,..., S Beroulli(π) where π is kow. If S i = you get to see c i. Otherwise, you do ot. (This is a example of survey samplig.) The likelihood fuctio is π S i ( π) S i. i The ukow parameter does ot appear i the likelihood. I fact, there are o ukow parameters i the likelihood! The likelihood fuctio cotais o iformatio at all. The posterior is the same as the prior. But we ca estimate θ. Let θ = Nπ N c j S j. j= The E( θ) = θ. Hoeffdig s iequality implies that P( θ θ > ɛ) e ɛ π. Hece, θ is close to θ with high probability. I particular, a α cofidece iterval is θ ± log(/α)/(π ). 6 Bayesia Computig If θ = (θ,..., θ p ) is a vector the the posterior π(θ X,..., X ) is a multivariate distributio. If you are iterested i oe parameter, θ for example, the you eed to fid the margial posterior: π(θ X,..., X ) = π(θ,..., θ p X,..., X )dθ dθ p. 6

144 Usually, this itegral is itractable. I practice, we resort to Mote Carlo methods. These are discussed i 36/ Bayesia Hypothesis Testig Bayesia hypothesis testig ca be doe as follows. Suppose that θ R ad we wat to test H 0 : θ = θ 0 ad H : θ θ 0. If we really believe that there is a positive prior probability that H 0 is true the we ca use a prior of the form aδ θ0 + ( a)g(θ) where 0 < a < is the prior probability that H 0 is true ad g is a smooth prior desity over θ which represets our prior beliefs about θ whe H 0 is false. It follows from Bayes theorem that P (θ = θ 0 X,..., X ) = ap(x,..., X θ 0 ) ap(x,..., X θ 0 ) + ( a) p(x,..., X θ)g(θ)dθ = al(θ 0 ) al(θ 0 ) + ( a)m where m = L(θ)g(θ)dθ. It ca be show that P (θ = θ 0 X,..., X ) is very sesitive to the choice of g. Sometimes, people like to summarize the test by usig the Bayes factor B which is defied to be the posterior odds divided by the prior odds: B = posterior odds prior odds where posterior odds = = P (θ = θ 0 X,..., X ) P (θ = θ 0 X,..., X ) al(θ 0 ) al(θ 0 )+( a)m ( a)m al(θ 0 )+( a)m = al(θ 0) ( a)m 7

145 ad ad hece prior odds = P (θ = θ 0) P (θ θ 0 ) = B = L(θ 0) m. a a Example 6 Let X, ldots, X N(θ, ). Let s test H 0 : θ = 0 versus H : θ 0. Suppose we take g(θ) to be N(0, ). Thus, g(θ) = π e θ /. Let us further take a = /. The, after some tedious itegratio to compute m(x,..., X ) we get P (θ = θ 0 X,..., X ) = = L(0) L(0) + m e X / e X / + + e X /((+)). O the other had, the p-value for the usual test is p = Φ( X ). Figure shows the posterior of H 0 ad the p-value as a fuctio of X whe = 00. Note that they are very differet. Ulike i estimatio, i testig there is little agreemet betwee Bayes ad frequetist methods. 8 Coclusio Bayesia ad frequetist iferece are aswerig two differet questios. Frequetist iferece aswers the questio: How do I costruct a procedure that has frequecy guaratees? Bayesia iferece aswers the questio: How do I update my subjective beliefs after I observe some data? I parametric models, if is large ad the dimesio of the model is fixed, Bayes ad frequetist procedures will be similar. Otherwise, they ca be quite differet. 8

146 Figure : Solid lie: P (θ = 0 X,..., X ) versus X. Dashed lie: p-value versus X. 9

147 Lecture Notes 5 Predictio. This is mostly ot i the text. Some relevat material is i Chapters ad Itroductio We observe traiig data (X, Y ),..., (X, Y ). Give a ew pair (X, Y ) we wat to predict Y from X. There are two commo versios:. Y {0, }. This is called classificatio, or discrimiatio, or patter recogitio. (More geerally, Y ca be discrete.). Y R. This is called regressio. For classificatio we will use the followig loss fuctio. Let h(x) be or predictio of Y whe X = x. Thus h(x) {0, }. The fuctio h is called a classifier. The classificatio loss is I(Y h(x)) ad the the classificatio risk is R(h) = P(Y h(x)) = E(I(Y h(x))). For regressio, suppose our predictio of Y whe X = x is g(x). We will use the squared error predictio loss (Y g(x)) ad the risk is R(g) = E(Y g(x)). Regressio Theorem R(g) is miimized by m(x) = E(Y X = x) = y p(y x)dy.

148 Proof. Let g(x) be ay fuctio of x. The R(g) = E(Y g(x)) = E(Y m(x) + m(x) g(x)) = E(Y m(x)) + E(m(X) g(x)) + E((Y m(x))(m(x) g(x))) E(Y m(x)) + E((Y m(x))(m(x) g(x))) ( ) = E(Y m(x)) + EE (Y m(x))(m(x) g(x)) X ( ) = E(Y m(x)) + E (E(Y X) m(x))(m(x) g(x)) ( ) = E(Y m(x)) + E (m(x) m(x))(m(x) g(x)) = E(Y m(x)) = R(m). Hece, to do make predictios, we eed to estimate m(x) = E(Y X = x). The simplest apprach is to use a parametric model. I particular, the liear regressio model assumes that m(x) is a liear fuctio of x. (More precisely, we seek the best liear predictor.) Suppose that X i R p so that X i = (X i,..., X ip ) T. The the liear regressio model is p m(x) = β 0 + β j x j. We ca write p Y i = β 0 + β j X ij + ɛ i, i =,..., j= where ɛ,..., ɛ are iid with mea 0. If we use the covetio that X i = the we ca write the model more simply as j= Y i = p β j X ij + ɛ i = β T X i + ɛ i, i =,..., () j=

149 where β = (β,..., β p ) T ad X i = (X i,..., X ip ) T. Let us defie Y = (Y,..., Y ) T, ɛ = (ɛ,..., ɛ ) T X(i, j) = X ij. The we ca write () as ad let X be the p matrix with Y = Xβ + ɛ. The least squares estimator β is the β that miimizes (Y i Xi T β) = Y Xβ. i= Theorem Suppose that X T X is ivertible. The the least squares estimator is β = (X T X) X T Y. The fitted values or predicted values are Ŷ = (Ŷ,..., Ŷ) T where Ŷ i = X T i β. Hece, where Ŷ = X β = HY H = X(X T X) X T is called the hat matrix. Theorem 3 The matrix H is symmetric ad idempotet: H = H. Moreover, HY is the projectio of Y oto the colum space of X. This is discussed i more detail i ad 0/ Theorem 4 Suppose that the liear model is correct. Also, suppose that Var(ɛ i ) = σ. The, ( β β) N(0, σ X T X). This model is virtually ever correct, so view this result with cautio. 3

150 Uder the (questioable) assumptio that the liear model is correct, we ca also say the followig. A cosistet estimator of σ is ad σ = RSS p ( βj β j ) s j N(0, ) where the stadard error s j is the j th diagoal elemet of σ X T X. To test H 0 : β j = 0 versus H : β j 0 we reject if β j /s)j > z α/. A approximate α cofidece iterval for β j is β j ± z α/ s j. Theorem 5 Suppose that the liear model is correct ad that ɛ,..., ɛ N(0, σ ). The the least squares estimator is the maximum likelihood estimator. 3 Liear Predictio Whe the Model is Wrog Whe the model is wrog (ad it always is) the least squares estimator still has the followig good property. Let β miimize R(β) = E(Y X T β). We call l (x) = x T β the best liear predictor. Theorem 6 Uder weak coditios, R( β) R(β ) P 0. Hece, the least squares estimator approximates the best liear predictor. Let s prove this i the case with oe covariate. The R(β) = E(Y Xβ) = E(Y ) βe(xy ) + β E(X ). 4

151 Miimizig with respect to β we get β = E(XY ) E(X ) assumig that 0 < E(X ) < ad E(XY ) <. Now β = i X iy i i = X iy i. i X i i X i By the law of large umbers ad the cotiuous mappig theorem: β P β. Sice R(β) is a cotiuous fuctio of β, it follows from the cotiuous mappig theorem that R( β) P R(β ). I fact, ad ad so β = i X iy i i X i = E(XY ) + O P (/ ) E(X ) + O P (/ ) = β + O P (/ ) R( β) = R(β ) + ( β β )R (β ) + o( β β ) R( β) R(β ) + O P (/ ). The message here is that least squares estimates the best liear predictor: we do t have to assume that the truth is liear. 4 Noparametric Regressio Suppose we wat to estimate m(x) where we oly assume that m is a smooth fuctio. The kerel regressio estimator is m(x) = Y i w i (x) i 5

152 y x Figure : A kerel regressio estimator. where w i (x) = K ( x Xi h ) j K ( x Xj h Here K is a kerel ad h is a badwidth. The properties are simialr to that of kerel desity estimatio. The properties of m are discussed i more detail i the ad i A example is show i Figure. ). 6

153 5 Classificatio The best classifier is the so-called Bayes classifier defied by: h B (x) = I(m(x) /) where m(x) = E(Y X = x). Theorem 7 For ay h, R(h) R(h B ). Proof. For ay h, R(h) R(h B ) = P(Y h(x)) P(Y h B (X)) = P(Y h(x) X = x)p(x)dx P(Y h B (x) X = x)p(x)dx = (P(Y h(x) X = x) P(Y h B (x) X = x)) p(x)dx. We will show that P(Y h(x) X = x) P(Y h B (x) X = x) 0 for all x. Now P(Y h(x) X = x) P(Y h B (x) X = x) ( ) = h(x)p(y X = x) + ( h(x))p(y 0 X = x) ( ) h B (x)p(y X = x) + ( h B (x))p(y 0 X = x) = (h(x)( m(x)) + ( h(x))m(x)) (h B (x)( m(x)) + ( h B (x))m(x)) = (m(x) /)(h B (x) h(x)) 0 sice h B (x) = if ad oly if m(x) /. 7

154 The most direct approach to classificatio is empirical risk miimizatio (ERM). We start with a set of classifiers H. Each h H is a fuctio h : x {0, }. The traiig error or empirical risk is R(h) = We choose ĥ to miimize R: I(Y i h(x i )). i= ĥ = argmi h H R(h). For example, a liear classifier has the form h β (x) = I(β T x 0). classifiers is H = {h β : β R p }. The set of liear Theorem 8 Suppose that H has VC dimesio d <. Let ĥ be the empirical risk miimizer ad let h = argmi h H R(h) be the best classifier i H. The, for ay ɛ > 0, for some costts c ad c. Proof. Recall that But whe sup h H R(h) R(h) ɛ we have P(R(ĥ) > R(h ) + ɛ) c d e c ɛ P(sup R(h) R(h) > ɛ) c d e c ɛ. h H R(ĥ) R(ĥ) + ɛ R(h ) + ɛ R(h ) + ɛ. Empirical risk miimizatio is difficult because R(h) is ot a smooth fuctio. Thus, we ofte use other approaches. Oe idea is to use a surrogate loss fuctio. To expai this idea, it will be coveiet to relabel the Y i s as beig + or -. May classifiers the take the form h(x) = sig(f(x)) 8

155 for some f(x). For example, liear classifiers have f(x) = x T β. Th classificatio loss is the L(Y, f, X) = I(Y f(x) < 0) sice a error occurs if ad oly if Y ad f(x) have differet sigs. A example of surrogate loss is the hige fuctio ( Y f(x)) +. Istead of miimizig classificatio loss, we miimize ( Y i f(x i )) +. The resultig classifier is called a support vector machie. i Aother approach to classificatio is plug-i clasificatio. We replace the Bayes rule h B = I(m(x) /) with ĥ(x) = I( m(x) /) where m is a estimate of the regressio fuctio. The estimate m ca be parametric or oparametric. A commo parametric estimator is logistic regressio. Here, we assume that Sice Y i is Beroulli, the likeihood is m(x; β) = ext β + e xt β. L(β) = m(x i ; β) Y i ( m(x i ; β)) Y i. i= We compute the mle β umerically. See Sectio.3 of the text. What is the relatioship betwee classificatio ad regressio? Geerally speakig, classificatio is easier. This follows from the ext result. Theorem 9 Let m(x) = E(Y X = x) ad let h m (x) = I(m(x) /) be the Bayes rule. Let g be ay fuctio ad let h g (x) = I(g(x) /). The R(h g ) R(h m ) g(x) m(x) dp (x). 9

156 Proof. We showed earlier that R(h g ) R(h m ) = [P(Y h g (x) X = x) P(Y h m (x) X = x)] dp (x) ad that P(Y h g (x) X = x) P(Y h m (x) X = x) = (m(x) /)(h m (x) h g (x)). Now (m(x) /)(h m (x) h g (x)) = m(x) / I(h m (x) h g (x)) m(x) g(x) sice h m (x) h g (x) implies that m(x) / m(x) g(x). Hece, R(h g ) R(h m ) = m(x) / I(h m (x) h g (x))dp (x) m(x) g(x) dp (x) g(x) m(x) dp (x) where the last setp follows from the Cauchy-Schwartz iequality. Hece, if we have a estimator m such that m(x) m(x) dp (x) is small, the the excess classificatio risk is also small. But the reverse is ot true. 0

157 Lecture Notes 6 Model Selectio Not i the text. Itroductio Sometimes we have a set of possible models ad we wat to choose the best model. Model selectio methods help us choose a good model. Here are some examples. Example Suppose you use a polyomial to model the regressio fuctio: m(x) = E(Y X = x) = β 0 + β x + + β p x p. You will eed to choose the order of polyomial p. We ca thik of this as a sequece of models M,..., M p,... idexed by p. Example Suppose you have data Y,..., Y o age at death for people. You wat to model the distributio of Y. Some popular models are:. M : the expoetial distributio: f(y; θ) = θe θy.. M : the gamma distributio: f(y; a, b) = (b a /Γ(a))y a e by. 3. M 3 : the log-ormal distributio: we take log Y N(µ, σ ). Example 3 Suppose you have time series data Y, Y,.... (autoregressive model): A commo model is the AR Y t = a Y t + a Y t + + a k Y t k + ɛ t where ɛ t N(0, σ ). The umber k is called the order of the model. We eed to choose k.

158 Example 4 I a liear regressio model, you eed to choose which variables to iclude i the regressio. This is called variable selectio. This problem is discussed at legth i ad The most commo model selectios methods are:. AIC (ad related methods like C p ).. Cross-validatio. 3. BIC (ad related methods like MDL, Bayesia model selectio). We eed to distiguish betwee goals:. Fid the model that gives the best predictio (without assumig that ay of the models are correct).. Assume oe of the models is the true model ad fid the true model. Geerally speakig, AIC ad cross-validatio are used for goal while BIC is used for goal. AIC Suppose we have models M,..., M k where each model is a set of desities: } M j = {p(y; θ j ) : θ j Θ j. We have data Y,..., Y draw from some desity f. We do ot assume that f is i ay of the models.

159 Let θ j be the mle from model j. A estimate of f, based o model j is f j (y) = p(y; θ j ). The quality of f j (y) as a estimate of f ca be measured by the Kullback-Leibler distace: ( ) K(f, f p(y) j ) = p(y) log dy f j (y) = p(y) log p(y)dy p(y) log f j (y)dy. The first term does ot deped o j. So miimizig K(f, f j ) over j is the same as maximizig K j = p(y) log p(y; θ j )dy. We eed to estimate K j. Ituitively, you might thik that a good estimate of K j is K j = log p(y i ; θ j ) = l j( θ j ) i= where l j (θ j ) is the log-likelihood fucio for model j. However, this estimate is very biased because the data are beig used twice: first to get the mle ad secod to estimate the itegral. Akaike showed that the bias is approximately d j / where d j = dimesio(θ j ). Therefore we use Now, defie K j = l j( θ j ) d j = K j d j. AIC(j) = K j = l j ( θ j ) d j. Notice that maximizig K j is the same as maximizig AIC(j) over j. Why do we multiply by? Just for historical reasos. We ca multiply by ay costat; it wo t chage which model we pick. I fact, differet texts use differet versios of AIC. AIC stads for Akaike Iformaio Criterio. Akaike was a famous Japaese statisticia who died recetly (August 009). 3 Theoretical Derivatio of AIC Let us ow look closer to see where the formulas come from. Recall that K j = p(y) log p(y; θ j )dy. 3

160 For simplicity, let us focus o oe model ad drop the subscript j. We wat to estimate K = p(y) log p(y; θ)dy. Our goal is to show that K d K where ad d is the dimesio of θ. K = log p(y i ; θ) i= Some Notatio ad Backgroud. Let θ 0 miimize K(f, p( ; θ)). So p(y; θ 0 ) is the closest desity i the model to the true desity. Let l(y, θ) = log p(y; θ) ad s(y, θ) = log p(y; θ) θ be the score ad let H(y, θ) be the matrix of secod derivatives. Let Z = ( θ θ 0 ) ad recall that Z N(0, J V J ) where J = E[H(Y, θ 0 )] ad V = Var(s(Y, θ 0 )). I class we proved that V = J. But that proof assumed the model was correct. We are ot assumig that. Let By the CLT, S = s(y i, θ 0 ). i= S N(0, V ) Hece, i distributio JZ S. () 4

161 Here we used the fact that Var(AX) = A(VarX)A T. Thus Var(JZ ) = J(J V J )J T = V. We will eed oe other fact. Let ɛ be a radom vector with mea µ ad covariace Σ. Let Q = ɛ T Aɛ. (Q is a called a quadratic form.) The E(Q) = trace(aσ) + µ T Aµ. The details. By usig a Taylor series ( ) K p(y) log p(y; θ 0 ) + (θ θ 0 ) T s(y, θ 0 ) + ( θ θ 0 ) T H(y, θ 0 )( θ θ 0 ) dy = K 0 ZT JZ where K 0 = p(y) log p(y; θ 0 )dy, The secod term dropped out because, like the score fuctio, it has mea 0. Agai we do a Taylor series to get K i= ( l(y i, θ 0 ) + ( θ θ 0 ) T s(y i, θ 0 ) + ) ( θ θ 0 ) T H(Y i, θ 0 )( θ θ 0 ) = K 0 + A + ( θ θ 0 ) T S ZT J Z K 0 + A + ZT S ZT JZ where J = H(Y i, θ 0 ) P J, i= 5

162 ad Hece, A = (l(y i, θ 0 ) K 0 ). i= Z T K K A + S where we used (). We coclude that ( ) Z T E(K K) E(A ) + E JZ A + ZT JZ = 0 + trace(j J V J ) = trace(j V ). Hece, K K trace(j V ). If the model is correct, the J = V so that trace(j V ) = trace(i) = p. Thus we would use K K p. You ca see that there are a lot of approximatios ad assumptios beig used. So AIC is a very crude tool. Cross-validatio is much more reliable. 4 Cross-Validatio There are various flavors of cross-validatio. I geeral, the data are split ito a traiig set ad a test set. The models are fit o the traiig set ad are used to predict the test set. Usually, may such splits are used ad the result are averaged over splits. However, to keep thigs simple, we will use a sigle split. Suppose agai that we have models M,..., M k. Assume there are data poits. Split the data radomly ito two halves that we will deote D = (Y,..., Y ) ad T = (Y,..., Y ). Use D to fid the mle s θ j. The defie K j = i= log p(y i ; θ j ). 6

163 Note that E( K j ) = K j ; there is o bias because θ j is idepedet of Y j. We will assume that log p(y; θ) B <. By Hoeffdig s iequality, Let The P(max K j K j > ɛ) ke ɛ /(B ). j B log(k/α) ɛ =. P(max K j K j > ɛ ) α. j If we choose ĵ = argmax j Kj, the, with probability at least α, ( ) K b j max B log(k/α) log k K j = max K j O. j j So with high probability, you choose close to the best model. This argumet ca be improved ad also applies to regressio, classificatio etc. Of course, with regressio, the loss fuctio is E(Y m(x)) ad the cross-validatio score is the For classificatio we use i= i= (Y i m(x i )). I(Y i h(x i )). We have made essetially o assumptios or approximatios. (The bouded o log f ca be relaxed.) The beauty of cross-validatio is its simpicity ad geerality. It ca be show that AIC ad cross-validatio have very similar behavior. But, cross-validatio works uder weaker coditios. 5 BIC BIC stads for Bayesia Iformatio Criterio. It is also kow as the Schwarz Criterio after Gideo Schwarz. It is virtually idetical to the MDL (miimum descriptio legth) criterio. 7

164 We choose j to maximize BIC j = l j ( θ j ) d j log. This is the same as AIC but the pealty is harsher. Thus, BIC teds to choose simpler models. Here is the derivatio. We put a prior π j (θ j ) o the parameter θ j. We also put a prior probability p j that model M j is the true model. By Bayes theorem P (M j Y,..., Y ) p(y,..., Y M j )p j. Furthermore, p(y,..., Y M j ) = p(y,..., Y M j, θ j )π j (θ j )dθ j = L(θ j )π j (θ j )dθ j. We kow choose j to maximize P (M j Y,..., Y ). Equivaletly, we choose j to maximize log L(θ j )π j (θ j )dθ j + log p j. Some Taylor series approximatios show that log L(θ j )π j (θ j )dθ j + log p j l j ( θ j ) d j log = BIC j. What happeed to the prior? It ca be show that the terms ivolvig the prior are lower order tha the term that appear i formula for BIC j so they have bee dropped. BIC behaves quite differetly tha AIC or cross-validatio. It is also based o differet assumptios. BIC assumes that oe of the models is true ad that you are tryig to fid the model most likely to be true i the Bayesia sese. AIC ad cross-validatio are tryig to fid the model that predict the best. 6 Model Averagig Bayesia Approach. Suppose we wat to predict a ew observatio Y. Let D = {Y,..., Y } be the observed data. The p(y D) = j p(y D, M j )P(M j D) 8

165 where P(M j D) = L(θj )π j (θ j )dθ j L(θs )π s (θ s )dθ s s ebic j s ebics. Frequetist Approach. There is a large ad growig literaure o frequeist model averagig. It is discussed i Simple Normal Example Let Y,..., Y N(µ, ). We wat to compare two models: Hypothesis Testig. We test The test statistic is M 0 : N(0, ), ad M : N(µ, ). H 0 : µ = 0 versus µ 0. Z = Y 0 Var(Y ) = Y. We reject H 0 if Z > z α /. For α = 0.05, we reject H 0 if Z >, i.e., if Y >. AIC. The likelihood is proportioal to L(µ) = where S = i (Y i Y ). Hece, e (Y i µ) / = e (Y µ) / e S / i= (Y µ) l(µ) = S. 9

166 Recall that AIC = l S S. The AIC scores are ad sice µ = Y. We choose model if AIC 0 = l(0) 0 = Y S AIC = l( µ) = S AIC > AIC 0 that is, if or S > Y S Y >. Similar to but ot the same as the hypothesis test. BIC. The BIC scores are ad BIC 0 = l(0) 0 log = Y S We choose model if BIC = l( µ) log = S log. BIC > BIC 0 that is, if Y > log. Hypothesis testig AIC/CV/C p BIC cotrols type I errors fids the most predictive model fids the true model (with high probability) 0

167 Lecture Notes 7 Multiple Testig ad Cofidece Itervals Suppose we eed to test may ull hypotheses H 0,,..., H 0,N where N could be very large. We caot simply test each hypotheses at level α because, if N is large, we are sure to make lots of type I errors just by chace. We eed to do some sort of multiplicity adjustmet. Familywise Error Cotrol. Suppose we get a p-value p j for each ull hypothesis. Let I = {i : H 0,i is true}. If we reject H 0,i for ay i I the we have made a error. Let R = {j : we reject H 0j } be the set of hypotheses we reject. We say that we have cotrolled the familywise error rate at level α if P(R I ) α. The easiest way to cotrol the familywise error rate is the Boferroi method. The idea is to reject H 0,i if ad oly if p i < α/n. The P(makig a false rejectio) = P (p i < α ) N for some i I ( P p i < α ) N i I = i I α N = α I N α. sice p i Uif(0, ) for i I So we have overall cotrol of the type I error. However, it ca have low power. The Normal Case. Suppose that we have N sample meas Y,..., Y N each based o Normal observatios with variace. So Y j N(µ j, /). To test H 0,j : µ j = 0 we ca use

168 the test statistic T j = Y j. The p-value is p j = Φ( T j ). If we did ucorrected testig we rject whe p j < α, which meas, T j > z α/. A useful approximatio is: z α log(/α). So we reject whe T j > log(/α). Uder the Boferroi correctio we reject whe p j < α/n which corespods to T j > log(n/α). Hece, the familywise rejectio threshold grows like log N. False Discovery Cotrol. The Boferroi adjustmet is very strict. A weaker type of cotrol is based o the false discovery rate. Suppose we reject a set of hyptheses R. Defie the false discovery proportio FDP = R I R where the ratio is defied to be 0 i case both the umerator ad deomiator are 0. Our goal is to fid a method for choosig R such that FDR = E(FDP) α. The Bejamii-Hochberg method works as follows:. Fid the ordered p-values P () < < P (N).. Let j = max{i : P (i) < iα/n}. Let T = P (j). 3. Let R = {i : P i T }. Referece: Bejamii ad Hochberg (995).

169 Let us see why this cotrols the FDR. Cosider, i geeral, rejectig all hypothesis for which P i < t. Let W i = if H 0,i is true ad W i = 0 otherwise. Let distributio of the p-values ad let G(t) = E(Ĝ(t)). I this case, N i= FDP = W N ii(p i < t) N i= N i= I(P = W ii(p i < t) i < t) N i= I(P i < t). Hece, N E(FDP) E( N N i= W ii(p i < t)) E( N N i= I(P i < t)) = N N i= W ie(i(p i < t)) N N i= E(I(P i < t)) = t I G(t) t G(t) t Ĝ(t). Ĝ be the empirical Let t = P (i) for some i; the Ĝ(t) = i/n. Thus, FDR P (i)n/i. Settig this equal to α we get P (i) < iα/n is the Bejamii-Hochberg rule. FDR cotrol typically has higher power tha familywise cotrol. But they are cotrollig differet thigs. You have to decide, based o the cotext, which is appropriate. Example Figure shows a example where Y j N(µ j, ) for j =,...,, 000. I this example, µ j = 3 for j 50 ad µ j = 0 for j > 50. The figure shows the test statistics, the p-values, the sorted log p-values with the Boferroi threshold ad the sorted log p-values with the FDR threshold (usig α = 0.05). Boferroi rejects 7 hypotheses while FDR rejects. Multiple Cofidece Itervals. A similar problem occurs with cofidece itervals. If we costruct a cofidece iterval C for oe parameter θ the P(θ C) α. But if we costruct cofidece itervals C,..., C N for N parameters θ,..., θ N the we wat to esure that P(θ j C j, for all j =,..., N) α. To do this, we costruct each cofidece iterval C j at level α/n. The P(θ j / C j for some j) j P(θ j / C j ) j α N = α. 3

170 T 0 4 pvalue Idex Idex sorted log p values sorted log p values Boferroi FDR Figure : Top left:,000 test statistics. Top right: the p-values. Bottom left: sorted log p-values ad Boferroi threshold. Bottom right: sorted log p-values ad FDR threshold. 4

171 Causatio Most of statistics ad machie learig is cocered with predictio. A typical questio is: what is a good predictio of Y give that I observe that X = x? Causatio is cocered with questios of the form: what is a good predictio of Y give that I set X = x? The differece betwee passively observig X = x ad actively iterveig ad settig X = x is sigificat ad requires differet techiques ad, typically, much strger assumptios. Cosider this story. A mother otices that tall kids have a higher readig level tha short kids. (This is because the tall kids are older.) The mother puts her small child o a device ad stretches the child util he is tall. She is dismayed to fid out that his readig level has ot chaged. Te mother is correct that height ad readig skill are associated. Put aother way, you ca use height to predict readig skill. But that does ot imply that height causes readig skill. This is what statisticias mea whe they say: correlatio is ot causatio. O the other had, cosider smokig ad lug cacer. We kow that smokig ad lug cacer are associated. But we also believe that smokig causes lug cacer. I this case, we recogize that iterveig ad forcig someoe to smoke does chage his probability of gettig lug cacer. The differece betwee predictio (associatio/correlatio) ad causatio is this: i predictio we are iterested i P(Y A X = x) which meas: the probability that Y A give that we observe that X is equal to x. For causatio we are iterested i P(Y A set X = x) which meas: the probability that Y A give that we set X equal to x. Predictio is about passive observatio. Causatio is about active itervetio. Most of statistics ad 5

172 machie learig cocers predictio. But sometimes causatio is the primary focus. The phrase correlatio is ot causatio ca be writte mathematically as P(Y A X = x) P(Y A set X = x). Despite the fact that causatio ad associatio are differet, people mix them up all the time, eve people traied i statistics ad machie learig. O TV recetly there was a report that good health is associated with gettig seve hours of sleep. So far so good. The the reporter goes o to say that, therefore, everyoe should strive to sleep exactly seve hours so they will be healthy. Wrog. That s cofusig causatio ad associatio. Aother TV report poited out a correlatio betwee people who brush their teeth regularly ad low rates of heart disease. A iterestig correlatio. The the reporter (a doctor i this case) wet o to urge people to brush their teeth to save their hearts. Wrog! To avoid this cofusio we eed a way to discuss causatio mathematically. That is, we eed someway to make P(Y A set X = x) formal. There are two commo ways to do this. Oe is to use couterfactuals. The other is to use causal graphs. These approaches are equivalet. There are two differet laguages for sayig the same thig. Causal iferece is tricky ad should be used with great cautio. The mai messages are:. Causal effects ca be estimated cosistetly from radomized experimets.. It is difficult to estimate causal effects from observatioal (o-radomized) experiemets. 3. All causal coclusios from observatioal studes should be regarded as very tetative. Causal iferece is a vast topic. We will oly touch o the mai ideas here. Couterfactuals. Cosider two variables Y ad X. Suppose that X is a biary variable that represets some treatmet. For example, X = meas the subject was treated ad X = 0 meas the subject was give placebo. The respose variable Y is real-valued. We ca address the problem of predictig Y from X by estimatig E(Y X = x). To address causal questios, we itroduce couterfactuals. Let Y deote the respose we observe if the subject is treated, i.e. if we set X =. Let Y 0 deote the respose we observe if the 6

173 subject is ot treated, i.e. if we set X = 0. If we treat a subject, we observe Y but we do ot observe Y 0. Ideed, Y 0 is the value we would have observed if the subject had bee treated. The uobserved variable is called a couterfactual. We have elarged our set of variables from (X, Y ) to (X, Y, Y 0, Y ). Note that Y = XY + ( X)Y 0. () A small dataset might look like this: X Y Y 0 Y * * 0 * 0 * 0 * * 0 * 0 * The asterisks idicate uobserved variables. To aswer causal questios, we are iterested i the distributio p(y 0, y ). We ca iterpret p(y ) as p(y set X = ) ad we ca iterpret p(y 0 ) as p(y set X = 0). I particular, we might wat to estimate the mea treatmet effect or mea causal effect θ = E(Y ) E(Y 0 ) = E(Y set X = ) E(Y set X = 0). The parameter θ has the followig itepretatio: θ is the mea respose if we forced everyoe to take the treatmet mius mea respose if we forced everyoe ot to take the treatmet. Suppose ow that we observe a sample (X, Y ),..., (X, Y ). Ca we estmate θ? No. I geeral, there is o cosistt estimator of θ. We ca estimate α = E(Y X = ) E(Y X = 0) but α is ot equal to θ. 7

174 However, suppose that we did a radomized experimet where we radomly assiged each perso to treatmet of placebo by the flip of a coi. I this case, X will be idepedet of (Y 0, Y ). I symbols: radom treatmet assigmet implies : (Y 0, Y ) X. Hece, i this case, α = E(Y X = ) E(Y X = 0) = E(Y X = ) E(Y 0 X = 0) sice Y = XY + ( X)Y 0 = E(Y ) E(Y 0 ) = θ sice (Y 0, Y ) X. Hece, radom assigmet makes θ equal to α ad α ca be cosistetly estimated. If X is radomly assiged the correlatio = causatio. This is why people sped millios of dollars doig radomized experiemets. I some cases it is ot feasible to do a radomized experimet. Smokig ad lug cacer is a example. Ca we estimate causal parameters from observatioal (o-radomized) studies? The aswer is: sort of. I a observatioal stsudy, the treated ad utreated groups will ot be comparable. Maybe the healthy people chose to take the treatmet ad the uhealthy people did t. I other words, X is ot idepedet of (Y 0, Y ). The treatmet may have o effect but we would still see a strog associatio betwee Y ad X. I other words, α might be large eve though θ = 0. To accout for the differeces i the groups, we might measure cofoudig variables. These are the variables that affect both X ad Y. By defiitio, there are o such variables i a radomized experimet. The hope is that if we measure eough cofoudig variables Z = (Z,..., Z k ), the, perhaps the treated ad utreated groups will be comparable, coditioal o Z. Formally, we hope that X is idpedet of (Y 0, Y ) coditioal o Z. If this is true, 8

175 we ca estimate θ sice θ = E(Y ) E(Y 0 ) = E(Y Z = z)p(z)dz E(Y 0 Z = z)p(z)dz = E(Y X =, Z = z)p(z)dz E(Y 0 X = 0, Z = z)p(z)dz = E(Y X =, Z = z)p(z)dz E(Y X = 0, Z = z)p(z)dz () where we used the fact that X is idpedet of (Y 0, Y ) coditioal o Z i the third lie ad the fact that Y = ( X)Y + XY 0 i the fourth lie. The latter quatity ca be estimated by θ = m(, Z i ) i= m(0, Z i ) where m(x, z) is a estimate of the regressio fuctio m(x, z) = E(Y X = x, Z = z). This is kow as adjustig for cofouders ad θ is called the adjusted treatmet effect. It is istructive to compare the casual effect i= θ = E(Y ) E(Y 0 ) = E(Y X =, Z = z)p(z)dz E(Y X = 0, Z = z)p(z)dz with the predictive quatity α = E(Y X = ) E(Y X = 0) = E(Y X =, Z = z)p(z X = )dz E(Y X = 0, Z = z)p(z X = 0)dz which are mathematically (ad coceptually) quite differet. We eed to treat θ cautiously. It is very ulikely that we have successfully measured all the relevat cofoudig variables so θ should be regarded as a crude approximatio to θ at best. Causal Graphs. Aother way to capture the differece betwee P (Y A X = x) ad P (Y A set X = x) is to represet the distributio usig a directed graph ad the we capture the secod statemet by performig certai operatios o the graph. 9

176 A Directed Acyclic Graph (DAG) is a graph for a set of variables with o cycles. The graph defies a set of distributios of the form p(y,..., y k ) = p(y j parets(y j ) where parets(y j ) are the parets of y j. A causal graph is a DAG with extra iformatio. A DAG is a causal graph if it correctly ecodes the effect of settig a variable to a fixed value. Cosider the graph G i Figure (). Here, X deotes treatmet, Y is respose ad Z is a cofoudig variable. To fid the causal distributio p(y set X = x) we do the followig steps:. Form a ew graph G by removig all arrow ito X. Now set X equal to x. This correspods to replacig the joit distributio p(x, y, z) = p(z)p(x z)p(y x, z) with the ew distributio p (y, z) = p(z)p(y x, z). The factor p(x z) is removed because we kow regard x as a fixed umber.. Compute the distributio of y from the ew distributio: p(y set X = x) p (y) = p (y, z)dz = p(z)p(y x, z)dz. Now we have that θ = p(y set X = ) p(y set X = 0) = p(z)p(y, z)dz p(z)p(y 0, z)dz This is precisely the same equatio as (). Both approaches lead to the same thig. If there were uobserved cofoudig variables, the the formula for θ would ivolve these variables ad the causal effect would be o-estimable (as before). I a radomized experimet, there would be o arrow from Z to X. (That s the poit of radomizatio). I that case the above calculatios shows that θ = E(Y X = ) E(Y X = 0) just as we saw with the couterfactual approach. To uderstad the differece betwee p(y x) ad p(y set x) more clearly, it is helpful to cosider two differet computer programs. Cosider the DAG i Figure. The 0

177 Z X Y Figure : Coditioig versus iterveig. probability fuctio for a distributio cosistet with this DAG has the form p(x, y, z) = p(x)p(y x)p(z x, y). The followig is pseudocode for geeratig from this distributio. For i =,..., : x i < p X (x i ) y i < p Y X (y i x i ) z i < p Z X,Y (z i x i, y i ) Suppose we ru this code, yieldig data (x, y, z ),..., (x, y, z ). Amog all the times that we observe Y = y, how ofte is Z = z? The aswer to this questio is give by the coditioal distributio of Z Y. Specifically, P(Y = y, Z = z) p(y, z) P(Z = z Y = y) = = P(Y = y) p(y) x p(x, y, z) x p(x) p(y x) p(z x, y) = = p(y) p(y) = p(y x) p(x) p(z x, y) = p(x, y) p(z x, y) p(y) p(y) x x = x p(z x, y) p(x y). Now suppose we itervee by chagig the computer code. Specifically, suppose we fix Y at the value y. The code ow looks like this:

178 set Y = y for i =,..., x i < p X (x i ) z i < p Z X,Y (z i x i, y) Havig set Y = y, how ofte was Z = z? To aswer, ote that the itervetio has chaged the joit probability to be p (x, z) = p(x)p(z x, y). The aswer to our questio is give by the margial distributio p (z) = x p (x, z) = x p(x)p(z x, y). This is p(z set Y = y). Example You may have oticed a correlatio betwee rai ad havig a wet law, that is, the variable Rai is ot idepedet of the variable Wet Law ad hece p R,W (r, w) p R (r)p W (w) where R deotes Rai ad W deotes Wet Law. Cosider the followig two DAGs: Rai Wet Law Rai Wet Law. The first DAG implies that p(w, r) = p(r)p(w r) while the secod implies that p(w, r) = p(w)p(r w) No matter what the joit distributio p(w, r) is, both graphs are correct. Both imply that R ad W are ot idepedet. But, ituitively, if we wat a graph to idicate causatio, the first graph is right ad the secod is wrog. Throwig water o your law does t cause rai. The reaso we feel the first is correct while the secod is wrog is because the itervetios implied by the first graph are correct. Look at the first graph ad form the itervetio W = where deotes wet law. Followig the rules of itervetio, we break the arrows ito W to get the modified graph: Rai set Wet Law =

179 with distributio p (r) = p(r). Thus P(R = r W := w) = P(R = r) tells us that wet law does ot cause rai. Suppose we (wrogly) assume that the secod graph is the correct causal graph ad form the itervetio W = o the secod graph. There are o arrows ito W that eed to be broke so the itervetio graph is the same as the origial graph. Thus p (r) = p(r w) which would imply that chagig wet chages rai. Clearly, this is osese. Both are correct probability graphs but oly the first is correct causally. We kow the correct causal graph by usig backgroud kowledge. Learig Casual Structure? We could try to lear the correct causal graph from data but this is dagerous. I fact it is impossible with two variables. With more tha two variables there are methods that ca fid the causal graph uder certai assumptios but they are large sample methods ad, furthermore, there is o way to ever kow if the sample size you have is large eough to make the methods reliable. Radomizatio Agai. We ca use DAGs to represet cofoudig variables. If X is a treatmet ad Y is a outcome, a cofoudig variable Z is a variable with arrows ito both X ad Y ; see Figure 3. It is easy to check, usig the formalism of itervetios, that the followig facts are true: I a radomized study, the arrow betwee Z ad X is broke. I this case, eve with Z uobserved (represeted by eclosig Z i a circle), the causal relatioship betwee X ad Y is estimable because it ca be show that E(Y X := x) = E(Y X = x) which does ot ivolve the uobserved Z. I a observatioal study, with all cofouders observed, we get E(Y X := x) = E(Y X = x, Z = z)p(z) which is just the adjusted treatmet effect. If Z is uobserved the we caot estimate the causal effect because E(Y X := x) = E(Y X = x, Z = z)dfz (z) ivolves the uobserved Z. We ca t just use X ad Y sice i this case. P(Y = y X = x) P(Y = y X := x) which is just aother way of sayig that causatio is ot associatio. 3

180 Z Z Z X Y X Y X Y Figure 3: Radomized study; Observatioal study with measured cofouders; Observatioal study with umeasured cofouders. The circled variables are uobserved. 3 Idividual Sequece Predictio The goal is to predict y t from y,..., y t with o assumptios o the sequece. The data are ot assumed to be iid; they are ot eve assumed to be radom. This is a versio of olie learig. For simplicity assume that y t {0, }. Suppose we have a set of predictio algorithms (or experts): F = {F,..., F N } Let F j,t is the predictio of algorithm j at time t based o y t = (y,..., y t ). At time t:. You see y t ad (F,t,..., F N,t ).. You predict P t. 3. y t is revealed. 4. You suffer loss l(p t, y t ). We will focus o the loss l(p t, y t ) = p t y t but the theory works well for ay covex loss. The cumulative loss is L j (y ) = F j,t y t S j(y ) where S j (y ) = i= F j,t y t. The maximum regret is R = max y t {0,} t i= ( L P (y ) mi j L j (y ) Referece: Predictio, Learig, ad Games. Nicolò Cesa-Biachi ad Gábor Lugosi, 006. ) 4

181 ad the miimax regret is V = if P Let P t (y t ) = N j= w j,t F j,t where ( ) max L P (y ) mi L j (y ). y t {0,} t j w j,t = exp { γs j,t } Z t ad Z t = N j= exp { γs j,t }. The w j s are called expoetial weights. Theorem 3 Let γ = 8 log N/. The L P (y ) mi j N L j(y ) log N. ( ) Proof. The idea is to place upper ad lower bouds o log Z+ Z the solve for L P (y ). Upper boud: We have log ( Z+ Z ) ( N ) = log exp { γl j, } log N j= ( ) log max exp { γl j, } j log N = γ mi j L j, log N. (3) Lower boud: Note that log ( Zt+ Z t ) ( N j= = log w ) j,t e γ F j,t y t N j= w j,t = log E ( e ) γ F j,t y t. This is a formal expectatio with respect to the distributio over j probability proportioal to e γ F j,t y t ). Recall Hoeffdig s boud for mgf: if a X b log E(e sx ) se(x) + s (b a). 8 5

182 So: Summig over t: Combiig (3) ad (4) we get log E ( e γ F j,t y t ) γe F j,t y t + γ 8 log ( Z+ Z γ mi L j (y ) log N log j Rearragig the terms we have: Set γ = 8 log N/ to get = γ EF j,t y t + γ 8 = γ P t (y t ) y t + γ 8. ) γl P (y ) + γ 8. (4) ( Z+ Z L P (y ) mi L j (y ) + log N j γ L P (y ) mi j N L j(y ) ) γl P (y ) + γ 8. + γ 8. log N. The result held for a specific time. We ca make the result uiform over time as follows. If we set γ t = 8 log N/t the we have: for all ad for all y, y,..., y. L P (y ) mi j L j (y ) + + log N Now suppose that F is a ifiite class. A set G = {G,..., G N } is a r-coverig if, for every F ad every y there is a G j such that F t (y t ) G j,t (y t ) r. t= Let N(r) deote the size of the smallest r-coverig. 6 8

183 Theorem 4 (Cesa-Biachi ad Lugosi) We have that ( ) r log N(r) V (F) if r>0 + form Cesa-Biachi ad Lugosi also costruct a predictor that early achieves the boud of the P t = k= a k P (k) t where P (k) t is a predictor based o a fiite subset of F. Usig batchificatio it is possible to use olie learig for o-olie learig. Suppose we are give data: (Z,..., X ) where Z i = (X i, Y i ) ad a arbitrary algorithm A that takes data ad outputs classifier H. We used uiform covergece theory to aalyze H but olie methods provide a alterative aalysis. 3 We apply A sequetially to get classifiers H 0, H,..., H. Let To choose a fial classifier: M = l(h t (X t ), Y t ) i=. usual batch method: use the last oe H. average: H = i= H t 3. selectio: choose H t to miimize t ( ) ( + ) l(h t (X t ), Y t ) + t ( t) log δ i= Aalyzig H requires assumptios o A, uiform covergece etc. This is ot eeded for the other two methods. Theorem 5 If l is covex: For ay l, P ( P ( R(H) M + R(Ĥ) M + log 36 log ( ) ) δ. δ ( ) ) ( + ) δ. δ 3 Referece: Cesa-Biachi, Cocoi ad Getile (004). 7

184 Homework Due: Thursday Sept 8 by 3:00 From Casella ad Berger:. Chapter, problem.47.. Chapter, problem Chapter, problem.. 4. Chapter, problem Chapter, problem.7a. 6. Chapter, problem Chapter, problem Chapter 3, problem Chapter 4, problem Chapter 4, problem 4.5.

185 Homework Due: Thursday Sept 5 by 3:00. Let X be a sequece of radom variables such that X 0 for all. Suppose that P(X > t) ( t )k where k >. Derive a upper boud o E(X ).. Let X,..., X Uif(0, ). Let Y = max i X i. (i) Boud E(Y ) usig the method we derived i lecture otes. (ii) Fid a exact expressio for E(Y ). Compare the result to part (i). 3. A improvemet o Hoeffdig s iequality is Berstei s iequality. Let X,..., X be iid, with mea µ, Var(X i ) = σ ad X i c. The Berstei s iequality says that P ( X µ > ɛ ) { } ɛ exp. σ + cɛ/3 (Whe σ is sufficietly small, this boud is tighter tha Hoeffdig s iequality.) Let X,..., X Uiform(0, ) ad A = [0, /]. Let p = P(X i A ) ad let p = I A (X i ). i= (i) Use Hoeffdig s iequality ad Berstei s iequality to boud P( p p > ɛ). (ii) Show that the boud from Berstei s iequality is tighter. ( ) (iii) Show that Hoeffdig s iequality implies p p = O but that Berstei s iequality implies p p = O P (/). 4. Show that X = o P (a ) ad Y = O P (b ) implies that X Y = o P (a b ).

186 Homework Due: Thursday Sept by 3:00. Let A be a class of sets. Let B = {A c : A A}. Show that s (B) = s (A).. Let Let A ad B be classes of sets. Let { C = A B : A A, B B }. Show that s (C) s (A)s (B). 3. Show that s +m (A) s (A)s m (A). 4. Let A = { A = [a, b] [c, d] : a, b, c, d R, a b c d }. Fid VC dimesio of A.

187 Homework Due: Thursday September 9 by 3:

188 Homework Due: Thursday October 6 by 3: (b) ad (e). 4. Write (x,..., x ) (y,..., y ) to mea that the likelihood fuctio based o (x,..., x ) is proportioal to the likelihood fuctio based o (y,..., y ). The equivalece relatio iduces a partitio Π of the sample space: (x,..., x ) ad (y,..., y ) are i the same elemet of the partitio if ad oly if (x,..., x ) (y,..., y ). Show that Π is a miimal sufficiet partitio (a) I class, we foud the miimax estimator for the Beroulli. Here, you will fill i the details. Let X,..., X Beroulli(p). Let L(p, p) = (p p). (a) Let p be the Bayes estimator usig a Beta(α, β) prior. Fid the Bayes estimator. (b) Compute the risk fuctio. (c) Compute the Bayes risk. (d) Fid α ad β to make the risk costat ad hece fid the miimax estimator.

189 Homework Due: Thursday October 0 by 3:

190 Homework Due: Thursday October 7 by 3: (a,b) (a,b,c,e) 7. Show that, whe H 0 is true, the the p-value has a Uiform (0,) distributio.

191 Homework Due: Thursday November 0 00 by 3: (a) (a) 4. Let X,..., X Uiform(0, θ). Fid the α likelihood ratio cofidece iterval for θ. Note: the limitig χ theory does ot apply to this example. You eed to fid the cutoff value directly. 5. Let X,..., X p ad assume that 0 X i. The histogram desity estimator is defied as follows. Divide [0, ] ito m bis B = [0, /m], B = (/m, /m],...,. Let h = /m ad let θ j = i= I(X i B j ). Let p(x) = θ j h whe x B j. Fid the asymptotic MSE. Fid the best h. Fid the rate of covergece of the estimator.

192 Homework 9 0/ Due: Thursday Nov 7 by 3:00. Let X,..., X p ad let p h deote the kerel desity estimator with badwidth h. Let R(h) = E[L(h)] deote the risk, where L(h) = ( p h (x) p(x)) dx. (a) Defie R(h) = E[ L(h)] where L(h) = ( p h (x)) dx p h (x)p(x)dx. Show that miimizig R(h) over h is equivalet to miimizig R(h). (b) Let Y,..., Y be a secod sample from p. Defie R(h) = ( p h (x)) dx p h (Y i ) where p h is still based o X,..., X. Show that E R(h) = R(h). (Hece, R(h) ca be used as a estimate of the risk.). Agai, let p h deote the kerel desity estimator. Use Hoeffdig s iequality to fid a boud o P( p h (x) p h (x) > ɛ) where p h (x) = E( p h (x)). 3. Let X,..., X Beroulli(θ). Let θ = i= X i. Let X,..., X deote a bootstrap sample. Let θ = i= X i. Fid the followig four quatities: E( θ X,..., X ), E( θ ), V( θ X,..., X ), V( θ ). i= 4. The bootstrap estimate of Var( θ ) is V( θ X,..., X ). (I other words, whe B, the bootstrap estimate of variace coverges to V( θ X,..., X ).) Show that the bootstrap is cosistet, i the sese that V( θ X,..., X ) Var( θ ) P.

193 Homework 0 36/0-705 Due: Thursday December 00 by 3: Suppose that V = k j= (Z j + θ j ) where Z,..., Z k are idepedet, stadard Normal radom variables. We say that V has a o-cetral χ distributio with o-cetrality parameter λ = j θ j ad k degrees of freedom. We write V χ k (λ). (a) Show that if V χ k (λ) the E(V ) = k + λ ad Var(V ) = (k + λ). (b) Let Y i N(µ i, ) for i =,...,. Fid the posterior distributio of µ = (µ,..., µ k ) usig a flat prior. (c) Fid the posterior distributio of τ = i µ i. (d) Fid the mea τ of the posterior. (e) Fid the bias ad variace of τ. (f) Show that τ is ot a cosistet estimator of τ. (Techically, the parameter τ is chagig with. You may assume that τ is bouded as icreases.) Hit: you may use the fact that if V χ k (λ), the (V E(V ))/ Var(V ) N(0, ). (g) Fid c so that P (τ C X,..., X ) = α where C = [c, ). (h) Costruct a ubiased estimator of τ. Compare this to the Bayes estimator. (i) Fid a (frequetist) cofidece iterval A = [a, ) such that P µ (τ A ) = α for all µ. Compare this to the Bayes posterior iterval C.

194 0 Fall Homework Solutios

195 .7 a. y P y X y y f X x dx y y PY y P X y y P X y y f X x dx y Differetiatio gives f Y y y 9 y y 9 y

196 3.3 a. c h x iti x dx. We have i * Note that exp log c j * * exp 3.3(a) * c h x iti x dx j i exp iterchage itegratio ad differetiatio * c h x iti x dx j i j j j exp * c h x t x t x dx x h x log c tj t j x j * c t x dx exp j j ad

197 * * log c log c j j * * * c c log c (3.3 (b)) j j * c h x iti tj j i exp x dx x * tj x h x c dx t j x tj x tj x Var tj x 3.3 b. x exp / x;, x exp x lx The atural parameters ad sufficiet statistics are [ /, ], t x [ x,log x]. Further, c * log log log log / log Therefore * x log c log /

198

199 Itermediate Statistics HW Problem As X 0 with Prob, we have E[X] = 0 ( F (t))dt = 0 P (X > t)dt = a 0 P (X > t)dt + a P (X > t)dt With P (X > t) ad P (X > t) (/t) k, we have the upper boud of E[X], E[X] = a 0 P (X > t)dt + a P (X > t)dt a 0 dt + a (/t) k dt = a + (k )a k. Set the derivative of it to be 0, we get + /a k = 0, so we get a =. The upper boud of E[X] is + = k. k k Problem The cumulative desity fuctio of Y is P (Y y) = P (X i y, i ) = P (X i y) = y, 0 y. i= So, the expected value of Y is Problem 3 E[Y ] = 0 P (Y > t)dt = 0 ( t )dt = + = (i) Note that p = P (X i A ) = /. Let Y i = I A (X i ), the E[Y i ] = E[I A (X i )] = P (X i A ) = /, Ȳ = Y i = i= I A (X i ) = ˆp, i= +. therefore:. Hoeffdig s iequality: Y i = 0 or, thus the boud is 0 Y i, ad fially P ( ˆp p ɛ) = P ( Ȳ E[Y ] ɛ) exp{ ɛ ( 0) } = ; e ɛ

200 . Berstei s iequality: still 0 Y i, hece Y i, ad the variace is as Y i Beroulli(/). So, we have V ar(y i ) = /( /) =, ɛ P ( ˆp p ɛ) = P ( Ȳ E[Y ] ɛ) exp{ ( )/ + ɛ/3 }. (ii) Whe ɛ is small ad is large, ( )/ + ɛ/3 will be very small, i the order of /, so ( )/ + ɛ/3 < /, ad so we have ɛ exp{ ( )/ + ɛ/3 } e ɛ. Therefore, Berstei s iequality is tighter tha Hoeffdig s iequality. (iii) Use Hoeffdig s iequality, P ( ˆp p / C) = P ( ˆp p C/ ) e (C/ ) = e C. So, for ay δ, whe C is large eough, there is P ( ˆp p / iequality implies ˆp p = O p (/ ). Use Berstei s iequality, we have C) δ, therefore, Hoeffidig s P ( ˆp p / C / C) = P ( ˆp p C/) exp{ ( )/ + C/3 }. Simplify the expoetial part, we have C / ( )/ + C/3 = C / ( )/ + C/3 = C ( )/ + C/3 3C, for large ad large C. So, i all, we have P ( ˆp p / C) e 3C. For ay δ, there is C large eough, such that the probability is smaller tha δ. So, Berstei s iequality implies ˆp p = O p (/).

201 Problem 4 3

202 Itermediate Statistics HW3 Problem Notice that (A F ) (A c F ) = (A A c ) F =. O the other had, (A F ) (A c F ) = (A A c ) F = Ω F = F. This shows A ad A c pick differet parts of F, that is, (A c F ) = F \(A F ). For ay fiite set F w/ elemets, say the total umber of distict A F is m, the for every distict A F, there is a correspodig A c B such that A c F picks the other part of F. The the total umber of distict A c F is also m. So we have S(A, F ) = S(B, F ), takig sup o both sides, we have, Problem s (A) = s (B) C = {A B : A A, B B}. Notice that for C C, C F = (A B) F = (A F ) (B F ), therefore, A F C F F, ad B F C F F. For ay fiite set F w/ elemets, say the total umber of distict A F is m ad the total umber of distict B F is m. The, the total umber of distict C F w/ C = A B, i.e. the total umber of distict itersectios (A F ) (B F ) is at most m m (the maximum umber of distict pairs). That is S(C, F ) S(A, F )S(B, F ), takig sup o both sides, s (C) sup [S(A, F )S(B, F )] F F sup S(A, F ) sup S(B, F ) = s (A)s (B). F F F F Problem 3 Let F +m = F F m where F with elemets ad F m with m elemets are disjoit ad F +m have m + elemets. For A A, A F +m = A (F F m ) = (A F ) (A F m ). Therefore, A F A F +m F +m, ad A F m A F +m F +m. For ay fiite set F w/ elemets ad F m w/ m elemets, say the total umber of distict A F is ad

203 the total umber of distict A F m is m. The, the total umber of distict A F +m w/ F +m = F F m, which are subsets of distict uios (A F ) (A F m ) is at most m (the maximum umber of distict pairs). That is S(A, F +m ) S(A, F )S(A, F m ), takig sup o both sides, s +m (A) sup [S(A, F )S(A, F m )] F F,F m F m sup S(A, F ) sup S(A, F m ) = s (A)s m (A). F F F m F m Problem 4 A is the set of sigle itervals or joit of two separate itervals o the real lie.. Let F 4 = {, 0,, } with 4 elemets. The ).[,.5] F =, ).[.5, 0.5] F = { }, 3).[.5, 0.5] F = {, 0}, 4).[.5,.5] F = {, 0, }, 5).[.5,.5] F = {, 0,, }, 6).[ 0.5, 0.5] F = {0}, 7).[ 0.5,.5] F = {0, }, 8).[ 0.5,.5] F = {0,, }, 9).[0.5,.5] F = {}, 0).[0.5,.5] F = {, }, ).[.5,.5] F = {}, ).[.5, 0.5] [0.5,.5] F = {, }, 3).[.5, 0.5] [.5,.5] F = {, }, 4).[ 0.5, 0.5] [.5,.5] F = {0, }, 5).[.5, 0.5] [.5,.5] F = {, 0, }, 6).[.5, 0.5] [0.5,.5] F = {,, } So s(a, F 4 ) = 4 = 6 ad s 4 (A) = 6. The VC dimesio of A, d(a) = max{ : s (A) = } 4.. For set F, st. 5, eg F 5 = {, 0,,, 3}, it is impossible A F 5 = {, 0, }, sice ay iterval coverig {, } will also cover {0}, similarly, the iterval coverig {0, } will also cover {}. This is suffice to show that the VC dimesio of A is less tha 5. So we have 4 d(a) < 5, that is d(a) = 4.

204 Test Solutios Problem X ad X are iid Uif(0,), the, f X,X (x, x ) = f X (x )f X (x ) = I(x (0, ))I(x (0, )) { = 4 0 < x <, 0 < x < 0, ow F Y (y) = P (Y y) = P (X X y) = P (X X + y) = A f X,X (x, x )dx dx where, A = {(x, x ) R : x x + y}. Sice the itegral is over a fuctio which takes value 4 over a square ad 0 everywhere else, the value of the itegral is equal to 4 of the area of the regio determied by the itersectio of A with the square 0 < x <, 0 < x <. The four differet cases are show i the last page. The cdf is, 0 y (+y) F Y (y) = 8 < y < 0 ( y) 8 0 y y > Differetiate it wrt y to get the pdf, Problem f Y (y) = df Y dy = y 4 0 y +y 4 y < 0 0 ow Let X i iid Beroulli(p) for i =,,...,. The X = i= X i has Biomial(, p) distributio ad the MGF of X, M X = i= M X i = (M Xi ). We kow the MGF of Beroulli distributio is M X = E[e tx ] = e t p + e 0 ( p)

205 The we have M X = (M X ) = (e t p + ( p)) Problem 3 We kow that, E(g(X) Y ) = g(x)p(x y)dx The, E(E(g(X) Y )) = E(g(X) Y )p(y)dy = [ g(x)p(x y)dx]p(y)dy = g(x)p(x y)p(y)dydx = g(x)[ p(x, y)dy]dx = g(x)p(x)dx = E(g(X)) Problem 4 We kow that X Uif(, ) ad Y = e X, the { e X 0 x Y = e X x < 0 ad y e. The attached figure shows how the fuctio looks like. The cdf is 0 y < F Y (y) = P (Y y) = P (e X P ( log(y) x log(y)) = log y log y 3 y) = dx = 3 log y y e P ( log(y) x ) = log y+ log y 3dx = 3 e y < e y e Differetiate the cdf with respect to y, we get the pdf, 3y y e p Y (y) = 3y e y < e 0 ow

206 Itermediate Statistics HW lim X X,limx X 0, for ay, we ca fid a m ad a N such that Sice F x F x F x / / P X m for > PY c m for > N. N. The, sice PY c m lim, we ca fid a N such that P A B P A P B Note that, the P X Y c P X m, Y c m P X m P Y c m / / for max N, N. Thus P X Y c lim.

207 5.36 (a) E Y E E Y N E N E N Var Y E Var Y N Var E Y N E 4N Var N (b) Y / ty / t / ty/ M t E e e E e t Y t / e E E e N total expectatio N t / / t / / e E t e t t / t/ e e e 0 0 e! t / t/ e Poisso! t/ t t t t/ 8 t/ 8 8 t/ 8 e e e e 3 t t t t 3 t t t t t t 8 t e e as which is the mgf of N(0,). e Taylor expasio

208 Itermediate Statistics HW5 Problem (C&B 6.) Problem (C&B 6.4) Problem 3 (C&B 6.9) (b) (e)

209 Problem 4 Refer to Notes 6 p. for the defiitio of miimal sufficiet partitio. Let θ be the parameter of the distributio ad f be the joit pdf. f(x,..., x θ) f(y,..., y θ) is idepedet of θ if ad oly if (x,..., x ) (y,..., y ). Therefore, by C&B Theorem 6..3, is a miimal sufficiet partitio for θ. Problem 5 (C&B 7.). x = 0, the likelihood L(θ) = 3 I(θ = )+ 4 I(θ = )+0 I(θ = 3) = 3 I(θ = )+ 4 I(θ = ), therefore, the MLE ˆθ = ;. x =, L(θ) = 3 I(θ = ) + 4 I(θ = ), ˆθ = ; 3. x =, L(θ) = 4 I(θ = ) + 4 I(θ = 3), ˆθ = or ˆθ = 3; 4. x = 3, L(θ) = 6 I(θ = ) + 4 I(θ = ) + I(θ = ), ˆθ = 3; 5. x = 4, L(θ) = 6 I(θ = ) + 4 I(θ = 3), ˆθ = 3.

210 Fially, X = 0, ; ˆθ = or 3 X = ; 3 X = 3, 4. Problem 6 (C&B 7.5(a)) Problem 7 (C&B 7.8) Problem 8 (C&B 7.9) 3

211 Problem 9 (a) Bayes estimator uder square error loss L(p, ˆp) = (p ˆp) is the posterior mea. iid X i Beroulli(p), p Beta(α, β) are cojugate, the posterior is p X Beta(α + i X i, β + i X i). Therefore, Bayes estimator ˆp = α+p i X i. α+β+ (b) Risk fuctio for ˆp R(p, ˆp) = E p [L(p, ˆp)] = MSE(ˆp) = (E[ˆp] p) + V [ˆp] = ( α + p α + β + p( p) p) + (α + β + ) (α( p) βp) p( p) = + (α + β + ) (α + β + ) (c) Bayes risk for ˆp B(π, ˆp) = R(p, ˆp)π(p)dp 4

212 = = = = = (α + β + ) [(α + β) (p α α + β ) + p p ]π(p)dp (α + β + ) [(α + αβ β) (α + β) (α + β + ) + α α + β ( αβ (α + β) (α + β + ) + α (α + β) )] (α + β + ) [ αβ α + β + + α α + β α(α + ) (α + β)(α + β + ) ] (α + β + ) [ αβ α + β + + αβ (α + β)(α + β + ) ] αβ (α + β)(α + β + )(α + β + ) (d) The risk R(p, ˆp) = (α( p) βp) p( p) + (α + β + ) (α + β + ) = (α + β + ) {p [(α+β) ]+p[ α(α+β)]+α } is a d order polyomial of p. To make it costat, set { { (α + β) = 0; α = = α(α + β) = 0. β = Thus ˆp m = α+p i X i α+β+ = P /+ i X i + is the miimax estimator. ;. 5

213 36705 Itermediate StatisticsChapter Homework 0 6 Solutios Problem C & B 0. Asymptotic Evaluatios 0. First calculate some momets for this distributio. EX = θ/3, E X =/3, VarX = 3 θ 9. So 3 X is a ubiased estimator of θ with variace Var(3 X ) = 9(VarX)/ = (3 θ )/ 0 as. So by Theorem 0..3, 3 X is a cosistet estimator of θ. 0.3 a. The log likelihood is Itermediate Statistics HW6 Oct, 00 Problem C & B 0. log (πθ) (xi θ)/θ. Differetiate ad set equal to zero, ad a little algebra will show that the MLE is the root Problem of θ + θ (0 W = 0. poits) The roots of this equatio are ( ± +4W )/, ad the MLE is the root with the plus sig, as it has to be oegative. By theorem b. The secod 0 i lecture derivative 4, of the log likelihood is ( x i + θ)/(θ3 ), yieldig a expected P P P W Fisher θ, aiformatio = ofa W θ = θ. P P b 0= a W + b θ +0=θ. Xi I(θ) = E + θ θ θ 3 = θ + θ, Problem (0 poits, 0 poits each) ad by Theorem 0.. the variace of the MLE is /I(θ). Problem 3 C & B a. Write a The log likelihood is Xi Y i Xi (X i + i ) Xi X = i i X =+ i X. log(πθ) (xi θ) /θ. i Differetiate From ormality ad set adequal idepedece to zero, ad a little algebra will show that the MLE is the root of θ EX + θ W =0. Therootsofthisequatioare( i i =0, VarX i i = σ (µ + τ ), EXi = µ + τ ± +4W, VarXi =τ )/, (µ ad + τ the ), MLE is the root with the plus sig, as it has to be oegative. ad Cov(X i,x i i ) = 0. Applyig the formulas of Example 5.5.7, the asymptotic mea b The secod ad variace derivative are of the log likelihood is ( x i + θ)/(θ 3 ), yieldig a expectatied Fisher iformatio of Xi Y i Xi Y E i X ad Var X i X σ (µ τ ) i [(µ + τ )] = σ i + θ I(θ) = E θ = θ +, (µ + τ ) θ 3 θ b.

214 So root by Theorem with the plus 0..3, sig, 3X asisita has cosistet to be oegative. estimator of θ. 0.3b. a. The Thesecod log likelihood derivative is of the log likelihood is ( x i + θ)/(θ3 ), yieldig a expected Fisher iformatio of log (πθ) (xi θ)/θ. 0. First Differetiate calculatead some setmomets equal Xi I(θ) to zero, = E for this addistributio. a little algebra + θ θ θ 3 = will θ show + that θ, the MLE is the root of θ + θ W = 0. The roots of this equatio are ( ± +4W ad by Theorem 0.. EX the = variace θ/3, E of X the =/3, MLE is VarX /I(θ). = )/, ad the MLE is the root with the plus sig, as it has to be oegative. 3 θ 9. b. The secod derivative of the log likelihood is ( x i 0.4 a. Write + θ)/(θ3 ), yieldig a expected So Fisher 3 X iformatio is a ubiased of estimator Xi Y of θ with i Xi (Xvariace i + i ) Xi X = i i =+ X. Var(3 X Xi I(θ) i ) = = E 9(VarX)/ =(3 + θ θ θ )/ 0 as. θ 3 = θ + θ, From ormality ad idepedece Soad by by Theorem0..3, X the isvariace a cosistet of theestimator MLE is /I(θ). of θ a. a. Write The EXlog i i =0, likelihood VarXis i i = σ (µ + τ ), EXi = µ + τ, VarXi =τ (µ + τ ), ad Cov(X i,x i i ) = 0. Applyig the formulas of Example 5.5.7, the asymptotic mea Problem 3 C & B 0.4 log (πθ) Xi Y i Xi (X i + i ) Xi X = (xi θ)/θ. i X =+ i i X. i ad variace are From Differetiate ormality adset idepedece equal to zero, ad a little algebra will show that the MLE is the root of θ + θ W = Xi 0. Y The roots of this equatio i Xi Y i are ( E X ad Var i σ (µ ± + +4W τ ) )/, [(µ + τ )] = σ ad E the MLE is the root EXwith i i =0, the plus VarXsig, i i = asσ it (µ has + to τ ), be oegative. EX X i = µ β+ τ, VarX (µ + τ i ) b. The secod derivative of the log likelihood is ( i =τ (µ + τ ), ad Cov(X x i + θ)/(θ3 ), yieldig a expected i,x i i ) = 0. Applyig the formulas of Example 5.5.7, the asymptotic mea b. ad ad Fisher variace iformatio are of Yi i = β Xi Y i Var Xi Xi X σ (µ X + i τ ) with approximate mea β ad variace i [(µ σ + /(µ τ )] = σ I(θ) = E + θ θ θ 3 = θ + Xi Y i Xi Y (µ θ+ τ, E ) X ad Var i i X σ (µ + τ ) i ). [(µ + τ )] = σ (µ + τ ) ad by Theorem 0.. the variace of the MLE is /I(θ). b. 0.4 a. Write Yi i Xi Y i Xi = β (X+ i + i ) Xi X = Xi Xi i i X =+ i X. i with approximate mea β ad variace σ /(µ ). From ormality ad idepedece c. EX i i =0, VarX i i = σ (µ + τ Yi ), = β EX + i = µ i + τ, VarXi =τ (µ + τ ), X i X i ad withcov(x approximate i,x i i ) mea = 0. βapplyig ad variace the σformulas /(µ ). of Example 5.5.7, the asymptotic mea ad 0-variace are Solutios Maual for Statistical Iferece 0.5 a. The itegral of ET is ubouded ear zero. We have c. Xi Y i Xi Y E i X ad Var i X σ (µ + τ ) i [(µ τ )] = σ ET Yi = β + > i πσ 0 x e (x µ) /σ dx > πσ (µ + τ ) X K dx =, Problem 4, C & B x i X i b. where K = with max 0 x approximate e (x µ) /σ Deote the desity of (µ, σ ) as fmea (x), the β ad variace σ /(µ ). Yi i b. If we delete the iterval ( δ, δ), 0.5 a. The itegral of ET the the itegrad = β + is bouded, that is, over the rage of is ubouded Xi ear Xi zero. We have itegratio /x < /δ. X i ( δ)f (x) + δf(x) with approximate mea β ad variace σ c. Assume µ>0. A similar argumet works for /(µ µ<0. ). ET The > πσ 0 x e (x µ) /σ dx > πσ K Let Y Beroulli(δ), the P dx =, 0 x P ( δ <X<δ) =P [ (Y = ) = δ ( δ µ) < ad P (Y = 0) (X µ) < = δ ad, (δ µ)] <P[Z < (δ µ)], V ar(x i ) = E(V ar(x i Y )) + V ar(e(x where K = max 0 x e (x µ) /σ i Y )) where Z (0, ). For δ <µ, the probability goes to 0 as. 0.7 NoteWe that, eedv to ar(x b. assume If i we Y = that delete ) τ(θ) = the τ isad iterval differetiable V ar(x ( δ, i Y at δ), = θ the = 0) θ 0 =, the σ, true itegrad value ofis the bouded, parameter. that The is, over the rage of we apply Theorem itegratio to /xtheorem < /δ We will doc. a more Assume geeral µ>0. E(V problem ar(x A similar i that Y )) icludes = argumet τ δ + a) σ ad works ( b) δ) as for special µ<0. cases. TheSuppose we wat to estimate λ t e λ /t! =P (X = t). Let P ( δ <X<δ) =P [ ( δ µ) < (X µ) < (δ µ)] <P[Z < (δ µ)], if X = t T = T (X,...,X )= where Z (0, ). For δ <µ, the probability 0 if X goes = t. to 0 as. 0- Solutios Maual for Statistical Iferece The0.7 ET = We P (T eed = ) to= assume P (X = that t), so τ(θ) T isisdifferetiable a ubiased estimator. θ = Sice θ 0, the true X i isvaluea complete of the parameter. The sufficiet statistic we apply fortheorem λ, E(T X i ) is UMVUE. to Theorem 0... UMVUE is 0 for y = X i <t, ad for y t, 0.9 We will do a more geeral problem that icludes a) ad b) as special cases. Suppose we wat to estimate E(T y) λ t e λ = /t! P (X =P (X = t). Let = t X i = y) if X = t = P (X = t, X i = y) T T (X,...,X )= P ( X i = y) 0 if X = t.

215 Also E(X i Y = ) = θ, E(X i Y = 0) = µ ad E(E(X i Y )) = θδ + ( δ)µ, V ar(e(x i Y )) = Y (E(X i Y ) E(E(X i Y ))) P (Y ) By the fact that X i s are iid, = (θ θδ ( δ)µ) δ + (µ δθ ( δ)µ) ( δ) = δ( δ)(θ µ) V ar( X) = V ar( (X i )) = (τ δ + σ ( δ) + δ( δ)(θ µ) ) i Sice the mea ad variace of Cauchy distributio do ot exist, ay cotamiate of cauchy distributio will make (θ µ) ad τ ifiite. So V ar(x i ) will be ifiite. Problem 5, C & B 0.9 X i (θ, σ ). a). V ar( (X)) = V ar( X i ) = V ar( i i X i ) = ( i V ar(x i ) + i<j Cov(X i, X j )) = (σ + ( ) ρσ ) = (σ + ( )ρσ ) So, as, V ar( X) 0. b). V ar( X) = V ar( i X i ) = ( i V ar(x i ) + i<j Cov(X i, X j )) = (σ + Cov(X i, X j )) i= j=i+ 3

216 c). We kow = (σ + = σ + σ Corr(X, X i ) = i= j=i+ ρ i j σ ) ρ ρ ( ρ ρ ) Cov(X, X i ) V ar(x )V ar(x i ) Ad sice δ i iid (0, ) we ca use δ for all δ i s, X = ρx + δ So, X 3 = ρ(ρx + δ ) + δ... i X i = ρ i X + ρ j δ j=0 i Cov(X, X i ) = Cov(X, ρ i X + ρ j δ ) j=0 i = ρ i Cov(X, X ) + Cov(X, ρ j δ ) = ρ i V ar(x ) = ρ i σ j= Also, i V ar(x i ) = ρ (i ) V ar(x ) + ρ j V ar(δ ) = ρ (i ) σ + ρ(i ) ρ = ρ j=0 Give σ = ρ, Corr(X, X i ) = ρ i 4

217 Itermediate Statistics Test Solutio () Let X,..., X Beroulli(θ) where 0 < θ <. Let W = X i ( X i ). i= (a) Show that there is a umber µ such that W coverges i probability µ. Solutio: As X i is either 0 or, so X i ( X i ) = 0 with probability. Hece, W has poit mass probability at 0. So, E[W ] = 0, V ar(x ) = 0. Obviously, if we set µ = 0, we have P ( W µ > ɛ) = 0, for ay ɛ > 0. Hece W coverges to µ i probability. (b) Fid the limitig distributio of (W µ). Solutio: W µ has poit mass at 0, so (W µ) also has poit mass at 0. The limitig distributio is P ( (W µ) = 0) =. () Let X,..., X Normal(θ, ). (a) Let T = (X,..., X ). Show that T is ot sufficiet. Solutio: As T = (X,..., X ), the coditioal distributio of (X,..., X T = (t,..., t )) is f(x,..., X, T = t) f(t ) where = f(x = t,..., X = t, X ) f(x = t,..., X = t ) f(x ) = π e (X θ). = f(x ) i= f(t i) i= f(t i) = f(x ), Obviously, the coditioal pdf f(x,..., X T = (t,..., t )) depeds o θ, so T is ot sufficiet. (b) Show that U = i= X i is miimal sufficiet. Solutio: From the solutio above, the ratio betwee probability is P f(x,..., x T ) i= f(y,..., y T ) = (x e i y i ) +θ P i= (x i y i ).

218 Obviously, whe i= x i = i= y i, the ratio does ot deped o θ, which meas that T (X ) = i= x i is sufficiet. To make sure that the ratio does ot deped o θ, there must be i= x i = i= y i, so T (X ) = i= x i is also miimal. I all, T (X ) = i= x i is miimal sufficiet statistic. (3) Let X,..., X be draw from a uiform distributio o the set where θ > 0. [0, ] [, + θ] (a) Fid the method of momets estimator ˆθ of θ. Solutio: The probability desity fuctio for X is { 0 x, x + θ, f X (x) = +θ 0 ow. So, the expectatio for X is EX = To fid the momet estimator, let 0 x +θ + θ dx + x + θ dx = θ + 4θ + ( + θ). E[X] = X i = X, i= ad we get the solutio ˆθ = X + X X + 3, ˆθ = X X X + 3. As θ > 0, oly the ˆθ is kept, ad the estimator is ˆθ = X + X X + 3. (b) Fid the mea squared error of ˆθ. Solutio: The form is too complicated, so we use Delta method to fid the approximate MSE of ˆθ. I part (a), we have that E[X] = θ + 4θ + ( + θ).

219 The variace for X ca also be calculated, as V ar(x) = E[X ] (E[X]) = (θ + )3 7 3( + θ) (E[X]), which eds up as With CLT, we kow that V ar(x) = θ4 + 4θ 3 + 8θ + 8θ + ( + θ). X N( θ + 4θ + ( + θ), θ 4 + 4θ 3 + 8θ + 8θ + ) ( + θ) Let g(x) = x + x x + 3, so ˆθ = g( X), with ad the approximate MSE is g( θ + 4θ + ( + θ) ) = θ, g ( θ + 4θ + (θ + ) ) = ( + θ) ( + θ) +, MSE(ˆθ) = (E[(θ ˆθ)]) + V ar(ˆθ) = (θ + ) ( θ 4 + 4θ 3 + 8θ + 8θ + ( + θ) + ) ( + θ) = ( θ + + 4θ 3 + 8θ + 8θ + ( + θ) + ) 3 (c) Show that ˆθ is cosistet. Solutio: Accordig to the result i part (b), the mea squared error of ˆθ goes to 0 i the order of O(/), so ˆθ θ i probability, which meas that ˆθ is cosistet. (4) Let X,..., X N(θ, ). Let τ = e θ +. (a) Fid the maximum likelihood estimator ˆτ of τ ad show that it is cosistet. Solutio: The likelihood fuctio for θ is L(θ; X ) = i= π e (x i θ), 3

220 hece the log-likelihood fuctio is Take derivative of l(θ), we have l(θ; X ) = log π (x i θ). θ l(θ) = i= x i θ. To fid the MLE of θ, let l(θ) = 0, ad the solutio is θ i= ˆθ = x i. i= As this is the oly solutio for the derivative fuctio, so this is global maximum, ad MLE for θ is ˆθ = i= x i. As MLE of fuctio g(θ) is fuctio of MLE g(ˆθ), so MLE for τ is ˆτ = e P i= x i +. Obviously, the distributio of ˆθ is ˆθ N(θ, /), so ˆθ θ i probability. As ˆτ is a cotiuous fuctio of ˆθ, accordig to cotiuous mappig theorem, ˆτ g(θ) = τ i probability, which meas that MLE ˆτ is cosistet. (b) Cosider some loss fuctio L(τ, ˆτ). Defie what it meas for a estimator to be a miimax estimator for τ. Solutio: We say ˆτ is a miimax estimator of τ, if for ay other estimator τ, there is sup τ R(τ, ˆτ) sup R(τ, τ), τ where R(τ, ˆτ) = E τ L(τ, ˆτ) = L(τ, ˆτ(x ))f(x ; τ)dx. (c) Let π be a prior for θ. Fid the Bayes estimator for τ uder the loss L(τ, ˆτ) = (ˆτ τ) /τ. Solutio: To fid the Bayes estimator, for ay x, we wat to choose ˆτ(x ) to miimize r(ˆτ x ) = L(τ, ˆτ(x ))π(θ x )dθ. 4

221 Itroduce Loss fuctio L(τ, ˆτ) = (ˆτ τ) τ r(ˆτ x ) with respect to ˆτ, we have i the equatio, ad take the derivative of ˆτ r(ˆτ x ) = Let ˆτ r(ˆτ x ) = 0, the equatio is which is equivalet with ˆτ (ˆτ τ) π(θ x )dθ. τ (ˆτ τ(θ)) π(θ x )dθ = 0, τ τ π(θ x )dθ π(θ x )dθ = 0, hece the solutio is ˆτ(x ) = / τ π(θ x )dτ = /E[/τ x ]. 5

222 0 Fall 0 705/ Homework 7 Solutios

223

224 d d 0.3 a. By CLT, we have ˆ ˆ p p, p p /, p p, p p /. Stackig them together, ad cosiderig that they are idepedet, we have pˆ d p p p / 0, pˆ p. Usig Delta s method, it is easy to show that 0 p p / d pˆ pˆ p p, p p / p p /. Uder H0: p p p. ˆp is the MLE of p, thus pˆ p p. Combiig these facts with Slutzkey s theorem, we get Therefore, T. d

225

226 7. Show that, whe H0 is true, the the p value has a Uiform (0,) distributio. Proof: First, accordig to C&B Theorem..0, the cdf of a cotiuous r.v. follows Uiform(0, ).

227 Itermediate Statistics HW8 Problem (C & B 9.) Solutio: Problem (C & B 9.4(a)) Solutio: (a).

228 Problem 3 (C & B 9.33(a)) Solutio: Problem 4 Solutio: The likelihood fuctio for θ is L(θ; X,..., X ) = θ I X () θ, X () = max{x,..., X }. So, the MLE is ˆθ = X (). The likelihood ratio for data is L(θ) L(ˆθ) = ˆθ θ I X() θ I X() ˆθ Hece, usig LRT, we accept the H 0 whe = X () θ I X () θ. L(θ) L(ˆθ) = X () θ I X () θ > C. Choose a proper C to make sure that the test has size α. For Uiform distributio, the size could be calculated as P θ ( X () θ I X () θ C) = P θ (X () C / θ) = (C/ θ) = C. θ

229 So, take C = α to make sure that the LRT is with size α. I this sese, the acceptace regio is A(θ) = {X,, X : θ X () > α / θ}, ad the correspodig α cofidece iterval is (X (), X () α / ). Problem 5 Solutio: Give x, say that x B j, the the estimator is ˆp(x) = ˆθ j = h h i= I(X i B j ). For ay X i, the distributio of I(X i B j ) is Beroulli Distributio with parameter p = P (X B j ) = B j p(t)dt. As X,, X are i.i.d samples, so I(X B j ),, I(X B j ) are also i.i.d samples. Hece, we have the expectatio ad variace for ˆθ j as E[ˆθ j ] = p(t)dt, B j ad ad var(ˆθ j ) = p(t)dt( p(t)dt). B j B j Hece, the bias ad variace for the estimator is bias(ˆp(x)) = p(t)dt p(x), h B j var(ˆp(x)) = p(t)dt( p(t)dt) h B j B j So, the MSE for this sigle poit is MSE(x) = b + v = ( p(t)dt p(x)) + p(t)dt( p(t)dt). h B j h B j B j Now try to estimate the MSE term by term. Taylor expasio shows that p(t)dt = hp(x)+p (t x) (x) (t x)dt+ B j B j Bj p ( x)dx = hp(x)+hp (x)(h(j ) x)+o(h3 ). I the bi B j, the itegratio over bias square is ( p(t)dt p(x)) dx = p (x) (h(j B j h B j B j ) x) dx + O(h 3 ), 3

230 ad by the mea value theorem, B j ( h B j p(t)dt p(x)) dx p ( x j ) Bj (h(j ) x) dx = p ( x j ) h3. Hece, we have 0 bias(x) dx = m j= B j bias(x) dx m j= p ( x j ) h3 p (x j ) dx h. For the variace part, i the bi B j, it does ot chage, so the itegratio is vdt = h( p(t)dt( p(t)dt)) = B j h B j B j h ( p(t)dt ( p(t)dt) ), B j B j ad o [0, ] iterval, the variace is 0 vdt = m j= h ( p(t)dt ( p(t)dt) ) = B j B j h h m ( p(t)dt). B j With mea value theorem, we have that B j p(t)dt = p( x j )h, so it becomes j= 0 vdt = h h m j= h p( x j ) h ( p (x)dx). So, the approximatio of MSE o the desity fuctio is MSE = b + v p (x j ) dx h + h ( p (x)dx). If we take C = p (x j ) dx/, ad C = ( p (x)dx), the the approximate MSE is MSE C h + C h, so the best h should be O( /3 ), ad the correspodig covergece rate is /3. 4

231 0 Fall 0 705/ Test 3 Solutios () Let X X Beroulli( θ) θ ( ),, ~, 0, (a) Fid MLE ˆ θ, score fuctio, ad Fisher iformatio. Solutio: ΣX i ( θ; X ) = θ ( θ) ΣX i L ˆ θ = X i i= log L ΣXi θ S ( θ ) = = θ θ( θ) log L I ( θ ) = E θ = ( ) θ θ θ (b) Fid the limitig distributio of ˆ τ = e θˆ. Solutio: Accordig to Thm Lecture 9: ( ( ) ( )) τ ( θ) I ( θ ) ( θ) d ˆ ' τ θ τ θ N 0, = N 0, θ θ d θ ( ˆ θ e θ τ θ) N e, (c) Fid the Wald test for H0: θ = /, H: θ /. Solutio: θ ( e ( )) se ( ˆ θ ) = θ ( θ ) ˆ θ / Therefore the Wald test is: reject whe > zα /. / 4 () Let X X ( θ ),, ~, N.

232 (a) Fid the level α Neyma Pearso test for H0: θ =, H: θ =. Solutio: The Neyma Pearso test rejects whe T( X ) ( ) T x > k α. exp ( X i ) L( θ ) i = = = exp 3 / L( θ0 ) exp ( X i ) i ( ( X )) To get k α ( ( ) α) exp ( 3 / ) ( ( ) α) ( ) P T x > k = α P X > k = α P X > c = α zα Sice X ~ N (,/ ), we kow P0 X > + = α. Therefore the test is: reject whe zα X > +. (b) I what sese is Neyma Pearso optimal: Solutio: Neyma Pearso is optimal because it is UMP: amog all the level α tests, Neyma Pearso has the largest power fuctio for all θ Θ i.e. has miimum type II error. (c) Fid the LRT of H0: θ =, H: θ. Solutio: ˆ θ =, ˆ θ = X λ 0 ( X ) MLE L( ˆ θ exp ) ( Xi θ0 ) exp ( ) 0 Xi i i = = = L ( ˆ θ ) exp ( X ˆ i θ ) exp ( Xi X) i i = exp = exp i ( Xi ) ( Xi X) ( X )

233 X ~ N 0, X ~ χ. Therefore the LRT is: reject We kow uder H0 ( ) ( ) ( ) X > χ (or equivaletly X > / ). H0 whe ( ), α z α (3) Let ( ) X,, ~ 0,, 0 X Uiform θ θ > (a) Fid the likelihood ad MLE. Solutio: ( ) ( θ ) L= I( Xi θ ) = I X ( ) θ i θ ˆ θ = X = max X i i (b) Fid the form of LRT for H0: θ =, H: θ Solutio: The LRT rejects H0 if ( X ) ( θ0 ) ( ˆ θ ) λ c. ( ) ( ( ) ) ( ( ) ( ) ) L ˆ I X λ ( X ) = = = X ( ) I X( ) L I X X X ( ) Therefore, if X ( ) >, always reject H0. Otherwise reject H0 if X ( ) is smaller tha some value. (c) Fid the form of likelihood ratio cofidece iterval. Solutio: L( θ ) C = θ : c L ) L( ˆ θ ) X ( ˆ θ ) I( X( ) θ ) θ ( ) = θ = I X ( ) θ ( ) X θ ( )

234 Whe θ < X ( ), this ratio is always zero. Whe θ X ( ), this ratio is mootoically decreasig with θ. Therefore, C should have the form X ( ), U. (4) Let X X p( x θ ),, ~ ; (a) Let C(X ) be a α cofidece iterval for θ. Cosider testig H0: θ = θ0, H: θ θ0. Suppose we reject H0 if 0 C( X ) Solutio: ( θ C( X )) θ Pθ ( θ C( X )) θ Pθ θ C( X ) θ Pθ θ C( X ) if P θ θ if α ( ) ( ) sup α sup α θ. Show that this defies a level α test for H0. α which is the defiitio of a level α test.,, ~ 0,, 0 / X X Uiform θ θ >. Let C = X( ), X( )/ α. Show that C is a α CI (b) ( ) for θ. Solutio: For θ, ( ) ( ) ( ) ( Pθ ( X) ) ( ) / / / ( / ) ( ( )) ( ( )) P θ C = P X θ X α = P θα X = P θα X θ θ θ θ / / = θα = θα = α θ (c) Use (a) ad (b) to defie a level α test of H0: θ =, H: θ. Fid its power fuctio. Solutio: / / The test is: reject H 0 if X( ), X( )/ α X( ) >, or, X ( ) < α..

235 / / Pθ ( X X ) P ( ( ) θ X ) P θ ( X( ) ) / ( ( ) ) θ ( ( ) α ) / ( ) ( θ α ) ( ) ( ) ( ) β θ = > < α = > + < α = P X < + P X < θ ( Pθ X ) P ( X ) = < + < θ α α θ α + θ < θ / / = α < θ

236 36705 Itermediate Statistics Homework 9 Solutios Problem a). R(h) = E(L(h)) = E (ˆp h (x)) dx E ˆp h (x)p(x)dx + (p(x)) dx Sice the last term of the rhs has othig to do with h, differetiate R(h) with respective to h, d (p(x)) /dh = 0. The, mi h ( R(h) = mi E(L(h)) = mi E h h (ˆp h (x)) dx E ) ˆp h (x)p(x)dx = mi R(h) h b). E ˆR(h) = E = E X = E X (ˆp h (x)) dx E(ˆp h (Y i )) i= (ˆp h (x)) dx E X,Y (ˆp h (Y )) () (ˆp h (x)) dx E X (ˆp h (y)p(y)dy) () The secod expectatio i () is with respect both X s ad Y s, while the secod expectatio i () is with respect to X s. So with prove that E ˆR(h) = R(h) = E (ˆp h (x)) dx E ˆp h (x)p(x)dx. Problem Kerel desity estimator is defied as ˆp h (x) = i= h K ( ) Xi x h

237 ( ) Let Y i = h K Xi x h. Sice kerel K is a symmetric desity with expectatio 0. The a < Y i < b with a = mi(k( X i x h ))/h ad b = max(k( X i x h ))/h. We ca apply Hoeffdig iequality that Problem 3 P ( ˆp h (x) p h (x) > ɛ) = P ( i= Y i E(Y ) > ɛ) e ɛ (b a) E(ˆθ X,..., X ) = E[Y Biomial(, X)] = X = X E(ˆθ ) = E(E(ˆθ X,..., X )) = EX = θ Var(ˆθ X,..., X ) = X( X) = X( X) Var(ˆθ ) = Var(E(ˆθ X,..., X )) + E(Var(ˆθ X,..., X )) = = Var(X) + E( X( X)) = = θ( θ) + (EX EX ) = = (θ( θ) + θ θ( θ) θ ) = = θ( θ) θ( θ) = ( ) = θ( θ) Problem 4 V( ˆθ X,..., X ) Var( θ ˆ ) = X( X) p( p) X P p by law of large umber The, X( X) P p( p) by Theorem 0 i Lecture 4 So X( X) p( p) P

238 Itermediate Statistics HW0 Problem (C & B 7.3) Solutio: Problem (C & B 9.7) Solutio:

239 Problem 3 Solutio: (a) Say that X j = Z j + θ j, the X j N(θ j, ). The momet geeratig fuctio for X j N(θ, ) is M X (t) = e θt+t /. So, we have the momets for X as ad E[X] = θ, E[X ] = + θ, E[X 3 ] = 3θ + θ 3, E[X 4 ] = 3 + 6θ + θ 4, V ar(x ) = E[X 4 ] (E[X ]) = 3 + θ + θ 4 ( + θ ) = + 4θ. Because X j s are idepedet, so we have E[V ] = k E[Xj ] = j= k + θj = k + λ, j= k V ar[v ] = V ar[xj ] = j= (b) The posterior distributio of µ is k + 4θj = (k + λ). j= f(µ) = f(y µ)f(µ) = Π i= e (y i µ i ) π So, the posterior distributio of µ is N(y, I )..

240 (c) I this case, the distributio of τ = i µ i is χ (λ), where λ = i y i. (d) Accordig to the result i (a), the posterior mea of τ is ˆτ = + i y i. (e) Let W = i y i ad thus ˆτ = + W. By defiitio, W χ ( i µ i ). Therefore bias(ˆτ) = E(ˆτ) i µ i = E( + W ) i µ i = + ( + i µ i ) i µ i = V ar(ˆτ) = V ar( + W ) = V ar(w ) = + 4 i µ i (f) Accordig to the hit, W N (E(W ), V ar(w )) = N ( + i µ i, + 4 i µ i ). Now cosider the probability P ( ˆτ τ > ɛ) for a arbitrarily small ɛ. This probability will ever approach sice ˆτ τ = W + i µ i N (, + 4 i µ i ). I other words, the desity of ˆτ τ will ever cocetrate aroud zero. Therefore, ˆτ is ot cosistet. (g) From (c), τ = i µ i is χ (λ), the the α cofidece iterval for τ is C = [χ,α( i y i ), + ). (h) From (e), bias(ˆτ) =, the E(ˆτ ) = 0. τ = ˆτ = W is a ubiased estimator. V ar( τ) = + 4 i µ i. As, V ar( τ) 0, so τ is ot cosistet either. (i) From (e) we have W χ ( i µ i ) = χ (τ). Suppose we wat to test H 0 : τ = τ 0 vs. H : τ τ 0, the the rejectio regio of a level α test is R = {W : W χ,α(τ 0 )}. By ivertig this test, we have a size α cofidece iterval A = {τ : W χ,α(τ)}. The iterval i (g) is actually Bayesia credible iterval where the parameter τ is radom ad the iterval is determied by the posterior distributio of τ. The iterval i (i) is the frequetist cofidece iterval which we assume it is fixed ad the iterval is determied from the distributio of the estimator of τ. 3

241 Practice Fial Exam. Let X,..., X be iid from a distributio with mea µ ad variace σ. Let S = (X i X ) i= where X = i= X i. Prove that S P σ.. Let θ > 0. Let S θ deote the square i the plae whose four corers are (θ, θ), ( θ, θ), ( θ, θ) ad (θ, θ). Let X,..., X be iid data from a uiform distributio over S θ. (Note that each X i R.) ( θ, θ ) (θ,θ) ( θ, θ) (θ, θ) (a) Fid a miimal sufficiet statistic. (b) Fid the maximum likelihood estimate (mle). (c) Show that the mle is cosistet.

242 3. Let X,..., X Poisso(λ) ad let Y,..., Y m Poisso(γ). Assume that the two samples are idepedet. (a) Fid the Wald test for testig H 0 : λ = γ versus H : λ γ. (b) Fid the likelihood ratio test for testig H 0 : λ = γ versus H : λ γ. What is the (approximate) level α critical value? (c) Fid a approximate α cofidece iterval for λ γ. (d) Fid the BIC criterio for decidig betwee the two models: Model I: ν = γ. Model II: ν γ. 4. Let X,..., X Uif(0, θ). (a) Let θ = ax where a > 0 is a costat. Fid the risk of θ uder squared error loss. (b) Fid the posterior mea usig the (improper) prior π(θ) /θ. (c) Suppose ow that 0 θ B where B > 0 is give. Hece the parameter space is Θ = [0, B]. Let θ be the Bayes estimator (assumig squared error loss) assumig that the prior puts all its mass at θ = 0. I other words, the prior is a poit mass at θ = 0. Prove that the posterior mea is ot miimax. (Hit: You eed oly fid some other estimator θ such that sup θ Θ R(θ, θ) < sup θ Θ R(θ, θ). 5. Suppose that (Y, X) are radom variables where Y {0, } ad X R. Suppose that ad that X Y = 0 Uif( 5, 5) X Y = Uif(, ). Further suppose that P(Y = 0) = P(Y = ) = /. (a) Fid m(x) = P(Y = X = x). (b) Let A = {(a, b) : a, b R, a b}. Fid the VC dimesio of A.

243 (c) Let H = {h A : A A} where h A (x) = if x A ad h A (x) = 0 if x / A. Show that the Bayes rule h is i H. (d) Let ĥ be the empirical risk miimizer based o data (X, Y ),..., (X, Y ). Show that R(ĥ) R(h ) ɛ with high probability. 6. Let X, X be iid Uiform(0, ). Fid the desity of Y = X + X. 7. Let X,..., X be iid data from a uiform distributio over the disc of radius θ i R. Thus, X i R ad { if x θ f(x; θ) = πθ 0 otherwise where x = x + x. (a) Fid a miimal sufficiet statistic. (b) Fid the maximum likelihood estimate (mle). (c) Show that the mle is cosistet. 8. Let X Biomial(, p) ad Y Biomial(m, q). Assume that X ad Y are idepedet. (a) Fid the Wald test for testig H 0 : p = q versus H : p q. (b) Fid the likelihood ratio test for testig H 0 : p = q versus H : p q. (c) Fid a approximate α cofidece iterval for θ = p q. 9. Let X f(x; θ) where θ Θ. Let L( θ, θ) be a loss fuctios. (a) Defie the followig terms: risk fuctio, miimax estimator, Bayes estimator. (b) Show that a Bayes estimator with costat risk is miimax. 3

244 0. Let X,..., X N(θ, ). Let π be a N(0, ) prior: π(θ) = π e θ /. (a) Fid the posterior distributio for θ. (b) Fid the posterior mea θ. (c) Fid the mea squared error R(θ, θ) = E θ (θ θ).. Let X,..., X Poisso(λ). (a) Fid the mle λ. (b) Fid the score fuctio. (c) Fid the Fisher iformtio. (d) Fid the limitig distributio of the mle. (e) Show that λ is cosistet. (f) Let ψ = e λ. Fid the limitig distributio of ψ = e bλ. (g) Show that ψ is a cosistet estimate of ψ.. Let X,..., X be a sample from f(x; θ) = (/)( + θx) where < x < ad < θ <. (a) Fid the mle θ. Show that it is cosistet. (b) Fid the method of momets estimator ad show that it is cosistet. 4