Lecture 2 Maximum Likelihood Estimators. Matlab example. As a motivatio, let us look at oe Matlab example. Let us geerate a radom sample of size 00 from beta distributio Beta(5, 2). We will lear the defiitio of beta distributio later, at this poit we oly eed to kow that this isi a cotiuous distributio o the iterval [0, ]. This ca be doe by typig X=betard(5,2,00,). Let us fit differet distributios by usig a distributio fittig tool dfittool. We try to fit ormal distributio ad beta distributio to this sample ad the results are displayed i figure 2.. Desity 3 2.5 2.5 00 samples ~ Beta(5,2) Normal fit Beta fit Cumulative probability 0.9 0.8 0.7 0.3 00 samples ~ Beta(5,2) Normal fit Beta fit 0. 0 0 0.3 0.7 0.8 0.9 0.3 0.7 0.8 0.9 Figure 2.: Fittig a radom sample of size 00 from Beta(5, 2). (a) Histogram of the data ad p.d.f.s of fitted ormal (solid lie) ad beta (dashed lie) distributios; (b) Empirical c.d.f. ad c.d.f.s of fitted ormal ad beta distributios. Besides the graphs, the distributio fittig tool outputs the followig iformatio: Distributio: Normal Log likelihood: 55.257 7
Domai: -If < y < If Mea: 0.7429 Variace: 0.095845 Parameter Estimate Std. Err. mu 0.7429 0.039945 sigma 0.39945 0.00997064 Estimated covariace of parameter estimates: mu sigma mu 0.00095845 6.0523e-020 sigma 6.0523e-020 9.9436e-005 Distributio: Beta Log likelihood: 63.8445 Domai: 0 < y < Mea: 0.7437 Variace: 0.08452 Parameter Estimate Std. Err. a 6.97783.08827 b 2.43424 0.37835 Estimated covariace of parameter estimates: a b a.8433 0.370094 b 0.370094 349 The value Log likelihood idicates that the tool uses the maximum likelihood estimators to fit the distributio, which will be the topic of the ext few lectures. Notice the Parameter estimates - give the data dfittool estimates the ukow parameters of the distributio ad the graphs the p.d.f. or c.d.f. correspodig to these parameters. Sice the data was geerated from beta distributio, it is ot surprisig that beta distributio fit seems better tha ormal distributio fit, which is particularly clear from figure 2. (b), that compares how estimated c.d.f. fits the empirical c.d.f. Empirical c.d.f. is defied as F (x) = I(X i x) i= where I(X x) is the idicator that X i is x. I other words, F (x) is the proportio of observatios below level x. Oe ca ask several questios about this example:. How to estimate the ukow parameters of a distributio give the data from this distributio? 8
2. How good are these estimates, are they close to the actual true parameters? 3. Does the data come from a particular type of distributio, for example, ormal or beta distributio? I the ext few lectures we will study the first two questios ad we will assume that we kow what type of distributio the sample comes from, so we oly do ot kow the parameters of the distributio. I the cotext of the above example, we would be told that the data comes from beta distributio, but the parameters (5, 2) would be ukow. Of course, i geeral we might ot kow what kid of distributio the data comes from - we will study this type of questios later whe we look at the so called goodess-of-fit hypotheses tests. I particular, we will see graphs like 2. (b) agai whe we study the Kolmogorov-Smirov goodess-of-fit test. Example. We cosider a dataset of various body measuremets from [] (dataset ca be dowloaded from joural s website), icludig weight, height, waist girth, abdome girth, etc. First, we use Matlab fittig tool to fit weight ad waist girth of me ad wome (separately) with logormal distributio, see figure 2.2 (a) ad (b). Wikipedia article about ormal distributio gives a referece to a 932 book Problems of Relative Growth by Julia Huxley for the explaatio why the sizes of full-grow aimals are approximately log-ormal. Oe short explaatio is cosistecy betwee liear ad volume dimesios - if liear dimesios are logormal ad volume dimesios are proportioal to cube of liear dimesios the they also are logormal. Assumptio that sizes are ormal would violate this cosistecy, sice the cube of ormal is ot ormal. We observe, hovewer, that the fit of wome s waist with logormal is ot very accurate. Later i the class we will lear several statistical tests to decide if the data comes from a certai distributio or a family of distributios, but here is a preview of what s to come. Chi-squared goodess-of-fit test rejects the hypothesis that the distributio of logarithms of wome s waists is ormal: [h,p,stats]=chi2gof(log_wome_waist) h =, p = 5.2297e-004 stats = chi2stat: 22.0027 df: 5 edges: [x9 double] O: [2 44 67 60 28 8 2 0] E: [x8 double] ad so does Lilliefor s test (adjusted Kolmogorov-Smirov test): [h,p,stats]=lillietest(log_wome_waist) h =, p = 0, stats = 0.084. The same tests accept the hypotheses that other variables have logormal distributio. Author s i [] suggest that we ca fit wome s waist with Gamma distributio. Sice Gamma 9
0.9 0.9 0.8 0.8 Cumulative probability 0.7 0.3 0. wome s weight logormal fit wome me s weight logormal fit me Cumulative probability 0.7 0.3 0. wome s waist logormal fit wome me s waist logormal fit me 0 0 50 60 70 80 90 00 0 60 70 80 90 00 0 0.9 0.8 Cumulative probability 0.7 0.3 0. 0 wome s waist (shifted) Gamma fit ormal fit 5 0 5 20 25 30 35 40 45 Figure 2.2: Fittig weight (upper left) ad waist girth (upper right) with logormal distributio. Lower left: fittig wome s waist with shifted Gamma ad ormal distributios. does ot have a traslatio (shift) parameter, whe we fit Gamma distributio we ca either add to it a shift parameter or istead shift all data to start at zero. I figure 2.2 (c) we fit Gamma ad, for the sake of illustratio, ormal distributio, to wome s waist sample. As we ca see, Gamma fits the data better tha logormal ad much better tha ormal. To fid the parameters of fitted Gamma distributio we use Matlab gamfit fuctio: param=gamfit(wome_waist_shift) param = 2.8700 4.4960. Chi-squared goodess-of-fit test for a specific (fitted) Gamma distributio: [h,p,stats]=chi2gof(wome_waist_shift, cdf,@(z)gamcdf(z,param(),param(2))) 0
h = 0, p = 0.9289, stats = chi2stat: 2.4763, df: 7 accepts the hypothesis that the sample has Gamma distributio (2.87, 4.496). This test is ot accurate i some sese, which will be explaied later. Oe ca also check that Gamma distributio fits well other variables - me s waist girth, weight of me ad weight of wome. Let us cosider a family of distributios P idexed by a parameter (which could be a vector of parameters) ϕ that belogs to a set. For example, we could cosider a family of ormal distributios N(, α 2 ) i which case the parameter would be ϕ = (, α 2 ) - the mea ad variace of the distributio. Let f(x ϕ) be either a probability fuctio (i case of discrete distributio) or a probability desity fuctio (cotiuous case) of the distributio P. Suppose we are give a i.i.d. sample X,..., X with ukow distributio P from this family, i.e. parameter ϕ is ukow. A likelihood fuctio is defied by (ϕ) = f(x ϕ)... f(x ϕ). We thik of the sample X,..., X as give umbers ad we thik of as a fuctio of the parameter ϕ oly. The likelihood fuctio has a clear iterpretatio. For example, if our distributios are discrete the the probability fuctio f(x ϕ) = P (X = x) is the probability to observe a poit x ad the likelihood fuctio (ϕ) = f(x ϕ)... f(x ϕ) = P (X )... P (X ) = P (X,..., X ) is the probability to observe the sample X,..., X whe the parameters of the distributio are equal to ϕ. I the cotiuous case the likelihood fuctio (ϕ) is the probability desity fuctio of the vector (X,..., X ). Defiitio: (Maximum Likelihood Estimators.) Suppose that there exists a parameter ϕˆ that maximizes the likelihood fuctio (ϕ) o the set of possible parameters, i.e. (ϕˆ) = max (ϕ). The ϕˆ is called the Maximum Likelihood Estimator (MLE). Whe fidig the MLE it sometimes easier to maximize the log-likelihood fuctio sice (ϕ) maximize log (ϕ) maximize maximizig is equivalet to maximizig log. Log-likelihood fuctio ca be writte as log (ϕ) = log f(x i ϕ). i= Let us give several examples of computig the MLE.
Example. Beroulli distributio B(p). X = {0, }, P(X = ) = p, P(X = 0) = p, p [0, ]. Probability fuctio i this case is give by p, x = f(x p) = p, x = 0 = p x ( p) x. Likelihood fuctio is (p) = f(x p)f(x 2 p)... f(x p) # of s ( p) # of 0 s X +...+X = p = p ( p) (X +...+X ) ad the log-likelihood fuctio is log (p) = (X +... + X ) log p + ( (X +... + X )) log( p). To maximize this over p [0, ] let us fid the critical poit (log (p)) = 0, (X +... + X ) ( (X +... + X )) = 0. p p Solvig this for p gives, X +... + X p = = X ad, therefore, the proportio of successes pˆ = X i the sample is the MLEstimator of the ukow true probability of success, which is a very atural ad ituitive estimator. For example, by law of large umbers, we kow that X EX = p i probability (we will recall this defiitio i the ext lecture), which meas that our estimate will approximate the ukow parameter p well whe we get more ad more data. Remark. I each example, oce we compute the estimate of parameters, we ca try to prove directly, usig the explicit form of the estimate, that it approximates well the ukow parameters, as we did i Example. However, i the ext lecture we will describe i a geeral settig that MLE has good properties. Example 2. Normal distributio N(, α 2 ). The p.d.f. of ormal distributio is ad, therefore, likelihood fuctio is (X ) 2 2 2 f(x (, α 2 )) = 2α e. (, α 2 ) = e (X i ) 2 2 2. 2α i= 2
ad log-likelihood fuctio is log (, α 2 ) = (X i ) 2 log log α 2 2α 2 i= d (X i ) 2 = 2 (X i ) = 0 log log α (X i X ) 2 2 + (X i X ) 2 = 0 α α 3 αˆ2 = (X X ) 2 i. The ormal distributio fit i figure 2. correspods to these parameters (ˆ, αˆ2). Exercise. Geerate a ormal sample i Matlab ad fit it with a ormal distributio usig dfittool. The plot a p.d.f. or c.d.f. correspodig to MLE above ad compare this with dfittool. Let us give oe more example of MLE. Uiform distributio U[0, ϕ] o the iterval [0, ϕ]. This distributio has p.d.f., 0 x ϕ, f(x ϕ) = 0, otherwise. = log log α (X i ) 2. 2 2α 2 We wat to maximize the log-likelihood with respect to < < ad α 2 > 0. First, obviously, for ay α we eed to miimize (X i ) 2 over. The critical poit coditio is d i= i= ad solvig this for we get that ˆ = X. We ca plug this estimate i the log-likelihood ad it remais to maximize over α. The critical poit coditio reads, 2α 2 i= ad solvig this for α we obtai that the MLE of α 2 is i= i= The likelihood fuctio (ϕ) = f(x i ϕ) = i= = ϕ I(X,..., X [0, ϕ]) ϕ I(max(X,..., X ) ϕ). 3
Here the idicator fuctio I(A) equals to if evet A happes ad 0 otherwise. What the idicator above meas is that the likelihood will be equal to 0 if at least oe of the factors is 0 ad this will happe if at least oe observatio X i will fall outside of the allowed iterval [0, ϕ]. Aother way to say it is that the maximum amog observatios will exceed ϕ, i.e. ad (ϕ) = 0 if ϕ < max(x,..., X ), (ϕ) = if ϕ max(x,..., X ). ϕ Therefore, lookig at the figure 2.3 we see that ϕˆ = max(x,..., X ) is the MLE. 2.8.6.4 (ϕ).2 0.8 PSfrag replacemets 0 0.5 2 2.5 3 max(x,..., X ) ϕ Figure 2.3: MLE for the uiform distributio. Sometimes it is ot so easy to fid the maximum of the likelihood fuctio as i the examples above ad oe might have to do it umerically. Also, MLE does ot always exist. Here is a example: let us cosider uiform distributio U[0, ϕ) ad defie the desity by, 0 x < ϕ, f(x ϕ) = 0, otherwise. The differece is that we excluded the poit ϕ by settig f(ϕ ϕ) = 0. The the likelihood fuctio is ϕ i= (ϕ) = f(x i ϕ) = I(max(X,..., X ) < ϕ) 4
ad the maximum at the poit ϕˆ = max(x,..., X ) is ot achieved. Of course, this is a artificial example that shows that sometimes oe eeds to be careful. Refereces: [] Grete Heiz, Louis J. Peterso, Roger W. Johso, Carter J. Kerk, (2003) Explorig Relatioships i Body Dimesios. Joural of Statistics Educatio, Volume, Number 2. 5