Loglikelihood and Confidence Intervals

Transcription

1 Stat 504, Lecture 3 Stat 504, Lecture 3 2 Review (contd.): Loglikelihood and Confidence Intervals The likelihood of the samle is the joint PDF (or PMF) L(θ) = f(x,.., x n; θ) = ny f(x i; θ) i= Review: Let X, X 2,..., X n be a simle random samle from a robability distribution f(x; θ). A arameter θ of f(x; θ) is a variable that is characteristic of f(x; θ). A statistic T is any quantity that can be calculated from a samle; it s a function of X,..., X n. An estimate ˆθ for θ is a single number that is a reasonable value for θ. The maximum likelihood estimate (MLE) ˆθ MLE maximizes L(θ): L(ˆθ MLE) L(θ), If we use X i s instead of x i s then ˆθ is the maximum likelihood estimator. Usually, MLE: is unbiased, E(ˆθ) = θ is consistent, ˆθ θ, as n θ An estimator ˆθ for θ is a statistic that gives the formula for comuting the estimate ˆθ. is efficient, has small SE(ˆθ) as n is asymtotically normal, (ˆθ θ) N(0, ) SE(ˆθ) Stat 504, Lecture 3 3 Stat 504, Lecture 3 4 The loglikelihood function is defined to be the natural logarithm of the likelihood function, l(θ ; x) = log L(θ ; x). For a variety of reasons, statisticians often work with the loglikelihood rather than with the likelihood. One reason is that l(θ ; x) tends to be a simler function than L(θ ; x). When we take logs, roducts are changed to sums. If X = (X, X 2,..., X n) is an iid samle from a robability distribution f(x; θ), the overall likelihood is the roduct of the likelihoods for the individual X i s: L(θ ; x) = f(x ; θ) ny = f(x i ; θ) = i= ny L(θ ; x i) i= The loglikelihood, however, is the sum of the individual loglikelihoods: l(θ ; x) = log f(x ; θ) ny = log f(x i ; θ) = = i= nx log f(x i ; θ) i= nx l(θ ; x i). i= Below are some examles of loglikelihood functions.

2 Stat 504, Lecture 3 5 Stat 504, Lecture 3 6 Binomial. Suose X Bin(n, ) where n is known. The likelihood function is L( ; x) = n! x! (n x)! x ( ) n x and so the loglikelihood function is l( ; x) = k + x log + (n x) log( ), where k is a constant that doesn t involve the arameter. In the future we will omit the constant, because it s statistically irrelevant. Poisson. Suose X = (X, X 2,..., X n) is an iid samle from a Poisson distribution with arameter λ. The likelihood is L(λ ; x) = and the loglikelihood is l(λ ; x) = ny i= λ x i e λ x i! = λpn i= x i e nλ x! x 2! x n!,! nx x i log λ nλ, i= ignoring the constant terms that don t deend on λ. As the above examles show, l(θ ; x) often looks nicer than L(θ ; x) because the roducts become sums and the exonents become multiliers. Stat 504, Lecture 3 7 Stat 504, Lecture 3 8 Asymtotic confidence intervals Loglikelihood forms the basis for many aroximate confidence intervals and hyothesis tests, because it behaves in a redictable manner as the samle size grows. The following examle illustrates what haens to l(θ ; x) as n becomes large. In-Class Exercise: Suose that we observe X = from a binomial distribution with n = 4 and unknown. Calculate the loglikelihood. What does the grah of likelihood look like? Find the MLE (do you understand the difference between the estimator and the estimate)? Locate the MLE on the grah of the likelihood.

3 Stat 504, Lecture 3 9 Stat 504, Lecture 3 0 The MLE is ˆ = /4 =.25. Ignoring constants, the loglikelihood is which looks like this: l(;x) l( ; x) = log + 3 log( ), Here is a samle code for lotting this function in R: (For clarity, I omitted from the lot all values of beyond.8, because for >.8 the loglikelihood dros down so low that including these values of would distort the lot s aearance. When lotting loglikelihoods, we don t need to include all θ values in the arameter sace; in fact, it s a good idea to limit the domain to those θ s for which the loglikelihood is no more than 2 or 3 units below the maximum value l(ˆθ; x) because, in a single-arameter roblem, any θ whose loglikelihood is more than 2 or 3 units below the maximum is highly imlausible.) <-seq(from=.0,to=.80,by=.0) loglik<-log() + 3*log(-) lot(,loglik,xlab="",ylab="",tye="l",xlim=c(0,)) Stat 504, Lecture 3 Now suose that we observe X = 0 from a binomial distribution with n = 40. The MLE is again ˆ = 0/40 =.25, but the loglikelihood is l(;x) l( ; x) = 0 log + 30 log( ), Finally, suose that we observe X = 00 from a binomial with n = 400. The MLE is still ˆ = 00/400 =.25, but the loglikelihood is now l(;x) l( ; x) = 00 log log( ), Stat 504, Lecture 3 2 As n gets larger, two things are haening to the loglikelihood. First, l( ; x) is becoming more sharly eaked around ˆ. Second, l( ; x) is becoming more symmetric about ˆ. The first oint shows that as the samle size grows, we are becoming more confident that the true arameter lies close to ˆ. If the loglikelihood is highly eaked that is, if it dros sharly as we move away from the MLE then the evidence is strong that is near ˆ. A flatter loglikelihood, on the other hand, means that is not well estimated and the range of lausible values is wide. In fact, the curvature of the loglikelihood (i.e. the second derivative of l(θ ; x) with resect to θ) is an imortant measure of statistical information about θ. The second oint, that the loglikelihood function becomes more symmetric about the MLE as the samle size grows, forms the basis for constructing asymtotic (large-samle) confidence intervals for the unknown arameter. In a wide variety of roblems, as the samle size grows the loglikelihood aroaches a quadratic function (i.e. a arabola) centered at the MLE.

4 Stat 504, Lecture 3 3 The arabola is significant because that is the shae of the loglikelihood from the normal distribution. Stat 504, Lecture 3 4 If we had a random samle of any size from a normal distribution with known variance σ 2 and unknown mean µ, the loglikelihood would be a erfect arabola centered at the MLE ˆµ = x = P n i= xi/n. From elementary statistics, we know that if we have a samle from a normal distribution with known variance σ 2, a 95% confidence interval for the mean µ is x ±.96 σ n. () The confidence interval () is valid because over reeated samles the estimate x is normally distributed about the true value µ with a standard deviation of σ/ n. The quantity σ/ n is called the standard error; it measures the variability of the samle mean x about the true mean µ. The number.96 comes from a table of the standard normal distribution; the area under the standard normal density curve between.96 and.96 is.95 or 95%. Stat 504, Lecture 3 5 Stat 504, Lecture 3 6 There is much confusion about how to interret a confidence interval (CI). A CI is NOT a robability statement about θ since θ is a fixed value, not a random variable. One interretation: if we took many samles, most of our intervals would cature true arameter (e.g. 95% of out intervals will contain the true arameter). Examle: The nationwide telehone oll was conducted by NY Times/CBS News between Jan with 8 adults. About 58% of resondents feel otimistic about next four years. The results are reorted with a margin of error of 3%. In Stat 504, the arameter of interest will not be the mean of a normal oulation, but some other arameter θ ertaining to a discrete robability distribution. We will often estimate the arameter by its MLE ˆθ. But because in large samles the loglikelihood function l(θ ; x) aroaches a arabola centered at ˆθ, we will be able to use a method similar to () to form aroximate confidence intervals for θ.

5 Stat 504, Lecture 3 7 Stat 504, Lecture 3 8 Just as x is normally distributed about µ, ˆθ is aroximately normally distributed about θ in large samles. This roerty is called the asymtotic normality of the MLE, and the technique of forming confidence intervals is called the asymtotic normal aroximation. This method works for a wide variety of statistical models, including all the models that we will use in this course. The asymtotic normal 95% confidence interval for a arameter θ has the form ˆθ ±.96 q, (2) l (ˆθ; x) Of course, we can also form intervals with confidence coefficients other than 95%. All we need to do is to relace.96 in (2) by z, a value from a table of the standard normal distribution, where ±z encloses the desired level of confidence. If we wanted a 90% confidence interval, for examle, we would use.645. where l (ˆθ; x) is the second derivative of the loglikelihood function with resect to θ, evaluated at θ = ˆθ. Stat 504, Lecture 3 9 Observed and exected information The quantity l (ˆθ; x) q is called the observed information, and / l (ˆθ; x) is an aroximate standard error for ˆθ. As the loglikelihood becomes more sharly eaked about the MLE, the second derivative dros and the standard error goes down. When calculating asymtotic confidence intervals, statisticians often relace the second derivative of the loglikelihood by its exectation; that is, relace l (θ; x) by the function I(θ) = E ˆl (θ; x), which is called the exected information or the Fisher information. In that case, the 95% confidence interval would become ˆθ ±.96 q I(ˆθ). (3) Stat 504, Lecture 3 20 When the samle size is large, the two confidence intervals (2) and (3) tend to be very close. In some roblems, the two are identical. Now we give a few examles of asymtotic confidence intervals. Bernoulli. If X is Bernoulli with success robability, the loglikelihood is l( ; x) = x log + ( x) log ( ), the first derivative is l ( ; x) = and the second derivative is l ( ; x) = x ( ) (x )2 2 ( ) 2 (to derive this, use the fact that x 2 = x). Because E ˆ(x ) 2 = V (x) = ( ), the Fisher information is I() = ( ).

6 Stat 504, Lecture 3 2 Of course, a single Bernoulli trial does not rovide enough information to get a reasonable confidence interval for. Let s see what haens when we have multile trials. Binomial. If X Bin(n, ), then the loglikelihood is Stat 504, Lecture 3 22 Notice that the Fisher information for the Bin(n, ) model is n times the Fisher information from a single Bernoulli trial. This is a general rincile; if we observe a samle size n, X = (X, X 2,..., X n), l( ; x) = x log + (n x) log ( ), the first derivative is l ( ; x) = the second derivative is x n ( ), where X, X 2,..., X n are indeendent random variables, then the Fisher information from X is the sum of the Fisher information functions from the individual X i s. If X, X 2,..., X n are iid, then the Fisher information from X is n times the Fisher information from a single observation X i. l ( ; x) = and the Fisher information is I() = x 2x + n2 2 ( ) 2, n ( ). Thus an aroximate 95% confidence interval for based on the Fisher information is r ˆ( ˆ) ˆ ±.96, (4) n where ˆ = x/n is the MLE. What haens if we use the observed information rather than the exected information? Evaluating the second derivative l ( ; x) at the MLE ˆ = x/n gives l n (ˆ ; x) = ˆ( ˆ), so the 95% interval based on the observed information is identical to (4). Unfortunately, Agresti (2002,. 5) oints out that the interval (4) erforms oorly unless n is very large; the actual coverage can be considerably less than the nominal rate of 95%. Stat 504, Lecture 3 23 The confidence interval (4) has two unusual features: The endoints can stray outside the arameter sace; that is, one can get a lower limit less than 0 or an uer limit greater than. If we haen to observe no successes (x = 0) or no failures (x = n) the interval becomes degenerate (has zero width) and misses the true arameter. This unfortunate event becomes quite likely when the actual is close to zero or one. A variety of fixes are available. One ad hoc fix, which can work surrisingly well, is to relace ˆ by = x +.5 n +, which is equivalent to adding half a success and half a failure; that kees the interval from becoming degenerate. To kee the endoints within the arameter sace, we can exress the arameter on a different scale, such as the log-odds θ = log which we will discuss later., Stat 504, Lecture 3 24 Poisson. If X = (X, X 2,..., X n) is an iid samle from a Poisson distribution with arameter λ, the loglikelihood is! nx l(λ ; x) = x i log λ nλ, the first derivative is i= l (λ ; x) = the second derivative is l (λ ; x) = and the Fisher information is I(λ) = n λ. P i xi n, λ P i xi λ 2, An aroximate 95% interval based on the observed or exected information is s ˆλ ˆλ ±.96 n, (5) where ˆλ = P i xi/n is the MLE.

7 Stat 504, Lecture 3 25 Stat 504, Lecture 3 26 Suose we observe X = 2 from a binomial distribution Bin(20, ). The MLE is ˆ = 2/20 =.0 and the loglikelihood is not very symmetric: One again, this interval may not erform well in some circumstances; we can often get better results by changing the scale of the arameter. Alternative arameterizations Statistical theory tells us that if n is large enough, the true coverage of the aroximate intervals (2) or (3) will be very close to 95%. How large n must be in ractice deends on the articulars of the roblem. Sometimes an aroximate interval erforms oorly because the loglikelihood function doesn t closely resemble a arabola. If so, we may be able to imrove the quality of the aroximation by alying a suitable rearameterization, a transformation of the arameter to a new scale. Here is an examle. l(;x) This asymmetry arises because ˆ is close to the boundary of the arameter sace. We know that must lie between zero and one. When ˆ is close to zero or one, the loglikelihood tends to be more skewed than it would be if ˆ were near.5. The usual 95% confidence interval is r ˆ( ˆ) ˆ ±.96 = 0.00 ± 0.3 n or (.03,.23), which strays outside the arameter sace. Stat 504, Lecture 3 27 The logistic or logit transformation is defined as «φ = log. (6) The logit is also called the log odds, because /( ) is the odds associated with. Whereas is a roortion and must lie between 0 and, φ may take any value from to +, so the logit transformation solves the roblem of a shar boundary in the arameter sace. Solving (6) for roduces the back-transformation = eφ + e φ. (7) Let s rewrite the binomial loglikelihood in terms of φ: l(φ ; x) = x log + (n x) log( ) «= x log + n log( ) «= xφ + n log. + e φ Stat 504, Lecture 3 28 Now let s grah the loglikelihood l(φ; x) versus φ: loglik It s still skewed, but not quite as sharly as before. This lot strongly suggests that an asymtotic confidence interval constructed on the φ scale will be more accurate in coverage than an interval constructed on the scale. An aroximate 95% confidence interval for φ is hi ˆφ ±.96 q I( ˆφ) where ˆφ is a the MLE of φ, and I(φ) is the Fisher information for φ. To find the MLE for φ, all we need to do is aly the logit transformation to ˆ: «0. ˆφ = log =

8 Stat 504, Lecture 3 29 Stat 504, Lecture 3 30 The general method for rearameterization is as follows. First, we choose a transformation φ = φ(θ) for which we think the loglikelihood will be symmetric. Assuming for a moment that we know the Fisher information for φ, we can calculate this 95% confidence interval for φ. Then, because our interest is not really in φ but in, we can transform the endoints of the confidence interval back to the scale. This new confidence interval for will not be exactly symmetric i.e. ˆ will not lie exactly in the center of it but the coverage of this rocedure should be closer to 95% than for intervals comuted directly on the -scale. Then we calculate ˆθ, the MLE for θ, and transform it to the φ scale, ˆφ = φ(ˆθ). Next we need to calculate I( ˆφ), the Fisher information for φ. It turns out that this is given by I( ˆφ) = I(ˆθ) [ φ (ˆθ) ] 2, (8) where φ (θ) is the first derivative of φ with resect to θ. Then the endoints of a 95% confidence interval for φ are: s φ low = ˆφ.96 I( ˆφ) φ high = ˆφ +.96 s I( ˆφ) Stat 504, Lecture 3 3 Stat 504, Lecture 3 32 Table : Some common transformations, their back transformations, and derivatives. transformation back derivative The aroximate 95% confidence interval for φ is [φ low, φ high ]. The corresonding confidence interval for θ is obtained by transforming φ low and φ high back to the original θ scale. A few common transformations are shown in Table, along with their back-transformations and derivatives. «θ φ = log θ θ = e φ + e φ φ (θ) = φ = log θ θ = e φ φ (θ) = θ φ = θ θ = φ 2 φ (θ) = θ( θ) 2 θ φ = θ /3 θ = φ 3 φ (θ) = 3 θ 2/3

9 Stat 504, Lecture 3 33 Stat 504, Lecture 3 34 Going back to the binomial examle with n = 20 and X = 2, let s form a 95% confidence interval for φ = log /( ). The MLE for is ˆ = 2/20 =.0, so the MLE for φ is ˆφ = log(./.9) = Using the derivative of the logit transformation from Table, the Fisher information for φ is I(φ) = = I(θ) [ φ (θ) ] 2 n ( )» = n( ). Evaluating it at the MLE gives ( ) I( ˆφ) = =.8 2 The endoints of the 95% confidence interval for φ are interval for are low = high = e e = 0.025, e e = The MLE ˆ =.0 is not exactly in the middle of this interval, but who says that a confidence interval must be symmetric about the oint estimate? r φ low = = 3.658, r φ high = = 0.736, and the corresonding endoints of the confidence Stat 504, Lecture 3 35 Stat 504, Lecture 3 36 Intervals based on the likelihood ratio Another way to form a confidence interval for a single arameter is to find all values of θ for which the loglikelihood l(θ ; x) is within a given tolerance of the maximum value l(ˆθ ; x). Statistical theory tells us that, if θ 0 is the true value of the arameter, then the likelihood-ratio statistic 2 log L(ˆθ ; x) h i L(θ 0 ; x) = 2 l(ˆθ ; x) l(θ 0 ; x) (9) is aroximately distributed as χ 2 when the samle size n is large. This gives rise to the well known likelihood-ratio (LR) test. In the LR test of the null hyothesis H 0 : θ = θ 0 versus the two-sided alternative H : θ θ 0, we would reject H 0 at the α-level if the LR statistic (9) exceeds the 00( α)th ercentile of the χ 2 distribution. That is, for an α =.05-level test, we would reject H 0 if the LR statistic is greater than 3.84.

10 Stat 504, Lecture 3 37 Stat 504, Lecture 3 38 The LR testing rincile can also be used to construct confidence intervals. An aroximate 00( α)% confidence interval for θ consists of all the ossible θ 0 s for which the null hyothesis H 0 : θ = θ 0 would not be rejected at the α level. For a 95% interval, the interval would consist of all the values of θ for which h i 2 l(ˆθ ; x) l(θ ; x) 3.84 or l(θ ; x) l(ˆθ ; x).92. In other words, the 95% interval includes all values of θ for which the loglikelihood function dros off by no more than.92 units. Returning to our binomial examle, suose that we observe X = 2 from a binomial distribution with n = 20 and unknown. The grah of the loglikelihood function looks like this, l(;x) l( ; x) = 2 log + 8 log( ) the MLE is ˆ = x/n =.0, and the maximized loglikelihood is l(ˆ ; x) = 2 log. + 8 log.9 = Let s add a horizontal line to the lot at the loglikelihood value = 8.42: Stat 504, Lecture 3 39 Stat 504, Lecture 3 40 l(;x) The horizontal line intersects the loglikelihood curve at =.08 and =.278. Therefore, the LR confidence interval for is (.08,.278). When n is large, the LR method will tend to roduce intervals very similar to those based on the observed or exected information. Unlike the information-based intervals, however, the LR intervals are scale-invariant. That is, if we find the LR interval for a transformed version of the arameter such as φ = log /( ) and then transform the endoints back to the -scale, we get exactly the same answer as if we aly the LR method directly on the -scale. For that reason, statisticians tend to like the LR method better.

11 Stat 504, Lecture 3 4 If the loglikelihood function exressed on a articular scale is nearly quadratic, then a information-based interval calculated on that scale will agree closely with the LR interval. Therefore, if the information-based interval agrees with the LR interval, that rovides some evidence that the normal aroximation is working well on that articular scale. If the information-based interval is quite different from the LR interval, the aroriateness of the normal aroximation is doubtful, and the LR aroximation is robably better.