Lecture 8. Confidence intervals and the central limit theorem Mathematical Statistics and Discrete Mathematics November 25th, 2015 1 / 15
Central limit theorem Let X 1, X 2,... X n be a random sample of size n from a distribution of X with mean µ and variance σ 2. Then, for large n, n X i N (nµ, nσ 2 ), i=1 X N (µ, σ 2 /n), X µ σ/ N (0, 1). n Here X Y means that X and Y have approximately the same distribution. Note that the central limit theorem is valid for any random variable X with mean µ and variance σ 2. In particular, X can be discrete, and the theorem says that the sample means for large sample sizes are well approximated by the continuous normal distribution. Note that if X is normal then we have exact, and not approximate equalities. 2 / 15
Central limit theorem 1.0 1.0 0.8 0.8 0.6 0.6 1 2 3 4 1 2 3 4 0.7 0.6 0.6 0.5 0.3 0.1 1 2 3 4 1 2 3 4 Figure: A comparison of PDF s of sums of n independent uniform random variables on (0, 1) for n = 1, 2, 3, 4. 3 / 15
Central limit theorem 0.15 0.15 0.10 0.10 0.05 0.05 5 10 15 20 25 30 5 10 15 20 25 30 Figure: A comparison of PDF s of Binom(n, p) with n = 20, 30, 40 and p = 0.5 on the left, and n = 60, 90, 120 and p = 0.1 on the left. One can see that the shape of the PDF approaches the bell curve of the normal distribution. Note that the number of variables n required for a good approximation by a normal distribution depends on the distribution of a single variable. 4 / 15
Central limit theorem We toss a fair coin 400 times. Let X be the total number of heads. We want to know We have X Binom(400, 1/2), µ = E[X] = 400 1/2 = 200, P(190 X 210). σ 2 = Var[X] = 400 1/2 (1 1/2) = 100. By the central limit theorem, X 200 10 is approximately distributed like a standard normal variable Z, and hence P(190 X 210) = P(X 210) P(X 189) = P(X 200 10) P(X 200 11) ( X 200 ) ( X 200 ) = P 1 P 1.1 10 10 F Z (1) F Z ( 1.1) = 0.8413 0.1357 = 0.7056. 5 / 15
Confidence intervals for µ with arbitrary data and σ 2 known Let X be an arbitrary random variable with known variance σ 2, and let X 1, X 2,..., X n be a random sample of large size n from the distribution of X. Let Z N (0, 1) be a standard normal variable, and let z α/2 > 0 be such that Then, the random interval [L, R], where F Z ( z α/2 ) = α/2. L = X z α/2 σ/ n and R = X + z α/2 σ/ n is a confidence interval for the true mean µ with confidence level 1 α, that is P(L µ R) = 1 α. 6 / 15
Chi-squared and t-distribution If Z 1, Z 2,..., Z n is a random sample of size n from the standard normal distribution, then we say that the random variable Q = n i=1 X 2 i has chi-squared distribution with n degrees of freedom. We denote this by writing Q χ 2 (n). If Z and Q be independent random variables such that Z is a standard normal variable, and Q has chi-squared distribution with n degrees of freedom, then we say that the random variable T = Z Q/n has t-distribution with n degrees of freedom. These are very important distributions and numerical values for their CDF s are found in all mathematical tables. 7 / 15
Chi-squared and t-distribution 0.15 0.10 0.05 5 10 15 20 25 30 35 Figure: The PDF s of the chi-squared distribution with 5, 10, 20, 30 degrees of freedom. 8 / 15
Chi-squared and t-distribution 0.3 0.3 0.1 0.1-4 -2 2 4-4 -2 2 4 0.3 0.3 0.1 0.1-4 -2 2 4-4 -2 2 4 Figure: A comparison of PDF s of the t-distribution with 1, 3, 10, and 30 degrees of freedom (orange) and the standard normal distribution (blue). 9 / 15
Chi-squared and t-distribution If X 1, X 2,... X n is a sample from the normal distribution N (µ, σ 2 ), and X is the sample mean, and S 2 the sample variance, then (n 1)S 2 /σ 2 has chi-squared distribution with n 1 degrees of freedom, and X µ S/ n has t-distribution with n 1 degrees of freedom. Proof. The proof is outside the scope of the course. Partial arguments can be found in the book. 10 / 15
Confidence intervals for µ with normal data and σ 2 unknown Let X be a normal random variable with unknown variance, and let X 1, X 2,..., X n be a random sample of size n from the distribution of X. Let T n 1 be a random variable that has t-distribution with n 1 degrees of freedom, and let t α/2 > 0 be such that Then, the random interval [L, R], where F Tn 1 (t α/2 ) = 1 α/2. L = X t α/2 S/ n and R = X + t α/2 S/ n is a confidence interval for the true mean µ of X with confidence level 1 α, that is P(L µ R) = 1 α. Proof. The proof is analogous to the one for σ 2 known. We use the fact X µ S/ n T n 1. 11 / 15
The manufacturer claims that their mix of nuts and fruits contains 33g fruits per 100g. We want to check this claim. We buy 5 packages and weigh the fruit content. We obtain the following numbers: 31.84, 32.35, 31.20, 32.89, 32.80. We find x = 1 5 5 i 1 x i = 32.22, and s 2 = 1 4 5 i 1 x2 i 5(x 2 ) = 0.50. We assume that the sample comes from a normal distribution. In the tables, we find that t 0.025 = 2.776 The 95% confidence interval is then [ s s ] [l, r] = x t 0.025, x + t 0.025 5 5 [ 0.5 0.5 ] = 32.22 2.776, 32.22 + 2.776 5 5 = [31.34, 33.10]. 12 / 15
Confidence intervals for σ 2 with normal data Let X be a normal random variable with unknown variance, and let X 1, X 2,..., X n be a random sample of size n from the distribution of X. Let χ 2 n 1 be a random variable that has chi-squared distribution with n 1 degrees of freedom, and let χ 2 α/2, χ2 1 α/2 > 0 be numbers such that F χ 2 n 1 (χ 2 α/2 ) = 1 α/2, and F χ 2 n 1 (χ2 1 α/2 ) = α/2 Then, the random interval [L, R], where L = (n 1)S2 χ 2 α/2 and R = (n 1)S2 χ 2 1 α/2 is a confidence interval for the true variance σ 2 of X with confidence level 1 α, that is P(L σ 2 R) = 1 α. 13 / 15
Confidence intervals for σ 2 with normal data Proof. We will use the fact that (n 1)S 2 /σ 2 χ 2 n 1. By the definition of χ 2 α/2, χ2 1 α/2, > 0, we have 1 α = P(χ 2 α/2 (n 1)S2 /σ 2 χ 2 1 α/2 ( ) ) χ 2 α/2 = P (n 1)S 2 1 σ 2 χ2 1 α/2 (n 1)S 2 ( = P (n 1)S 2 χ 2 1 α/2 σ 2 (n 1)S2 χ 2 α/2 ). 14 / 15
Let us find a 95% confidence interval for σ 2 in the fruit mix example. We have s 2 = 0.50, n 1 = 4, α/2 = 0.025. We find in the tables that, χ 2 1 α/2 = χ2 0.975 = 84 and χ 2 α/2 = χ2 0.025 = 11.1. Hence, the confidence interval is [l, r] = [ 4 0.5 11.1, 4 0.5 ] = [0.18, 4.13]. 4.84 15 / 15