Where are we? Recap from last time. Lecture 4: Confidence Intervals & Hypothesis Testing. Statistical inference

Where are we? Recap from last time Lecture 4: Confidence Intervals & Hypothesis Testing Sandy Eckel seckel@jhsph.edu 25 April 2008 Lecture 3 Summary The Normal Distribution Sampling distributions (i.e., of the sample mean) The Central Limit Theorem Today, we ll discuss Confidence intervals for population parameters The t-distribution Hypothesis testing (p-values) 1 / 25 2 / 25 Recall: Summary of Sampling Distributions Statistical inference Sampling Distribution Statistic Mean Variance σ X µ 2 n σ1 X 1 X 2 µ 1 - µ 2 2 + σ2 2 pq n ˆp p nˆp np npq ˆp 1 ˆp 2 p 1 p 2 p 1 q 1 + p 2q 2 Two methods Estimation (Confidence intervals) Hypothesis testing (p-values) Both make use of sampling distributions Remember to use CLT Sampling distributions allow us to make statements about the unobserved true population parameter in relation to the observed sample statistic This is called statistical inference! 3 / 25 4 / 25

What do we mean by Estimation What do we mean by Hypothesis Testing Point estimation An estimator of a population parameter: a statistic (i.e., X, ˆp) An estimate of a population parameter: the value of the estimator for a particular sample From a sample of 100 infants, sample mean birth weight was X = 3012 grams From a sample of 100 Vitamin A treated girls, 2 died so ˆp = 2 100 = 0.02 Interval estimation A point estimate plus an interval that expresses the uncertainty or variability associated with the estimate Pre-specify a null hypothesis and an alternative hypothesis that relate to the population parameter Given the observed data (and resulting statistic), we decide to reject or fail to reject the null hypothesis in favor of the alternative Significance testing 5 / 25 6 / 25 Point Estimation Interval Estimation X is a point estimator of µ X 1 X 2 is a point estimator of µ 1 µ 2 ˆp is a point estimator of p ˆp 1 ˆp 2 is a point estimator of p 1 p 2 We know the sampling distribution of these statistics, e.g. X N(µ X = µ, σ2 X = σ2 n ) If σ 2 is not known, we can use s 2, the sample variance, as a point estimator of σ 2 100(1 α)% Confidence interval: estimate ± (critical value of z or t) (standard error) Example: Confidence interval for the population mean Plugging in the values, we get X ± z α/2 σ X = [L,U] Note: The z α/2 is the value such that under a standard normal curve the area under the curve that is larger than z α/2 is α/2 and the area under the curve that is less than z α/2 is α/2 7 / 25 8 / 25

Derivation of Confidence Interval (CI) for the mean Summary: CI for mean We get the 100(1 α)% confidence interval for µ by taking: P( z α/2 Z z α/2 ) = 1 α P( z α/2 X µ z α/2 ) = 1 α σ X P( z α/2 σ X X µ z α/2 σ X ) = 1 α A 100(1 α)% confidence interval for µ, the population mean, is given by the interval estimate X ± z (α/2) σ X when the population variance σ 2 is known After some algebra: P( X z α/2 σ X µ X + z α/2 σ X) = 1 α P(L µ U) = 1 α The population variance is very rarely known (!), but you ll see we can deal with this... In this class, we ll always use 100(1 α)% = 95% confidence intervals, but you might sometimes see 90% or 99% CI in the literature. 9 / 25 10 / 25 Interpretation of the CI for µ Known Variance Assumption Before the data are observed, the probability is at least (1 α) that [L, U] will contain µ, the population parameter In repeated sampling from a normally distributed population, 100(1 α)% of all intervals of the form above will include the the population mean µ After the data are observed, the constructed interval [L,U] either contains the true mean or it does not (no probability involved anymore) Sampling from a normally distributed population with known variance (σ 2 known) Confidence interval: X ± z(α/2) σ X What if σ 2 is unknown? Best we can do is use the best estimate we have of population variance: sample variance 11 / 25 12 / 25

Using the Sample Variance The t-distribution Sampling from a normally distributed population with population variance unknown We can make use of the sample variance s 2 Now we construct the confidence interval as: X ± z (α/2) s X when n is large X ± t (α/2,n 1) s X when n is small Estimate σ 2 with s 2 Here, s X = s n and t α/2 has n-1 degrees of freedom The distribution of X is not quite normal, so we need the t-distribution t Density t = X µ s/ n x df=2 df=5 df=20 13 / 25 14 / 25 Properties of the t-distribution Comparing t with normal mean = median = mode = 0 Symmetric about the mean t ranges from to + Family of distributions determined by n 1, the degrees of freedom The t distribution approaches the standard normal distribution as n 1 approaches Density Std. normal t with df=2 x 15 / 25 16 / 25

T-tables Summary: Confidence intervals for means Population Sample Population 95% Confidence Distribution Size Variance Interval Any σ Normal 2 known X ± 1.96σ/ n Any σ 2 unknown, use s 2 X ± t0.025,n 1 s/ n Not Normal/ Large σ 2 known X ± 1.96σ/ n Large σ Unknown 2 unknown, use s 2 X ± 1.96s/ n Small Any Non-parametric methods Large - ˆp ± 1.96 ˆp(1 ˆp)/n Binomial Small - Exact methods 17 / 25 18 / 25 Confidence Intervals for Differences in Means Equal Variances Assumption This is a bit tricky Recall that formulas for CIs for a single mean depend on whether or not σ 2 is known the sample size For a difference in means, the formula for a CI depends on whether or not the variances are assumed to be equal when variance are unknown sample sizes in each group When variances are assumed to be equal: The standard error of the difference is estimated by: s 2 p + s2 p Here, s 2 p is the pooled variance s 2 p = ( 1)s 2 1 + ( 1)s 2 2 + 2 where the degrees of freedom (df) = + 2 Recall, is the size of sample 1, and is the size of sample 2 19 / 25 20 / 25

Unequal Variances Assumption Summary: Confidence intervals for difference of means When variances are assumed to be unequal: The standard error of the difference is estimated by: s1 2 + s2 2 Here, df = ν and ν = ( s2 1 + s2 2 ) 2 (s1 2/) 2 1 + (s2 2 /) 2 1 21 / 25 Population Sample Population 95% Confidence Distribution Size Variances Interval σ1 Any known ( X 1 X 2 ) ± 1.96 2 Normal + σ2 2 s 2 p Any unknown, ( X 1 X 2 ) ± t 0.025,n1+ 2 σ1 2 = σ2 2 s1 Any unknown, ( X 1 X 2 ) ± t 2 0.025,ν + s2 2 σ1 2 σ2 2 Large known ( X 1 X 2 ) ± 1.96 + s2 p σ1 2 + σ2 2 s 2 p + s2 p Not Normal/ Large unknown, ( X 1 X 2 ) ± 1.96 Unknown σ1 2 = σ2 2 Large unknown, ( X 1 X 2 ) ± 1.96 σ1 2 σ2 2 Small Any Non-parametric methods s 2 1 + s2 2 22 / 25 Confidence intervals for difference of proportions Recap: Statistical Inference Population Sample 95% Confidence Distribution Size Interval Binomial ˆp Large (ˆp 1 ˆp 2 ) ± 1.96 1(1 ˆp 1) Small Exact methods + ˆp2(1 ˆp2) Estimation Point estimation Confidence intervals Hypothesis Testing This is next! We will first discuss hypothesis testing as it applies to means of distributions for continuous variables We will then discuss discrete data (specifically dichotomous variables) - probably next week 23 / 25 24 / 25

To be continued... The remaining material from this lecture on Hypothesis Testing has been moved to Lecture 5. 25 / 25