Distribution of the Sample Mean

Estimation of the population mean In many investigations the data of interest take on a wide range of possible values. Examples: attachment loss (mm) and DMFS. With this type of data it is often of interest to estimate the population mean, μ. A common estimator for μ is the sample mean, X In this lecture we will focus on the sampling distribution of X

Example: Fluoride Varnish Study* Children in Yakima WA were randomized to two different methods of fluoride varnish delivery Followed for ~3 years Outcome of interest was number of surfaces with new decay * Weinstein, P. et al. Caries Research 2009;43(6):484-90.

Example: Fluoride Varnish Study Can summarize the observed data with the sample mean and standard deviation The sample mean is used as an estimate of the true population mean. X = 7.4 for the Standard group How good of an estimate is it? means ± standard deviations

Example: Fluoride Varnish Study X 7.4 X 7.8 X 7.3 X 6.8 Xis a random variable. X 7.8 X 7.9 X 8.0 X 8.1 Its value is determined by which people are randomly chosen to be in the sample. X 7.4 X 7.4 X 7.5 X 7.0 Many possible samples, many possible X s. X 7.0 X 7.8 X 7.6 X 6.9 X 6.6 X 7.0 X 7.2 X 7.0

Example: Fluoride Varnish Study X 7.4 X 7.8 X 7.3 X 6.8 In our study we only see one occurrence of the sample mean. X 7.8 X 7.9 X 8.0 X 8.1 We will have a better idea of how good our one estimate is if we have good knowledge of how X behaves. X 7.4 X 7.0 X 7.4 X 7.8 X 7.5 X 7.6 X 7.0 X 6.9 That is, if we know the probability distribution of X. X 6.6 X 7.0 X 7.2 X 7.0

The Central Limit Theorem An important result in probability theory states that the probability distribution for averages (i.e. X ) is the Normal distribution* The size of the sample needs to be reasonably large This result will often hold, regardless of the distribution of the original data Probability distribution for X μ *some restrictions will apply

Approximation with the Normal distribution not as good with only 10 observations

More on the distribution of X The expected value of X is μ X is unbiased On average, X is perfect as an estimator of μ

More on the distribution of X The standard deviation of X is SE X = σ n σ is the standard deviation in the population. n is the number of people in the sample It is called the standard error of the mean or SEM

More on the distribution of X SE X = σ n One can think of the SE X as the average error that X makes when estimating μ, or the precision of the estimate. The precision of X is better (SEM is smaller) when the sample is larger (larger n) The precision is worse (SEM is greater) when the population is more variable (has greater σ)

More on the distribution of X By the Central Limit Theorem when n is reasonably large, then the distribution of X will be approximately Normal, with mean μ, and standard deviation σ n X ~ Normal μ, σ2 n

Example: Birthweight data The histogram shows the distribution of birthweights at a Boston hospital. Estimate the probability that the mean birthweight of the next 20 babies born will be greater than 120 oz. μ = 112 oz σ = 20.6 oz

Law of Large Numbers Recall X ~ Normal μ, σ2 As the n gets large, the distribution of X is forced to be closer and closer to μ. n

Law of Large Numbers Recall X ~ Normal μ, σ2 As the n gets large, the distribution of X is forced to be closer and closer to μ. With larger sample sizes X provides a better estimate of μ. The same is true for the sample standard deviation s. As the sample size increases, s should get closer to the population standard deviation σ. n

Standard Error versus Standard Deviation Standard Deviation: describes the variability of a population or a sample. Standard Error: describes the variability of an estimator that is usually a function of the whole sample.

Confidence intervals for the mean If n is large enough we can use the result that X μ ~N(0,1) σ n to a construct confidence interval for μ. However, this would result in a formula that involves σ, a value that we don t usually know. In practice we will estimate σ with the sample standard deviation, s. Substituting the random variable s for σ will alter the distribution of the Z score slightly.

The distribution of the statistic T = X μ s n is called a t distribution with n-1 degrees of freedom, and is denoted by t n-1 The t distribution

The t distribution T = X μ s n The shape of the t distribution is similar to the Normal distribution, but it has higher variability How much higher depends on the degrees of freedom, which depends on the sample size.

The t distribution T = X μ s n The larger the sample, the less variability. t distributions with higher degrees of freedom are more similar to the Normal distribution.

Confidence intervals for the mean If X is Normal or n is large, then T = X μ s with n-1 degrees of freedom and n follows a t distribution P t n 1,0.975 < X μ s n < t n 1,0.975 = 0.95, where t n 1,0.975 is the 97.5 th percentiles of a t n-1 distribution. Doing some algebra gets P X t n 1,0.975 s n < μ < X + t n 1,0.975 s n = 0.95

Confidence intervals for the mean P X t n 1,0.975 s n < μ < X + t n 1,0.975 s n = 0.95 says that there is 95% probability that the interval X t n 1,0.975 s n, X + t n 1,0.975 s n will contain μ.

Example: chewing gum data Group A was comprised of n=25 children. The sample mean of the change in DMFS was 0.72. The sample standard deviation, s, was 5.37. A 95% confidence interval for the true mean change in DMFS is 0.72 t 24,0.975 5.37 25, 0.72 + t 24,0.975 5.37 25 can look up t 24,0.975 = 2.06 in Table 4 in the coursepack (or use Excel) = 0.72 2.06 1.07, 0.72 + 2.06 1.07 = 0.72 2.20, 0.72 + 2.20 = (-2.92, 1.48)

General formula: 100(1-α)% confidence interval for μ: X t n 1,1 α 2 s n where t n 1,1 α 2 is the 100(1-α/2) th percentile of a t n-1 distribution Example: Suppose n = 30. For a 95% confidence interval, we use α = 0.05. We use t 29,0.975 = 2.05, the 97.5 th percentile for the 95% confidence interval.

(1- α)*100% confidence interval for the mean α is meant to indicate the error we are willing to live with. That is, when estimating the mean with 95% confidence, we are allowing an α = 5% chance of missing the true mean. It is standard to use the 1-α/2 th percentile because we want to split the error evenly on either side of the interval

Example: chewing gum data Compute a 99% confidence interval for the true mean change in DMFS for Group A: Acceptable error in this case would be α=1%, so we use the 100(1-0.01/2)% = 99.5 th percentile. From table 4, t 24,.995 = 2.80, thus the 99% CI is 0.72 2.80 5.37 25 = 0.72 3.01 = (-3.73, 2.29) Note: the 99% confidence interval is wider than the 95% confidence interval. It needs to be wider to have a better chance of covering μ.