Sampling (cont d) and Confidence Intervals Lecture 9 8 March 2006 R. Ryznar

Sampling (cont d) and Confidence Intervals 11.220 Lecture 9 8 March 2006 R. Ryznar

Census Surveys Decennial Census Every (over 11 million) household gets the short form and 17% or 1/6 get the long form Miss approximately.12% of population overall (including about 2.78% of black population) Why do it? Current Population Survey 60,000 households interviewed every month The American Community Survey Contacts 3 million households (including some from every county) and will replace the long form in 2010

Gallup Polls Many do not believe a survey of 1500-2000 respondents can represent the views of all Americans.

Estimation Parameter A number that describes the population. We don t know its value. Statistic A number that describes a sample. It can change from sample to sample. If we take lots of samples the statistic follows a predictable pattern.

Sampling Variability

Law of Large Numbers As the number of trials increases the average outcome approaches the mean of the population (i.e., the expected outcome) and the standard deviation of the average outcome approaches zero.

To reduce bias take a SRS. To reduce variability take larger samples. The margin of error is about sampling variability. We say, The president s approval rating is 40%, plus or minus 3 percentage points. We are 95% percent confident that the true population proportion is between 37% and 43%.

Central Limit Theorem The distribution of an average tends to be Normal, even when the distribution from which the average is computed is decidedly non-normal.

Quick method for a 95% confidence interval around a sample proportion is 1/ n

Margin of Error and Sample Size 1/ n = 1/ 1600 = 1/ 40 = 0.025or 3.0% 1/ n = 1/ 2527 = 1/ 50.27 = 0.020or 2.0% 1/ n = 1/ 100 = 1/10 = 0.1or10% The size of the population has little influence on the behavior of statistics from random samples. The population size does not matter as long as it is at least 100 times larger than the sample.

Gallup Polls Many do not believe a survey of 1500-2000 respondents can represent the views of all Americans.

Estimating a Population Proportion We take a survey (SRS) to estimate the percentage of overweight children aged 6 11 years in the general population. count of successes in the sample 408 pˆ = = = 15.3% n 2673

Sampling Distribution of a Sample Proportion If the sample is large enough, the sampling distribution of is approximately normal. pˆ The mean of the sampling distribution is p. The standard deviation of the sampling distribution is p ( 1 p) n

The standard deviation from our sample is: p(1 n p) = pˆ(1 n pˆ).153(.847) 2673 =.006963

The 95% Confidence Interval around our estimate is: pˆ ± z α / 2 pˆ(1 n pˆ).153 ± 1.96(.00696).153 ±.0136 13.9%,16.7%

The 95% Confidence Interval around our estimate is: pˆ ± z α / 2 pˆ(1 n pˆ).153 ± 2(.00696).153 ±.0139 13.9%,16.7%

What if you wanted a 99% Confidence Interval around our estimate? pˆ ± z α / 2 pˆ(1 n pˆ).153 ± 2.58(.00696).153 ±.018 13.5%,17.1%

Sampling distribution of a sample mean Choose an SRS of size n from a population in which individuals have mean µ and standard deviation σ. Let x be the mean of the sample. Then: The sampling distribution of x is approximately normal when the sample size n is large. The mean of the sampling distribution is equal to µ. The standard deviation (standard error of the estimate) of the sampling distribution is σ = s. e. = σ / x n

Confidence Interval for a Population Mean (µ) When n is large (>30) the sample standard deviation s is close to σ and can be used to estimate it. Confidence Interval for a population mean: n s z x n z x z x x 2 / 2 / 2 / α α α σ σ ± ± ±

Suppose a program director wants to estimate the average length of time (in months) clients remain in a rehab clinic program. She takes a random sample of 100 clients records and uses the sample s mean x, to estimate µ, the population mean. We start by calculating the mean and the sample standard deviation. Assume that: x = 465 ( x x) 2 = 2,387

Then, x x = n = 465 = 100 4.65 2 2 ( x x) 2,387 s = = = 24.11 and s = 4.9 n 1 99 Since we have a large sample (n=100) we can substitute s for σ. A 95% confidence interval for the mean number of months spent in the program is x s 4.9 ± 2 = 4.65 ± 2 = 4.65 ±.98 100 10 Confidence Interval = 3.67, 5.63

Small sample estimates of µ x ± t α / 2 s n Where t α/2 is based on (n 1) degrees of freedom. Assumption: A random sample is selected from a population with a relative frequency distribution that is approximately normal.

Food prices have been going up rapidly. To periodically assess the increase in prices you purchase the same items at twenty different grocery stores. The mean and standard deviation of the costs at the twenty supermarkets are: x = $26.84 and s = $2.63 If we assume that the distribution of costs for the grocery basket at all supermarkets is approximately normal, we can use the t-statistic to form the confidence interval. For a confidence level of 95%, we need the tabulated value of t with df = n 1 = 19. From the t table we see that t α/2 = t.025 = 2.093 x s 2.63 ± t. 025 = 26.84 ± 2.093 = 26.84 ± 1.23 = n 20 ( 25.61, 28.07) Thus, we are reasonably confident (95%) that the interval from $25.61 to $28.07 contains the true mean cost µ of the grocery basket. This is because if we were to employ our interval estimator on repeated occasions, 95% of the intervals constructed would contain µ.

Determining Sample Size How can the appropriate sample size be determined? First determine how reliable you want the estimate to be. Example: Consider the rehab program example where we estimated the mean length of time clients stayed in the program. A sample of 100 clients records produced an estimate, x, that was within.98 month of µ with probability equal to.95. What if we wanted to estimate the true mean to within.5 month with a probability equal to.95. How large a sample would be required?

For the sample size n = 100, we found that an approximate 95% confidence interval to be x ± x 2 σ 4.65 ±.98 If we now want our estimator to be within.5 month of µ, we must have 2σ 2 σ =.5 =. 5 x or n S=4.9 2(4.9) =.5 n 2(4.9) n =.5 2 n = 19.6 = 19.6 = 384.16 384

You would have to sample approximately 384 clients records in order to estimate the mean length of stay in the program, µ, to within.5 month with probability equal to.95.

Understanding Degrees of Freedom Statisticians use the terms "degrees of freedom" to describe the number of values in the final calculation of a statistic that are free to vary. Consider, for example the statistic s 2. To calculate the s 2 of a random sample, we must first calculate the mean of that sample and then compute the sum of the several squared deviations from that mean. While there will be n such squared deviations only (n - 1) of them are, in fact, free to assume any value whatsoever. This is because the final squared deviation from the mean must include the one value of X such that the sum of all the Xs divided by n will equal the obtained mean of the sample. All of the other (n - 1) squared deviations from the mean can, theoretically, have any values whatsoever. For these reasons, the statistic s 2 is said to have only (n - 1) degrees of freedom.