Population and sample; parameter and statistic Sociology 360 Statistics for Sociologists I Chapter 11 Sampling Distributions The Population is the entire group we are interested in A parameter is a number describing a characteristic of the population. Parameters are usually unknown. Sample A statistic is a number describing a characteristic of a sample. We often use a statistic to estimate an unknown population parameter. 1 2 Populations and samples: Notation Question about Notation Numerical Summaries for quantitative variables Population Parameters Sample Statistics The mean distance run in a year by a sample of subscribers to Runner s World can be represented by: Mean Standard deviation Proportion for a dichotomous categorical variable: 3 4
Key question: What if we drew another sample? How closely does a sample reflect the population? of states The Population is the entire group we are interested in Sample How likely is it that a statistic estimated from a sample will be close to the population parameter? We can use statistical theory to answer this question if we used a probability sample to calculate the statistic. 7.332 5 The law of large numbers 6 Distribution of x (the sample mean) Law of large numbers: As the number of randomly-drawn observations (n) in a sample increases, We take many random samples of a given size n from a population with mean and standard deviation. the mean of the sample ( ) gets closer and closer to the population mean (quantitative variable). Some sample means will be above the population mean! and some will be below, making up the sampling distribution. the sample proportion ( ) gets closer and closer to the population proportion p (categorical variable). Histogram of some sample averages 7 8
Facts about the distribution of x A sampling distribution is a distribution of sample statistics The mean, or center of the sampling distribution of, is equal to the population mean µ. The standard deviation of the sampling distribution is!/!n, where n is the sample size. Sampling distribution of When sampling randomly from a given population: The sampling distribution describes what happens to the statistic when we take all possible random samples of a fixed size n. Like other distributions we can describe the center and the spread of sampling distributions. The sampling distribution of a statistic is the probability distribution of that statistic.!/!n µ 9 10 Why we use sampling distributions Sample size and the spread of sampling distributions We have data from sample surveys..2 n = 5.2 n = 10 How accurate are our estimates of the population parameters of interest? E.g., 5 states to estimate the mean murder rate for all 50 states? 10 states? Fraction.1 Fraction.1 If we know about the sampling distribution of a statistic, we can say how precise (close to the population parameter) the statistic is likely to be. 0.2 0 5 10 15 mean means of 1000 samples of size 5 n = 15 0 0 5 10 15 mean means of 1000 samples of size 10 Distribution of samples of size 5: mean = 7.4048 Standard Deviation = 1.7048 Fraction.1 Size 10: mean = 7.3596 standard deviation = 1.1564 0 0 5 10 15 mean means of 1000 samples of size 15 Size 15: mean = 7.3380 standard deviation = 0.8553 11 12
Relationships of the statistics to the parameters The mean of the sampling distribution of the sample mean is equal to the population mean, or, using symbols: The mean of x is equal to µ. Because the average value of x, over many samples, is equal to µ, we say: x is an unbiased estimator of µ. The standard deviation of the sampling distribution of x is smaller than the standard deviation of the population, if the sample size is larger than 1. Specifically: Shape of the sampling distribution with a normal population When a variable is normally distributed in its population, then the sampling distribution of over all possible samples of size n is also normal. The standard deviation of x is equal to! n. Thinking about the implication of n = 1 as one possibility, we see that averages are less variable than individual observations. 13 14 Summary for normal populations So if variable X is N ( ), then the sample mean distribution is N Problem: IQ scores In a selected population of adults, IQ is normally distributed with mean 112 with standard deviation 20. Suppose 200 adults are randomly selected for a market research campaign. The distribution of the sample mean IQ is: A) normal, mean 112, standard deviation 20. B) normal, mean 112, standard deviation 1.414. C) normal, mean 112, standard deviation 0.1. 15 16
Problem: IQ scores, continued Suppose that we would be satisfied with a standard deviation of the mean of 5. How many individuals would we need to sample? Practical notes Large samples are not always feasible Not all variables are normally distributed Example: Income is strongly skewed to the right Is still a good estimator of!? In large samples? In small samples? 17 18 The central limit theorem Central Limit Theorem: When randomly sampling from any population with mean and standard deviation, when n is large enough, the sampling distribution of is approximately normal: N(, /!n). Question about distributions and the CLT If the first graph shows the population, which plot could be the sampling distribution of if all samples of size n = 50 are drawn? Population with strongly skewed distribution Sampling distribution of for n = 2 observations Sampling distribution of for n = 10 observations Sampling distribution of for n = 25 observations 19 20
Another Question about Distributions & the CLT The following density curve represents waiting times at a customer service counter at a national department store. The mean waiting time is 5 minutes with standard deviation 5 minutes. If we took all possible samples of size n = 100, how would you describe the sampling distribution of the s? Shape? Center? Spread? Sampling Distributions and Normality When sample size is small, the sampling distribution of the mean will resemble the population distribution. As sample size increases, the sampling distribution of the mean becomes more normal-shaped, regardless of the shape of the population distribution. A sample size of 25 is generally enough to obtain a normal sampling distribution from a strongly-skewed population or even one with mild outliers. A sample size of 40 will typically be good enough to overcome extreme skewness and outliers. 21 22 The three distributions to keep straight Distribution of a variable in the population Mean =!; standard deviation " Units/cases = people, states, etc Distribution of a variable in a sample Mean = ; standard deviation s Statistics estimate parameters Units/cases = people, states, etc Distribution of a mean calculated from repeated samples Mean =!; standard deviation = It is the sampling distribution of Units/cases = samples Using the central limit theorem In 1997 mean family income in the United States was $49,692 with a standard deviation of $39,802. What is the minimum sample size we should use and why? Using this sample size, find the probability that the sample you draw will have a mean income of above 60,000. 23 24
More Problems : Stocks 1987 was a bad year for the stock market. Of 1815 stocks on the NYSE: the average return was 3.5%; the standard deviation was 26%. Stock returns were normally distributed. 1) What is the probability that a randomly selected stock lost more than 30% of its value in 1987? 2) What is the probability that a portfolio of 5 randomly chosen stocks lost more than 30% of its value. 3) Why do experts recommend larger portfolios as less risky? 4) If I randomly picked 5 stocks, what s the least I could have lost if I were in the bottom 5% of the returns distribution? Concluding Comments Some things to know include: what a sampling distribution is and why they are important the effect of sample size on the sampling distribution the center and variability of a sampling distribution how to think of a sampling distribution as a probability model the Law of Large Numbers and the Central Limit Theorem keeping straight the population, sample, and sampling distribution what parts of the following expression are true for any sampling distribution, and what is true only in certain situations: N (µ, n! ) 25 26 Review of important concepts Sampling distributions are theoretical distributions: they are the distribution of using all possible combinations of samples of size n. The spread of a sampling distribution depends on the number of cases over which you calculate the mean, or the sample size n, as well as on the spread of the population, measured by!. When you calculate means over more cases (larger n) the variability of the sampling distribution decreases and the closer and closer the samples will fall around the population mean (by the LLN). Because sampling distributions are theoretical distributions, they vary only by the number of cases used to calculate the mean (for any given population). Their characteristics are not affected by the number of samples that might be drawn. 27