Objectives. 3.3 Toward statistical inference. Toward statistical inference Sampling variability

Transcription

1 Objectives 3.3 Toward statistical inference Population versus sample (CIS, Chapter 6) Toward statistical inference Sampling variability Further reading: (some of the concepts introduced in this link are beyond this class) Adapted from authors slides 2012 W.H. Freeman and Company

2 The inconvenient truth So far we have assumed the mean of a population is known. In reality the population is unknown so its mean is unknown. Inference is detecting/find the unknown population mean based on a very small sample from the population. We illustrate what is meant by this in the following examples. See also the recent journal article from Poultry Science.

3 Towards statistical inference A survey of 2000 randomly sampled college students, 62% of this sample reported they have encountered some type of harassment. Parents are worried: What is the truth about the millions of students who are currently at college? Because the sample was taken at random it seems quite reasonable to suppose this sample is representative of the population of college students. This suggests that about 62% of all college students may have encountered some type of harassment. 62% is in fact an estimate of the total proportion who may have encountered harassment. What is the exact proportion? This is the start of statistical inference, where we infer conclusions on the entire population based on a sample. 62% is not the exact value, it will vary from sample to sample, and our objective in the next few lectures is to understand this variability. This will help us to understand the reliability of the estimate.

4 Refresher: De9initions Population: The entire group of individuals in which we are interested but cannot assess or observe directly. Examples: All college students, All calves etc. Often the population is described by a mathematical model. Population Sample: The part of the population we actually examine and for which we do have data. How well the sample represents the population depends on the sampling method, as well as on the sample size. Sample A parameter is a number describing a characteristic of the population. A statistic is a number describing a characteristic of a sample.

5 Example: M&M data q q q To illustrate what we mean by a population and sample, let us return to the M&M example. Let us suppose that the 170 M&M bags represent the population of M&Ms (in reality we do not observe the population so this is just an example for illustration). The population mean for the number of M&Ms is A random sample of size 5 is taken. There are different random samples that can be taken! q Note: Examples of random samples are given in homework 1. q q On the next two slides we show how to sample from the distribution. q Top plot: The distribution for the number of M&Ms in a bag (over 170 bags). q Middle plot: One sample of size 5 q Lower plot: The average of that sample (sample mean). Observe how the sample mean is different for the two samples.

6 Sample 1

7 Sample 2

8 Sampling variability As illustrated from the previous example, for every sample taken from a population, we are likely to get a different set of individuals and calculate a different value for our statistic (such as the sample mean). This is called sampling variability. This would suggest that the sample and the statistic contains no information about the population. However. The good news is that, if we imagine taking lots of random samples of the same size from a given population, the variation from sample to sample the sampling distribution will follow a predictable pattern. All of statistical inference is based on this; to see how trustworthy a statistic is what happens of we kept repeating the sampling many times?

9 We measure the quality of a statistic (such as the sample mean) with: Accuracy (bias) Random samples provide accurate estimates of a parameter because they are unbiased (or close to unbiased, depending on the random sampling method). This is done by sampling in a good way (ie. Randomly sampling over the population of interest). Using a well constructed statistic. Typically we will assume an estimator is unbiased. When reading an article identify the population of interest and potentially biases which may arise. Reliability (variable) A reliable estimation method is one that would give similar results if the random sampling is repeated over. The less variable a statistic, the more reliable it is. Random sampling enables us to measure the variability of a statistic. We do this with the standard error in the next slide we define what this means. Important: The larger the sample size, the less variable the corresponding estimator will be. To understand the above concepts look at the question at the end of this page:

10 Measuring Variability We have come across variability before. Recall in Chapter 3 we used the standard deviation to measure the variability in the sample. We recall that the sample standard deviation is the deviation from each observation to the sample mean: s = v u t 1 n 1 nx (X i X) 2 i=1 q The same criterion is used to measure the variability in the sample mean (and all other estimators). This is called the standard error. q More precisely, we measure the average spread from each estimator to the true mean. q Looking back at the M&M examples, it would appear that we have to calculate sample means! q This is impossible. q Remarkably we can find a very nice expression for the standard error which requires very little effort!

11 Population size does not matter There are about 15 million students in higher education. In the harassment survey about 2000 people were randomly surveyed. This means that the sexual harassment survey interviewed about one in every 7500 students. 62% is a estimate of the true population proportion. Question: Would the estimate of the proportion be better if the population size were smaller? For example, 1.5 million students rather than 15 million student. Answer: No. Only the size of the sample, in this case n=2000, has an influence on it s reliability, not the size of the population. Statistical inference is not based on how close the sample size is to the population (usually we assume that the population is infinite). It is based on the idea that simple random sample gives a representative sample over the entire population.

12 Summary and what s to come The techniques of statistics allow us to draw inferences or conclusions about a population using the data from a sample. Your estimate of the population parameter is only as good as your sampling design. à Work hard to eliminate biases (design your experiment well). Your sample statistic is only an estimate and if you randomly sampled again you would probably get a somewhat different result (more of this next). In the next section we will show: q q The distribution of the estimates (for much of the course it will be the sample mean) will, if the sample size is large enough, be normally distributed even if the observations are not normal. The standard error (reliability) has a simple formula!

13 Objectives 5.1 Sampling distribution of a sample mean (CIS, Chapter 8) The mean and standard deviation of x For normally distributed populations The central limit theorem (CIS, Chapter 8 and p103) Additional reading: samp_dist_mean.html Adapted from authors slides 2012 W.H. Freeman and Company

14 Simulation tools used To demonstrate the concepts I am using here I will be using an Applet in Statcrunch called sampling distribution. It is highly recommended that you try this out yourself. Applets -> Sampling Distributions. Select the distribution (from uniform etc) or choose the data table (your own data). Press computer. Choose your sample size (this is how large a sample you use) times etc. has NOTHING to do with sample size. It is the number of samples you draw (this part is the thought experiment). You should make this as large as possible (I usually set it to 100,000). Press the + sign next to Sampling means to get the QQplot of the distribution of the sample mean. Do not press the + sign next to Samples this will give you the QQplot of the sample. Conceptionally, what we will be doing is rather sophisticated and it will take time to precisely understand the ideas behind inference. This is NOT plug and chug. Note that you can customize the (parent) distribution from which you sample from by simply left clicking over the parent distribution and moving the cursor as you want the shape of the distribution to be.

15 M&M example Look first at the distribution of the total number M&Ms in a bag. We will treat this as our `population. Just comparing the histogram with the normal curve we can see that it is not normal. There are two reasons for this: a) The mix of different type of M&Ms (milk chocolate, peanut and peanut butter), will induce multimodalness in the distribution. b) The number of M&Ms is a numerical discrete random variable. In the following examples we will be drawing M&M bags (numbers) from this distribution. It is analogous to putting all 170 counts in a bag and drawing them out (with replacement). We see that we are most likely to draw the number 18 and least likely to draw 14 (within the range 5-21).

16 Distribution of average: sample 5 Let us now look at the distribution of the sample mean of all samples of size 5. That is we randomly sample 5 values from the population, and take the sample mean.

17 QQplot of average: sample 5 Let us now look at the QQplot of the sample mean of all samples of size 5 (corresponding to the histogram on the previous page) Observations: 1. The histogram of the sample mean is more bell-shaped than the original distribution. However, it is certainly not normal (the spikes we see is due taking average of 5 numbers, which is not continuous enough). 2. There is less spread in the distribution of the averages than the original histogram. 3. The QQplot shows a large deviation from normality in the tails.

19 QQplot of average: sample 10 Let us now look at the QQplot of the sample mean of all samples of size 10 (corresponding to the histogram on the previous page) Observations: 1. The histogram of the sample mean is a lot more bell-shaped than the original distribution. The spikes that were seen for sample size 5 have gone (the bumps you see on the histogram are due to binwidth). 1. There is even less spread in the distribution of the averages than the original histogram. 2. The QQplot shows only a small deviation from normality in the top tail of the distribution.

21 QQplot of average: sample 20 Let us now look at the QQplot of the sample mean of all samples of size 20 (corresponding to the histogram on the previous page) Observations: 1. The histogram of the sample mean is pretty much normal. 2. There is even less spread in the distribution of the averages than the original histogram. 3. The QQplot shows only a very tiny deviation from normality in the tails of the distribution.

23 QQplot of average: sample 40 Let us now look at the QQplot of the sample mean of all samples of size 40 (corresponding to the histogram on the previous page) Observations: 1. The histogram of the sample mean is almost normal. 2. There is even less spread in the distribution of the averages than the original histogram. 3. The QQplot is very close to the x=y line.

24 Summary: Sampling distribution of M&Ms

25 Summary of averages of M&Ms Sample size mean standard error comment original =4.64/ p 1 Not normal =4.64/ p 5 More unimodal =4.64/ p 10 Getting normal =4.64/ p 20 Mostly there =4.64/ p 40 Pretty much normal. This example illustrates three major insights: q The distributions of the sample means are centered about the true mean This tells us that the sample mean is not biased. q We see that the spread in the sample means decreases as the sample size used to evaluate them increases. The spread/reliability/ variability is measured using the standard error which has the formula σ/ n (in this case σ=4.64 and n=5,10,20 or 40). q The distribution of the sample mean becomes more normal (look at the QQplots) as the sample size grows.

26 Properties: Sample mean for normally distributed data When a variable in a population is normally distributed, the sampling distribution of distributed. x for all possible samples of size n is also normally If the population is Normal(µ, σ) then the sample mean s Sampling distribution distribution is Normal(µ, σ/ n). Note that the sample average has less variability than any Population individual observation.

27 Properties: Sample mean of non- normal distributed data Central Limit Theorem: When randomly sampling from any population with mean µ and standard deviation σ, if n is large enough then the sampling distribution of is approximately normal: ~ N(µ, σ / n). x Population with strongly skewed distribution Sampling distribution of x for n = 2 observations Sampling distribution of x for n = 10 observations Sampling distribution of x for n = 25 observations

28 Calculation Practice In 2010 the combined SAT scores had mean 1016 and standard deviation 212. They also had approximately normal distribution. Population distribution is Normal(µ = 1016; σ = 212). In Chapter 4, we used the normal distribution to show that the probability of a randomly selected student scoring 1100 or higher is 34.5%. Now, suppose 50 students are randomly selected and their SAT scores averaged. What is the probability that the average is greater than 1100? Sampling distribution of the sample average when n = 50 is Normal(µ = 1016; σ / n = 212 / 50 = 29.98). Using these values, the z-score for 1100 is z ( x µ ) = = ` = = σ n In Table A, the area to the right of 2.80 is So there is only a 0.25% chance that the average of 50 randomly sampled students is more than In this example we do not use the CLT because the original data is assumed normal.

29 Calculation Practice Hypokalemia is diagnosed when blood potassium levels are below 3.5mEq/dl. Let s assume that we know a patient whose measured potassium levels vary daily according to the Normal(µ = 3.8, σ = 0.2) distribution. If only one measurement is made, what is the probability that this patient will be misdiagnosed with Hypokalemia? ( x µ ) z = = = 1.5 σ 0.2, P(z < 1.5) = %. Instead, if measurements are taken on 4 separate days, what is the probability of a misdiagnosis (in this case sample mean based on 4 is below 3.5)? ( x µ ) z = = = 3 σ n 0.2 4, P(z < 3) = %. Note: If the problem is about the sample mean, make sure to standardize (get z) using the standard error for the sample mean.

30 Calculation Practice: using the CLT In Chapter 4 we discussed ACT scores. We argued that because the grades were numerical discrete over a small range, that the grade distribution could not be normally distributed. This means we cannot use the normal distribution to calculate probabilities for one randomly selected person. BUT if the sample size is large enough we can use the normal distribution to calculate probabilities for averages. We recall the mean ACT score is 20 with standard deviation 5. Question: 50 students are randomly selected. Calculate the probability their average (sample mean) score will be greater than 18. Answer: The mean of the sample mean has the same mean as the original distribution, which we know is 20. The standard error of the sample mean is s.e. = 5/ 50 = We use this to make the z-transform z = = Looking up the z-tables using a computer we see that probability is 99.7%. This means there is a very large chance the sample mean greater is than 18.

31 Calculation Practice Let us return to the weights of calves at 0.5 weeks. The distribution is below q Looking at the plot, it seems that a normal density (with mean and standard deviation 7.7) is a rough approximation of the underlying distribution of calves weights (see also the QQplot given at the end of Chapter 4). q q Question (a): Using the normal density calculate the approximate probability that a calf weights more than 100 pounds. Answer: Make a z-transform=( )/7.7 =1.28. Looking this up in the z-tables we have 90%. Therefore the approximate probability that a calf is greater than 100 is 10%.

32 Question (b): Let us suppose that the sample mean of 10 calves is taken. Using the normal approximation of the sample mean, what is the probability that the sample mean will be greater than 100 pounds? Answer: The mean of the sample mean is the same as the mean weight of cows which is The standard deviation of the sample mean is 7.7/ 10 = 2.4. By making the z-transform we have z=( )/2.4 = Looking 4.12 up in the z-tables, we see that it is in the far upper tails, thus the probability is close to 0%. The size of the probabilities calculated in (a) and (b) are compared in the above plots.

33 q Of the two probabilities calculated above, which is likely to be closest to the true probability? q q Both probabilities were calculated using the normal distribution. But this is only an approximation of the true distribution of calf weights and sample mean of calf weights. From the histogram on two pages back, it appears that the density for the underlying weights of calves is only very approximately normal. Thus it is unlikely that the probability calculated for the weight of one calf is that accurate. On the other hand the Central Limit Theorem tells is that the distribution of the sample mean gets closer to normal as the sample size grows. The second probability we calculated was based on the average weight of 10 calves. The distribution of the average is likely to be more normal than the weight of calves. Thus the second probability based on the average is more accurate (close to the true probability).

34 Calculation practice A farmer wants to use a vehicle to carry week old calves. The vehicle he plans to use can carry a maximum load of 2760 pounds. He knows that the mean weight of a calf is pounds and the standard deviation is 7.7. What is the chance the vehicle can carry the calves? We need to turn the total weight into the sample mean. We observe, if the total weight of 30 calves needs to be less than 2760 pounds this is the same as the sample mean weight of 30 calves must be less than 2760/30 = 92: X30 i=1 X i < 2760 ) X = 1 30 X30 i=1 X i < Therefore, we have turned the problem from totals into averages and apply the CLT to calculate the probability using the normal distribution.

35 Calculation practice (cont) We know from the central limit theorem that the sample mean is close to normally distributed. Thus the distribution of the sample mean is normal with mean and standard deviation 7.7/ 30 = 1.4. We know that for the vehicle to carry the calves, the sample mean has to be less than 92 pounds. Calculate the z-transform z=( )/1.4 = 1.35 and look up the z- tables to get Conclusion: There is a 91.1% chance the vehicle can carry the week old calves. In mathematical symbols: P X30 i=1 X i < 2760! = P X = 1 30 X30 i=1! X i < = P (Z <1.35) = 0.911

36 How large is a large enough sample size? It depends on the population distribution. More observations are required if the population distribution has a large standard deviation or if it is far from normal in distribution. A sample size of 25 is generally enough to obtain a normal sampling distribution from a population with some skewness or even mild outliers. A sample size of 40 will typically be good enough to overcome some skewness and outliers. More importantly, n should be large enough to make the standard error sufficiently small then we can get meaningful and precise inferences. We can check this by using the Sampling distribution applet. In many cases, even n = 40 is not large enough to give results reliable enough when there is a lot at stake. This is why clinical trials, political polls and marketing surveys typically observe 100 s or even 1000 s of individuals.

37 The effect of skewness on the CLT Below we look at the sample mean taken from data with a large right skew

38 The corresponding QQplot of the sample mean Observations: 1. We see that the standard error is = 4.7/ 40, which is as it should be. 2. However, the QQplot deviates far from normality in the tables. The distribution of the sample mean still has a slight right skew (look back at the QQplots in Chapter 4). This demonstrates that when data is highly skewed, we need a much large sample size for the CLT to kick in. 3. Calculations based on normality of the the average will not be completely correct.

39 Effect of binary data on the CLT Binary data arises in several situations. It includes Male or Female. Like or Dislike, wherever there are two possible outcomes. In this example, we have encrypted one outcome with zero and the other with 1 (it does not really matter which way). We see that the proportion in the one category is about 20% - this is what is meant by the mean. This data is discrete and clearly skewed.

40 The corresponding QQplot of the sample mean Observations: 1. We see that the standard error is = 0.405/ 50, which is as it should be. 2. However, the QQplot deviates far from normality in the tables. The lines across demonstrate that the average over 50 still takes discrete values (though not integers). We also see a U shape that shows that the sample mean is still skewed. 3. Calculations based on normality of the the average will not be completely correct.

41 Example: Income distribution Let s consider the very large database of individual incomes from the Bureau of Labor Statistics as our population. Income is strongly right skewed. We take 1000 SRSs of 25 incomes, calculate the sample mean for each, and make a histogram of these 1000 means. We also take 1000 SRSs of 100 incomes, calculate the sample mean for each, and make a histogram of these 1000 means. Which histogram corresponds to samples of size 100? Which to samples of size 25?

42 So many standard deviations! In statistics we talk about different kinds of standard deviations, and it can be hard to keep track of them: s is the standard deviation of a set (sample) of data. It is a statistic we can compute once we have the data. σ is the standard deviation of a population (which is much too big to observe completely). It is a parameter usually, we will never know its true value. σ / n is the standard deviation of the values of from all possible random samples of size n. It refers to the sample mean, not to data. It is also called the standard error of. s / n is our estimate of σ / n, since we do not know the value of σ. From a survey of students taking statistics, n = 459 responded to the question How many Facebook friends do you have? The sample mean was x = and the sample standard deviation was s = The standard error for the sample mean is s / n = 589.5/ 459 = x is an estimate for µ = mean of the population of all students required to take the class and s is an estimate for the population standard deviation σ. x x

43 Summary is always unbiased for µ, even if the population s distribution is very different from a normal distribution. The standard deviation of, σ / n, measures the variability due to random sampling. If the population is approximately normal or if the sample size n is large, we can use the normal distribution to compute probabilities for. We just have to remember to use σ / n, not σ, in the denominator when calculating z. This means we can say something about how close is likely to be to µ. Generally it is quite likely (95% chance) that it will be within 2 x x standard errors of µ. x Not all variables are normally distributed and large samples are not always attainable. In such circumstances, a statistician should be consulted for proper methods of statistical inference and calculation. x

44 Accompanying problems associated with this Chapter Quiz 5 Quiz 6 Homework 2, Q6. Homework 3.