Week 3&4: Z tables and the Sampling Distribution of X
2 / 36 The Standard Normal Distribution, or Z Distribution, is the distribution of a random variable, Z N(0, 1 2 ). The distribution of any other normal random variable, X N(µ, σ 2 ), can be converted to a Z = X µ σ. Probabilities for these variables are areas under the curve, but since we don t use calculus in the this course, we can use software or a Z table to find probabilities. The random variable, Z, is continuous which means the probabilty at any exact point is always 0. Thus, we will find probabilities for ranges of values.
3 / 36 First, some general characteristics of the Z distribution. The area under the entire curve is 1 since it represents all possible values. Because it is symmetric, the mean = the median, so the area under the curve to the left of 0 is 0.5 (as is the area to the right). We say, The probability that Z is less than 0 is 0.5. This is written as P(Z < 0) = 0.5. Again, since Z is continuous, P(Z 0) = P(Z < 0) = 0.5.
4 / 36 We will use the Z table found on the Stat30X webpage - http://www.stat.tamu.edu/stat30x/zttables.php Notice that the only entry on both pages of the table is z = 0.00 and the probability is 0.5000. The rows of the table are the z-scores with the columns indicating the 2 nd decimal. The body of the table contains the probabilitiesis to the left of any particular z-score = z.zz. For example, the P(Z < 0.00) = 0.5000 and P(Z < 0.07) = 0.5279.
5 / 36 Examples of reading the table: P(Z < 1.25) = 0.8944 P(Z < 0.50) = 0.6915
6 / 36 P(Z < 0.75) = 0.2266 P(Z < 2.01) = 0.0222
7 / 36 The Z -table only gives probabilities to the left of a value. If we want to get probabilities to the right we use the complement rule, P(Z > z) = 1 P(Z < z). P(Z > 1.25) = 1 0.8944 = 0.1056 P(Z > 0.50) = 1 0.6915 = 0.3085
8 / 36 P(Z > 0.75) = 1 0.2266 = 0.7734 P(Z > 2.01) = 1 0.0222 = 0.9778
9 / 36 To find probabilities between two numbers, find the larger area (using the larger value) first and then subtract the smaller area. Remember, a probability can never be negative, so check your work! P( 2.01 < Z < 2.01) = P(Z < 2.01) P(Z < 2.01) = 0.9778 0.0222 = 0.9556
10 / 36 Now suppose we have a non-standard normal, X N(µ, σ 2 ), and we want to know the probability that X is less than some value. We must first convert the X to a Z and then use the probabilities from the Z -table. Recall that if X N(µ, σ 2 ), then so Z = X µ σ N(0, 1 2 ) P(X < x) = P(Z < x µ σ ) Beware: P(X > x) 1 P(X < x) if X is not centered at 0. You must convert to a Z before using the complement rule.
11 / 36 Suppose X N(2, 3 2 ), Given a value x, find the corresponding z and then the probability. Find P(X > 5) = P ( X µ s ) > 5 2 3 = P(Z > 1) = 1 P(Z < 1) = 1 0.8413 = 0.1587. ( ) 4 2 Find P( 4 < X < 8) = P < X µ < 8 2 3 s 3 = P( 2 < Z < 2) = P(Z < 2) P(Z < 2) = 0.9772 0.0228 = 0.9544. Find P(X < 4 or X > 8) = P ( X µ s < 4 2 or X µ 3 s = P(Z < 2 or Z > 2) = P(Z < 2)+P(Z > 2) = 0.0228 + (1 0.9772) = 0.0456. ) > 8 2 3 Note: Since the two areas are the same size, you could have just doubled the lower tail.
12 / 36 Reverse use of Z -table: Finding probabilities given z-scores. Find the z such that Pr(Z < z ) = 0.8485, where 0.8485 is some probability. Answer: z = 1.03. Find z such that P(Z < z ) = 0.2981 Answer z = 0.53 Find z such that P(Z > z ) = 0.1056 = 1 P(Z < z ). Answer: P(Z < z ) = 1 P(Z > z ) = 1 0.1056 = 0.8944. z = 1.25.
13 / 36 Finding Centered Probabilities What if P( z < Z < z ) = 0.85, where 0.85 is a central area under the Z curve (if it s not, we can t do this). Since the total area under the curve is 1 and 1 0.85 = 0.15, there is 0.15 of the area outside z and z. And since the Z curve is centered at 0, half of this area is below z and the other half is above z.
14 / 36 This means P( z < Z < z ) = 0.85 = P(Z < z ) P(Z < z ) = (1 0.075) 0.075 We can now find z such that P(Z < z ) = 0.075 Answer: z = 1.44 If we call the central area 1 α (we ll discover why later), then the outside area is α and the area to look up α/2.
15 / 36 Standard Normal 5 Number Summary We know from Chapter 1 that the IQR = Q 3 Q 1 covers the middle 50% of a distribution. So what are z Q1 and z Q3? P(z Q1 < Z < z Q3 ) = 0.50 = P(Z < z Q3 ) P(Z < z Q1 ) = 0.75 0.25 or P(Z < z Q1 ) = 0.25 and P(Z < z Q3 ) = 0.75. Answer: z Q1 = 0.675 and z Q3 = 0.675 Adding these numbers to the Empirical Rule numbers, we have estimates for the middle 50, 68, 95 and 99.7% s as easy references.
16 / 36 Non-standard Normal Example Suppose the sample proportion of 100 students who think that there is insufficient parking is normally distributed with a mean of 0.8 and a standard deviation of 0.04. As long as we know the distribution is normal, and µ and σ, we can find any probability! p N(µ p = 0.8, σp 2 = 0.04 2 ) How often would we get a sample proportion of 0.75 or less? P(p 0.75) = P( p µ σ 0.75 0.8 ) 0.04 = P(Z < 1.25) = 0.1056
17 / 36 Inference So what good are these probabilities? Recall from the Introduction, an important area of statistics is inference: drawing a conclusion based on data and making decisions based on how likely something is to occur. Since probabilities tell us how often things occur, we can use them to make our decisions. But probabilities come from the whole population which would mean we needed a census, a complete listing of all of the data. We need to be able to make our decisions based on samples, or even one sample.
18 / 36 Inference Inferential Statistics General Idea of Inferential Statistics We take a sample from the whole population. We summarize the sample using important statistics. We use those summaries to make inference about the whole population. We realize there may be some error involved in making inference.
19 / 36 Inference Inferential Statistics Example: (1988, the Steering Committee of the Physicians Health Study Research Group) Question: Can Aspirin reduce the risk of heart attack in humans? Sample: Sample of 22,071 male physicians between the ages of 40 and 84, randomly assigned to one of two groups. One group took an ordinary aspirin tablet every other day (headache or not). The other group took a placebo every other day. This group is the control group. Summary statistic: The rate of heart attacks in the group taking aspirin was only 55% of the rate of heart attacks in the placebo group. Inference to population: Taking aspirin causes lower rate of heart attacks in humans.
20 / 36 Inference Basics for sampling Samples should not be biased: no favoring of any individual in the population. Examples of biased samples: select goldfish from a particular store, polling your neighbor rather than the whole city The selection of an individual in the population should not affect the selection of the next individual: independence. Example of non-independent sample: when taking a survey on the cost of a college education, we ask both the mother and the father of a student Samples should be large enough to adequately cover the population. Example of a small sample: suppose only 20 physicians were used in the aspirin study.
21 / 36 Inference Basics for sampling Samples should have the smallest variability possible. We know that there are many different samples, so we want to make sure our statistics are consistent. The larger sample we use, the less the different sample statistics will vary. Although there are many types of samples, we will only discuss the simplest, a sample random sample. Every sample of a particular size, n, from the population has an equal chance of being selected. A SRS produces an biased statistic.
22 / 36 Inference Basics for sampling
23 / 36 Inference Sampling Distribution Since statistics vary from sample to sample, there is a distribution of them called a sampling distribution which is the distribution of all of the values taken by the statistic in all possible samples of the same size, n, from the same population. We can then examine the shape, center, and spread of the sampling distribution. We know that there are many statistic that we can calculate from a sample, but we re going to start with the sample mean, X.
24 / 36 Inference Bias and Variance Bias concerns the center of the sampling distribution. A statistic used to estimate a parameter is unbiased if the mean of the sampling distribution is equal to the true value of the parameter being estimated. This says that the mean of the sample mean is the same as the mean of the population sampled, µ X = µ X. To reduce bias, we use a random sample. Variability is described by the spread of the sampling distribution. To reduce the variability of a statistic, use a larger sample; the larger the sample size, n, the smaller the variance of the statistic. The reason this is true is because the variance of the sample mean gets smaller as the sample size increases, σ 2 X = σx 2 /n, or σ X = σ X / n.
25 / 36 Inference Bias and Variance Summary Population Distribution of a random variable The distribution of all the members of the population. Parameters help describe the distribution, for example, µ and σ. Sampling Distribution of a sample statistic This is not the distribution of the sample! The sampling distribution is the distribution of a statistic. If we take many, many samples and calculate the statistic for each of those samples, the distribution of all those statistics is the sampling distribution. We will start with the sampling distribution of the sample mean, X.
26 / 36 Sampling Distribution for Numeric Data Sampling Distribution of a Sample Mean We already know that if we take random samples the sample mean is unbiased, µ X = µ X, so we know the center. We can minimize the variance by using a large sample, n, σ X = σ X / n, so we know the spread. Since the sample mean of a normal random variable is also normal, we know the shape. So, if the X is normal, the distribution of the sample mean, or sampling distribution of the sample mean is X n N ( ( ) ) 2 σ µ, n the subscript on X indicates the sample size
27 / 36 Sampling Distribution for Numeric Data Examples of Sample Mean There has been some concern that young children are spending too much time watching television. A study in Columbia, South Carolina recorded the number of cartoon shows watched per child from 7:00 a.m. to 1:00 p.m. on a particular Saturday morning for 28 different children. The results were as follows: 2, 2, 1, 3, 3, 5, 7, 5, 3, 8, 1, 4, 0, 4, 2, 0, 4, 2, 7, 3, 6, 1, 3, 5, 6, 4, 4, 4. (Adapted from Intro. to Statistics, Milton, McTeer and Corbet, 1997) Suppose the true average for all of South Carolina is 3.4 with a standard deviation of 2.1, and that the data is normal.
28 / 36 Sampling Distribution for Numeric Data Examples of Sample Mean What is the population mean? µ = 3.4 What is the sample mean? x = 99/28 = 3.535 What is the approximate sampling distribution (of the sample mean)? X 28 N ( 3.4, ( 2.1 28 ) 2 ) = N(3.4, 0.4 2 ) Again, what does this mean?
29 / 36 Sampling Distribution for Numeric Data Examples of Sample Mean Suppose we take many, many samples (each sample of size 28), then we find the sample mean for each sample. The sampling distribution of all those means (2.9, 3.4, 4.1,... ) is distributed N(3.4, 0.4 2 ).
30 / 36 Sampling Distribution for Numeric Data The Central Limit Theorem What if the original data (parent population) is not normal? The Central Limit Theorem states that for any population with mean µ and standard deviation σ, the sampling distribution of the sample mean, X n, is approximately normal when n is large. X n N ( ( ) ) 2 σ µ, n The central limit theorem is a very powerful tool in statistics. Remember, the central limit theorem works for any distribution. Let us see how well it works for the years on pennies.
31 / 36 Sampling Distribution for Numeric Data Example of Central Limit Theorem Penny Population Distribution (276)
32 / 36 Sampling Distribution for Numeric Data Example of Central Limit Theorem Note from the previous slide, the distribution is highly left skewed. The mean of the 276 pennies is 1992.9. The standard deviation of the 276 pennies is 8.7. Let us take 50 samples of size 10. According to the Central Limit Theorem, the sampling distribution of the sample means should be normal with mean 1992.9 and standard deviation 8.7/ 10 = 2.75.
33 / 36 Sampling Distribution for Numeric Data Example of Central Limit Theorem That is, the sampling distribution, the distribution of the x s should be a normal distribution. Suppose we took 50 samples from these pennies and plotted the sample means:
34 / 36 Sampling Distribution for Numeric Data Example of Central Limit Theorem The distribution of the means of the 50 samples is Notice x X is close to 1992.9 = µ and s X is not far from 2.75 = σ. The previous slide shows the distribution of the means of the 50 samples is slightly skewed but closer to the normal distribution. So, n = 10 isn t large enough and taking larger samples would produce a more normal distribution of sample means. So what is large enough? Theory says at least n = 30, but sometimes more is needed.
35 / 36 Sampling Distribution for Numeric Data Recap So in general: The mean of sample means is the mean of the data, µ X = µ X. The standard deviation of the sample means is the standard deviation of the data divided by the square root of the sample size, σ X = σ X. If the data is normal, then the distribution of the sample means is exactly normal. But even if the distribution of the data isn t known, we can say the distribution of the sample means is approximately normal as long as we take a large sample.
36 / 36 Sampling Distribution for Numeric Data Example Example: Suppose past studies indicate it takes an average of 15 minutes with a standard deviation of 5 minutes to memorize a short passage of 100 words. A psychologist claims a new method of memorization will reduce the average time. A random sample of 40 people use the new method and the average time required to memorize the passage is found to be 12.5 minutes. 12.5 minutes is obviously less than 15, but is it small enough to say that the new method actually reduces the average time or is it just random chance that produced such a small sample mean? How likely is x 12.5 if µ = 15? First X N(15, ( 5 40 ) 2 ) = N(15, 0.79 2 ) P(X < 12.5) = P(Z < 12.5 15 0.79 ) = P(Z < 3.16) = 0.0008 So, even though 12.5 isn t much different than 15 minutes, an average this small should rarely if ever happen.