Outline. 1 Confidence Intervals for Proportions. 2 Sample Sizes for Proportions. 3 Student s t-distribution. 4 Confidence Intervals without σ

Outline 1 Confidence Intervals for Proportions 2 Sample Sizes for Proportions 3 Student s t-distribution 4 Confidence Intervals without σ

Confidence Interval for µ (pretending we know σ) Suppose a population has standard deviation σ. Taking a sample of n individuals, you obtain a sample mean x. Then you can be y-confident that the true mean µ is in the interval (x z σ n, x + z σ n ), where z was a number got from the z-table (using y). That s great for numerical data, but what about categorical data? Question Suppose you take a sample of n individuals from a population and find that x of them are successes, so that your population proportion is p = x n. Then p is our estimate of p, but what is (say) a 95% confidence interval for p?

Some correspondences This... corresponds to... µ p (parameter) x p (statistic) σ n pq n (standard error) (Recall that q = 1 p.)

Some correspondences This... corresponds to... µ p (parameter) x p (statistic) σ n pq n (standard error) (Recall that q = 1 p.) Answer So let s just make those substitutions in our formula! ) µ is somewhere in (x z n σ, x + z n σ

Some correspondences This... corresponds to... µ p (parameter) x p (statistic) σ n pq n (standard error) (Recall that q = 1 p.) Answer So let s just make those substitutions in our formula! ) p is somewhere in (x z n σ, x + z n σ

Some correspondences This... corresponds to... µ p (parameter) x p (statistic) σ n pq n (standard error) (Recall that q = 1 p.) Answer So let s just make those substitutions in our formula! ) p is somewhere in ( p z n σ, p + z n σ

Some correspondences This... corresponds to... µ p (parameter) x p (statistic) σ n pq n (standard error) (Recall that q = 1 p.) Answer So let s just make those substitutions ( in our formula! ) p is somewhere in p z pq n, p + z pq n

Some correspondences This... corresponds to... µ p (parameter) x p (statistic) σ n pq n (standard error) (Recall that q = 1 p.) Answer So let s just make those substitutions ( in our formula! ) p is somewhere in p z pq n, p + z pq n But wait! We don t know p that s the whole point! Happily, it s good enough to use p for p and q = 1 p for q.

Some correspondences This... corresponds to... µ p (parameter) x p (statistic) σ n pq n (standard error) (Recall that q = 1 p.) Answer So let s just make those substitutions ( in our formula! ) p is somewhere in p z p q n, p + z p q n But wait! We don t know p that s the whole point! Happily, it s good enough to use p for p and q = 1 p for q.

Confidence Interval for Proportions If a sample of size n reveals a sample proportion of p, then the confidence interval for the population proportion p is ( ) p q p q p z n, p + z, n where z is the z-score gotten from the confidence level in the usual way. This is good enough as long as the sample size is fairly large, and the population proportion is not too close to 0 or to 1.

Example: Fish Example 400 randomly chosen people were asked whether they like fish; 160 said yes. Find a 97% confidence interval for p, the proportion of people in the whole population who like fish.

Example: Fish Example 400 randomly chosen people were asked whether they like fish; 160 said yes. Find a 97% confidence interval for p, the proportion of people in the whole population who like fish. Solution First, let s see what a 97% confidence interval looks like.

Example: Fish We need a 97% confidence interval. 1 Draw the standard normal curve Z. 2 Draw vertical bars and label the middle with 0.97. 3 That means the remaining area is 1 0.97 = 0.03. 4 That means the left tail has area 0.03 2 = 0.015. 5 The z-table (backwards) tells us the tail ends at 2.17. 6 So we need 2.17 standard errors!

Example: Fish We need a 97% confidence interval. 1 Draw the standard normal curve Z. 0.97 2 Draw vertical bars and label the middle with 0.97. 3 That means the remaining area is 1 0.97 = 0.03. 4 That means the left tail has area 0.03 2 = 0.015. 5 The z-table (backwards) tells us the tail ends at 2.17. 6 So we need 2.17 standard errors!

Example: Fish We need a 97% confidence interval. 1 Draw the standard normal curve Z. 0.97 0.03 2 Draw vertical bars and label the middle with 0.97. 3 That means the remaining area is 1 0.97 = 0.03. 4 That means the left tail has area 0.03 2 = 0.015. 5 The z-table (backwards) tells us the tail ends at 2.17. 6 So we need 2.17 standard errors!

Example: Fish We need a 97% confidence interval. 1 Draw the standard normal curve Z. 0.015 0.97 0.03 2 Draw vertical bars and label the middle with 0.97. 3 That means the remaining area is 1 0.97 = 0.03. 4 That means the left tail has area 0.03 2 = 0.015. 5 The z-table (backwards) tells us the tail ends at 2.17. 6 So we need 2.17 standard errors!

Example: Fish We need a 97% confidence interval. 1 Draw the standard normal curve Z. 0.015 0.97 0.03 2.17 2 Draw vertical bars and label the middle with 0.97. 3 That means the remaining area is 1 0.97 = 0.03. 4 That means the left tail has area 0.03 2 = 0.015. 5 The z-table (backwards) tells us the tail ends at 2.17. 6 So we need 2.17 standard errors!

Example Example: Fish 400 randomly chosen people were asked whether they like fish; 160 said yes. Find a 97% confidence interval for p, the proportion of people in the whole population who like fish. Solution So for 97% confidence we need 2.17 standard errors. Now p = 160 400 = 0.4, so q = 0.6; also, n = 400. Thus the standard error is p q n = (0.4)(0.6) 400 = 0.0245. Hence our confidence interval is ( ) p q p 2.17 n, p q p + 2.17 n = (0.4 2.17(0.0245), 0.4 + 2.17(0.0245)) = (0.347, 0.453) Thus we can be 97% confident that the true proportion of people who like fish is somewhere between 34.7% and 45.3%.

Example: Fish Example 400 randomly chosen people were asked whether they like fish; 160 said yes. Find a 97% confidence interval for p, the proportion of people in the whole population who like fish. Solution So for 97% confidence we need 2.17 standard errors. Now p = 160 400 = 0.4, so q = 0.6; also, n = 400. Thus the standard error is p q n = (0.4)(0.6) 400 = 0.0245. Hence our confidence interval is ( ) p q p 2.17 n, p q p + 2.17 n = (0.4 2.17(0.0245), 0.4 + 2.17(0.0245)) = (0.347, 0.453) Thus we can be 97% confident that the true proportion of people who like fish is somewhere between 34.7% and 45.3%.

Example Example: Fish 400 randomly chosen people were asked whether they like fish; 160 said yes. Find a 97% confidence interval for p, the proportion of people in the whole population who like fish. Solution So for 97% confidence we need 2.17 standard errors. Now p = 160 400 = 0.4, so q = 0.6; also, n = 400. Thus the standard error is p q n = (0.4)(0.6) 400 = 0.0245. Hence our confidence interval is ( ) p q p 2.17 n, p q p + 2.17 n = (0.4 2.17(0.0245), 0.4 + 2.17(0.0245)) = (0.347, 0.453) Thus we can be 97% confident that the true proportion of people who like fish is somewhere between 34.7% and 45.3%.

Finding a good sample size for proportions Last time, we saw how to find the sample size you need to get a confidence interval of a certain size. Can we do that for proportions as well? Example You want to find the true proportion of red-haired people in North Dakota. How many North Dakota residents should you choose randomly in order to be 96% confident that your conclusions are accurate within 3 percentage points?

Finding a good sample size for proportions Last time, we saw how to find the sample size you need to get a confidence interval of a certain size. Can we do that for proportions as well? Example You want to find the true proportion of red-haired people in North Dakota. How many North Dakota residents should you choose randomly in order to be 96% confident that your conclusions are accurate within 3 percentage points? Solution First, let s see what a 96% confidence interval looks like.

Example: Red Hair We need a 96% confidence interval. 1 Draw the standard normal curve Z. 2 Draw vertical bars and label the middle with 0.96. 3 That means the remaining area is 1 0.96 = 0.04. 4 That means the left tail has area 0.04 2 = 0.02. 5 The z-table (backwards) tells us the tail ends at 2.05. 6 So we need 2.05 standard errors!

Example: Red Hair We need a 96% confidence interval. 1 Draw the standard normal curve Z. 0.96 2 Draw vertical bars and label the middle with 0.96. 3 That means the remaining area is 1 0.96 = 0.04. 4 That means the left tail has area 0.04 2 = 0.02. 5 The z-table (backwards) tells us the tail ends at 2.05. 6 So we need 2.05 standard errors!

Example: Red Hair We need a 96% confidence interval. 1 Draw the standard normal curve Z. 0.96 0.04 2 Draw vertical bars and label the middle with 0.96. 3 That means the remaining area is 1 0.96 = 0.04. 4 That means the left tail has area 0.04 2 = 0.02. 5 The z-table (backwards) tells us the tail ends at 2.05. 6 So we need 2.05 standard errors!

Example: Red Hair We need a 96% confidence interval. 1 Draw the standard normal curve Z. 0.02 0.96 0.04 2 Draw vertical bars and label the middle with 0.96. 3 That means the remaining area is 1 0.96 = 0.04. 4 That means the left tail has area 0.04 2 = 0.02. 5 The z-table (backwards) tells us the tail ends at 2.05. 6 So we need 2.05 standard errors!

Example: Red Hair We need a 96% confidence interval. 1 Draw the standard normal curve Z. 0.02 0.96 0.04 2.05 2 Draw vertical bars and label the middle with 0.96. 3 That means the remaining area is 1 0.96 = 0.04. 4 That means the left tail has area 0.04 2 = 0.02. 5 The z-table (backwards) tells us the tail ends at 2.05. 6 So we need 2.05 standard errors!

Example Example: Red Hair You want to find the true proportion of red-haired people in North Dakota. How many North Dakota residents should you choose randomly in order to be 96% confident that your conclusions are accurate within 3 percentage points? Solution, cont. The accuracy of our 96% confidence interval is thus 2.05 standard errors, or 2.05 p q n within 0.03, so we want 0.03 2.05. We want that accuracy to be p q n. Solving for n, 0.03 n 2.05 p q n 2.05 p q 0.03 = 68.33 p q ( 2 n 68.33 = 4669.44 p q. We need n to be at least 4669.44 p q. But we don t know p and q until we do the survey!

Example: Red Hair Example You want to find the true proportion of red-haired people in North Dakota. How many North Dakota residents should you choose randomly in order to be 96% confident that your conclusions are accurate within 3 percentage points? Solution, cont. The accuracy of our 96% confidence interval is thus 2.05 standard errors, or 2.05 p q n within 0.03, so we want 0.03 2.05. We want that accuracy to be p q n. Solving for n, 0.03 n 2.05 p q n 2.05 p q 0.03 = 68.33 p q ( 2 n 68.33 = 4669.44 p q. We need n to be at least 4669.44 p q. But we don t know p and q until we do the survey!

Example Example: Red Hair You want to find the true proportion of red-haired people in North Dakota. How many North Dakota residents should you choose randomly in order to be 96% confident that your conclusions are accurate within 3 percentage points? Solution, cont. The accuracy of our 96% confidence interval is thus 2.05 standard errors, or 2.05 p q n within 0.03, so we want 0.03 2.05. We want that accuracy to be p q n. Solving for n, 0.03 n 2.05 p q n 2.05 p q 0.03 = 68.33 p q ( 2 n 68.33 = 4669.44 p q. We need n to be at least 4669.44 p q. But we don t know p and q until we do the survey!

What saves us p(1 p) 1 p Fortunately, we know p is somewhere between 0 and 1. Also, p q = p(1 p). If we graph the function p(1 p), we see that it can t get too large! In fact, the largest it can be is 0.25. That is, p q 0.25.

What saves us p(1 p) 0.25 1 p Fortunately, we know p is somewhere between 0 and 1. Also, p q = p(1 p). If we graph the function p(1 p), we see that it can t get too large! In fact, the largest it can be is 0.25. That is, p q 0.25.

Example: Red Hair Example You want to find the true proportion of red-haired people in North Dakota. How many North Dakota residents should you choose randomly in order to be 96% confident that your conclusions are accurate within 3 percentage points? Solution, cont. We found out that we need n to be at least 4669.44 p q. Since the largest p q can be is 0.25, that means we need n 4669.44 0.25 = 1167.36. Thus we need at least 1,168 people in our survey in order to be 96% sure that our survey is accurate within three percentage points.

Summary Finding sample size for proportions 1 Find out what z-score you need for the desired confidence level. 2 Set your desired accuracy equal to z p q n. 3 Plug in the z-score you found. 4 Instead of p q, use their maximum value, namely 0.25. 5 Now solve for n.

Getting rid of σ The 95% confidence interval is (x 2 σ n, x + 2 σ n ) In practice, we know x and n, but we don t know σ. What can we do? Our best guess for σ is s, the sample standard deviation.

Sample Variance and Standard Deviation Definition If our sample yields the list of numbers {x 1, x 2,..., x n }, then the sample variance is given by s 2 = (x 1 x) 2 + (x 2 x) 2 + + (x n x) 2. n 1 The sample standard deviation s is the square root of the sample variance. Alternate form An easier version for computing the sample variance is s 2 = (x 1) 2 + (x 2 ) 2 + + (x n ) 2 nx 2. n 1

Using s instead of σ The simplest thing to do would be to use s instead of σ in our confidence interval formula. We could try (x z σ n, x + z σ n ). This actually works surprisingly well... For the rest of the time, we need another approach, known as Student s t-distribution.

Using s instead of σ The simplest thing to do would be to use s instead of σ in our confidence interval formula. We could try ( x z s n, x + z s n ). This actually works surprisingly well... For the rest of the time, we need another approach, known as Student s t-distribution.

Using s instead of σ The simplest thing to do would be to use s instead of σ in our confidence interval formula. We could try ( x z s n, x + z s n ). This actually works surprisingly well... some of the time. For the rest of the time, we need another approach, known as Student s t-distribution.

Where it began... William S. Gosset Student

Where it began... William S. Gosset Student Arthur Guinness Son & Co. Ltd. good

Where it began... William S. Gosset Student Arthur Guinness Son & Co. Ltd. good In the early 1900 s, Guinness employed Gosset as a statistician to help improve their beer. Brewing is a long, expensive process, and Gosset often had only a few batches of beer in his samples. Gosset found that using s n worked well when he had a large n, but when n was small, it was producing confidence intervals that were too small.

Why it goes wrong Before, when we constructed the 95% confidence interval off of µ, we got error bars of 2 σ n from the uncertainty of where µ was. x But if we don t know σ, then that just adds to our uncertainty! x

Why it goes wrong Before, when we constructed the 95% confidence interval off of µ, we got error bars of 2 σ n from the uncertainty of where µ was. x 2 σ n x x + 2 σ n But if we don t know σ, then that just adds to our uncertainty! x

Why it goes wrong Before, when we constructed the 95% confidence interval off of µ, we got error bars of 2 σ n from the uncertainty of where µ was. uncertainty from unknown µ x 2 σ n x x + 2 σ n But if we don t know σ, then that just adds to our uncertainty! x

Why it goes wrong Before, when we constructed the 95% confidence interval off of µ, we got error bars of 2 σ n from the uncertainty of where µ was. uncertainty from unknown µ x 2 σ n x x + 2 σ n But if we don t know σ, then that just adds to our uncertainty! uncertainty from unknown µ x 2 s n x x + 2 s n

Why it goes wrong Before, when we constructed the 95% confidence interval off of µ, we got error bars of 2 σ n from the uncertainty of where µ was. uncertainty from unknown µ x 2 σ n x x + 2 σ n But if we don t know σ, then that just adds to our uncertainty! uncertainty from unknown µ extra uncertainty from unknown σ x 2 s n x x + 2 s n

Gosset s observation If we use s instead of σ, then we re more uncertain. Therefore we need more s n s than we would need of σ n s. We got the number of standard errors to use from the Z -distribution. So that s the wrong distribution to use!

If we were using σ, within 2 standard errors we would have 95% confidence. Because we re working with s instead of σ, we have less confidence! So we need a flatter distribution than Z!

95% If we were using σ, within 2 standard errors we would have 95% confidence. Because we re working with s instead of σ, we have less confidence! So we need a flatter distribution than Z!

88% If we were using σ, within 2 standard errors we would have 95% confidence. Because we re working with s instead of σ, we have less confidence! So we need a flatter distribution than Z!

The Story of Student Wm. S. Gosset discovered the flatter distribution that gives the confidence intervals with small sample sizes. Some years earlier, a Guinness employee had published some of the company s brewing secrets, so Guinness prohibited its employees from publishing. Gosset pleaded with Guinness to let him publish math. They finally gave him permission, under one condition.

The t-distribution Gosset found the formula for the right distribution for small samples. There s a different distribution for each sample size. If your sample size is n, you use the t-distribution with n 1 degrees of freedom.

z-distribution The t-distribution t-distribution with 1 degrees of freedom Gosset found the formula for the right distribution for small samples. There s a different distribution for each sample size. If your sample size is n, you use the t-distribution with n 1 degrees of freedom.

z-distribution The t-distribution t-distribution with 30 degrees of freedom Gosset found the formula for the right distribution for small samples. There s a different distribution for each sample size. If your sample size is n, you use the t-distribution with n 1 degrees of freedom. If n 30, then the t-distribution is almost exactly the normal curve Z.

Student s Conclusions To make a confidence interval when we don t know σ, we replace σ n with our estimate s n. If our sample size n is at least 30, we use the Z -curve just like last time. If our sample size n is less than 30, we use the t-curve for n 1 degrees of freedom. So the only change in our procedure is to look up the numbers in a different table!

Finding a y-confidence interval from a small sample 1 Subtract 1 from the sample size n to get n 1 degrees of freedom. 2 Draw Student s t-distribution with n 1 degrees of freedom. 3 Draw two vertical bars symmetrically on the graph, and label the middle with y. 4 That means the remaining area is 1 y. 5 That means the left tail has area 1 y 2. 6 Use the appropriate t-table to learn where that tail ends! 7 Use that many standard errors s n!

Finding a y-confidence interval from a small sample 1 Subtract 1 from the sample size n to get n 1 degrees of freedom. 2 Draw Student s t-distribution with n 1 degrees of freedom. y 3 Draw two vertical bars symmetrically on the graph, and label the middle with y. 4 That means the remaining area is 1 y. 5 That means the left tail has area 1 y 2. 6 Use the appropriate t-table to learn where that tail ends! 7 Use that many standard errors s n!

Finding a y-confidence interval from a small sample 1 Subtract 1 from the sample size n to get n 1 degrees of freedom. 2 Draw Student s t-distribution with n 1 degrees of freedom. y 1 y 3 Draw two vertical bars symmetrically on the graph, and label the middle with y. 4 That means the remaining area is 1 y. 5 That means the left tail has area 1 y 2. 6 Use the appropriate t-table to learn where that tail ends! 7 Use that many standard errors s n!

Finding a y-confidence interval from a small sample 1 Subtract 1 from the sample size n to get n 1 degrees of freedom. 2 Draw Student s t-distribution with n 1 degrees of freedom. 1 y 2 y 1 y 3 Draw two vertical bars symmetrically on the graph, and label the middle with y. 4 That means the remaining area is 1 y. 5 That means the left tail has area 1 y 2. 6 Use the appropriate t-table to learn where that tail ends! 7 Use that many standard errors s n!

Finding a y-confidence interval from a small sample 1 Subtract 1 from the sample size n to get n 1 degrees of freedom. 2 Draw Student s t-distribution with n 1 degrees of freedom. 1 y 2 y 1 y t 3 Draw two vertical bars symmetrically on the graph, and label the middle with y. 4 That means the remaining area is 1 y. 5 That means the left tail has area 1 y 2. 6 Use the appropriate t-table to learn where that tail ends! 7 Use that many standard errors s n!

Example Example: Sugar Mrs. Smith is worried about her family s health, so she keeps track of how much sugar they use. In five randomly picked weeks, they used the following amounts of sugar (in pounds): 3.8 4.5 5.2 4.0 5.5 Construct a 94% confidence interval for the true mean µ.

Example Example: Sugar Mrs. Smith is worried about her family s health, so she keeps track of how much sugar they use. In five randomly picked weeks, they used the following amounts of sugar (in pounds): 3.8 4.5 5.2 4.0 5.5 Construct a 94% confidence interval for the true mean µ. Solution First we need to find the sample mean x and sample standard deviation s. 3.8 + 4.5 + 5.2 + 4.0 + 5.5 x = = 4.6. 5 s 2 = 3.82 + 4.5 2 + 5.2 2 + 4.0 2 + 5.5 2 5 4.6 2 = 0.545, 5 1 so s = 0.545 = 0.738. Next, we need to see what a 94% confidence interval looks like for a sample size of n = 5.

Example: Sugar 1 n = 5, so we need 5 1 = 4 degrees of freedom. 2 Draw Student s t-distribution with 4 degrees of freedom. 3 Draw two vertical bars symmetrically on the graph, and label the middle with 0.94. 4 That means the remaining area is 0.06. 5 That means the left tail has area 0.03. 6 The t-table for 4 degrees of freedom says the tail ends at 2.60. 7 So we need 2.60 standard errors s n!

Example: Sugar 1 n = 5, so we need 5 1 = 4 degrees of freedom. 2 Draw Student s t-distribution with 4 degrees of freedom. 0.94 3 Draw two vertical bars symmetrically on the graph, and label the middle with 0.94. 4 That means the remaining area is 0.06. 5 That means the left tail has area 0.03. 6 The t-table for 4 degrees of freedom says the tail ends at 2.60. 7 So we need 2.60 standard errors s n!

Example: Sugar 1 n = 5, so we need 5 1 = 4 degrees of freedom. 2 Draw Student s t-distribution with 4 degrees of freedom. 0.94 0.06 3 Draw two vertical bars symmetrically on the graph, and label the middle with 0.94. 4 That means the remaining area is 0.06. 5 That means the left tail has area 0.03. 6 The t-table for 4 degrees of freedom says the tail ends at 2.60. 7 So we need 2.60 standard errors s n!

Example: Sugar 1 n = 5, so we need 5 1 = 4 degrees of freedom. 2 Draw Student s t-distribution with 4 degrees of freedom. 0.03 0.94 0.06 3 Draw two vertical bars symmetrically on the graph, and label the middle with 0.94. 4 That means the remaining area is 0.06. 5 That means the left tail has area 0.03. 6 The t-table for 4 degrees of freedom says the tail ends at 2.60. 7 So we need 2.60 standard errors s n!

Example: Sugar 1 n = 5, so we need 5 1 = 4 degrees of freedom. 2 Draw Student s t-distribution with 4 degrees of freedom. 0.03 0.94 0.06 2.60 3 Draw two vertical bars symmetrically on the graph, and label the middle with 0.94. 4 That means the remaining area is 0.06. 5 That means the left tail has area 0.03. 6 The t-table for 4 degrees of freedom says the tail ends at 2.60. 7 So we need 2.60 standard errors s n!

Example Example: Sugar Mrs. Smith is worried about her family s health, so she keeps track of how much sugar they use. In five randomly picked weeks, they used the following amounts of sugar (in pounds): 3.8 4.5 5.2 4.0 5.5 Construct a 94% confidence interval for the true mean consumption µ. Solution So we need 2.60 standard errors; recall that n = 5, x = 4.6, and s = 0.738. So the confidence interval is ( x 2.60 s, x + 2.60 s ) n n ( = 4.6 2.60 0.738, 4.6 + 2.60 0.738 ) 5 5 = (3.741, 5.459). Thus Mrs. Smith can be 94% sure that her family averages between 3.741 pounds and 5.459 pounds of sugar per week.