the proportion of voters who intend on voting for the incumbent prime minister in the next election.

Transcription

1 1 Iferece for Populatio Proportios So far we were iterested i estimatig ad aswerig questios for populatio meas. I these cases our parameters of iterest were either a populatio mea µ or a differece betwee two populatio meas, µ 1 µ 2. We will ow study the aalysis of populatio proportios: the proportio of voters who ited o votig for the icumbet prime miister i the ext electio. the proportio of cacer patiets who are goig to survive at least 5 years after treatmet the proportio of batteries which lasts at least 6 hours the proportio of studets who receive a A i Stat 151 We ca also iterpret the populatio proportio, as the probability of the evet of iterest (whe radomly choosig a idividual from the populatio). 2 Estimatio of a populatio proportio p The sample proportio ˆp is defied for a sample a give by: ˆp = x where x deotes the umber of members i the sample that have the specified attribute (or umber of successes), ad deotes the sample size. It seems atural to use the sample proportio for estimatig a populatio proportio. I order to cofirm that this is statistically reasoable, we eed to study the distributio of ˆp (why is this a radom variable?). The followig is also called the Cetral Limit Theorem for proportios. For samples of size : 1. (mea) muˆp = p 2. (stadard deviatio) σˆp = p(1 p)/ 3. (shape) If is large the ˆp is approximately ormally distributed. The first property meas, that ˆp is a ubiased estimator for p, the secod property meas the larger, the more likely ˆp is fallig close to p, ad the last property lets us costruct cofidece itervals ad tests (yippee!) Example A study showed, that the proportio of people i the 20 to 34 age group with a IQ (o the Wechsler Itelligece Scale) of over 120 is about Calculate the probability for the evet that i a sample of 50 there are more tha 20 people with a IQ of at least 120. For this 1

2 sample ˆp = 20/20 = 0.4 We will calculate how likely a sample proportio of 0.4 (or larger) is occurrig i a sample of size 50, with a true populatio proportio of 0.35 ˆp 0.35 P (ˆp 0.4 = P ( 0.35(0.65)/ (0.65)/50 ) stadardize = P (Z 0.74) = 1 P (Z < 0.74) = =.2296 (table II) We calculated that the probability that more tha 20 out of 50 people (betwee 20 ad 34) have a IQ greater tha 120 is.23. Not that ulikely. 2.1 Large-Sample Cofidece Iterval for a Populatio Proportio p Let p be the probability of a evet of iterest. We saw before that ˆp = x is a ubiased estimate for p, if x is the umber of successes i trials. Usually p is ukow ad based o a radom sample we ca calculate a (1 α)100% cofidece iterval. A (1 α)100% Large Sample Cofidece Iterval for a Populatio Proportio p. ˆp ± z α/2 p(1 p) where z 1 α/2 is the 1 α/2 percetile of a stadard ormal distributio. Sice p is ukow, it is estimated usig ˆp. The sample size is cosidered large whe the ormal approximatio to the biomial distributio is adequate amely whe the umber of successes ad the umber of failures are both at least five. Proof: P ( ˆp z 1 α/2 p(1 p) ) p ˆp + z p(1 p) 1 α/2 = P ( z 1 α/2 ) p ˆp p(1 p)/ z 1 α/2 ( ) ( ) p ˆp = P p(1 p)/ z p ˆp 1 α/2 P p(1 p)/ z 1 α/2 = 1 α 2 (1 (1 α 2 )) = 1 α sice p ˆp p(1 p)/ is accordig to the Cetral Limit Theorem stadard ormal distributed. Remark: A cofidece iterval is calculated, whe p is ukow. So the boudaries will be calculated by replacig p by the ubiased estimator ˆp. This is oly appropriate if is large ad will result 2

3 i a approximate cofidece iterval, that meas the probability for the parameter to fall ito the iterval is approximately 1 α. So we use: Let z α/2 the (1 α/2) percetile of the stadard ormal distributio ad p > 5 ad (1 p) > 5. The is ˆp(1 ˆp) ˆp(1 ˆp) ˆp z α/2 ; ˆp + z α/2 a approximate (1 α) cofidece iterval for p. Example: Cosider flippig a coi 1000 times. I oly 400 of the experimets HEAD was observed. Is this a surprisig umber, if the coi is ubiased. To aswer this questio calculate a 95% cofidece iterval from this data ad check if 0.5 (the probability for HEAD, whe tossig a ubiased coi) is i the cofidece iterval. First check if the coditios are met: p = (1 p) = = We coclude that we ca apply the Cetral Limit Theorem ad ca use the above described method for obtaiig a cofidece iterval. [ ˆp z α/2 ˆp(1 ˆp) ; ˆp + z α/2 ˆp(1 ˆp) ] = [ ; = [ ; ] = [0.37 ; 0.43] We ca be 95% cofidet, that the true probability for HEAD is i the iterval [0.37; 0.43]. Sice 0.5 is ot i the iterval, it seems to be ulikely that 0.5 is the true probability for HEAD. Check the coi, what makes it biased! 2.2 Choosig the Sample Size The Margi of Error for the estimatio of p is E = z α/2 p(1 p)/ Choosig the sample size for estimatig a proportio p follows the same argumet, as fidig the sample size for estimatig a mea µ, oly that the formula is based o aother cofidece iterval. Assume a probability p shall be estimated withi a margi of error of E with a (1 α)100% cofidece iterval, the ( ) 2 z( α/2) p(1 p) E Sice p is ot kow, use a guess, or use p = 0.5 as a coservative value i this formula. Example A poll shall be coducted to fid the proportio of Caadias supportig the Liberal party withi a margi of error of 3% (E = 0.03) the ( ) (0.5) = A sample size of 1068 would be required to make this goal. (This is why most polls are based o samples of size of a little above 1000). 3 ]

4 2.3 A Large Sample Test Cocerig a Proportio p For developig a test agai the facts we kow from the CLT have to be cosidered. The poit estimator for a proportio is the sample proportio ˆp. From the Cetral Limit Theorem we kow about the samplig distributio of ˆp that: 1. µˆp = p 2. σˆp = p(1 p) 3. If is large the samplig distributio of ˆp is approximately ormal. So we get that z = p ˆp p(1 p) is stadard ormally distributed for large sample sizes. Usig these properties it ca be proved that the followig procedure, is a statistical test, that esures, that the probability to make a error of type I is less or equal tha α. A Large Sample Test cocerig a Proportio p 1. Hypotheses: Test type Upper tail H 0 : p p 0 versus H a : p > p 0 Lower tail H 0 : p p 0 versus H a : p < p 0 Two tail H 0 : p = p 0 versus H a : p p 0 Choose α. 2. Assumptio:Radom sample ad, the sample size is large, that is that ˆp > 5 ad (1 ˆp) > Test statistic: Let p 0 be a value betwee zero ad oe ad defie the test statistic z 0 = ˆp p 0 (p 0 (1 p 0 ))/ 4. p-value ad Rejectio Regio: Test type p-value Rejectio Regio Upper tail P (z > z 0 ) z 0 > z α Lower tail P (z < z 0 ) z 0 < z α Two tail 2 P (z > abs(z 0 )) abs(z 0 ) > z α/2 4

5 Where z α is the 1 α percetile of the stadard ormal distributio. 5. Decisio: If P-value α or z 0 falls i the rejectio regio, the reject H 0 If P-value> α or z 0 does ot fall i the rejectio regio the do ot reject H 0 6. Cotext. Example: Suppose that you wat to show that the proportio of adults above 40 who are participatig i fitess activities is below So you wat to test ( puttig what you wat to show ito the alterative hypothesis H a ) H 0 : p 0.2 vs. H a : p < 0.2 at a sigificace level of α = The sample size is = 100 ad the umber of people sampled who participate i those activities equals 19, so that ˆp = 0.19, ˆp = 19 > 5 ad (1 ˆp) = 81 > 5, so the assumptios are met (assumig the sample was radomly chose). 3. The z 0 = = Now calculate the P-value, accordig to the choice of H a it is a lower tail test, so the P-value is the lower tail probability. P value = P (z < 0.25) = (from table II.) 5. Decisio: Sice =P-value> 0.05 = α, H 0 is ot rejected. 6. Cotext: At sigificace level of 5% the sample data do ot provide sufficiet evidece that less tha 20% of adults 40 ad older take part i fitess activities. 5

6 2.4 Estimatig the Differece betwee Two Populatio Proportios Istead of comparig two populatio meas let s ow compare two populatio proportios. Assume you wat to compare the rate of people who play computer games i the age groups of 20 to 30 ad 30 to 40 The proportio of defective items maufactured i two productio lies The statistic for estimatig the differece i two populatio proportios that comes to mid is the differece i the sample proportio (ˆp 1 ˆp 2 ). Let study the samplig distributio of this statistic to costruct a cofidece iterval. Properties of the Samplig Distributio of the Differece betwee two Sample Proportios (ˆp 1 ˆp 2 ) Cosider that you have two idepedet samples of sizes 1 ad 2 from biomial populatios with parameters p 1 ad p 2, respectively. The samplig distributio of (ˆp 1 ˆp 2 ) has these properties: 1. The mea of (ˆp 1 ˆp 2 ) is ad the stadard error is µˆp1 ˆp 2 = p 1 p 2 SE = p1 (1 p 1 ) 1 + p 2(1 p 2 ) 2 which is estimated by SE ˆ ˆp1 (1 ˆp 1 ) = + ˆp 2(1 ˆp 2 ) The samplig distributio of (ˆp 1 ˆp 2 ) is approximately ormal distributed, whe the sample sizes 1 ad 2 are large, that is whe 1 p 1 > 5 ad 1 (1 p 1 ) > 5 ad 2 p 2 > 5 ad 2 (1 p 2 ) > 5 These results ow lead to the descriptio of the estimatio of (p 1 p 2 ). Large Sample Poit Estimatio of (p 1 p 2 ) Poit estimate: (ˆp 1 ˆp 2 ) Margi of error: z α/2 p1 (1 p 1 ) 1 + p 2(1 p 2 ) 2 Large Sample (1 α)100% Cofidece Iterval for (p 1 p 2 ) (ˆp 1 ˆp 2 ) ± z α/2 p1 (1 p 1 ), 1 + p 2(1 p 2 ) 2 6

7 For this we have to assume agai that 1 ad 2 are large, that is 1 p 1 5, 1 (1 p 1 ), 2 p 2, 2 (1 p 2 ) are greater tha 5. I order to apply the tools described above, fid that p 1 ad p 2, the populatio proportios, are ukow. I order to use the above procedures, we have to replace the populatio proportios by their estimates ˆp 1 ad ˆp 2. So that you will estimate the margi of error by ±1.96SE ˆ ˆp1 (1 ˆp 1 ) = ± ˆp 2(1 ˆp 2 ) 1 2 ad use the followig Approximate Large Sample (1 α)100% Cofidece Iterval for (p 1 p 2 ) (ˆp 1 ˆp 2 ) ± z α/2 ˆp1 (1 ˆp 1 ) 1 + ˆp 2(1 ˆp 2 ) 2 For this we have to assume agai that 1 ad 2 are large, that is 1 p 1, 1 (1 p 1 ), 2 p 2, 2 (1 p 2 ) are greater tha 5. Example: Suppose we wat to compare therapies. The criteria for the compariso is the probability to survive at least 5 years after therapy. The study produced the followig data: Populatio 1 Populatio x ˆp = x/ That is 90 out of 100 patiets, who uderwet therapy 1 survived at least 5 years. If we use ˆp 1 as estimate for p 1 ad ˆp 2 as estimate for p 2, we fid that 1 p 1, 1 (1 p 1 ), 2 p 2, 2 (1 p 2 ) are all greater tha 5. So we ca use the formula from above for calculatig a 95% cofidece iterval for p 1 p 2. ˆp1 (1 ˆp 1 ) (ˆp 1 ˆp 2 )±z α/2 + ˆp 2(1 ˆp 2 ) 0.9(0.1) = (0.025)± (0.125) = 0.025± or [ ; 0.118]. Sice 0 is captured i this iterval, we fid, that this data does ot provide evidece, that the two therapies result i differet probabilities to survive 5 years. They ca be differet, but this data does ot show it. 7

8 2.5 Statistical Test for Two Populatio Proportios p 1 ad p 2 Notatio: populatio sample proportio size proportio populatio 1 p 1 1 ˆp 1 populatio 2 p 2 2 ˆp 2 Large-Sample z Test for comparig p 1 ad p 2 Hypotheses Test type Upper tail H 0 : p 1 p 2 0 versus H a : p 1 p 2 > 0 Lower tail H 0 : p 1 p 2 0 versus H a : p 1 p 2 < 0 Two tail H 0 : p 1 p 2 = 0 versus H a : p 1 p 2 0 Assumptio: Both sample sizes are large: Radom samples, 1ˆp 1 > 5, 1 (1 ˆp 1 ) > 5, 2ˆp 2 > 5, 2 (1 ˆp 2 ) > 5 Test statistic: z 0 = P-value ad Rejectio Regio: ˆpc(1 ˆp c) (ˆp 1 ˆp 2 ) 1 + ˆpc(1 ˆpc) 2 Test type P-value Rejectio Regio Upper tail P (z > z 0 ) z 0 > z α Lower tail P (z < z 0 ) z 0 < z α Two tail 2 P (z < abs(z 0 )) abs(z 0 ) > z α/2 Decisio Cotext Example: Fid if the proportios of red M&M s i the plai ad peaut variety do differ at a sigificace level of The sample Plai(1) Peaut(2) Sample Size Number of red M&Ms 12 8 This results i ˆp 1 = 12/56 = ad ˆp 2 = 8/32 = 0.25 ad ˆp c = (12+8)/(56+32) = 20/88 =

9 1. The questio asks for a test of H 0 : p 1 p 2 = 0 vs. H a : p 1 p 2 0. α = Assumptio: Sice ˆp 1 1, (1 ˆp 1 ) 1, ˆp 2 2, (1 ˆp 2 ) 2 are all greater tha 5, the assumptios are met ad the test will deliver a reliable result. 3. Test statistic: z 0 = ˆpc(1 ˆp c) (ˆp 1 ˆp 2 ) 1 + ˆpc(1 ˆpc) 2 = ( ) 0.227(0.773) (0.773) = = Rejectio regio: With α = 0.05 the rejectio regio for a two tailed test is: abs(z 0 ) > z α/2 = or usig the p-value: 2-tailed p-value=2p (z > abs(z 0 )) = 2P (z > ) = 2( ) = Decisio: Sice the P-value is ot smaller tha α = 0.05 do ot reject H 0 at sigificace level of At sigificace level of 5% we coclude that we do ot have eough evidece, that the proportio of red M&M s is differet for the plai ad peaut variety. 9