Math C067 Samplig Distributios Sample Mea ad Sample Proportio Richard Beigel Some time betwee April 16, 2007 ad April 16, 2007 Examples of Samplig A pollster may try to estimate the proportio of voters i favor of a particular presidetial cadidate by pollig a radom sample of voters. A statisticia may wat to estimate the average startig icome of ew college graduates by fidig the mea of a radom sample of ew college graduates. Numbers Populatio size: N Sample size: (usually much smaller tha N) Samplig with Replacemet Choose 1 perso, Choose a 2d perso (who could be the same as the first), Choose a 3rd perso (who could be the same as the first ad/or secod),..., Choose a th perso Samplig without Replacemet Choose differet people Rules of thumb Our formulas will be correct for samplig with replacemet. Our formulas will be approximately correct for samplig without replacemet provided that the populatio size N is large. 1
Sample Mea Suppose that we are tryig to estimate a radom variable that is based o a large populatio, e.g., we wat to estimate average icome. Radom Variable that we wish to estimate: X Expected value (mea) of X: µ Stadard deviatio of X: σ We defie a radom variable X called the Sample Mea via the followig radom process: Evaluate X at radom poits, i.e., choose radom people ad fid out how much each of them ears Fid the average of those values Formulas The expected value of X is the same as the the expected value of X, i.e., µ X = µ (correct eve if we sample without replacemet). The variace of X is smaller tha the variace of X by a factor of (the sample size), i.e., σ 2 X = 1 σ2 The stadard deviatio of X is smaller tha the stadard deviatio of X by a factor of (the sample size), i.e., σ X = 1 σ 2
Example: Test scores. Oe millio studets take a stadardized test. Their average score is 500. The stadard deviatio of their scores is 100. A statisticia chooses 400 studets at radom with replacemet ad computes the mea X of their test scores. What is the expected value of X? What is the stadard deviatio of X? µ X = µ = 500 σ X = 1 σ = 1 400 100 = 1 20 100 = 5 Note: these formulas do ot deped i ay way o the populatio size N. Example: Test scores. Oe millio studets take a stadardized test. Their average score is 500. The stadard deviatio of their scores is 100. A statisticia chooses a group of 400 differet studets at radom ad computes the mea X of their test scores. What is the expected value of X? What is the stadard deviatio of X? µ X = µ = 500. This formula works with or without replacemet. Because the populatio size is large we ca use the same formula we used with replacemet to estimate the stadard deviatio. σ X 1 σ = 1 400 100 = 1 20 100 = 5 Example: Average icome 10 millio people work i Metropolis. Their average icome is $100,000. The stadard deviatio of their icomes is 2500. A statisticia chooses 100 differet Metropolis workers at radom ad computes their average icome I. Was the sample chose with replacemet or without replacemet? What is the expected value of I? What is the stadard deviatio of I? What is the probability that I is betwee $99,500 ad $100,500. I is just a sample mea, so we ca call it X. (But the problem statemet ca call it aythig. It s up to you to recogize that I is the sample mea.) µ I = µ X = µ = 100000 σ I = σ X 1 σ = 1 2500 = 1 2500 = 250 100 10 The Sample Mea is approximately ormally distributed so we ca use the methods of chapter 6 to estimate Pr[99500 X 100500]. We are lookig at 2 stadard deviatios o each side of the mea so Pr[99500 X 100500] 0.4772 + 0.4772 = 0.9544 Note: 30. You ca approximate a Sample Mea by a Normal Distributio provided that 3
Sample Proportio Suppose that a fractio (proportio) p of a populatio favors cadidate C for presidet. We take a radom sample cosistig of people Let X be a radom variable that deotes the umber of people i the sample who favor cadidate C The the distributio of X is B(, p). Let ˆP be a radom variable that deotes the fractio (proportio) of people i the sample that favor cadidate C The ˆP = 1 X ˆP is called the Sample Proportio E(X) = p, E( ˆP ) = p Var(X) = p(1 p), Var( ˆP ) = p(1 p) σ X = p(1 p), σ ˆP = p(1 p) Example: A Electio Suppose that 55 percet of the voters favor Cadidate C for presidet. If a radom sample of 1000 voters is chose, estimate the probability that betwee 52 ad 58 percet of the voters i the sample will favor Cadidate C. To aswer this questio we ca work with B(, p) ad estimate Pr[520 X 580] or we ca work directly with the Sample Proportio ad ask whether Pr[0.52 ˆP 0.58]. Both radom variables X ad ˆP are approximately ormally distributed. Solve this o your ow ad the compare your aswer to the aswers o the ext page. 4
Solved Example: A Electio Suppose that 55 percet of the voters favor Cadidate C for presidet. If a radom sample of 1000 voters is chose, estimate the probability that betwee 52 ad 58 percet of the voters i the sample will favor Cadidate C. To aswer this questio we ca work with B(, p) ad estimate Pr[520 X 580] or we ca work directly with the Sample Proportio ad ask whether Pr[0.52 ˆP 0.58]. Both radom variables X ad ˆP are approximately ormally distributed. Solve this o your ow ad the compare your aswer to the aswers o the ext page. Solutio usig Biomial Distributio E[X] = p = 1000 0.55 = 550. Var[X] = pq = p(1 p) = 1000 0.55 0.45 = 247.5 σ X = Var[X] = 247.5 15.73 Pr[520 X 580] = Pr[519.5 X 580.5] = Pr[519.5 X 550] + Pr[550 < X 580.5] The first iterval cosists of 550 519.5 1.94 stadard deviatios. The secod iterval cosists 15.73 of 580.5 550 1.94 stadard deviatios. Therefore 15.73 Pr[520 X 580] 0.4738 + 0.4738 = 0.9476 Did you get the correct aswer? If ot, go back ad try agai. Solutio usig Sample Proportio E[ ˆP ] = p = 0.55 Var[ ˆP ] = pq σ ˆP = = p(1 p) = 0.55 0.45 1000 = 0.0002475 Var[ ˆP ] = 0.0002475 0.01573 Pr[0.52 ˆP 0.58] = Pr[0.52 ˆP 0.55] + Pr[0.55 < ˆP 0.58] The first iterval cosists of 0.55 0.52 1.91 stadard deviatios. The secod iterval cosists 0.01573 of 0.58 0.55 1.91 stadard deviatios. Therefore 0.01573 Pr[0.52 ˆP 058] 0.4719 + 0.4719 = 0.9438 (The aswer we got by usig the Biomial Distributio was more accurate because we expaded the iterval by 0.5 i both directios.) Did you get the correct aswer? If ot, go back ad try agai. 5
Solved Example: 1936 Presidetial Electio Suppose that 61% of the voters favor Cadidate R for Presidet. You poll a radom sample cosistig of 3000 voters. What is the probability that a majority of your sample favors Cadidate R? Try this yourself before readig o. 6
Solved Example: 1936 Presidetial Electio Suppose that 61% of the voters favor Cadidate R for Presidet. You poll a radom sample cosistig of 3000 voters. What is the probability that a majority of your sample favors Cadidate R? Solutio usig Biomial Distributio Let X deote the umber of voters i your sample who favor Cadidate R. The distributio of X is B(3000, 0.61). We wish to estimate Pr[X > 1500], which is the same as Pr[X 1501]. E[X] = p = 3000 0.61 = 1830. Var[X] = pq = p(1 p) = 3000 0.61 0.39 = 713.7. σ X = Var[X] = 713.7 26.72 Pr[X 1501] = Pr[X 1500.5] = Pr[1500.5 X] = Pr[1500.5 X 1830] + Pr[1830 < X] The first iterval cosists of approximately 1830 1500.5 26.72 12.33 stadard deviatios. The secod iterval cosists of half of the distributio. Therefore, Solutio usig Sample Proportio E[ ˆP ] = p = 0.61 Var[ ˆP ] = pq σ ˆP = Pr[X 1501] 0.5 + 0.5 = 1.0 = p(1 p) = 0.61 0.39 3000 = 0.0000793 Var[ ˆP ] = 0.0000793 0.008905 Pr[ ˆP > 0.5] = Pr[0.5 < ˆP ] = Pr[0.5 < ˆP 0.61] + Pr[0.61 < ˆP ] The first iterval cosists of approximately 0.61 0.5 12.35 stadard deviatios. The secod 0.008905 iterval cosists of half of the distributio. Therefore, Pr[ ˆP > 0.5] 0.5 + 0.5 = 1.0 7
Gallup s Gambit I a 1935, George Gallup came up with a clever marketig strategy for his opiio polls. Gallup bet that he would correctly predict the results of the 1936 Presidetial Electio. He promised to refud all subscriptio fees for the etire year if his predictio was icorrect. Gallup did predict the Electio results correctly based o a sample of approximately 3000 voters. Reader s Digest icorrectly predicted the result based o a sample of 10,000,000 voters. How could Gallup predict correctly based o such a small sample? Because the actual electio was t really close, eve his small sample provided a comfort level of 12 stadard deviatios. How could Reader s Digest predict icorrectly based o such a large sample? Their sample was ot represetative, but cosisted maily of people who had a car or a telephoe. Lesso: Radomess is much more importat tha sample size. (Accuracy i samplig is also importat, sice some people lie to pollsters. There is a sciece to desigig questioaires.) Some recet electios have bee decided by less tha a 1% margi. It is iterestig to look at how Gallup s poll might have tured out, had the 1936 Electio ot bee a ladslide. Example: What if the Electio had bee close? Suppose that 50.5% of the voters favor Cadidate R for Presidet. You poll a radom sample cosistig of 3000 voters. What is the probability that a majority of your sample favors Cadidate R? Try this oe yourself before readig the solutio. 8
Example: What if the Electio had bee close? Suppose that 50.5% of the voters favor Cadidate R for Presidet. You poll a radom sample cosistig of 3000 voters. What is the probability that a majority of your sample favors Cadidate R? Solutio usig Biomial Distributio Let X deote the umber of voters i your sample who favor Cadidate R. The distributio of X is B(3000, 0.505). We wish to estimate Pr[X > 1500], which is the same as Pr[X 1501]. E[X] = p = 3000 0.505 = 1515. Var[X] = pq = p(1 p) = 3000 0.505 0.495 = 749.925. σ X = Var[X] = 749.925 27.38 Pr[X 1501] = Pr[X 1500.5] = Pr[1500.5 X] = Pr[1500.5 X 1515] + Pr[1515 < X] The first iterval cosists of approximately 1515 1500.5 27.38 0.53 stadard deviatios. The secod iterval cosists of half of the distributio. Therefore, Pr[X 1501] 0.2019 + 0.5 = 0.7019 Thus Gallup s chace of predictig the Electio correctly would have bee about 70%, assumig he used a represetative sample. If his sample was skewed (like Reader s Digest s), the his chace could have bee much smaller. Solutio usig Sample Proportio E[ ˆP ] = p = 0.505 Var[ ˆP ] = pq σ ˆP = = p(1 p) = 0.505 0.495 3000 = 0.000083325 Var[ ˆP ] = 0.000083325 0.009128 Pr[ ˆP > 0.5] = Pr[0.5 < ˆP ] = Pr[0.5 < ˆP 0.505] + Pr[0.505 < ˆP ] The first iterval cosists of approximately 0.505 0.5 0.55 stadard deviatios. The secod 0.009128 iterval cosists of half of the distributio. Therefore, Pr[ ˆP > 0.5] 0.2088 + 0.5 = 0.7088 Thus Gallup s chace of predictig the Electio correctly would have bee about 71%, assumig he used a represetative sample. If his sample was skewed (like Reader s Digest s), the his chace could have bee much smaller. 9
Example: How good was Gallup s Sample? Oly 54% of Gallup s sample favored Roosevelt. Suppose that 61% of the voters favor Cadidate R for Presidet. You poll a radom sample cosistig of 3000 voters. What is the probability that 54% or less your sample favors Cadidate R? Try this yourself before readig o. 10
Solved Example: How good was Gallup s Sample? Suppose that 61% of the voters favor Cadidate R for Presidet. You poll a radom sample cosistig of 3000 voters. What is the probability that 54% or less your sample favors Cadidate R? Solutio usig Biomial Distributio Let X deote the umber of voters i your sample who favor Cadidate R. The distributio of X is B(3000, 0.61). 54% of 3000 is 1620. We wish to estimate Pr[X 1620]. E[X] = p = 3000 0.61 = 1830. Var[X] = pq = p(1 p) = 3000 0.61 0.39 = 713.7. σ X = Var[X] = 713.7 26.72 Pr[X 1620] = Pr[X 1620.5] = Pr[X 1830] Pr[1620.5 < X 1830] The first iterval cosists of half of the distributio. 1830 1620.5 7.84 stadard deviatios. Therefore, 26.72 The secod iterval cosists of Pr[X 1620] 0.5 0.5 = 0.0 This suggests that Gallup s sample was ot close to uiform, i.e., ot represetative of the populatio. Perhaps Gallup just got lucky. They say that Fortue favors the bold. Solutio usig Sample Proportio E[ ˆP ] = p = 0.61 Var[ ˆP ] = pq σ ˆP = = p(1 p) = 0.61 0.39 3000 = 0.0000793 Var[ ˆP ] = 0.0000793 0.008905 Pr[ ˆP 0.54] = Pr[ ˆP 0.61] Pr[0.54 < ˆP 0.61] The first iterval cosists of approximately 0.61 0.54 7.86 stadard deviatios. The secod 0.008905 iterval cosists of half of the distributio. Therefore, Pr[ ˆP 0.54] 0.5 0.5 = 0.0 This suggests that Gallup s sample was ot close to uiform, i.e., ot represetative of the populatio. Perhaps Gallup just got lucky. They say that Fortue favors the bold. 11
Gallup vs. the Tout If the 1936 Electio had bee a close oe (decided by oly a 1% margi), there is substatial probability (30% or more) that Gallup would have predicted it wrog. Was Gallup takig a big risk? Perhaps Gallup would have take a larger sample if it looked like the electio was goig to be close. Or, perhaps Gallup had othig to lose. Suppose that you are gambler. A tout is a perso who sells predictios, usually writte o somethig called a tout sheet. For example, the tout might predict whether the Phillies will beat the Cardials toight. The tout charges $1 for his predictio but he always offers a moey-back guaratee. Key facts about tout sheets: 1. Tout sheets are worthless. Why? I reality, the tout will tell oe customer that Philadelphia will wi ad tell aother customer that Philadelphia will lose. So half of the predictios are guarateed to be correct. (Not like Gallup Polls) 2. The tout ever loses. Whe he predicts correctly, he ears $1. Whe he predicts icorrectly he ears $0. (Like Gallup, although Gallup also exteded his offer to previous customers.) 3. Like most somethig-for-othig schemes, sellig tout sheets is illegal. (Not like Gallup.) 4. Touts thrive o repeat busiess. People who wi are likely to retur to the tout for aother predictio, which is usually more expesive tha the first. (Like Gallup, although I do t thik he jacked up his rates.) 5. A tout s customers are kow i gamblig circles as suckers. (Like Gallup s customers, if they were fooled by the gambit.) 12