5: Itroductio to Estimatio Cotets Acroyms ad symbols... 1 Statistical iferece... Estimatig µ with cofidece... 3 Samplig distributio of the mea... 3 Cofidece Iterval for μ whe σ is kow before had... 4 Sample Size Requiremets for estimatig µ with cofidece... 6 Estimatig p with cofidece... 7 Samplig distributio of the proportio... 7 Cofidece iterval for p... 7 Sample size requiremet for estimatig p with cofidece... 9 Acroyms ad symbols qˆ complemet of the sample proportio x sample mea pˆ sample proportio 1 α cofidece level CI cofidece iterval LCL lower cofidece limit m margi of error sample size NHTS ull hypothesis test of sigificace p biomial success parameter ( populatio proportio ) s sample stadard deviatio SDM samplig distributio of mea (hypothetical probability model) SEM stadard error of the mea SEP stadard error of the proportio UCL upper cofidece limit α alpha level μ expected value ( populatio mea ) σ stadard deviatio parameter Page 5.1 (C:\Users\B. Burt Gerstma\Dropbox\StatPrimer\estimatio.docx, 5/8/016)
Statistical iferece Statistical iferece is the act of geeralizig from the data ( sample ) to a larger pheomeo ( populatio ) with calculated degree of certaity. The act of geeralizig ad derivig statistical judgmets is the process of iferece. [Note: There is a distictio betwee causal iferece ad statistical iferece. Here we cosider oly statistical iferece.] The two commo forms of statistical iferece are: Estimatio Null hypothesis tests of sigificace (NHTS) There are two forms of estimatio: Poit estimatio (maximally likely value for parameter) Iterval estimatio (also called cofidece iterval for parameter) This chapter itroduces estimatio. The followig chapter itroduced NHTS. Both estimatio ad NHTS are used to ifer parameters. A parameter is a statistical costat that describes a feature about a pheomea, populatio, pmf, or pdf. Examples of parameters iclude: Biomial probability of success p (also called the populatio proportio ) Expected value μ (also called the populatio mea ) Stadard deviatio σ (also called the populatio stadard deviatio ) Poit estimates are sigle poits that are used to ifer parameters directly. For example, Sample proportio pˆ ( p hat ) is the poit estimator of p Sample mea x ( x bar ) is the poit estimator of μ Sample stadard deviatio s is the poit estimator of σ Notice the use of differet symbols to distiguish estimators ad parameters. More importatly, poit estimates ad parameters represet fudametally differet thigs. Poit estimates are calculated from the data; parameters are ot. Poit estimates vary from study to study; parameters do ot. Poit estimates are radom variables: parameters are costats. Page 5. (C:\Users\B. Burt Gerstma\Dropbox\StatPrimer\estimatio.docx, 5/8/016)
Estimatig µ with cofidece Samplig distributio of the mea Although poit estimate x is a valuable reflectios of parameter μ, it provides o iformatio about the precisio of the estimate. We ask: How precise is x as estimate of μ? How much ca we expect ay give x to vary from µ? The variability of x as the poit estimate of μ starts by cosiderig a hypothetical distributio called the samplig distributio of a mea (SDM for short). Uderstadig the SDM is difficult because it is based o a thought experimet that does t occur i actuality, beig a hypothetical distributio based o mathematical laws ad probabilities. The SDM imagies what would happe if we took repeated samples of the same size from the same (or similar) populatios doe uder the idetical coditios. From this hypothetical experimet we build a pmf or pdf that is used to determie probabilities for various hypothetical outcomes. Without goig ito too much detail, the SDM reveals that: x is a ubiased estimate of μ; the SDM teds to be ormal (Gaussia) whe the populatio is ormal or whe the sample is adequately large; the stadard deviatio of the SDM is equal to σ. This statistic which is called the stadard error of the mea (SEM) predicts how closely the x s i the SDM are likely to cluster aroud the value of μ ad is a reflectio of the precisio of x as a estimate of μ: SEM= σ Note that this formula is based o σ ad ot o sample stadard deviatio s. Recall that σ is NOT calculated from the data ad is derived from a exteral source. Also ote that the SEM is iversely proportio to the square root of. Numerical example. Suppose a measuremet that has σ = 10. o A sample of = 1 for this variable derives SEM = o A sample of = 4 derives SEM = o A sample of = 16 derives SEM = σ = 10 / 4 = 5 σ = 10 / 16 =.5 σ = 10 / 1 = 10 Each time we quadruple, the SEM is cut i half. This is called the square root law the precisio of the mea is iversely proportioal to the square root of the sample size. Page 5.3 (C:\Users\B. Burt Gerstma\Dropbox\StatPrimer\estimatio.docx, 5/8/016)
Cofidece Iterval for μ whe σ is kow before had To gai further isight ito µ, we surroud the poit estimate with a margi of error: This forms a cofidece iterval (CI). The lower ed of the cofidece iterval is the lower cofidece limit (LCL). The upper ed is the upper cofidece limit (UCL). Note: The margi of error is the plus-or-mius wiggle-room draw aroud the poit estimate; it is equal to half the cofidece iterval legth. Let (1 α)100% represet the cofidece level of a cofidece iterval. The α ( alpha ) level represets the lack of cofidece ad is the chace the researcher is willig to take i ot capturig the value of the parameter. A (1 α)100% CI for μ is give by: x ± ( z 1 )( SEM ) α / The z1-α/ i this formula is the z quatile associatio with a 1 α level of cofidece. The reaso we use z1-α/ istead of z1-α i this formula is because the radom error (imprecisio) is split betwee uderestimates (left tail of the SDM) ad overestimates (right tail of the SDM). The cofidece level 1 α area lies betwee z1 α/ ad z1 α/: Page 5.4 (C:\Users\B. Burt Gerstma\Dropbox\StatPrimer\estimatio.docx, 5/8/016)
You may use the z/t table o the StatPrimer website to determie z quatiles for various levels of cofidece. Here are the commo levels of cofidece ad their associated alpha levels ad z quatiles: (1 α)100% α z1-α/ 90%.10 1.64 95%.05 1.96 99%.01.58 Numerical example, 90% CI for µ. Suppose we have a sample of = 10 with SEM = 4.30 ad x = 9.0. The z quatile for 10% cofidece is z 1.10/ = z.95 = 1.64 ad the 90% CI for μ = 9.0 ± (1.64)(4.30) = 9.0 ± 7.1 = (1.9, 36.1). We use this iferece to address populatio mea μ ad NOT about sample mea x. Note that the margi of error for this estimate is ±7.1. Numerical example, 95% CI for µ. The z quatile for 95% cofidece is z 1.05/ = z.975 = 1.96. The 95% CI for μ = 9.0 ± (1.96)(4.30) = 9.0 ± 8.4 = (0.6, 37.4). Note that the margi of error for this estimate is ±8.4. Numerical example, 99% CI for µ. Usig the same data, α =.01 for 99% cofidece ad the 99% CI for μ = 9.0 ± (.58)(4.30) = 9.0 ± 11.1 = (17.9, 40.1). Note that the margi of error for this estimate is ±11.1. Here are cofidece iterval legths (UCL LCL) of the three itervals just calculated: Cofidece Level Cofidece Iterval Cofidece Iterval Legth 90% (1.9, 36.1) 36.1 1.9 = 14. 95% (0.6, 37.4) 37.4 0.6 = 16.8 99% (17.9, 40.1) 40.1 17.9 =. The cofidece iterval legth grows as the level of cofidece icreases from 90% to 95% to 99%.This is because there is a trade-off betwee the cofidece ad margi of error. You ca achieve a smaller margi of error if you are willig to pay the price of less cofidece. Therefore, as Dr. Evil might say, 95% is pretty stadard. Numerical example. Suppose a populatio has σ = 15 (ot calculated, but kow ahead of time) ad ukow mea μ. We take a radom sample of 10 observatios from this populatio ad observe the followig values: {1, 4, 5, 11, 30, 50, 8, 7, 4, 5}. Based o these 10 observatios, x = 9.0, SEM = 15/ 10 = 4.73 ad a 95% CI for μ = 9.0 ± (1.96)(4.73) = 9.0 ± 9.7 = (19.73, 38.7). Iterpretatio otes: The margi of error (m) is the plus or mius value surroudig the estimate. I this case m = ±9.7. We use these cofidece iterval to address potetial locatios of the populatio mea μ, NOT the sample mea x. Page 5.5 (C:\Users\B. Burt Gerstma\Dropbox\StatPrimer\estimatio.docx, 5/8/016)
Sample Size Requiremets for estimatig µ with cofidece Oe of the questios we ofte faces is How much data should be collected? Collectig too much data is a waste of time ad moey. Also, by collectig fewer data poits we ca devote more time ad eergy ito makig these measuremets accuracy. However, collectig too little data reders our estimate too imprecise to be useful. To address the questio of sample size requiremets, let m represet the desired margi of error of a estimate. This is equivalet to half the ultimate cofidece iterval legth. σ Note that margi of error m = z1 α /. Solvig this equatio for derives, σ = z1 α / m We always roud results from this formula up to the ext iteger to esure that we have a margi of error o greater tha m. Note that to determie the sample size requiremets for estimatig µ with a give level of cofidece requires specificatio of the z quatile based o the desired level of cofidece (z1 α/), populatio stadard deviatio (σ), ad desired margi of error (m). Numerical examples. Suppose we have a variable with stadard deviatio σ = 15 ad wat to estimate µ with 95% cofidece. σ 15 The samples size required to achieve a margi of error of 5 = z1 / =1.96 m 5 36. 15 The samples size required to achieve a margi of error of.5 is =1.96 = 144..5 Agai, doublig the precisio requires quadruplig the sample size. ε = Page 5.6 (C:\Users\B. Burt Gerstma\Dropbox\StatPrimer\estimatio.docx, 5/8/016)
Estimatig p with cofidece Samplig distributio of the proportio Estimatig parameter p is aalogous to estimatig parameter µ. However, istead of usig x as a ubiased poit estimate of μ, we use pˆ as a ubiased estimate of p. The symbol pˆ ( p-hat ) represets the sample proportio: umber of successes i the sample pˆ = For example, if we fid 17 smokers i a SRS of 57 idividuals, pˆ = 17 / 57 = 0.98. We ask, How precise is pˆ as are reflectio of p? How much ca we expect ay give pˆ to vary from p? I samples that are large, the samplig distributio of pˆ is approximately ormal with a mea of pq p ad stadard error of the proportio SEP = where q = 1 p. The SEP quatifies the precisio of the sample proportio as a estimate of parameter p. Cofidece iterval for p This approach should be used oly i samples that are large. a Use this rule to determie if the sample is large eough: if pq 5 proceed with this method. (Call this the pq rule ). A approximate (1 α)100% CI for p is give by ˆ α / p ± ( z 1 )( SEP) where the estimated SEP = pˆ qˆ. Numerical example. A SRS of 57 idividuals reveals 17 smokers. Therefore, pˆ = 17 / 57 = 0.98, qˆ = 1 0.98 = 0.7018 ad pˆq ˆ = (.98)(.7018)(57) = 11.9. Thus, the sample is large pq ˆ ˆ.98.7018 to proceed with the above formula. The estimated SEP = = = 0.06059 ad 57 the 95% CI for p =.98 ± (1.96)(.06059) =.98 ±.1188 = (.1794,.4170). Thus, the populatio prevalece is betwee 18% ad 4% with 95% cofidece. a A more precise formula that ca be used i small samples is provided i a future chapter. Page 5.7 (C:\Users\B. Burt Gerstma\Dropbox\StatPrimer\estimatio.docx, 5/8/016)
Estimatio of a proportio (step-by-step summary) Step 1. Review the research questio ad idetify the parameter. Read the research questio. Verify that we have a sigle sample that addresses a biomial proportio (p). Step. Poit estimate. Calculate the sample proportio ( pˆ ) as the poit estimate of the parameter. Step 3. Cofidece iterval. Determie whether the z (ormal approximatio) formula ca be used with the pq rule. If so, determie the z percetile for the give level of pˆ qˆ cofidece (table) ad the stadard error of the proportio SEP =. Apply the formula p ± ( z 1 )( SEP). ˆ α / Step 4. Iterpret the results. I plai laguage report what proportio ad the variable it address. Report the cofidece iterval; beig clear about what populatio is beig addressed. Reported results should be rouds as appropriate to the reader. Illustratio Of 673 people surveyed, 170 have risk factor X. We wat to determie the populatio prevalece of the risk factor with 95% cofidece. Step 1. Prevalece is the proportio of idividuals with a biary trait. Therefore we wish to estimate parameter p. Step. pˆ = 170 / 673 =.06360 = 6.4%. Step 3. pq ˆ ˆ = 673(.0636)(1.0636) = 159 z method OK. pq ˆ ˆ (.0636)(1.0636) SEP = = =.0047 673 The 95% CI for p = pˆ ± ( z 1 α / )( SEP) = 0.636 ± 1.96.0047 =.0636 ±.0093 = (.0543,.079) = (5.4%, 7.3%) Step 4. The prevalece i the sample was 6.4%. The prevalece i the populatio is betwee 5.4% ad 7.3% with 95% cofidece. Page 5.8 (C:\Users\B. Burt Gerstma\Dropbox\StatPrimer\estimatio.docx, 5/8/016)
Sample size requiremet for estimatig p with cofidece I plaig a study, we wat to collect eough data to estimate p with adequate precisio. Earlier i the chapter we determied the sample size requiremets to estimate µ with cofidece. We apply a similar method to determie the sample size requiremets to estimate p. Let m represet the margi of error. This provides the wiggle room aroud pˆ for our cofidece iterval ad is equal to half the cofidece iterval legth. To achieve margi of error m, = z p * α 1 m q * where p * represet the a educated guess for the proportio ad q * = 1 p *. Whe o reasoable guess of p is available, use p* = 0.50 to provide a worst-case sceario sample size that will provide more tha eough data. Numeric example: We wat to sample a populatio ad calculate a 95% cofidece for the prevalece of smokig. How large a sample is eeded to achieve a margi of error of 0.05 if we assume the prevalece of smokig is roughly 30% * * z 1 1.96 0.30 Solutio: To achieve a margi of error of 0.05, = α p q = m 0.05 Roud this up to 33 to esure adequate precisio. 0.70 = 3.7. How large a sample is eeded to shrik the margi of error to 0.03? 1.96 0.30 0.70 To achieve a margi of error of 0.05, =. = 896.4, so study 897 idividuals. 0.03 Page 5.9 (C:\Users\B. Burt Gerstma\Dropbox\StatPrimer\estimatio.docx, 5/8/016)