Hypotesis testig usig complex survey data A Sort Course preseted by Peter Ly, Uiversity of Essex i associatio wit te coferece of te Europea Survey Researc Associatio Prague, 5 Jue 007 1
1. Objective: Simple Hypotesis Tests Survey data are ofte used to test ypoteses. Hypoteses of iterest are typically complex, ivolvig several variables, for example: - Differeces i pay betwee me ad wome i urba areas ca be explaied by differeces i occupatio, ours worked ad legt of time i post But i tis course te examples we will use will be simple ypoteses. Te ideas exted to more complex ypoteses. Cosider te followig questio, wic is asked o te Europea Social Survey (ESS): Geerally speakig, would you say tat most people ca be trusted, or tat you ca t be too careful i dealig wit people? Please tell me o a score of 0 to 10, were 0 meas you ca t be too careful ad 10 meas tat most people ca be trusted. Most people ca be trusted 00 01 0 03 04 05 06 07 08 09 10 88 You ca t be too careful (Do t kow) We migt be iterested i weter te mea score give i reply to tis questio (ppltrst) differs betwee atios. If te mea score is iger i oe atio ta aoter, te we migt coclude tat people i te first atio are more trustig ta people i te secod atio. Te mea scores give i te Czec Republic (), Hugary (), Sloveia (), Frace () ad Portugal () by ESS roud 1 respodets (00-03) were as follows: 43 4.4673 4.0794 4.0007 768 It would appear tat te Frec are te most trustig amogst tese five atios, wit te Sloveias te least trustig. But are tese differeces i meas sigificat? I oter words, are we cofidet tat tey reflect true differeces i meas betwee te respective populatios as a wole? To aswer tis questio, we eed more iformatio ta just te meas temselves. To see just wat iformatio we eed, we must cosider samplig teory.
1 9 17 5 33 41 49 57 65 73 Frequecy. Revisio of some basic samplig teory Samplig teory allows us to make statemets about te precisio of a sample estimate. Essetially, tese are statemets about ow likely it is tat a sample estimate falls witi a particular distace of te true populatio value of wic it is a estimate. Tis likeliood - or probability depeds solely o te sample desig. A sample desig, D, defies a large set of possible samples tat could be selected. For a particular estimator, E e.g. mea score o te ESS trust questio eac of tose samples will provide a estimate. Te estimates will vary over te samples. Te complete set of possible estimates is kow as te samplig distributio of estimator E uder sample desig D. For most sample desigs used i social surveys ad for may of te kids of estimators i wic we are typically iterested, samplig distributios are approximately ormally distributed, meaig tat tey ave a bell sape: 5000 4000 3000 000 1000 0 Estimate Te ormal distributio as some useful properties. It is symmetric. Ad tere is a kow relatiosip betwee te distace from te cetre of te distributio (i terms of stadard deviatios) ad te proportio of te area uder te curve covered. For example, plus or mius 1.96 stadard deviatios covers 95% of te area uder te curve. I te case of a samplig distributio, tis meas tat 95% of te samples tat migt be selected uder desig D will produce a estimate tat is witi 1.96 stadard deviatios of te true populatio value (assumig tat te samplig distributio is cetred o te true value). So, to make a precisio statemet of te form, tere is a 95% cace tat te true value is witi plus or mius z uits of our sample estimate, we eed oly to be able to estimate te stadard deviatio of te samplig distributio of te estimator oterwise kow as te stadard error of te estimate. Tis is te extra iformatio tat we eed i order to assess weter observed differeces i meas are sigificat. 3
Let s cosider te case of simple radom samplig (SRS). It is a somewat artificial case as SRS is rarely used i practice. But it is useful, for tree reasos: - Te teory is relatively simple, so it is a comfortable place to start; - SRS provides a stadard desig wic we ca use as a becmark, agaist wic to compare oter more realistic - desigs; - Muc data aalysis software carries out calculatios uder te assumptio tat te data are from a SRS eiter by default or as te oly optio. We sould try to uderstad wat our software is doig. SRS is a sample desig were every uit i te study populatio as a equal, ad idepedet, probability of selectio. Note tat may of te features ofte used i practical sample desig, suc as stratificatio, clusterig ad te use of variable samplig fractios, are ot permissible witi te defiitio of SRS. Stratificatio ad clusterig bot cause selectio probabilities to be depedet; variable samplig fractios cause selectio probabilities to be uequal. If we select a SRS of uits from a populatio of N uits, te (samplig) variace of te sample mea of a variable y will be: S Vary 1 N - (1) were S Var y N i1 y i y N 1 ad yi i y 1. I most data aalysis software, if you request te variace of a mea, tis is te quatity tat will be estimated (by default). I fact, te term 1 - kow as te fiite N populatio correctio - will almost certaily be igored, as te software does ot kow N, te size of te populatio. Igorig tis term usually makes o differece as te value of tis term is typically very close to 1.0. Ad S will most likely be estimated by its sample aalogue, s. So te estimate provided by te software will be: s Var ˆ y - () Te stadard error is te square root of te variace, so te estimated stadard error is simply te square root of te estimated variace as i (). 4
3. Testig Differeces i Mea Scores Te estimated stadard errors of te mea trust scores (assumig SRS) are: Natio Mea Std. Err. 43.06590 4.4673.0581 4.0794.05838 4.0007.0597 768.06498 So ow we ca estimate 95% cofidece itervals aroud te meas, as tese are plus or mius 1.96 stadard errors. Our software gives us: Natio Mea Std. Err. [95% Cof. Iterval] -------------+-------------------------------------------- 43.06590 51 834 4.4673.0581 53 814 4.0794.05838 649 938 4.0007.0597 3.8845 168 768.06498 3.8495 04 But ow does tis elp us to assess weter te meas are differet from oe aoter? Well, if we compare te cofidece itervals for ad we see tat tey do ot overlap at all. So it seems very ulikely tat te true values for tose two coutries are te same. But if we compare, say, ad we fid tat te itervals overlap (sligtly). So we still caot be sure weter te differece is sigificat. We eed to state a formal ypotesis. We usually do tis i terms of a ull ypotesis, for example: H 0 : Y Y We te carry out a test to determie weter te data cotai evidece to reject te ull ypotesis. If te test rejects te ull ypotesis te we would say tat we ave evidece tat te meas for ad differ. A appropriate test for a differece i meas is a Wald test. We ca ask our software to perform tis for us: [ppltrst] - [ppltrst] = 0: F(1, 30970) = 5.87; Prob > F = 0.0154 So, tere appears to be a probability of oly 0.0154, or 1.54%, tat we would ave observed a differece i meas at least as large as te oe actually observed, if te true meas were te same. We migt say tat at te 0.05 level we would reject te ull ypotesis of equal meas i ad. So, te Frec are more trustig ta te Czecs! Te figure below sows te estimated cofidece itervals for te mea trust score for all five atios: 5
4.7 4.6 4.4 4. Low er 4 3.8 0 1 3 4 5 6 F-test results of comparisos betwee ad eac of te oter coutries are as follows: [ppltrst] - [ppltrst] = 0: F(1, 30970) = 5.87; Prob > F = 0.0154 [ppltrst] - [ppltrst] = 0: F(1, 30970) = 5; Prob > F = 0.0470 [ppltrst] - [ppltrst] = 0: F(1, 30970) = 8.19; Prob > F = 0.004 [ppltrst] - [ppltrst] = 0: F(1, 30970) = 8.99; Prob > F = 0.007 4. Variable Samplig Fractios However, te estimates preseted so far all assume tat te sample i eac atio is SRS. I fact, te ESS sample desig is ot SRS i ay of tese atios (see Ly et al 007). ad bot selected teir ESS roud 1 sample from teir atioal populatio register, eablig tem to select persos wit equal probabilities. But te oter tree atios used sample desigs i wic selectio probabilities varied betwee uits (persos). I all tree cases te uits listed ad selected were addresses or ouseolds rater ta persos. Te, i te field, iterviewers would radomly select oe perso at te address to iterview. Tis results i persos livig aloe avig greater selectio probabilities ta persos livig i -perso ouseolds, etc. We a sample desig ivolves variable samplig fractios, desig weigts sould be used i order to permit desig-ubiased estimatio. Desig weigts simply make eac observatio cotribute to te estimate i iverse proportio to its selectio probability. If ouseolds were selected wit equal probabilities ad te oe perso selected at radom at eac ouseold te, compared to persos livig aloe, tose livig i -perso ouseolds would receive a relative desig weigt of.0, tose i 3-perso ouseolds a weigt of 3.0, ad so o. A weigted sample mea (estimate of populatio mea) would be calculated as follows: 6
y i1 i w i w y i i - (3) We sould take te desig weigts ito accout i estimatig te mea trust scores. If we ask our software to estimate meas usig (3), avig specified wic variable o te data set cotais te desig weigt, w i, we obtai: Natio Mea Std. Err. [95% Cof. Iterval] -------------+-------------------------------------------- 4.889.06519 611 4.4167 4.4759.05811 60 898 4.0794.05838 649 938 638.06033 4.0456 4.81 768.06498 3.8495 04 Note tat te estimates for bot ad are exactly te same as before, but for te oter tree atios bot te estimate of te mea ad te widt of te cofidece itervals ave caged. Tese cages also affect te results of our tests of differeces, wic are ow as follows: [ppltrst] - [ppltrst] = 0: F(1, 30970) = 9; Prob > F = 0.03 [ppltrst] - [ppltrst] = 0: F(1, 30970) = 5.73; Prob > F = 0.0167 [ppltrst] - [ppltrst] = 0: F(1, 30970) = 1.98; Prob > F = 0.1593 [ppltrst] - [ppltrst] = 0: F(1, 30970) =11.49; Prob > F = 0.0007 It seems tat by igorig te desig weigts, as we did earlier, we were over-estimatig te sigificace of te differeces betwee ad bot ad, but uder-estimatig te sigificace of te differece betwee ad. Tis ca be see i te plot of te estimated cofidece itervals, usig weigted data: 4.7 4.6 4.4 4. Low er 4.0 3.8 0 1 3 4 5 6 7
Te itervals for bot ad ow overlap wit tat for more ta before, wile te iterval for overlaps wit less ta before. 5. Some More Samplig Teory I fact, desig weigts affect ot oly estimates of meas but also te variace of tose estimates. Tis ca be see i te expressio for te variace of a mea uder stratified simple radom samplig, as we ca tik of te weigtig classes as strata (compare tis wit expressio (1)): H N S Var y 1 1 N N - (4) Note tat te desig weigts are populatio correctios), we ca rewrite tis as: H N w ad tat N N 1 H w 1 y S Var - (5) H w 1, so (if we igore te fiite Tis ca be estimated from te survey data provided we kow te desig weigts for eac sample uit (te s provide estimates of S ). We ca ask our software to estimate stadard errors ad cofidece itervals takig ito accout te desig weigts: Natio Mea Std. Err. [95% Cof. Iterval] -------------+------------------------------------------------ 4.889.0758 466 4.4311 4.4759.06456 494 4.605 4.0794.05837 650 938 638.08387 995 8 768.06496 3.8495 041 -------------------------------------------------------------- Note tat te estimates of stadard error are ow larger ta i te previous aalysis for te tree atios tat do ot ave equal-probability desigs. Te stadard error estimate as icreased by a factor of 1.39 for, 1.11 for ad 1.11 for. Tese factors may be referred to as mis-specificatio factors : te factor by wic te stadard error is uderestimated due to mis-specifyig te data structure. Te mis-specificatio factor is closely related to, toug ot idetical to, te desig factor. Te desig factor due to te use of variable samplig fractios is te icrease i stadard errors relative to a SRS. Te tests of differeces are ow as follows: 8
[ppltrst] - [ppltrst] = 0: F(1, 30970) = 3.71; Prob > F = 0.054 [ppltrst] - [ppltrst] = 0: F(1, 30970) = 5.06; Prob > F = 0.045 [ppltrst] - [ppltrst] = 0: F(1, 30970) = 1.7; Prob > F = 0.597 [ppltrst] - [ppltrst] = 0: F(1, 30970) =10.6; Prob > F = 0.0014 Te P-values ave icreased i all cases. I particular, te P-value for te - differece is ow larger ta 0.05, so we would o loger reject at tis level te ull ypotesis of equal meas i ad. Remember tat te P-value for tis compariso was oly 0.015 i our iitial aalysis were we igored desig weigts completely. Agai, we ca see tis grapically, as te cofidece itervals for ad clearly overlap more ta i te previous aalyses: 4.7 4.6 4.4 4. Lower 4.0 3.8 0 1 3 4 5 6 6. Clusterig Te use of variable samplig fractios (ad ece desig weigts) is ot te oly way i wic te ESS sample desigs differ from SRS. I all five coutries, multi-stage samples are selected, resultig i samples tat are clustered. Tis as te potetial to affect stadard errors of estimates. I geeral, if clusters are more omogeeous ta te overall populatio, wic is ofte te case, sample clusterig will icrease te size of stadard errors. Te form of te variace of a mea gets complicated if we ave bot variable samplig fractios ad a multi-stage clustered desig (see, e.g., StataCorp 005, p.61), but te approximate effect of a clustered desig is to icrease te variace by a factor of: Deff cy * y 1 b 1 - (6) were b * is a weigted mea cluster sample size ad y is te itra-cluster correlatio for y (see Kis 1965, pp.170-171; Ly & Gabler 005). 9
If we ask our software to take ito accout te sample clusterig as well as te desig weigts, we get te followig estimates: Natio Mea Std. Err. [95% Cof. Iterval] -------------+------------------------------------------------ 4.889.08774 169 4.4611 4.4759.07356 319 4.600 4.0794.08558 116 4.471 638.1063 73 4.4003 768.086 3.8078 459 -------------------------------------------------------------- [ppltrst] - [ppltrst] = 0: F(1, 30970) =.65; Prob > F = 0.1036 [ppltrst] - [ppltrst] = 0: F(1, 30970) = 3.04; Prob > F = 0.081 [ppltrst] - [ppltrst] = 0: F(1, 30970) = 0.73; Prob > F = 0.3933 [ppltrst] - [ppltrst] = 0: F(1, 30970) = 6.55; Prob > F = 0.0155 4.7 4.6 4.4 4. 4.0 3.8 0 4 6 Low er Wat we observe is tat if we take te relevat features of te sample desig ito accout, te mea for is ot sigificatly differet from te mea for, or at te 0.05 level. It is differet from te mea for at te 0.05 level, but ot at te 0.01 level. Tis cotrasts sarply wit te results tat we obtaied wit our aïve aalysis, assumig SRS. I tat case it seemed tat all four of te differeces were sigificat at te 0.05 level ad two of tem at te 0.01 level. Takig te sample desig correctly ito accout alters te coclusios! Furtermore, we ave see tat te differeces i te estimates of stadard errors are partly due to te effect of variable samplig fractios ad partly due to te effect of clusterig of te samplig so it is importat to take bot tese factors ito accout. 10
7. Aoter Example Aoter questio o te ESS (ppllp) as a similar structure to te oe aalysed above, but a differet topic: Do you tik tat most people would try to take advatage of you if tey got te cace, or would tey try to be fair? Most people Most people would try to would try to (Do t take advatage be fair kow) of me 00 01 0 03 04 05 06 07 08 09 10 88 If we ru equivalet aalyses to tose preseted above, agai usig ESS roud 1 data, we obtai te followig results: 7.1: Results assumig SRS Natio Mea Std. Err. [95% Cof. Iterval] -------------+------------------------------------------------ 471.06319 3.833 4.0710 4.4175.05973 004 346 556.05754 4.049 4.684 3.796.05444 3.6859 3.8993 89.06359 143 635 -------------------------------------------------------------- [ppllp] - [ppllp] = 0: F(1, 30970) = 9.6; Prob > F = 0.0000 [ppllp] - [ppllp] = 0: F(1, 30970) = 5.95; Prob > F = 0.0147 [ppllp] - [ppllp] = 0: F(1, 30970) = 3.63; Prob > F = 0.0640 [ppllp] - [ppllp] = 0: F(1, 30970) = 10.59; Prob > F = 0.0011 4.7 Low er 3.7 3.5 0 1 3 4 5 6 11
Differeces betwee te mea for ad ad appear igly sigificat (P<0.01); te differece wit appears sigificat at te 0.05 level (P=0.015) ad te differece wit is almost sigificat at te 0.05 level (P=0.064). 7.: Results usig weigted meas but assumig SRS i variace estimatio Natio Mea Std. Err. [95% Cof. Iterval] -------------+------------------------------------------------ 491.06307 3.855 4.078 740.06013 61 4.4919 556.05754 4.049 4.684 059.05507 3.7980 4.0139 89.06359 143 635 -------------------------------------------------------------- [ppllp] - [ppllp] = 0: F(1, 30970) = 3.77; Prob > F = 0.0000 [ppllp] - [ppllp] = 0: F(1, 30970) = 5.85; Prob > F = 0.0156 [ppllp] - [ppllp] = 0: F(1, 30970) = 0.7; Prob > F = 0.6058 [ppllp] - [ppllp] = 0: F(1, 30970) = 10.47; Prob > F = 0.001 4.7 Lower 3.7 3.5 0 1 3 4 5 6 Te mai cage ere is tat te weigted mea for is iger ta te uweigted mea, wit te result tat te mea for o loger appears sigificatly differet from tat for. 7.3: Results takig accout of weigtig, but ot clusterig, i variace estimatio Natio Mea Std. Err. [95% Cof. Iterval] -------------+---------------------------------------------- 491.07049 3.8110 4.0873 740.06777 4.41 068 556.05753 4.049 4.684 059.0751 3.7585 4.0533 89.06357 143 635 ------------------------------------------------------------ 1
[ppllp] - [ppllp] = 0: F(1, 30970) = 18.88; Prob > F = 0.0000 [ppllp] - [ppllp] = 0: F(1, 30970) = 5.15; Prob > F = 0.03 [ppllp] - [ppllp] = 0: F(1, 30970) = 0.18; Prob > F = 0.6750 [ppllp] - [ppllp] = 0: F(1, 30970) = 9.3; Prob > F = 0.003 4.7 Low er 3.7 3.5 0 1 3 4 5 6 P-values ave icreased for all four tests, but te differeces are ulikely to affect coclusios. 7.4: Results takig accout of bot weigtig ad clusterig Natio Mea Std. Err. [95% Cof. Iterval] -------------+---------------------------------------------- 491.08587 3.7808 174 740.07069 54 16 556.07490 4.0088 05 059.10910 3.690 198 89.07537 4.0911 867 ------------------------------------------------------------ 4.7 Low er 3.7 3.5 0 1 3 4 5 6 13
[ppllp] - [ppllp] = 0: F(1, 30970) = 15.38; Prob > F = 0.0001 [ppllp] - [ppllp] = 0: F(1, 30970) = 3.0; Prob > F = 0.0739 [ppllp] - [ppllp] = 0: F(1, 30970) = 0.10; Prob > F = 0.753 [ppllp] - [ppllp] = 0: F(1, 30970) = 6.39; Prob > F = 0.0115 I tis example, te most dramatic impact of mis-specificatio is to over-state te differece i meas betwee ad. However, tis is maily caused by failure to apply desig weigts i estimatig te mea: te aalysis i sectio 7. already sowed o sigificat differece betwee ad, eve witout takig te desig ito accout. Te oter oticeable impact of mis-specificatio is to over-state te evidece of a differece betwee ad. Tis is caused etirely by te failure to estimate te variace of te estimates correctly (P=0.016 i 7., cf. P=0.074 i 7.4). 8. A Tird Example: Cage Betwee Rouds Here we are iterested i testig weter te mea score cages betwee rouds 1 (00-03) ad (004-05) of te ESS. We carry out te estimatio i te same four ways as previously, for te variable ppltrst for Luxembourg: 8.1: Results assumig SRS Roud Mea Std. Err. [95% Cof. Iterval] -------------+---------------------------------------------- 1 5.133.05871 5.098 5.384 5.0080.06093 4.8885 74 ------------------------------------------------------------ [ppltrst]1 - [ppltrst] = 0: F(1, 30970) = 5.89; Prob > F = 0.0153 8.: Results usig weigted meas but assumig SRS i variace estimatio Roud Mea Std. Err. [95% Cof. Iterval] -------------+---------------------------------------------- 1 5.1848.05846 5.0701 5.994 5.015.06065 4.8963 5.134 ------------------------------------------------------------ [ppltrst]1 - [ppltrst] = 0: F(1, 30970) = 4.05; Prob > F = 0.0443 8.3: Results takig accout of weigtig, but ot clusterig, i variace estimatio Roud Mea Std. Err. [95% Cof. Iterval] -------------+---------------------------------------------- 1 5.1848.06519 5.0570 5.316 5.015.07456 4.8691 5.1614 ------------------------------------------------------------ [ppltrst]1 - [ppltrst] = 0: F(1, 30970) =.93; Prob > F = 0.0870 14
Te sample desig i Luxembourg was uclustered, so tere is o eed to take ito accout clusterig. I tis example, te test of a differece i meas, correctly takig ito accout te sample desig, provides o evidece at te 0.05 level of a differece (P=0.087). But igorig te weigts i variace estimatio would suggest evidece of a reductio i trustig betwee ESS rouds 1 ad (P=0.044). Ad additioally igorig te weigts i estimatig te meas would suggest eve stroger evidece of a reductio (P=0.015). 9. Some Commets o Software Implemetatio Te aalyses preseted ere were carried out i Stata. Te commads are quite simple to implemet, usig te SVY commads to take ito accout te sample desig. It is ecessary to ave a variable tat cotais te desig weigt (dweigt) ad a variable tat idicates te cluster, or primary samplig uit (psuit). For comparig te mea of ppltrst betwee te five coutries: svyset [pw = dweigt], psu(psuit) svy: mea ppltrst if (set==1 & essroud==1), over(ctcode) test [ppltrst]4 = [ppltrst]9 test [ppltrst]4 = [ppltrst]1 test [ppltrst]4 = [ppltrst]19 test [ppltrst]4 = [ppltrst]1 For comparig te mea of ppltrst betwee rouds 1 ad for Luxembourg: svyset [pw = dweigt] svy: mea ppltrst if ctcode==15, over(essroud) test [ppltrst]1 = [ppltrst] Similar commads are available i SPSS (i te Advaced Statistics module) ad i SUDAAN. Refereces Kis L (1965) Survey Samplig. New York: Wiley. Ly P & Gabler S (005) Approximatios to b * i te predictio of desig effects due to clusterig, Survey Metodology 31, 101-104. Ly P, Häder S, Gabler S & Laaksoe S (007) Metods for acievig equivalece of samples i cross-atioal surveys: te Europea Social Survey experiece, Joural of Official Statistics 3, 107-14. StataCorp (005) Stata Survey Data Referece Maual Release 9. Stata Press: College Statio, Texas. 15