How accurate are self-selection selection web surveys?0t

Size: px
Start display at page:

Download "How accurate are self-selection selection web surveys?0t"

Transcription

1 07 How accurate are self-selection selection web surves?0t s? Jelke Betlee Te views expressed in tis paper are tose of te autor(s and do not necessaril reflect te policies of Statistics eterlands Discussion paper (0804 Statistics eterlands Te Haue/Heerlen, 008

2 Explanation of sbols. data not available provisional fi ure x publication proibited (confi dential fi ure nil or less tan alf of unit concerned (between two fi ures inclusive 0 (0,0 less tan alf of unit concerned blank not applicable to 006 inclusive 005/006 averae of 005 up to and includin / 06 crop ear, fi nancial ear, scool ear etc. beinnin in 005 and endin in / / 06 crop ear, fi nancial ear, etc. 003/ 04 to 005/ 06 inclusive Due to roundin, soe totals a not correspond wit te su of te separate fi ures. ubliser Statistics eterlands Henri Faasdreef 3 49 J Te Haue repress Statistics eterlands - Facilit Services Cover TelDesin, Rotterda Inforation Telepone Telefax ia contact for: ere to order E-ail: Telefax Internet ISS: Statistics eterlands, Te Haue/Heerlen, 008. Reproduction is peritted. Statistics eterlands ust be quoted as source X-0

3 Suar: A web surve sees to be an attractive eans of collectin surve data, because it provides siple, ceap and fast access to a lare roup of people. However, tere are pitfalls. Due to etodoloical probles, te qualit of te outcoes of web surves a be seriousl affected. Tis paper addresses one of tese probles, and tat is self-selection of respondents. Self-selection leads to a lack of representativit and tus to biased estiates. Te effect of self-selection on te distributional caracteristics of estiators is described in detail. It is sown tat te bias of estiators in self-selection surves can be uc larer tan in surves based on traditional probabilit saples. A siulation stud also sows wat can o wron in a self-selection web surve. It is explored weter soe correction tecniques (adjustent weitin and use of reference surves can iprove te qualit of te outcoes. It turns out tat tere is no uarantee for success. Kewords: web surve, online surve, self-selection, bias, representativit, adjustent weitin, reference surve. Trends in data collection Collectin data usin surves is often a coplex, costl and tie-consuin process. ot surprisinl, continuous attepts ave been ade all trou te istor of surve researc to iprove tieliness and reducin costs, wile at te sae tie aintainin a i level of data qualit. Developents in inforation tecnolo in te last decades of te previous centur ade it possible to use icrocoputers for data collectin. Tis led to te introduction of coputer-assisted interviewin (CAI. Replacin te paper questionnaire b an electronic one turned out to ave an advantaes, aon wic were considerabl sorter surve processin ties and ier data qualit. More on te benefits of CAI can be found in Couper et al. (998. Te rapid developent of te Internet as led to anoter new tpe of data collection: Coputer Assisted eb Interviewin (CAI. Suc a web surve (also soeties called online surve is alost alwas self-adinistered: respondents visit a website, and coplete te questionnaire b fillin in a for on-line. eb surves ave soe attractive advantaes in ters of costs and tieliness: ow tat so an people are connected to te Internet, a web surve is a siple eans to et access to a lare roup of potential respondents; Questionnaires can be distributed at ver low costs. o interviewers are needed, and tere are no ailin and printin costs; Surves can be launced ver quickl. ittle tie is lost between te oent te questionnaire is read and te start of te fieldwork. 3

4 Tus, web surves are a fast, ceap and attractive eans of collectin lare aounts of data. ot surprisinl, an surve oranisations ave ipleented suc surves. However, te question is weter a web surve is also attractive fro a qualit point of view, because tere are etodoloical probles. Tese probles are caused b usin te Internet as a selection instruent for respondents. Tis paper sows tat te qualit of web surves a be seriousl affected b tese probles, akin it difficult, if not ipossible to ake proper inference wit respect to te taret population of te surve. Te two ain causes of probles are under-coverae and self-selection. Tis paper focuses on self-selection probles and onl briefl touces upon under-coverae. Te effects of under-coverae are treated in ore detail in Betlee (007. Soe teor about self-selection is developed in tis paper. It explores to wat extent weitin adjustent tecniques can elp to solve te proble. ractical iplications are sowed usin data fro a fictitious population.. Coverae and self-selection Objective of a surve alwas is to collect inforation about a well-defined taret population. To tat end a saple is selected fro tis population. Te etodolo of surve saplin as been developed over a period of ore tan 00 ears. It is based on te fundaental principle of probabilit saplin. Selectin rando saples akes it possible to appl probabilit teor. Consequentl, te accurac of estiators can be quantified and controlled. Te probabilit saplin principle as been successfull applied in official and acadeic statistics since te 940 s, and to a lesser extent also in ore coercial arket researc. At first sit, web surves ave uc in coon wit oter tpes of surves. It is just anoter ode of data collection. Questions are not asked face-to-face or b telepone, but over te Internet. at is different for an web surves, owever, is tat te principles of probabilit saplin ave not been applied. Saples are not constructed b eans of probabilit saplin but instead rel on self-selection of respondents. Tis can ave a ajor ipact on surve results. Tere is also anoter etodoloical proble tat web surves sare wit surves based on probabilit saples, and tat is under-coverae. Tis proble occurs wen eleents in te taret population do not appear in te saplin frae. Undercoverae can be a serious proble for web surves. If te taret population consists of all people wit an Internet connection, tere is no proble. However, usuall te taret population is wider tan tat. Ten, under-coverae occurs due to te fact tat still an people do not ave access to te Internet. Betlee (007 describes te situation in Te eterlands wit respect to Internet access. In te period fro 998 to 006 te percentae of persons wit Internet as increased fro 6% to 85%. Te question is weter tis Internet population differs fro te coplete taret population. Te answer is es in Te eterlands. Specific 4

5 roups are substantiall under-represented, like te elderl, te low educated, and te non-native part of te population. Te results described above are in line wit te findins of autors in oter countries. See e.. Couper (000, and Dillan and Bowker (00. One could arue tat tis proble a disappear as te Internet penetration increases furter. However, tis is not evident. Betlee (007 sows tat te bias due to under-coverae of te estiator for te population ean of soe variable is equal to I B( I E( I I ( I I. (. Te estiator I is te saple ean based on observations fro just te Internet population. Te eans of in te Internet population and non-internet population are denoted b I and I respectivel. Furterore, is te size of te total population and I is te size of te non-internet population. Te anitude of tis bias is deterined b two factors. Te first factor is te relative size I / of te population witout Internet. Te bias will decrease as a saller proportion of te population does not ave access to Internet. Te second factor is te contrast I I between te Internet-population and te non-internetpopulation. Te ore te ean of te taret variable differs for tese two subpopulations, te larer te bias will be. An increased Internet coverae will reduce te bias because te factor I / is saller. However, te contrast does not necessaril decrease as Internet coverae rows. It is even possible tat te reainin ard-core roup of people witout Internet will be ore and ore deviant. Tis a cause te contrast to increase. So, takin into account te cobined effect of bot factors, tere is no uarantee tat increased Internet coverae will reduce te under-coverae bias. 3. Effect of self-selection Horvitz and Topson (95 sow in teir seinal paper tat unbiased estiates of population caracteristics can be coputed onl if a real probabilit saple as been drawn, ever eleent in te population as a non-zero probabilit of selection, and all tese probabilities are known to te researcer. Furterore, onl under tese conditions, te accurac of estiates can be coputed. Man web surves are not based on probabilit saplin. Te surve questionnaire is sipl put on te web. Respondents are tose people wo appen to ave Internet, visit te website and decide to participate in te surve. Tese surves are called self-selection surves. Te proble is tat te surve researcer is not in control of te selection process. Selection probabilities are unknown and, oreover, te are considerabl saller tan in traditional probabilit surves. Terefore, no 5

6 unbiased estiates can be coputed nor can te accurac of estiates be deterined. Te effects of self-selection can be illustrated usin an exaple related to te eneral elections in Te eterlands in 003. arious oranisations ade attepts to use opinion polls to predict te outcoe of tese elections. Te results of tese polls are suarised in table 3.. olitieke Baroeter, eil.nl and De Stein are opinion polls carried out b arket researc aencies. Te are all based on saples fro web panels (also called access panels. To reduce a possible bias, adjustent weitin as been carried out. Te polls were conducted one da before te election. Te Mean Absolute Difference indicates ow bi te differences (on averae are between te poll and te election results. articularl, differences are lare for te ore volatile parties like vda, S and te. DES is te Dutc arliaentar Election Stud. Te fieldwork was carried out b Statistics eterlands in a few weeks just before te elections. Te probabilit saplin principle as been followed ere. A true (two-stae probabilit saple was drawn. Respondents were interviewed face-to-face (usin CAI. Te predictions of tis surve are uc better tan tose based on te online opinion polls. Table 3.. Dutc arliaentar elections 006. Outcoes and te results of various opinion surves Election result olitieke Baroeter eil.nl De Stein DES 006 Saple size,000,500,000,600 Seats in parliaent: CDA (cristian deocrats vda (social deocrats D (liberals S (socialists G (reen part D66 (liberal deocrats CristenUnie (cristian SG (cristian vdd (Anial part (Conservative Oter parties Mean Absolute Difference robabilit saplin as te additional advantae tat it provides protection aainst certain roups in te population atteptin to anipulate te outcoes of te surve. Tis a tpicall pla a role in opinion polls. Self-selection does not ave tis safeuard. An exaple of tis effect could be observed in te election of te 005 Book of te ear Award (Dutc: S ublieksprijs, a i-profile literar prize. Te winnin book was deterined b eans of a poll on a website. eople could vote for one of te noinated books or ention anoter book of teir coice. More tan 90,000 people participated in te surve. Te winner turned out to be te new interconfessional Bible translation launced b te eterlands and Flanders 6

7 Bible Societies. Tis book was not noinated, but neverteless an overwelin ajorit (7% voted for it. Tis was due to a capain launced b (aon oters Bible societies, a Cristian broadcaster and Cristian newspaper. Altou tis was all copletel witin te rules of te contest, te roup of voters could clearl not be considered to be representative of te Dutc population. 4. Te teoretical fraework et te taret population of te surve consist of identifiable eleents, wic are labelled,,...,. Associated wit eac eleent k is a value k of te taret variable. Te ai of te web surve is assued to be estiation of te population ean k k (4. of te taret variable. articipation in a self-selection web-surve requires in te first place tat respondents are aware of te existence of a surve (te ave to accidentall visit te website, or te ave to follow up a banner or an e-ail essae. In te second place, te ave to ake te decision to fill in te questionnaire on te Internet. All tis eans tat eac eleent k in te population as unknown probabilit k of participatin in te surve, for k,,...,. Te respondin eleents can be denoted b a series r,...,, r r (4. of indicators, were te k-t indicator r k assues te value if eleent k participates, and oterwise it assues te value 0, for k,,...,. Te expected value k E(r k will be called te response propensit of eleent k. Te rando variables r, r,, r are independent. Tis saple selection process is a for of oisson saplin. However in practical applications of oisson saplin te selection probabilities are known, wereas te are unknown in a self-selection surve. Te nuber of respondents is equal to n r k k (4.3 ote tat n is a rando variable. A naive estiator of te population ean is te saple ean n k r k k. (4.4 Tis estiator iplicitl assues ever eleent in te population to ave te sae probabilit of participatin in te surve. Quantit (4.4 is te ratio of two rando 7

8 variables. It can be sown tat its expected value is approxiatel equal to te ratio of te expected values of te bot rando variables. Hence E( k k k (4.5 were is te ean of all response probabilities. Usin an approac siilar to Cocran (977, p. 3, it can be sown tat te variance of te saple ean is approxiatel equal to k ( k ( k k ( (4.6 ote tat tis expression for te variance does not contain a saple size (because no fixed size saple was drawn, but te expected saple size. ot surprisinl, te variance decreases as te expected saple size increases. Generall, te expected value of tis saple ean is not equal to te population ean of te population. Te onl situation in wic te bias vanises is tat in wic all response probabilities in te Internet-population are equal. In ters of nonresponse correction teor, tis coes down to Missin Copletel At Rando (MCAR. Indeed, in tis case, self-selection leads to a representative saple because all eleents ave te sae selection probabilit. Betlee (988 sows tat te bias of te saple ean (4.4 can be written as were C B( E(, (4.7 C k ( k v ( k (4.8 is te covariance between te values of taret variable and te response probabilities. Te bias of te saple ean (as an estiator of te population ean is terefore deterined b two factors: Te averae response probabilit. Te ore likel people are to participate in te surve, te ier te averae response probabilit will be, and tus te saller te bias will be. Te relationsip between te taret variable and response beaviour. Te ier te correlation between te values of te taret variable and te response probabilities, te ier te bias will be. Tree situations can be distinuised in wic tis bias vanises: All response probabilities are equal. Aain, tis is te case in te wic te selfselection process can be copared wit a siple rando saple; 8

9 All values of te taret variable are equal. Tis situation is ver unlikel to occur. If tis were te case, no surve would be necessar. One observation would be sufficient. 3 Tere is no relationsip between taret variable and response beaviour. It eans participation does not depend on te value of te taret variable. Tis corresponds to Missin Copletel At Rando (MCAR. Expression (4.7 for te bias of te saple ean can be rewritten as Rρ S ρs B ( E(, (4.9 ρ in wic R is te value of correlation between te taret variable and te response probabilities, S is te standard deviation of te response probabilities, and S is te standard deviation of te taret variable. Given te ean response probabilit, tere is a axiu value te standard S cannot exceed: S ρ ( ρ. (4.0 ρ Tis iplies tat in te worst case (S assues it axiu value and te correlation R is equal to eiter + or - te absolute value of te bias will be equal to B ax ( S. (4. ρ Betlee (988 sows te forula (4.7 also applies in te situation in wic a probabilit saple as been drawn, and subsequentl non-response occurs durin te fieldwork. Consequentl, expression (4. provides a eans to copare potential biases in various surve desins. For exaple, reular surves of Statistics eterlands are all based on probabilit saplin. Teir response rates are around 70%. Tis eans te absolute axiu bias is equal to 0.65 S. One of te larest web surves in Te eterlands is inuten.nl. Tis surve is supposed to suppl answers to questions about iportant probles in Dutc societ. It is a self-selection web surve. itin a period of six weeks in 006 about 70,000 people copleted te online questionnaire. Te taret population of tis surve was not defined, as everone could participate. If it is assued te taret population consists of all Dutc fro te ae of 8, te averae response probabilit is equal to 70,000 /,800, Hence, te absolute axiu bias is equal to 8.6 S. It can be concluded tat te bias of a lare web surve can be a factor 3 larer tan bias of a saller probabilit surve. 5. eitin adjustent eitin adjustent is a fail of tecniques tat attept to iprove te qualit of surve estiates b akin use of auxiliar inforation. Auxiliar inforation is 9

10 defined ere as a set of variables tat ave been easured in te surve, and for wic inforation on teir population distribution is available. B coparin te population distribution of an auxiliar variable wit its saple distribution, it can be assessed weter or not te saple is representative for te population (wit respect to tis variable. If tese distributions differ considerabl, one ust conclude tat te saple is selective. ote tat for a probabilit saple in wic non-response as occurred, it is also possible to use te distribution of te auxiliar variables in te coplete saple instead of teir population distribution. Suc inforation can soeties be retrieved fro te saplin frae. Tis situation does not appl to self-selection saples as tere is no saplin frae. To correct for a lack of representativit, adjustent weits can be coputed. eits are assined to records of all respondents. Estiates of population caracteristics can now be obtained b usin weited values instead of te unweited values. eitin adjustent is often used to correct surves tat are affected b nonresponse, see e.. Betlee (00. Tis section explores te possibilit to reduce te bias of self-selection web surve estiates b applin post-stratification. Tis is a well-known and often used weitin tecnique. To carr out post-stratification, one or ore qualitative auxiliar variables are needed. Here, onl one suc variable is considered. Te extension to ore variables is essentiall te sae. Suppose, tere is an auxiliar variable X avin cateories. So it divides te taret population into strata. Te strata are denoted b te subsets U, U,..., U of te population U. Te nuber of taret population eleents in stratu U is denoted b, for,,...,. Te population size is equal to Tis is te population inforation assued to be available. Suppose a self-selection saple is selected fro te Internet-population. If n denotes te nuber of respondents in stratu, ten n n + n n. Te values of te n are te result of a oisson saplin process, so te are rando variables. ost-stratification assins identical adjustent weits to all eleents in te sae stratu. Te weit w k for a respondent k in stratu is equal to / wk n / n (5. Te siple saple ean n k r k k (5. is now replaced b te weited saple ean S n k w r k k k (5.3 0

11 Substitutin te weits and workin out tis expression leads to te poststratification estiator S, (5.4 were is te saple ean in stratu and / is te relative size of stratu. Te expected value of tis post-stratification estiator is equal to were ~ E( S E(, (5.5 k k, k, (5.6 is te weited ean of te taret variable in stratu. Te subscript k, denotes te k-t eleent in stratu, and is te averae response probabilit in stratu. Expression (5.6 is te analoue of expression (4.5, but now coputed for stratu. Generall, tis ean will not be equal to te ean of te taret variable in stratu of te taret population. Te bias of tis estiator is equal to B( S E( R, S S ~,, S, ( (5.7 were te subscript indicates tat te respective quantities are coputed just for stratu and not for te coplete population. Tis bias will be sall if Te response probabilities are siilar witin strata; Te values of te taret variable are siilar witin strata; Tere is no correlation between response beaviour and te taret variable witin strata. Tese conditions can be realised if tere is a stron relationsip between te taret variable and te stratification variable X. Ten te variation in te values of anifests itself between strata but not witin strata. In oter words, te strata are ooeneous wit respect to te taret variable. Also if te strata are ooeneous wit respect to te response probabilities, te bias will be reduced. In nonresponse correction terinolo, tis situation coes down to Missin At Rando (MAR. In conclusion it can be said tat application of post-stratification will successfull reduce te bias of te estiator if proper auxiliar variables can be found. Suc variables sould satisf tree conditions:

12 Te ave to be easured in te surve; Teir population distribution (,,..., ust be known; Te ust produce ooeneous strata. Unfortunatel, suc variables are rarel available, or tere is onl a weak correlation. It can be sown tat, in eneral, te variance of te post-stratification estiator is equal to ( (. (5.8 S In te case of a self-selection web surve, te variance ( of te saple ean in a stratu is te analoue of variance (4.6 but restricted to observations in tat stratu. Terefore, te variance of te post-stratification estiator is approxiatel equal to ( ρ k U ( ρ ( ρ ( S k k k. (5.9 Tis variance is sall if te strata are ooeneous wit respect to te taret variable. So, a stron correlation between te taret variable and te stratification variable X will reduce bot te bias and te variance of te estiator. 6. eitin adjustent wit a reference saple Te previous section sowed tat post-stratification can be an effective correction tecnique provided auxiliar variables are available tat ave a stron correlation wit te taret variables of te surve. If suc variables are not available, it it be considered to conduct a reference surve. Tis reference surve is based on a sall probabilit saple, were data collection takes place wit a ode different fro te web, e.. CAI (Coputer Assisted ersonal Interviewin, wit laptops or CATI (Coputer Assisted Telepone Interviewin. Te reference surve approac as been applied b several arket researc oranisations, see e.. Börsc-Supan et al. (004 and Duff et al. (005. Under te assuption of full response, or inorable nonresponse, tis reference surve will produce unbiased estiates of quantities tat ave also been easured in te web surve. Unbiased estiates for te taret variable can be coputed, but due to te sall saple size, tese estiates will ave a substantial variance. Te question is now weter estiates can be iproved b cobinin te lare saple size of te web surves wit te unbiased estiates of te reference surve. To explore tis, it is assued tat one qualitative auxiliar variable is observed bot in te web surve and te reference surve, and tat tis variable as a stron correlation wit te taret variable of te surve. Ten a for of post-stratification

13 can be applied were te stratu eans are estiated usin web surve data and te stratu weits are estiated usin te reference surve data. Tis leads to te poststratification estiator RS (6. were is te web surve based estiate for te ean of stratu of te taret population (for,,...,, and / is te relative saple size in stratu for te reference saple (for,,...,. Under te conditions described above te quantit / is an unbiased estiate of /. et I denote te probabilit distribution for te web surve and let be te probabilit distribution for te reference surve. Ten te expected value of te post-stratification estiator is equal to E( RS E E ( I ~ RS,,..., E I, (6. were / is te relative size of stratu in te taret population. So, te expected value of tis estiator is identical to tat of te post-stratification estiator (5.4. Te bias of tis estiator is equal to B( RS E( R, RS S ~,, S, ( (6.3 A stron relationsip between te taret variable and te auxiliar variable used for coputin te weits eans tat tere is little or no variation of te taret variable witin te strata. Consequentl, te correlation between taret variable and response beaviour will be sall, and te sae applies to te standard deviation of te taret variable. So, usin a reference surve wit te proper auxiliar variables can substantiall reduce te bias of web surve estiates. ote tat te bias of te reference surve estiator is equal to tat of te poststratification estiator, see expression (5.6. An interestin aspect of te reference surve approac is tat an variable can be used for adjustent weitin as lon as it is easured in bot surves. For exaple, soe arket researc oranisations use weborapics or pscorapic variables tat divide te population in 'entalit roups'. eople in te sae roups ave ore or less te sae level of otivation and interest to participate in suc surves. Effective weitin variables approac te MAR situation as uc as possible. Tis iplies tat witin weitin strata tere is no relationsip between participatin in a web surve and te taret variables of te surve. 3

14 It can be sown tat if a reference surve is used, te variance of te poststratification estiator is equal to ~ ( + ( ( + ( RS ( (6.4 Te proof is iven in te appendix. Te quantit is easured in te web surve. Terefore its variance will be at ost of te order / E ( n /( ρ. Tis ( eans tat te first ter in te variance of te post-stratification estiator will be of te order /, te second ter of order /(E(n, and te tird ter of order /E(n. Since E(n will enerall be uc larer tan in practical situations, te first ter in te variance will doinate, i.e. te (sall size of te reference surve will deterine te accurac of te estiates. Moreover, since strata are based on roups of people wit te sae pscorapics scores, and taret variables a ver well be related to te pscorapic variables, te stratu eans a var substantiall. Tis also contributes to a lare value of te first variance coponent. Te conclusion is tat a lare nuber of observations in te web surve do not elp to produce accurate estiates. Te reference surve approac a reduce te bias of estiates, but it does so at te cost of a ier variance. Te effectiveness of a surve desin is soeties also indicated b eans of te effective saple size. Tis is te saple size of a siple rando saple of eleents tat would produce an estiator wit te sae precision. Use of a reference surve iplies tat te effective saple size is uc lower tan te size of te web surve. See section 8 for an exaple sowin tis effect. 7. ropensit weitin ropensit weitin is used b several arket researc oranisations to correct for a possible bias in teir web surves, see e.. Börsc-Supan et al. (004 and Duff et al. (005. Te oriinal idea beind propensit weitin oes back to Rosenbau & Rubin (983, 984. ropensit scores are obtained b odellin a variable tat indicates weter or not soeone participates in te surve. Usuall a loistic reression odel is used were te indicator variable is te dependent variable and attitudinal variables are te explanator variables. Tese attitudinal variables are assued to explain w soeone participates or not. Fittin te loistic reression odel coes down to estiatin te probabilit (propensit score of participatin, iven te values of te explanator variables. Eac person k in te population is assued to ave a certain, unknown probabilit k of participatin in te surve, for k,,..,. et r, r,, r denote indicator 4

15 variables, were r k if person k participates in te surve, and r k 0 oterwise. Consequentl, (r k k. Te propensit score (X is te conditional probabilit tat a person wit observed caracteristics X participates, i.e. ρ ( X ( Xr (7. It is assued tat witin te strata defined b te values of te observed caracteristics X, all persons ave te sae participation propensit. Tis is te Missin At Rando (MAR assuption. Te propensit score is often odelled usin a loit odel: ρ( X k lo α + β X k ρ X (7. k ( Te odel is fitted usin Maxiu ikeliood estiation. Once propensit scores ave been estiated, te are used to stratif te population. Eac stratu consists of eleents wit (approxiatel te sae propensit scores. If indeed all eleents witin a stratu ave te sae response propensit, tere will be no bias if just te eleents in te Internet population are used for estiation purposes. Cocran (968 clais tat five strata are usuall sufficient to reove a lare part of te bias. Te arket researc aenc Harris Interactive was aon te first to appl propensit score weitin, see Teranian et al. (00. To be able to appl propensit score weitin, two conditions ave to be fulfilled. Te first condition is tat proper auxiliar variables ust be available. Tese are variables tat are capable of explainin weter or not soeone is willin to participate in te web surve. ariables often used easure eneral attitudes and beaviour. Te are soeties referred to as weborapic or pscorapic variables. Sconlau et al. (004 ention as exaples Do ou often feel alone? and On ow an separate occasions did ou watc news proras on T durin te past 30 das?. It sould be rearked tat attitudinal questions are uc less reliable tan factual questions. Respondents a never ave tout of te topics addressed in attitudinal questions. Te ave to ake up teir ind at te ver oent te question is asked. Teir answers a be depend on teir current circustances, and a ver over tie. Terefore, attitudinal question a be subject to substantial easureent errors. Te second condition for tis tpe of adjustent weitin is tat te population distribution of te weborapic variables ust be available. Tis is enerall not te case. A possible solution to tis proble is to carr out an additional reference surve. To allow for unbiased estiation of te population distribution, te reference surve ust be based on a true probabilit saple fro te entire taret population. Suc a reference surve can be sall in ters of te nuber of questions asked. It can be liited to te weborapic questions. referabl, te saple size of te 5

16 reference surve sould be lare enou to allow for precise estiation. A sall saple size results in lare standard errors of estiates. Sconlau et al. (004 describe te reference surve of Harris Interactive. Tis is a CATI surve, usin rando diit diallin. Tis reference surve is used to adjust several web surves. Sconlau et al. (003 stress tat te success of tis approac depends on two assuptions: ( te weborapics variables are capable of explainin te difference between te web surve respondents and te oter persons in te taret population, and ( te reference surve does not suffer fro noninorable nonresponse. In practical situations it will not be eas to satisf tese conditions. It sould be noted tat fro a teoretical point of view propensit weitin sould be sufficient to reove te bias. However, in practice te propensit score variable will often be cobined wit oter (deorapic variables in a ore extended weitin procedure, see e.. Sconlau ( A siulation stud To explore te effects of self-selection and correction tecniques, a siulation stud was carried out. A fictitious population was constructed. For tis population, votin intentions for te next eneral elections were siulated and analsed. Te relationsips between variables involved were odelled soewat stroner tan te probabl would be in a real life situation. Effects are terefore ore pronounced, akin it clearer wat te pitfalls are. Te caracteristics of estiators (before and after correction were coputed based on a lare nuber of siulations. First, te distribution of te estiator was deterined in te ideal situation of a siple rando saple fro te taret population. Ten, it was explored ow te caracteristics of te estiator cane if self-selection is applied. Finall, te effects of weitin (post-stratification and reference surve were analsed. A fictitious population of 00,000 individuals was constructed. Tere were five variables: Te variable Internet indicates ow active a person is on te internet. Tere are two cateories. er active users and ore passive users. Te population consists for % of active users and for 99% of passive users. Active users ave a response propensit of 0.99 and passive users ave a response propensit of 0.0. Te variable Ae in tree cateories: oun, iddle aed and old. Te active Internet users consist for 60% of oun people, for 30% of iddle aed people and onl for 0% of old people. Te ae distribution for passive Internet users is 40% oun, 35% iddle aed and 5% old. So, tpicall ouner people are ore active internet users. 6

17 ill vote for te ational Elderl art (E. Te probabilit to vote for tis part onl depends on ae. robabilities are 0.00 (for oun, 0.30 (for Middle aed and 0.60 (for Old. ill vote for te ew Internet art (I. Te probabilit to vote for tis part depends bot on ae and use of Internet. For active Internet users, te probabilities were 0.80 (for oun, 0.40 (for iddle aed and 0.0 (for old. All probabilities were equal to 0.0 for passive Internet users. So, for active users votin decreases wit ae. otin probabilit is alwas low for passive users. Fiure 8.. Relationsips between variables Ae Internet Ae Internet I E Fiure 8. sows te relationsips between te variables in a rapical wa. Te decision not to participate in a self-selection surve can be seen as a for of nonresponse. Te teor on nonresponse (see for exaple ittle & Rubin, 00 distinuises tree nonresponse eneratin ecaniss: Missin Copletel At Rando (MCAR. Tere is no relationsip at all between te ecanis causin issinness and taret variables of te surve. Tis is te ideal situation. Te ecanis onl leads to a reduced nuber of observations. Estiators will not be biased. Missin At Rando (MAR. Tere is an indirect relationsip between te ecanis causin issinness and te taret variables of te surve. Te relationsip runs trou a tird variable, and tis variable is easured in te surve as an auxiliar variable. In tis case estiates are biased, but it is possible to correct for tis bias. For exaple, if te auxiliar variable is used to construct strata, tere will be no bias witin strata, and te post-stratification will reove te bias. ot Missin At Rando (MAR. Tere is a direct relationsip between te ecanis causin issinness and taret variables of te surve. Tis is te worst case. Estiators will be biased and it is not possible to reove tis bias. Te variable E (ational Elderl art suffers fro issinness due to MAR in te experient. Tere is direct relationsip between votin for tis part and ae, and also tere is a direct relationsip between ae and te probabilit to participate in te surve. Tis will cause estiates to be biased. It sould be possible to correct for tis bias b weitin usin te variable ae. Te variable I (ational Internet art suffers fro MAR. Tere exists a direct relationsip between votin for tis part and te response probabilit. Estiates will be biased, and tere is no correction possible. 7

18 Te distribution of estiators for te percentae of voters for bot parties was deterined in various situations b repeatin te selection of te saple 500 ties. Te averae response probabilit in te population is Terefore, te expected saple size in a self-selection surve is equal to 97. Fiure 8. contains te results for te variable E (votes for ational Elderl art. Te upper-left rap sows te distribution of te estiator for siple rando saples of size 97 fro te taret population. Te vertical line denotes te population value to be estiated (5.6%. Te estiator as a setric distribution around tis value. Tis is a clear indication tat te estiator is unbiased. Te upper-rit rap sows wat appens if saples are selected b eans of selfselection. Te sape of te distribution reains ore or less te sae, but te distribution as a wole as sifted to te left. All values of te estiator are ssteaticall too low. Te expected value of te estiator is onl 0.5%. Te estiator is biased. Te explanation of tis bias is siple: Relative few elderl are active Internet users. Terefore, te are under-represented in te saples. Tese are tpicall people wo will vote for te E. Fiure 8.. Results of te siulations for variable E Siple rando saple Self-selection surve Self-selection surve, weitin b ae Self-selection surve + reference surve Te lower-left rap in fiure 8. sows te distribution of te estiator in case of post-stratification b ae. Te bias is reoved. Tis is possible because tis is a case of Missin At Rando (MAR. 8

19 ost-stratification b ae can onl be applied if te distribution of ae in te population is known. If tis is not te case, one could consider to conduct a sall ( 00 reference surve, in wic tis population distribution is estiated unbiasedl. Te lower-rit rap in fiure 8. sows wat appens in tis case. Te bias is reoved but at te cost of a substantial increase in variance. Fiure 8.3 sows te results for te variable I (votes for ew Internet art. Te upper-left rap sows te distribution of te estiator for siple rando saples of size 97 fro te taret population. Te vertical line denotes te population value to be estiated (0.5%. Since te estiator as a setric distribution around tis value, it is clear tat te estiator is unbiased. Te upper-rit rap sows wat appens if saples are selected b eans of selfselection. Te distribution as sifted to te rit considerabl. All values of te estiator are ssteaticall too i. Te expected value of te estiator is now 35.6%. Te estiator is severel biased. Te explanation of tis bias is straitforward: voters for te I are over-represented in te self-selection saples. Fiure 8.3. Results of te siulations for variable I Siple rando saple Self-selection surve Self-selection surve, weitin b ae Self-selection surve + reference surve Te lower-left rap in fiure 8.3 sows te effect of post-stratification b ae. Onl a sall part of te bias is reoved. eitin is not successful. Tis is not surprisin as tere is a direct relationsip between votin for te I and use of Internet. Tis is a case of MAR. 9

20 Also in tis case one can consider conductin a sall reference surve if te population distribution of ae is not available. Te lower-rit rap in fiure 8.3 sows wat appens in tis case. Onl a sall part of te bias is reoved and at te sae tie tere is a substantial increase in variance. Te followin conclusion can be drawn fro tis siulation stud: If Missin At Rando (MAR or ot Missin At Rando (MAR applies to surve participation, estiates based on a self-selection web surve will be biased; Tere is no uarantee tat weitin will reove te bias. Tis correction tecnique will onl work in case of Missin At Rando (MAR, and te proper auxiliar variables are used for weitin; A reference surve will onl be effective in reovin te bias if Missin At Rando (MAR applies, and te proper auxiliar variable are easured; Use of a sall reference surve will alwas substantiall increase te variance of estiators. 9. Discussion and conclusions Tis paper discussed soe of te etodoloical probles caused b self-selection in web surves. Te underlin question is weter suc a surve can be used as a data collection instruent for akin valid inference about a taret population. Costs and tieliness are iportant aruents in favour of web surves. However, tis paper concentrated on qualit aspects like unbiasedness and accurac of estiates. It was sown tat self-selection can cause estiates of population caracteristics to be biased. Tis sees to be siilar to te effect of nonresponse in traditional probabilit saplin based surves. However, it was sown tat te bias in selfselection surves can be substantiall larer. Dependin on te response rate in a web surve, te bias can in a worst case situation even be ore tan 3 ties as lare. eitin tecniques (includin propensit weitin can elp to reduce te bias, but onl if te saple selection ecanis satisfies te Missin at Rando (MAR condition. Tis is a stron assuption. It requires weitin variables tat sow a stron relationsip wit te taret variables of te surve and te response probabilities. Often suc variables are not available. Soeties a reference surve is used as a eans to obtain te proper weitin variables. Indeed, tis approac can be successful if suc variables can be easured bot in te web surve and in te reference surve. Tere are soe reports tat weborapics variables see to work well. Tese attitudinal or lifestle variables see to be capable of explainin response beaviour. Te easure activities of respondents (e.. readin and perceptions about possible violations of privac. 0

ABSTRACT KEYWORDS. Comonotonicity, dependence, correlation, concordance, copula, multivariate. 1. INTRODUCTION

ABSTRACT KEYWORDS. Comonotonicity, dependence, correlation, concordance, copula, multivariate. 1. INTRODUCTION MEASURING COMONOTONICITY IN M-DIMENSIONAL VECTORS BY INGE KOCH AND ANN DE SCHEPPER ABSTRACT In this contribution, a new easure of coonotonicity for -diensional vectors is introduced, with values between

More information

THE FIVE DO S AND FIVE DON TS OF SUCCESSFUL BUSINESSES BDC STUDY. BDC Small Business Week 2014

THE FIVE DO S AND FIVE DON TS OF SUCCESSFUL BUSINESSES BDC STUDY. BDC Small Business Week 2014 BDC STUDY THE FIVE DO S AND FIVE DON TS OF SUCCESSFUL BUSINESSES BDC Sall Business Week 2014 bdc.ca BUSINESS DEVELOPMENT BANK OF CANADA BDC Sall Business Week 2014 PAGE 1 Executive suary -----------------------------------------------------------------------

More information

HORIZONTAL AND VERTICAL TAKEOVER AND SELL-OFF ANNOUNCEMENTS: ABNORMAL RETURNS DIFFER BY INDUSTRY

HORIZONTAL AND VERTICAL TAKEOVER AND SELL-OFF ANNOUNCEMENTS: ABNORMAL RETURNS DIFFER BY INDUSTRY HORIZONTAL AND VERTICAL TAKEOVER AND SELL-OFF ANNOUNCEMENTS: ABNORMAL RETURNS DIFFER BY INDUSTRY Stephan K.H. Gross*, Hagen Lindstädt** Abstract We begin with the hypothesis that shareholder-wealth effects

More information

Misunderstandings between experimentalists and observationalists about causal inference

Misunderstandings between experimentalists and observationalists about causal inference J. R. Statist. Soc. A (2008) 171, Part 2, pp. 481 502 Misunderstandings between experimentalists and observationalists about causal inference Kosuke Imai, Princeton University, USA Gary King Harvard University,

More information

Sample Attrition Bias in Randomized Experiments: A Tale of Two Surveys

Sample Attrition Bias in Randomized Experiments: A Tale of Two Surveys DISCUSSION PAPER SERIES IZA DP No. 4162 Sample Attrition Bias in Randomized Experiments: A Tale of Two Surveys Luc Behaghel Bruno Crépon Marc Gurgand Thomas Le Barbanchon May 2009 Forschungsinstitut zur

More information

Who Has Power in the EU? The Commission, Council and Parliament in Legislative Decisionmaking*

Who Has Power in the EU? The Commission, Council and Parliament in Legislative Decisionmaking* JCMS 2006 Volume 44. Number 2. pp. 391 417 Who Has Power in the EU? The Commission, Council and Parliament in Legislative Decisionmaking* ROBERT THOMSON Trinity College, Dublin MADELEINE HOSLI Leiden University

More information

An Introduction to Regression Analysis

An Introduction to Regression Analysis The Inaugural Coase Lecture An Introduction to Regression Analysis Alan O. Sykes * Regression analysis is a statistical tool for the investigation of relationships between variables. Usually, the investigator

More information

Can political science literatures be believed? A study of publication bias in the APSR and the AJPS

Can political science literatures be believed? A study of publication bias in the APSR and the AJPS Can political science literatures be believed? A study of publication bias in the APSR and the AJPS Alan Gerber Yale University Neil Malhotra Stanford University Abstract Despite great attention to the

More information

Climate Surveys: Useful Tools to Help Colleges and Universities in Their Efforts to Reduce and Prevent Sexual Assault

Climate Surveys: Useful Tools to Help Colleges and Universities in Their Efforts to Reduce and Prevent Sexual Assault Climate Surveys: Useful Tools to Help Colleges and Universities in Their Efforts to Reduce and Prevent Sexual Assault Why are we releasing information about climate surveys? Sexual assault is a significant

More information

RESEARCH. A Cost-Benefit Analysis of Apprenticeships and Other Vocational Qualifications

RESEARCH. A Cost-Benefit Analysis of Apprenticeships and Other Vocational Qualifications RESEARCH A Cost-Benefit Analysis of Apprenticeships and Other Vocational Qualifications Steven McIntosh Department of Economics University of Sheffield Research Report RR834 Research Report No 834 A Cost-Benefit

More information

What is a Survey. By Fritz Scheuren

What is a Survey. By Fritz Scheuren What is a Survey By Fritz Scheuren Harry Truman displays a copy of the Chicago Daily Tribune newspaper that erroneously reported the election of Thomas Dewey in 1948. Truman s narrow victory embarrassed

More information

The InStat guide to choosing and interpreting statistical tests

The InStat guide to choosing and interpreting statistical tests Version 3.0 The InStat guide to choosing and interpreting statistical tests Harvey Motulsky 1990-2003, GraphPad Software, Inc. All rights reserved. Program design, manual and help screens: Programming:

More information

Discovering Value from Community Activity on Focused Question Answering Sites: A Case Study of Stack Overflow

Discovering Value from Community Activity on Focused Question Answering Sites: A Case Study of Stack Overflow Discovering Value from Community Activity on Focused Question Answering Sites: A Case Study of Stack Overflow Ashton Anderson Daniel Huttenlocher Jon Kleinberg Jure Leskovec Stanford University Cornell

More information

Steering User Behavior with Badges

Steering User Behavior with Badges Steering User Behavior with Badges Ashton Anderson Daniel Huttenlocher Jon Kleinberg Jure Leskovec Stanford University Cornell University Cornell University Stanford University ashton@cs.stanford.edu {dph,

More information

Pluralistic Ignorance and Alcohol Use on Campus: Some Consequences of Misperceiving the Social Norm

Pluralistic Ignorance and Alcohol Use on Campus: Some Consequences of Misperceiving the Social Norm Journal of Personality and Social Psychology 1993, Vol. 64, No. 2. 243-256 Copyright 1993 by the American Psychological Association, Inc. 0022-3514/93/S3.00 Pluralistic Ignorance and Alcohol Use on Campus:

More information

Candidates or parties? Objects of electoral choice in Ireland*

Candidates or parties? Objects of electoral choice in Ireland* p. 1 Candidates or parties? Objects of electoral choice in Ireland* Michael Marsh Department of Political Science Trinity College Dublin Republic of Ireland mmarsh@tcd.ie http://www.tcd.ie/political_science/staff/michael.marsh

More information

CREATING AN ENGAGED WORKFORCE

CREATING AN ENGAGED WORKFORCE Research report January 2010 CREATING AN ENGAGED WORKFORCE CREATING AN ENGAGED WORKFORCE FINDINGS FROM THE KINGSTON EMPLOYEE ENGAGEMENT CONSORTIUM PROJECT This report has been written by: Kerstin Alfes,

More information

Sawtooth Software. How Many Questions Should You Ask in Choice-Based Conjoint Studies? RESEARCH PAPER SERIES

Sawtooth Software. How Many Questions Should You Ask in Choice-Based Conjoint Studies? RESEARCH PAPER SERIES Sawtooth Software RESEARCH PAPER SERIES How Many Questions Should You Ask in Choice-Based Conjoint Studies? Richard M. Johnson and Bryan K. Orme, Sawtooth Software, Inc. 1996 Copyright 1996-2002, Sawtooth

More information

WORKING PAPER SERIES EUROPEAN WOMEN WHY DO(N T) THEY WORK? NO 454 / MARCH 2005. by Véronique Genre Ramón Gómez Salvador and Ana Lamo

WORKING PAPER SERIES EUROPEAN WOMEN WHY DO(N T) THEY WORK? NO 454 / MARCH 2005. by Véronique Genre Ramón Gómez Salvador and Ana Lamo WORKING PAPER SERIES NO 454 / MARCH 2005 EUROPEAN WOMEN WHY DO(N T) THEY WORK? by Véronique Genre Ramón Gómez Salvador and Ana Lamo WORKING PAPER SERIES NO. 454 / MARCH 2005 EUROPEAN WOMEN WHY DO(N T)

More information

WHO VOTES BY MAIL? A DYNAMIC MODEL OF THE INDIVIDUAL- LEVEL CONSEQUENCES OF VOTING-BY-MAIL SYSTEMS

WHO VOTES BY MAIL? A DYNAMIC MODEL OF THE INDIVIDUAL- LEVEL CONSEQUENCES OF VOTING-BY-MAIL SYSTEMS WHO VOTES BY MAIL? A DYNAMIC MODEL OF THE INDIVIDUAL- LEVEL CONSEQUENCES OF VOTING-BY-MAIL SYSTEMS ADAM J. BERINSKY NANCY BURNS MICHAEL W. TRAUGOTT Abstract Election administrators and public officials

More information

Referendumon independenceforscotland

Referendumon independenceforscotland Referendumon independenceforscotland AdviceoftheElectoralCommissionontheproposed referendumquestion January2013 Translations and other formats Forinformationonobtainingthis publicationinanotherlanguageorin

More information

Inferring the Popularity of an Opinion From Its Familiarity: A Repetitive Voice Can Sound Like a Chorus

Inferring the Popularity of an Opinion From Its Familiarity: A Repetitive Voice Can Sound Like a Chorus Journal of Personality and Social Psychology Copyright 2007 by the American Psychological Association 2007, Vol. 92, No. 5, 821 833 0022-3514/07/$12.00 DOI: 10.1037/0022-3514.92.5.821 Inferring the Popularity

More information

E. F. Allan, S. Abeyasekera & R. D. Stern

E. F. Allan, S. Abeyasekera & R. D. Stern E. F. Allan, S. Abeyasekera & R. D. Stern January 2006 The University of Reading Statistical Services Centre Guidance prepared for the DFID Forestry Research Programme Foreword The production of this

More information

Fake It Till You Make It: Reputation, Competition, and Yelp Review Fraud

Fake It Till You Make It: Reputation, Competition, and Yelp Review Fraud Fake It Till You Make It: Reputation, Competition, and Yelp Review Fraud Michael Luca Harvard Business School Georgios Zervas Boston University Questrom School of Business May

More information

The Most Significant Change technique

The Most Significant Change technique Equal Access Participatory Monitoring and Evaluation toolkit The Most Significant Change technique A manual for M&E staff and others at Equal Access Developed by June Lennie February 2011 Contents Acknowledgements

More information

Some Practical Guidance for the Implementation of Propensity Score Matching

Some Practical Guidance for the Implementation of Propensity Score Matching DISCUSSION PAPER SERIES IZA DP No. 1588 Some Practical Guidance for the Implementation of Propensity Score Matching Marco Caliendo Sabine Kopeinig May 2005 Forschungsinstitut zur Zukunft der Arbeit Institute

More information

Is there a difference between solicited and unsolicited bank ratings and if so, why?

Is there a difference between solicited and unsolicited bank ratings and if so, why? Working paper research n 79 February 2006 Is there a difference between solicited and unsolicited bank ratings and if so, why? Patrick Van Roy Editorial Director Jan Smets, Member of the Board of Directors

More information

Employee is also a customer. How to measure employees satisfaction in an enterprise?

Employee is also a customer. How to measure employees satisfaction in an enterprise? Employee is also a customer. How to measure employees satisfaction in an enterprise? Z. Kotulski, Z.Wąsik and B. Dorożko 1. Introduction A working place: factory, company, office, etc., is the place where

More information

Unrealistic Optimism About Future Life Events

Unrealistic Optimism About Future Life Events Journal of Personality and Social Psychology 1980, Vol. 39, No. 5, 806-820 Unrealistic Optimism About Future Life Events Neil D. Weinstein Department of Human Ecology and Social Sciences Cook College,

More information

The legal status of a marriage between two men

The legal status of a marriage between two men Why State Constitutions Differ in their Treatment of Same-Sex Marriage Arthur Lupia Yanna Krupnikov Adam Seth Levine Spencer Piston Alexander Von Hagen-Jamar University of Michigan Indiana University University

More information