Repeated sampling in Successive Survey

Transcription

1 Statistics istitutio Repeated samplig i Successive Survey (RSSS) Xiaolu Cao 15-högskolepoägsuppsats iom Statistik III, HT011 Supervisor: Mikael Möller

2 Abstract: I this thesis, the idea of samplig desig for successive surveys is ivestigated. A good samplig desig ca ifluece the result of the survey. It ca decrease the bias estimatio of survey i real world. Author wats to compare a ew samplig desig with a classical samplig desig. The thesis cocered o usig repeated samplig withi realm of stratified radom samplig. Thesis shows this kowledge i theory ad how the theory is applied. The data for applicatio are geerated by a Mote-Carlo method. The result from simulatios shows that usig repeated samplig to estimate the populatio mea gives higher precisio ad less variace tha to use fixed sample Thaks I would like to express my gratitude to Mikael Möller who support ad guidace durig my writig this thesis.

3 Cotets 1 Itroductio 1.1 Backgroud Related work Data set Theoretical Framwork 5.1 Notatios ad cocept of repeated samplig Repeated samplig Simple radom samplig Repeated samplig withi realm of simple radom samplig Repeated samplig withi realm of stratified radom samplig Estimatio of sample size: Experimetal results Implemetatio issues Experimetal result for data set 1: Sample size Compariso for fixed sample ad repeated sample Estimated variace Optimality Experimetal result for data set Experimet result for testig sceario Sample size Compariso for fixed sample ad repeated sample Estimate variace Summary of experimet Coclusio ad discussio 19 Refereces 0 1

4 1 Itroductio 1.1 Backgroud Successive surveys meas survey o repeated occasios. May studies have used survey method: e.g. sociological, ecoomical, medical etc. The objective for these are to estimate characteristics of a populatio o repeated occasios i order to measure time-tred as well as curret values of the characteristics (see [Rao ad Graham]). For example, the estimatio of the yearly average icome for people livig i Stockholm couty. Whe chages i timedepedet populatio values are to be examied, several samplig alteratives, as listed by Yate (1960), ca be used. These ivolve: (i) A ew sample o each occasio; (ii) A fixed sample used o all occasios; (iii) A subsample of the origial sample o secod occasio; (iv) A partial replacemet of uits from occasio to occasio. I this thesis we cocetrate o comparig (ii) a fixed sample used o all occasios ad (iv) a partial replacemet of uits from occasio to occasio (see [Maoussakis]). A partial replacemet of uits from occasio to occasio is usually kow as repeated samplig; i.e. part of the sample is replaced at equidistace times. Durig this process o the h th occasio we may have parts of the sample that are matched with the (h 1) th occasio, parts that are matched with both the (h 1) th ad the (h ) th occasio, ad so o (see [Patterso]). Fidig the optimal replacemet strategy for two or more occasios is very importat. Sectio 1 will compare the covetioal method (ii) ad repeated samplig withi the realm of stratified radom samplig. Stratified samplig divide the populatio Ω N ito K homogeeous subpopulatios of size N 1, N,..., N K. Each subpopulatio is called a stratum. Draw from each stratum a sample, the drawig must be idepedet i differet strata. If the process of drawig a sample is radom, the the whole procedure is called stratified radom samplig. Stratified samplig is a commo techique ad works well for populatios with a variety of attributes. The pricipal reasos are (see [Cochra]): 1. If data of kow precisio are wated for certai subdivisios of the populatio, it is advisable to treat each subdivisio as a populatio i its ow right.. Admiistrative coveiece may dictate the use of stratificatio; for example, the agecy coductig the survey may have field offi ces, each of which ca supervise the survey for a part of the populatio. 3. Samplig problems may differ markedly i differet parts of the populatio. With huma populatios, people livig i istitutios (e.g., hotels, hospitals, prisos) are ofte placed i a differet stratum from people livig i ordiary homes because a differet approach to the samplig is appropriate for the two situatios. I samplig busiesses we may possess a list of the large firms, which are placed i a separate stratum. Some type of area samplig may have to be used for the smaller firms. 4. Stratificatio may produce a gai i precisio i the estimates of characteristics of the whole populatio. If it is possible to divide a heterogeeous populatio ito subpopulatios, each of which is iterally homogeeous, stratificatio may produce a gai i precisio i the estimates of the characteristics. This is suggested by the ame stratum,

5 with its implicatio of a divisio ito layers. If each stratum is homogeeous, i that the measuremets vary little from oe uit to aother, a precise estimate of ay stratum mea ca be obtaied from a small sample i that stratum. These estimates ca the be combied ito a precise estimate for the whole populatio. For reasos above, this thesis will focus o usig repeated samplig withi the realm of stratified radom samplig. The improvemet by usig repeated sample will be ivestigated by both theoretically derivatios ad data simulatio. 1. Related work Radom samplig, stratified samplig, cluster samplig two-stage stratified samplig, etc usually use fixed samples. I these cases, every occasio use the same uits. However, measurig the same uits o successive occasio surveys might ot be a wise choice. This because usig the same fixed sample may egatively ifluece ivestigated uits (e.g. people) ad this might ifluece the fial result. This motivates ew methods to load, for each occasio, a "ew" sample. O the other had usig ew sample o each occasio, the cotiuity of statistical features betwee curret sample ad ew sample ca ot be guarateed also ew samples cost extra. Because of the reasos above the repeated samplig techique was iveted. This techique cosists of partial replacemet of uits from occasio to occasio.o the (h + 1) th occasio, do a partial replacemet of uits. The process of repeated samplig may thus be summarized as: The sample for the (h+1) th occasio iclude some uits which match with the h th occasio, parts which are matched with both h th ad (h 1) th occasio, ad so o. The method of repeated samplig withi the realm of simple radom samplig is mature. But there are oly a few paper about the use of repeated samplig withi the realm of more complex samplig, e.g. stratified radom samplig, stratified cluster samplig, two-stage stratified samplig etc. Stratified samplig is a method of samplig from a heterogeous populatio. Stratificatio is the process of dividig this populatio ito homogeeous stratums before samplig. This thesis will use repeated samplig withi stratified samplig. Here stratified samplig cotais stratified radom samplig, stratified cluster samplig ad two-stage stratified samplig. I this thesis we focuses o repeated samplig withi the realm of stratified radom samplig. 1.3 Data set Sice real data was ot available we decided to use the Mote-Carlo method to geerate datasets. Hece we first simulate our populatio from which we will take a sample. Examples of characteristics are the icome for all workers who work at Stockholm Uiversity durig 007, 008, 009 etc, or the BMI (body mass idex) for all the people who live i Stockholm couty durig 006, 007, 008 etc. We simulate icome data sice they are ot possible for us to get idividually from public offi cial statistics. The expressio "Mote-Carlo method" is actually very geeral. Mote-Carlo methods also called MC methods, are stochastic techiques which are based o pseudo radom geerators ad probability theory. You ca fid Mote Carlo methods used i may differet areas; from 3

6 ecoomics to uclear physics to regulatig the traffi c flow. Of course the way they are applied varies widely from field to field, ad there are dozes of subsets of Mote Carlo methods eve withi chemistry. But, strictly speakig, to call somethig a "Mote Carlo" experimet, all you eed to do is to use radom umbers whe examiig your problem (see [Woller]). Geerally, we assume that the data distributio follows a Gaussia distributio. Thus, we use the fuctio ormrd i MATLAB to geerate our dataset. From Statistics Swede we obtai the average icome for males ad females livig i Stockholm couty durig : Average icome (tkr) male female With aid of the Matlab fuctio ormrd(µ, σ, N), we geerate N Gaussia radom umbers with mea µ ad stadard deviatio σ. I our case, we geerate differet datasets by usig give mea values (as show i table Average icome) ad varyig stadard deviatio values. Commo stadard deviatio value ca be i rage of 10 to 50. 4

7 Theoretical Framwork.1 Notatios ad cocept of repeated samplig I the sequal we are iterested i measurig the idividual icome i Stockholm couty. More exact, we wat to estimate the mea icome ad its stadard deviatio for two differet samplig schemes. We itroduce the followig otatio Y j (h) = icome idividual j at occasio h i Stockholm couty where j {1,,..., N}. The size of the populatio N is the same at every occasio h. Let Ω = {Y 1 (h), Y (h),..., Y N (h)} µ (h) = Ȳ (h) = 1 N Y j (h) N σ (h) = 1 N 1 j=1 N ( Yj (h) Ȳ (h)) j=1 Here occasio h stads for oe of the years ivestigated (h = 006, 007, 008, 009). Owards we will use small y to deote a sample value ad our samples will be of size. For partial replacemet of uits, we use m to deote a keep part ad u to deote a replacemet part. ˆµ m (h) is the estimate of the populatio mea use keep part from origial sample at h th occasio, ad ˆµ u+m (h) is the estimate of the populatio mea use repeated samplig at h th occasio.. Repeated samplig I a stratified radom samplig process, we do simple radom samplig i each stratum. Usig repeated samplig, withi realm of stratified radom samplig, meas usig repeated samplig i each stratum. Repeated samplig is a process that chages the sample from occasio to occasio. Figure 1: dropped uits kept uits ew uits 5

8 We write (h1) th occasio istead of previous occasio ad h th occasio istead of curret occasio. O the h th occasio, we eed keep uits i the o-replacemet part of the sample used at (h1) th occasio ad drop the rest. We draw ew observatios to replace the dropped oes. This process will call repeated samplig ad it is described by figure o previous page..3 Simple radom samplig The idea behid simple radom samplig is that the values of sample characteristics is approximetely the same as the correspodig values i the populatio. If we use a fixed sample for a survey the the accuracy of the measuremets i this sample will decrease over time due to populatio chages. It is believed that a repeated sample is better tha a fixed sample. I repeated samplig, we eed take ito accout the correlatio betwee curret occasio h th ad previous occasio (h 1) th. We use correlatio coeffi ciet ρ to deote this relatioship. Geerally, the value of ρ is betwee 0 ad 1. Whe ρ = 1, the previous sample is ot good for estimatio of curret populatio characteristics. The proportio of replacemet is 100%. Cotrarily, if ρ = 0, that meas data from the curret occasio has o relatioship with the previous occasio. This sample is though useful for estimatio of populatio characteristics i the curret occasio, we eed ot chage ay uits i the previous sample. The proportio of replacemet is 0%. I the real world, we ormally use stadard deviatio σ(ȳ) to measure how close the estimate ȳ of a sample is to µ of the populatio. The bigger stadard deviasio σ(ȳ), the flatter distributio curve ad the more spread is data. I other words, the estimate of the mea has low precisio. Cotrarily, if σ(ȳ) is small that meas the distributio curve is peaked, all data are close to the mea. Hece, the estimate of the mea has high precisio. I simple radom samplig, the formula for stadard deviatio of the mea ȳ is i samplig from 1. a ifiite populatio: σ(ȳ) = σ. a fiite populatio: σ(ȳ) = σ ( 1 ) N From formulas above we see that stadard deviatio, sample size ad populatio size will effect the stadard deviatio for the mea ȳ. But these formulas are ot good eough to calculate the stadard deviatio of the mea ȳ i repeated samplig. I repeated samplig, sample size ad populatio size are the same at all occasios. But the sample is ot the same from occasio to occasio. The correlatio betwee the two occasios will have a effect o the sample estimate. Hece it will also have a effect o the sample characteristics for the curret occasio. Further, the optimal proportio of replacemet also effects the characteristics of the sample. The correlatio will help us to decide how may uits to replace to get a estimate of highest precisio. If we wat to estimate the characteristics of a populatio well, we eed have a sample that resembles the populatio. These are the reasos why we should add the correlatio coeffi ciet ρ ad the optimal proportio of replacemet p ito formulas above. 6

9 .4 Repeated samplig withi realm of simple radom samplig Firstly, oly cosider about a survey i two successive occasios h ad h 1. The sample size is same i all occasios ad deote measuremet variable by y. So y (h) meas variable at curret occasio ad y (h 1) meas variable at previous occasio. Let m be the umber of kept uits (m ) ad u the umber of dropped uits (u = m). The we have the followig samplig schemes for occasio h give occasio h 1 (where we write r s for measures at time r give sample at time s) 1. All uits are ewly sampled (h) ad their values are measured at the curret occasio (h). I this case ˆµ (h h) = ȳ (h h) V (ˆµ (h h)) = s (h h) I the sequal we will write ȳ (h) ad s (h) whe sample time is same as measure time.. All uits are from the previous occasio (h 1) but their values are measured at the curret occasio (h). ˆµ (h h 1) = ȳ (h h 1) V (ˆµ (h h 1)) = s (h h 1) 3. The sample is regarded as a populatio i its ow respect ad we take a subsample, size m, of it. Sample i curret occasio is a subsample of origial sample at previous occasio. It meas m <. The liear regressio estimate is desiged to icrease precisio by use of a auxiliary variate Y j (h 1) that is correlated with Y j (h). Whe the relatio betwee Y j (h) ad Y j (h 1) is examied, it may be foud that although the relatio is approximately liear, the lie does ot go through the origi. This suggests a estimate based o the liear regressio of y j (h) o y j (h 1) rather tha o the ratio of the two variables. We suppose that values y j (h) ad y j (h 1) are each obtaied for every uit i the sample ad that the populatio mea µ (h 1) of {Y j (h 1)} is kow. The liear regressio estimate of µ (h), the populatio mea of {Y j (h)} is for the kept part ad some β (to be decided) (see [Cochra]) ˆµ m (h h 1) = ȳ m (h h 1) + β [ȳ (h 1) ȳ m (h 1)] (1) Here ˆµ m (h h 1) is the estimate of µ(h) give by the kept part, at time h, with sample, 7

10 at time h 1. We let f = m, the estimate of variace for populatio is V mi (ˆµ m (h h 1)) = 1 f m {[(y j (h) ȳ m (h h 1)) β (y j (h 1) ȳ (h 1))]} j=1 1 = 1 f m [ s (h) βs (h, h 1) + β s (h 1) ] () Where s (h, h 1) is the covariace betwee h th ad (h 1) th occasio. Here, the regressio coeffi ciet is estimated as ˆβ = (y j (h) ȳ m (h h 1)) (y j (h 1) ȳ (h 1)) j=1 = (y j (h 1) ȳ (h 1)) j=1 s (h, h 1) s (h 1) (3) The ρ is the populatio correlatio coeffi ciet betwee h th ad (h 1) th occasio, ad is estimated as ˆρ s (h, h 1) = (4) s (h) s (h 1) here ˆρ depeds o occasio h ad (h 1). We combie equatio (), (3) ad (4) ad get V mi (ˆµ m (h h 1)) = s (h) (1 ˆρ ) m + ˆρ s (h) (5) 4. The correspodig calculatios for the replacemet part give us All observatios i replacemet part at curret occasio are ew ad have o correlatio with same observatios at previous occasio. So the sample mea is ad ˆµ u (h) = ȳ u (h) = 1 u u y j (h) j=1 V (ˆµ u (h)) = s (h) u 5. From 3 we get ˆµ m (h h 1) ad from 4 we get ˆµ u (h). The, at time h, all uits are measured, ad combied ito the estimate ˆµ u+m (h) = φ ˆµ u (h) + (1 φ) ˆµ m (h h 1.) (6) 8

11 Where φ is a costat weight factor. ad observe that ȳ u (h h) is idepedet from ˆµ m (h h 1.). I followig equatio, ad g (h) is shor for g(h h). From equatio (1) ito (6) give ˆµ u+m (h) = φ ˆµ u (h) + (1 φ) {ȳ m (h h 1) + β [ȳ (h 1) ȳ m (h 1)]} (7) To calculate the variace we use equatios (5), (9) ad (8) to get V (ˆµ u (h)) = s (h) u = 1 w u (h) (8) s (h) (1 ˆρ ) m + ˆρ s (h) = 1 w m (h) (9) V (ˆµ u+m (h)) = φ V (ˆµ u (h)) + (1 φ) V mi (ˆµ m (h h 1.)) (10) = φ s (h) + (1 φ) ( s (h) (1 ˆρ ) u m = φ 1 w u (h) + (1 1 φ) w m (h) + ˆρ s (h) ) (11) We wat to get the miimum variace, ad a estimate of φ. V (ˆµ u+m (h) ) we take derivative with respect to φ ad put it equal to zero. The we ca get the optimal φ for miimum variace estimate. This gives the weight factor. It also meas: φ = φ = w u (h) w m (h) + w u (h) V (ˆµ u (h)) V (ˆµ m (h h 1)) + V (ˆµ u (h)) The we use (13) istead of φ i (1) to get optimal variace estimate by optimal φ V optimal (ˆµu+m (h) ) ( 1 = w u (h) + w m (h) = s (h) ) uˆρ u ˆρ (14) The optimal proportio of replacemet is a importat estimatio value. We wat to estimate a optimal proportio of replacemet which ca help us to get the miimum variace for the estimate of populatio mea. I other words, it ca help us to estimate, the populatio mea, with highest precisio. If we wat to get the optimal proportio of replacemet for the miimum variace, we derivate the V (ˆµ u+m (h)) with respect to u ad put derivative equal to zero. We get after some calculatios (1) (13) p = u = (ˆρ 1) (15) 1 ˆρ 9

12 use (15) ito equatio (14) to get miimum variace estimate by optimal φ ad p. V mi (ˆµu+m (h) ) ( ) = s (h) ˆρ Secodly, we cosider partial replacemet of uits o occasios that are more tha two. We assume that these are (h ) th, (h 1) th ad h th occasio. So o h th occasio, we eed use estimatio of sample mea i (h 1) th occasio to be the auxiliary variable, ad use it to help us to estimate o h th occasio. I fourth cosequece above, (h 1) th occasio is the first occasio, but i this case it is the secod occasio. That is the reaso for us to use ˆµ m (h h 1) istead of y (h 1) ad φ (h) istead of φ i this case. We still assume that populatio variace S is same for all occasios. The estimatio of populatio mea o h th occasio is ˆµ u+m (h) = φ (h) ˆµ u (h) + (1 φ (h)) ˆµ m (h h 1) (h > ) (16) We get it from (13) φ (h) = ˆµ u (h) = 1 u w u (h) w u (h) + w m (h) u y i (h) i=1 V (ˆµ u (h)) = s (h) u (h) = 1 w u (h) (17) (18) ˆµ m (h h 1) = ȳ m (h) + β(ˆµ m (h 1 h ) ȳ m (h 1)) (19) So combie the (16) ad (19) to get the formula for estimate of populatio mea at h th occasio, whe h >, Here ˆµ u+m (h) = φ (h) ˆµ u (h) + (1 φ (h)) {ȳ m (h) + β [ˆµ m (h 1 h ) ȳ m (h 1)]} ˆµ m (h 1 h ) = ȳ m (h 1) + β(ˆµ m (h h 3) ȳ m (h )) so whe we calculate for variace of ˆµ m (h h 1), it becomes V mi (ˆµ m (h h 1)) = s (h) (1 ˆρ ) m (h) + ˆρ V mi (ˆµ m (h 1)) s (h) = 1 w m (h) (0) We combie equatio (1), (17), (18) ad (0) [ V (ˆµ u+m (h)) = φ h s (h) s u (h) + (1 (h) (1 ˆρ ) φ h) m (h) ] + ˆρ V mi (ˆµ m (h 1)) s (h) Combie equatio (18), (0) ad use equatio (17) istead of φ (h), the we get V (ˆµ u+m (h)) = 1 w u (h) + w m (h) = g (h) S 10

13 We assume that S is same for all occasios, but for the sample s (h) is chaged by occasios. If we use S to estimate for the V (ˆµ u+m (h)), we eed put a weight g (h) here, ad g (h) is the ratio for variace which estimated by (h th ) occasio ad 1 st occasio. At the first occasio, we assume that s (1) S. g (1) = 1 whe h = 1. g (h) = V (ˆµ u+m (h)) S Q = S V (ˆµ u+m (h)) = g (h) = S (w u (h) + w m (h)) 1 = u (h) + 1ˆρ + ˆρ g(h1) m(h) (1) We wat to get maximum of Q, it meas that a miimum variace estimate, so we take the derivative of m (h) i equatio (1) ad put it equal to zero. This gives after some simplificatiot that: 1 ˆρ ˆρ = (1 m (h) m (h) + ˆρ g (h 1) ) For keep part ad this equatio has the solutio. ˆm (h) = 1 ˆρ g (h 1) (1 + 1 ˆρ ) () Now combiig equatio (1) ad () we get 1 g (h) = ˆρ g (h 1) (1 + 1 ˆρ ) (3) Now defie r (h) ad b as, r (h) = 1 g (h) We may the rewrite equatio (3) as The it is readily foud that b = 1 1 ˆρ ˆρ r (h) = 1 + b r (h 1) r (h) = 1 + b + b + b b h1 = 1 bh 1 b 11

14 sice ad hece r (h) = 1 g (h) = 1 + b + b + b b h1 = 1 bh 1 b g (h) = 1 b 1 b h 1 b g (h 1) = (4) 1 b h1 combie equatio () ad (4) ˆm (h) = (1 bh1 ) 1 ˆρ (1 b)(1 + 1 ˆρ ) = 1 bh1 from this follows û (h) ˆm (h) = 1 = 1 + bh1 which gives us optimal proportio of replacemet ( p = û (h) = 1 1 ) h1 1 ˆρ (5) 1 ˆρ.5 Repeated samplig withi realm of stratified radom samplig With aid of equatios above, we ca derive formulas for stratified radom samplig with replacemet. N is the size for whole populatio. K is the umber of strata. K = 1,,..., K, the size for each stratum is N 1, N,..., N K ad K i=1 N i = N. The sample size for each stratum is i ad m i is the umber of kept uits, ad u i is the umber of replacemet uits i strata i. So that m i + u i = i, i = 1,,..., K. The total sample size is, so K i=1 i =. The value for j th observatio i i th stratum is deoted by Y ij ad Y ij (h) is the value for j th observatio i i th stratum o h th occasio. The weight for each stratum is W i = N i N sample size for each stratum are i = W i. Estimate of sample mea for stratum i is y i (h) where ȳ i (h) = 1 i i j=1 Estimate of variace for stratum i is S i (h) where y ij (h) Si (h) = 1 N i (Y ij (h) µ N i (h)) i j=1 i = 1,,..., K ad the 1

15 For sample variace estimatio for stratum i is s i (h) where s i (h) = 1 i (y ij (h) ȳ i (h)) i 1 Estimate of populatio mea i stratified radom samplig is j=1 ˆµ u+m (h) = L W iˆµ i (h) (6) i=1 Combie equatio (7), (16) ad (6) ˆµ u+m (h) = K W i {φ i (h) ȳ ui (h) + (1 φ i (h)) [ȳ mi (h) + β (ˆµ mi (h 1) ȳ mi (h 1))]} i=1 This is the formula for miimum variace liear ubiased estimator ˆµ u+m (h) of the true populatio mea Ȳ (h) o the hth occasio. If the proportio for replacemet is ot equal to zero, the estimatio of mea at each stratum ˆµ i (h) is a miimum variace liear ubiased estimator. So that i a stratified samplig case, if ˆµ i (h) is a miimum variace liear ubiased estimator, i = 1,,..., K, the ˆµ u+m (h) = K W iˆµ i (h) is a miimum variace liear ubiased estimator of Ȳ (h). (see i=1 [Prabhu Ajgaokar]) Estimatio of variace for ˆµ u+m (h) is V (ˆµ u+m (h)) = K i=1 W i V (ˆµ i (h)) (7) Combie equatio (11) ad (7) to get V (ˆµ u+m (h)) = K i=1 W i { [ φ i (h) S i (h) S u i (h) + (1 φ i (h)) i (h) (1 ˆρ ]} i (h)) + ˆρ i (h) S i (h) m i (h) i Usig sample variace s istead of true variace S for estimatio of stratum s variace: i equatio (11) will get the equatio ˆv (ȳ i (h)) = φ i (h) ˆv (ȳ ui (h)) + (1 φ i (h)) ˆv (ȳ mi (h)) (8) ( = φ i (h) s i (h) s u i (h) + (1 φ i (h)) i (h) ( 1 ˆρ i (h) ) ) + ˆρ i (h) s i (h) m i (h) i.6 Estimatio of sample size: Usig stratified radom samplig i the successive occasio, the sample size ad of populatio size N are the same for all occasios. We oly eed cosider total sample size whe 13

16 h = 1. Because whe h = 1, the method of samplig is stratified radom samplig. The estimatio of sample size for repeated samplig, withi the realm of stratified radom samplig, is the same as i stratified radom samplig. Here it is assumed that the estimate has a specified variace V. If istead, the margi of error d has bee specified, V = (d/t), where t is the ormal deviate correspodig to the allowable probability that the error will exceed the desired margi.(see [Cochra]) As a remider W i = N i N ad w i = i V = 1 from this follows. W i S i w i = 1 N Wi Si /w i i V + 1 W N i Si i Wi S i Presumed optimum allocatio (for the fixed sample size ): w i W i S i (W i S i ) = i V + 1 W N i Si i Proportioal allocatio: w i = W i = W i Si i V + 1 W N i Si i (9) 14

17 3 Experimetal results 3.1 Implemetatio issues This thesis focuses o a miimum variace estimatio problem for a time-depedet populatio mea. We restrict our ivestigatio to the case of liear ubiased estimators, ad to fixed sample ad populatio sizes at all occasios. This sectio is experimetal part. Firstly, we use Mote-Carlo method to geerate data. This data simulatio is based o the mea value of people s average yearly icome i Stockholm couty durig 006, 007, 008 ad 009. For populatio of icome is all people livig i Stockholm couty, we write it Ω = {Y 1, Y,..., Y N }. First, calculate the true populatio mea for 006, 007, 008 ad 009. Value i 006 is the base iformatio, we oly use it to decide the sample size. We use equatio (9) to calculate the sample size. Data separates ito two strata by geder: oe stratum is Male ad oe is Female. The stratum has N 1 ad N { Y f 1, Y f,..., Y f N }. idividuals. We deote them Ω male = { } Y1 m, Y m,..., YN m 1 ad Ωfemale = The we use radom samplig at each stratum to get the sample ad the sample sizes are 1 ad. The samples are 1 = { y m 1, y m,..., y m 1 } ad = {y f1, y f,..., y f }. After that, use fixed sample respectively to estimate the mea ad variace for whole populatio i 007, 008 ad 009, these are ˆµ (07) fixed, ˆµ (08) fixed ad ˆµ (09) fixed. Ad the use repeated samplig to estimate the mea ad variace for the populatio i 008 ad 009, because 007 is the first occasio. At the first occasio the mea ad variace are the same as the fixed sample estimates. Use equatio (5) to calculate the optimal size for replacemet at each stratum. These are u male (08), u female (08), u male (09) ad u female (09). The umber of kept part is m ad 1 u male (08) = m male (08), u female (08) = m female (08), 1 u male (09) = m male (09) ad u female (09) = m female (09). Estimate the sample mea ad sample variace for the ew sample which is equal to replacemet part plus kept part, at each stratum. After that use equatio (6) ad (7) to calculate the populatio mea ad variace for 008 ad 009. We ca estimate populatio mea with fixed sample ad estimate populatio mea with repeated samplig method. Which oe is closer to the true mea, meas use that method of samplig to get the estimate with highest precisio. Ad the ratio of these two variaces shows how much differece of precisio there are betwee these two methods. Durig this experimet we oly use data that is yearly icome i 007, 008 ad 009. Whe we use repeated samplig to estimate the mea ad variace for whole populatio, we estimate the optimal proportio for replacemet. 3. Experimetal result for data set 1: 3..1 Sample size Calculate from base year 006 Same for all occasios. N N male N female male female

18 3.. Compariso for fixed sample ad repeated sample true mea Ȳ (tkr) estimatio of estimatio of Y fixed (tkr) Y repeated (tkr) estimatio of V ( Y fixed ) estimatio ( of V ( Y repeated ) u ( ) male u ) female bias of estimatio (fixed sample) 0.014% 0.04% 0.106% bias of estimatio (repeated sample) 0.017% 0.06% 3..3 Estimated variace V ( Y fixed )/V ( Y repeated ) From this table we see that the estimated bias is always smaller for repeated samplig whe compared with "fixed" samplig. It is also foud that V (ˆµ fixed ) > V (ˆµ repeated ) for both year 008 ad 009. Hece repeated samplig is foud to be a better choice. 3.3 Optimality Here we check whether the optimal proportio of replacemet is real optimum or ot. The method for estimatio of optimal proportio of replacemet is the same i 008 ad 009. So we a test whether the proportio estimated is optimal or ot for 008 (h = ). If it is a optimal proportio i 008, it will be also a optimal proportio i 009. Durig this experimet we oly chage the proportio of replacemet, sice the correlatio coeffi ciet, regressio coeffi ciet ad optimum stratum weight etc. will ot chage. We foud estimatio for optimum proportio of replacemet is for male ad for female.use proportios 0.6 ad 0.8 to check bias. If both 0.6 ad 0.8 gives higher bias the proportio of replacemet, which was gotte from last experimet, is better oe. The also check 0.65 ad ( u ) at secod occasio male bias of estimatio 0.036% 0.0% 0.017% 0.07% 0.06% From this experimet, chage the proportio of replacemet, lower ad higher, which was estimated from last experimet. Follow the proportio chages closes to 0.71; the bias of estimatio has decreased ad closes to the 0.017% which estimated i last experimet. It meas that the proportio of replacemet, estimated i last experimet, is the optimum proportio of replacemet. 16

19 3.4 Experimetal result for data set We used simulated data. Simulate agai but with a bigger variace (σ = 50) ad keep the total umber, stratum ad stratum size the same Use this ew data to repeat expermet above Experimet result for testig sceario 3.4. Sample size Calculate from base year 006. Same for all occasios. N N male N female male female Compariso for fixed sample ad repeated sample true mea Ȳ (tkr) estimatio of estimatio of Y fixed (tkr) Y repeated (tkr) estimatio of V ( Y fixed ) estimatio ( of V ( Y repeated ) u ( ) male u ) female bias of estimatio (fixed sample) % % 0.185% bias of estimatio (repeated sample) % 0.039% Estimate variace V ( Y fixed )/V ( Y repeated ) The results also show that use repeated samplig has higher precisio estimatio tha use fixed samplig. Hece repeated samplig if foud to be a better choice. No matter the variace for data is larger or small Summary of experimet From these experimets, the results show that usig repeated samplig, withi the realm of stratified samplig, to estimate mea of whole populatio is closer to the true populatio mea tha whe we use fixed samplig. I other words, use repeated samplig will give higher 17

20 precisio tha fixed samplig. The formula, for repeated samplig that we derived i theory part, are useful to us to estimate populatio mea with higher precisio. They also help us to get the optimal proportio of replacemet. 18

21 4 Coclusio ad discussio This thesis has derived formulas to estimate populatio mea ad variace whe usig repeated samplig withi realm of stratified radom samplig. After derivig the formulas we use data from SCB to simulate the whole process. The results from the simulatio for data set 1 shows that, use of repeated samplig ca improve the precisio cosiderably (0.04%-0.017%) ad lower the variace compared to fixed samplig whe estimatig the populatio mea at 008. Ad for year 009, we atfai compareable improvemets. For these simulatios, the results are that stratified radom samplig with replacemet whe estimatig the populatio mea gives lower variace tha to use stratified radom samplig without replacemet. This is due to that the sample with partial replacemet o the h th occasio iclude some uits that match with (h 1) th occasio ad some uits that match both the (h1) th ad (h) th occasio. To use this sample to estimate populatio mea o h th occasio will have higher precisio tha to use the sample which oly matched (h) th occasio. Durig the whole process, the optimal proportio of replacemet is a very importat characteristic. I the sectio Optimality, we test if the estimate of proportio is optimal or ot. We chage the proportio of replacemet aroud our estimate, from experimet for data set 1, sice our estimate is we choose 0.6, 0.65 ad 0.75, 0.8. The results shows that whe proportio is closer to the estimated proportio , the the bias for estimatio of populatio mea do decrease. That meas the estimatio of proportio i experimet for data set 1 is at a real optimal proportio of replacemet. Usig repeated samplig withi realm of stratified radom samplig is oe of a samplig desigs. Desig of samplig is a art, a successful samplig desig will have direct positive ifluece o the result of the survey. I this thesis the results from the experimetal part shows that our assumptio is correct. Usig repeated samplig withi realm of stratified radom samplig ca help us to get estimates with higher presicio tha fixed samplig. This is a good use of samplig desig. But there is also alteratives to stratified radom samplig such as sigle-stage cluster samplig, systematic samplig, double samplig, stratified cluster samplig ad two-stage stratified samplig, etc. We are ot sure if these methods also gives estimates with higher presicio. I this thesis we oly studied that usig repeated samplig withi realm of stratified radom samplig. So usig repeated samplig withi other samplig methods may give good samplig desigs. 19

22 Refereces [Cochra] W.G. Cochra, Samplig Techiques, third editio, Wiley, New York [Eckler] A.R. Eckler Rotatio Samplig, A. Math. Statist [Gbur ad Sielke] E.E. Gbur ad R.L. Sielke Rotatio Samplig Desig, Texas A&M Uiversity. [Maoussakis] [Patterso] E. Maoussakis, Repeated Samplig with Partial Replacemet of Uits, The Aals of Statistics, Vol.5, No.4 (Jul.,1977), pp H.D.Patterso, Samplig o successive occasios with partial replacemet of uits, Joural of the Royal Statistical Society, Series B, 1 (1950), [Prabhu Ajgaokar] S.G.Prabhu-Ajgaokar ad B.D.Tikkiwal O classes of estimators of the variace fuctio of a liear estimator Volume 18, Number 1, 15-0, DOI: /BF06143 [Rao ad Graham] J. N. K. Rao ad Jack E. Graham, Desigs for Samplig o Repeated Occasios, Joural of the America Statistical Associatio, Vol.59, No.306 (Ju.,1964), pp [Ulam] [Woller] N.M.S.Ulam, The Mote Carlo Method, Joural of the America Statistical Associatio, Vol.44, No.47. (Sep., 1949), pp J.Woller, The basics of Mote Carlo Simulatios, uiversity of Nebraska- Licol Physical Chemistry Lab, Sprig