Confidence Intervals for Capture-Recapture Data With Matching

Confidence Intervals for Cature-Recature Data With Matching Executive summary Cature-recature data is often used to estimate oulations The classical alication for animal oulations is to take two samles and count how many individuals are resent in both The classical method involves assuming stochastic indeendence (ie that having been in one samle has no effect on the robability of being in the other) For animals this may a reasonable assumtion but for humans it is much less credible Peole who do not return their census forms (and hence are not in samle 1) are more likely to be less than co-oerative when the enumerator calls (and hence less likely to be in the coverage survey which is samle ) This aer resents a Bayesian method which exresses the robability distribution of the oulation not only as a function of the three numbers of eole (those in both samles, in samle 1 only and in samle only) and also of the correlation coefficient ϕ between the samles The median estimate of the oulation size and the uer and lower ½ er cent oints can be modelled as functions (i) of ϕ (to which they are quite sensitive) and (ii) of the form of the rior distribution (to which they are not very sensitive) The method is illustrated using Scottish data from the 011 Census Comuter code, samle outut and tis for imroving efficiency are resented 1 Introduction The use of cature-recature data to estimate animal oulations dates back to the early 0 th century One current examle of its use to estimate human oulations is in the census when it is common to follow u a general census enumeration (which is known not to cover all members of a oulation) with a census coverage survey (in which intensive effort is exended in achieving accurate counts in a minority of geograhical areas) If this minority is reresentative, comaring the census and coverage counts allows an imression to be formed of the adjustments which must be made to the census counts to make them better reflect the actual oulation However to do this, it is necessary to count how many ersons were in both the census and the coverage survey and how many were in one but not the other Finally it is necessary to estimate how many ersons did not feature in either enumeration Classically, this was done by assuming that inclusion in the two samles were indeendent of each other, even though there are good grounds for assuming that this is not the case and that in articular, those with an incentive not to articiate in the census are also more likely to be unwilling to take art in the coverage survey In this reort, a Bayesian argument is used to model how the estimate of the number of eole missing from both samles, and its credible interval, change as a function of the extent of stochastic deendence between the two samles First the roblem is stated formally Then the Bayesian solution is stated and the various comonents needed to imlement the argument are develoed, along with a discussion of matters relevant to comutational efficiency Finally the method is alied to samle data sets with different assumtions about the Bayesian rior distribution

The roblem: A random samle of size A B cases is taken A second random samle of size A C is taken of which A were also in the first samle What is the robability that there are exactly N A B C k cases in the total oulation? S1 In Out S In A C Out B k 3 A solution: A comlete solution is not ossible solely on the basis of the information given However a artial solution can be develoed as follows First, we note that N differs from k only by the sum of the three known constants A, B and C We therefore require the robability density of k conditional on A, B and C We can write k A, C k k A, C A, C kk (1) A, C This is the classical Bayesian solution in which the osterior robability is roortional to the roduct of the likelihood and the rior robability 3a The likelihood function: The vector A C, k, follows a multinomial distribution with integer arameter A B C k of the form A, C k A B C k! A B C k A B C k () A! B! C! k! The four robabilities must be constant over k since, while the cell entries vary robabilistically, the arameters on which they are based are functions of the samles and of the stochastic relationshi between them and these are fixed The robabilities are unknown and must be based on assumtions The method will be as credible as these assumtions It seems credible to assume that A, B and C will stand in the same roortions as A, B and C which gives us two constraints A third is rovided by the requirement that the four robabilities sum to one Instead of imosing a fourth constraint, we exress the result in terms of the degree of stochastic deendence between the samles This is measured by the Pearson correlation coefficient for dichotomous variables which we can define in terms of the four robabilities as A k B C A B C k A C B k Substituting the three constraints into the exression for and solving for rather ungainly result A yields the B C 4BC B C 1 1 1 B 1 C 1 (3) A where B B / A and C C / A Then B A C AC and k 1 A B C These can then be substituted directly into () to calculate the likelihood

We assume that there is ositive correlation between the samles so that if a case has been included in one of the samles, the robability of inclusion in the other is increased The value of can be varied systematically from zero uwards to exlore the effect on the confidence intervals of changing the degree of stochastic deendence between the samles To imlement these calculations it is safer to use the natural logarithms of the likelihood as this removes the danger of numeric overflows and underflows It is useful to note that the natural logarithm of k! is returned by the function gammaln(k+1) in Excel and the function lgamma(k+1) in Enterrise Guide The value for the first term of the likelihood where k 0 can be found by substituting this value in () Thereafter it is easier to calculate each term as a function of the receding one using the relationshi A, C, k N N k A, C, k 1 N 1 k 3b The rior distribution: Use of the multilication rule for robabilities in (1) imlies that the rior and the likelihood must be stochastically indeendent and so the numbers A, B and C cannot be used in the definition of the rior A safe aroach would be to adot an uninformative rior with a large variance reresenting little rior knowledge about the value of N However it may be argued that this does not use all the information available and that other relevant data (articularly revious estimates of N including Mid Year Estimates or MYEs) may be used This aroach is adoted in the next section Priors can be divided into one-arameter models such as the Poisson distribution, and two-arameter models such as the normal distribution The difference in ractice is that two-arameter models allow the mean and variance of the rior to be controlled indeendently We now illustrate these oints with hyothetical, though realistic, data 1 In all cases the algorithm was run until the log likelihood fell below a value equal to one millionth of its highest value 4a An examle using a Poisson rior: This illustration of the method uses a one-arameter Poisson rior N The rior density is given by f N e / N! and the mean and variance are both equal to the arameter In this examle it was assumed that 388 cases were found in both samles, 56 in the first one only and 75 in the second one only It was also assumed that a revious study had returned a oulation estimate of 550, so a Poisson rior with arameter 550 was used Table 1: 5 50 975 Comlete indeendence ( = 4 11 18 000) Low deendence ( = 010) 1 1 31 Low deendence ( = 00) 33 46 Low deendence ( = 030) 36 49 64 Medium deendence ( = 040) 54 71 89 Medium deendence ( = 050) 81 101 1 Medium deendence ( = 060) 13 146 171 Footnotes 1) The code for all the calculations in this section is given in aendix B to this note ) Imlementing the Poisson form for the rior is eased by noting that each term can be derived simly by multilying the revious one by /N

High deendence ( = 070) 193 53 High deendence ( = 080) 336 374 413 High deendence ( = 084) 444 487 53 Table 1 gives the 5, 50 and 975 oints of the osterior distribution of the number k of cases included in neither samle The rows reresent increasing degrees of deendence The table shows that the degree of deendence makes a large difference to the outcome throughout the range of values and that the effect increases the higher the value As assumes higher values 3, the calculation becomes unstable and the results cannot be relied on for values above 080 Informal exerience suggests that the deendence is in the low range ( equals 04 or below) but even within this range there is considerable variation 4b An examle using a normal rior: This requires a both mean and a variance The MYE can serve as the mean, as it did for the Poisson Confidence intervals are not ublished for MYEs (otherwise, the variance arameter could be derived from them) so another aroach must be used One such aroach is to link the variance to the mean The one-arameter Poisson rior above in effect did this because, with a arameter is high as 550, the Poisson closely aroximates the normal distribution with mean and variance both equal to 550 Table 1 was recomiled using the normal density with these arameters but the result was too close to Table 1 to merit relication Only at the bottom end of the table did the results differ ercetibly with the normal giving slightly lower cut-offs For = 08 for examle, the figures were 301, 334 and 367 Setting the variance equal to the mean is equivalent to assuming that the estimate of 550 has a standard deviation of 34 or 43, which could well be reasonable for MYEs To exlore the effect of changing the variance arameter, two further simulations were undertaken using the same mean of 550 but variances of 450 and 700 (equivalent to standard deviations of 39 and 48 resectively) The results are given in Table which shows that the sensitivity of the osterior distribution cut-offs to changes in the variance of the normal rior increases with the degree of deendency between the samles but is only imortant for degrees which are higher than those likely to be met in ractice Table : Using a normal rior with = 550 = 450 = 700 5 50 975 5 50 975 Comlete indeendence ( = 4 11 18 4 10 18 000) Low deendence ( = 010) 1 1 31 1 1 31 Low deendence ( = 00) 33 46 33 46 Low deendence ( = 030) 35 49 64 36 50 65 Medium deendence ( = 040) 54 70 87 55 7 90 Medium deendence ( = 050) 79 98 118 83 103 14 Medium deendence ( = 060) 116 138 161 16 150 176 High deendence ( = 070) 174 199 6 199 8 58 High deendence ( = 080) 74 304 335 336 37 409 Footnote 3) While the uer bound of is in theory one, for some combinations of A, C and k it may be significantly less than this This can lead to results which are out of bounds such as roducing negative values for robabilities where the bracketed denominator in (3) assumes a negative value

High deendence ( = 084) 338 370 403 49 468 508 Imlementing the normal form for the rior is eased by exressing each term as a function of the revious one using the relationshi ln N, N N 1, ½ A diagnostic statistic of interest concerns the comatibility between the rior and osterior distributions In Bayesian analysis it is common (and safe) to use an uninformative rior as this will be comatible with any osterior distribution which is likely to occur in ractice However if an informative rior is used, as here, it is useful to be aware of the extent of the comatibility For large values of A, B and C, an aroximate method of doing this is to divide the difference between the means of the rior and osterior distributions by the standard deviation of the difference, the latter being the square root of the sum of their variances The ratio is aroximately normally distributed with zero mean and unit variance Formally, osterior rior z osterior rior For the data given above, z took values from -083 where 0 0 to +011 for 0 and +160 for 0 4 to +1050 for 0 8 The values varied very little from table to table 5 Conclusion: The method outlined in section 3 and illustrated in section 4 can be alied in ractice to cature-recature data There are two elements of uncertainty which must be addressed to aly it successfully The first element concerns the form of the rior and the arameters used for it The central limit theorem shows that the distributions of sums and means of samles tend to the normal distribution as samle size increases This is true for all distributions, subject only to the observations being stochastically indeendent In this sense the normal is the limiting case of all other distributions It is a common choice for the rior in Bayesian analyses and can be used here MYEs can be used for the mean and any reasonable value for the variance, as the results are not sensitive to this at low levels of deendency The second element concerns the likelihood, where an assumtion must be made about the degree of stochastic deendency between the two samles Section 4 suggests that the results are sensitive to this and it will be imortant to have a rationale for how the question is dealt with It is not necessary to assume a single value for the correlation coefficient We might use the knowledge we have about the correlation to assume that it is not less than 010 but that it is not greater than 03 A distribution over the remaining range would then exress what we do know about and what we do not know This could be imlemented in the WinBUGS software by exressing as following a beta distribution with the bulk of the robability mass concentrated in the interval from 03 to 01 In ractice however a close aroximation to this could be obtained by taking a weighted average of the relevant rows in Table If for examle we are content to assume that equals 03, 0 and 01 with robabilities 03, 04 and 03 resectively then the ercentage oints (rounded to the nearest integer) would be 3, 34 and 47 In summary, subject to a satisfactory decision about the degree of deendency between the samles, this method gives technically very defensible results for the confidence intervals of oulation estimates based on cature-recature data

Aendix A: Comarison of the method with the Chaman estimator and variance The Chaman method assumes that the two samles are stochastically indeendent The estimate of the oulation and its aroximate variance are given by A B 1 C B 1 N ˆ 1 and ˆ A B 1 C B 1AC var N B 1 B 1 B In the case of the above examle, this gives an estimate of 530 with a variance of 147 which, assuming a normal aroximation, gives uer and lower cut-offs of 515 and 545 The Comlete indeendence rows of the three tables in section give corresonding values of 53, 530 and 537 Thus while the estimate is the same, the above method offers slightly smaller confidence intervals than the Chaman method

Aendix B: Code used for the calculations in section * The first ste calculates and stores the roduct terms which are roortional to the osterior distribution and the roortionality constant The second ste calculates the osterior distribution and the three cut-offs; data roducts (kee = k loglk logr) const (kee = const); N1 = 56; N = 75; N1 = 388; * Numbers of cases in samle 1 only, samle only, and both samles; ncyc = 5000; * The maximum number of cycles; crit = -10000; * The criterion below which no further cycles are run; const = 0; * The constant of roortionality; sum = 0; * The sum of the rior distribution; hi = 00; * The dichotomous correlation; R1 = N1 / N1; R = N / N1; t1 = (1 + R1) * (1 + R); t = sqrt(4 * R1 * R + hi** * (R1 - R)**); 1 = (1 / t1) * (1 - (hi** * (R1 + R) + hi * t) / ( * (1 - hi**))); 1 = 1 * R1; = 1 * R; k = 1-1 - - 1; * The robability arameters; k = 0; * k is the number of cases not in either samle; label1: N = k + N1 + N + N1; if k = 0 then do; loglk = lgamma(1 + N) - lgamma(1 + N1) - lgamma(1 + N1) - lgamma(1 + N) + N1 * log(1) + N1 * log(1) + N * log(); * For the normal rior we have; mu = 550; var = 450; z = (N - mu) ** / var; logr = -(log( * 3141596 * var) + z) / ; /** For the Poisson rior we have; lamda = 550; logr = - lamda + N * log(lamda) - lgamma(n + 1);*/ end; else do; loglk = loglk + log(n / k) + log(k); logr = logr + (mu - N + 05) / var; * For the Poisson rior, logr = logr + log(lamda / N); end; const = const + ex(loglk + logr); sum = sum + ex(logr); outut roducts; k = k + 1; if k le ncyc and loglk gt crit then goto label1; outut const; const = const * 10 ** 0; ut " Constant of roortionality * 10 ** 0 = " const; ut " Number of cycles run = " k; acc = int(-log(max(sum, - sum) - 1) / log(10)); ut " Sum of rior robabilities is " sum " and equals one to " acc " laces of decimals "; run; data _null_; set const; call symut('const', const); ut " const = " const; run;

data _null_; retain s0 s1 s 0 975 500 05; const = symget('const'); set roducts end = eof; term = ex(loglk + logr) / const; if s0 lt 005 and s0 + term ge 005 then 05 = max(k - 1, 0); * do not include the next term as it would take you over the boundary; if s0 lt 05 and s0 + term ge 05 then do; if s0 + term / gt 05 then 500 = k - 1; else 500 = k; end; if s0 lt 0975 and s0 + term ge 0975 then 975 = k; * do include the next term so that it does take you over the boundary; s0 = s0 + term; s1 = s1 + term * k; s = s + term * k * k; if eof then do; CI = 975-05; acc = int(-log(max(s0, - s0) - 1) / log(10)); ut " Percentage oint of the number of cases in neither samle:"; ut " 5 oint = " 05 " median = " 500 " 975 oint = " 975; ut " Sum of osterior robabilities is " s0 " and equals one to " acc " laces of decimals "; * At the end, s0 is the sum of the whole osterior distribution, which should equal one Acc measures how close it gets; z = (388 + 56 + 75 + s1-550)/sqrt(550 + s - s1**); ut " Measure of the consistency between the rior and the osterior = " z; end; run;