Regression Analysis of Probability-Linked Data

Transcription

1 Official Statistics Research Series, Vol 4, 2009 ISSN ; ISBN (Online) Regression Analysis of Probability-Linked Data Ray Chambers Centre for Statistical and Survey Methodology, University of Wollongong his report was commissioned by Official Statistics Research, through Statistics New Zealand. he opinions, findings, recommendations and conclusions expressed in this report are those of the author(s), do not necessarily represent Statistics New Zealand and should not be reported as those of Statistics New Zealand. he department takes no responsibility for any omissions or errors in the information contained here.

2 Abstract Data obtained after probability linkage of administrative registers will typically include errors due to the fact that some linked records actually contain data items are sourced from different individuals. Such errors can induce bias in standard statistical analyses if ignored. In this report we describe some approaches to eliminating this bias in the case of linear regression analysis and, more generally when inference is based on an estimating euation, with an emphasis on logistic regression. Simulation results that illustrate the gains from allowing for linkage error in linear and logistic regression analysis are presented, as are extensions of the approach to situations where a sample is linked to a register and to where the linked registers are of uneual size. Keywords Record matching, linkage errors, linear regression, logistic regression, estimating euations, measurement error. Reproduction of material Material in this report may be reproduced and published, provided that it does not purport to be published under government authority and that acknowledgement is made of this source. Citation Chambers, R. (2009). Regression analysis of probability-linked data, Official Statistics Research Series, 4. Available from Published by Statistics New Zealand atauranga Aotearoa Wellington, New Zealand ISSN (Online) ISBN (Online) 2

3 Acknowledgements he theory set out in this paper was not developed in a vacuum. It has benefited considerably from advice and critical input from Walt Davis of Statistics New Zealand, Milorad Kovacevic of Statistics Canada and Glenys Bishop and James Chipperfield of the Australian Bureau of Statistics. My thanks go out to all of them for their encouragement. Also, I would like to acknowledge the input of the referee who provided me with the details of the Neter et al. (1965) reference. his is a well-written paper that nicely summarises many of the statistical issues that I have attempted to tackle in this report. 3

4 Contents 1 Introduction Background and assumptions Research uestions Linear regression using linked data Bias-corrected OLS inference Efficient linear estimation using linked data Maximum likelihood using linked data A fixed population approach Using estimating functions with probability-linked data Correcting estimating functions for linkage error Application to linear and logistic regression Variance estimation when linkage probabilities are estimated Maximum likelihood logistic regression with linked data Simulation analysis Simulation of linear regression with linked data Simulation of logistic regression based on linked data Regression analysis under sample to register linkage Regression analysis under nested linkage Using estimating functions with nested linked data Fitting linear and logistic models to nested linked data Reversing the nesting Conclusions and further research References Appendix 1 Approximating the V matrix Appendix 2 R Code for linear model fitting and simulation R functions for linear regression analysis R code for linear model simulations Simulation of known lambda case Simulation of estimated lambda case Appendix 3 R code for logistic model fitting and simulation R functions for logistic regression analysis R Code for logistic model simulations Simulation of known lambda case Simulation of estimated lambda case

5 List of tables able 1 Options for G (θ) in logistic regression able 2 Specification of Ĝ and θû in (50) for the linear case able 3 Simulation results for the linear model able 4 Simulation results for slope estimators under the logistic model List of figures Figure 1 Boxplots of percentage relative errors generated by different estimators in linear model simulations Figure 2 Boxplots of percentage relative errors generated by different slope estimators in logistic model simulations

6 1 Introduction In their seminal paper on the topic, Fellegi and Sunter (1969) defined record linkage as a solution to the problem of recognizing those records in two files which represent identical persons, objects, or events... Record linkage allows data for a single individual to be compiled from different data sources, enabling more powerful and effective analyses to be carried out than would otherwise be the case. In particular, datasets created by linking individual records constitute a critical resource for research in health, epidemiology, economics, demography, sociology and many other scientific areas. National statistical agencies increasingly rely on linking surveys to administrative registers to provide more accurate measurement and to reduce respondent burden. Freuently, one or more datasets (whether all administrative data or a mix of administrative and survey data) are linked to answer a broader array of research uestions than can be addressed through any of the datasets individually. Linked longitudinal datasets are particularly useful in health related research. hese are datasets created by matching individual health and health-related records from a variety of sources over a period of time. For example, a longitudinal dataset created by linking hospital admission and general practitioner records to private health insurance expenditure records for individuals in a particular social and/or demographic group could be used to build models for how changes in that group s health expenditure influences subseuent uptake of medical services. his type of linkage is able to bring together a much better picture of the driving factors behind many public health issues. hus, using data obtained from linking physician billing claims held by the Ontario Health Insurance Plan with data for consenting Ontario respondents to the 1994/95 Canadian National Population Health Survey, Iron, Manuel and Williams (2003) report on an analysis of the relationship between utilization and costs of physician services and incidence of self-reported chronic conditions for residents of Ontario province in Canada. Data linkage is not confined to the health sciences. In a review commissioned by the UK Department for rade and Industry, Chesher and Nesheim (2006) describe the extensive use of data linkage in economic research, particularly in the United States. Statistics New Zealand has recently developed a linked longitudinal employer-employee dataset based on linking administrative data held on the NZ Inland Revenue Department's tax system and Statistics New Zealand's list of NZ businesses. his dataset allows the analysis of job and worker flows, employment tenure, multiple job holding and business demography. Similarly, the Census Data Enhancement Initiative of the Australian Bureau of Statistics (ABS) aims to create a Statistical Longitudinal Census Dataset that integrates census data from the same individuals over a number of censuses, with the objective of building a research resource for longitudinal analysis of the Australian population. In the UK, the Interdepartmental Migration and Population ask Force set up by the Office for National Statistics has recently recommended the use of record linkage to improve migration and population statistics in the UK. he aim in this case is to link administrative, health register, school enrollment and university student data with incoming passenger survey and labour force survey data to create an integrated longitudinal data set that will allow in-depth analysis of the UK migrant experience. he process used to link datasets often involves a probabilistic matching of records from one dataset to another. In most linkage operations matching variables present in both datasets are used to maximise the probability that the values of the variables making up the linked record are the correct measurements for the population unit corresponding to that record. However, when analysis is undertaken using the resulting linked data, the errors inherent in this type of record matching are typically ignored. his is unfortunate since these errors introduce bias and additional variability into standard statistical estimation techniues. his poses a significant barrier to policy-relevant research using probabilistically linked data. 6

7 Statistical methods for linking datasets are now well established (Herzog, Scheuren and Winkler, 2007), with recent statistical research in this area mainly focused on the confidentiality issues that arise as a conseuence of linkage. See Sibthorpe, Kliewer and Smith (1995) and rutwein, Holman and Rosman (2005) for an Australian health data perspective, and Mackie and Bradburn (2000) for contributions to a workshop on confidentiality and linkage sponsored by the US Committee on National Statistics and the Institute of Medicine. In contrast, aside from the notable contributions of Neter et al. (1965), Scheuren and Winkler (1993) and Lahiri and Larsen (2005), there has been comparatively little methodological research carried out on the impact of linkage errors on analysis of linked data. Linkage errors are the errors caused by incorrectly linking different population units as well as the errors caused by not linking the same population units in the datasets that are linked. hese errors are a particular type of measurement error, and can lead to biased inference unless appropriate steps are taken to control and/or adjust for this bias. In this report we develop a methodological framework that can be used to provide appropriate modifications to standard statistical analysis methods to ensure that they remain unbiased when used with probabilistically linked data. he framework is based on modelling the relationship between the probabilistically linked data and the true data that would be obtained if error free linkage were possible. Inference then proceeds on the basis of a combined model defined by the integration of this linkage error model with the statistical model for the true data values that is of primary interest. Our assumptions about the data linkage situation and a description of a simple model for linkage errors are set out in the following sub-section. In section 2 we apply these ideas to fitting a simple linear regression model to linked data from two registers that each cover the same population. In section 3 this theory is generalised to where the statistical model of interest is fitted via the solution of an estimating euation, with application to logistic regression serving to motivate our approach. Simulation results for both linear and logistic regression are described in section 4 and illustrate the potential gains from the modified analytic methods that we propose. In section 5 we extend our framework to the important case of linking a survey to a register, while in section 6 we look at another important extension, where the registers that are linked are nested, in the sense that the population making up one of the linked registers is a subgroup of the population making up the other. Section 7 concludes the report with a short discussion of avenues for further research. 1.1 Background and assumptions In what follows we assume that the existence of a population of N units, indexed by i = 1,K, N, such that, for each unit in this population, it is possible to measure the values of a scalar random variable Y and a vector random variable X. We are interested in modelling the relationship between Y and X in this population, and in particular we seek to fit a model of the form E(Y X) = g X;β ( ) for the regression of Y on X. Here g corresponds to a known functional form while the parameter β is unknown and needs to be estimated. his is usually straightforward if we have the values of Y and X for a random sample of units from this population. Unfortunately, we do not have such a sample. Instead, we have access to two registers that separately contain the population values of Y and X. We shall refer to these as the Y-register and X-register respectively from now on. For the time being we also assume that both registers refer to the same population and have no duplicates, so each is made up of N records. If each unit in the population has a uniue identifier, and this identifier is also stored on both registers, we can use it to link records from the two registers, and then estimate β using the Y and X values associated with the N linked records. Unfortunately, such a uniue identifier does not exist. Instead, some form of probability-linking algorithm is used to associate (i.e. link) records on the X-register with records on the Y-register. his algorithm makes it is possible (at least conceptually) to link every record on the X-register with a 7

8 record on the Y-register. hat is, linkage is complete and one to one between the Y and X- registers. Clearly, the data set constructed by this process (the linked data) can contain linkage errors, i.e. records where the values of Y and X actually come from different population units. Although it may be theoretically possible for any two records on the Y and X-registers to be linked, most reasonable probability linking algorithms will only attempt to link records that are similar in some sense. Conseuently, we shall assume that the linked records can be partitioned into Q distinct blocks such that there is no possibility that linked records in different blocks contain data for the same population unit. We model this situation by assuming that the different blocks correspond to different values of a categorical population variable Z that can be derived from the information on either register, and which is defined in such a way that if a record on one register does not have the same value of Z as the record on the other register, then it is reasonable to assume that these two records cannot correspond to the same unit in the original population. Conversely, the fact that a Y-register record and an X-register record have the same value for Z does not guarantee that they correspond to the same unit, and so linkage errors can still occur within a block. We refer to Z as a blocking variable, and those population units with the same value of Z as being in the same block. Note that errors in measurement of Z can lead to the same population unit having one value of Z on the Y-register and another on the X-register, which invalidates the assumption of no linkage errors when Y and X-register records have different values of Z. Conseuently, we shall assume that Z is measured without error on both the Y-register and the X-register. With this set up, data linkage errors only occur among records in both registers in the same block. his property of the blocking variable Z indicates a subtle but key difference between the use of the blocking concept in our development and its use in data linkage methodology. In the latter case, blocking variables define stages (or passes ) in the linkage process, where at any particular stage matching is carried out with respect to a particular blocking variable. hat is, only those remaining unmatched records at this stage with the same value for this blocking variable are considered as potential matches. However, once all matches at a particular stage of the process are declared, all remaining unmatched records are then considered as candidates for matching at the next stage using another blocking variable. Conseuently it is uite possible that links can be created between Y and X-register records that have different values for any particular blocking variable. In our case the blocking variable Z is an ex-post construct. It defines a partition of the declared links into groups such that all linkage errors are isolated within the groups there are no errors that cross group boundaries. Clearly Z can be defined in terms of the blocking variables used in creating the links, but there is no fundamental reuirement for this. he main reuirement is that Z partitions both the Y and X-registers so that all (or virtually all) linkage errors are confined to the groups of records defined by the distinct values of this variable. Without loss of generality, we denote the Q distinct values taken by Z by 1,2,K,Q. Let block correspond to the M population units with Z =, so N =. Since Z is measured without error in both registers, and linkage is complete, the number of records in block in each register is the same, i.e. M. Let i index the records in the linked data set. Again, without loss of generality we assume that this index is the same as the one used to index the X-register, i.e. the linkage process associates a record from the Y-register, with its associated Y-value, with each record on the X-register. In block we then have M linked data pairs (y i,x i ), where y i denotes the Y- value from block on the Y-register that is matched to X i. More accurately, the record with Y = y i in block on the Y-register is matched to the record with X = X i in block on the X- M 8

9 register. We use y to denote the vector of order M defined by the linked values y i in block and X as the matrix with rows defined by the values X i in the same block. Also, let y denote the unknown vector of order M with entries indexed as in the X-register that corresponds to the true Y values associated with X. Since one and only one of the M records in block on the Y-register can be matched to each distinct record in the corresponding block on the X-register, we model randomness in the outcome of the linkage process via the identity y = A y (1) where A = [a ij ] is an unknown random permutation matrix of order M. Note that the entries a ij of A are either zero or one, with a value of one occurring just once in each row and column. Also, since we are assuming that linkage errors are confined to blocks, it is natural to impose the condition that A 1 and A 2 are independently distributed when 1 2. Clearly inference based on linked data will involve assumptions about the distribution of the A. In this report we assume that linkage is non-informative at each level of Z, i.e. the distribution of A is independent of y given X. Let E( A X )= E. (2) Given the care that typically goes into the construction of a linked data set, it seems reasonable that a declared link is more likely to be correct than incorrect. Although the probability that such a link is correct will typically vary between the records that make up the linked dataset, as a first approximation we assume that the probability of correct linkage is the same for all records in a block. We also assume that it is eually likely that any two Y- records in the same block that are not linked to a particular X-record in that block could in fact be the correct link for this record. A simple way of characterising both of these assumptions is via an exchangeable linkage errors model, where for each value of and, for i j, Pr( correct linkage)= Pr( a ii = 1)= λ (3) Pr( incorrect linkage)= Pr( a ij = 1)= γ. (4) Given (3) and (4) hold, it follows that (2) is then of the form E = ( λ γ )I + γ 1 1 (5) where I is the identity matrix of order M and 1 denotes a vector of ones of length M. Since 1 A = 1 and A 1 = 1 we have 1 E = 1 and E 1 = 1, which means that (5) implies λ + (M 1)γ = 1. (6) In other words, we just need to specify λ in order to completely specify the first order properties of the linkage mechanism under the model (5). his will be particularly useful 9

10 later since estimation of λ reuires only that we know whether a defined link is correct or incorrect, and not the identity of the correct link. he model specified by (3) and (4) represents what is probably the simplest way of characterising the behaviour of a probability-based linkage process, and will form the basis for the theory developed in this report. It was originally suggested by Neter et al. (1965) in a groundbreaking paper that investigated its use in assessing the impact of linkage error on response error analysis, where alternative data sources were linked to respondent records in order to assess the extent of response error in these records. As these authors note, and as we shall see in next section, the impact of linkage error defined by (3) and (4) is to attenuate the relationship between the study variable (in their case the difference between the survey value and the linked alternative value) and explanatory covariates. Depending on the available information from the operation of the linkage process, more sophisticated models for linkage error can be formulated. For example, it may be the case that the Y and X-registers are ordered so that only nearby records in the linked data can possibly correspond to the correct link. his can be modelled by replacing (4) by 1 λ Pr(incorrect linkage) = Pr(a ij = 1) = 2m 0 if j i m otherwise with appropriate modifications for values of i close to either the beginning or the end of the X-register. Here 2m denotes the number of nearest neighbours to y i in the linked data set that can actually contain the correct value y i. Another extension is where there exists another variable on the X-register, say W, with values w i that vary within a block, such that the probability of a correct link depends on these values. For example, we could have Pr( correct linkage)= Pr( a ii = 1)= p w i,w i ;λ and, for i j, ( ) Pr( incorrect linkage)= Pr( a ij = 1)= p( w i,w j ;λ ) where p(w i,w j ;λ ) is a function that (i) takes values in the interval [0,1]; (ii) is maximised M when w i = w j ; and (iii) satisfies p(w i,w j ;λ ) = 1. An obvious candidate function in this j=1 case is where p(w i,w j ;λ ) is proportional to exp( λ w i w j ). Note however that if W is categorical and available on both registers then by including it in the definition of the blocking variable Z we recover the situation implicit in the exchangeable linkage errors model, where all linked records in the same block have the same probability of being incorrectly linked. 1.2 Research uestions Given the preceding development, there are a number of uestions that immediately arise. 1. What are the properties of the estimator of β based on the linked data that assumes all linkages are correct? 2. Are there more efficient ways of estimating β using the linked data?. 10

11 he methodological framework described in the previous sub-section was based a number of strong assumptions about the linkage process that will typically be violated. As a conseuence, we can ask further uestions. 3. How do we need to modify our inference when linkage is incomplete (i.e. there are unlinked records in one or both of the Y and X-registers? 4. What happens when one or both of the Y and X-registers are based on sample survey data? How do we integrate sample selection and linkage in inference? 5. We have assumed that all components of X are on one register. What happens if some components are actually held on the Y-register? More generally, what happens if components of X are held on different registers and these are linked either prior to the linkage to the Y-register or simultaneously with the linkage to the Y-register? In the rest of this report we develop some theory that may help in answering these uestions. 11

12 2 Linear regression using linked data In this section we consider the situation where the widely used linear regression model is the focus of inference. hat is, the population values of Y and X in each block (i.e. those associated with population units with the same value of Z) satisfy E X ( y )= X β = f (7) Var X ( y )= σ 2 I. (8) where we use a subscript of X to denote conditioning on the value X of the explanatory variables in block. Note that in addition to the regression parameter β in (7), which is the target of inference, (8) now includes an unknown scale parameter σ 2. Given the y and X, the optimal estimator of β is then its Ordinary Least Suares (OLS) estimator ˆβ = 1 X X X y. (9) Unfortunately, unless the linkage is perfect, (9) cannot be calculated. Instead, what is usually done is to substitute the linked data values y for y in this expression, which leads to the naïve linked data OLS estimator ˆβ = 1 X X X y (10) 2.1 Bias-corrected OLS inference Under the linkage error model (1) it is easy to see that (10) is actually ˆβ = Under non-informative linkage 1 X X X A y. so E X ( A y )= E X ( A )E X ( y )= E f E X ( ˆβ ) = 1 X X X E X β = Dβ. (11) hat is, the naïve OLS estimator (10) based on the linked data set is biased. Provided E is known and the inverse of the matrix D in (11) exists, an unbiased estimator of β in this situation is ˆβ R = D 1 ˆβ = X 1 X X 1 E X { ˆβ which, since X E X is then of full rank, reduces to ˆβ R = ( X E X ) 1 X ( y ). (12) Note that the subscript of R used to denote the estimator defined by (12) serves as a reminder that this estimator is based on a ratio-type correction for the bias in the naive estimator (10). 12

13 We use an iterated expectation argument to obtain the variance of ˆβ R. o start, observe that Var X ( ˆβR )= D 1 Var X ( ˆβ )D ( 1 ) where Var X ( ˆβ )= E X Var AX ˆβ { ( ) + Var X E AX ( ˆβ ) {. Here a subscript of AX denotes conditioning on both A and X, so and E AX ( ) 1 ( ˆβ )= X X Var AX ( ˆβ )= σ 2 X X ( X A X )β ( ) 1 ( X A A X )( X X ) 1 ( ) 1 = σ 2 X X since A A = I. Put V = Var X ( A X β)= Var X ( A f ). It follows that ( ) 1 { X ( σ 2 I + V )X ( X X ) 1 D 1 ( ) 1 { X ( σ 2 I + V )X ( X E X ) 1. Var X ( ˆβR )= D 1 X X = X E X An estimator of (13) is then ˆV X ( ˆβR )= X E X ( ) 1 { X ( ˆσ 2 I + ˆV )X X E X where ˆσ 2 and ˆV are estimates of σ 2 and V respectively. ( ) 1 ( ) (13) (14) In order to define these estimates, we note that after some simplification, and using the fact that A A = I, E X ( y f ) ( y f ) = E X y y y f f y + f f 2f ( A I )y = Nσ 2 2 f ( E I )f. { It follows that when f and E are known, ˆσ 2 = N 1 ( y f ) y { ( f ) 2 f ( I E )f (15) is an unbiased method of moments estimator of σ 2 under the linkage errors model (1) and the linear model specified by (7) and (8). Note that (15) can take negative values. In practice, we replace f by ˆf = X ˆβR in (15) to then obtain a consistent estimator of σ 2. Development of an expression for ˆV is somewhat more complicated. In Appendix 1 we show that a large M approximation to V given a simple second-order extension of the exchangeable linkage errors model defined by (3) and (4) is 13

14 V diag (1 λ ) λ ( f i f ) 2 + f (2) 2 { f (16) where f = ( ) and f, f (2) f i denote the block averages of the components of f and their suares respectively. In order to calculate ˆV we replace these components in (16) by their estimated values. he approach to linear regression estimation using probability-linked data described above is in the spirit of Scheuren and Winkler (1993), where it is suggested that one corrects the naive estimator using an estimate of its bias under an appropriate model for the linkage error process. In our case the ratio-type adjustment we use for this purpose depends on knowing (or having good estimates of) the parameters (i.e. the λ ) that characterise this process. As noted earlier, all that is reuired to estimate these parameters is access to a random audit sample of the linked records in each block where the only thing we need to know is whether the declared links are correct or not. his could also be done by comparison with the output from a gold standard (e.g. clerical) linkage operation carried out on this sample of records. 2.2 Efficient linear estimation using linked data An alternative approach to fitting a linear model using the probability-linked data is based on directly modelling the regression relationship between the linked values y and the values in X. Since y = A y, and A and y are independently distributed given X, it follows hat is, the y E X ( y )= E X A ( )E X ( y )= E X β = H β. (17) also follow a linear model with regression coefficient β but with a modified set of explanatory variables H in block. Lahiri and Larsen (2005) note this relationship and suggest estimation of β using the OLS estimator for this situation, ˆβ A = ( H H ) 1 H ( y )= ( X E E X ) 1 X E ( y ). (18) However, the optimality of this estimator depends on the regression errors under (17) being homoskedastic. It is easy to see that this condition generally does not hold, since implicit in the development leading to (13) is the fact that Var X ( y )= σ 2 I + V = Σ (19) which implies that the variances of the regression errors defined by the linked data vary between blocks. he Best Linear Unbiased Estimator (BLUE) for β given these data is ˆβ C = ( H Σ 1 H ) 1 H Σ 1 ( y )= X E Σ 1 E X Note that (20) depends on Σ, and hence on σ 2 ( ) 1 X E Σ 1 ( y ). (20) and β. Its empirical (EBLUE) version is defined by substituting estimates for these parameters and iterating, using the estimate (15) for σ 2 developed in the previous sub-section, combined with the estimate of β defined by (20). Standard plug-in type sandwich-type estimators of the variances of (18) and (20) are easily developed using the estimates ˆσ 2 and ˆV developed in the previous sub-section. hese are 14

15 ( ) 1 ˆV X ( ˆβA )= X E E X in the case of (18) and in the case of (20). X { E ( ˆσ 2 I + ˆV )E X X E E X ( ) 1 1 { E X ˆV X ( ˆβC )= X E ˆσ 2 I + ˆV ( ) 1 Note that such plug-in estimators ignore the contribution to the variance associated with estimation of the linkage model parameters and hence may be biased low. his issue is further discussed in section Maximum likelihood using linked data An alternative approach to constructing an efficient estimator of β given the linked data is to use the Missing Information Principle or MIP (Orchard and Woodbury, 1972) to derive the maximum likelihood estimator of this parameter given the linked data. In order to do so, we extend the linear model (7) and (8) to include an assumption of normality. hat is, given X, we assume that y : N( f,σ 2 I ). (21) (22) When the y are known, the score function for β and σ 2 has components and sc 1 = 1 σ 2 sc 2 = N 2σ σ 4 ( y f ) (23) X ( y f ) ( y f ). (24) In order to apply the MIP, we replace (14) and (15) by their conditional expectations given y and X. Using an iterated expectations argument again, we see that Cov X ( y, y )= σ 2 E X ( A )+ Cov X ( f,a f )= σ 2 E. Combining this result with (17) and (19), it follows that and so y X y : N f E f,σ I 2 E E Σ and E X ( y y )= f + E Σ 1 ( y E f )= ŷ We therefore replace (23) by sc 1 = 1 σ 2 Var X ( y y )= σ 2 ( I E Σ 1 E ). ( ŷ f ) = 1 X and, since y y = y y, we replace (24) by σ 2 E Σ 1 y E f (25) X ( ) 15

16 sc 2 = N 2σ σ 4 he MLEs for β and σ 2 = N 2σ σ 4 y y 2f ŷ + f ( f ) parameters. Since ŷ is a function of β and σ 2 { ( y f ) ( y f ) 2f ( ŷ y ). are defined by setting (25) and (26) to zero and solving for these (26) this needs to be done numerically. Note that the solution to setting (25) to zero is the BLUE (20). Since the MLE for σ 2 obtained by setting (26) to zero is not the same as the method of moments estimator (15), the MLE and the EBLUE for β will not be the same. However, they are typically very close. In order to estimate the variances and covariances of these MLEs, we calculate the matrixvalued observed information function corresponding to the MIP-based score function for these parameters and invert it. his can be done by either numerically differentiating (25) and (26), or by using the MIP information identity. his identity states that the information function for β and σ 2 given the linked data is the conditional expectation of the y known information function given the linked data minus the conditional variance of the y known score function given the linked data. Denoting conditioning on the linked data ( y ; = 1,2,K,Q) by a superscript of *, the information function generated by these data is where and info = E X E X info 21 E X ( info 11 ) E X ( info 12 ) ( ) E X ( ) E X ( info 22 )= N 2σ σ 6 E X Var X ( sc 1 )= 1 σ 4 ( info 12 )= E X X info 22 Var X sc 1 Cov X sc 2, sc 1 ( info 11 )= 1 X σ 2 X ( ) Cov X ( sc 1, sc 2 ) ( ) Var X ( ) sc 2 { ( y f ) ( y f ) 2f ( ŷ y ) ( info 21 )= 1 σ 4 Var ( y y )X X = 1 σ 2 X ( ŷ f ) X ( I E Σ 1 E )X Cov X ( sc 1, sc 2 )= 1 Cov 2σ 6 X X ( y f ), ( y f ) { ( y f )y = 1 Cov 2σ 6 X X y, 2f ( y y ) = 1 X σ 6 Var X ( y y )f = 1 X σ 4 ( I E Σ 1 E )f. (27) 16

17 Var X ( sc 2 )= 1 Var 4σ 8 X y f ( ) { ( y f )y { y = 1 Var 4σ 8 X y y y f f y + f f = 1 Var σ 8 X f { y y = 1 f σ 6 ( I E Σ 1 E )f. he observed information for β and σ 2 is the value of info at the values of the MLEs for these parameters. he inverse of this matrix is then used as an estimate of the variance/covariance matrix of these estimators. Note that the value of the matrix at the MLEs for β and σ 2 ( sc 1 ) Cov X ( sc 1, sc 2 ) ( ) Var X ( ) Var X Cov X sc 2, sc sc 2 is a measure of the information loss caused by incorrect linkage. 2.4 A fixed population approach Suppose that we have perfectly linked data. he efficient estimator of the regression parameter β is then the y known OLS estimator ( ) 1 B = X X ( X y ). (28) So far, our emphasis has been on estimation of β. However, it is legitimate to also consider prediction of B given the fixed finite population of Y and X-values that define the Y and X-registers. In this context, we denote conditioning on these values (i.e. on the values of y and X ) by a subscript of YX and look for a predictor ˆB of B that satisfies (over repeated applications of the probability linkage process) E YX ( ˆB )= B. (29) Note that none of ˆβ R, ˆβ A and ˆβ C satisfy (29) since we have E YX ( ˆβR )= ( X E X ) 1 ( X E y ) B E YX ( ˆβA )= ( X E E X ) 1 ( X E E y ) B and ( ) 1 E YX ( ˆβC )= X E Σ 1 E X ( X E Σ 1 E y ) B. In order to derive a predictor that satisfies (29), consider the class of linear predictors of B that can be written in the form ˆB = ( X X ) 1 X ( K y ). If K E = I it is straightforward to see that E YX ( ) 1 ( ˆB )= X X ( X K E y )= B.

18 If E is of full rank (as is the case with (5) when λ > γ ), then an obvious choice is K = E 1. More generally, Kovacevic (personal communication, 2008) has suggested that one put K = E E ( ) 1 E, leading to the predictor ˆβ B = ( X X ) 1 X ( E E ) 1 E { y. (30) Since (30) is linear in the y, variance estimation for this predictor using a plug-in sandwich-based approach follows directly. he resulting variance estimator is ˆV X ( ˆβB )= X X X ( E E ) 1 E ( ˆσ 2 I + ˆV )E ( E E ) 1 X ( ) 1 { ( X X ) 1. (31) 18

19 3 Using estimating functions with probability-linked data In this section we consider extension of the ideas developed for linear regression analysis in the previous section to where the regression model of interest is fitted via the solution of an estimating euation. In particular, we assume that this model is characterised by a p- dimensional parameter θ, which is then estimated by solving H(θ) = 0 where H(θ) is a p-dimensional unbiased estimating function for θ, i.e. a function of the data that satisfies E X { H )= 0 where θ 0 is the true value of θ. Let θ denote the partial differentiation operator with respect to the components of θ. he resulting estimator ˆθ can then be shown to be approximately unbiased for θ 0 since, under appropriate smoothness conditions 0 = H( ˆθ) H ) + ( θ H 0 )( ˆθ θ 0 ). Here θ H 0 is the p p matrix of first order partial derivatives of H(θ) with respect to the components of θ, evaluated at θ 0. Since H(θ) is an unbiased estimating function, it immediately follows that E ˆθ X ( θ 0 ) ( θ H 0 ) 1 E{ H )= 0 provided θ H 0 is of full rank, and so ˆθ is approximately unbiased for θ 0. Furthermore, we then also have Var X ( ˆθ) ( θ H 0 ) 1 Var X { H ) ( θ H 0 ) 1 leading to the usual sandwich-type estimator of this variance ˆV X ( ˆθ) { ( θ H 0 ) 1 θ 0 = ˆV ˆθ X { H ) θ H 0 { is an estimate of Var X H ) { evaluated at ˆθ = θ 0. where ˆV X H ) Var X H ) { (32) {( ) 1 θ 0 = (33) ˆθ {. ypically, it is a plug-in estimate, i.e. 3.1 Correcting estimating functions for linkage error We now turn our attention to the situation where a regression model is fitted using an estimating function and data that have been linked using a probability-based method. In particular, we shall concern ourselves with situations where H(θ) is of the form N { H(θ) = G i (θ) y i f i (θ) (34) i=1 ( y i ) and G i (θ) is a vector of order p which is a function of θ and X i but where f i ) = E X not of y i. Clearly (34) defines an unbiased estimating function for θ 0, which we can write in blocked form as { H(θ) = G (θ) y f (θ) (35) 19

20 where G (θ) is the p M matrix with columns defined by the vectors G i (θ) associated with the population units making up block, and f (θ) is the vector of order M defined by their corresponding values of f i (θ). Now consider the situation described in section 1.1 where instead of y, we have access to a probability-linked version of this vector, y = A y. Here A is a random permutation matrix of order M distributed independently of y given the values in X (i.e. linkage is non-informative given the values of the explanatory variables), with values of A distributed independently between blocks and where E X ( A )= E. Let H (θ) denote the value of (35) when we use y instead of y. hat is, our naive estimator ˆθ of θ 0 that assumes no linkage errors satisfies Clearly, since { H ( ˆθ ) = G ( ˆθ ) y f ( ˆθ ) = 0. (36) { E X { H )= G ) ( E I )f ) 0 we see that H (θ) is biased if linkage is not perfect, and so the resulting estimator ˆθ also biased in this case. Given the value of E, we can correct for this bias, replacing the estimating function H (θ) by its bias-corrected version H adj { (θ) = H (θ) G (θ) ( E I )f (θ) = G (θ){ y E f (θ). (37) is Our bias-adjusted estimator of θ based on the linked data is then ˆθ adj where H adj ( ˆθ adj ) = 0. he general results for inference based on unbiased estimating functions clearly apply to H adj (θ) defined by (37). It immediately follows that the large sample variance of ˆθ adj is given by (32) with H adj (θ) substituted for H(θ). hat is, Var X ( ˆθ adj 1 ( θ=θ 0 ) ) θ H adj Var X 1 ( ) H { adj ) θ H adj θ=θ 0 with plug-in sandwich-type estimator, see (33), of the form where θ H adj ˆV X ( ˆθ adj In order to define ˆV X ) = { θ H adj ( ˆθ adj ) 1 ˆVX H adj ( ˆθ adj ) = θ H adj. θ= ˆθ adj { ( ˆθ adj ) 1 { ) θ H adj (38). (39) { H adj ) in (39) we put Var X (y ) = Ω ) and observe that then 20

21 Var X (y ) = E X { Var X ( A y A )+ Var X E X A y A = E X { A Var X ( y )A + Var X { A f ) = E X { A Ω )A + Var X { A f ) = E X { A Ω )A + V ) = Σ ) { ( ) (40) so and hence Var X H { adj )= G )Var X ( y )G ) = G )Σ )G ) ˆV X { H adj )= G ( ˆθ adj )Σ ( ˆθ adj )G ( ˆθ ) adj. (41) In order to compute (41) we need to estimate the covariance matrix Σ (θ) specified by (40). In turn, this reuires that we estimate both V ), which can be approximated via (16) after replacing f i by f i ( ˆθ adj ), and E X { A Ω )A, which will depend on the particular model that we assume for the y. Next, in order to define the matrix of partial derivatives θ H adj although in theory θ H adj { = θ G (θ) y E f (θ) ( ˆθ adj ) in (39) we note that it is often the case that G (θ) varies little as θ changes. Conseuently, we approximate this derivative by θ H adj G (θ)e θ f (θ). hat is, we put { θ H adj ( ˆθ adj ) = G ( ˆθ adj )E θ f ( ˆθ adj ) (42) where θ f ( ˆθ adj ) = θ f (θ). he final variance estimator for ˆθ θ= ˆθ adj adj substituting (41) and (42) into (39). is then obtained by 3.2 Application to linear and logistic regression Although we have already developed the theory for linear regression in section 2, it is interesting to see how the results obtained there can be obtained as special cases of the general estimating euation theory set out in the previous sub-section. In particular, the Lahiri-Larsen estimator (18) and the BLUE (20) can be obtained from (28) by setting θ β and f (β) = X β (so β f (β) = X ) with G = X E in the case of (18) and G = X E Σ 1 in the case of (20). As far as the predictor (30) of B is concerned, we note that it can be expressed as the solution to X ( E E ) 1 E ( y E X ˆβ ) = 0. It follows that in this case G = X ( E E ) 1 E which leads to β H adj ( ˆβ B ) = X X. 21

22 In contrast, the ratio-adjusted estimator (12) cannot be expressed as the solution of an estimating euation of the form G y E X β = 0, being instead the solution to the alternative ratio-type estimating euation 1 { ( ) H R (β) = X y X Dβ = 0 (43) where D = X X X E X. As a conseuence, the results in the previous subsection do not apply to it directly. However, it is not difficult to show that H R (β) also defines an unbiased estimating function under the assumed linear model, since E X X y { ( X Dβ) = { X ( E X X D) β = X E X ( X X )D { β = 0. he linearisation argument that was earlier used to define an estimator of variance in the standard estimating function approach also applies to (12) when it is written as the solution to (43). In particular, we have and β H R (β) = X X D = X E X (44) Var X { H R (β 0 )= X Σ X. (45) When (44) and (45) are substituted in (38) we obtain the variance expression (13), leading to the same plug-in estimator of variance as specified by (39). he case where the regression model of interest corresponds to linear logistic regression is of special interest. Here f (θ) = f i (θ);i It follows that f i (θ){ 1 f i (θ) { where f i (θ) = where D (θ) = diag. exp(x i θ) 1+ exp(x i θ). (46) θ f (θ) = D (θ)x (47) he standard maximum likelihood estimating function (i.e. the score function) for the logistic regression model puts G (θ) = X in (35). However, this is not the only choice for this matrix when we estimate θ via the adjusted estimating euation (37). In particular we can also use the expressions for G (θ) that lead to the linear regression estimators (18), (20) and (30) introduced in section 2. We summarise these options in able 1. Here option M defines the estimating euation for the MLE under perfect linkage, option A leads to the Lahiri-Larsen estimator (18) under a linear model and option B leads to the predictor (30) of the finite population regression vector (28) under the same model. In contrast, option C in able 1 defines the second order efficient version of (35), which in the logistic case is given by G opt (θ) = θ { E X ( y )Var 1 X ( y )= { θ f (θ)e Σ 1 (θ) = X D (θ)e Σ 1 (θ). (48) 22

23 It is easy to see that the corresponding optimal version of G (θ) in the linear case is G = X E Σ 1 and leads to the BLUE (20). It should be noted, however, that option B in able 1 does not has the same finite population interpretation for logistic regression as it has in the linear regression context. In particular, it is not clear whether use of option B leads to a predictor of the estimator of θ defined by the correctly linked data. Further research is necessary in this area. able 1 Options for G (θ) in logistic regression Option M A C B G (θ) X X E X D (θ)e Σ 1 (θ) X ( E E ) 1 E For each of the options set out in able 1, variance estimation for the solution to the adjusted estimating euation defined by (37) uses the plug-in sandwich estimator (39), with θ H adj ( ˆθ adj ) defined by (42) and with ˆV X H adj ) { given by (41). In order to compute the latter expression, we observe that under the logistic model Ω ) is D ) and so Now so E X E X { A Ω )A = E X { A D )A. A D (θ)a = diag M j=1 f j (θ){ 1 f j (θ)a ij M { A D (θ)a = diag f j (θ){ 1 f j (θ)e ij j=1 where A = a ij and E = e ij. 3.3 Variance estimation when linkage probabilities are estimated he development so far has assumed that the matrix of expected values E for the stochastic linkage matrix A is known. If this matrix is specified using the exchangeable errors model (5) then this is euivalent to assuming that the probabilities λ of correct linkage are known. his is highly unlikely to be the case, and these probabilities will usually be estimated in some way. he extra uncertainty arising from this estimation then needs to be accounted for when carrying out variance estimation for the estimators of θ that use E to correct for bias induced by linkage errors. Let λ denote the vector defined by the block-specific values of λ. he estimating function (37) then needs to be replaced by H adj { (θ,λ) = G (θ) y E (λ)f (θ) = U (θ,λ ) 23

24 which is now considered to be a function of both θ and λ, allowing us to develop a first order aylor series approximation of the form 0 = H adj ( ˆθ, ˆλ) H adj,λ 0 ) + θ H adj,λ 0 ) ˆθ ( θ 0 )+ λ H adj or ˆθ θ 0 + θ H 0 { ( ) ( ) 1 H 0 + λ H 0 ˆλ λ0,λ 0 )( ˆλ λ 0 ) where H 0 = H adj,λ 0 ). Here θ 0 and λ 0 denote the true values of these parameters with ˆθ and ˆλ their corresponding estimators. It immediately follows that we can approximate the variance of ˆθ by where Var ˆθ X ( ) θ H 0 { ( ) H θ 0 ( ) 1 Var X H 0 + λ H 0 ˆλ λ0 ( ) 1 Var X H 0 = θ H 0 = θ U,λ 0 ) {( ) 1 { + ( λ H 0 )Var X ( ˆλ ) λ H 0 { 1 Ψ 1 + Ψ 2 ( ) ( ) {( θ H 0 ) 1 { θ U,λ 0 ) 1 Ψ 1 = Var X { U,λ 0 )= G )Var X (y )G ) = G 0 Σ 0 G 0 Ψ 2 = ( λ U 0 )Var X ( ˆλ ) ( λ U 0 ) and U 0 = U,λ 0 ). Note that we have also assumed that the distribution of ˆλ is (at least approximately) independent of the distribution of H 0. o proceed further, we reuire an expression for λ U 0 = λ G 0 { y E (λ )f 0 = G 0 λ { E (λ )f 0 where f 0 = f ). Under the exchangeable model (5) for linkage errors so and hence E (λ ) = λ (1 λ ) M 1 I + (1 λ ) M = (M 1) 1 ( λ M 1)I + (1 λ )1 1 { λ { E (λ )= (M 1) 1 ( M I 1 1 ) λ U 0 = (M 1) 1 G 0 ( M I 1 1 )f 0. hat is, we have where ( ) 1 Var ˆθ X ( ) θ U 0 {( ) 1 { G 0 ( Σ )G 0 θ U 0 (49) 24

25 0 = (M 1) 2 Var X = M 2 (M 1) 2 Var X ˆλ ( ˆλ ) M I 1 1 ( )f 0 f 0 ( ) f 0 1 f 0 ( M I 1 1 ) ( )( f 0 1 f 0 ). If the estimates of the linkage probabilities are obtained by checking a random audit 1 sample of linked records in each block, then Var X ( ˆλ )= m λ 0 ( 1 λ 0 ). Variance estimation based on (49) then follows in the usual way, by plugging in estimates for unknown uantities. hat is, our estimator of Var ˆθ X ( ) is ˆV ˆθ( X )= ( θ Û ) 1 Ĝ ( ˆΣ + ˆ )Ĝ { ( θ Û ) 1 (50) where a hat denotes a plug-in estimate. able 2 shows the specification of the components of (50) for the important special case of linear regression and the linkage bias corrected estimators described in section 2. able 2 Specification of Ĝ and θû in (50) for the linear case Estimator Ĝ θ Û (12) X (18) X Ê X Ê X ĜÊ X (20) X Ê ˆσ 2 I + ˆV ( ) 1 ĜÊ X (30) X ( Ê Ê ) 1 Ê ĜÊ X 3.4 Maximum likelihood logistic regression with linked data Finally, we explore maximum likelihood estimation of a logistic model based on application of the MIP in the situation when the data are linked. If perfectly linked data were available (i.e. y and X ) the MLE for θ satisfies { sc(θ) = X y f (θ) = 0. Applying the MIP, the MLE for this parameter given the linked data therefore satisfies sc (θ) = E X X ( y y ) X f (θ) = 0. (51) { Implementing this approach reuires that we know, or can approximate, the conditional expectation E X X ( y y ). ypically, X will contain an intercept that we can assume corresponds to the first column X 1 of X. In this case it is clear that E X X y ( 1 y )= E X y y ( )= y 25

26 ( ) where X % = X 2 K X p so we only need to approximate E X %X y y denotes the remaining p 1 columns of X. We conjecture that a reasonable approximation to this conditional expectation is E X %X ( y y ) E X (%X y %X y, y ). (52) Provided M is large enough, the joint distribution of % X y, % X y approximated by a multivariate normal distribution, with E % X ( X y )= X % f (θ) E X X % ( y )= X % E f (θ) E X ( y )= f (θ) and y given X can be Var % X ( X y )= X % D (θ) X % Var % X X ( y )= X % Σ (θ) X % Var X ( y )= M 2 1 D (θ)1 Cov % X X y, X % ( y )= X % D (θ)e % X Cov % X X 1 ( y, y )= M % X D (θ)1 Cov % X X y 1 (, y )= M % X E D (θ)1 Since y = y it immediately follows that E X ( X % y X % y, y )= X % f (θ) + (θ) %X { y E f (θ) (53) y f (θ) where (θ) = X % D (θ)e X % 1 M X % D (θ)1 = 1 (θ) 2 (θ). %X Σ (θ) % 1 X M % X E D (θ)1 M 1 1 D (θ)e % X M 2 1 D (θ)1 Here 1 (θ) is of order (p 1) (p 1) and 2 (θ) is of order (p 1) 1. Substituting (53) into the estimating euation (51) for the MLE then leads to the approximate ML estimating euation 1 or euivalently y f (θ) sc (θ) = %X { y E f (θ) = 0 (θ) y f (θ) 26