An extension of the factoring likelihood approach for non-monotone missing data Jae Kwang Kim Dong Wan Shin January 14, 2010 ABSTRACT We address the problem of parameter estimation in multivariate distributions under ignorable non-monotone missing data. The factoring likelihood method for monotone missing data, termed by Rubin 1974), is extended to a more general case of non-monotone missing data. The proposed method is equivalent to the Newton-Raphson method for the observed likelihood, but avoids the burden of computing the first and the second partial derivatives of the observed likelihood. Instead, the maximum likelihood estimates and their information matrices for each partition of the data set are computed separately and combined naturally using the generalized least squares method. A numerical example is presented to illustrate the method. A Monte-Carlo experiment compares the proposed method with the EM method. KEY WORDS. EM algorithm, Gauss-Newton method, Generalized least squares, Maximum likelihood estimator, Missing at random. Department of Statistics, Iowa State University, Ames, IA, 50014, U.S.A., jkim@iastate.edu, phone: 1-515-294-3225, fax: 1-515-294-4040 Department of Statistics, Ewha University, Seoul, 120-750, Korea, shindw@ewha.ac.kr, phone:82-011-9914-0189, fax:82-2-3277-3607
1 Introduction Missing data is quite common in practice. Statistical analysis of data with missing values is an important practical problem because missing data is oftentimes non-ignorable. When we simply ignore missing data, the resulting estimates will have nonresponse bias if the responding part of the sample is systematically different than the nonresponding part of the sample. Also, we may lose some information observed in the partially missing data. Little and Rubin 2002) and Molenberghs and Kenward 2007) provide comprehensive overviews of the missing data problem. We consider statistical inference of data with missing values using the maximum likelihood method. Specifically, we propose a computational tool for obtaining the maximum likelihood estimator MLE) under multivariate missing data. To explain the basic idea, we use an example of bivariate normal data. Later in Section 2.3, an extension to general multivariate data is proposed. Let y i = y 1i, y 2i ) be a bivariate normal random variables distributed as y1i y 2i ) [ iid µ1 N µ 2 ) σ11 σ, 12 σ 12 σ 22 )], 1) where iid is the abbreviation of independently and identically distributed. Note that five parameters, µ 1, µ 2, σ 11, σ 12, and σ 22, are needed to identify the bivariate normal distribution. We assume that the observations are missing at random MAR) as defined in Rubin 1976) so that the relevant likelihood is the observed likelihood, or the marginal likelihood of the observed data. Under MAR, we can ignore the response mechanism when estimating the population parameters. 1
If the missing data pattern is monotone in the sense that the set of respondents for one variable is a subset of the respondent set of the other variable, the observed likelihood can be factored into the marginal likelihood for one variable and the conditional likelihood for the second variable conditional on the first so that the maximum likelihood estimates can be estimated separately for each likelihood. For example, assume that y 1 is fully observed with n observations and y 2 is observed with r observations. Anderson 1957) first considered maximum likelihood parameter estimation under this setup by using an alternative representation of the bivariate normal distribution as y 1i y 2i y 1i iid N µ 1, σ 11 ) 2) iid N β 20 1 + β 21 1 y 1i, σ 22 1 ), where β 20 1 = µ 2 β 21 1 µ 1, β 21 1 = σ 1 11 σ 12 and σ 22 1 = σ 22 β 2 21 1σ 11. The observed likelihood is then written as a product of marginal likelihood of a fully observed variable y 1 and the conditional likelihood of y 2 given y 1. Thus, the parameters µ 1 and σ 11 for the marginal distribution of y 1 can be estimated with n observations and the other regression parameters, β 20 1, β 21 1, and σ 22 1, can be estimated from the conditional distribution with r observations. The factoring likelihood FL) method, termed by Rubin 1974), expresses the observed likelihood as a product of the marginal likelihood and the conditional likelihood so that the maximum likelihood estimates can be obtained separately at each likelihood. Note that the FL approach consists of two steps. In the first step, the likelihood is factored, and in the second step the MLE for each likelihood is computed separately. In many cases, the MLE s are easily computed in the FL approach because the marginal and the condi- 2
tional likelihoods are known so that we can directly use the known solutions of the likelihood equations for each likelihood. For the monotone missing data, the MLE s for the conditional distribution are independent of those for the marginal distribution. This is because the two sets of parameters - the parameters for the marginal likelihood and those for the conditional likelihood - are orthogonal Cox and Reid, 1997) and as a result the MLE s for the conditional likelihood are not affected by the MLE s for the marginal likelihood. Rubin 1974) recommended the FL approach as a general framework in the analysis of missing data with a monotone missing pattern. The main advantage of the FL is its computational simplicity. Under non-monotone missing data patterns, however, the FL approach is not directly applicable. The EM algorithm, proposed by Dempster et al. 1977), successfully provides MLE s under a general missing pattern. Using the EM algorithm also avoids the calculation of the observed likelihood function and uses only the complete likelihood function. Despite its popularity, there are several shortcomings of using the EM algorithm. First, the computation is performed iteratively and the convergence is notoriously slow Liu and Rubin, 1994). Second, the covariance matrix of the estimated parameters is not provided directly Louis, 1982, and Meng and Rubin, 1991). The focus of this paper is to propose an alternative method that will resolve these two issues at the same time. In this paper, we consider an extension of the FL method to the nonmonotone missing data. To apply the FL method to non-monotone missing, in addition to the two steps in the original FL approach, we need another step that combines the separate MLE s computed for each likelihood to produce 3
the final MLE s. The proposed method turns out to be essentially the same as the direct maximum likelihood method using the Newton-Raphson algorithm which converges much faster than the EM algorithm. Furthermore, the proposed method provides the asymptotic variance-covariance matrix of the MLE s directly as a by-product of the computation. Using the variancecovariance expression, the asymptotic variances are compared with other estimators obtained by ignoring some part of partially observed data. A related work is Chen et al. 2008), who compared variances of estimators for regression models with missing responses and covariates. The proposed method is an extension of the preliminary work of Kim 2004) who considered the case of a bivariate missing data. In Section 2, some of the result of Kim 2004) is reviewed and extended to more general class of multivariate missing data. Efficiency comparisons based on the asymptotic variance-covariance matrix obtained from the proposed method are discussed in Section 3. The proposed method is applied to a categorical data example in Section 4. Results from a limited simulation study are presented in Section 5. Concluding remarks are made in Section 6. 2 Proposed method The proposed method can be described in the following three steps: [Step 1] Partition the original sample into several disjoint sets according to the missing pattern. [Step 2] Compute the MLE s for the identified parameters separately in each partition of the sample. 4
[Step 3] Combine the estimators to get a set of final estimates using a generalized least squares GLS) form. Kim 2004) discuss the procedures in detail for the bivariate case. We review the result of Kim 2004) for the bivariate normal case in Section 2.1. In Section 2.2, we consider a general class of bivariate distributions. In Section 2.3, the proposed method is extended to multivariate distributions. 2.1 Bivariate normal case To simplify the presentation, we describe the proposed method in the bivariate normal setup with non-monotone missing pattern. The joint distribution of y = y 1, y 2 ) is parameterized by the five parameters using model 1) or 2). For the convenience of the factoring method described in Section 1, we use the parametrization in 2) and let θ = β 20 1, β 21 1, σ 22 1, µ 1, σ 11 ). In Step 1, we partition the sample into several disjoint sets according to the pattern of missingness. In the case of a non-monotone missing pattern with two variables, we have 3 = 2 2 1 types of respondents that contain information about the parameters. The first set H has both y 1 and y 2 observed, the second set K has y 1 observed but y 2 missing, and the third set L has y 2 observed but y 1 missing. See Table 1. Let n H, n K, n L be the sample sizes of the set H, K, L, respectively. The case of both y 1 and y 2 missing can be safely removed from the sample. In Step 2, we obtain the parameter estimators in each set: For set H, we have the five parameters η H = β 20.1, β 21.1, σ 22.1, µ 1, σ 11 ) of the conditional distribution of y 2 given y 1 and the marginal distribution of y 1, with MLE s ˆη H = ˆβ 20 1,H, ˆβ 21 1,H, ˆσ 22 1,H, ˆµ 1,H, ˆσ 11,H ). For set K, the MLE s 5
η θ) = β 20 1, β 21 1, σ 22 1, µ 1, σ 11, µ 1, σ 11, β 20 1 + β 21 1 µ 1, σ 22 1 + β 2 21 1σ 11 ) 4) Table 1. An illustration of the missing data structure under bivariate normal distribution Set y 1 y 2 Sample Size Estimable parameters H Observed Observed n H µ 1, µ 2, σ 11, σ 12, σ 22 K Observed Missing n K µ 1, σ 11 L Missing Observed n L µ 2, σ 22 ˆη K = ˆµ 1,K, ˆσ 11,K ) are obtained for η K = µ 1, σ 11 ), the parameters of the marginal distribution of y 1. For set L, the MLE s ˆη L = ˆµ 2,L, ˆσ 22,L ) are obtained for η L = µ 2, σ 22 ), where µ 2 = β 20 1 +β 21 1 µ 1 and σ 22 = σ 22 1 +β21 1σ 2 11. In Step 3, we use the GLS method to combine the three estimators ˆη H, ˆη K, ˆη L to get a final estimator for the parameter θ. Let ˆη = ˆη H, ˆη K, ˆη L). Then ˆη = ˆβ20 1,H, ˆβ ) 21 1,H, ˆσ 22 1,H, ˆµ 1,H, ˆσ 11,H, ˆµ 1,K, ˆσ 11,K, ˆµ 2,L, ˆσ 22,L. 3) The expected value of this estimator is and the asymptotic covariance matrix is { Σ22.1 V = diag, 2σ2 22 1, σ 11, 2σ2 11, σ 11, 2σ2 11, σ } 22, 2σ2 22, 5) n H n H n H n H n K n K n L n L where Σ 22.1 = ) σ22 1 1 + σ 1 11 µ 2 1 σ 1 11 σ 22 1 µ 1 σ11 1 σ 22 1 µ 1 σ11 1 σ 22 1 Note that Σ 22.1 = {E[1, y 1 )1, y 1 ) ]} 1 σ 22.1 = [ 1 µ 1 µ 1 σ 11 +µ 2 1] 1 σ 22.1. Derivation for the asymptotic covariance matrix of the first five estimates in 3) is straightforward and can be found, for example, in Subsection 7.2.2 of Little 6 ).
and Rubin 2002). We have a block-diagonal structure of V in 5) because ˆµ 1K and ˆσ 11K are independent due to normality and observations between different sets are independent due to the iid assumptions. Note that the nine elements in ηθ) are related to each other because they are all functions of the five elements of vector θ. The information contained in the extra four equations has not yet been utilized in constructing estimators ˆη H, ˆη K, ˆη L. The information can be employed to construct a fully efficient estimator of θ by combining ˆη H, ˆη K, ˆη L through a GLS generalized least squares) regression of ˆη = ˆη H, ˆη K, ˆη L) on θ as follows: ˆη ηˆθ S ) = η/ θ )θ ˆθ S ) + error, where ˆθ S is an initial estimator. The expected value and variance of ˆη in 4) and 5) can be viewed as a nonlinear model of the five parameters in θ. Using a Taylor series expansion on the nonlinear model, a step of the Gauss-Newton method can be formulated as e η = X θ ˆθ ) S + u, 6) ) ) where e η = ˆη η ˆθS, η ˆθS is the vector 4) evaluated at ˆθ S, X = 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 µ 1 2β 21 1 σ 11 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 β 21 1 0 0 0 0 0 1 0 1 0 β21 1 2, 7) and, approximately, u 0, V), 7
Table 2. Summary for bivariate normal case. y 1 y 2 Data Set Size Estimable parameters Asymptotic variance O M K n K η K = θ 1 W K = diagσ 11, 2σ11 2 ) O O H n H η H = θ 1, θ 2 ) W H = diagw K, Σ 22.1, 2σ22.1 2 ) M O L n L η L = µ 2, σ 22 ) W L = diagσ 22, 2σ22 2 ) O: observed, M: missing, θ 1 = µ 1, σ 11 ), θ 2 = β 20 1, β 21 1, σ 22 1 ), η = η H, η K, η L ), θ = η H, V = diagw H /n H, W K /n K, W L /n L ), Σ 22.1 = {E[1, y 1 )1, y 1 ) ]} 1 σ 22.1, X = η/ θ, µ 2 = β 20.1 + β 21.1 µ 1, σ 22 = σ 22.1 + β21.1 2 σ 11 and V is the covariance matrix defined in 5). The Gauss-Newton method for the estimation of nonlinear models can be found in Seber and Wild 1989). Relations among parameters η, θ, X, and V are summarized in Table 2. A simple initial estimator is the weighted average of available estimators from data sets, defined as ˆθ S = ˆβ 20 1,H, ˆβ 21 1,H, ˆσ 11 2,H, ˆµ 1,HK, ˆσ 11,HK ), 8) where ˆµ 1,HK = 1 p K ) ˆµ 1,H + p K ˆµ 1,K, ˆσ 11,HK = 1 p K ) ˆσ 11,H + p K ˆσ 11,K, and p K = n K / n H + n K ). This initial value is a n-consistent estimate of θ and guarantees the consistency of the one-step estimators. The procedure can be carried out iteratively until convergence. Given the current value ˆθ t), the solution of the Gauss-Newton method can be obtained iteratively as ˆθ t+1) = ˆθ t) + ) 1 X 1 t) ˆV t) X t) X t) { 1 ˆθt) ˆV )} t) ˆη η, 9) where X t) and ˆV t) are evaluated from X in 7) and V in 5), respectively, using the current value ˆθ t). The covariance matrix of the estimator in 9) can be estimated by C = when the iteration is stopped at the t-th iteration. 1 X 1 t) ˆV t) t)) X, 10) 8
Remark 1 Bivariate normal monotone case. In the case of monotone missing which consists of H and K, the iteration 9) produces the estimator obtained by the factoring likelihood, given in Little and Rubin 2002). order to see this, note that 9) reduces to ˆθ t+1) = ˆθ t) + where and ) 1 { X 1 HK ˆV HKt) X HK X 1 ˆθt) HK ˆV )} HKt) ˆη HK η, 11) X HK = ˆV HKt) = diag { ˆΣ22.1t) n H 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1, 2ˆσ2 22 1t) n H, ˆσ 11t) n H, 2ˆσ2 11t) n H,, ˆσ 11t), 2ˆσ2 11t) n K n K ˆη HK = ˆβ20 1,H, ˆβ 21 1,H, ˆσ 22 1,H, ˆµ 1,H, ˆσ 11,H, ˆµ 1K, ˆσ 11K ). Starting with an initial estimator ˆθ 0) = ˆβ20 1,H, ˆβ 21 1,H, ˆσ 22 1,H, ˆµ 1,H, ˆσ 11,H ), the estimator constructed from data set H, one step application of 9) leads to the one-step estimate ˆθ 1) = ˆβ20 1,H, ˆβ 21 1,H, ˆσ 22 1,H, ˆµ 1,HK, ˆσ 11,HK ) 12) }, In where ˆµ 1,HK and ˆσ 11,HK are defined after 8). This is the same as the MLE of Anderson 1957) from the original factoring likelihood method. When only y 2 is subject to missingness, the estimated regression coefficient for the regression of y 2 on y 1 using the set H only is fully efficient, but the estimated regression coefficient for the regression of y 1 on y 2 based on the set H only is not fully efficient. 9
2.2 General bivariate case We now consider the general bivariate case when the joint distribution is not necessarily normal. Assume that the joint distribution of y = y 1, y 2 ) is parameterized by θ. For set H, let η H = η H θ) be a parametrization for the joint distribution of y 1, y 2 ) such that the information matrix, say I H η H ), is easy to compute. One such parametrization is η H = η H1, η H2 ), where η H1 is the parameter vector for the conditional distribution of y 1 given y 2 and η H2 is the parameter vector for the marginal distribution of y 2. Since the parameters for the conditional distribution are orthogonal to those for the marginal distribution, the parametrization η H = η H1, η H2 ) results in a block diagonal I H η H ). For set K, let η K = η K θ) be a parametrization for the marginal distribution of y 1 such that the information matrix, say I K η K ), is easy to compute. Define η L similarly. The parametrization for the set H need not be the same as that for set K or for set L, which provides more flexibility in choosing the parametrization. Separate orthogonal parametrization in each set will lead to computational advantages over the direct maximum likelihood method. Let ˆη H be the MLE of η H constructed using sample set H. Define ˆη K and ˆη L similarly. Let V H, V K, and V L be the estimated covariance matrices of the MLE s ˆη H, ˆη K, and ˆη L, respectively, and let V = diagv H, V K, V L ). Note that V 1 = diag {I H η H ), I K η K ), I L η L )}. Because V is a function of θ, we write V = V θ). Also, define [ η X θ) = H θ, η K θ, η L θ and η θ) = η H θ), η K θ), η L θ) ). Note that ˆη = ˆη H, ˆη K, ˆη L, ) is ] 10
different from ηˆθ) in general. Using the above notation, the maximum likelihood estimator can be computed iteratively from 9) with X t) = Xˆθ t) ) and ˆV t) = Vˆθ t) ). We now show that the procedure in 9) produces a fully efficient estimator of θ in that it is equivalent to the Newton-Raphson procedure for ML estimation based on the observed likelihood. Let the score function of the observed likelihood be defined as S obs y; θ) = log l obs θ) / θ, where l obs θ) = i f y i ; θ) dy imis), with y imis) defined to be the missing part of y i, is the observed likelihood function of parameter θ. The Newton-Raphson method for maximum likelihood estimation can be defined as ˆθ t+1) = ˆθ t) + [I obs ˆθt) )] 1 Sobs y; ˆθ t)), 13) where I obs θ) = E [ 2 log l obs θ) / θ θ ] is the expected information matrix for θ. We show in Theorem 1 below, that the iterations 9) and 13) are identical in that, starting from ˆθ t), the two iterations produce identical values for ˆθ t+1) for all t. Therefore, our procedure gives us a fully efficient estimator of θ. Equivalence between the Gauss-Newton estimator and the maximum likelihood estimator will be established in a more general multivariate situation in Section 2.3. Note that evaluation of the likelihood l obs θ) and hence direct computation of the MLE through the scoring method 13) is not trivial 11
due to the complexities in evaluating the integral in the observed likelihood. On the other hand, our procedure is easy to implement because it involves evaluation of likelihoods corresponding to only observed parts. The following theorem establishes equivalence between the Gauss-Newton estimator in 9) and the maximum likelihood estimator in 13). Theorem 1 The Gauss-Newton estimator 9) is equivalent to the maximum likelihood estimator 13) in that, starting from ˆθ t), 9) and 13) give the same value for ˆθ t+1). Proof. See Appendix A. Note that because of the nature of the Newton-Raphson algorithm, the iteration 9) converges much faster than the usual EM-algorithm. Moreover, our procedure directly produces a simple estimator C of the variance of the MLE ˆθ, while the EM-algorithm does not give a direct estimate of the variance of ˆθ. Remark 2 - One-step estimator. Given a suitable choice of the initial estimate ˆθ S, the one-step estimator ˆθ = ˆθ S + X S ) 1 1 ˆV S X S X 1 S ˆV S e η, 14) can be a very good approximation to the maximum likelihood estimator, where X S and ˆV S are evaluated from X and V, respectively, using the initial estimator ˆθ S. The one-step Newton-Raphson estimator 13) using n-consistent initial estimates is asymptotically equivalent to the MLE Lehmann, 1983, Theorem 3.1). By Theorem 1, the one-step Gauss-Newton estimator 14) is also asymptotically equivalent to the MLE. 12
2.3 General multivariate case One advantage of the GLS Gauss-Newton procedure 9) is that it can easily extend to a general multivariate case having p variables y = y 1,..., y p ). Any general missing data set can be partitioned into say H 1,..., H q, mutually disjoint and exhaustive data sets such that, for each data set H j, j = 1,..., q, all the element share the same missing pattern. Therefore, each set H j can be considered to be a complete data set if only all the observed variables are considered. Let θ be a parameter vector for which the joint distribution of y is fully indexed and is identified. We choose a parameter vector η j for the joint distribution of the observed variables corresponding to H j such that the joint distribution is identified and the information matrix, say I j θ), is easy to compute. Let ˆη j be the MLE of η j computed from the data set H j, which can be easily computed because H j is complete. We have V j = varˆη j ) = I 1 j θ). Let η = η 1,..., η q) and let ˆη = ˆη 1,..., ˆη q). Then V = varˆη) = diagi 1 1 θ),..., I 1 q θ)). Letting X = η/ θ, with an initial consistent estimator ˆθ S, the following iteration ˆθ = ˆθ S + X V 1 X ) 1 X V 1 ˆη ηˆθ S )), 15) defines a one step Gauss-Newton procedure for ML estimation. The following theorem establishes equivalence of the proposed estimator and the MLE. The proof is a straightforward extension of that of Theorem 1 and thus is skipped for brevity. Theorem 2 The one-step estimator 15) is equivalent to the maximum likelihood estimator. 13
In order to implement our procedure 15), we need to specify θ, η, X, V) as well as an initial estimator ˆθ S. Specification of θ, η, X, V) depends on data distribution and missing type. In the following remarks, explicit expressions for θ, η, X, V) are given for some important cases. These remarks demonstrate that our procedure 15) can be easily implemented to multivariate normal cases and hence multiple regressions with any missing type. Remark 3-3-variate general non-monotone missing case. We give a detailed implementation of the Gauss-Newton procedure 15) for 3- variate case. Assume that y = y 1, y 2, y 3 ) is jointly normal N 3 µ, Σ), µ = µ 1, µ 2, µ 3 ), Σ = σ ij ). All possible 7 = 2 3 1 missing cases are displayed in Table 3. Expressions for θ, η, X and V are given in Table 3. Parameters θ 1, θ 2, θ 3 correspond to the parameters of the distributions of y 1, the conditional distribution of y 2 given y 1, and the conditional distribution of y 3 given y 1, y 2 ), respectively. The parameter θ = θ 1, θ 2, θ 3) fully parameterizes the joint distribution of y 1, y 2, y 3 ). The conditional parameters can be written in the following regression equations y 1 = µ 1 + e 1, e 1 N0, σ 11 ), y 2 = β 20.1 + β 21.1 y 1 + e 2.1, e 2.1 N0, σ 22.1 ), y 3 = β 30.12 + β 31.12 y 1 + β 32.12 y 2 + e 3.12, e 3.12 N0, σ 33.12 ), y 3 = β 30.1 + β 31.1 y 1 + e 3.1, e 3.1 N0, σ 33.1 ), y 3 = β 30.2 + β 31.2 y 2 + e 3.2, e 3.2 N0, σ 33.2 ), in which the regression errors are independent of the regressors. Construction of estimator ˆη j from data set H j is obvious from defini- 14
tion of parameter η j. For example, ˆη 1 = ˆµ 1,1, ˆσ 11,1 ) is estimated from H 1 ; ˆη 2 = ˆθ 1,2, ˆθ 2,2) is estimated from data set H 2 where ˆθ 1,2 = ˆµ 1,2, ˆσ 11,2 ) is constructed from variable y 1 and ˆθ 2,2 = ˆβ 20.1,2, ˆβ 21.1,2, ˆσ 22.1,2 ) is constructed from the regression of y 2 on y 1 ; ˆη 7 = ˆµ 2,7, ˆσ 22,7, ˆβ 30.2,7, ˆβ 31.2,7, ˆσ 33.2,7 ) is estimated from H 7 where ˆµ 2,7, ˆσ 22,7 ) is constructed from variable y 2 and ˆβ 30.2,7, ˆβ 31.2,7, ˆσ 33.2,7 ) is constructed from regression of y 3 on y 2. We then have ˆη = ˆη 7,..., ˆη 1). A simple initial estimator ˆθ S = ˆθ 1S, ˆθ 2S, ˆθ 3S) can be constructed by averaging available estimators from data sets H 1,..., H 7 as given by ˆθ 1S = n 1ˆθ1,1 + n 2ˆθ1,2 + n 3ˆθ1,3 + n 6ˆθ1,6 )/n 1 + n 2 + n 3 + n 6 ), ˆθ 2S = n 2ˆθ2,2 + n 3ˆθ2,3 )/n 2 + n 3 ), ˆθ 3S = ˆθ 3,3. For evaluation of η and X = η/ θ, we need expressions for η with respect to θ and their derivatives. This issue is addressed in Remark 4 below. We now have all the materials for implementing 15). Observe that some elements of η 4,..., η 7 are nonlinear functions of θ. Therefore, the X matrix has elements other than 0 or 1, as occurred in the last two columns of X in 7). For monotone missing pattern, sets H 4 H 7 are empty and the X matrix consists of elements of 0 or 1, as in Remark 1. Remark 4 - Evaluation of η and X. Consider the general p dimensional normal case N p µ, Σ), µ = µ 1,..., µ p ), Σ = σ ij ). As shown in Remark 3, elements of η take one of the following three forms: {θ j, j = 1,,., p}, {µ, Σ}, or { parameters, say β j0.j, β jj.j, σ jj.j) of the regression of y j on a vector, say y J, a subvector of y 1, y 2,..., y j 1 ) such that y j = β j0.j + β jj y J + e j.j, j = 2,..., p }. In order to compute η and X = η/ θ, we need expressions 15
Table 3. 3-dimensional normal case: non-monotone missing. y 1 y 2 y 3 Data Set Size Estimable parameters Asymptotic variance O M M H 1 n 1 η 1 = θ 1 W 1 = diagσ 11, 2σ11 2 ) O O M H 2 n 2 η 2 = θ 1, θ 2 ) W 2 = diagw 1, Σ 22.1, 2σ22.1 2 ) O O O H 3 n 3 η 3 = θ 1, θ 2, θ 3 ) W 3 = diagw 2, Σ 33.12, 2σ33.12 2 ) M M O H 4 n 4 η 4 = µ 3, σ 33 ) W 4 = diagσ 33, 2σ33 2 ) M O M H 5 n 5 η 5 = µ 2, σ 22 ) W 5 = diagσ 22, 2σ22 2 ) O M O H 6 n 6 η 6 = θ 1, β 30.1, β 31.1, σ 33.1 ) W 6 = diagw 1, Σ 33.1, 2σ33.1 2 ) M O O H 7 n 7 η 7 = µ 2, σ 22, β 30.2, β 31.2, σ 33.2 ) W 7 = diagσ 22, 2σ22 2, Σ 33.2, 2σ33.2 2 ) θ 1 = µ 1, σ 11 ), θ 2 = β 20 1, β 21 1, σ 22 1 ), θ 3 = β 30 12, β 31 12, β 32 12, σ 33 12 ), µ = µ 1, µ 2, µ 3 ) and Σ = σ ij ) are computed from θ 1, θ 2, θ 3 using the recursion in Remark 4 below, β 30.1 = µ 3 β 31.1 µ 1, β 31.1 = σ 13 /σ 11, σ 33.1 = σ 33 β31.1 2 σ 11, β 30.2 = µ 3 β 31.2 µ 2, β 31.2 = σ 23 /σ 22, σ 33.2 = σ 33 β31.2 2 σ 22, Σ 22.1 = {E[1, y 1 )1, y 1 ) ]} 1 σ 22.1, Σ 33.12 = {E[1, y 1, y 2 )1, y 1, y 2 ) ]} 1 σ 33.12, Σ 33.1 = {E[1, y 1 )1, y 1 ) ]} 1 σ 33.1, Σ 33.2 = {E[1, y 2 )1, y 2 ) ]} 1 σ 33.2, η = η 7, η 6,..., η 1 ), θ = η 3, V = diagw 7 /n 7, W 6 /n 6,..., W 1 /n 1 ), X = η/ θ. for the parameters µ, Σ and β j0.j, β jj.j, σ jj.j) in terms of the conditional parameter θ = θ 1,..., θ p). Using a regression expression y j+1 = β j+1)0 12...j + β j+1)1 12...j y 1 +... + β j+1)j 12...j y j + e j+1) 12...j, we get, for j = 0, 1, 2,..., p 1, µ j+1 = Ey j+1 ) = β j+1)0 12...j + β j+1)1 12...j µ 1 +... + β j+1)j 12...j µ j, σ i,j+1 = covy i, y j+1 ) = β j+1)1 12...j σ i1 +... + β j+1)j 12...j σ ij, i = 1, 2,..., j, and j j σ j+1,j+1 = vary j+1 ) = σ j+1)j+1) 12...j + β j+1)k 12...j β j+1)l 12...j σ kl. k=1 l=1 Note that, in the above three equations, µ j+1, σ 1j+1), σ 2j+1),..., σ j+1)j+1) ) is expressed in terms of the conditional parameter θ j+1 = β j+1)0 12...j, β j+1)1 12...j,..., β j+1)j 12...j, σ j+1)j+1) 12...j ) and the marginal parameter µ j, σ 1j,..., σ jj ), the latter of which is a function of 16
θ j, θ j 1,..., θ 1. Therefore, recursive evaluation of these three equations for j = 0, 1,..., p 1 with initial values β 10 = µ 1 and σ 11 = σ 11 gives the required expression for the marginal parameters µ j, σ 1j,..., σ j 1)j, σ jj ), j = 1,..., p in terms of the conditional parameters θ j, θ j 1,..., θ 1. 1, Partial derivatives are recursively computed as follows: for j = 0, 1,..., p j l=1 β j+1)l 12...j µ l /θ t if t = 1,..., j, µ j+1 / θ t = 1, µ 1,..., µ j, 0) if t = j + 1 0 if t = j + 2, j + 3,..., p, σ i,j+1 / θ t = β j+1)1 12...j σ i1 / θ t +... + β j+1)j 12...j σ ij / θ t, i = 1, 2,..., j, t = 1, 2,..., j, σ j+1,j+1 / θ t = σ j+1)j+1) 12...j + j k=1 j l=1 β j+1)k 12...jβ j+1)l 12...j σ kl / θ t, t = 1, 2,..., j, σ i,j+1 / θ j+1 = 0, σ i1,..., σ ij, 0), i = 1,..., j, σ j+1,j+1 / θ j+1 = 0, 2 j k=1 β j+1)k 12...jσ k1, 2 j k=1 β j+1)k 12...jσ k2,,..., 2 j k=1 β j+1)k 12...jσ kj, 1), σ i,j+1 / θ t = 0, i = 1, 2,..., j + 1, t = j + 2, j + 3,..., p. Given these expressions for µ, Σ and their partial derivatives in terms of θ, it is straightforward to compute β j0.j, β jj.j, σ jj.j) and the corresponding derivatives because the regression parameters are simple functions of µ and Σ. Remark 5 - A 4-variate non-monotone case. In Table 4, expressions for θ, η, X and V are given for a 4-dimensional case with a specific missing pattern. 17
Table 4. A 4-dimensional normal case - non-monotone missing. y 1 y 2 y 3 y 4 Data Set Size Estimable parameters Asymptotic variance O M M M H 1 n 1 η 1 = µ 1, σ 11 ) W 1 = diagσ 11, 2σ11 2 ) M O M M H 2 n 2 η 2 = µ 2, σ 22 ) W 2 = diagσ 22, 2σ22 2 ) M M O M H 3 n 3 η 3 = µ 3, σ 33 ) W 3 = diagσ 33, 2σ33 2 ) M M M O H 4 n 4 η 4 = µ 4, σ 44 ) W 4 = diagσ 44, 2σ44 2 ) O O O O H 5 n 5 η 5 = θ 1, θ 2, θ 3, θ 4 ) W 5, see below θ 1 = µ 1, σ 11 ), θ 2 = β 20 1, β 21 1, σ 22 1 ), θ 3 = β 30 12, β 31 12, β 32 12, σ 33 12 ), θ 4 = β 40 123, β 41 123, β 42 123, β 43 123, σ 44 123 ) µ = µ 1,..., µ 4 ) and Σ = σ ij ) are computed from θ 1,..., θ 4 using the recursion in Remark 4, W 5 = diagσ 11, 2σ11 2, Σ 22.1, 2σ22.1 2, Σ 33.12, 2σ33.12 2, Σ 44.123, 2σ44.123 2 ) Σ 22.1 = {E[1, y 1 )1, y 1 ) ]} 1 σ 22.1, Σ 33.12 = {E[1, y 1, y 2 )1, y 1, y 2 ) ]} 1 σ 33.12, Σ 44.123 = {E[1, y 1, y 2, y 3 )1, y 1, y 2, y 3 ) ]} 1 σ 44.123, η = η 5, η 4,..., η 1 ), θ = η 5, V = diagw 5 /n 5, W 4 /n 4,..., W 1 /n 1 ), X = η/ θ 3 Efficiency comparison We compare efficiencies of estimators constructed from different combinations of data sets H, K, L for the bivariate normal case. Under the non-monotone missing pattern, we can compute the following four types of the estimates. 1. ˆθ H : the maximum likelihood estimator using the samples in H set. 2. ˆθ HK : the maximum likelihood estimator using the samples in H K. 3. ˆθ HL : the maximum likelihood estimator using the samples in H L. 4. ˆθ HKL : the maximum likelihood estimator using the whole sample. By Theorem 1, the Gauss-Newton estimator 9) is asymptotically equal to ˆθ HKL. Write X = [X H X K X L] where X H is the left 5 5 submatrix of X, X K is the 5 2 submatrix in the middle of X, and X L is the 5 2 submatrix in the right side of X. Similarly, we can decompose ˆη = ˆη H, ˆη K, ˆη L) and V = diag {VH, V K, V L }. 18
Using the arguments in the proof of Theorem 1, the asymptotic variance of ˆθ 1 H is X 1 H H) X. Similarly, we have ) 1 V ar ˆθHL = X 1 H X H + X 1 L ˆV L L) X, ) 1 V ar ˆθHK = X 1 H X H + X 1 K ˆV K K) X, and ) 1 V ar ˆθHKL = X 1 H X H + X 1 K ˆV K X K + X 1 L ˆV L L) X. Using the matrix algebra such as ) 1 X 1 H X H + X 1 L ˆV L X L = X 1 H X H ) 1 X L X H ) 1 1 ˆV H X H [ ˆV L + X L X 1 H X H we can derive expressions for the variances of the estimators. ) ] 1 1 1 X L X L X 1 H H) X, For estimates of the slope parameter, the asymptotic variances are V ar ˆβ 21.1,HK ) = σ 22.1 σ 11 n H = V ar ˆβ 21.1,H ) V ar ˆβ 21.1,HL ) = σ 22.1 { 1 2pL ρ 2 1 ρ 2)} σ 11 n H ) V ar ˆβ21 1,HKL = σ { 22.1 1 2p } Lρ 2 1 ρ 2 ), σ 11 n H 1 p L p K ρ 4 where ρ 2 = σ 2 12/ σ 11 σ 22 ), p K = n K / n H + n K ) and p L = n L / n H + n L ). See Appendix B for derivations of these variances and other variances below. Thus, we have V ar ˆβ 21.1,H ) = V ar ˆβ 21.1,HK ) V ar ˆβ 21.1,HL ) V ar ˆβ 21.1,HKL ). 16) Here strict inequalities generally hold except for special trivial cases. Note that the asymptotic variance of ˆβ 21 1,HK is the same as the variance of ˆβ 21 1,H, 19
which implies that there is no gain of efficiency by adding set K missing y 2 ) to H. On the other hand, by comparing V ar ˆβ ) 21 1,HL ) with V ar ˆβ21 1,H, we observe an efficiency gain by adding a set L missing y 1 ) to H. This analysis is similar to the results from Little 1992) who summarized statistical results for regression with missing X s whose data sets are H and L. Little 1992) did not include the cases of missing y 2 s because data set K) with missing y 2 does not contain additional information in estimating the regression parameter. It is interesting to observe that even though adding K the data set with missing y 2 ) to H does not improve efficiency of regression parameter estimate, i.e., V ar ˆβ 21.1,H ) = V ar ˆβ 21.1,HK ), adding K to H, L) does improve the efficiency, i.e., V ar ˆβ 21.1,HL ) > V ar ˆβ 21.1,HKL ). Using these expressions, we can investigate variance reduction of ˆβ 21.1,HKL over ˆβ 21.1,HK. For example, we can say that relative efficiency of ˆβ 21.1,HKL over ˆβ 21.1,HK is lager for larger values of p K, p L, or ρ. If p L = p K = 0.5 and ρ = 0.5, the relative efficiency value is 1.0037. If p L = p K = 0.9 and ρ = 0.9, the relative efficiency value is 1.768. For the other parameters of the conditional distribution, relationships similar to 16) hold. For the marginal parameters, we have V arˆµ 1,HK ) = σ 11 n H 1 p K ), V arˆµ 1,HL ) = σ 11 1 pl ρ 2), n H { V arˆµ 1,HKL ) = σ 11 n H 1 p K ) p Lρ 2 1 p K ) 2 1 p L p K ρ 2 ) }, 20
and V arˆσ 11,HK ) = 2σ2 11 n H 1 p K ), V arˆσ 11,HL ) = 2σ2 11 1 pl ρ 4), n H { V arˆσ 11,HKL ) = 2σ2 11 n H 1 p K ) p Lρ 4 1 p K ) 2 1 p L p K ρ 4 Note that efficiency of the marginal parameters µ 1, σ 11 ) of y 1 improves if additional data for y 2 with y 1 missing are provided. In particular, if n K = n L, then V arˆµ 1,H ) V arˆµ 1,HL ) V arˆµ 1,HK ) V arˆµ 1,HKL ) }. and V arˆσ 11,H ) V arˆσ 11,HL ) V arˆσ 11,HK ) V arˆσ 11,HKL ). 4 A Numerical Example For a numerical example, we consider the data set adapted from Bishop, Fienberg and Holland 1975, Table 1.4-2). Table 5 gives the data for a 2 3 table of three categorical variable Y 1 =Clinic, Y 2 =Parential care, Y 3 =survival) with one supplemental margin for Y 2 and Y 3 and another supplemental margin for Y 1 and Y 3. In this setup, Y i are all dichotomous, taking either 0 or 1, and 8 parameters can be defined as π ijk = P ry 1 = i, Y 2 = j, Y 3 = k), i = 0, 1; j = 0, 1; k = 0, 1. For the orthogonal parametrization, we use η H = ) π 1 11, π 1 10, π 1 01, π 1 00, π +1 1, π +1 0, π ++1 21
Table 5. A 2 3 table with supplemental margins Set y 1 y 2 y 3 Count 1 1 1 293 1 0 1 176 0 1 1 23 H 0 0 1 197 1 1 0 4 1 0 0 3 0 1 0 2 0 0 0 17 1 1 100 K 0 1 82 1 0 5 0 0 6 1 1 90 L 0 1 150 1 0 5 0 0 10 22
where π i jk = P r y 1 = i y 2 = j, y 3 = k), π +j k = P r y 2 = j y 3 = k), π ++k = P r y 3 = k). We also set θ θ 1, θ 2, θ 3, θ 4, θ 5, θ 6, θ 7 ) = η H. Note that the validity of the proposed method does not depend on the choice of the parametrization. A suitable parametrization will make the computation of the information matrix simple. From the data in Table 5, we can obtain 13 observations for 7 parameters. The observation vector can be written ˆη = ˆη H, ˆη K, ˆη L), where ˆη H = 293/316, 4/6, 176/373, 3/20, 316/689, 6/26, 689/715) ˆη K = ˆπ 1 +1,K, ˆπ 1 +0,K, ˆπ ++1,K ) = 100/182, 5/11, 182/193) ˆη L = ˆπ +1 1,L, ˆπ +1 0,L, ˆπ ++1,L ) = 90/240, 5/15, 240/255) with the expectations where η H = θ, η θ) = η H, η K, η L), η K = π 1 11 π +1 1 + π 1 01 π +0 1, π 1 10 π +1 0 + π 1 00 π +0 0, π ++1 ) = θ 1 θ 5 + θ 3 1 θ 5 ), θ 2 θ 6 + θ 4 1 θ 6 ), θ 7 ) and η L = π +1 1, π +1 0, π ++1 ) = θ5, θ 6, θ 7 ), and the variance-covariance matrix where V = diag {V H /n H, V K /n K, V L /n L } V H = diag {θ 1 1 θ 1 ), θ 2 1 θ 2 ),..., θ 7 1 θ 7 )}, V K = diag { ) ) π 1 +1 1 π1 +1, π1 +0 1 π1 +0, π++1 1 π 1++ ) } V L = diag {θ 5 1 θ 5 ), θ 6 1 θ 6 ), θ 7 1 θ 7 )}. 23
The Gauss-Newton method as in 9) can be used to solve the nonlinear model of three parameters, where the initial estimator of θ is ˆθ S = 293/316, 4/6, 176/373, 3/20, 406/929, 11/41, 1111/1163) and the X matrix is X = 1 0 0 0 0 0 0 θ 5 0 0 0 0 0 0 1 0 0 0 0 0 0 θ 6 0 0 0 0 0 0 1 0 0 0 0 1 θ 5 0 0 0 0 0 0 0 0 1 0 0 0 0 1 θ 6 0 0 0 0 0 0 0 0 1 0 0 θ 1 θ 3 0 0 1 0 0 0 0 0 0 0 1 0 0 θ 2 θ 4 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 1 The resulting one-step estimates are ˆθ 1 = 0.923, 0.678, 0.454, 0.168, 0.426, 0.272, 0.955).. Standard errors of the estimated values are computed from 10) and are ˆV 1/2 ˆθ 1 ) = diag 0.0098, 0.0173, 0.0178, 0.0134, 0.0155, 0.0140, 0.00606). On the other hand, the standard errors of the initial parameter estimates are ˆV 1/2 ˆθ S ) = diag 0.0097, 0.0176, 0.0187, 0.0134, 0.0159, 0.0142, 0.00606). Note that there is no efficiency gain for ˆπ ++1 because y 3 is fully observed throughout the sample. 5 Simulation Study To test our theory with finite sample sizes, we perform a limited simulation. For the simulated data set, we generate B = 10, 000 samples of size n from 24
Table 6. Monte Carlo variance of the point estimators under two difference estimation schemes based on samples of 10,000 trials. Sample size Parameter EM estimation GN estimation True variance µ 1.0124.0124.0122 µ 2.0132.0132.0130 100 σ 11.0271.0244.0256 σ 22.0277.0275.0276 σ 12.0193.0188.0196 µ 1.00242.00242.00244 µ 2.00262.00262.00260 500 σ 11.00544.00515.00511 σ 22.00551.00550.00551 σ 12.00400.00396.00392 the population x y 1 iid N1, 1) iid N 0, 1) y 2 = 0.1 + 0.7y 1 + e where e N 0, 0.7 2 ). We use two levels of sample sizes, n = 100 and n = 500. Variable x is always observed and variables y 1 and y 2 are subject to missingness. The response probability for y 1 follows a logistic regression model such that logit {P ry 1 is observed)} = x, the response probability for y 1 follows a logistic regression model such that logit {P ry 1 is observed)} = 0.7x, and that the two responses are independent. The one-step Gauss-Newton GN) estimation and the EM estimation are compared. The estimates from the EM algorithm are computed after 10 iterations with the same initial values as for the one-step GN estimator. 25
Table 7. Monte Carlo result for the estimated variance of the proposed method based on samples of 10,000 trials. Sample size Parameter Mean Est. Var. Rel. Bias t-statistic µ 1.01186 -.05-3.21 µ 2.01268 -.04-2.88 100 σ 11.02474.01 0.93 σ 22.02678 -.03-1.76 σ 12.01884.00 0.00 µ 1.002429.01 0.39 µ 2.002589 -.01-0.88 500 σ 11.005092 -.01-0.82 σ 22.005483 -.00-0.15 σ 12.003899 -.02-1.10 The means and variances of the point estimators and the mean of the estimated variances are calculated. Because the point estimators are all unbiased in the simulation, their simulation means are not listed here. Table 6 displays the Monte Carlo variances of the point estimators in each estimation method. The theoretical asymptotic variances of the MLE are also computed and presented in the last column of Table 6. The simulation results in Table 6 are generally consistent with the theoretical results. The simulation variances are slightly larger than the theoretical variances because the estimators were not computed until convergence. Table 7 displays the mean, relative bias, and the t-statistic of the estimated variance of the one-step GN method. The relative bias is the Monte Carlo bias of the variance estimator divided by the Monte Carlo variance, where the variance is given in Table 6. The t-statistic for testing the hypothesis of zero bias is the Monte Carlo estimated bias divided by the Monte Carlo standard error of the estimated bias. The t-values as well as the values 26
of relative biases state that estimated variances of our estimators computed using 10) are close to their theoretical values. The simulation results in Table 6 show that the two procedures have similar performance in terms of point estimation. The efficiency is slightly better for the one-step GN method because the EM algorithm was terminated after 10 iterations. The efficiency improvement is larger for the variance parameters than for the mean parameters, which suggests that convergence of the EM algorithm for the mean parameters is faster than convergence for the variance parameters. Also, as can be seen in Table 7, the one-step GN method provides consistent variance estimates for all parameters. The performance is better for a larger sample size because the consistency of the variance estimator is justified from the asymptotic theory. 6 Concluding remarks We have proposed a Gauss-Newton algorithm for obtaining the maximum likelihood estimator under a general non-monotone missing data scenario. The proposed method is shown to be algebraically equivalent to the Newton- Raphson method but avoids the burden of obtaining the observed likelihood. Instead, the MLEs separately computed from each partition of the marginal likelihoods and the full likelihoods are combined in a natural way. The way we combine the information takes the form of GLS estimation and thus can be easily implemented using the existing software. The estimated covariance matrix is computed automatically and shows good finite sample performance in the simulation study. The proposed method is not restricted to the multivariate normal distribution. It can be applied to any parametric multivariate 27
distribution as long as the computation for the marginal likelihood and the full likelihood are relatively easier than that of the observed likelihood. The proposed method assumes an ignorable response mechanism. A more realistic situation would be the case when the probability of y 2 missing depends on the value of y 1. In this case, the assumption of missing at random no longer holds and we have to take the response mechanism into account. Further investigation in this direction is a topic for a future research. Appendix A. Proof of Theorem 1 Note that the observed log-likelihood can be written as a sum of the loglikelihood in each set: log l obs θ) = log l H θ) + log l K θ) + log l L θ), A.1) where l H = i H f y; θ) is the likelihood function defined in set H, and l K and l L are defined similarly. Under MAR, l H is the likelihood for the joint distribution of y 1 and y 2, l K is the likelihood for the marginal distribution of y 1, and l L is the likelihood for the marginal distribution of y 2. By A.1), the score function for the likelihood can be written as S obs y; θ) = S H y; θ) + S K y; θ) + S L y; θ) A.2) and the expected information matrix also satisfies the additive decomposition I obs θ) = I H θ) + I K θ) + I L θ), A.3) where I H θ) = E [ 2 log l H θ) / θ θ ], and I K θ) and I L θ) are defined similarly. 28
The equation in A.3) can be written as ) ) η I obs θ) = H ηh I H η θ H ) θ + ) ) ηl ηl + I L η θ L ) θ ) ηk I K η θ K ) ) ηk θ = X ˆV 1 X, A.4) where X = η H / θ, η K / θ, η L / θ) and ˆV 1 = diag {I H η H ), I K η K ), I L η L )}. Now, consider the score function in A.2). Using the chain rule, the score function can be written as ) ) ) η S obs y; θ) = H η S H y; η θ H ) + K η S K y; η θ K ) + L S L y; η θ L ). A.5) Let ˆη H be the MLE of the likelihood l H. Taking a Taylor expansion of S H y; η H ) about ˆη H leads to S H y; η H ). = S H y; ˆη H ) I H ˆη H ) η H ˆη H ), where I H η H ) = 2 log l H η H ) / η H η H. Using S H y; ˆη H ) = 0 and the convergence of the observed information matrix to the expected information matrix, we have S H y; η H ). = I H ˆη H ) η H ˆη H ). Similar results hold for the sets K and L. Thus, A.5) becomes ) ). η S obs y; θ) = H η I K ˆη θ H ) ˆη H η H ) + K I K ˆη θ K ) ˆη K η K ) ) η + L I L ˆη θ L ) ˆη L η L ) = X ˆV 1 ˆη η). A.6) Therefore, inserting A.4) and A.6) into 13), we have 9). 29
B. Computations for variance formula Using ) 1 X 1 H X H + X 1 K ˆV K X K = X 1 H X H ) 1 X K it can be easily shown that X H ) 1 1 ˆV H X H [ ˆV K + X K X 1 H X H ) { 1 X 1 H X H + X 1 K ˆV K X Σbb K = diag, 2σ2 22 1 σ 11,, n H n H n H + n K Now, to use the formula ) 1 X 1 H X H + X 1 L ˆV L X L = X 1 H X H ) 1 X L X H ) 1 1 ˆV H X H [ ˆV L + X L X 1 H X H ) ] 1 1 1 X K X K X 1 H H) X, 2σ 2 11 n H + n K }. ) ] 1 1 1 X L X L X 1 H H) X, note that ˆV L + X L X H ) { } 1 1 ˆV H X H X σ22 L = diag, 2σ2 22, n HL n HL where n 1 HL = n 1 H + n 1 L, and ) 1 X L X 1 H X 1 H = n H σ 22.1 0 0 σ 12 0 2β 21.1 σ 22.1 µ 1 2β 21.1 σ 22.1 2σ22.1 2 0 2σ12 2 ). Thus, we have ) 1 X 1 H X H + X 1 L ˆV L X L = ˆV H n HL n 2 H σ 22.1 /σ 22 β 21.1 σ 22.1 /σ 2 22 0 β 21.1 σ 22.1 /σ 2 22 0 σ 2 22.1/σ 2 22 σ 12 /σ 22 0 0 σ 2 12/σ 2 22 σ 22.1 2β 21.1 σ 22.1 0 2β 21.1 σ 22.1 0 2σ 2 22.1 σ 12 0 0 2σ 2 12, 30
which present the variances of ˆθ HK. where To compute the variances of ˆθ HKL, use ) 1 X 1 HK ˆV HK X HK + X 1 L ˆV L X L = X 1 HK ˆV HK X HK ) 1 X L X HK ) 1 1 ˆV HK X HK [ ˆV L + X L X 1 HK ˆV HK X HK ) 1 X 1 HK ˆV HK X HK = X 1 H X H + X 1 1. K ˆV K K) X Writing D ˆV ) 1 L + X L X 1 HK ˆV HK X HK X L { σ22 = diag 1 pl p K ρ 2), 2σ2 22 1 pl p K ρ 4) }, n HL n HL ) ] 1 1 1 X L X L X 1 HK ˆV HK HK) X, where n 1 HL = n 1 H + n 1 L, p L = n L /n H + n L ), and p K = n K /n H + n K ), and X L = 1 n H X HK ) 1 1 ˆV HK X HK σ 22.1 0 0 σ 12 1 p K ) 0 2β 21.1 σ 22.1 µ 1 2β 21.1 σ 22.1 2σ22.1 2 0 2σ12 2 1 p K ) the variance of ˆθ HKL can be obtained by X HK 1 n 2 H ) 1 1 ˆV HK X HK + X 1 L ˆV L X L = σ 22.1 2β 21.1 σ 22.1 0 2β 21.1 σ 22.1 0 2σ22.1 2 σ 12 1 p K ) 0 0 2σ12 2 1 p K ) X HK D 1 ), ) 1 1 ˆV HK X HK σ 22.1 2β 21.1 σ 22.1 0 2β 21.1 σ 22.1 0 2σ22.1 2. σ 12 1 p K ) 0 0 2σ12 2 1 p K ) 31
References Anderson, T.W. 1957). Maximum likelihood estimates for the multivariate normal distribution when some observation are missing, Journal of the American Statistical Association 52, 200-203. Chen, Q., Ibrahim, J. G., Chen, M-H, and Senchaudhuri, P. 2008). Theory and inference for regression models with missing responses and covariates, Journal of Multivariate Analysis 99, 1302-1331. Cox, D.R. and Reid, N. 1987). Parameter orthogonality and approximate conditional inference with discussion). Journal of Royal Statistical Society: Series B 49, 1-39. Dempster, A.P., Laird, N.M., and Rubin, D.B. 1977). Maximum likelihood from incomplete data via the EM algorithm with discussion), Journal of Royal Statistical Society, Series B 39, 1-38. Kim, J.K. 2004). Extension of Factoring Likelihood Approach To Non- Monotone Missing Data, Journal of the Korean Statistical Society 2004), 33, 401 410. Lehmann, E.L. 1983). Theory of Point Estimation. Wiley, New York. Little, R.J.L. 1982). Models for nonresponse in sample surveys, Journal of the American Statistical Association 77, 237-250. Little, R.J.L. 1992). Regression with missing X s: A review, Journal of the American Statistical Association 87, 1227-1237. 32
Little, R.J.L. and Rubin, D.B. 2002). data. Wiley, New York. Statistical Analysis with missing Liu, C. and Rubin, D. B. 1994). The ECME Algorithm: A Simple Extension of EM and ECM with Faster Monotone Convergence, Biometrika 81, 633-648. Louis, T. A. 1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society: Series B 44, 226-233. Meng, X.-L. and Rubin, D.B. 1991). Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. Journal of the American Statistical Association 86, 899-909. Molenberghs, G. and Kenward, M. 2007). Missing Data in Clinical Studies. Wiley, New York. Rubin, D.B. 1974). Characterizing the estimation of parameters in incomplete data problems, Journal of the American Statistical Association 69, 467-474. Rubin, D.B. 1976). Inference and missing data, Biometrika 63, 581-590. Seber, G.A.F. and Wild, C.J. 1989). Nonlinear Regression. Wiley, New York. 33