Statistics 580 The EM Algorithm Introduction

Transcription

1 Statistics 580 The EM Algorithm Introduction The EM algorithm is a very general iterative algorithm for parameter estimation by maximum likelihood when some of the random variables involved are not observed i.e., considered missing or incomplete. The EM algorithm formalizes an intuitive idea for obtaining parameter estimates when some of the data are missing: i. replace missing values by estimated values, ii. estimate parameters. iii. Repeat step (i) using estimated parameter values as true values, and step (ii) using estimated values as observed values, iterating until convergence. This idea has been in use for many years before Orchard and Woodbury (1972) in their missing information principle provided the theoretical foundation of the underlying idea. The term EM was introduced in Dempster, Laird, and Rubin (1977) where proof of general results about the behavior of the algorithm was first given as well as a large number of applications. For this discussion, let us suppose that we have a random vector y whose joint density f(y; θ) is indexed by a p-dimensional parameter θ Θ R p. If the complete-data vector y were observed, it is of interest to compute the maximum likelihood estimate of θ based on the distribution of y. The log-likelihood function of y log L(θ; y) = l(θ; y) = log f(y; θ), is then required to be maximized. In the presence of missing data, however, only a function of the complete-data vector y, is observed. We will denote this by expressing y as (y obs, y mis ), where y obs denotes the observed but incomplete data and y mis denotes the unobserved or missing data. For simplicity of description, assume that the missing data are missing at random (Rubin, 1976), so that f(y; θ) = f(y obs, y mis ; θ) = f 1 (y obs ; θ) f 2 (y mis y obs ; θ), where f 1 is the joint density of y obs and f 2 is the joint density of y mis given the observed data y obs, respectively. Thus it follows that l obs (θ; y obs ) = l(θ; y) log f 2 (y mis y obs ; θ), where l obs (θ; y obs ) is the observed-data log-likelihood. 1

2 EM algorithm is useful when maximizing l obs can be difficult but maximizing the completedata log-likelihood l is simple. However, since y is not observed, l cannot be evaluated and hence maximized. The EM algorithm attempts to maximize l(θ; y) iteratively, by replacing it by its conditional expectation given the observed data y obs. This expectation is computed with respect to the distribution of the complete-data evaluated at the current estimate of θ. More specifically, if θ (0) is an initial value for θ, then on the first iteration it is required to compute Q(θ; θ (0) ) = E θ (0) [l(θ; y) y obs ]. Q(θ; θ (0) ) is now maximized with respect to θ, that is, θ (1) is found such that Q(θ (1) ; θ (0) ) Q(θ; θ (0) ) for all θ Θ. Thus the EM algorithm consists of an E-step (Estimation step) followed by an M-step (Maximization step) defined as follows: E-step: Compute Q(θ; θ (t) ) where Q(θ; θ (t) ) = E θ (t) [l(θ; y) y obs ]. M-step: Find θ (t+1) in Θ such that for all θ Θ. Q(θ (t+1) ; θ (t) ) Q(θ; θ (t) ) The E-step and the M-step are repeated alternately until the difference L(θ (t+1) ) L(θ (t) ) is less than δ, where δ is a prescribed small quantity. The computation of these two steps simplify a great deal when it can be shown that the log-likelihood is linear in the sufficient statistic for θ. In particular, this turns out to be the case when the distribution of the complete-data vector (i.e., y) belongs to the exponential family. In this case, the E-step reduces to computing the expectation of the completedata sufficient statistic given the observed data. When the complete-data are from the exponential family, the M-step also simplifies. The M-step involves maximizing the expected log-likelihood computed in the E-step. In the exponential family case, actually maximizing the expected log-likelihood to obtain the next iterate can be avoided. Instead, the conditional expectations of the sufficient statistics computed in the E-step can be directly substituted for the sufficient statistics that occur in the expressions obtained for the complete-data maximum likelihood estimators of θ, to obtain the next iterate. Several examples are discussed below to illustrate these steps in the exponential family case. As a general algorithm available for complex maximum likelihood computations, the EM algorithm has several appealing properties relative to other iterative algorithms such as Newton-Raphson. First, it is typically easily implemented because it relies on completedata computations: the E-step of each iteration only involves taking expectations over complete-data conditional distributions. The M-step of each iteration only requires completedata maximum likelihood estimation, for which simple closed form expressions are already 2

3 available. Secondly, it is numerically stable: each iteration is required to increase the loglikelihood l(θ; y obs ) in each iteration, and if l(θ; y obs ) is bounded, the sequence l(θ (t) ; y obs ) converges to a stationery value. If the sequence θ (t) converges, it does so to a local maximum or saddle point of l(θ; y obs ) and to the unique MLE if l(θ; y obs ) is unimodal. A disadvantage of EM is that its rate of convergence can be extremely slow if a lot of data are missing: Dempster, Laird, and Rubin (1977) show that convergence is linear with rate proportional to the fraction of information about θ in l(θ; y) that is observed. Example 1: Univariate Normal Sample Let the complete-data vector y = (y 1,..., y n ) T be a random sample from N(µ, σ 2 ). Then ( ) 1 n/2 { f(y; µ, σ 2 ) = exp 1 n (y i µ) 2 } 2πσ 2 2 σ 2 ( ) 1 n/2 = exp { 1/2σ ( 2 y 2 2πσ 2 i 2µ y i + nµ 2)} which implies that ( y i, y 2 i ) are sufficient statistics for θ = (µ, σ 2 ) T. The complete-data log-likelihood function is: l(µ, σ 2 ; y) = n 2 log(σ2 ) 1 2 = n 2 log(σ2 ) 1 2σ 2 n n (y i µ) 2 + constant σ 2 y 2 i + µ σ 2 n y i nµ2 σ 2 + constant It follows that the log-likelihood based on complete-data is linear in complete-data sufficient statistics. Suppose y i, i = 1,..., m are observed and y i, i = m + 1,..., n are missing (at random) where y i are assumed to be i.i.d. N(µ, σ 2 ). Denote the observed data vector by y obs = (y 1,..., y m ) T ). Since the complete-data y is from the exponential family, the E-step requires the computation of ( n ) ( N ) E θ y i y obs and E θ yi 2 y obs, instead of computing the expectation of the complete-data log-likelihood function shown above. Thus, at the t th iteration of the E-step, compute ( n ) s (t) 1 = E µ y (t),σ 2(t) i y obs m = y i + (n m) µ (t) (1) since E µ (t),σ 2(t) (y i) = µ (t) where µ (t) and σ 2(t) are the current estimates of µ and σ 2, and 3

4 ( n ) s (t) 2 = E µ y 2 (t),σ 2(t) i y obs m = yi 2 + (n m) [ σ (t)2 + µ (t)2] (2) since E µ (t),σ 2(t) (y2 i ) = σ 2(t) + µ (t)2. For the M-step, first note that the complete-data maximum likelihood estimates of µ and σ 2 are: ˆµ = n y i n and ˆσ 2 = n y 2 i n ( n ) 2 y i The M-step is defined by substituting the expectations computed in the E-step for the complete-data sufficient statistics on the right-hand side of the above expressions to obtain expressions for the new iterates of µ and σ 2. Note that complete-data sufficient statistics themselves cannot be computed directly since y m+1,..., y n have not been observed. We get the expressions n and µ (t+1) = s(t) 1 n (3) σ 2(t+1) = s(t) 2 n µ(t+1)2. (4) Thus, the E-step involves computing evaluating (1) and (2) beginning with starting values µ (0) and σ 2(0). M-step involves substituting these in (3) and (4) to calculate new values µ (1) and σ 2(1), etc. Thus, the EM algorithm iterates successively between (1) and (2) and (3) and (4). Of course, in this example, it is not necessary to use of EM algorithm since the maximum likelihood estimates for (µ, σ 2 ) are clearly given by ˆµ = m y i /m and ˆσ 2 = m y 2 i /m ˆµ 2. Example 2: Sampling from a Multinomial population In the Example 1, incomplete data in effect was missing data in the conventional sense. However, in general, the EM algorithm applies to situations where the complete data may contain variables that are not observable by definition. In that set-up, the observed data can be viewed as some function or mapping from the space of the complete data. The following example is used by Dempster, Laird and Rubin (1977) as an illustration of the EM algorithm. Let y obs = (38, 34, 125) T be observed counts from a multinomial population with probabilities: ( 1 1θ, 1θ, θ). The objective is to obtain the maximum likelihood estimate of θ. First, to put this into the framework of an incomplete data problem, 4

5 define y = (y 1, y 2, y 3, y 4 ) T with multinomial probabilities ( θ, 1 4 θ, 1 4 θ, 1 2 ) (p 1, p 2, p 3, p 4 ). The y vector is considered complete-data. Then define y obs = (y 1, y 2, y 3 + y 4 ) T. as the observed data vector, which is a function of the complete-data vector. Since only y 3 + y 4 is observed and y 3 and y 4 are not, the observed data is considered incomplete. However, this is not simply a missing data problem. The complete-data log-likelihood is l(θ; y) = y 1 log p 1 + y 2 log p 2 + y 3 log p 3 + y 4 log p 4 + const. which is linear in y 1, y 2, y 3 and y 4 which are also the sufficient statistics. The E-step requires that E θ (y y obs ) be computed; that is compute E θ (y 1 y obs ) = y 1 = 38 E θ (y 2 y obs ) = y 2 = 34 since, conditional on (y 3 + y 4 ), y 3 Similarly, E θ (y 3 y obs ) = E θ (y 3 y 3 + y 4 ) = 125( 1 4 θ)/( θ) is distributed as Binomial(125, p) where p = 1 θ θ. 2 4 E θ (y 4 y obs ) = E θ (y 4 y 3 + y 4 ) = 125( 1 2 )/( θ), which is similar to computing E θ (y 3 y obs ). But only y (t) 3 = E θ (t)(y 3 y obs ) = 125( 1 4 )θ(t) ( θ(t) ) (1) needs to be computed at the t th iteration of the E-step as seen below. For the M-step, note that the complete-data maximum likelihood estimate of θ is (Note: Maximize y 2 + y 3 y 1 + y 2 + y 3 l(θ; y) = y 1 log( θ) + y 2 log 1 4 θ + y 3 log 1 4 θ + y 4 log 1 2 and show that the above indeed is the maximum likelihood estimate of θ). Thus, substitute the expectations from the E-step for the sufficient statistics in the expression for maximum likelihood estimate θ above to get θ (t+1) = 34 + y(t) 3. (2) 72 + y (t) 3 Iterations between (1) and (2) define the EM algorithm for this problem. The following table shows the convergence results of applying EM to this problem with θ (0) =

6 Table 1. The EM Algorithm for Example 2 (from Little and Rubin (1987)) t θ (t) θ (t) ˆθ (θ (t+1) ˆθ)/(θ (t) ˆθ) Example 3: Sample from Binomial/ Poisson Mixture The following table shows the number of children of N widows entitled to support from a certain pension fund. Number of Children: Observed # of Widows: n 0 n 1 n 2 n 3 n 4 n 5 n 6 Since the actual data were not consistent with being a random sample from a Poisson distribution (the number of widows with no children being too large) the following alternative model was adopted. Assume that the discrete random variable is distributed as a mixture of two populations, thus: Mixture of Populations: Population A: with probability ξ, the random variable takes the value 0, and Population B: with probability (1 ξ), the random variable follows a Poisson with mean λ Let the observed vector of counts be n obs = (n 0, n 1,..., n 6 ) T. The problem is to obtain the maximum likelihood estimate of (λ, ξ). This is reformulated as an incomplete data problem by regarding the observed number of widows with no children be the sum of observations that come from each of the above two populations. Define n 0 = n A + n B n A = # widows with no children from population A n B = n o n A = # widows with no children from population B Now, the problem becomes an incomplete data problem because n A is not observed. Let n = (n A, n B, n 1, n 2,..., n 6 ) be the complete-data vector where we assume that n A and n B are observed and n 0 = n A + n B. 6

7 Then f(n; ξ, λ) = k(n) {P (y 0 = 0)} n 0 Π {P (y i = i)} n i = k(n) [ { ξ + (1 ξ) e λ] n 0 [Π 6 (1 ξ) e λ λ i } ni ] i! = k(n) [ ξ + (1 ξ)e λ] n A +n B { } [ ( ) 6 (1 ξ)e λ n i λ Π 6 i ni ]. i! where k(n) = 6 n i /n 0!n 1!... n 6!. Obviously, the complete-data sufficient statistic is (n A, n B, n 1, n 2,..., n 6 ). The complete-data log-likelihood is l(ξ, λ; n) = n 0 log(ξ + (1 ξ)e λ ) 6 + (N n 0 ) [log(1 ξ) λ] + i n i log λ + const. Thus, the complete-data log-likelihood is linear in the sufficient statistic. The E-step requires the computing of E ξ,λ (n n obs ). This computation results in and E ξ,λ (n i n obs ) = n i for i = 1,..., 6, E ξ,λ (n A n obs ) = n 0 ξ ξ + (1 ξ) exp( λ), since n A is Binomial(n 0, p) with p = p A p A +p B where p A = ξ and p B = (1 ξ) e λ. The expression for E ξ,λ (n B n obs ) is equivalent to that for E(n A ) and will not be needed for E- step computations. So the E-step consists of computing n (t) A = n 0 ξ (t) ξ (t) + (1 ξ (t) ) exp( λ (t) ) (1) at the t th iteration. For the M-step, the complete-data maximum likelihood estimate of (ξ, λ) is needed. To obtain these, note that n A Bin(N, ξ) and that n B, n 1,..., n 6 are observed counts for i = 0, 1,..., 6 of a Poisson distribution with parameter λ. Thus, the complete-data maximum likelihood estimate s of ξ and λ are ˆξ = n A N, 7

8 and ˆλ = 6 i n i n B + 6 n i. The M-step computes and ξ (t+1) = n(t) A N (2) λ (t+1) = 6 i n i n (t) B + (3) 6 n i where n (t) B = n 0 n (t) A. The EM algorithm consists of iterating between (1), and (2) and (3) successively. following data are reproduced from Thisted(1988). The Number of children Number of widows 3, Starting with ξ (0) = 0.75 and λ (0) = 0.40 the following results were obtained. Table 2. EM Iterations for the Pension Data t ξ λ n A n B A single iteration produced estimates that are within 0.5% of the maximum likelihood estimate s and are comparable to the results after about four iterations of Newton-Raphson. However, the convergence rate of the subsequent iterations are very slow; more typical of the behavior of the EM algorithm. 8

9 Example 4: Variance Component Estimation (Little and Rubin(1987)) The following example is from Snedecor and Cochran (1967, p.290). In a study of artificial insemination of cows, semen samples from six randomly selected bulls were tested for their ability to produce conceptions. The number of samples tested varied from bull to bull and the response variable was the percentage of conceptions obtained from each sample. Here the interest is on the variability of the bull effects which is assumed to be a random effect. The data are: Table 3. Data for Example 4 (from Snedecor and Cochran(1967)) Bull(i) Percentages of Conception n i 1 46,31,37,62, , ,44,57,40,67,64, ,21,70,46, ,64,50,69,77,81, ,68,59,38,57,76,57,29,60 9 Total 35 A common model used for analysis of such data is the oneway random effects model: y ij = a i + ɛ ij, j = 1,..., n i, i = 1,..., k; where it is assumed that the bull effects a i are distributed as i.i.d. N(µ, σ 2 a) and the withinbull effects (errors) ɛ ij as i.i.d. N(0, σ 2 ) random variables where a i and ɛ ij are independent. The standard oneway random effects analysis of variance is: Source d.f. S.S. M.S. F E(M.S.) Bull σ σ 2 a Error σ 2 Total Equating observed and expected mean squares from the above gives s 2 = as the estimate of σ 2 and ( )/5.67 = as the estimate of σa. 2 To construct an EM algorithm to obtain MLE s of θ = (µ, σa, 2 σ 2 ), first consider the joint density of y = (y, a) T where y is assumed to be complete-data. This joint density can be written as a product of two factors: the part first corresponds to the joint density of y ij given a i and the second to the joint density of a i. f(y ; θ) = f 1 (y a; θ)f 2 (a; θ) { } { } 1 = Π i Π j e 1 2σ 2 (y ij a i ) 2 1 Π i e 1 2σ 2 (a i µ) 2 a 2πσ 2πσa 9

10 Thus, the log-likelihood is linear in the following complete-data sufficient statistics: T 1 = a i T 2 = a 2 i T 3 = (y ij a i ) 2 = i j i (y ij ȳ i. ) 2 + j i n i (ȳ i. a i ) 2 Here complete-data assumes that both y and a are available. Since only y is observed, let y obs = y. Then the E-step of the EM algorithm requires the computation of the expectations of T 1, T 2 and T 3 given y obs, i.e., E θ (T i y) for i = 1, 2, 3. The conditional distribution of a given y is needed for computing these expectations. First, note that the joint distribution of y = (y, a) T is (N + k)-dimensional multivariate normal: N(µ, Σ ) where µ = (µ, µ a ) T, µ = µj N, µ a = µj k and Σ is the (N + k) (N + k) matrix ( ) Σ Σ Σ12 = σai 2. Here Σ = Σ 1 0 Σ Σ k Σ T 12, Σ 12 = σa 2 j n1 0 j n j nk where Σ i = σ 2 I ni +σ 2 aj ni is an n i n i matrix. The covariance matrix Σ of the joint distribution of y is obtained by recognizing that the y ij are jointly normal with common mean µ and common variance σ 2 + σ 2 a and covariance σ 2 a within the same bull and 0 between bulls. That is Cov(y ij, y i j ) = Cov(a i + ɛ ij, a i + ɛ i j ) Σ 1 = = σ 2 + σ 2 a if i = i, j = j, = σ 2 a if i = i, j j, = 0 if i i. Σ 12 is covariance of y and a and follows from the fact that Cov(y ij, a i ) = σa 2 if i = i and 0 if i i. The inverse of Σ is needed for computation of the conditional distribution of a given y and obtained as Σ where Σ 1 i = 1 Σ Σ 1 k ] σ2 a σ 2 +n i J σa 2 ni. Using a well-known theorem in multivariate normal [ σ 2 Ini theory, the distribution of a given y is given by N(α, A) where α = µ a + Σ 12Σ 1 (y µ) and A = σai 2 Σ 12Σ 1 Σ 12. It can be shown after some algebra that a i y i.i.d N (w i µ + (1 w i )ȳ i., v i ) 10

11 where w i = σ 2 /(σ 2 + n i σa), 2 ȳ i. = ( n i j=1 y ij )/n i, and v i = w i σa. 2 Recall that this conditional distribution was derived so that the expectations of T 1, T 2 and T 3 given y (or yobs) can be computed. These now follow easily. Thus the t th iteration of the E-step is defined as T (t) 1 = [ (t) w i µ (t) + (1 w (t) ] i )ȳ i. T (t) 2 = [ w (t) T (t) 3 = i i j µ (t) + (1 w (t) ] 2 (t) i )ȳ i. + v i (y ij ȳ i. ) 2 + i n i [ w (t) 2 i Since the complete-data maximum likelihood estimates are (µ (t) ȳ i. ) 2 + v (t) i ] and ˆµ = T 1 k ˆσ 2 a = T 2 k ˆµ2 ˆσ 2 = T 3 N, the M-step is thus obtained by substituting the expectations for the sufficient statistics calculated in the E-step in the expressions for the maximum likelihood estimates: µ (t+1 = T (t) 1 k σa 2(t+1) = T (t) 2 k σ 2(t+1) = T (t) 3 N µ(t+1)2 Iterations between these 2 sets of equations define the EM algorithm. With the starting values of µ (0) = 54.0, σ 2(0) = 70.0, σa 2(0) = 248.0, the maximum likelihood estimates of ˆµ = , ˆσ a 2 = and ˆσ 2 = were obtained after 30 iterations. These can be compared with the estimates of σa 2 and σ 2 obtained by equating observed and expected mean squares from the random effects analysis of variance given above. Estimates of σa 2 and σ 2 obtained from this analysis are and respectively. Convergence of the EM Algorithm The EM algorithm attempts to maximize l obs (θ; y obs ) by maximizing l(θ; y), the completedata log-likelihood. Each iteration of EM has two steps: an E-step and an M-step. The t th E-step finds the conditional expectation of the complete-data log-likelihood with respect to the conditional distribution of y given y obs and the current estimated parameter θ (t) : Q(θ; θ (t) ) = E θ (t)[l(θ; y) y obs ] = l(θ; y)f(y y obs ; θ (t) )dy, 11

12 as a function of θ for fixed y obs and fixed θ (t). The expectation is actually the conditional expectation of the complete-data log-likelihood, conditional on y obs. The t th M-step then finds θ (t+1) to maximize Q(θ; θ (t) ) i.e., finds θ (t+1) such that Q(θ (t+1) ; θ (t) ) Q(θ; θ (t) ), for all θ Θ. To verify that this iteration produces a sequence of iterates that converges to a maximum of l obs (θ; y obs ), first note that by taking conditional expectation of both sides of l obs (θ; y obs ) = l(θ; y) log f 2 (y mis y obs ; θ), over the distribution of y given y obs at the current estimate θ (t), l obs (θ; y obs ) can be expressed in the form l obs (θ; y obs ) = l(θ; y)f(y y obs ; θ (t) )dy log f 2 (y mis y obs ; θ)f(y y obs ; θ (t) )dy = E θ (t)[l(θ; y) y obs ] E θ (t)[log f 2 (y mis y obs ; θ) y obs ] = Q(θ; θ (t) ) H(θ; θ (t) ) where Q(θ; θ (t) ) is as defined earlier and H(θ; θ (t) ) = E θ (t)[log f 2 (y mis y obs ; θ) y obs ]. The following Lemma will be useful for proving a main result that the sequence of iterates θ (t) resulting from EM algrithm will converge at least to a local maximum of l obs (θ; y obs ). Lemma: For any θ Θ, H(θ; θ (t) ) H(θ (t) ; θ (t) ). Theorem: The EM algorithm increases l obs (θ; y obs ) at each iteration, that is, with equality if and only if l obs (θ (t+1) ; y obs ) l obs (θ (t) ; y obs ) Q(θ (t+1) ; θ (t) ) = Q(θ (t) ; θ (t) ). This Theorem implies that increasing Q(θ; θ (t) ) at each step leads to maximizing or at least constantly increasing l obs (θ; y obs ). Although the general theory of EM applies to any model, it is particularly useful when the complete data y are from an exponential family since, as seen in examples, in such cases the E-step reduces to finding the conditional expectation of the complete-data sufficient statistics, and the M-step is often simple. Nevertheless, even when the complete data y are from an exponential family, there exist a variety of important applications where complete-data maximum likelihood estimation itself is complicated; for example, see Little & Rubin (1987) on selection models and log-linear models, which generally require iterative M-steps. 12

13 In a more general context, EM has been widely used in the recent past in computations related to Bayesian analysis to find the posterior mode of θ, which maximizes l(θ y)+log p(θ) for prior density p(θ) over all θ Θ. Thus in Bayesian computations, log-likelihoods used above are substituted by log-posteriors. Extensions of the EM Algoirthm Some Definitions and Notations Regular Exponential Family (REF). The joint density of an Exponential family may be written in the form : f(y; θ) = b(y) exp { c(θ) T s(y) }/ a(θ) where s(y) is a k 1 vector of sufficient statistics c(θ) is a k 1 vector of parameters θ is a d 1 vector Ω, a d dimensional convex set s.t. f(y; θ) is a p.d.f. b(y) and a(θ) are scalars c(θ) is called the natural or canonical parameter vector. If k = d and the Jacobian of c(θ), c θ is a full rank k k matrix, then f(y; θ) is said to belong to a Regular Exponential Family (REF). In this case Complete-data score vector Observed-data score vector Also can show that f(y; θ) = b(y) exp { θ T s(y) }/ a(θ) S(θ; y) = l(θ; y) θ S obs (θ; y obs ) = l obs(θ; y obs ) θ S obs (θ; y obs ) = E θ [ S(θ; y) yobs ] assuming conditions for interchanging the operations of expectation and differentiation hold. Complete-data Information Matrix I(θ; y) = 2 l(θ; y) θ θ T 13

14 Complete-data Expected Information Matrix Observed-data Information Matrix I(θ; y) = E θ [ I(θ; y) ] I obs ( θ; yobs ) = 2 l(θ; y obs ) θ θ T Observed-data Expected Information Matrix I obs (θ; y obs ) = E θ [ I(θ; yobs ] Conditional Expected Information Matrix Missing Information Principle Recall Differentiating twice w.r.t. θ, we have I c (θ; y obs ) = E θ [ I(θ; y) yobs ] l obs (θ; y obs ) = l(θ; y) log f 2 (y mis y obs ; θ) I obs (θ; y obs ) = I(θ; y) + 2 log f 2 (y mis y obs ; θ) θ θ T Now taking expectation over the conditional distribution y y obs : I obs (θ; y obs ) = I c (θ; y obs ) I mis (θ; y obs ) where we denote the missing information matrix as I mis (θ; y obs ) = E θ { 2 f 2 (y mis y obs ; θ) θ θ T In other words, the missing information principle asserts that } y obs Observed Information = Complete Information - Missing Information Convergence Rate of EM EM algorithm implicitly defines a mapping where M(θ) = ( M 1 (θ),..., M d (θ) ). θ (t+1) = M(θ (t) ) t = 0, 1,... For the problem of maximizing Q(θ; θ), it can be shown that M has a fixed point and since M is continuous and monotone θ (t) converges to a point θ Ω. 14

15 Consider the Taylor series expansion of about θ noting that θ = M(θ ) : which leads to where DM = M (θt ) θ θ (t+1) = M(θ (t) ) M(θ (t) ) = M(θ ) + (θ (t) θ ) M(θ(t) ) θ θ=θ θ (t+1) θ = (θ (t) θ )DM is a d d matrix. θ=θ Thus near θ, EM algorithm is essentially a linear iteration with the rate matrix DM. Definition Recall that the rate of convergence of an iterative process is defined as = lim t θ (t+1) θ θ (t) θ where is any vector norm. For the EM algorithm, the rate of convergence is thus r r = λ max = largest eigen value ofdm Dempster, Laird, and Rubin (1977) have shown that ( ) ( ) DM = I mis θ ; y obs I 1 c θ ; y obs ( ) ( ) Thus the rate of convergence is the largest eigen value of I mis θ ; y obs I 1 c θ ; y obs. Obtaining the covariance matrix of MLE from the EM Algorithm In maximum likelihood estimation, the large-sample covariance matrix of the mle ˆθ is usually estimated by the observed information matrix. When using the EM Algorithm for computing the maximum likelihood estimates, once convergence is reached we can evaluate I(ˆθ; y obs ) directly. However, this involves calculation of the second order derivatives of the observed-data log-likelihood l obs (θ; y obs ). This is not a viable option since, we have appealed to EM algoirthm expressly to avoid the complexity of evaluating l obs (θ; y obs ) itself. Thus we need to be able to approximate I(ˆθ; y obs ) by other methods if EM Algorithm is to be a useful alternative for maximum likelihood estimation using other iterative techniques. For this we need some more results: We can use I c (ˆθ; y obs ) = I(ˆθ; y) to obtain the conditional expected information matrix in the REF case, because I c ( θ; yobs ) = Eθ { I(θ; y) yobs } as I(θ, y) is not a function of y in the REF case. = E θ {I(θ; y)} = I(θ; y) 15

16 That is I c (ˆθ; y obs ) can be obtained by replacing the sufficient statistic s(y) by its conditional expectation evaluated at θ = ˆθ in I(θ; y), the complete-data information matrix. Example (continued from page 5) l(θ; y) = y 1 log ( 1 θ ) + y2 log θ y 3 log θ 4 + y 4 log 1/2 l(θ; y) θ = y 1 1 θ + y 2 θ + y 3 θ 2 l(θ; y) θ 2 = I(θ; y) = E[I(θ; y) y obs ] = I (θ; y obs ) = I (ˆθ; y obs ) = y 1 (1 θ) 2 y 2 θ 2 y 3 θ 2 y 1 (1 θ) + y 2 + y 3 2 θ 2 38 (1 θ) θ + 125θ 2 (2 + θ)θ (1 ˆθ) 2 ˆθ (2 + ˆθ)ˆθ Complete this computation using ˆθ from previous results. Also the convergence rate can be calculated simialry and shown to be.1328 which is the value obtained in the actual iteration. Generalized EM Algorithm (GEM) Recall that in the M-step we maximize Q(θ, θ (t) ) i.e., find θ (t+1) s.t. Q ( θ (t+1) ; θ (t)) Q ( θ; θ (t)) for all θ. In the generalized version of the EM Algorithm we will require only that θ (t+1) be chosen such that Q ( θ (t+1) ; θ (t)) Q ( θ (t) ; θ (t)) holds, i.e., θ (t+1) is chosen to increase Q(θ; θ (t) ) over its value at θ (t) at each iteration t. This is sufficient to ensure that l ( θ (t+1) ; y ) l ( θ (t) ; y ) at each iteration, so GEM sequence of iterates also converges to a local maximum. GEM Algorithm based on a single N-R step We use GEM-type algorithms when a global maximizer of Q(θ; θ (t) ) does not exist in closed form. In this case, possibly an iterative method is required to accomplish the M-step, which might prove to be a computationally infeasible procedure. Since it is not essential to actually maximize Q in a GEM, but only increase the likelihood, we may replace the M-step with a step that achieves that. One possibilty of such a step is a single iteration of the 16

17 Newton-Raphson(N-R) algorithm, which we know is a descent method. Let where θ (t+1) = θ (t) + a (t) δ (t) [ δ (t) = 2 Q(θ;θ (t) ) θ θ T ] 1 θ=θ (t) [ ] Q(θ;θ (t) ) θ θ=θ (t) i.e., δ (t) is the N-R direction at θ (t) and 0 < a (t) 1. If a (t) = 1 this will define an exact N-R step. Here we will choose a (t) so that this defines a GEM sequence. This will be achieved if a (t) < 2 as t. General Mixed Model The general mixed linear model is given by r y = Xβ + Z i u i + ɛ where y n 1 is an observed random vector, X is an n p, and Z i are n q i, matrices of known constants, β p 1 is a vector of unknown parameters, and u i are q i 1 are vectors of unobservable random effects. ɛ n 1 is assumed to be distributed n-dimensional multivariate normal N(0, σ 2 0I n ) and each u i are assumed to have q i -dimensional multivariate normal distributions N qi (0, σ 2 i Σ i ) for i = 1, 2,..., r, independent of each other and of ɛ. We take the complete data vector to be (y, u 1,..., u r ) where y is the incomplete or the observed data vector. It can be shown easily that the covariance matrix of y is the n n matrix V where r V = Z i Z T i σi 2 + σ0i 2 n Let q = r i=0 q i where q 0 = n. The joint distribution of y and u 1,..., u r is q-dimensional multivariate normal N(µ, Σ) where { Xβ V µ = and Σ = i rσ 2 Z i}r q 1 { } 0. q q r { 0 σi 2 Z T i i c dσ 2 I qi}r Thus the density function of y, u,..., u r is f(y, u 1, u 2,..., u r ) = (2π) 1 2 q Σ exp( 2 wt Σ 1 w) where w = [ ] (y Xβ) T, u T 1,..., u T r. This gives the complete data loglikelihood to be l = 1 2 q log(2π) 1 r q i log σi i= r i=0 u T i u i σ 2 i

18 where u 0 = y Xβ r Z i u i = (ɛ). Thus the sufficient statistics are: u T i u i i = 0,..., r, and y r Z i u i and the maximum likelihood estimates (m.l.e. s) are ˆσ i 2 = ut i u i, q i i = 0, 1,..., r ˆβ = r (X T X) X T (y Z i u i ) Special Case: Two-Variance Components Model The general mixed linear model reduces to: y = Xβ + Z 1 u 1 + ɛ where ɛ N(0, σ 2 0I) and u 1 N(0, σ 2 1I n ) and the covariance matrix of y is now V = Z 1 Z T 1 σ σ 2 0I n The complete data loglikelihood is l = 1 2 q log(2π) 1 1 q i log σi i=0 2 where u 0 = y Xβ Z 1 u 1. The m.l.e. s are 1 i=0 ˆσ 2 i = ut i u i q i i = 0, 1 ˆβ = (X T X) X T (y Z 1 u 1 ) u T i u i σ 2 i We need to find the expected values of the sufficient statistics u T i u i, i = 0, 1 and y Z 1 u 1 conditional on observed data vector y. Since u i y is distributed as q i -dimensional multivariate normal we have N(σ 2 i Z T i V 1 (y Xβ), σ 2 i I qi σ 4 i Z T i V 1 Z i ) E(u T i u i y) = σ 4 i (y Xβ) T V 1 Z i Z T i V 1 (y Xβ) + tr(σ 2 i I qi σ 4 i Z T i V 1 Z i ) E(y Z 1 u 1 y) = Xβ + σ 2 0V 1 (y Xβ) noting that E(u 0 y) = σ 2 0V 1 (y Xβ) E(u T 0 u 0 y) = σ 4 0(y Xβ) T V 1 Z 0 Z T 0 V 1 (y Xβ) + tr(σ 2 0I qi σ 4 0Z T 0 V 1 Z 0 ) where Z 0 = I n. 18

19 From the above we can derive the following EM-type algorithms for this case: Basic EM Algorithm Step 1 (E-step) Set V (t) = Z 1 Z 1σ 2(t) 1 + σ 2(t) 0 I n and for i = 0, 1 calculate ŝ (t) i = E(u T i u i y) β=β (t), σ 2 i =σ2(t) i = σ 4(t) i (y Xβ (t) ) T V (t) Z i Z T i V (t) (y Xβ (t) ) + tr(σi 2(t) I qi σi 4(t) Z T i V (t) 1 Z i ) i = 0, 1 ŵ (t) = E(y Z 1 u 1 y) β=β (t), σ 2 i =σ2(t) i = Xβ (t) + σ 2(t) 0 V (t) 1 (y Xβ (t) ) Step 2 (M-step) σ 2(t+1) i = ŝ (t) i /q i i = 0, 1 β (t+1) = (X T X) 1 X ŵ (t) ECM Algorithm Step 1 (E-step) Set V (t) = Z 1 Z 1σ 2(t) 1 + σ 2(t) 0 I n and, for i = 0, 1 calculate ŝ (t) i = E(u T i u i y) β=β (t), σ 2 i =σ2(t) i = σi 4(t) (y Xβ (t) ) T V (t) 1 Z i Z i V (t) 1 (y Xβ) + tr(σi 2(t) I qi σi 4(t) Z T i V (t) 1 Z i ) Step 2 (M-step) Partition the parameter vector θ = (σ0, 2 σ1, 2 β) as θ 1 = (σ0, 2 σ1) 2 and θ 2 = β CM-step 1 Maximize complete data log likelihood over θ 1 σ 2(t+1) i = ŝ (t) i /q i i = 0, 1 CM-step 2 Calculate β (t+1) as β (t+1) = (X T X) 1 X T ŵ (t) where ŵ (t+1) = Xβ (t) + σ 2(t+1) 0 V (t+1) 1 (y Xβ (t) ) 19

20 ECME Algorithm Step 1 (E-step) Set V (t) = Z 1 Z 1σ 2(t) 1 + σ 2(t) 0 I n and, for i = 0, 1 calculate ŝ (t) i = E(u T i u i y) σi 2 = σ 2(t) i = σ 4(t) i y T P (t) Z i Z T i P (t) y + tr(σi 2(t) I qi σi 4(t) Z T i V (t) 1 Z i ) where P (t) = V (t) 1 V (t) 1 X(X T V (t) 1 X) X T V (t) 1 Step 2 (M-step) Partition θ as θ 1 = (σ 2 0, σ 2 1) and θ 2 = β as in ECM. CM-step 1 Maximize complete data log likelihood over θ 1 σ 2(t+1) i = ŝ (t) i /q i i = 0, 1 CM-step 2 Maximize the observed data log likelihood over θ given θ (t) 1 = (σ 2(t) 0, σ 2(t) 0 ): β (t+1) = (X T V (t+1) 1 X) 1 (X T V (t+1) 1 )y (Note: This is the WLS estimator of β.) 20

21 Example of Mixed Model Analysis using the EM Algorithm The first example is an evaluation of the breeding value of a set of five sires in raising pigs, taken from Snedecor and Cochran (1967). (The data is reported in Appendix 4.) The experiment was designed so that each sire is mated to a random group of dams, each mating producing a litter of pigs whose characteristics are criterion. The model to be estimated is y ijk = µ + α i + β ij + ɛ ijk, (4) where α i is a constant associated with the i-th sire effect, β ij is a random effect associated with the i-th sire and j-th dam, ɛ ijk is a random term. The three different initial values for (σ 2 0, σ 2 1) are (1, 1), (10, 10) and (.038,.0375); the last initial value corresponds to the estimates from the SAS ANOVA procedure. Table 1: Average Daily Gain of Two Pigs of Each Litter (in pounds) Sire Dam Gain Sire Dam Gain

22 ########################################################### # Classical EM algorithm for Linear Mixed Model # ########################################################### em.mixed <- function(y, x, z, beta, var0, var1,maxiter=2000,tolerance = 1e-0010) { time <-proc.time() n <- nrow(y) q1 <- nrow(z) conv <- 1 L0 <- loglike(y, x, z, beta, var0, var1) i<-0 cat(" Iter. sigma0 sigma1 Likelihood",fill=T) repeat { if(i>maxiter) {conv<-0 break} V <- c(var1) * z %*% t(z) + c(var0) * diag(n) Vinv <- solve(v) xb <- x %*% beta resid <- (y-xb) temp1 <- Vinv %*% resid s0 <- c(var0)^2 * t(temp1)%*%temp1 + c(var0) * n - c(var0)^2 * tr(vinv) s1 <- c(var1)^2 * t(temp1)%*%z%*%t(z)%*%temp1+ c(var1)*q1 - c(var1)^2 *tr(t(z)%*%vinv%*%z) w <- xb + c(var0) * temp1 var0 <- s0/n var1 <- s1/q1 beta <- ginverse( t(x) %*% x) %*% t(x)%*% w L1 <- loglike(y, x, z, beta, var0, var1) if(l1 < L0) { print("log-likelihood must increase, llikel <llikeo, break.") conv <- 0 break } i <- i + 1 cat(" ", i," ",var0," ",var1," ",L1,fill=T) if(abs(l1 - L0) < tolerance) {break} #check for convergence L0 <- L1 } list(beta=beta, var0=var0,var1=var1,loglikelihood=l0) } 22

23 ######################################################### # loglike calculates the LogLikelihood for Mixed Model # ######################################################### loglike<- function(y, x, z, beta, var0, var1) { n<- nrow(y) V <- c(var1) * z %*% t(z) + c(var0) * diag(n) } Vinv <- ginverse(v) xb <- x %*% beta resid <- (y-xb) temp1 <- Vinv %*% resid (-.5)*( log(det(v)) + t(resid) %*% temp1 ) > y <- matrix(c(2.77, 2.38, 2.58, 2.94, 2.28, 2.22, 3.01, 2.61, , 2.71, 2.72, 2.74, 2.87, 2.46, 2.31, 2.24, , 2.56, 2.50, 2.48),20,1) > x1 <- rep(c(1,0,0,0,0),rep(4,5)) > x2 <- rep(c(0,1,0,0,0),rep(4,5)) > x3 <- rep(c(0,0,1,0,0),rep(4,5)) > x4 <- rep(c(0,0,0,1,0),rep(4,5)) > x <- cbind(1,x1,x2,x3,x4) > x x1 x2 x3 x4 [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,]

24 [14,] [15,] [16,] [17,] [18,] [19,] [20,] > beta <- lm(y~ x1 + x2 + x3 +x4)$coefficients > beta [,1] (Intercept) x x x x > z=matrix(rep( as.vector(diag(1,10)),rep(2,100)),20,10) > z [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,] [16,] [17,] [18,] [19,] [20,] > tolerance <- 1e

25 > maxiter < > seed <- 100 > tr <- function(x) sum(diag(x)) > pig.em.results=em.mixed(y,x,z,beta,1,1) Iter. sigma0 sigma1 Likelihood

26

27 > pig.em.results $beta: [,1] [1,] [2,] [3,] [4,] [5,] $var0: [,1] [1,] $var1: [,1] [1,] $Loglikelihood: [,1] [1,]

28 References Aitkin, M. and Wilson G.T. (1980) Mixture Models, Outliers and the EM Algorithm, Technometrics, 22, Baker S.G. (1990) A simple EM Algorithm for Capture-Recapture Data with Categorical Covariates, Biometrics, 46, Callanan, T. and Harville, D.A. (1991) Some New Algorithms for Computing REML of Variance Components, Journal of Statistical Computation and Simulation, 38, Davenport, J.W., Pierce, M.A, and Hathaway, R.J. (1988) A Numerical Comparison of EM and Quasi-Newton Type Algorithms for MLE for a Mixture of Normals, Computer Science and Statistics: Proc. of the 20th Symp. on the Interface, Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society, Series B, Methodological, 39, Fessler, J.A. and Hero, A.O. (1994) Space-alternating Generalized Expaectation Maximization Algorithm, IEEE Trans. on Signal Processing, 42, Jamshidian, M. and Jennrich, R.I.(1993) Conjugate Gradient Acceleration of the EM Algorithm, Journal of the American Statistical Association, 88, Laird, N.M. (1982) Computation of Variance Components using the EM Algorithm, Journal of Statistical Computation and Simulation, 14, Laird, N.M., Lange, N. and Stram, D. (1987) Maximum Likelihood Computation with Repeated Measures, Journal of the American Statistical Association, 83, Lindstrom and Bates, D. (1988) Newton-Raphson and EM Algorithms for Linear Mixed-Effects Models for Repeated Measure Data, Journal of the American Statistical Association, 83, Little, R.J.A. and Rubin, D.B. (1987) Statistical Analysis with Missing Data, New York: John Wiley. Liu, C.H. and Rubin, D.B. (1994) The ECME Algorithm: a simple Expansion of EM and ECM with Faster Monotone Convergence, Biometrika, 81, Louis, T.A. (1982) Finding the Observed Information Matrix when using the EM Algorithm, Journal of the Royal Statistical Society, Series B, Methodological, 44, Meng, X.L. (1994) On the Rate of Convergence of the ECM Algorithm, Annals of Statistics, 22, Meng, X.L. and Rubin, D.B. (1991) Using EM to Obtain Asymptotic Variance-Covariance Matrices: the SEM Algorithm, Journal of the American Statistical Association, 86,

29 Meng, X.L. and Rubin, D.B. (1993) Maximum Likelihood Estimation via the ECM Algorithm: a general framework, Biometrika, 80, Meilijson, I. (1989) A Fast Improvement to the EM Algorithm on its Own Terms, Journal of the Royal Statistical Society, Series B, Methodological, 44, Orchard, T. and Woodbury, M.A. (1972) A Missing Information Principle: Theory and Applications, Proc. of the 6th Symp. on Math. Stat. and Prob. Vol.1, Rubin, D.B. (1976) Inference with Missing Data, Biometrika, 63, Rubin, D.B. (1991) EM and Beyond, Psychometrika, 56, Tanner, M.A. and Wong, W.H. (1987) The Calculation of Posterior Distributions by Data Augmentation (with discussion), Journal of the American Statistical Association, 82, Titterington, D.M. (1984) Recursive Parameter Estimation using Incomplete Data, Journal of the Royal Statistical Society, Series B, Methodological, 46, Wei, G.C.G. and Tanner, M.A. (1990) A Monte Carlo Implementation of the EM algorithm and the Poor Man s Data Augmentation Algorithms, Journal of the American Statistical Association, 85, Wu, C.F.J. (1983) On the Convergence Properties of the EM Algorithm, The Annals of Statistics, 11,