ON EM-LIKE ALGORITHMS FOR MINIMUM DISTANCE ESTIMATION. P.P.B. Eggermont and V.N. LaRiccia University of Delaware

Transcription

1 March 1998 ON EM-LIKE ALGORITHMS FOR MINIMUM DISTANCE ESTIMATION PPB Eggermont and VN LaRiccia University of Delaware Abstract We study minimum distance estimation problems related to maximum likelihood estimation in positron emission tomography (pet), which admit algorithms similar to the standard em algorithm for pet with the same type of monotonicity properties as does the em algorithm, see Vardi, Shepp, and Kaufman [25] We derive the algorithms via the maorizing function approach of De Pierro [11], as well as via the alternating proections approach of Csiszár and Tusnády [7], and prove the monotonicity properties of these algorithms The distances studied include the Hellinger distance and cross-entropy The Pearson s ϕ 2 distance fits in, but does not seem to enoy both monotonicity properties For nonnegatively constrained least squares problems the two approaches lead to different algorithms, both of which enoy the strong monotonicity properties Corresponding author: Paul Eggermont Department of Mathematical Sciences University of Delaware Newark, Delaware telephone : (302) fax : (302) eggermon@mathudeledu 1

2 1 Introduction In this paper we study various minimum distance estimation problems that are similar to maximum likelihood estimation for positron emission tomography and that admit minimization algorithms similar to the EM algorithm of Shepp and Vardi [23], with similar monotonicity properties The distances under discussion are mainly the Hellinger distance and Pearson s ϕ 2 distance The last one was recently studied by Mair, Rao and Anderson [19] We also discuss smoothed (roughness penalized) minimum distance estimation problems, and briefly discuss minimum cross-entropy and minimum Burg-entropy estimation problems The results are new for the minimum Hellinger distance estimation problem, as well as for the smoothed versions of Hellinger and Pearson s ϕ 2 problems In minimum Hellinger distance estimation one solves the problem (11) minimize H(b, Ax) def = n [Ax] i b i 2 subect to x 0 component wise, where A R n m is a nonnegative matrix, with coefficients a i and with columns sums equal to one : (12) n a i = 1, = 1, 2,, m, and b R n is a nonnegative data vector The Hellinger distance is closely related to both the Kullback-Leibler distance (13) KL(u, w) = n and Pearson s ϕ 2 distance u i log u i w i + w i u i, (14) P(u, w) = n u i w i 2 w i The problem (15) minimize KL(b, Ax) subect to x 0 is the maximum likelihood estimation problem familiar from astronomical image processing, Richardson [21], Lucy [18], and emission tomography, Rockmore and Macovski [22], Shepp and Vardi [23] There, the underlying model is that b 1, b 2,, b n 2

3 are independent Poisson random variables with means [Ax o ] i, i = 1, 2,, n Here x o is an unknown probability vector one wishes to estimate So x o satisfies (16) m x o, = 1 Since the number of parameters to be estimated is typically quite large, this problem behaves like a nonparametric estimation problem The Pearson s ϕ 2 distance arises from the normal approximation to the Poisson distribution, see Mair, Rao and Anderson [19] In this context, minimum Hellinger distance estimation is suggested by its role played in parametric estimation problems For parametric problems, minimum Hellinger distance estimation enoys optimality properties similar to maximum likelihood estimation, if the postulated model is in fact true Moreover, its robustness with respect to modeling errors is well documented, see, eg, Beran [1], Tamura and Boos [24], and references therein Here we concentrate on methods for solving (11), with special emphasis on EM-like algorithms with EMlike monotonicity properties In the process we point out other, similar minimization problems with similar algorithms Byrne [3] does more or less the same, but considers a quite different set of algorithms See also Byrne [4] The EM algorithm for solving (15) is, starting from any strictly positive vector x 1 (17) x k+1 = x k [ A T r k ], = 1, 2,, m, with ri k = b i/[ Ax k ] i (We abbreviate this as r k = b / Ax k ) The model (15) and the algorithm (17) was introduced by Richardson [21] and Lucy [18] in astronomical image processing, and by Shepp and Vardi [23] in positron emission tomography Vardi, Shepp, and Kaufman [25] derived the two wonderful monotonicity properties of the EM algorithm The first monotonicity property is that (18) KL(b, Ax k ) KL(b, Ax k+1 ) KL(x k+1, x k ), k 1, which says that the algorithm (17) decreases the negative log-likelihood KL(b, Ax) This is about the least one would expect of an algorithm for minimizing KL second one is quite unexpected If x is any solution of (15) then (19) KL(x, x k ) KL(x, x k+1 ) KL(b, Ax k ) KL(b, Ax k+1 ) In combination with (18) this says that the x k get closer to every solution of (15) (The everyday image is that the x k land on the solution set like a helicopter on an airfield, 3 The

4 rather than like a plane) The convergence of the algorithm (17) to a solution of (15) is an easy consequence, see, eg, Vardi, Shepp, and Kaufman [25] or Byrne [2] Vardi, Shepp, and Kaufman [25] modeled their proof of the monotonicity properties (18) and (19) on the alternating proection approach of Csiszár and Tusnády [7] There are two aspects to this geometric view The first one comprises the setting in which the alternating proection method may be formulated and in which it solves the original minimization problem if the algorithm in fact converges The second aspect is the proof of the convergence of the algorithm, which requires extra conditions on the obective function The Csiszár and Tusnády [7] approach applies in full to (15) with the resulting algorithm (17) and monotonicity properties (18) and (19) The approach applies only partially to minimizing the Pearson s ϕ 2 distance and Hellinger distance Mair, Rao and Anderson [19] showed that the Csiszár and Tusnády [7] approach applies to minimizing P(b, Ax), with the resulting algorithm (110) x k+1 = x k {[ A T r k ] } 1/2, where ri k = (b i / [Ax k ] i ) 2 Unfortunately, this is where it ends There is a first monotonicity property, of course, but a second monotonicity property analogous to (19) is not provided by the alternating proections approach Likewise, the Csiszár and Tusnády [7] approach applies to the minimum Hellinger distance estimation problem (11) with the resulting algorithm (111) x k+1 = x k {[ A T r k ] } 2, where now ri k = (b i / [Ax k ] i ) 1/2 The (dis)similarity with (110) is uncanny Unfortunately, here too a second monotonicity property is not provided However, there is a second approach to deriving these algorithms De Pierro [9], [11] used this approach both to derive algorithms for penalized versions of (15), and to show monotonicity properties This was based on his interpretation of the analytic proofs of the monotonicity properties (18) and (19) by Mülthei and Schorr [20] De Pierro [11] calls it the maorizing function approach, because it is based on the inequality (112) KL(b, Ax) KL(b, Ay) + Λ KL (x, y), with (113) Λ KL (x, y) = m y [ A T {b/ay} ] log y x + x y, 4

5 for nonnegative x, y R m Note that Λ KL (y, y) = 0 The EM algorithm now arises by minimizing Λ KL (x, x k ) over x We show in this paper that this approach extends to the algorithms (110) and (111) We prove the following (114) Theorem Let x 1 R m be strictly positive, and let x be any solution of (11) Then the sequence { x k } k generated by (111) satisfies H(b, Ax k ) H(b, Ax k+1 ) H(x k, x k+1 ), KL(x, x k ) KL(x, x k+1 ) 2 { H(b, Ax k+1 ) H(b, Ax ) } Again, the convergence of the algorithms (111) is an easy consequence An unexplained feature of the second monotonicity property is that the Kullback-Leibler distance pops up again For the algorithm (110) we are not so fortunate There is a first monotonicity property, Mair, Rao and Anderson [19], (115) P(b, Ax k ) P(b, Ax k+1 ) P(x k+1, x k ), but a second monotonicty property analogous to (19) remains elusive in this set-up as well At this point we cannot resist mentioning our smoothed EM algorithm Let S R m m be a symmetric (nonnegative) smoothing matrix with all columns sums equal to 1, and define the nonlinear smoother N (based on geometric averages) by (116) [ N x ] = exp( [ S{log f} ] ), = 1, 2,, m The smoothed version of the maximum likelihood estimation problem (15) is (117) n b i minimize b i log + [ Ax ] i b i [ AN x ] i subect to x 0 component wise, The problem (117) also admits an EM algorithm, viz (118) x k+1 = S { (N x k ) (A T r k ) }, with r k i = b i / [ AN x k ] i for all i, and (N x k ) (A T r k ) is the component wise product of the two vectors N x k and A T r k Moreover, the analogues of the monotonicity properties hold, see Eggermont [13], Eggermont and LaRiccia [14] The rather surprising 5

6 thing is that there is an analogue of this for (11) before, define the nonlinear smoother M by With the smoothing matrix as (119) [ M x ] = { [ S( x ) ] } 2, and consider the problem (120) minimize H(b, A, x) def = n subect to x 0 component wise The algorithm for (120) analogous to (118) is b i 2 b i [ AM x ] i + [ Ax ] i (121) x k+1 = S { Mxk (A T r k ) }2, and its monotonicity properties are stated in the following theorem (122) Theorem Let x 1 R m be strictly positive, and let x be any solution of (119) Then the sequence { x k } k generated by (120) satisfies H(b, A, x k ) H(b, A, x k+1 ) H(x k, x k+1 ), KL(x, x k ) KL(x, x k+1 ) 2 { H(b, A, x k+1 ) H(b, A, x ) } There is a similar algorithm with analogous monotonicity properties for the minimization problem (116) with the nonlinear smoother N replaced by the nonlinear smoother M, see Eggermont and LaRiccia [15] Finally, there is an analogous smoothed version with the analogous monotonicity properties for minimum Pearson s ϕ 2 estimation, see 5 (but no second monotonicity property) The proofs of all these monotonicity properties for these smoothed algorithms are substantially the same, but a unifying theory, say along the lines of Csiszár and Tusnády [7], has not been forthcoming Earlier on we mentioned the close connection between the Kullback-Leibler, Pearson s ϕ 2 and Hellinger distances This is further illustrated by considering the following two algorithms for solving (15) With Λ KL strictly positive vector x 1, let x k+1 be the solution to the maorizing function, and starting from a (122) minimize Λ KL (x, x k ) + P(x, x k ) subect to x 0 It turns out that the resulting algorithm is a multiplicatively relaxed version of the EM algorithm (17), viz (123) x k+1 = x k ( [ A T {b / Ax k } ] ) 1/2 6

7 Note the difference with algorithm (110)! The algorithm (123) has ust about the same monotonicity properties (18) and (19), see Iusem [17] The Hellinger analogue of (122) also works That is, if x k+1 is defined (recursively) as the solution to (124) minimize Λ KL (x, x k ) + H(x, x k ) subect to x 0, then (125) x k+1 = x k { } [ A T { b / Ax k } ], = 1, 2,, m, and this too is a multiplicatively and additively relaxed version of (17) and satisfies analogues of the two monotonicity properties We omit the details We emphasize again that these last two algorithms are merely stated to show the close interplay between the three distances under discussion In the next section we discuss the alternating proection method and point out some applications In 3 we discuss its application to minimum Hellinger distance estimation, and derive the algorithm In 4 and 5 we discuss the maorizing function approach to minimum Hellinger and minimum Pearson s ϕ 2 estimation problems, as well as to minimizing Burg-entropy In 6 we briefly discuss the maorizing function approach to nonnegatively constrained least squares estimation : in this case this leads to an algorithm different from the Csiszár and Tusnády [7] approach 2 Alternating proections onto closed convex subsets of R d In this section we discuss the alternating proection method of Csiszár and Tusnády [7], and give a slightly more general proof of the convergence However, the exposition follows quite closely that of Csiszár and Tusnády [7] Since proections onto closed convex sets may be thought of as being obtained as solutions of minimum distance problems, we begin by introducing suitable generalizations of (the square of) Euclidean distance Let b : domain b R d R { } be a proper convex, lower semi continuous function For simplicity we assume also that b is differentiable on its domain If b is not differentiable, then the notion of subgradients may be used, but this would cause technical complications On domain B = domain b domain b define (21) B(x, y) = b(x) b(y) b(y), x y, 7

8 where b denotes the gradient of b Note that B(x, y) 0 for all x, y, by the convexity of b To strengthen the interpretation of B(x, y) as distance squared we make the following assumptions (B1) (B2) B(x, y) is convex in x, y ointly, and strictly convex in x and in y separately B(x, y) is lower-semi-continuous in x, y ointly (B3) B(x, y) has bounded level sets for fixed x, and for fixed y (B4) If B(x n, y n ) 0, and {x n } n or {y n } n is bounded, then x n y n 0 (B5) If x o P, and x o y n 0, then B(x o, y n ) 0 These conditions are somewhat technical, but they are precisely what is needed later on An important feature is that we do not require symmetry of B(x, y) in x and y (22) Remark It is easily checked that B satisfies the above conditions when b is one of the following three examples : (a) b(x) = m x log x, x 0 ; (b) b(x) = m x2, x Rm ; (c) b(x) = m xp, x 0, where 1 < p < 2 It is not so clear whether there are other (interesting) examples (23) Remark It is likewise easily checked that the functions B given below are not of the form (21), but do satisfy (B1) through (B5) (a) B(x, y) = m x y 2, x, y 0 (b) B(x, y) = m x y 2 /y, y > 0, x 0 Since there is no symmetry, the function B(x, y) gives rise to two kinds of proections (24) Definition Let C R d be a nonempty closed convex set (a) Let q R d We define the B 1 -proection of q onto C as the unique element p C such that (25) B(p, q) = min {B(x, q) : x C } We denote p as p = Π q when the set C is clear from the context (b) Let p R d The B 2 -proection of p onto C is defined as the unique q C such that (26) B(p, q) = min {B(p, y) : y C } We denote q as q = ΠΠ p For this definition to work, it needs to be shown that Π and ΠΠ are in fact well defined operators This is indeed so, but we omit the details 8

9 It is useful to introduce the set of all elements in P that have finite distance to Q, and vice versa Let (27) B(P, q) = inf { B(p, q) : p P } The expression B(p, Q) is defined similarly We may now define the alternating proection method associated with the distance (squared) B Consider two nonempty closed convex sets P, Q R d For reasons that will transpire later we wish to find points p P, q Q such that (28) B(p, q ) = min { B(p, q) : p P, q Q } The alternating proection method for solving this problem would go as follows Let q 1 Q be arbitrary, but such that there exists an x P with B(x, q 1 ) < Let p 1 P be the B 1 -proection of q 1 onto P Then let q 2 Q be the B 2 -proection of p 1 onto Q, and repeat ad infinitum This gives rise to two sequences {p n } n P, {q n } n Q recursively defined by (29) p n = Π q n, q n+1 = ΠΠ p n, n = 1, 2, It has to be shown that this algorithm does not break down, but again we omit the details We proceed with proving the convergence of the alternating proection method, and begin by deriving the so-called three-points and four-points properties (210) Lemma (Three-points property) Let q 1 Q, with B(P, q 1 ) <, and let p 1 = Π q 1 Then for all p P Proof The left hand side equals b(p) b(p 1 ) b(q 1 ), p p 1 = Since p 1 B(p, q 1 ) B(p 1, q 1 ) B(p, p 1 ) b(p) b(p 1 ) b(p 1 ), p p 1 + b(p 1 ) b(q 1 ), p p 1 realizes min {B(p, q 1 ) : p P }, which is a convex minimization problem, the Kuhn-Tucker conditions tell us that 1 B(p 1, q 1 ), p p 1 0 for all p P, where 1 B denotes the gradient of B(p, q) with respect to p (the first variable) But 1 B(p, q) = b(p) b(q), so the result follows Qed The above Three-points property regarding the B 1 -proection seems reasonable enough; cf the case of the Euclidean norm squared The Four-points property regarding the B 2 -proection is much more mysterious 9

10 (211) Lemma (Four-points property) Let p 1 P, with B(p 1, Q) <, and let q 2 = ΠΠ p 1 Then for all x P, y Q Proof Using the identity we have that B(x, q 2 ) B(x, p 1 ) + B(x, y) B(x, p 1 ) = B(x, q 2 ) B(p 1, q 2 ) 1 B(p 1, q 2 ), x p 1 (212) B(x, p 1 ) + B(x, y) B(x, q 2 ) = Now, B(x, y) is convex in x, y ointly, so B(x, y) B(p 1, q 2 ) 1 B(p 1, q 2 ), x p 1 B(x, y) B(p 1, q 2 ) + 1 B(p 1, q 2 ), x p B(p 1, q 2 ), y q 2, with 2 B denoting the derivative (gradient) of B with respect to the second variable Thus the expression on the right of (212) dominates 2 B(p 1, q 2 ), y q 2, which is nonnegative for all y Q, by the Kuhn-Tucker conditions for the optimality of q 2 Qed The full content of these lemmas is not so obvious The following two monotonicity properties are quite remarkable consequences With an eye towards the application to maximum likelihood estimation we define the functional Λ as (213) Λ(q) = B(Π q, q), for all q Q with B(P, q) < (214) First Monotonicity Property Let q 1 Q with Λ(q 1 ) < Then Λ(q 2 ) <, and Proof Observe that Λ(q 1 ) Λ(q 2 ) B(p 1, p 2 ) 0 Λ(q 1 ) Λ(q 2 ) = { B(p 1, q 1 ) B(p 1, q 2 ) } + { B(p 1, q 2 ) B(p 2, q 2 ) } The expression between the first pair of curly brackets is nonnegative since q 2 = ΠΠ p 1 The Three-points lemma provides the lower bound B(p 1, p 2 ) for the second expression Qed To formulate the second monotonicity property, let P P be the set of all p o P such that (215) B(p o, Q) = B(P, Q) = inf { B(x, y) : x P, y Q } So P is the set of solutions p of the minimum distance problem (28) 10

11 (216) Second Monotonicity Property Let p P, and set q = ΠΠ p Select p 1 P such that B(p, p 1 ) < Then B(p, p 2 ) < as well, and B(p, p 1 ) B(p, p 2 ) Λ(q 2 ) Λ(q ) Proof The Four-points lemma, with x = p, y = q says that B(p, p 1 ) B(p, q 2 ) B(p, q ), and the Three-points lemma, with the indices incremented by 1, gives B(p, p 2 ) B(p 2, q 2 ) B(p, q 2 ) Adding these two inequalities gives B(p, p 1 ) B(p, p 2 ) B(p 2, q 2 ) B(p, q ), which is the required inequality Qed The proof that the alternating proection method converges is now quite simple, modulo a rather annoying assumption In the fully general setting there appears to be no way around it In specific instances it is always easily verified (217) Theorem Let p 1 P such that B(p, p 1 ) < for all p P Then {p n } n converges to some p o P, and {q n } n converges to some q o Q, and B(p o, q o ) = min { B(p, q) : p P, q Q } Proof By the First Monotonicity Property, {Λ(q n )} n is decreasing Let p P, and let q = ΠΠ p By the Second Monotonicity Property {B(p, p n )} n is decreasing, and since it is a nonnegative sequence, it has a limit Again the Second Monotonicity Property then implies that Λ(q n ) Λ(q ) Also, from the boundedness of {B(p, p n )} n condition (B3) implies that {p n } n is bounded, so it has a convergent subsequence, denoted by {p n } n M where M N Let p o be the limit of this subsequence Now {q n } n M is bounded, so it too has convergent subsequences Without loss of generality, we may assume that {q n+1 } n M is convergent, say with limit q o By the lower semi continuity (B2) of B, then B(p o, q o ) lim inf n M B(p n, q n ) = lim inf n M Λ(q n) = Λ(q ), 11

12 where the lim inf n M denotes the liminf as n, n M It follows that p o P (and that q o = ΠΠ p o, but never mind) To prove the convergence of the whole sequences, apply the above with p replaced by p o (Here the strange condition that B(p, p 1 ) < for all p P comes into play) Then {B(p o, p n )} n is decreasing, and by (B5) a subsequence converges to 0 It follows that the whole sequence converges to 0, so p n p o, n (n N) Now, since {q n } n is bounded, every subsequence has itself a convergent subsequence Call the limit q (o) By the lower semi continuity of B(p, q), we get ust as above that B(p o, q (o) ) Λ(q ) It follows that q (o) = ΠΠ p o, and then that the whole sequence {q n } n converges to q (o) The last statement follows from p P, and p = Πq, so that Λ(q ) is equal to the distance between P and Q Qed (218) Remark It is interesting to note that the alternating proection method and the associated Three- and Four-points property, as well as the two monotonicity properties work also for the problem minimize B(p, q) def = B(p, q) + F (q) subect to p P, q Q Here F is a differentiable convex function on Q Denoting the B 1 -proection of q onto P by p = Π q, and the B 2 -proection of p onto Q by q = ΠΠ p, the Three- and Four-points properties read, resp B(p, q) B(Π q, q) B(p, Π q), B(x, ΠΠ p) B(x, y) B(x, p) Note the distinction between B and B This is especially interesting in the case where P = Q, since then one is minimizing F (p) over p For B(p, q) = KL(p, q) this leads the implicit algorithm discussed in Eggermont [12], viz (219) x k+1 = x k 1 + [ F (x k+1 ) ], = 1, 2,, m (220) Remark We note that the standard application of the theory is to minimizing KL(b, Ax), with nonnegative A R n m with column sums equal to 1, and nonnegative 12

13 b R n It is interesting to note that it also applies to minimizing KL(Ax, b) The resulting algorithm is (221) x k+1 = x k exp ( [ A T {log(b/ax k )} ] ), = 1, 2,, m, and the algorithm converges as per the general theory It is interesting to note that if Ax = b has a nonnegative solution then the algorithm (221), with x 1 positive vector, converges to the solution of = u, a strictly (222) minimize m x log x u + u x subect to x 0, Ax = b See Elfving [16] What happens when Ax = b does not have an exact nonnegative solution is not so easy, apparently 3 Least Hellinger distance estimation We now apply the alternating proections method to the minimum Hellinger distance estimation problem (11) Note that the Hellinger distance H(p, q) satisfies the properties (B1) through (B5), but is not of the form (21) (eg, the gradient is not of the required form) So the general theory of 2 does not tell us whether this alternating proection method converges or not The alternating proections set-up is similar to the one employed for minimizing KL(b, Ax) by Csiszár and Tusnády [7], and for minimum Pearson s ϕ 2 distance employed by Mair, Rao and Anderson [19] Thus, let P and Q be defined as (31) P = { (p i ) R n m : p 0, Q = { (a i x ) R n m : x 0 }, p i = b i, i = 1, 2,, n } and consider the problem (32) minimize H(p, q) = p i q i 2 i subect to p P, q Q It is of course not clear why solutions to (32) should provide solutions to (31), but it will transpire that they do To determine the proection steps of the alternating proection method, let q 1 i = (a i x 1 ) Q be given The H 1-proection of q 1 onto P is obtained by minimizing 13

14 H(p, q 1 ) over p P Ignoring the nonnegativity constraints on p, the Lagrange Multiplier Theorem yields that p should solve q 1 i 1 + λ i = 0, pi for suitable λ i, and hence p i = a ix 1 (1 + λ i ) 2 for all i,, This shows that we are ustified in ignoring the nonnegativity constraint on p Summing over results in b i = [Ax 1 ] i /(1 + λ i ) 2, and so, for all i, (33) p 1 i = a i b i x 1 [Ax 1 ] i The H 2 -proection of p 1 onto Q is determined by minimizing H(p 1, q) = i ( i a i )x 2 x ( i a i p 1 i ) +, where denote terms independent of x Ignoring the nonnegativity constraint on x, and setting the gradient to 0 yields x = i ai p 1 i, or and q 2 x 2 = x 1 a i b i /[Ax 1 ] i i = (a i x 2 ) So we were ustified in ignoring the nonnegativity constraints, and the algorithm is (34) x 2 = x 1 [ A T {b/ax 1 } 1/2 ] 2, = 1, 2,, m, as advertised in the introduction The geometric intuition tells us that this algorithm converges In the next section we give an alternative derivation, and prove that is converges It is interesting to note that in all three minimum distance problems (Kullback-Leibler, Pearson s ϕ 2 and Hellinger) the first proection step (33) is the same This begs for an explanation Indeed, all three functions KL(x, y), P(x, y) and H(x, y) may be written in the form (35) Ψ(x, y) = n 14 y ψ ( x /y ),

15 where ψ is an increasing, differentiable, convex function defined for nonnegative numbers The functions Ψ are referred to as entropy functions, see, eg, Chen and Teboulle [6], and references therein It can now be shown that for given q, q i = a i x, the solution p to the problem (36) minimize Ψ(p, q) p P, with P as in (31), is given by (37) p i = a i b i x [ Ax ] i It should be noted that Ψ satisfies the conditions (B1) through (B5), but again is not of the required form (21) It is not clear that a Csiszar-Tusnady theory could be worked for this family of functions (35) 4 Maorizing functions for Hellinger distance We now apply the maorizing function approach of De Pierro [11] to the minimum Hellinger distance estimation problem (11) Note that H(b, Ax) is convex in x We begin by deriving a maorizing function, or, as we like to call it, a Tendentious Inequality, because it will suggest the minimization algorithm We have (41) H(b, Ax) = n [Ax] i 2 b i [Ax] i + b i, so only the second term needs consideration convexity of the function t t that Writing Ax = A{y(x/y)}, we get by [ { } [A y (x/y) ]i [Ax] i = [Ay] i [Ay] i [A { y [ x/y ] } 1/2 ] i [Ay] i [Ay] i = [A{ xy } ]i [Ay]i ] 1 2 It follows that H(b, Ax) n [Ax] i 2 [A xy ] i bi [Ay]i + b i, 15

16 or, (42) H(b, Ax) H(x, y) def = m x 2 x y [ A T b/ay ] + n b i This is the Tendentious Inequality The minimization algorithm it suggests for solving (41) is as follows If y = x k is a guess for a solution of (41), obtain a new and improved(?) guess x k+1 by minimizing H(x, x k ) as function of x The result is that (43) x k+1 = x k ([ A T b/ax k ] ) 2, = 1, 2,, m We now investigate the monotonicity properties In our search for the First Monotonicity Property we observe the following For ease of notation we let y = x k and x = x k+1 Then H(b, Ax) H(x, y) = m H(b, Ay) m H(b, Ay) m The formulation in terms of x k and x k+1 reads y { [ A T b/ay ] } 2 + n y { 1 + [ A T b/ay ] } 2 = x y 2 (44) H(b, Ax k ) H(b, Ax k+1 ) H(x k, x k+1 ), which is the First Monotonicity Property Note the lack of any hint of Kullback-Leibler But Kullback-Leibler pops up in the Second Monotonicity Property It turns out that the Second Monotonicity Property takes ust about the standard form Let x be a solution of (41) By (45) then x is a fixed point of the iteration (44), so [A T b/ax ] = 1 whenever x > 0 Now, with KL the standard Kullback-Leibler divergence, b i = (45) KL(x, x k ) KL(x, x k+1 ) = m = m x log xk+1 x k + x k x k+1 x k x k x log[ A T b/ax k ] In the usual fashion we have [ A T b/ax k ] = [ A T { 16 b Ax Ax Ax k } ],

17 and so, by the concavity of the logarithm (46) KL(x, x k ) KL(x, x k+1 ) m x k x k x [ A { b Ax T Ax log } ] Ax k m m x k x k+1 + x k x k+1 + n n 2 b i [Ax ] i log [Ax ] i [Ax k ] i 2 b i [Ax ] i 2 b i [Ax k ] i, where in the last line we used the inequality log t 1 t 1 Consequently (47) KL(x, x k ) KL(x, x k+1 ) n { [Ax k } ] i 2 b i [Ax k ] i + b i n { [Ax } ] i 2 b i [Ax ] i + b i + rest H(b, Ax k ) H(b, Ax ) + rest H(b, Ax k+1 ) + H(x k+1, x k ) H(b, Ax ) + rest, where in the last line we used (44), and the rest is given by (48) rest = n [Ax ] i m Now, x k+1 = m x x k+1 (49) where H(x k+1, x k ) + rest = m = m = n x k+1 + x k 2 x k+1 x k + x x k+1 ] x k 2x k [A b/ax T k + x = { [Ax k } ] i 2 b i [Ax k ] i + b i + rem = H(b, Ax k ) + rem, rem = m 17 x n b i =

18 Rather surprising, rem = H(b, Ax ), as we now show Since x is a fixed point of (43), and so (410) It follows that m x = m rem = m = n x [ A T b/ax ] = n x + m 2x n b i b i [Ax ] i, [Ax ] i + 2 b i [Ax ] i b i = H(b, Ax ) (411) KL(x, x k ) KL(x, x k+1 ) H(b, Ax k+1 ) + H(b, Ax k ) 2 H(b, Ax ) 0, which implies (412) KL(x, x k ) KL(x, x k+1 ) 2 { H(b, Ax k+1 ) H(b, Ax ) } 0 Either (411) or (412) may be considered as the Second Monotonicity Property The maorizing function approach applies to the smoothed minimum Hellinger distance problem (119) At the end of this section we show one may view this as a regularized version of (11) Note that H(b, A, x) is convex The Tendentious Inequality is (413) H(b, A, x) n b i + m x 2 x [ S{ My A T b / AMy } ], which gives rise to the algorithm (120) The first monotonicity property of Theorem (121) is similar to the unsmoothed case For the second monotonicity property we work backwards, in several steps analogous to the unsmoothed case The first ingredient is the observation that for any solution x of (119) m (414) x n b i = H(b, A, x ) The proof is ust about the same as before : since x is a fixed point of (120) With r = b / A T Mx, m x = m x [ S( M x (A T r ) ) ] = m [ Mx ] [ A T r ] = n [ AMx ] i ri = n bi [ AMx ] i, 18

19 where we used duality twice (or interchanging the order of summation) Now (414) follows as in (410) The second step is to show that (415) H(b, A, x k ) H(b, A, x ) = H(x k, x k+1, x ) + m x x k+1 This too follows similarly to the unsmoothed case : using (414) we have H(b, A, x k ) H(b, A, x ) = H(b, A, x k ) + m and now, with r k = b / AMx k, n Going back we get and (415) follows b i [ AMx k ] = m x k + x n = n [ AMx k ] r k = m [ Mx k ] [ A T r k ] = m = m = m x n b i 2 b i [ AMx k ] i, [ S x k ] [ Mx k ] [ A T r k ] [ x k ] [ S ( Mxk (A T r k ) ) ] x k xk+1 H(b, A, x k ) H(b, A, x ) = m Now backtracking as in (47) (46) (45) we get that H(b, A, x k+1 ) + H(x k+1, x k ) H(b, A, x ) + m m m x k x k+1 x k x k+1 + n + n m x k 2 x k xk+1 + x, x x k+1 2 ( ) bi [ AMx ] i b i [ AMx k ] i 2 b i [ AMx ] i log x k x k+1 19 [ AMx ] i [ AMxk ] i + 2 [ Mx ] [ A T r ] log [ AT r k ] [ A T r ]

20 So we now have (416) H(b, A, x k+1 ) + H(x k+1, x k ) H(b, A, x ) + m with (417) SUM = n x x k+1 m 2 [ Mx ] [ A T r ] log [ AT r k ] [ A T r ] x k x k+1 + SUM, The last step is to get from here to KL(x, x k ) KL(x, x k+1 ) We rewrite SUM as SUM = SUM I + SUM II, with SUM I = n SUM II = n With arguments used before, SUM I = m 2 [ Mx ] [ A T r ] log [ Mx k ] [ A T r k ] [ Mx ] [ A T r ], 2 x 2 [ Mx ] [ A T r ] log [ Mx ] [ Mx k ], [ S{ Mx (A T r ) log Mxk A T r k Mx A T r } ], and, now, in view of the iteration (120), of which x is a fixed point S inv x k+1 = Mx k A T r k, S inv x = Mx A T r, assuming that S is invertible (The following goes through without this assumption, actually) So we may write SUM I as (418) SUM I = m 2 x [ { S (S inv x x ) log Sinv k+1 S inv x } ] It should be noted that S inv x k+1 and S inv x are nonnegative vectors Now, for any nonnegative function U, by the concavity of the logarithm S ( (S inv x ) log U ) S ( S inv x ) log S ( (S inv x ) U ) S ( S inv x ), 20