ON EMLIKE ALGORITHMS FOR MINIMUM DISTANCE ESTIMATION. P.P.B. Eggermont and V.N. LaRiccia University of Delaware


 Timothy Evans
 1 years ago
 Views:
Transcription
1 March 1998 ON EMLIKE ALGORITHMS FOR MINIMUM DISTANCE ESTIMATION PPB Eggermont and VN LaRiccia University of Delaware Abstract We study minimum distance estimation problems related to maximum likelihood estimation in positron emission tomography (pet), which admit algorithms similar to the standard em algorithm for pet with the same type of monotonicity properties as does the em algorithm, see Vardi, Shepp, and Kaufman [25] We derive the algorithms via the maorizing function approach of De Pierro [11], as well as via the alternating proections approach of Csiszár and Tusnády [7], and prove the monotonicity properties of these algorithms The distances studied include the Hellinger distance and crossentropy The Pearson s ϕ 2 distance fits in, but does not seem to enoy both monotonicity properties For nonnegatively constrained least squares problems the two approaches lead to different algorithms, both of which enoy the strong monotonicity properties Corresponding author: Paul Eggermont Department of Mathematical Sciences University of Delaware Newark, Delaware telephone : (302) fax : (302)
2 1 Introduction In this paper we study various minimum distance estimation problems that are similar to maximum likelihood estimation for positron emission tomography and that admit minimization algorithms similar to the EM algorithm of Shepp and Vardi [23], with similar monotonicity properties The distances under discussion are mainly the Hellinger distance and Pearson s ϕ 2 distance The last one was recently studied by Mair, Rao and Anderson [19] We also discuss smoothed (roughness penalized) minimum distance estimation problems, and briefly discuss minimum crossentropy and minimum Burgentropy estimation problems The results are new for the minimum Hellinger distance estimation problem, as well as for the smoothed versions of Hellinger and Pearson s ϕ 2 problems In minimum Hellinger distance estimation one solves the problem (11) minimize H(b, Ax) def = n [Ax] i b i 2 subect to x 0 component wise, where A R n m is a nonnegative matrix, with coefficients a i and with columns sums equal to one : (12) n a i = 1, = 1, 2,, m, and b R n is a nonnegative data vector The Hellinger distance is closely related to both the KullbackLeibler distance (13) KL(u, w) = n and Pearson s ϕ 2 distance u i log u i w i + w i u i, (14) P(u, w) = n u i w i 2 w i The problem (15) minimize KL(b, Ax) subect to x 0 is the maximum likelihood estimation problem familiar from astronomical image processing, Richardson [21], Lucy [18], and emission tomography, Rockmore and Macovski [22], Shepp and Vardi [23] There, the underlying model is that b 1, b 2,, b n 2
3 are independent Poisson random variables with means [Ax o ] i, i = 1, 2,, n Here x o is an unknown probability vector one wishes to estimate So x o satisfies (16) m x o, = 1 Since the number of parameters to be estimated is typically quite large, this problem behaves like a nonparametric estimation problem The Pearson s ϕ 2 distance arises from the normal approximation to the Poisson distribution, see Mair, Rao and Anderson [19] In this context, minimum Hellinger distance estimation is suggested by its role played in parametric estimation problems For parametric problems, minimum Hellinger distance estimation enoys optimality properties similar to maximum likelihood estimation, if the postulated model is in fact true Moreover, its robustness with respect to modeling errors is well documented, see, eg, Beran [1], Tamura and Boos [24], and references therein Here we concentrate on methods for solving (11), with special emphasis on EMlike algorithms with EMlike monotonicity properties In the process we point out other, similar minimization problems with similar algorithms Byrne [3] does more or less the same, but considers a quite different set of algorithms See also Byrne [4] The EM algorithm for solving (15) is, starting from any strictly positive vector x 1 (17) x k+1 = x k [ A T r k ], = 1, 2,, m, with ri k = b i/[ Ax k ] i (We abbreviate this as r k = b / Ax k ) The model (15) and the algorithm (17) was introduced by Richardson [21] and Lucy [18] in astronomical image processing, and by Shepp and Vardi [23] in positron emission tomography Vardi, Shepp, and Kaufman [25] derived the two wonderful monotonicity properties of the EM algorithm The first monotonicity property is that (18) KL(b, Ax k ) KL(b, Ax k+1 ) KL(x k+1, x k ), k 1, which says that the algorithm (17) decreases the negative loglikelihood KL(b, Ax) This is about the least one would expect of an algorithm for minimizing KL second one is quite unexpected If x is any solution of (15) then (19) KL(x, x k ) KL(x, x k+1 ) KL(b, Ax k ) KL(b, Ax k+1 ) In combination with (18) this says that the x k get closer to every solution of (15) (The everyday image is that the x k land on the solution set like a helicopter on an airfield, 3 The
4 rather than like a plane) The convergence of the algorithm (17) to a solution of (15) is an easy consequence, see, eg, Vardi, Shepp, and Kaufman [25] or Byrne [2] Vardi, Shepp, and Kaufman [25] modeled their proof of the monotonicity properties (18) and (19) on the alternating proection approach of Csiszár and Tusnády [7] There are two aspects to this geometric view The first one comprises the setting in which the alternating proection method may be formulated and in which it solves the original minimization problem if the algorithm in fact converges The second aspect is the proof of the convergence of the algorithm, which requires extra conditions on the obective function The Csiszár and Tusnády [7] approach applies in full to (15) with the resulting algorithm (17) and monotonicity properties (18) and (19) The approach applies only partially to minimizing the Pearson s ϕ 2 distance and Hellinger distance Mair, Rao and Anderson [19] showed that the Csiszár and Tusnády [7] approach applies to minimizing P(b, Ax), with the resulting algorithm (110) x k+1 = x k {[ A T r k ] } 1/2, where ri k = (b i / [Ax k ] i ) 2 Unfortunately, this is where it ends There is a first monotonicity property, of course, but a second monotonicity property analogous to (19) is not provided by the alternating proections approach Likewise, the Csiszár and Tusnády [7] approach applies to the minimum Hellinger distance estimation problem (11) with the resulting algorithm (111) x k+1 = x k {[ A T r k ] } 2, where now ri k = (b i / [Ax k ] i ) 1/2 The (dis)similarity with (110) is uncanny Unfortunately, here too a second monotonicity property is not provided However, there is a second approach to deriving these algorithms De Pierro [9], [11] used this approach both to derive algorithms for penalized versions of (15), and to show monotonicity properties This was based on his interpretation of the analytic proofs of the monotonicity properties (18) and (19) by Mülthei and Schorr [20] De Pierro [11] calls it the maorizing function approach, because it is based on the inequality (112) KL(b, Ax) KL(b, Ay) + Λ KL (x, y), with (113) Λ KL (x, y) = m y [ A T {b/ay} ] log y x + x y, 4
5 for nonnegative x, y R m Note that Λ KL (y, y) = 0 The EM algorithm now arises by minimizing Λ KL (x, x k ) over x We show in this paper that this approach extends to the algorithms (110) and (111) We prove the following (114) Theorem Let x 1 R m be strictly positive, and let x be any solution of (11) Then the sequence { x k } k generated by (111) satisfies H(b, Ax k ) H(b, Ax k+1 ) H(x k, x k+1 ), KL(x, x k ) KL(x, x k+1 ) 2 { H(b, Ax k+1 ) H(b, Ax ) } Again, the convergence of the algorithms (111) is an easy consequence An unexplained feature of the second monotonicity property is that the KullbackLeibler distance pops up again For the algorithm (110) we are not so fortunate There is a first monotonicity property, Mair, Rao and Anderson [19], (115) P(b, Ax k ) P(b, Ax k+1 ) P(x k+1, x k ), but a second monotonicty property analogous to (19) remains elusive in this setup as well At this point we cannot resist mentioning our smoothed EM algorithm Let S R m m be a symmetric (nonnegative) smoothing matrix with all columns sums equal to 1, and define the nonlinear smoother N (based on geometric averages) by (116) [ N x ] = exp( [ S{log f} ] ), = 1, 2,, m The smoothed version of the maximum likelihood estimation problem (15) is (117) n b i minimize b i log + [ Ax ] i b i [ AN x ] i subect to x 0 component wise, The problem (117) also admits an EM algorithm, viz (118) x k+1 = S { (N x k ) (A T r k ) }, with r k i = b i / [ AN x k ] i for all i, and (N x k ) (A T r k ) is the component wise product of the two vectors N x k and A T r k Moreover, the analogues of the monotonicity properties hold, see Eggermont [13], Eggermont and LaRiccia [14] The rather surprising 5
6 thing is that there is an analogue of this for (11) before, define the nonlinear smoother M by With the smoothing matrix as (119) [ M x ] = { [ S( x ) ] } 2, and consider the problem (120) minimize H(b, A, x) def = n subect to x 0 component wise The algorithm for (120) analogous to (118) is b i 2 b i [ AM x ] i + [ Ax ] i (121) x k+1 = S { Mxk (A T r k ) }2, and its monotonicity properties are stated in the following theorem (122) Theorem Let x 1 R m be strictly positive, and let x be any solution of (119) Then the sequence { x k } k generated by (120) satisfies H(b, A, x k ) H(b, A, x k+1 ) H(x k, x k+1 ), KL(x, x k ) KL(x, x k+1 ) 2 { H(b, A, x k+1 ) H(b, A, x ) } There is a similar algorithm with analogous monotonicity properties for the minimization problem (116) with the nonlinear smoother N replaced by the nonlinear smoother M, see Eggermont and LaRiccia [15] Finally, there is an analogous smoothed version with the analogous monotonicity properties for minimum Pearson s ϕ 2 estimation, see 5 (but no second monotonicity property) The proofs of all these monotonicity properties for these smoothed algorithms are substantially the same, but a unifying theory, say along the lines of Csiszár and Tusnády [7], has not been forthcoming Earlier on we mentioned the close connection between the KullbackLeibler, Pearson s ϕ 2 and Hellinger distances This is further illustrated by considering the following two algorithms for solving (15) With Λ KL strictly positive vector x 1, let x k+1 be the solution to the maorizing function, and starting from a (122) minimize Λ KL (x, x k ) + P(x, x k ) subect to x 0 It turns out that the resulting algorithm is a multiplicatively relaxed version of the EM algorithm (17), viz (123) x k+1 = x k ( [ A T {b / Ax k } ] ) 1/2 6
7 Note the difference with algorithm (110)! The algorithm (123) has ust about the same monotonicity properties (18) and (19), see Iusem [17] The Hellinger analogue of (122) also works That is, if x k+1 is defined (recursively) as the solution to (124) minimize Λ KL (x, x k ) + H(x, x k ) subect to x 0, then (125) x k+1 = x k { } [ A T { b / Ax k } ], = 1, 2,, m, and this too is a multiplicatively and additively relaxed version of (17) and satisfies analogues of the two monotonicity properties We omit the details We emphasize again that these last two algorithms are merely stated to show the close interplay between the three distances under discussion In the next section we discuss the alternating proection method and point out some applications In 3 we discuss its application to minimum Hellinger distance estimation, and derive the algorithm In 4 and 5 we discuss the maorizing function approach to minimum Hellinger and minimum Pearson s ϕ 2 estimation problems, as well as to minimizing Burgentropy In 6 we briefly discuss the maorizing function approach to nonnegatively constrained least squares estimation : in this case this leads to an algorithm different from the Csiszár and Tusnády [7] approach 2 Alternating proections onto closed convex subsets of R d In this section we discuss the alternating proection method of Csiszár and Tusnády [7], and give a slightly more general proof of the convergence However, the exposition follows quite closely that of Csiszár and Tusnády [7] Since proections onto closed convex sets may be thought of as being obtained as solutions of minimum distance problems, we begin by introducing suitable generalizations of (the square of) Euclidean distance Let b : domain b R d R { } be a proper convex, lower semi continuous function For simplicity we assume also that b is differentiable on its domain If b is not differentiable, then the notion of subgradients may be used, but this would cause technical complications On domain B = domain b domain b define (21) B(x, y) = b(x) b(y) b(y), x y, 7
8 where b denotes the gradient of b Note that B(x, y) 0 for all x, y, by the convexity of b To strengthen the interpretation of B(x, y) as distance squared we make the following assumptions (B1) (B2) B(x, y) is convex in x, y ointly, and strictly convex in x and in y separately B(x, y) is lowersemicontinuous in x, y ointly (B3) B(x, y) has bounded level sets for fixed x, and for fixed y (B4) If B(x n, y n ) 0, and {x n } n or {y n } n is bounded, then x n y n 0 (B5) If x o P, and x o y n 0, then B(x o, y n ) 0 These conditions are somewhat technical, but they are precisely what is needed later on An important feature is that we do not require symmetry of B(x, y) in x and y (22) Remark It is easily checked that B satisfies the above conditions when b is one of the following three examples : (a) b(x) = m x log x, x 0 ; (b) b(x) = m x2, x Rm ; (c) b(x) = m xp, x 0, where 1 < p < 2 It is not so clear whether there are other (interesting) examples (23) Remark It is likewise easily checked that the functions B given below are not of the form (21), but do satisfy (B1) through (B5) (a) B(x, y) = m x y 2, x, y 0 (b) B(x, y) = m x y 2 /y, y > 0, x 0 Since there is no symmetry, the function B(x, y) gives rise to two kinds of proections (24) Definition Let C R d be a nonempty closed convex set (a) Let q R d We define the B 1 proection of q onto C as the unique element p C such that (25) B(p, q) = min {B(x, q) : x C } We denote p as p = Π q when the set C is clear from the context (b) Let p R d The B 2 proection of p onto C is defined as the unique q C such that (26) B(p, q) = min {B(p, y) : y C } We denote q as q = ΠΠ p For this definition to work, it needs to be shown that Π and ΠΠ are in fact well defined operators This is indeed so, but we omit the details 8
9 It is useful to introduce the set of all elements in P that have finite distance to Q, and vice versa Let (27) B(P, q) = inf { B(p, q) : p P } The expression B(p, Q) is defined similarly We may now define the alternating proection method associated with the distance (squared) B Consider two nonempty closed convex sets P, Q R d For reasons that will transpire later we wish to find points p P, q Q such that (28) B(p, q ) = min { B(p, q) : p P, q Q } The alternating proection method for solving this problem would go as follows Let q 1 Q be arbitrary, but such that there exists an x P with B(x, q 1 ) < Let p 1 P be the B 1 proection of q 1 onto P Then let q 2 Q be the B 2 proection of p 1 onto Q, and repeat ad infinitum This gives rise to two sequences {p n } n P, {q n } n Q recursively defined by (29) p n = Π q n, q n+1 = ΠΠ p n, n = 1, 2, It has to be shown that this algorithm does not break down, but again we omit the details We proceed with proving the convergence of the alternating proection method, and begin by deriving the socalled threepoints and fourpoints properties (210) Lemma (Threepoints property) Let q 1 Q, with B(P, q 1 ) <, and let p 1 = Π q 1 Then for all p P Proof The left hand side equals b(p) b(p 1 ) b(q 1 ), p p 1 = Since p 1 B(p, q 1 ) B(p 1, q 1 ) B(p, p 1 ) b(p) b(p 1 ) b(p 1 ), p p 1 + b(p 1 ) b(q 1 ), p p 1 realizes min {B(p, q 1 ) : p P }, which is a convex minimization problem, the KuhnTucker conditions tell us that 1 B(p 1, q 1 ), p p 1 0 for all p P, where 1 B denotes the gradient of B(p, q) with respect to p (the first variable) But 1 B(p, q) = b(p) b(q), so the result follows Qed The above Threepoints property regarding the B 1 proection seems reasonable enough; cf the case of the Euclidean norm squared The Fourpoints property regarding the B 2 proection is much more mysterious 9
10 (211) Lemma (Fourpoints property) Let p 1 P, with B(p 1, Q) <, and let q 2 = ΠΠ p 1 Then for all x P, y Q Proof Using the identity we have that B(x, q 2 ) B(x, p 1 ) + B(x, y) B(x, p 1 ) = B(x, q 2 ) B(p 1, q 2 ) 1 B(p 1, q 2 ), x p 1 (212) B(x, p 1 ) + B(x, y) B(x, q 2 ) = Now, B(x, y) is convex in x, y ointly, so B(x, y) B(p 1, q 2 ) 1 B(p 1, q 2 ), x p 1 B(x, y) B(p 1, q 2 ) + 1 B(p 1, q 2 ), x p B(p 1, q 2 ), y q 2, with 2 B denoting the derivative (gradient) of B with respect to the second variable Thus the expression on the right of (212) dominates 2 B(p 1, q 2 ), y q 2, which is nonnegative for all y Q, by the KuhnTucker conditions for the optimality of q 2 Qed The full content of these lemmas is not so obvious The following two monotonicity properties are quite remarkable consequences With an eye towards the application to maximum likelihood estimation we define the functional Λ as (213) Λ(q) = B(Π q, q), for all q Q with B(P, q) < (214) First Monotonicity Property Let q 1 Q with Λ(q 1 ) < Then Λ(q 2 ) <, and Proof Observe that Λ(q 1 ) Λ(q 2 ) B(p 1, p 2 ) 0 Λ(q 1 ) Λ(q 2 ) = { B(p 1, q 1 ) B(p 1, q 2 ) } + { B(p 1, q 2 ) B(p 2, q 2 ) } The expression between the first pair of curly brackets is nonnegative since q 2 = ΠΠ p 1 The Threepoints lemma provides the lower bound B(p 1, p 2 ) for the second expression Qed To formulate the second monotonicity property, let P P be the set of all p o P such that (215) B(p o, Q) = B(P, Q) = inf { B(x, y) : x P, y Q } So P is the set of solutions p of the minimum distance problem (28) 10
11 (216) Second Monotonicity Property Let p P, and set q = ΠΠ p Select p 1 P such that B(p, p 1 ) < Then B(p, p 2 ) < as well, and B(p, p 1 ) B(p, p 2 ) Λ(q 2 ) Λ(q ) Proof The Fourpoints lemma, with x = p, y = q says that B(p, p 1 ) B(p, q 2 ) B(p, q ), and the Threepoints lemma, with the indices incremented by 1, gives B(p, p 2 ) B(p 2, q 2 ) B(p, q 2 ) Adding these two inequalities gives B(p, p 1 ) B(p, p 2 ) B(p 2, q 2 ) B(p, q ), which is the required inequality Qed The proof that the alternating proection method converges is now quite simple, modulo a rather annoying assumption In the fully general setting there appears to be no way around it In specific instances it is always easily verified (217) Theorem Let p 1 P such that B(p, p 1 ) < for all p P Then {p n } n converges to some p o P, and {q n } n converges to some q o Q, and B(p o, q o ) = min { B(p, q) : p P, q Q } Proof By the First Monotonicity Property, {Λ(q n )} n is decreasing Let p P, and let q = ΠΠ p By the Second Monotonicity Property {B(p, p n )} n is decreasing, and since it is a nonnegative sequence, it has a limit Again the Second Monotonicity Property then implies that Λ(q n ) Λ(q ) Also, from the boundedness of {B(p, p n )} n condition (B3) implies that {p n } n is bounded, so it has a convergent subsequence, denoted by {p n } n M where M N Let p o be the limit of this subsequence Now {q n } n M is bounded, so it too has convergent subsequences Without loss of generality, we may assume that {q n+1 } n M is convergent, say with limit q o By the lower semi continuity (B2) of B, then B(p o, q o ) lim inf n M B(p n, q n ) = lim inf n M Λ(q n) = Λ(q ), 11
12 where the lim inf n M denotes the liminf as n, n M It follows that p o P (and that q o = ΠΠ p o, but never mind) To prove the convergence of the whole sequences, apply the above with p replaced by p o (Here the strange condition that B(p, p 1 ) < for all p P comes into play) Then {B(p o, p n )} n is decreasing, and by (B5) a subsequence converges to 0 It follows that the whole sequence converges to 0, so p n p o, n (n N) Now, since {q n } n is bounded, every subsequence has itself a convergent subsequence Call the limit q (o) By the lower semi continuity of B(p, q), we get ust as above that B(p o, q (o) ) Λ(q ) It follows that q (o) = ΠΠ p o, and then that the whole sequence {q n } n converges to q (o) The last statement follows from p P, and p = Πq, so that Λ(q ) is equal to the distance between P and Q Qed (218) Remark It is interesting to note that the alternating proection method and the associated Three and Fourpoints property, as well as the two monotonicity properties work also for the problem minimize B(p, q) def = B(p, q) + F (q) subect to p P, q Q Here F is a differentiable convex function on Q Denoting the B 1 proection of q onto P by p = Π q, and the B 2 proection of p onto Q by q = ΠΠ p, the Three and Fourpoints properties read, resp B(p, q) B(Π q, q) B(p, Π q), B(x, ΠΠ p) B(x, y) B(x, p) Note the distinction between B and B This is especially interesting in the case where P = Q, since then one is minimizing F (p) over p For B(p, q) = KL(p, q) this leads the implicit algorithm discussed in Eggermont [12], viz (219) x k+1 = x k 1 + [ F (x k+1 ) ], = 1, 2,, m (220) Remark We note that the standard application of the theory is to minimizing KL(b, Ax), with nonnegative A R n m with column sums equal to 1, and nonnegative 12
13 b R n It is interesting to note that it also applies to minimizing KL(Ax, b) The resulting algorithm is (221) x k+1 = x k exp ( [ A T {log(b/ax k )} ] ), = 1, 2,, m, and the algorithm converges as per the general theory It is interesting to note that if Ax = b has a nonnegative solution then the algorithm (221), with x 1 positive vector, converges to the solution of = u, a strictly (222) minimize m x log x u + u x subect to x 0, Ax = b See Elfving [16] What happens when Ax = b does not have an exact nonnegative solution is not so easy, apparently 3 Least Hellinger distance estimation We now apply the alternating proections method to the minimum Hellinger distance estimation problem (11) Note that the Hellinger distance H(p, q) satisfies the properties (B1) through (B5), but is not of the form (21) (eg, the gradient is not of the required form) So the general theory of 2 does not tell us whether this alternating proection method converges or not The alternating proections setup is similar to the one employed for minimizing KL(b, Ax) by Csiszár and Tusnády [7], and for minimum Pearson s ϕ 2 distance employed by Mair, Rao and Anderson [19] Thus, let P and Q be defined as (31) P = { (p i ) R n m : p 0, Q = { (a i x ) R n m : x 0 }, p i = b i, i = 1, 2,, n } and consider the problem (32) minimize H(p, q) = p i q i 2 i subect to p P, q Q It is of course not clear why solutions to (32) should provide solutions to (31), but it will transpire that they do To determine the proection steps of the alternating proection method, let q 1 i = (a i x 1 ) Q be given The H 1proection of q 1 onto P is obtained by minimizing 13
14 H(p, q 1 ) over p P Ignoring the nonnegativity constraints on p, the Lagrange Multiplier Theorem yields that p should solve q 1 i 1 + λ i = 0, pi for suitable λ i, and hence p i = a ix 1 (1 + λ i ) 2 for all i,, This shows that we are ustified in ignoring the nonnegativity constraint on p Summing over results in b i = [Ax 1 ] i /(1 + λ i ) 2, and so, for all i, (33) p 1 i = a i b i x 1 [Ax 1 ] i The H 2 proection of p 1 onto Q is determined by minimizing H(p 1, q) = i ( i a i )x 2 x ( i a i p 1 i ) +, where denote terms independent of x Ignoring the nonnegativity constraint on x, and setting the gradient to 0 yields x = i ai p 1 i, or and q 2 x 2 = x 1 a i b i /[Ax 1 ] i i = (a i x 2 ) So we were ustified in ignoring the nonnegativity constraints, and the algorithm is (34) x 2 = x 1 [ A T {b/ax 1 } 1/2 ] 2, = 1, 2,, m, as advertised in the introduction The geometric intuition tells us that this algorithm converges In the next section we give an alternative derivation, and prove that is converges It is interesting to note that in all three minimum distance problems (KullbackLeibler, Pearson s ϕ 2 and Hellinger) the first proection step (33) is the same This begs for an explanation Indeed, all three functions KL(x, y), P(x, y) and H(x, y) may be written in the form (35) Ψ(x, y) = n 14 y ψ ( x /y ),
15 where ψ is an increasing, differentiable, convex function defined for nonnegative numbers The functions Ψ are referred to as entropy functions, see, eg, Chen and Teboulle [6], and references therein It can now be shown that for given q, q i = a i x, the solution p to the problem (36) minimize Ψ(p, q) p P, with P as in (31), is given by (37) p i = a i b i x [ Ax ] i It should be noted that Ψ satisfies the conditions (B1) through (B5), but again is not of the required form (21) It is not clear that a CsiszarTusnady theory could be worked for this family of functions (35) 4 Maorizing functions for Hellinger distance We now apply the maorizing function approach of De Pierro [11] to the minimum Hellinger distance estimation problem (11) Note that H(b, Ax) is convex in x We begin by deriving a maorizing function, or, as we like to call it, a Tendentious Inequality, because it will suggest the minimization algorithm We have (41) H(b, Ax) = n [Ax] i 2 b i [Ax] i + b i, so only the second term needs consideration convexity of the function t t that Writing Ax = A{y(x/y)}, we get by [ { } [A y (x/y) ]i [Ax] i = [Ay] i [Ay] i [A { y [ x/y ] } 1/2 ] i [Ay] i [Ay] i = [A{ xy } ]i [Ay]i ] 1 2 It follows that H(b, Ax) n [Ax] i 2 [A xy ] i bi [Ay]i + b i, 15
16 or, (42) H(b, Ax) H(x, y) def = m x 2 x y [ A T b/ay ] + n b i This is the Tendentious Inequality The minimization algorithm it suggests for solving (41) is as follows If y = x k is a guess for a solution of (41), obtain a new and improved(?) guess x k+1 by minimizing H(x, x k ) as function of x The result is that (43) x k+1 = x k ([ A T b/ax k ] ) 2, = 1, 2,, m We now investigate the monotonicity properties In our search for the First Monotonicity Property we observe the following For ease of notation we let y = x k and x = x k+1 Then H(b, Ax) H(x, y) = m H(b, Ay) m H(b, Ay) m The formulation in terms of x k and x k+1 reads y { [ A T b/ay ] } 2 + n y { 1 + [ A T b/ay ] } 2 = x y 2 (44) H(b, Ax k ) H(b, Ax k+1 ) H(x k, x k+1 ), which is the First Monotonicity Property Note the lack of any hint of KullbackLeibler But KullbackLeibler pops up in the Second Monotonicity Property It turns out that the Second Monotonicity Property takes ust about the standard form Let x be a solution of (41) By (45) then x is a fixed point of the iteration (44), so [A T b/ax ] = 1 whenever x > 0 Now, with KL the standard KullbackLeibler divergence, b i = (45) KL(x, x k ) KL(x, x k+1 ) = m = m x log xk+1 x k + x k x k+1 x k x k x log[ A T b/ax k ] In the usual fashion we have [ A T b/ax k ] = [ A T { 16 b Ax Ax Ax k } ],
17 and so, by the concavity of the logarithm (46) KL(x, x k ) KL(x, x k+1 ) m x k x k x [ A { b Ax T Ax log } ] Ax k m m x k x k+1 + x k x k+1 + n n 2 b i [Ax ] i log [Ax ] i [Ax k ] i 2 b i [Ax ] i 2 b i [Ax k ] i, where in the last line we used the inequality log t 1 t 1 Consequently (47) KL(x, x k ) KL(x, x k+1 ) n { [Ax k } ] i 2 b i [Ax k ] i + b i n { [Ax } ] i 2 b i [Ax ] i + b i + rest H(b, Ax k ) H(b, Ax ) + rest H(b, Ax k+1 ) + H(x k+1, x k ) H(b, Ax ) + rest, where in the last line we used (44), and the rest is given by (48) rest = n [Ax ] i m Now, x k+1 = m x x k+1 (49) where H(x k+1, x k ) + rest = m = m = n x k+1 + x k 2 x k+1 x k + x x k+1 ] x k 2x k [A b/ax T k + x = { [Ax k } ] i 2 b i [Ax k ] i + b i + rem = H(b, Ax k ) + rem, rem = m 17 x n b i =
18 Rather surprising, rem = H(b, Ax ), as we now show Since x is a fixed point of (43), and so (410) It follows that m x = m rem = m = n x [ A T b/ax ] = n x + m 2x n b i b i [Ax ] i, [Ax ] i + 2 b i [Ax ] i b i = H(b, Ax ) (411) KL(x, x k ) KL(x, x k+1 ) H(b, Ax k+1 ) + H(b, Ax k ) 2 H(b, Ax ) 0, which implies (412) KL(x, x k ) KL(x, x k+1 ) 2 { H(b, Ax k+1 ) H(b, Ax ) } 0 Either (411) or (412) may be considered as the Second Monotonicity Property The maorizing function approach applies to the smoothed minimum Hellinger distance problem (119) At the end of this section we show one may view this as a regularized version of (11) Note that H(b, A, x) is convex The Tendentious Inequality is (413) H(b, A, x) n b i + m x 2 x [ S{ My A T b / AMy } ], which gives rise to the algorithm (120) The first monotonicity property of Theorem (121) is similar to the unsmoothed case For the second monotonicity property we work backwards, in several steps analogous to the unsmoothed case The first ingredient is the observation that for any solution x of (119) m (414) x n b i = H(b, A, x ) The proof is ust about the same as before : since x is a fixed point of (120) With r = b / A T Mx, m x = m x [ S( M x (A T r ) ) ] = m [ Mx ] [ A T r ] = n [ AMx ] i ri = n bi [ AMx ] i, 18
19 where we used duality twice (or interchanging the order of summation) Now (414) follows as in (410) The second step is to show that (415) H(b, A, x k ) H(b, A, x ) = H(x k, x k+1, x ) + m x x k+1 This too follows similarly to the unsmoothed case : using (414) we have H(b, A, x k ) H(b, A, x ) = H(b, A, x k ) + m and now, with r k = b / AMx k, n Going back we get and (415) follows b i [ AMx k ] = m x k + x n = n [ AMx k ] r k = m [ Mx k ] [ A T r k ] = m = m = m x n b i 2 b i [ AMx k ] i, [ S x k ] [ Mx k ] [ A T r k ] [ x k ] [ S ( Mxk (A T r k ) ) ] x k xk+1 H(b, A, x k ) H(b, A, x ) = m Now backtracking as in (47) (46) (45) we get that H(b, A, x k+1 ) + H(x k+1, x k ) H(b, A, x ) + m m m x k x k+1 x k x k+1 + n + n m x k 2 x k xk+1 + x, x x k+1 2 ( ) bi [ AMx ] i b i [ AMx k ] i 2 b i [ AMx ] i log x k x k+1 19 [ AMx ] i [ AMxk ] i + 2 [ Mx ] [ A T r ] log [ AT r k ] [ A T r ]
20 So we now have (416) H(b, A, x k+1 ) + H(x k+1, x k ) H(b, A, x ) + m with (417) SUM = n x x k+1 m 2 [ Mx ] [ A T r ] log [ AT r k ] [ A T r ] x k x k+1 + SUM, The last step is to get from here to KL(x, x k ) KL(x, x k+1 ) We rewrite SUM as SUM = SUM I + SUM II, with SUM I = n SUM II = n With arguments used before, SUM I = m 2 [ Mx ] [ A T r ] log [ Mx k ] [ A T r k ] [ Mx ] [ A T r ], 2 x 2 [ Mx ] [ A T r ] log [ Mx ] [ Mx k ], [ S{ Mx (A T r ) log Mxk A T r k Mx A T r } ], and, now, in view of the iteration (120), of which x is a fixed point S inv x k+1 = Mx k A T r k, S inv x = Mx A T r, assuming that S is invertible (The following goes through without this assumption, actually) So we may write SUM I as (418) SUM I = m 2 x [ { S (S inv x x ) log Sinv k+1 S inv x } ] It should be noted that S inv x k+1 and S inv x are nonnegative vectors Now, for any nonnegative function U, by the concavity of the logarithm S ( (S inv x ) log U ) S ( S inv x ) log S ( (S inv x ) U ) S ( S inv x ), 20
THE PROBLEM OF finding localized energy solutions
600 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1997 Sparse Signal Reconstruction from Limited Data Using FOCUSS: A Reweighted Minimum Norm Algorithm Irina F. Gorodnitsky, Member, IEEE,
More informationSubspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity
Subspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity Wei Dai and Olgica Milenkovic Department of Electrical and Computer Engineering University of Illinois at UrbanaChampaign
More informationA Modern Course on Curves and Surfaces. Richard S. Palais
A Modern Course on Curves and Surfaces Richard S. Palais Contents Lecture 1. Introduction 1 Lecture 2. What is Geometry 4 Lecture 3. Geometry of InnerProduct Spaces 7 Lecture 4. Linear Maps and the Euclidean
More informationRECENTLY, there has been a great deal of interest in
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 47, NO. 1, JANUARY 1999 187 An Affine Scaling Methodology for Best Basis Selection Bhaskar D. Rao, Senior Member, IEEE, Kenneth KreutzDelgado, Senior Member,
More informationDecoding by Linear Programming
Decoding by Linear Programming Emmanuel Candes and Terence Tao Applied and Computational Mathematics, Caltech, Pasadena, CA 91125 Department of Mathematics, University of California, Los Angeles, CA 90095
More informationONEDIMENSIONAL RANDOM WALKS 1. SIMPLE RANDOM WALK
ONEDIMENSIONAL RANDOM WALKS 1. SIMPLE RANDOM WALK Definition 1. A random walk on the integers with step distribution F and initial state x is a sequence S n of random variables whose increments are independent,
More informationHow to Use Expert Advice
NICOLÒ CESABIANCHI Università di Milano, Milan, Italy YOAV FREUND AT&T Labs, Florham Park, New Jersey DAVID HAUSSLER AND DAVID P. HELMBOLD University of California, Santa Cruz, Santa Cruz, California
More informationTHE adoption of classical statistical modeling techniques
236 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 28, NO. 2, FEBRUARY 2006 Data Driven Image Models through Continuous Joint Alignment Erik G. LearnedMiller Abstract This paper
More informationFoundations of Data Science 1
Foundations of Data Science John Hopcroft Ravindran Kannan Version /4/204 These notes are a first draft of a book being written by Hopcroft and Kannan and in many places are incomplete. However, the notes
More informationIEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289. Compressed Sensing. David L. Donoho, Member, IEEE
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289 Compressed Sensing David L. Donoho, Member, IEEE Abstract Suppose is an unknown vector in (a digital image or signal); we plan to
More informationWHICH SCORING RULE MAXIMIZES CONDORCET EFFICIENCY? 1. Introduction
WHICH SCORING RULE MAXIMIZES CONDORCET EFFICIENCY? DAVIDE P. CERVONE, WILLIAM V. GEHRLEIN, AND WILLIAM S. ZWICKER Abstract. Consider an election in which each of the n voters casts a vote consisting of
More informationOrthogonal Bases and the QR Algorithm
Orthogonal Bases and the QR Algorithm Orthogonal Bases by Peter J Olver University of Minnesota Throughout, we work in the Euclidean vector space V = R n, the space of column vectors with n real entries
More informationRegression. Chapter 2. 2.1 Weightspace View
Chapter Regression Supervised learning can be divided into regression and classification problems. Whereas the outputs for classification are discrete class labels, regression is concerned with the prediction
More informationHow many numbers there are?
How many numbers there are? RADEK HONZIK Radek Honzik: Charles University, Department of Logic, Celetná 20, Praha 1, 116 42, Czech Republic radek.honzik@ff.cuni.cz Contents 1 What are numbers 2 1.1 Natural
More informationFrom Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images
SIAM REVIEW Vol. 51,No. 1,pp. 34 81 c 2009 Society for Industrial and Applied Mathematics From Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images Alfred M. Bruckstein David
More informationThe Backpropagation Algorithm
7 The Backpropagation Algorithm 7. Learning as gradient descent We saw in the last chapter that multilayered networks are capable of computing a wider range of Boolean functions than networks with a single
More informationAn Elementary Introduction to Modern Convex Geometry
Flavors of Geometry MSRI Publications Volume 3, 997 An Elementary Introduction to Modern Convex Geometry KEITH BALL Contents Preface Lecture. Basic Notions 2 Lecture 2. Spherical Sections of the Cube 8
More informationSketching as a Tool for Numerical Linear Algebra
Foundations and Trends R in Theoretical Computer Science Vol. 10, No. 12 (2014) 1 157 c 2014 D. P. Woodruff DOI: 10.1561/0400000060 Sketching as a Tool for Numerical Linear Algebra David P. Woodruff IBM
More informationGeneralized compact knapsacks, cyclic lattices, and efficient oneway functions
Generalized compact knapsacks, cyclic lattices, and efficient oneway functions Daniele Micciancio University of California, San Diego 9500 Gilman Drive La Jolla, CA 920930404, USA daniele@cs.ucsd.edu
More informationCOSAMP: ITERATIVE SIGNAL RECOVERY FROM INCOMPLETE AND INACCURATE SAMPLES
COSAMP: ITERATIVE SIGNAL RECOVERY FROM INCOMPLETE AND INACCURATE SAMPLES D NEEDELL AND J A TROPP Abstract Compressive sampling offers a new paradigm for acquiring signals that are compressible with respect
More informationSpaceTime Approach to NonRelativistic Quantum Mechanics
R. P. Feynman, Rev. of Mod. Phys., 20, 367 1948 SpaceTime Approach to NonRelativistic Quantum Mechanics R.P. Feynman Cornell University, Ithaca, New York Reprinted in Quantum Electrodynamics, edited
More informationSteering User Behavior with Badges
Steering User Behavior with Badges Ashton Anderson Daniel Huttenlocher Jon Kleinberg Jure Leskovec Stanford University Cornell University Cornell University Stanford University ashton@cs.stanford.edu {dph,
More informationControllability and Observability of Partial Differential Equations: Some results and open problems
Controllability and Observability of Partial Differential Equations: Some results and open problems Enrique ZUAZUA Departamento de Matemáticas Universidad Autónoma 2849 Madrid. Spain. enrique.zuazua@uam.es
More informationOptimization with SparsityInducing Penalties. Contents
Foundations and Trends R in Machine Learning Vol. 4, No. 1 (2011) 1 106 c 2012 F. Bach, R. Jenatton, J. Mairal and G. Obozinski DOI: 10.1561/2200000015 Optimization with SparsityInducing Penalties By
More informationRegular Languages are Testable with a Constant Number of Queries
Regular Languages are Testable with a Constant Number of Queries Noga Alon Michael Krivelevich Ilan Newman Mario Szegedy Abstract We continue the study of combinatorial property testing, initiated by Goldreich,
More informationA Tutorial on Spectral Clustering
A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Spemannstr. 38, 7276 Tübingen, Germany ulrike.luxburg@tuebingen.mpg.de This article appears in Statistics
More informationEMPIRICAL PROCESSES: THEORY AND APPLICATIONS
NSFCBMS Regional Conference Series in Probability and Statistics Volume 2 EMPIRICAL PROCESSES: THEORY AND APPLICATIONS David Pollard Yale University Sponsored by the Conference Board of the Mathematical
More informationON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION. A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT
ON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT A numerical study of the distribution of spacings between zeros
More informationMUSTHAVE MATH TOOLS FOR GRADUATE STUDY IN ECONOMICS
MUSTHAVE MATH TOOLS FOR GRADUATE STUDY IN ECONOMICS William Neilson Department of Economics University of Tennessee Knoxville September 29 289 by William Neilson web.utk.edu/~wneilson/mathbook.pdf Acknowledgments
More informationIntellectual Need and ProblemFree Activity in the Mathematics Classroom
Intellectual Need 1 Intellectual Need and ProblemFree Activity in the Mathematics Classroom Evan Fuller, Jeffrey M. Rabin, Guershon Harel University of California, San Diego Correspondence concerning
More information