MINIMIZATION OF ENTROPY FUNCTIONALS UNDER MOMENT CONSTRAINTS. denote the family of probability density functions g on X satisfying

Transcription

1 MINIMIZATION OF ENTROPY FUNCTIONALS UNDER MOMENT CONSTRAINTS I. Csiszár (Budapest) Given a σ-finite measure space (X, X, µ) and a d-tuple ϕ = (ϕ 1,..., ϕ d ) of measurable functions on X, for a = (a 1,..., a d ) R d let L a denote the family of probability density functions g on X satisfying ϕgdµ = a, that is, ϕ i gdµ = a i, i = 1,..., d. Extensively studied problem: minimize J(g) = or K(g, h) = g log gdµ subject to g L a. (negative Shannon entropy) g log g dµ (Kullback-Leibler distance, h I-divergence, relative entropy) 1

2 First this problem, then its extension to other entropies and distances will be considered. For ϑ = (ϑ 1,..., ϑ d ) R d denote (ϑ) = log e ϑ,ϕ dµ ϑ, ϕ = d i=1 ϑ i ϕ i Assume: dom( ) = {ϑ : (ϑ) < + } is nonempty. Not hard to show: (ϑ) is the convex conjugate of the function H(a) = inf g La J(g) : (ϑ) = H (ϑ) = sup a R d [ ϑ, ϕ H(a)]. Dual problem associated with the primal problem of minimizing J(g) subject to g L a : maximize l a (ϑ) = ϑ, ϕ (ϑ) for ϑ R d. The supremum of l a (ϑ) is the convex conjugate of (ϑ), thus the second conjugate H (a) of H(a). Always H(a) H (A), the difference is called duality gap. 2

3 Exponential family with canonical statistic ϕ : E = {f ϑ = e ϑ,ϕ (ϑ) : ϑ dom( )}. When the empirical mean 1 n nj=1 ϕ(x j ) of ϕ in a sample x 1,..., x n drawn from a density in E is equal to a, the normalized log-likelihood function is l a (ϑ); for this a the dual problem means ML estimation. Moreover, if g L a then (lik.id) K(g, f ϑ ) = J(g) l a (ϑ), ϑ dom( ), hence providing J(g) is finite, the dual problem is equivalent to minimizing K(g, f ϑ ) for f ϑ E. 3

4 Note: this interpretation of the dual problem does not apply if a / dom(h), in which case J(g) = + for all g L a even though H (a) < H(a) = + is possible. Elementary proposition: If L a E =, it contains a single g a, and for this J(g) = J(g a ) + K(g, g a ), g L a ; equivalently, the Pythagorean identity holds: K(g, f) = K(g a, f) + K(g, g a ), g L a, f E In this case, the duality gap is 0: H(a) = H (a) = J(g a ), and the common member g a of L a and E is simultaneously the I-projection to L a of each f E and the reverse I-projection to E of each g L a with J(g) < +. 4

5 HISTORY HINTS Boltzmann, Gibbs: in 19. century Jaynes, Kullback: in the fifties Čencov 1972: information projections, diff. geom. approach Barndorff - Nielsen 1977: convex analysis approach to MLE for exponential families Csiszár 1975, Topsøe 1979: generalized minimizer when minimum not attained (Shannon case) Csiszár 1991: axiomatic approach Borwein and Lewis 1991: convex analysis approach for general entropies Csiszár 1995: generalized minimizer, general case Several recent works employ advanced Orlicz space techniques (Léonard ) or diff. geom. (Amari and Nagaoka 2000, etc.) 5

6 This talk is based on works of Csiszár and Matúš and hopefully will show that classical tools suffice for treating the problem efficiently. Convex core of a finite measure Q on R d (Csiszár - Matúš 2001): cc(q) = intersection of all convex Borel sets with full Q-measure = set of means of all probability measures P Q that have mean For the measure µ on X, define cc ϕ (µ)= { ϕgdµ : g prob.density, ϕg integrable } = {a R d : L a }. If µ is finite then cc ϕ (µ) = cc(µ ϕ ), µ ϕ image of µ on R d. 6

7 Lemma: If a cc ϕ (µ), there exists g L a with µ({x : g(x) > 0}) < +, g bounded. Corollary: dom(h) = cc ϕ (µ), that is, the necessary condition L a for H(a) = inf g La J(g) < + is sufficient, as well. Face of a convex set C R d : Nonempty convex subset F C such that a convex combination tx + (1 t)y of x C and y C (with 0 < t < 1) belongs to F only if x, y F For a face F of cc ϕ (µ), denote F = {x : ϕ(x) cl(f )} Lemma: For a in a face F of cc ϕ (µ), each g L a vanishes outside F (µ-a.e.) 7

8 Extended exponential family exte: The union of the families E F for all faces F of cc ϕ (µ), where E F = {f F,ϑ = e ϑ,ϕ f(ϑ) 1 F : ϑ dom( F )} F (ϑ) = log F e ϑ,ϕ dµ Theorem 1 (Csiszár - Matúš 2003): Whenever L a thus a cc ϕ (µ), there exists a unique g a, perhaps not in L a, such that J(g) = H(a) + K(g, g a ), g L a Moreover g a E F, for the face F of cc ϕ (µ) whose relative interior contains a. Clearly, if g a L a then it minimizes J(g) subject to g L a. Otherwise, it is a generalized minimizer: every sequence g n in L a with J(g n ) H(a) satisfies K(g n, g a ) 0, in particular, g n g a in L 1 (µ). 8

9 Generalized Pythagorean identity: K(g, f) = K(L a, f) + K(g, g a ) g L a, f E where K(L a, f) = inf g La K(g, f) K(g a, f) Thus, g a is the generalized I-projection to L a of each f E. If a ri(cc ϕ (µ)) thus g a E, then g a is also the reverse I-projection to E of each g L a with J(g) < +, and the duality gap is zero. g a / L a can happen if g a = f ϑ with ϑ on boundary of dom( ); g a may be the same for several vectors a. Existence of minimizer (I-projection): g a L a holds for all a ri(cc ϕ (µ)) if and only if is steep, and for all a ri(f ), if and only if F is steep. 9

10 Theorem 2. (Csiszár - Matúš ): If H (a) = sup ϑ R d l a (ϑ) is finite, there exists a unique density h a such that H (a) l a (ϑ) K(h a, f ϑ ), ϑ dom( ). Moreover, h a E F where F is the largest face of cc ϕ (µ) with a ri(f ) + barr(dom( )). Here barr denotes barrier cone: for any convex set C R d, barr(c) = {b : sup c C b, c < + }. Supplement: dom(h ) = cc ϕ (µ) + barr(dom( )). The maximum of l a (ϑ) is attained (MLE exists) if and any only if h a E. Otherwise, h a is a generalized MLE: every sequence ϑ n in dom( ) with l a (ϑ n ) H (a) satisfies K(h a, f ϑn ) 0, in particular, f ϑn h a in L 1 (µ). 10

11 GENERAL ENTROPY FUNCTIONALS In the sequel, γ is a given strictly convex, differentiable function on (0, + ), γ(0) is defined as lim t 0 γ(t); later, γ (0), γ (+ ) are also defined limiting. γ-entropy of a nonnegative function g on X: J γ (g) = γ(g)dµ Familiar choices of γ, in addition to t log t : γ(t) = log t Burg entropy γ(t) = sign(α 1)t α Rényi (Tsallis) entropy Problem: minimize J γ (g) subject to g L a, where L a is defined slightly differently than before: attention is not restricted to probability densities, accordingly we set ϕ = (ϕ 0, ϕ 1,..., ϕ d ) with ϕ 0 identically 1, and for a = (a 0,..., a d ) R 1+d, L a = {g 0 : ϕgdµ = a} 11

12 BASIC TOOLS The convex conjugate of γ, γ (r) = sup t>0 [rt γ(t)] is a nondecreasing convex function, finite and differentiable in (, γ (+ )), and its derivative goes to + as r γ (+ ). γ (γ (+ )) may or may not be finite. Denote by u the function on R equal to (γ ) in (, γ (+ )) and + outside. Then u(r) = 0 if r γ (0), and u is strictly increasing from 0 to + in the interval (γ (0), γ (+ )). Lemma 1. For r < γ (+ ) γ (u(r)) = max [ γ (0), r ] = r + γ (0) r + γ(u(r)) + γ (r) = ru(r). 12

13 For non-negative numbers t, s define γ (t.s) = γ(t) = [γ(s) + γ (s)(t s)] (not meaningful for s = 0 if γ (0) = ; then we set γ (0, 0) = 0, γ (t, 0) = + if t > 0) Bregman distance of nonnegative functions g, h on X : B γ (g, h) = γ (g, h)dµ Clearly, B γ (g, h) 0, equality iff g = h [µ] 13

14 KEY IDENTITY Denote: L a : the family of nonnegative (measurable) functions g on X satisfying the constraints gϕdµ = a (a = (a0,..., s d ) R 1+d ) F γ : the family of functions f ϑ = u( ϑ, ϕ ) with ϑ R 1+d such that γ ( ϑ, ϕ )dµ is finite, and ϑ, ϕ < γ (+ ) [µ] Key identity: For g L a and f ϑ F γ [ J γ (g) ϑ, a γ ( ϑ, ϕ )dµ ] = = B γ (g, f ϑ ) + g γ (0) ϑ, ϕ + dµ Proof: Immediate, using Lemma 1. 14

15 Proposition: If L a F γ, it consists of a single function g = f ϑ, this g minimizes J γ (g) subject to g L a, and ϑ maximizes ϑ, a γ ( ϑ, ϕ )dµ; these minimum and maximum are equal. [But ϑ need not be unique, only f ϑ is.] Proof: Immediate from the key identity. The family F γ is the γ-analogue of an exponential family in the theory of Shannon entropy maximization. While the functions in F γ need not be probability densities, in the case γ(t) = t log t they are exactly the constant multiples of the probability densities in F γ, which form an exponential family in the familiar statistical sense. For other γ however, no simple way is apparent to identify the probability densities in F γ. 15

16 Convex conjugate of H γ (a) = inf g La J γ (g) : H γ (ϑ) = sup [ ϑ, a H γ (a)], ϑ R 1+d. a R 1+d Lemma 2: If dom(h γ ), thus there exists some g with γ(g)dµ < + and gϕ i integrable for i = 0,..., d, then Hγ (ϑ) = γ ( ϑ, ϕ )dµ Dual problem: find the dual value Hγ [ (a) = ϑ, a H γ (ϑ) ], sup ϑ R 1+d and if it is finite, find ϑ R 1+d that attains the maximum, if such ϑ exists (dual attainment). In the latter case, the function f ϑ = u( ϑ, ϕ ) will be called dual solution, rather than ϑ itself. 16

17 Lemma 3: for ϑ dom(hγ ), the directional derivative of Hγ at ϑ, in a direction τ, exists and equals f ϑ τ, ϕ dµ < + whenever ϑ+tτ dom(hγ ) for some t > 0. In particular, H γ is differentiable in the interior of its essential domain, with the gradient equal to f ϑ ϕdµ. Corollary: For a R 1+d with H γ (a) finite, a dual solution satisfies f ϑ dµ a 0 Proof: Straightforward calculus. Differentiation within the integral is justified by monotone convergence. 17

18 Proposition 2: If H γ (a) is finite and a is in the relative interior of dom(h γ ) then the primal and dual values are equal, and dual attainment holds. Moreover, the dual solution f ϑ satisfies for each g L a J γ (g) = H γ (a) + B γ (g, f ϑ )+ + g γ (0) ϑ, ϕ + dµ. If, in addition, H γ (ϑ) = γ ( ϑ, ϕ )dµ is essentially smooth then the dual solution f ϑ belongs to L a, hence it is a primal solution, too. Proof: The first assertion is a general convex analysis result. The second assertion follows from it, by the key identity. The last assertion follows by Lemma 3, since if Hγ is essentially smooth, the maximizing ϑ has to be in the interior of dom(hγ). 18

19 Theorem 1: If γ(0) = 0 then dom(h γ ) = {ta : t 0, a cc ϕ (µ)} If γ(0) = + (and µ(x) < + ) then dom(h γ ) is either empty, or dom(h γ ) = {ta : t > 0, a ri(cc ϕ (µ))} = dom(h γ ). In both cases, dom(h γ ) is a cone that has a simple description in terms of cc ϕ (µ). In the case γ(0) = +, no general criterion appears available to determine whether dom(h γ ) is empty or not, but if nonempty, then Proposition 2 tells the story, as dom(h γ ) is relatively open. In the sequel, we concentrate on the case γ(0) = 0. 19

20 Given nonzero a dom(h γ ), equivalently a 0 > 0, a 1 0 a cc ϕ(µ), let F (a) denote the face of cc ϕ (µ) whose relative interior contains a 1 0 a. Let ν denote the restiction of µ to F (a) = ϕ 1 (el(f/a)). Each g L a vanishes outside F (a) = H γ (a) = inf g L a γ(g)dν. Proposition 2 applies to this a and the measure ν in the role of µ because a 1 0 a is in the relative interior of the face F (a) equal to cc ϕ (ν). 20

21 It follows, provided H γ (a) >, that the maximum of ϑ, a γ ( ϑ, ϕ )dν is attained, and with a maximizing ϑ, the function satisfies for all g L a f F,ϑ = u ( ϑ, ϕ ) 1 F (a) J γ (g) = = H γ (a) + B γ (g, f F,ϑ ) + g γ (0) ϑ, ϕ + dµ Theorem 2: To every a 0 with H γ (a) finite, there exists a (unique) generalized primal solution, of form f F,ϑ = u( ϑ, ϕ )1 F (a) and it satisfies the above identity. Essential smoothness of F (a) γ ( ϑ, ϕ )dµ is a sufficient condition for primal attainment. 21