MINIMIZATION OF ENTROPY FUNCTIONALS UNDER MOMENT CONSTRAINTS. denote the family of probability density functions g on X satisfying

Similar documents

Duality of linear conic problems

Convex analysis and profit/cost/support functions

Separation Properties for Locally Convex Cones

No: Bilkent University. Monotonic Extension. Farhad Husseinov. Discussion Papers. Department of Economics

Walrasian Demand. u(x) where B(p, w) = {x R n + : p x w}.

1 if 1 x 0 1 if 0 x 1

A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION

CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e.

2.3 Convex Constrained Optimization Problems

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

MEASURE AND INTEGRATION. Dietmar A. Salamon ETH Zürich

Notes on metric spaces

Pacific Journal of Mathematics

and s n (x) f(x) for all x and s.t. s n is measurable if f is. REAL ANALYSIS Measures. A (positive) measure on a measurable space

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

ALMOST COMMON PRIORS 1. INTRODUCTION

Undergraduate Notes in Mathematics. Arkansas Tech University Department of Mathematics

1. Prove that the empty set is a subset of every set.

BANACH AND HILBERT SPACE REVIEW

Date: April 12, Contents

(Basic definitions and properties; Separation theorems; Characterizations) 1.1 Definition, examples, inner description, algebraic properties

TOPIC 4: DERIVATIVES

9 More on differentiation

So let us begin our quest to find the holy grail of real analysis.

Lecture Notes on Measure Theory and Functional Analysis

Mathematical Methods of Engineering Analysis

Notes V General Equilibrium: Positive Theory. 1 Walrasian Equilibrium and Excess Demand

Metric Spaces. Chapter Metrics

Mathematics for Econometrics, Fourth Edition

Gambling Systems and Multiplication-Invariant Measures

Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1

Continued Fractions and the Euclidean Algorithm

E3: PROBABILITY AND STATISTICS lecture notes

Follow links for Class Use and other Permissions. For more information send to:

Complex geodesics in convex tube domains

Random graphs with a given degree sequence

Optimal Investment with Derivative Securities

n k=1 k=0 1/k! = e. Example 6.4. The series 1/k 2 converges in R. Indeed, if s n = n then k=1 1/k, then s 2n s n = 1 n

Metric Spaces Joseph Muscat 2003 (Last revised May 2009)

Stationary random graphs on Z with prescribed iid degrees and finite mean connections

The Ergodic Theorem and randomness

SOLUTIONS TO EXERCISES FOR. MATHEMATICS 205A Part 3. Spaces with special properties

On Lexicographic (Dictionary) Preference

Elements of probability theory

Duality in Linear Programming

CHAPTER 6. Shannon entropy

About the Gamma Function

Adaptive Online Gradient Descent

Maximum Likelihood Estimation

HOMEWORK 5 SOLUTIONS. n!f n (1) lim. ln x n! + xn x. 1 = G n 1 (x). (2) k + 1 n. (n 1)!

Metric Spaces. Chapter 1

I. GROUPS: BASIC DEFINITIONS AND EXAMPLES

Lecture Notes on Elasticity of Substitution

Practice with Proofs

MA651 Topology. Lecture 6. Separation Axioms.

Lecture 7: Finding Lyapunov Functions 1

Some stability results of parameter identification in a jump diffusion model

Some representability and duality results for convex mixed-integer programs.

Fixed Point Theorems

Practical Guide to the Simplex Method of Linear Programming

Linear Algebra. A vector space (over R) is an ordered quadruple. such that V is a set; 0 V ; and the following eight axioms hold:

Non-Arbitrage and the Fundamental Theorem of Asset Pricing: Summary of Main Results

KATO S INEQUALITY UP TO THE BOUNDARY. Contents 1. Introduction 1 2. Properties of functions in X 4 3. Proof of Theorem

Fuzzy Probability Distributions in Bayesian Analysis

Bipan Hazarika ON ACCELERATION CONVERGENCE OF MULTIPLE SEQUENCES. 1. Introduction

FUNCTIONAL ANALYSIS LECTURE NOTES: QUOTIENT SPACES

24. The Branch and Bound Method

LECTURE 15: AMERICAN OPTIONS

Several Views of Support Vector Machines

What is Linear Programming?

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Nonlinear Optimization: Algorithms 3: Interior-point methods

DEGREES OF ORDERS ON TORSION-FREE ABELIAN GROUPS

Lecture 13 - Basic Number Theory.

Mathematical finance and linear programming (optimization)

10. Proximal point method

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

Geometrical Characterization of RN-operators between Locally Convex Vector Spaces

Let H and J be as in the above lemma. The result of the lemma shows that the integral

Schooling, Political Participation, and the Economy. (Online Supplementary Appendix: Not for Publication)

CONTINUED FRACTIONS AND PELL S EQUATION. Contents 1. Continued Fractions 1 2. Solution to Pell s Equation 9 References 12

1 Error in Euler s Method

Some Research Problems in Uncertainty Theory

Lecture 7: Continuous Random Variables

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

ON SEQUENTIAL CONTINUITY OF COMPOSITION MAPPING. 0. Introduction

Optimal File Sharing in Distributed Networks

Section 1.3 P 1 = 1 2. = P n = 1 P 3 = Continuing in this fashion, it should seem reasonable that, for any n = 1, 2, 3,..., =

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

8 Divisibility and prime numbers

Matrix Representations of Linear Transformations and Changes of Coordinates

Probability Theory. Florian Herzog. A random variable is neither random nor variable. Gian-Carlo Rota, M.I.T..

WHEN DOES A RANDOMLY WEIGHTED SELF NORMALIZED SUM CONVERGE IN DISTRIBUTION?

Statistical Machine Translation: IBM Models 1 and 2

Max-Min Representation of Piecewise Linear Functions

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

An Introduction to Linear Programming

1 Solving LPs: The Simplex Algorithm of George Dantzig

1 Norms and Vector Spaces

This asserts two sets are equal iff they have the same elements, that is, a set is determined by its elements.

Transcription:

MINIMIZATION OF ENTROPY FUNCTIONALS UNDER MOMENT CONSTRAINTS I. Csiszár (Budapest) Given a σ-finite measure space (X, X, µ) and a d-tuple ϕ = (ϕ 1,..., ϕ d ) of measurable functions on X, for a = (a 1,..., a d ) R d let L a denote the family of probability density functions g on X satisfying ϕgdµ = a, that is, ϕ i gdµ = a i, i = 1,..., d. Extensively studied problem: minimize J(g) = or K(g, h) = g log gdµ subject to g L a. (negative Shannon entropy) g log g dµ (Kullback-Leibler distance, h I-divergence, relative entropy) 1

First this problem, then its extension to other entropies and distances will be considered. For ϑ = (ϑ 1,..., ϑ d ) R d denote (ϑ) = log e ϑ,ϕ dµ ϑ, ϕ = d i=1 ϑ i ϕ i Assume: dom( ) = {ϑ : (ϑ) < + } is nonempty. Not hard to show: (ϑ) is the convex conjugate of the function H(a) = inf g La J(g) : (ϑ) = H (ϑ) = sup a R d [ ϑ, ϕ H(a)]. Dual problem associated with the primal problem of minimizing J(g) subject to g L a : maximize l a (ϑ) = ϑ, ϕ (ϑ) for ϑ R d. The supremum of l a (ϑ) is the convex conjugate of (ϑ), thus the second conjugate H (a) of H(a). Always H(a) H (A), the difference is called duality gap. 2

Exponential family with canonical statistic ϕ : E = {f ϑ = e ϑ,ϕ (ϑ) : ϑ dom( )}. When the empirical mean 1 n nj=1 ϕ(x j ) of ϕ in a sample x 1,..., x n drawn from a density in E is equal to a, the normalized log-likelihood function is l a (ϑ); for this a the dual problem means ML estimation. Moreover, if g L a then (lik.id) K(g, f ϑ ) = J(g) l a (ϑ), ϑ dom( ), hence providing J(g) is finite, the dual problem is equivalent to minimizing K(g, f ϑ ) for f ϑ E. 3

Note: this interpretation of the dual problem does not apply if a / dom(h), in which case J(g) = + for all g L a even though H (a) < H(a) = + is possible. Elementary proposition: If L a E =, it contains a single g a, and for this J(g) = J(g a ) + K(g, g a ), g L a ; equivalently, the Pythagorean identity holds: K(g, f) = K(g a, f) + K(g, g a ), g L a, f E In this case, the duality gap is 0: H(a) = H (a) = J(g a ), and the common member g a of L a and E is simultaneously the I-projection to L a of each f E and the reverse I-projection to E of each g L a with J(g) < +. 4

HISTORY HINTS Boltzmann, Gibbs: in 19. century Jaynes, Kullback: in the fifties Čencov 1972: information projections, diff. geom. approach Barndorff - Nielsen 1977: convex analysis approach to MLE for exponential families Csiszár 1975, Topsøe 1979: generalized minimizer when minimum not attained (Shannon case) Csiszár 1991: axiomatic approach Borwein and Lewis 1991: convex analysis approach for general entropies Csiszár 1995: generalized minimizer, general case Several recent works employ advanced Orlicz space techniques (Léonard 2001-2007) or diff. geom. (Amari and Nagaoka 2000, etc.) 5

This talk is based on works of Csiszár and Matúš 2001-2008 and hopefully will show that classical tools suffice for treating the problem efficiently. Convex core of a finite measure Q on R d (Csiszár - Matúš 2001): cc(q) = intersection of all convex Borel sets with full Q-measure = set of means of all probability measures P Q that have mean For the measure µ on X, define cc ϕ (µ)= { ϕgdµ : g prob.density, ϕg integrable } = {a R d : L a }. If µ is finite then cc ϕ (µ) = cc(µ ϕ ), µ ϕ image of µ on R d. 6

Lemma: If a cc ϕ (µ), there exists g L a with µ({x : g(x) > 0}) < +, g bounded. Corollary: dom(h) = cc ϕ (µ), that is, the necessary condition L a for H(a) = inf g La J(g) < + is sufficient, as well. Face of a convex set C R d : Nonempty convex subset F C such that a convex combination tx + (1 t)y of x C and y C (with 0 < t < 1) belongs to F only if x, y F For a face F of cc ϕ (µ), denote F = {x : ϕ(x) cl(f )} Lemma: For a in a face F of cc ϕ (µ), each g L a vanishes outside F (µ-a.e.) 7

Extended exponential family exte: The union of the families E F for all faces F of cc ϕ (µ), where E F = {f F,ϑ = e ϑ,ϕ f(ϑ) 1 F : ϑ dom( F )} F (ϑ) = log F e ϑ,ϕ dµ Theorem 1 (Csiszár - Matúš 2003): Whenever L a thus a cc ϕ (µ), there exists a unique g a, perhaps not in L a, such that J(g) = H(a) + K(g, g a ), g L a Moreover g a E F, for the face F of cc ϕ (µ) whose relative interior contains a. Clearly, if g a L a then it minimizes J(g) subject to g L a. Otherwise, it is a generalized minimizer: every sequence g n in L a with J(g n ) H(a) satisfies K(g n, g a ) 0, in particular, g n g a in L 1 (µ). 8

Generalized Pythagorean identity: K(g, f) = K(L a, f) + K(g, g a ) g L a, f E where K(L a, f) = inf g La K(g, f) K(g a, f) Thus, g a is the generalized I-projection to L a of each f E. If a ri(cc ϕ (µ)) thus g a E, then g a is also the reverse I-projection to E of each g L a with J(g) < +, and the duality gap is zero. g a / L a can happen if g a = f ϑ with ϑ on boundary of dom( ); g a may be the same for several vectors a. Existence of minimizer (I-projection): g a L a holds for all a ri(cc ϕ (µ)) if and only if is steep, and for all a ri(f ), if and only if F is steep. 9

Theorem 2. (Csiszár - Matúš 2003. 2008): If H (a) = sup ϑ R d l a (ϑ) is finite, there exists a unique density h a such that H (a) l a (ϑ) K(h a, f ϑ ), ϑ dom( ). Moreover, h a E F where F is the largest face of cc ϕ (µ) with a ri(f ) + barr(dom( )). Here barr denotes barrier cone: for any convex set C R d, barr(c) = {b : sup c C b, c < + }. Supplement: dom(h ) = cc ϕ (µ) + barr(dom( )). The maximum of l a (ϑ) is attained (MLE exists) if and any only if h a E. Otherwise, h a is a generalized MLE: every sequence ϑ n in dom( ) with l a (ϑ n ) H (a) satisfies K(h a, f ϑn ) 0, in particular, f ϑn h a in L 1 (µ). 10

GENERAL ENTROPY FUNCTIONALS In the sequel, γ is a given strictly convex, differentiable function on (0, + ), γ(0) is defined as lim t 0 γ(t); later, γ (0), γ (+ ) are also defined limiting. γ-entropy of a nonnegative function g on X: J γ (g) = γ(g)dµ Familiar choices of γ, in addition to t log t : γ(t) = log t Burg entropy γ(t) = sign(α 1)t α Rényi (Tsallis) entropy Problem: minimize J γ (g) subject to g L a, where L a is defined slightly differently than before: attention is not restricted to probability densities, accordingly we set ϕ = (ϕ 0, ϕ 1,..., ϕ d ) with ϕ 0 identically 1, and for a = (a 0,..., a d ) R 1+d, L a = {g 0 : ϕgdµ = a} 11

BASIC TOOLS The convex conjugate of γ, γ (r) = sup t>0 [rt γ(t)] is a nondecreasing convex function, finite and differentiable in (, γ (+ )), and its derivative goes to + as r γ (+ ). γ (γ (+ )) may or may not be finite. Denote by u the function on R equal to (γ ) in (, γ (+ )) and + outside. Then u(r) = 0 if r γ (0), and u is strictly increasing from 0 to + in the interval (γ (0), γ (+ )). Lemma 1. For r < γ (+ ) γ (u(r)) = max [ γ (0), r ] = r + γ (0) r + γ(u(r)) + γ (r) = ru(r). 12

For non-negative numbers t, s define γ (t.s) = γ(t) = [γ(s) + γ (s)(t s)] (not meaningful for s = 0 if γ (0) = ; then we set γ (0, 0) = 0, γ (t, 0) = + if t > 0) Bregman distance of nonnegative functions g, h on X : B γ (g, h) = γ (g, h)dµ Clearly, B γ (g, h) 0, equality iff g = h [µ] 13

KEY IDENTITY Denote: L a : the family of nonnegative (measurable) functions g on X satisfying the constraints gϕdµ = a (a = (a0,..., s d ) R 1+d ) F γ : the family of functions f ϑ = u( ϑ, ϕ ) with ϑ R 1+d such that γ ( ϑ, ϕ )dµ is finite, and ϑ, ϕ < γ (+ ) [µ] Key identity: For g L a and f ϑ F γ [ J γ (g) ϑ, a γ ( ϑ, ϕ )dµ ] = = B γ (g, f ϑ ) + g γ (0) ϑ, ϕ + dµ Proof: Immediate, using Lemma 1. 14

Proposition: If L a F γ, it consists of a single function g = f ϑ, this g minimizes J γ (g) subject to g L a, and ϑ maximizes ϑ, a γ ( ϑ, ϕ )dµ; these minimum and maximum are equal. [But ϑ need not be unique, only f ϑ is.] Proof: Immediate from the key identity. The family F γ is the γ-analogue of an exponential family in the theory of Shannon entropy maximization. While the functions in F γ need not be probability densities, in the case γ(t) = t log t they are exactly the constant multiples of the probability densities in F γ, which form an exponential family in the familiar statistical sense. For other γ however, no simple way is apparent to identify the probability densities in F γ. 15

Convex conjugate of H γ (a) = inf g La J γ (g) : H γ (ϑ) = sup [ ϑ, a H γ (a)], ϑ R 1+d. a R 1+d Lemma 2: If dom(h γ ), thus there exists some g with γ(g)dµ < + and gϕ i integrable for i = 0,..., d, then Hγ (ϑ) = γ ( ϑ, ϕ )dµ Dual problem: find the dual value Hγ [ (a) = ϑ, a H γ (ϑ) ], sup ϑ R 1+d and if it is finite, find ϑ R 1+d that attains the maximum, if such ϑ exists (dual attainment). In the latter case, the function f ϑ = u( ϑ, ϕ ) will be called dual solution, rather than ϑ itself. 16

Lemma 3: for ϑ dom(hγ ), the directional derivative of Hγ at ϑ, in a direction τ, exists and equals f ϑ τ, ϕ dµ < + whenever ϑ+tτ dom(hγ ) for some t > 0. In particular, H γ is differentiable in the interior of its essential domain, with the gradient equal to f ϑ ϕdµ. Corollary: For a R 1+d with H γ (a) finite, a dual solution satisfies f ϑ dµ a 0 Proof: Straightforward calculus. Differentiation within the integral is justified by monotone convergence. 17

Proposition 2: If H γ (a) is finite and a is in the relative interior of dom(h γ ) then the primal and dual values are equal, and dual attainment holds. Moreover, the dual solution f ϑ satisfies for each g L a J γ (g) = H γ (a) + B γ (g, f ϑ )+ + g γ (0) ϑ, ϕ + dµ. If, in addition, H γ (ϑ) = γ ( ϑ, ϕ )dµ is essentially smooth then the dual solution f ϑ belongs to L a, hence it is a primal solution, too. Proof: The first assertion is a general convex analysis result. The second assertion follows from it, by the key identity. The last assertion follows by Lemma 3, since if Hγ is essentially smooth, the maximizing ϑ has to be in the interior of dom(hγ). 18

Theorem 1: If γ(0) = 0 then dom(h γ ) = {ta : t 0, a cc ϕ (µ)} If γ(0) = + (and µ(x) < + ) then dom(h γ ) is either empty, or dom(h γ ) = {ta : t > 0, a ri(cc ϕ (µ))} = dom(h γ ). In both cases, dom(h γ ) is a cone that has a simple description in terms of cc ϕ (µ). In the case γ(0) = +, no general criterion appears available to determine whether dom(h γ ) is empty or not, but if nonempty, then Proposition 2 tells the story, as dom(h γ ) is relatively open. In the sequel, we concentrate on the case γ(0) = 0. 19

Given nonzero a dom(h γ ), equivalently a 0 > 0, a 1 0 a cc ϕ(µ), let F (a) denote the face of cc ϕ (µ) whose relative interior contains a 1 0 a. Let ν denote the restiction of µ to F (a) = ϕ 1 (el(f/a)). Each g L a vanishes outside F (a) = H γ (a) = inf g L a γ(g)dν. Proposition 2 applies to this a and the measure ν in the role of µ because a 1 0 a is in the relative interior of the face F (a) equal to cc ϕ (ν). 20

It follows, provided H γ (a) >, that the maximum of ϑ, a γ ( ϑ, ϕ )dν is attained, and with a maximizing ϑ, the function satisfies for all g L a f F,ϑ = u ( ϑ, ϕ ) 1 F (a) J γ (g) = = H γ (a) + B γ (g, f F,ϑ ) + g γ (0) ϑ, ϕ + dµ Theorem 2: To every a 0 with H γ (a) finite, there exists a (unique) generalized primal solution, of form f F,ϑ = u( ϑ, ϕ )1 F (a) and it satisfies the above identity. Essential smoothness of F (a) γ ( ϑ, ϕ )dµ is a sufficient condition for primal attainment. 21