From Maxent to Machine Learning and Back T. Sears ANU March 2007 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 1 / 36
50 Years Ago... The principles and mathematical methods of statistical mechanics are seen to be of much more general applicability... In the problem of prediction, the maximization of entropy is not an application of a law of physics, but merely a method of reasoning which ensures that no unconscious arbitrary assumptions have been introduced. E.T. Jaynes, 1957 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 2 / 36
... a method of reasoning... Jenkins, if I want another yes-man I ll build one.
Outline 1 Generalizing Maxent 2 Two Examples 3 Broader Comparisons 4 Extensions/Conclusions T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 4 / 36
You are here Generalizing Maxent 1 Generalizing Maxent 2 Two Examples 3 Broader Comparisons 4 Extensions/Conclusions T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 5 / 36
Generalizing Maxent The Classic Maxent Problem Minimize negative entropy subject to linear constraints: min p S(p) := subject to Ap = b p i 0 N p i log(p i ) i=1 A is M N. M < N, a wide matrix. b is a [ data ] vector. B A := 1 T contains a normalization constraint. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 6 / 36
Generalizing Maxent Extending the Classic Maxent Problem min S(p) p subject to Ap = b Original problem. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 7 / 36
Generalizing Maxent Extending the Classic Maxent Problem min p S(p) + δ {0} (Ap b) Original problem. Convert constraints to a convex function. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 7 / 36
Generalizing Maxent Extending the Classic Maxent Problem min p S(p) + δ {0} ( Ap b P ) Original problem. Convert constraints to a convex function. Use any norm... T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 7 / 36
Generalizing Maxent Extending the Classic Maxent Problem min p S(p) + δ ɛbp (Ap b) Original problem. Convert constraints to a convex function. Use any norm... and relax constraints. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 7 / 36
Generalizing Maxent Extending the Classic Maxent Problem min p F (p, p 0 ) + δ ɛbp (Ap b) Original problem. Convert constraints to a convex function. Use any norm... and relax constraints. Generalize SBG entropy to Bregman divergence. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 7 / 36
Generalizing Maxent Extending the Classic Maxent Problem min µ F (A T µ + p 0) + µ, b + ɛ µ Q Original problem. Convert constraints to a convex function. Use any norm... and relax constraints. Generalize SBG entropy to Bregman divergence. Find the Fenchel dual problem to solve. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 7 / 36
Generalizing Maxent Extending the Classic Maxent Problem min F (A T µ + p µ 0) + µ, b + ɛ µ } {{ } } {{ Q } Likelihood Prior Original problem. Convert constraints to a convex function. Use any norm... and relax constraints. Generalize SBG entropy to Bregman divergence. Find the Fenchel dual problem to solve. It s a more general form of the MAP problem. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 7 / 36
Generalizing Maxent Characterizing the solution Compare to statistical models After solving for µ we can recover the optimal primal solution: p = Score {}}{ F }{{} ( A T µ +p 0) Family T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 8 / 36
Generalizing Maxent Characterizing the solution Compare to statistical models After solving for µ we can recover the optimal primal solution: p = Score {}}{ F }{{} ( A T µ +p 0) Family p comes from a family of distributions. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 8 / 36
Generalizing Maxent Characterizing the solution Compare to statistical models After solving for µ we can recover the optimal primal solution: p = Score {}}{ F }{{} ( A T µ +p 0) Family p comes from a family of distributions. Entropy function (F ) determines the family ( F ). T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 8 / 36
Generalizing Maxent Characterizing the solution Compare to statistical models After solving for µ we can recover the optimal primal solution: p = Score {}}{ F }{{} ( A T µ +p 0) Family p comes from a family of distributions. Entropy function (F ) determines the family ( F ). SBG entropy exponential family. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 8 / 36
Generalizing Maxent Characterizing the solution Compare to statistical models After solving for µ we can recover the optimal primal solution: p = Score {}}{ F }{{} ( A T µ +p 0) Family p comes from a family of distributions. Entropy function (F ) determines the family ( F ). SBG entropy exponential family. Any nice F some family. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 8 / 36
Generalizing Maxent Generalizing the Exponential Family q-exponential exp q 8 6 4 2 q 1.5 q 1. q 0.5 Asymptote for q 1.5 3 2 1 0 1 2 3 1 1 q (1 + (1 q)p) exp q (p) := + q 1 exp(p) q = 1 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 9 / 36
Tail Behavior Tail Behavior Generalizing Maxent 1.0 exp q 0.8 0.6 0.4 0.2 3.0 2.5 2.0 1.5 1.0 0.5 0.0 q > 1 naturally gives fat tails. q < 1 truncates the tail. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 10 / 36
You are here Two Examples 1 Generalizing Maxent 2 Two Examples 3 Broader Comparisons 4 Extensions/Conclusions T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 11 / 36
Two Examples Loaded Die Example Setup A die with 6 faces. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 12 / 36
Two Examples Loaded Die Example Setup A die with 6 faces. Expected value of 4.5, instead of 3.5 for a fair die. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 12 / 36
Two Examples Loaded Die Example Setup A die with 6 faces. Expected value of 4.5, instead of 3.5 for a fair die. For this problem: ( ) 1 2 3 4 5 6 A = 1 1 1 1 1 1 and b = ( 4.5 1 ) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 12 / 36
Two Examples Loaded Die Example Setup A die with 6 faces. Expected value of 4.5, instead of 3.5 for a fair die. For this problem: ( ) 1 2 3 4 5 6 A = 1 1 1 1 1 1 and b = ( 4.5 1 ) Find p, assuming S S q, p 0 is uniform, ɛ = 0. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 12 / 36
Two Examples Loaded Die Example Setup A die with 6 faces. Expected value of 4.5, instead of 3.5 for a fair die. For this problem: ( ) 1 2 3 4 5 6 A = 1 1 1 1 1 1 and b = ( 4.5 1 ) Find p, assuming S S q, p 0 is uniform, ɛ = 0. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 12 / 36
Two Examples Loaded Die Example Setup A die with 6 faces. Expected value of 4.5, instead of 3.5 for a fair die. For this problem: ( ) 1 2 3 4 5 6 A = 1 1 1 1 1 1 and b = ( 4.5 1 ) Find p, assuming S S q, p 0 is uniform, ɛ = 0. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 12 / 36
Two Examples Loaded Die Example Probability q Sensitivity of Each Event Varies 0.3 0.2 q 0.1 1. 1.9 0.1 0.0 Higher q raises weight on face 1 and face 6. Opposite for 3,4,5. Task: Make a two-way market on each die face. Which is easiest? T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 13 / 36
Two Examples Example: The Dantzig Selector Entropy Function as Prior Information Background: Consider a variation on linear regression ŷ = Xβ. Choose β via min β 1 + δ ɛb (X T (Xβ y)) β T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 14 / 36
Two Examples Example: The Dantzig Selector Entropy Function as Prior Information Background: Consider a variation on linear regression ŷ = Xβ. Choose β via min β 1 + δ ɛb (X T (Xβ y)) β The non-zero entries of the solution can exactly identify the correct set of regressors with high probability under special conditions. (Candace and Tao, Ann. Stat. 2007) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 14 / 36
Two Examples Example: The Dantzig Selector Entropy Function as Prior Information Background: Consider a variation on linear regression ŷ = Xβ. Choose β via min β 1 + δ ɛb (X T (Xβ y)) β The non-zero entries of the solution can exactly identify the correct set of regressors with high probability under special conditions. (Candace and Tao, Ann. Stat. 2007) Special conditions: low noise, sparse true model β. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 14 / 36
Two Examples Example: The Dantzig Selector Entropy Function as Prior Information Background: Consider a variation on linear regression ŷ = Xβ. Choose β via min β 1 + δ ɛb (X T (Xβ y)) β The non-zero entries of the solution can exactly identify the correct set of regressors with high probability under special conditions. (Candace and Tao, Ann. Stat. 2007) Special conditions: low noise, sparse true model β. Application area: Compressed Sensing. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 14 / 36
Dantzig Selector Connection Two Examples Change of variables ( +/- trick) β = [ I I ] p, T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 15 / 36
Dantzig Selector Connection Two Examples Change of variables ( +/- trick) β = [ I I ] p, β 1 can be approached using S q with q 0. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 15 / 36
Dantzig Selector Connection Two Examples Change of variables ( +/- trick) β = [ I I ] p, β 1 can be approached using S q with q 0. Entropy function S q captures part of the prior knowledge. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 15 / 36
You are here Broader Comparisons 1 Generalizing Maxent 2 Two Examples 3 Broader Comparisons 4 Extensions/Conclusions T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 16 / 36
Broader Comparisons Making Broader Comparisons Value Regularization Problem: Model preferences over parameters can t be easily compared. Solution: Compare outputs instead (Rifkin and Lippert, JMLR, 2007). Many methods can be viewed as also solving min y R(y) + L(y b) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 17 / 36
Broader Comparisons Making Broader Comparisons Value Regularization Problem: Model preferences over parameters can t be easily compared. Solution: Compare outputs instead (Rifkin and Lippert, JMLR, 2007). Many methods can be viewed as also solving min y R(y) + L(y b) The regularizer, R, wants smooth outputs, y. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 17 / 36
Broader Comparisons Making Broader Comparisons Value Regularization Problem: Model preferences over parameters can t be easily compared. Solution: Compare outputs instead (Rifkin and Lippert, JMLR, 2007). Many methods can be viewed as also solving min y R(y) + L(y b) The regularizer, R, wants smooth outputs, y. The loss L, wants a close fit to the data, b (e.g. match labels). T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 17 / 36
Broader Comparisons Making Broader Comparisons Value Regularization Problem: Model preferences over parameters can t be easily compared. Solution: Compare outputs instead (Rifkin and Lippert, JMLR, 2007). Many methods can be viewed as also solving min y R(y) + L(y b) The regularizer, R, wants smooth outputs, y. The loss L, wants a close fit to the data, b (e.g. match labels). These goals typically compete. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 17 / 36
Broader Comparisons Generalized Maxent and Value Regularization To apply this idea to maxent: Change variables y = Ap. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 18 / 36
Broader Comparisons Generalized Maxent and Value Regularization To apply this idea to maxent: Change variables y = Ap. The regularizer corresponds to an image function: R(y) = AS(y) = min p S(p) + δ {0} (Ap y) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 18 / 36
Broader Comparisons Generalized Maxent and Value Regularization To apply this idea to maxent: Change variables y = Ap. The regularizer corresponds to an image function: Loss is straightforward: R(y) = AS(y) = min p S(p) + δ {0} (Ap y) L(y) = δ ɛbp (y b) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 18 / 36
Broader Comparisons SVMs and Value Regularization The Support Vector Machine (SVM, Vapnik) is one of the best known machine learning algorithms. Loss function is the soft margin hinge loss: 1 2 max(0, 1 by) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 19 / 36
Broader Comparisons SVMs and Value Regularization The Support Vector Machine (SVM, Vapnik) is one of the best known machine learning algorithms. Loss function is the soft margin hinge loss: 1 2 max(0, 1 by) Regularizer uses a data-dependent positive definite matrix K T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 19 / 36
Broader Comparisons SVMs and Value Regularization The Support Vector Machine (SVM, Vapnik) is one of the best known machine learning algorithms. Loss function is the soft margin hinge loss: 1 2 max(0, 1 by) Regularizer uses a data-dependent positive definite matrix K In value regularization terms the objective function is: 1 2 λyt K 1 y T + hingeloss(y i, b i ) } {{ } i R } {{ } L T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 19 / 36
Broader Comparisons SVMs and Value Regularization The Support Vector Machine (SVM, Vapnik) is one of the best known machine learning algorithms. Loss function is the soft margin hinge loss: 1 2 max(0, 1 by) Regularizer uses a data-dependent positive definite matrix K In value regularization terms the objective function is: 1 2 λyt K 1 y T + hingeloss(y i, b i ) } {{ } i R } {{ } L T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 19 / 36
Broader Comparisons SVMs and Value Regularization The Support Vector Machine (SVM, Vapnik) is one of the best known machine learning algorithms. Loss function is the soft margin hinge loss: 1 2 max(0, 1 by) Regularizer uses a data-dependent positive definite matrix K In value regularization terms the objective function is: 1 2 λyt K 1 y T + hingeloss(y i, b i ) } {{ } i R } {{ } L Compare to the generalized maxent objective function. AS(y) + δ } {{ } ɛbp (y b) } {{ } R L T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 19 / 36
You are here Extensions/Conclusions 1 Generalizing Maxent 2 Two Examples 3 Broader Comparisons 4 Extensions/Conclusions T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 20 / 36
Extensions/Conclusions Other Models, Briefly Many NLP models owe direct debt. Connection is easily seen. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 21 / 36
Extensions/Conclusions Other Models, Briefly Many NLP models owe direct debt. Connection is easily seen. Conditional models, graphical models. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 21 / 36
Extensions/Conclusions Other Models, Briefly Many NLP models owe direct debt. Connection is easily seen. Conditional models, graphical models. Use exponential family (SBG entropy), almost always. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 21 / 36
Extensions/Conclusions Other Models, Briefly Many NLP models owe direct debt. Connection is easily seen. Conditional models, graphical models. Use exponential family (SBG entropy), almost always. Often replace marginal distributions with empirical counterparts. Strong assumption, big simplification. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 21 / 36
Extensions/Conclusions Other Models, Briefly Many NLP models owe direct debt. Connection is easily seen. Conditional models, graphical models. Use exponential family (SBG entropy), almost always. Often replace marginal distributions with empirical counterparts. Strong assumption, big simplification. Non-probabilistic models. Relax normalization. Use +/- trick. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 21 / 36
Extensions/Conclusions Other Models, Briefly Many NLP models owe direct debt. Connection is easily seen. Conditional models, graphical models. Use exponential family (SBG entropy), almost always. Often replace marginal distributions with empirical counterparts. Strong assumption, big simplification. Non-probabilistic models. Relax normalization. Use +/- trick. Continuous/mixed models. p becomes a function, A becomes an operator. Call in the mathematicians and approximation theory. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 21 / 36
Summary Extensions/Conclusions There is a class of models based on convex functions, which have interchangeable parts. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 22 / 36
Summary Extensions/Conclusions There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 22 / 36
Summary Extensions/Conclusions There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. Fenchel duality permits a quick switch of model assumptions. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 22 / 36
Summary Extensions/Conclusions There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. Fenchel duality permits a quick switch of model assumptions. Benefit: modular approach allows exploration of model space, by the modeler, or the computer. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 22 / 36
Summary Extensions/Conclusions There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. Fenchel duality permits a quick switch of model assumptions. Benefit: modular approach allows exploration of model space, by the modeler, or the computer. Key required tool: flexible, non-smooth optimization tools. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 22 / 36
Summary Extensions/Conclusions There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. Fenchel duality permits a quick switch of model assumptions. Benefit: modular approach allows exploration of model space, by the modeler, or the computer. Key required tool: flexible, non-smooth optimization tools. Harder: Characterize prior knowledge represented in choice of Regularizer and Loss. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 22 / 36
Summary Extensions/Conclusions There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. Fenchel duality permits a quick switch of model assumptions. Benefit: modular approach allows exploration of model space, by the modeler, or the computer. Key required tool: flexible, non-smooth optimization tools. Harder: Characterize prior knowledge represented in choice of Regularizer and Loss. Harder: Incorporate/factor out knowledge of the task(s) to be performed with the model. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 22 / 36
The End Extensions/Conclusions Thank You T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 23 / 36
You are here Appendix 5 Appendix Generalizing the Maxent Problem The Consequences of Normalization Phi-Exponential Families p as a Projection T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 24 / 36
Appendix Software for Experiments Apply quasi-newton method (LMVM) to dual problem. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 25 / 36
Appendix Software for Experiments Apply quasi-newton method (LMVM) to dual problem. Objective function requires matrix-vector multiplication (A T v M 1 ) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 25 / 36
Appendix Software for Experiments Apply quasi-newton method (LMVM) to dual problem. Objective function requires matrix-vector multiplication (A T v M 1 ) Gradient requires additional matrix-vector multiplication (A v N 1 ) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 25 / 36
Appendix Software for Experiments Apply quasi-newton method (LMVM) to dual problem. Objective function requires matrix-vector multiplication (A T v M 1 ) Gradient requires additional matrix-vector multiplication (A v N 1 ) Built on PETSC/TAO/Elefant. Will run single or parallel (MPI) with a simple switch. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 25 / 36
Appendix Software for Experiments Apply quasi-newton method (LMVM) to dual problem. Objective function requires matrix-vector multiplication (A T v M 1 ) Gradient requires additional matrix-vector multiplication (A v N 1 ) Built on PETSC/TAO/Elefant. Will run single or parallel (MPI) with a simple switch. Additional features to accommodate non-smooth duals to constraint relaxations. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 25 / 36
Appendix Software for Experiments Apply quasi-newton method (LMVM) to dual problem. Objective function requires matrix-vector multiplication (A T v M 1 ) Gradient requires additional matrix-vector multiplication (A v N 1 ) Built on PETSC/TAO/Elefant. Will run single or parallel (MPI) with a simple switch. Additional features to accommodate non-smooth duals to constraint relaxations. Possible synergy: Choon-Hui, Alex, and Vishy announce high performance non-smooth optimization package. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 25 / 36
Classic Maxent Solution Exponential Family Distribution Appendix Generalizing the Maxent Problem [ ] [ 1 1 Constraint equivalent to: A p = p = B b B Normalization is just another feature. Try to hide its existence in the solution: ] p = exp[a T µ] = exp[b T µ + 1 µ 1 ] = exp[b T 1 µ B 1T ( µ B )] = Z( µ B ) exp[bt µ B ] T is the log-partition function. Z is the partition function. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 26 / 36
Convex Analysis Recap A quick detour Appendix Generalizing the Maxent Problem The convex conjugate of a convex function F is Legendre if F (p ) := 1 C = int(dom F ) is non-empty 2 F is differentiable on C sup { p, p F (p)}. p dom(f ) 3 F (p) as p bdry(dom F ) For Legendre functions (in the int(dom F ) ) we have p = F (p ). T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 27 / 36
Appendix A More General Objective Function Bregman Divergence Generalizing the Maxent Problem F (p, q) := F (p) F (q) F (q), p q Let q be uniform (q i = 1/N). S is SBG entropy. S(p, q) = i +p i log(p i ) q i log(q i ) (1 + log(q i ))(p i q i ) = i p i log(1/p i ) + i p i log(n) i p i + i q i = S(p) + log(n). S is relative entropy when q not uniform. But we are not restricted to SBG entropy.... T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 28 / 36
Appendix A More General Maxent Problem New Objective Function Generalizing the Maxent Problem min F (p, p 0) subject to Ap = b and p i 0. p R n Solve it by using the Fenchel dual max F (A T µ + p 0) + b, µ µ dom F where (if F is Legendre) p = F ( A T µ + p 0 ). T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 29 / 36
Solution to the Problem New Distribution Families Appendix Generalizing the Maxent Problem This solution is more general but similar to the exponential family. p = F (B T µ B + p 0 + 1 µ 1 ) = F (B T µ B + p 0 1T ( µ B )) Here, T (µ B ) is defined implicitly via 1 T F (B T µ B + p 0 1T ( µ B )) = 1 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 30 / 36
Scale Function Properties Analog to partition function Appendix The Consequences of Normalization T is not simple to calculate. But we can deduce that T is convex and use implicit differentiation to calculate its gradient. 0 = (B T (µ B )1 T ) 2 F (B T µ B + p 0 1T (µ B ))1 which on rearrangement gives T (µ B ) = B 2 F ( p )1 1 T 2 F ( p )1 = B q T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 31 / 36
Escort Distribution Appendix The Consequences of Normalization When F is additively separable q is indeed a probability distribution. (Can you see why?) q := 2 F ( p )1 1 T 2 F ( p )1 So B q is an expectation. When does p = q? T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 32 / 36
Appendix A Concrete Class of Entropies based on φ-logarithms Phi-Exponential Families log(p) = p 1 1 dx Usual construction x T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 33 / 36
Appendix A Concrete Class of Entropies based on φ-logarithms Phi-Exponential Families log φ (p) = p 1 1 dx Deformed Log φ(x) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 33 / 36
Appendix A Concrete Class of Entropies based on φ-logarithms Phi-Exponential Families p 1 log φ (p) = dx Deformed Log 1 φ(x) Any positive increasing φ will do. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 33 / 36
Appendix A Concrete Class of Entropies based on φ-logarithms Phi-Exponential Families p 1 log φ (p) = dx Deformed Log 1 φ(x) Any positive increasing φ will do. Apply a scaling/smoothing normalization operation to obtain another such function: ψ(p) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 33 / 36
Appendix A Concrete Class of Entropies based on φ-logarithms Phi-Exponential Families p 1 log φ (p) = dx Deformed Log 1 φ(x) Any positive increasing φ will do. Apply a scaling/smoothing normalization operation to obtain another such function: ψ(p) Form negative entropy term: s φ (p) = p log ψ (1/p) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 33 / 36
Appendix A Concrete Class of Entropies based on φ-logarithms Phi-Exponential Families p 1 log φ (p) = dx Deformed Log 1 φ(x) Any positive increasing φ will do. Apply a scaling/smoothing normalization operation to obtain another such function: ψ(p) Form negative entropy term: s φ (p) = p log ψ (1/p) Leads to Convenient gradient: s φ (p) = log φ (p) + k φ φ-exponential family: p = exp φ [A T µ + p 0 k φ] T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 33 / 36
Appendix Phi-Exponential Families Example from the Physics Literature φ(x) = x q Φ p 2 a : Φ p p q q 1.5 q 1 1 q 0.5 Try this: Pick q (between 0 and 2) Let φ(p) = p q 1 2 p T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 34 / 36
Appendix Phi-Exponential Families Example from the Physics Literature φ(x) = x q 2 log Φ log q Yields the q-logarithm from the non-extensive thermodynamics literature. log q (x) := x 1 q 1 1 q 1 1 2 q 1 1 2 q 1 3 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 34 / 36
Appendix Phi-Exponential Families Example from the Physics Literature φ(x) = x q 2 b : Ψ p 2 q x 2 q 1 q 0.5 q 1.5 q 1 Scaling/Smoothing operation: ψ(x) = ( 1/u 0 ) 1 u φ(u) du In this case the operation only scales and reparameterizes φ to yield ψ. 1 2 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 34 / 36
Appendix Phi-Exponential Families Example from the Physics Literature φ(x) = x q 1 d : log Ψ p log 2 q p 2 q q 1 Use this log to form negative entropy: p log ψ (1/p) 1 q 0.5 q 1.5 1 2 2 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 34 / 36
Appendix Phi-Exponential Families Example from the Physics Literature φ(x) = x q 1 e : s Φ p p log q p 2 q q 0.5 1 2 Only Legendre for q > 1. Why? q 1 1 q 1.5 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 34 / 36
Looking At Projections q Examples Appendix p as a Projection Orthogonal Projection q 0 Same as orthogonal projection. A p b p 0 p T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 35 / 36
Looking At Projections q Examples Appendix p as a Projection A p b Curved Projection q 0 Same as orthogonal projection. q =.6 p 0 p T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 35 / 36
Looking At Projections q Examples Appendix p as a Projection Oblique Projection A p b p p 0 q 0 Same as orthogonal projection. q =.6 Usual normalization. Actually relates directly to projection under SBG entropy. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 35 / 36
Looking At Projections q Examples Appendix p as a Projection Curved Again q 0 Same as orthogonal projection. A p b p p 0 q =.6 Usual normalization. Actually relates directly to projection under SBG entropy. q = 1.6 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 35 / 36
Four Views of Optimality Appendix p as a Projection Solution to primal problem Bregman Projection A p b p 0 p T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 36 / 36
Four Views of Optimality Appendix p as a Projection Solution to primal problem Intersection of e-flat and m-flat manifolds. T Manifold Intersection Q p T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 36 / 36
Four Views of Optimality Appendix p as a Projection Solution to primal problem Intersection of e-flat and m-flat manifolds. Reverse distance solution. Non-convex! Smallest Reverse Distance p b Q p T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 36 / 36
Four Views of Optimality Appendix p as a Projection Solution to primal problem Intersection of e-flat and m-flat manifolds. Reverse distance solution. Non-convex! Orthogonality conditions. Sometimes used in algorithm design. Pseudo Orthogonality p b p p 0 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 36 / 36