From Maxent to Machine Learning and Back

Similar documents
Several Views of Support Vector Machines

Support Vector Machine (SVM)

Support Vector Machines Explained

Nonlinear Optimization: Algorithms 3: Interior-point methods

Semi-Supervised Support Vector Machines and Application to Spam Filtering

A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION

Introduction to Support Vector Machines. Colin Campbell, Bristol University

MINIMIZATION OF ENTROPY FUNCTIONALS UNDER MOMENT CONSTRAINTS. denote the family of probability density functions g on X satisfying

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

1 Review of Least Squares Solutions to Overdetermined Systems

1 Solving LPs: The Simplex Algorithm of George Dantzig

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Linear Threshold Units

Determining distribution parameters from quantiles

Support Vector Machines

A Simple Introduction to Support Vector Machines

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Fixed Point Theorems

Statistical Machine Learning

The p-norm generalization of the LMS algorithm for adaptive filtering

ALMOST COMMON PRIORS 1. INTRODUCTION

Linear Programming for Optimization. Mark A. Schulze, Ph.D. Perceptive Scientific Instruments, Inc.

Adaptive Online Gradient Descent

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Lecture 3: Linear methods for classification

Lecture 6: Logistic Regression

Big Data - Lecture 1 Optimization reminders

Elasticity Theory Basics

The Cobb-Douglas Production Function

Linear Programming. March 14, 2014

Part II Redundant Dictionaries and Pursuit Algorithms

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Markov Chain Monte Carlo Simulation Made Simple

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Duality of linear conic problems

Review of Fundamental Mathematics

Introduction to General and Generalized Linear Models

STA 4273H: Statistical Machine Learning

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8

Level Set Framework, Signed Distance Function, and Various Tools

Support Vector Machines with Clustering for Training with Very Large Datasets

LECTURE 5: DUALITY AND SENSITIVITY ANALYSIS. 1. Dual linear program 2. Duality theory 3. Sensitivity analysis 4. Dual simplex method

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

CHAPTER 6: Continuous Uniform Distribution: 6.1. Definition: The density function of the continuous random variable X on the interval [A, B] is.

Introduction to Online Learning Theory

What is Linear Programming?

Definition and Properties of the Production Function: Lecture

Constrained optimization.

Least-Squares Intersection of Lines

3 An Illustrative Example

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Dual Methods for Total Variation-Based Image Restoration

An Overview Of Software For Convex Optimization. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Statistical machine learning, high dimension and big data

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Duality in Linear Programming

On Adaboost and Optimal Betting Strategies

Linear Programming in Matrix Form

Log-Linear Models. Michael Collins

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Practical Guide to the Simplex Method of Linear Programming

An Introduction to Machine Learning

Linear Programming Notes V Problem Transformations

Lasso on Categorical Data

Lecture 10: Regression Trees

Lecture 2: The SVM classifier

10. Proximal point method

Review D: Potential Energy and the Conservation of Mechanical Energy

1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where.

Big Data Analytics: Optimization and Randomization

International Doctoral School Algorithmic Decision Theory: MCDA and MOO

Fitting Subject-specific Curves to Grouped Longitudinal Data

Two-Stage Stochastic Linear Programs

2.3 Convex Constrained Optimization Problems

Nonlinear Regression:

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Operation Research. Module 1. Module 2. Unit 1. Unit 2. Unit 3. Unit 1

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Variational Mean Field for Graphical Models

Statistical Machine Translation: IBM Models 1 and 2

Christfried Webers. Canberra February June 2015

Epipolar Geometry. Readings: See Sections 10.1 and 15.6 of Forsyth and Ponce. Right Image. Left Image. e(p ) Epipolar Lines. e(q ) q R.

Towards running complex models on big data

Non-Inferiority Tests for Two Proportions

Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

5 Scalings with differential equations

Online Convex Optimization

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

Linear Programming. April 12, 2005

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

Variational approach to restore point-like and curve-like singularities in imaging

Optimization. J(f) := λω(f) + R emp (f) (5.1) m l(f(x i ) y i ). (5.2) i=1

Summer course on Convex Optimization. Fifth Lecture Interior-Point Methods (1) Michel Baes, K.U.Leuven Bharath Rangarajan, U.

Multi-variable Calculus and Optimization

Transcription:

From Maxent to Machine Learning and Back T. Sears ANU March 2007 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 1 / 36

50 Years Ago... The principles and mathematical methods of statistical mechanics are seen to be of much more general applicability... In the problem of prediction, the maximization of entropy is not an application of a law of physics, but merely a method of reasoning which ensures that no unconscious arbitrary assumptions have been introduced. E.T. Jaynes, 1957 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 2 / 36

... a method of reasoning... Jenkins, if I want another yes-man I ll build one.

Outline 1 Generalizing Maxent 2 Two Examples 3 Broader Comparisons 4 Extensions/Conclusions T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 4 / 36

You are here Generalizing Maxent 1 Generalizing Maxent 2 Two Examples 3 Broader Comparisons 4 Extensions/Conclusions T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 5 / 36

Generalizing Maxent The Classic Maxent Problem Minimize negative entropy subject to linear constraints: min p S(p) := subject to Ap = b p i 0 N p i log(p i ) i=1 A is M N. M < N, a wide matrix. b is a [ data ] vector. B A := 1 T contains a normalization constraint. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 6 / 36

Generalizing Maxent Extending the Classic Maxent Problem min S(p) p subject to Ap = b Original problem. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 7 / 36

Generalizing Maxent Extending the Classic Maxent Problem min p S(p) + δ {0} (Ap b) Original problem. Convert constraints to a convex function. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 7 / 36

Generalizing Maxent Extending the Classic Maxent Problem min p S(p) + δ {0} ( Ap b P ) Original problem. Convert constraints to a convex function. Use any norm... T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 7 / 36

Generalizing Maxent Extending the Classic Maxent Problem min p S(p) + δ ɛbp (Ap b) Original problem. Convert constraints to a convex function. Use any norm... and relax constraints. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 7 / 36

Generalizing Maxent Extending the Classic Maxent Problem min p F (p, p 0 ) + δ ɛbp (Ap b) Original problem. Convert constraints to a convex function. Use any norm... and relax constraints. Generalize SBG entropy to Bregman divergence. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 7 / 36

Generalizing Maxent Extending the Classic Maxent Problem min µ F (A T µ + p 0) + µ, b + ɛ µ Q Original problem. Convert constraints to a convex function. Use any norm... and relax constraints. Generalize SBG entropy to Bregman divergence. Find the Fenchel dual problem to solve. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 7 / 36

Generalizing Maxent Extending the Classic Maxent Problem min F (A T µ + p µ 0) + µ, b + ɛ µ } {{ } } {{ Q } Likelihood Prior Original problem. Convert constraints to a convex function. Use any norm... and relax constraints. Generalize SBG entropy to Bregman divergence. Find the Fenchel dual problem to solve. It s a more general form of the MAP problem. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 7 / 36

Generalizing Maxent Characterizing the solution Compare to statistical models After solving for µ we can recover the optimal primal solution: p = Score {}}{ F }{{} ( A T µ +p 0) Family T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 8 / 36

Generalizing Maxent Characterizing the solution Compare to statistical models After solving for µ we can recover the optimal primal solution: p = Score {}}{ F }{{} ( A T µ +p 0) Family p comes from a family of distributions. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 8 / 36

Generalizing Maxent Characterizing the solution Compare to statistical models After solving for µ we can recover the optimal primal solution: p = Score {}}{ F }{{} ( A T µ +p 0) Family p comes from a family of distributions. Entropy function (F ) determines the family ( F ). T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 8 / 36

Generalizing Maxent Characterizing the solution Compare to statistical models After solving for µ we can recover the optimal primal solution: p = Score {}}{ F }{{} ( A T µ +p 0) Family p comes from a family of distributions. Entropy function (F ) determines the family ( F ). SBG entropy exponential family. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 8 / 36

Generalizing Maxent Characterizing the solution Compare to statistical models After solving for µ we can recover the optimal primal solution: p = Score {}}{ F }{{} ( A T µ +p 0) Family p comes from a family of distributions. Entropy function (F ) determines the family ( F ). SBG entropy exponential family. Any nice F some family. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 8 / 36

Generalizing Maxent Generalizing the Exponential Family q-exponential exp q 8 6 4 2 q 1.5 q 1. q 0.5 Asymptote for q 1.5 3 2 1 0 1 2 3 1 1 q (1 + (1 q)p) exp q (p) := + q 1 exp(p) q = 1 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 9 / 36

Tail Behavior Tail Behavior Generalizing Maxent 1.0 exp q 0.8 0.6 0.4 0.2 3.0 2.5 2.0 1.5 1.0 0.5 0.0 q > 1 naturally gives fat tails. q < 1 truncates the tail. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 10 / 36

You are here Two Examples 1 Generalizing Maxent 2 Two Examples 3 Broader Comparisons 4 Extensions/Conclusions T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 11 / 36

Two Examples Loaded Die Example Setup A die with 6 faces. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 12 / 36

Two Examples Loaded Die Example Setup A die with 6 faces. Expected value of 4.5, instead of 3.5 for a fair die. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 12 / 36

Two Examples Loaded Die Example Setup A die with 6 faces. Expected value of 4.5, instead of 3.5 for a fair die. For this problem: ( ) 1 2 3 4 5 6 A = 1 1 1 1 1 1 and b = ( 4.5 1 ) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 12 / 36

Two Examples Loaded Die Example Setup A die with 6 faces. Expected value of 4.5, instead of 3.5 for a fair die. For this problem: ( ) 1 2 3 4 5 6 A = 1 1 1 1 1 1 and b = ( 4.5 1 ) Find p, assuming S S q, p 0 is uniform, ɛ = 0. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 12 / 36

Two Examples Loaded Die Example Setup A die with 6 faces. Expected value of 4.5, instead of 3.5 for a fair die. For this problem: ( ) 1 2 3 4 5 6 A = 1 1 1 1 1 1 and b = ( 4.5 1 ) Find p, assuming S S q, p 0 is uniform, ɛ = 0. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 12 / 36

Two Examples Loaded Die Example Setup A die with 6 faces. Expected value of 4.5, instead of 3.5 for a fair die. For this problem: ( ) 1 2 3 4 5 6 A = 1 1 1 1 1 1 and b = ( 4.5 1 ) Find p, assuming S S q, p 0 is uniform, ɛ = 0. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 12 / 36

Two Examples Loaded Die Example Probability q Sensitivity of Each Event Varies 0.3 0.2 q 0.1 1. 1.9 0.1 0.0 Higher q raises weight on face 1 and face 6. Opposite for 3,4,5. Task: Make a two-way market on each die face. Which is easiest? T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 13 / 36

Two Examples Example: The Dantzig Selector Entropy Function as Prior Information Background: Consider a variation on linear regression ŷ = Xβ. Choose β via min β 1 + δ ɛb (X T (Xβ y)) β T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 14 / 36

Two Examples Example: The Dantzig Selector Entropy Function as Prior Information Background: Consider a variation on linear regression ŷ = Xβ. Choose β via min β 1 + δ ɛb (X T (Xβ y)) β The non-zero entries of the solution can exactly identify the correct set of regressors with high probability under special conditions. (Candace and Tao, Ann. Stat. 2007) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 14 / 36

Two Examples Example: The Dantzig Selector Entropy Function as Prior Information Background: Consider a variation on linear regression ŷ = Xβ. Choose β via min β 1 + δ ɛb (X T (Xβ y)) β The non-zero entries of the solution can exactly identify the correct set of regressors with high probability under special conditions. (Candace and Tao, Ann. Stat. 2007) Special conditions: low noise, sparse true model β. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 14 / 36

Two Examples Example: The Dantzig Selector Entropy Function as Prior Information Background: Consider a variation on linear regression ŷ = Xβ. Choose β via min β 1 + δ ɛb (X T (Xβ y)) β The non-zero entries of the solution can exactly identify the correct set of regressors with high probability under special conditions. (Candace and Tao, Ann. Stat. 2007) Special conditions: low noise, sparse true model β. Application area: Compressed Sensing. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 14 / 36

Dantzig Selector Connection Two Examples Change of variables ( +/- trick) β = [ I I ] p, T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 15 / 36

Dantzig Selector Connection Two Examples Change of variables ( +/- trick) β = [ I I ] p, β 1 can be approached using S q with q 0. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 15 / 36

Dantzig Selector Connection Two Examples Change of variables ( +/- trick) β = [ I I ] p, β 1 can be approached using S q with q 0. Entropy function S q captures part of the prior knowledge. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 15 / 36

You are here Broader Comparisons 1 Generalizing Maxent 2 Two Examples 3 Broader Comparisons 4 Extensions/Conclusions T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 16 / 36

Broader Comparisons Making Broader Comparisons Value Regularization Problem: Model preferences over parameters can t be easily compared. Solution: Compare outputs instead (Rifkin and Lippert, JMLR, 2007). Many methods can be viewed as also solving min y R(y) + L(y b) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 17 / 36

Broader Comparisons Making Broader Comparisons Value Regularization Problem: Model preferences over parameters can t be easily compared. Solution: Compare outputs instead (Rifkin and Lippert, JMLR, 2007). Many methods can be viewed as also solving min y R(y) + L(y b) The regularizer, R, wants smooth outputs, y. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 17 / 36

Broader Comparisons Making Broader Comparisons Value Regularization Problem: Model preferences over parameters can t be easily compared. Solution: Compare outputs instead (Rifkin and Lippert, JMLR, 2007). Many methods can be viewed as also solving min y R(y) + L(y b) The regularizer, R, wants smooth outputs, y. The loss L, wants a close fit to the data, b (e.g. match labels). T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 17 / 36

Broader Comparisons Making Broader Comparisons Value Regularization Problem: Model preferences over parameters can t be easily compared. Solution: Compare outputs instead (Rifkin and Lippert, JMLR, 2007). Many methods can be viewed as also solving min y R(y) + L(y b) The regularizer, R, wants smooth outputs, y. The loss L, wants a close fit to the data, b (e.g. match labels). These goals typically compete. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 17 / 36

Broader Comparisons Generalized Maxent and Value Regularization To apply this idea to maxent: Change variables y = Ap. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 18 / 36

Broader Comparisons Generalized Maxent and Value Regularization To apply this idea to maxent: Change variables y = Ap. The regularizer corresponds to an image function: R(y) = AS(y) = min p S(p) + δ {0} (Ap y) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 18 / 36

Broader Comparisons Generalized Maxent and Value Regularization To apply this idea to maxent: Change variables y = Ap. The regularizer corresponds to an image function: Loss is straightforward: R(y) = AS(y) = min p S(p) + δ {0} (Ap y) L(y) = δ ɛbp (y b) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 18 / 36

Broader Comparisons SVMs and Value Regularization The Support Vector Machine (SVM, Vapnik) is one of the best known machine learning algorithms. Loss function is the soft margin hinge loss: 1 2 max(0, 1 by) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 19 / 36

Broader Comparisons SVMs and Value Regularization The Support Vector Machine (SVM, Vapnik) is one of the best known machine learning algorithms. Loss function is the soft margin hinge loss: 1 2 max(0, 1 by) Regularizer uses a data-dependent positive definite matrix K T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 19 / 36

Broader Comparisons SVMs and Value Regularization The Support Vector Machine (SVM, Vapnik) is one of the best known machine learning algorithms. Loss function is the soft margin hinge loss: 1 2 max(0, 1 by) Regularizer uses a data-dependent positive definite matrix K In value regularization terms the objective function is: 1 2 λyt K 1 y T + hingeloss(y i, b i ) } {{ } i R } {{ } L T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 19 / 36

Broader Comparisons SVMs and Value Regularization The Support Vector Machine (SVM, Vapnik) is one of the best known machine learning algorithms. Loss function is the soft margin hinge loss: 1 2 max(0, 1 by) Regularizer uses a data-dependent positive definite matrix K In value regularization terms the objective function is: 1 2 λyt K 1 y T + hingeloss(y i, b i ) } {{ } i R } {{ } L T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 19 / 36

Broader Comparisons SVMs and Value Regularization The Support Vector Machine (SVM, Vapnik) is one of the best known machine learning algorithms. Loss function is the soft margin hinge loss: 1 2 max(0, 1 by) Regularizer uses a data-dependent positive definite matrix K In value regularization terms the objective function is: 1 2 λyt K 1 y T + hingeloss(y i, b i ) } {{ } i R } {{ } L Compare to the generalized maxent objective function. AS(y) + δ } {{ } ɛbp (y b) } {{ } R L T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 19 / 36

You are here Extensions/Conclusions 1 Generalizing Maxent 2 Two Examples 3 Broader Comparisons 4 Extensions/Conclusions T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 20 / 36

Extensions/Conclusions Other Models, Briefly Many NLP models owe direct debt. Connection is easily seen. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 21 / 36

Extensions/Conclusions Other Models, Briefly Many NLP models owe direct debt. Connection is easily seen. Conditional models, graphical models. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 21 / 36

Extensions/Conclusions Other Models, Briefly Many NLP models owe direct debt. Connection is easily seen. Conditional models, graphical models. Use exponential family (SBG entropy), almost always. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 21 / 36

Extensions/Conclusions Other Models, Briefly Many NLP models owe direct debt. Connection is easily seen. Conditional models, graphical models. Use exponential family (SBG entropy), almost always. Often replace marginal distributions with empirical counterparts. Strong assumption, big simplification. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 21 / 36

Extensions/Conclusions Other Models, Briefly Many NLP models owe direct debt. Connection is easily seen. Conditional models, graphical models. Use exponential family (SBG entropy), almost always. Often replace marginal distributions with empirical counterparts. Strong assumption, big simplification. Non-probabilistic models. Relax normalization. Use +/- trick. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 21 / 36

Extensions/Conclusions Other Models, Briefly Many NLP models owe direct debt. Connection is easily seen. Conditional models, graphical models. Use exponential family (SBG entropy), almost always. Often replace marginal distributions with empirical counterparts. Strong assumption, big simplification. Non-probabilistic models. Relax normalization. Use +/- trick. Continuous/mixed models. p becomes a function, A becomes an operator. Call in the mathematicians and approximation theory. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 21 / 36

Summary Extensions/Conclusions There is a class of models based on convex functions, which have interchangeable parts. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 22 / 36

Summary Extensions/Conclusions There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 22 / 36

Summary Extensions/Conclusions There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. Fenchel duality permits a quick switch of model assumptions. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 22 / 36

Summary Extensions/Conclusions There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. Fenchel duality permits a quick switch of model assumptions. Benefit: modular approach allows exploration of model space, by the modeler, or the computer. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 22 / 36

Summary Extensions/Conclusions There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. Fenchel duality permits a quick switch of model assumptions. Benefit: modular approach allows exploration of model space, by the modeler, or the computer. Key required tool: flexible, non-smooth optimization tools. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 22 / 36

Summary Extensions/Conclusions There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. Fenchel duality permits a quick switch of model assumptions. Benefit: modular approach allows exploration of model space, by the modeler, or the computer. Key required tool: flexible, non-smooth optimization tools. Harder: Characterize prior knowledge represented in choice of Regularizer and Loss. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 22 / 36

Summary Extensions/Conclusions There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. Fenchel duality permits a quick switch of model assumptions. Benefit: modular approach allows exploration of model space, by the modeler, or the computer. Key required tool: flexible, non-smooth optimization tools. Harder: Characterize prior knowledge represented in choice of Regularizer and Loss. Harder: Incorporate/factor out knowledge of the task(s) to be performed with the model. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 22 / 36

The End Extensions/Conclusions Thank You T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 23 / 36

You are here Appendix 5 Appendix Generalizing the Maxent Problem The Consequences of Normalization Phi-Exponential Families p as a Projection T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 24 / 36

Appendix Software for Experiments Apply quasi-newton method (LMVM) to dual problem. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 25 / 36

Appendix Software for Experiments Apply quasi-newton method (LMVM) to dual problem. Objective function requires matrix-vector multiplication (A T v M 1 ) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 25 / 36

Appendix Software for Experiments Apply quasi-newton method (LMVM) to dual problem. Objective function requires matrix-vector multiplication (A T v M 1 ) Gradient requires additional matrix-vector multiplication (A v N 1 ) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 25 / 36

Appendix Software for Experiments Apply quasi-newton method (LMVM) to dual problem. Objective function requires matrix-vector multiplication (A T v M 1 ) Gradient requires additional matrix-vector multiplication (A v N 1 ) Built on PETSC/TAO/Elefant. Will run single or parallel (MPI) with a simple switch. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 25 / 36

Appendix Software for Experiments Apply quasi-newton method (LMVM) to dual problem. Objective function requires matrix-vector multiplication (A T v M 1 ) Gradient requires additional matrix-vector multiplication (A v N 1 ) Built on PETSC/TAO/Elefant. Will run single or parallel (MPI) with a simple switch. Additional features to accommodate non-smooth duals to constraint relaxations. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 25 / 36

Appendix Software for Experiments Apply quasi-newton method (LMVM) to dual problem. Objective function requires matrix-vector multiplication (A T v M 1 ) Gradient requires additional matrix-vector multiplication (A v N 1 ) Built on PETSC/TAO/Elefant. Will run single or parallel (MPI) with a simple switch. Additional features to accommodate non-smooth duals to constraint relaxations. Possible synergy: Choon-Hui, Alex, and Vishy announce high performance non-smooth optimization package. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 25 / 36

Classic Maxent Solution Exponential Family Distribution Appendix Generalizing the Maxent Problem [ ] [ 1 1 Constraint equivalent to: A p = p = B b B Normalization is just another feature. Try to hide its existence in the solution: ] p = exp[a T µ] = exp[b T µ + 1 µ 1 ] = exp[b T 1 µ B 1T ( µ B )] = Z( µ B ) exp[bt µ B ] T is the log-partition function. Z is the partition function. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 26 / 36

Convex Analysis Recap A quick detour Appendix Generalizing the Maxent Problem The convex conjugate of a convex function F is Legendre if F (p ) := 1 C = int(dom F ) is non-empty 2 F is differentiable on C sup { p, p F (p)}. p dom(f ) 3 F (p) as p bdry(dom F ) For Legendre functions (in the int(dom F ) ) we have p = F (p ). T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 27 / 36

Appendix A More General Objective Function Bregman Divergence Generalizing the Maxent Problem F (p, q) := F (p) F (q) F (q), p q Let q be uniform (q i = 1/N). S is SBG entropy. S(p, q) = i +p i log(p i ) q i log(q i ) (1 + log(q i ))(p i q i ) = i p i log(1/p i ) + i p i log(n) i p i + i q i = S(p) + log(n). S is relative entropy when q not uniform. But we are not restricted to SBG entropy.... T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 28 / 36

Appendix A More General Maxent Problem New Objective Function Generalizing the Maxent Problem min F (p, p 0) subject to Ap = b and p i 0. p R n Solve it by using the Fenchel dual max F (A T µ + p 0) + b, µ µ dom F where (if F is Legendre) p = F ( A T µ + p 0 ). T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 29 / 36

Solution to the Problem New Distribution Families Appendix Generalizing the Maxent Problem This solution is more general but similar to the exponential family. p = F (B T µ B + p 0 + 1 µ 1 ) = F (B T µ B + p 0 1T ( µ B )) Here, T (µ B ) is defined implicitly via 1 T F (B T µ B + p 0 1T ( µ B )) = 1 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 30 / 36

Scale Function Properties Analog to partition function Appendix The Consequences of Normalization T is not simple to calculate. But we can deduce that T is convex and use implicit differentiation to calculate its gradient. 0 = (B T (µ B )1 T ) 2 F (B T µ B + p 0 1T (µ B ))1 which on rearrangement gives T (µ B ) = B 2 F ( p )1 1 T 2 F ( p )1 = B q T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 31 / 36

Escort Distribution Appendix The Consequences of Normalization When F is additively separable q is indeed a probability distribution. (Can you see why?) q := 2 F ( p )1 1 T 2 F ( p )1 So B q is an expectation. When does p = q? T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 32 / 36

Appendix A Concrete Class of Entropies based on φ-logarithms Phi-Exponential Families log(p) = p 1 1 dx Usual construction x T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 33 / 36

Appendix A Concrete Class of Entropies based on φ-logarithms Phi-Exponential Families log φ (p) = p 1 1 dx Deformed Log φ(x) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 33 / 36

Appendix A Concrete Class of Entropies based on φ-logarithms Phi-Exponential Families p 1 log φ (p) = dx Deformed Log 1 φ(x) Any positive increasing φ will do. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 33 / 36

Appendix A Concrete Class of Entropies based on φ-logarithms Phi-Exponential Families p 1 log φ (p) = dx Deformed Log 1 φ(x) Any positive increasing φ will do. Apply a scaling/smoothing normalization operation to obtain another such function: ψ(p) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 33 / 36

Appendix A Concrete Class of Entropies based on φ-logarithms Phi-Exponential Families p 1 log φ (p) = dx Deformed Log 1 φ(x) Any positive increasing φ will do. Apply a scaling/smoothing normalization operation to obtain another such function: ψ(p) Form negative entropy term: s φ (p) = p log ψ (1/p) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 33 / 36

Appendix A Concrete Class of Entropies based on φ-logarithms Phi-Exponential Families p 1 log φ (p) = dx Deformed Log 1 φ(x) Any positive increasing φ will do. Apply a scaling/smoothing normalization operation to obtain another such function: ψ(p) Form negative entropy term: s φ (p) = p log ψ (1/p) Leads to Convenient gradient: s φ (p) = log φ (p) + k φ φ-exponential family: p = exp φ [A T µ + p 0 k φ] T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 33 / 36

Appendix Phi-Exponential Families Example from the Physics Literature φ(x) = x q Φ p 2 a : Φ p p q q 1.5 q 1 1 q 0.5 Try this: Pick q (between 0 and 2) Let φ(p) = p q 1 2 p T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 34 / 36

Appendix Phi-Exponential Families Example from the Physics Literature φ(x) = x q 2 log Φ log q Yields the q-logarithm from the non-extensive thermodynamics literature. log q (x) := x 1 q 1 1 q 1 1 2 q 1 1 2 q 1 3 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 34 / 36

Appendix Phi-Exponential Families Example from the Physics Literature φ(x) = x q 2 b : Ψ p 2 q x 2 q 1 q 0.5 q 1.5 q 1 Scaling/Smoothing operation: ψ(x) = ( 1/u 0 ) 1 u φ(u) du In this case the operation only scales and reparameterizes φ to yield ψ. 1 2 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 34 / 36

Appendix Phi-Exponential Families Example from the Physics Literature φ(x) = x q 1 d : log Ψ p log 2 q p 2 q q 1 Use this log to form negative entropy: p log ψ (1/p) 1 q 0.5 q 1.5 1 2 2 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 34 / 36

Appendix Phi-Exponential Families Example from the Physics Literature φ(x) = x q 1 e : s Φ p p log q p 2 q q 0.5 1 2 Only Legendre for q > 1. Why? q 1 1 q 1.5 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 34 / 36

Looking At Projections q Examples Appendix p as a Projection Orthogonal Projection q 0 Same as orthogonal projection. A p b p 0 p T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 35 / 36

Looking At Projections q Examples Appendix p as a Projection A p b Curved Projection q 0 Same as orthogonal projection. q =.6 p 0 p T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 35 / 36

Looking At Projections q Examples Appendix p as a Projection Oblique Projection A p b p p 0 q 0 Same as orthogonal projection. q =.6 Usual normalization. Actually relates directly to projection under SBG entropy. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 35 / 36

Looking At Projections q Examples Appendix p as a Projection Curved Again q 0 Same as orthogonal projection. A p b p p 0 q =.6 Usual normalization. Actually relates directly to projection under SBG entropy. q = 1.6 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 35 / 36

Four Views of Optimality Appendix p as a Projection Solution to primal problem Bregman Projection A p b p 0 p T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 36 / 36

Four Views of Optimality Appendix p as a Projection Solution to primal problem Intersection of e-flat and m-flat manifolds. T Manifold Intersection Q p T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 36 / 36

Four Views of Optimality Appendix p as a Projection Solution to primal problem Intersection of e-flat and m-flat manifolds. Reverse distance solution. Non-convex! Smallest Reverse Distance p b Q p T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 36 / 36

Four Views of Optimality Appendix p as a Projection Solution to primal problem Intersection of e-flat and m-flat manifolds. Reverse distance solution. Non-convex! Orthogonality conditions. Sometimes used in algorithm design. Pseudo Orthogonality p b p p 0 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent 2007 36 / 36