From Maxent to Machine Learning and Back

Size: px
Start display at page:

Download "From Maxent to Machine Learning and Back"

Transcription

1 From Maxent to Machine Learning and Back T. Sears ANU March 2007 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

2 50 Years Ago... The principles and mathematical methods of statistical mechanics are seen to be of much more general applicability... In the problem of prediction, the maximization of entropy is not an application of a law of physics, but merely a method of reasoning which ensures that no unconscious arbitrary assumptions have been introduced. E.T. Jaynes, 1957 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

3 ... a method of reasoning... Jenkins, if I want another yes-man I ll build one.

4 Outline 1 Generalizing Maxent 2 Two Examples 3 Broader Comparisons 4 Extensions/Conclusions T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

5 You are here Generalizing Maxent 1 Generalizing Maxent 2 Two Examples 3 Broader Comparisons 4 Extensions/Conclusions T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

6 Generalizing Maxent The Classic Maxent Problem Minimize negative entropy subject to linear constraints: min p S(p) := subject to Ap = b p i 0 N p i log(p i ) i=1 A is M N. M < N, a wide matrix. b is a [ data ] vector. B A := 1 T contains a normalization constraint. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

7 Generalizing Maxent Extending the Classic Maxent Problem min S(p) p subject to Ap = b Original problem. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

8 Generalizing Maxent Extending the Classic Maxent Problem min p S(p) + δ {0} (Ap b) Original problem. Convert constraints to a convex function. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

9 Generalizing Maxent Extending the Classic Maxent Problem min p S(p) + δ {0} ( Ap b P ) Original problem. Convert constraints to a convex function. Use any norm... T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

10 Generalizing Maxent Extending the Classic Maxent Problem min p S(p) + δ ɛbp (Ap b) Original problem. Convert constraints to a convex function. Use any norm... and relax constraints. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

11 Generalizing Maxent Extending the Classic Maxent Problem min p F (p, p 0 ) + δ ɛbp (Ap b) Original problem. Convert constraints to a convex function. Use any norm... and relax constraints. Generalize SBG entropy to Bregman divergence. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

12 Generalizing Maxent Extending the Classic Maxent Problem min µ F (A T µ + p 0) + µ, b + ɛ µ Q Original problem. Convert constraints to a convex function. Use any norm... and relax constraints. Generalize SBG entropy to Bregman divergence. Find the Fenchel dual problem to solve. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

13 Generalizing Maxent Extending the Classic Maxent Problem min F (A T µ + p µ 0) + µ, b + ɛ µ } {{ } } {{ Q } Likelihood Prior Original problem. Convert constraints to a convex function. Use any norm... and relax constraints. Generalize SBG entropy to Bregman divergence. Find the Fenchel dual problem to solve. It s a more general form of the MAP problem. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

14 Generalizing Maxent Characterizing the solution Compare to statistical models After solving for µ we can recover the optimal primal solution: p = Score {}}{ F }{{} ( A T µ +p 0) Family T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

15 Generalizing Maxent Characterizing the solution Compare to statistical models After solving for µ we can recover the optimal primal solution: p = Score {}}{ F }{{} ( A T µ +p 0) Family p comes from a family of distributions. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

16 Generalizing Maxent Characterizing the solution Compare to statistical models After solving for µ we can recover the optimal primal solution: p = Score {}}{ F }{{} ( A T µ +p 0) Family p comes from a family of distributions. Entropy function (F ) determines the family ( F ). T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

17 Generalizing Maxent Characterizing the solution Compare to statistical models After solving for µ we can recover the optimal primal solution: p = Score {}}{ F }{{} ( A T µ +p 0) Family p comes from a family of distributions. Entropy function (F ) determines the family ( F ). SBG entropy exponential family. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

18 Generalizing Maxent Characterizing the solution Compare to statistical models After solving for µ we can recover the optimal primal solution: p = Score {}}{ F }{{} ( A T µ +p 0) Family p comes from a family of distributions. Entropy function (F ) determines the family ( F ). SBG entropy exponential family. Any nice F some family. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

19 Generalizing Maxent Generalizing the Exponential Family q-exponential exp q q 1.5 q 1. q 0.5 Asymptote for q q (1 + (1 q)p) exp q (p) := + q 1 exp(p) q = 1 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

20 Tail Behavior Tail Behavior Generalizing Maxent 1.0 exp q q > 1 naturally gives fat tails. q < 1 truncates the tail. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

21 You are here Two Examples 1 Generalizing Maxent 2 Two Examples 3 Broader Comparisons 4 Extensions/Conclusions T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

22 Two Examples Loaded Die Example Setup A die with 6 faces. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

23 Two Examples Loaded Die Example Setup A die with 6 faces. Expected value of 4.5, instead of 3.5 for a fair die. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

24 Two Examples Loaded Die Example Setup A die with 6 faces. Expected value of 4.5, instead of 3.5 for a fair die. For this problem: ( ) A = and b = ( ) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

25 Two Examples Loaded Die Example Setup A die with 6 faces. Expected value of 4.5, instead of 3.5 for a fair die. For this problem: ( ) A = and b = ( ) Find p, assuming S S q, p 0 is uniform, ɛ = 0. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

26 Two Examples Loaded Die Example Setup A die with 6 faces. Expected value of 4.5, instead of 3.5 for a fair die. For this problem: ( ) A = and b = ( ) Find p, assuming S S q, p 0 is uniform, ɛ = 0. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

27 Two Examples Loaded Die Example Setup A die with 6 faces. Expected value of 4.5, instead of 3.5 for a fair die. For this problem: ( ) A = and b = ( ) Find p, assuming S S q, p 0 is uniform, ɛ = 0. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

28 Two Examples Loaded Die Example Probability q Sensitivity of Each Event Varies q Higher q raises weight on face 1 and face 6. Opposite for 3,4,5. Task: Make a two-way market on each die face. Which is easiest? T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

29 Two Examples Example: The Dantzig Selector Entropy Function as Prior Information Background: Consider a variation on linear regression ŷ = Xβ. Choose β via min β 1 + δ ɛb (X T (Xβ y)) β T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

30 Two Examples Example: The Dantzig Selector Entropy Function as Prior Information Background: Consider a variation on linear regression ŷ = Xβ. Choose β via min β 1 + δ ɛb (X T (Xβ y)) β The non-zero entries of the solution can exactly identify the correct set of regressors with high probability under special conditions. (Candace and Tao, Ann. Stat. 2007) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

31 Two Examples Example: The Dantzig Selector Entropy Function as Prior Information Background: Consider a variation on linear regression ŷ = Xβ. Choose β via min β 1 + δ ɛb (X T (Xβ y)) β The non-zero entries of the solution can exactly identify the correct set of regressors with high probability under special conditions. (Candace and Tao, Ann. Stat. 2007) Special conditions: low noise, sparse true model β. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

32 Two Examples Example: The Dantzig Selector Entropy Function as Prior Information Background: Consider a variation on linear regression ŷ = Xβ. Choose β via min β 1 + δ ɛb (X T (Xβ y)) β The non-zero entries of the solution can exactly identify the correct set of regressors with high probability under special conditions. (Candace and Tao, Ann. Stat. 2007) Special conditions: low noise, sparse true model β. Application area: Compressed Sensing. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

33 Dantzig Selector Connection Two Examples Change of variables ( +/- trick) β = [ I I ] p, T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

34 Dantzig Selector Connection Two Examples Change of variables ( +/- trick) β = [ I I ] p, β 1 can be approached using S q with q 0. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

35 Dantzig Selector Connection Two Examples Change of variables ( +/- trick) β = [ I I ] p, β 1 can be approached using S q with q 0. Entropy function S q captures part of the prior knowledge. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

36 You are here Broader Comparisons 1 Generalizing Maxent 2 Two Examples 3 Broader Comparisons 4 Extensions/Conclusions T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

37 Broader Comparisons Making Broader Comparisons Value Regularization Problem: Model preferences over parameters can t be easily compared. Solution: Compare outputs instead (Rifkin and Lippert, JMLR, 2007). Many methods can be viewed as also solving min y R(y) + L(y b) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

38 Broader Comparisons Making Broader Comparisons Value Regularization Problem: Model preferences over parameters can t be easily compared. Solution: Compare outputs instead (Rifkin and Lippert, JMLR, 2007). Many methods can be viewed as also solving min y R(y) + L(y b) The regularizer, R, wants smooth outputs, y. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

39 Broader Comparisons Making Broader Comparisons Value Regularization Problem: Model preferences over parameters can t be easily compared. Solution: Compare outputs instead (Rifkin and Lippert, JMLR, 2007). Many methods can be viewed as also solving min y R(y) + L(y b) The regularizer, R, wants smooth outputs, y. The loss L, wants a close fit to the data, b (e.g. match labels). T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

40 Broader Comparisons Making Broader Comparisons Value Regularization Problem: Model preferences over parameters can t be easily compared. Solution: Compare outputs instead (Rifkin and Lippert, JMLR, 2007). Many methods can be viewed as also solving min y R(y) + L(y b) The regularizer, R, wants smooth outputs, y. The loss L, wants a close fit to the data, b (e.g. match labels). These goals typically compete. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

41 Broader Comparisons Generalized Maxent and Value Regularization To apply this idea to maxent: Change variables y = Ap. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

42 Broader Comparisons Generalized Maxent and Value Regularization To apply this idea to maxent: Change variables y = Ap. The regularizer corresponds to an image function: R(y) = AS(y) = min p S(p) + δ {0} (Ap y) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

43 Broader Comparisons Generalized Maxent and Value Regularization To apply this idea to maxent: Change variables y = Ap. The regularizer corresponds to an image function: Loss is straightforward: R(y) = AS(y) = min p S(p) + δ {0} (Ap y) L(y) = δ ɛbp (y b) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

44 Broader Comparisons SVMs and Value Regularization The Support Vector Machine (SVM, Vapnik) is one of the best known machine learning algorithms. Loss function is the soft margin hinge loss: 1 2 max(0, 1 by) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

45 Broader Comparisons SVMs and Value Regularization The Support Vector Machine (SVM, Vapnik) is one of the best known machine learning algorithms. Loss function is the soft margin hinge loss: 1 2 max(0, 1 by) Regularizer uses a data-dependent positive definite matrix K T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

46 Broader Comparisons SVMs and Value Regularization The Support Vector Machine (SVM, Vapnik) is one of the best known machine learning algorithms. Loss function is the soft margin hinge loss: 1 2 max(0, 1 by) Regularizer uses a data-dependent positive definite matrix K In value regularization terms the objective function is: 1 2 λyt K 1 y T + hingeloss(y i, b i ) } {{ } i R } {{ } L T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

47 Broader Comparisons SVMs and Value Regularization The Support Vector Machine (SVM, Vapnik) is one of the best known machine learning algorithms. Loss function is the soft margin hinge loss: 1 2 max(0, 1 by) Regularizer uses a data-dependent positive definite matrix K In value regularization terms the objective function is: 1 2 λyt K 1 y T + hingeloss(y i, b i ) } {{ } i R } {{ } L T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

48 Broader Comparisons SVMs and Value Regularization The Support Vector Machine (SVM, Vapnik) is one of the best known machine learning algorithms. Loss function is the soft margin hinge loss: 1 2 max(0, 1 by) Regularizer uses a data-dependent positive definite matrix K In value regularization terms the objective function is: 1 2 λyt K 1 y T + hingeloss(y i, b i ) } {{ } i R } {{ } L Compare to the generalized maxent objective function. AS(y) + δ } {{ } ɛbp (y b) } {{ } R L T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

49 You are here Extensions/Conclusions 1 Generalizing Maxent 2 Two Examples 3 Broader Comparisons 4 Extensions/Conclusions T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

50 Extensions/Conclusions Other Models, Briefly Many NLP models owe direct debt. Connection is easily seen. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

51 Extensions/Conclusions Other Models, Briefly Many NLP models owe direct debt. Connection is easily seen. Conditional models, graphical models. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

52 Extensions/Conclusions Other Models, Briefly Many NLP models owe direct debt. Connection is easily seen. Conditional models, graphical models. Use exponential family (SBG entropy), almost always. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

53 Extensions/Conclusions Other Models, Briefly Many NLP models owe direct debt. Connection is easily seen. Conditional models, graphical models. Use exponential family (SBG entropy), almost always. Often replace marginal distributions with empirical counterparts. Strong assumption, big simplification. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

54 Extensions/Conclusions Other Models, Briefly Many NLP models owe direct debt. Connection is easily seen. Conditional models, graphical models. Use exponential family (SBG entropy), almost always. Often replace marginal distributions with empirical counterparts. Strong assumption, big simplification. Non-probabilistic models. Relax normalization. Use +/- trick. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

55 Extensions/Conclusions Other Models, Briefly Many NLP models owe direct debt. Connection is easily seen. Conditional models, graphical models. Use exponential family (SBG entropy), almost always. Often replace marginal distributions with empirical counterparts. Strong assumption, big simplification. Non-probabilistic models. Relax normalization. Use +/- trick. Continuous/mixed models. p becomes a function, A becomes an operator. Call in the mathematicians and approximation theory. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

56 Summary Extensions/Conclusions There is a class of models based on convex functions, which have interchangeable parts. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

57 Summary Extensions/Conclusions There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

58 Summary Extensions/Conclusions There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. Fenchel duality permits a quick switch of model assumptions. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

59 Summary Extensions/Conclusions There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. Fenchel duality permits a quick switch of model assumptions. Benefit: modular approach allows exploration of model space, by the modeler, or the computer. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

60 Summary Extensions/Conclusions There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. Fenchel duality permits a quick switch of model assumptions. Benefit: modular approach allows exploration of model space, by the modeler, or the computer. Key required tool: flexible, non-smooth optimization tools. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

61 Summary Extensions/Conclusions There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. Fenchel duality permits a quick switch of model assumptions. Benefit: modular approach allows exploration of model space, by the modeler, or the computer. Key required tool: flexible, non-smooth optimization tools. Harder: Characterize prior knowledge represented in choice of Regularizer and Loss. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

62 Summary Extensions/Conclusions There is a class of models based on convex functions, which have interchangeable parts. Strong/exact connection to MAP estimation. Fenchel duality permits a quick switch of model assumptions. Benefit: modular approach allows exploration of model space, by the modeler, or the computer. Key required tool: flexible, non-smooth optimization tools. Harder: Characterize prior knowledge represented in choice of Regularizer and Loss. Harder: Incorporate/factor out knowledge of the task(s) to be performed with the model. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

63 The End Extensions/Conclusions Thank You T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

64 You are here Appendix 5 Appendix Generalizing the Maxent Problem The Consequences of Normalization Phi-Exponential Families p as a Projection T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

65 Appendix Software for Experiments Apply quasi-newton method (LMVM) to dual problem. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

66 Appendix Software for Experiments Apply quasi-newton method (LMVM) to dual problem. Objective function requires matrix-vector multiplication (A T v M 1 ) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

67 Appendix Software for Experiments Apply quasi-newton method (LMVM) to dual problem. Objective function requires matrix-vector multiplication (A T v M 1 ) Gradient requires additional matrix-vector multiplication (A v N 1 ) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

68 Appendix Software for Experiments Apply quasi-newton method (LMVM) to dual problem. Objective function requires matrix-vector multiplication (A T v M 1 ) Gradient requires additional matrix-vector multiplication (A v N 1 ) Built on PETSC/TAO/Elefant. Will run single or parallel (MPI) with a simple switch. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

69 Appendix Software for Experiments Apply quasi-newton method (LMVM) to dual problem. Objective function requires matrix-vector multiplication (A T v M 1 ) Gradient requires additional matrix-vector multiplication (A v N 1 ) Built on PETSC/TAO/Elefant. Will run single or parallel (MPI) with a simple switch. Additional features to accommodate non-smooth duals to constraint relaxations. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

70 Appendix Software for Experiments Apply quasi-newton method (LMVM) to dual problem. Objective function requires matrix-vector multiplication (A T v M 1 ) Gradient requires additional matrix-vector multiplication (A v N 1 ) Built on PETSC/TAO/Elefant. Will run single or parallel (MPI) with a simple switch. Additional features to accommodate non-smooth duals to constraint relaxations. Possible synergy: Choon-Hui, Alex, and Vishy announce high performance non-smooth optimization package. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

71 Classic Maxent Solution Exponential Family Distribution Appendix Generalizing the Maxent Problem [ ] [ 1 1 Constraint equivalent to: A p = p = B b B Normalization is just another feature. Try to hide its existence in the solution: ] p = exp[a T µ] = exp[b T µ + 1 µ 1 ] = exp[b T 1 µ B 1T ( µ B )] = Z( µ B ) exp[bt µ B ] T is the log-partition function. Z is the partition function. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

72 Convex Analysis Recap A quick detour Appendix Generalizing the Maxent Problem The convex conjugate of a convex function F is Legendre if F (p ) := 1 C = int(dom F ) is non-empty 2 F is differentiable on C sup { p, p F (p)}. p dom(f ) 3 F (p) as p bdry(dom F ) For Legendre functions (in the int(dom F ) ) we have p = F (p ). T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

73 Appendix A More General Objective Function Bregman Divergence Generalizing the Maxent Problem F (p, q) := F (p) F (q) F (q), p q Let q be uniform (q i = 1/N). S is SBG entropy. S(p, q) = i +p i log(p i ) q i log(q i ) (1 + log(q i ))(p i q i ) = i p i log(1/p i ) + i p i log(n) i p i + i q i = S(p) + log(n). S is relative entropy when q not uniform. But we are not restricted to SBG entropy.... T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

74 Appendix A More General Maxent Problem New Objective Function Generalizing the Maxent Problem min F (p, p 0) subject to Ap = b and p i 0. p R n Solve it by using the Fenchel dual max F (A T µ + p 0) + b, µ µ dom F where (if F is Legendre) p = F ( A T µ + p 0 ). T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

75 Solution to the Problem New Distribution Families Appendix Generalizing the Maxent Problem This solution is more general but similar to the exponential family. p = F (B T µ B + p µ 1 ) = F (B T µ B + p 0 1T ( µ B )) Here, T (µ B ) is defined implicitly via 1 T F (B T µ B + p 0 1T ( µ B )) = 1 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

76 Scale Function Properties Analog to partition function Appendix The Consequences of Normalization T is not simple to calculate. But we can deduce that T is convex and use implicit differentiation to calculate its gradient. 0 = (B T (µ B )1 T ) 2 F (B T µ B + p 0 1T (µ B ))1 which on rearrangement gives T (µ B ) = B 2 F ( p )1 1 T 2 F ( p )1 = B q T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

77 Escort Distribution Appendix The Consequences of Normalization When F is additively separable q is indeed a probability distribution. (Can you see why?) q := 2 F ( p )1 1 T 2 F ( p )1 So B q is an expectation. When does p = q? T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

78 Appendix A Concrete Class of Entropies based on φ-logarithms Phi-Exponential Families log(p) = p 1 1 dx Usual construction x T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

79 Appendix A Concrete Class of Entropies based on φ-logarithms Phi-Exponential Families log φ (p) = p 1 1 dx Deformed Log φ(x) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

80 Appendix A Concrete Class of Entropies based on φ-logarithms Phi-Exponential Families p 1 log φ (p) = dx Deformed Log 1 φ(x) Any positive increasing φ will do. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

81 Appendix A Concrete Class of Entropies based on φ-logarithms Phi-Exponential Families p 1 log φ (p) = dx Deformed Log 1 φ(x) Any positive increasing φ will do. Apply a scaling/smoothing normalization operation to obtain another such function: ψ(p) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

82 Appendix A Concrete Class of Entropies based on φ-logarithms Phi-Exponential Families p 1 log φ (p) = dx Deformed Log 1 φ(x) Any positive increasing φ will do. Apply a scaling/smoothing normalization operation to obtain another such function: ψ(p) Form negative entropy term: s φ (p) = p log ψ (1/p) T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

83 Appendix A Concrete Class of Entropies based on φ-logarithms Phi-Exponential Families p 1 log φ (p) = dx Deformed Log 1 φ(x) Any positive increasing φ will do. Apply a scaling/smoothing normalization operation to obtain another such function: ψ(p) Form negative entropy term: s φ (p) = p log ψ (1/p) Leads to Convenient gradient: s φ (p) = log φ (p) + k φ φ-exponential family: p = exp φ [A T µ + p 0 k φ] T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

84 Appendix Phi-Exponential Families Example from the Physics Literature φ(x) = x q Φ p 2 a : Φ p p q q 1.5 q 1 1 q 0.5 Try this: Pick q (between 0 and 2) Let φ(p) = p q 1 2 p T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

85 Appendix Phi-Exponential Families Example from the Physics Literature φ(x) = x q 2 log Φ log q Yields the q-logarithm from the non-extensive thermodynamics literature. log q (x) := x 1 q 1 1 q q q 1 3 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

86 Appendix Phi-Exponential Families Example from the Physics Literature φ(x) = x q 2 b : Ψ p 2 q x 2 q 1 q 0.5 q 1.5 q 1 Scaling/Smoothing operation: ψ(x) = ( 1/u 0 ) 1 u φ(u) du In this case the operation only scales and reparameterizes φ to yield ψ. 1 2 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

87 Appendix Phi-Exponential Families Example from the Physics Literature φ(x) = x q 1 d : log Ψ p log 2 q p 2 q q 1 Use this log to form negative entropy: p log ψ (1/p) 1 q 0.5 q T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

88 Appendix Phi-Exponential Families Example from the Physics Literature φ(x) = x q 1 e : s Φ p p log q p 2 q q Only Legendre for q > 1. Why? q 1 1 q 1.5 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

89 Looking At Projections q Examples Appendix p as a Projection Orthogonal Projection q 0 Same as orthogonal projection. A p b p 0 p T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

90 Looking At Projections q Examples Appendix p as a Projection A p b Curved Projection q 0 Same as orthogonal projection. q =.6 p 0 p T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

91 Looking At Projections q Examples Appendix p as a Projection Oblique Projection A p b p p 0 q 0 Same as orthogonal projection. q =.6 Usual normalization. Actually relates directly to projection under SBG entropy. T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

92 Looking At Projections q Examples Appendix p as a Projection Curved Again q 0 Same as orthogonal projection. A p b p p 0 q =.6 Usual normalization. Actually relates directly to projection under SBG entropy. q = 1.6 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

93 Four Views of Optimality Appendix p as a Projection Solution to primal problem Bregman Projection A p b p 0 p T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

94 Four Views of Optimality Appendix p as a Projection Solution to primal problem Intersection of e-flat and m-flat manifolds. T Manifold Intersection Q p T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

95 Four Views of Optimality Appendix p as a Projection Solution to primal problem Intersection of e-flat and m-flat manifolds. Reverse distance solution. Non-convex! Smallest Reverse Distance p b Q p T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

96 Four Views of Optimality Appendix p as a Projection Solution to primal problem Intersection of e-flat and m-flat manifolds. Reverse distance solution. Non-convex! Orthogonality conditions. Sometimes used in algorithm design. Pseudo Orthogonality p b p p 0 T. Sears (ANU) From Maxent to Machine Learning and Back Maxent / 36

Several Views of Support Vector Machines

Several Views of Support Vector Machines Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min

More information

Support Vector Machine (SVM)

Support Vector Machine (SVM) Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Support Vector Machines Explained

Support Vector Machines Explained March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information

Nonlinear Optimization: Algorithms 3: Interior-point methods

Nonlinear Optimization: Algorithms 3: Interior-point methods Nonlinear Optimization: Algorithms 3: Interior-point methods INSEAD, Spring 2006 Jean-Philippe Vert Ecole des Mines de Paris Jean-Philippe.Vert@mines.org Nonlinear optimization c 2006 Jean-Philippe Vert,

More information

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Semi-Supervised Support Vector Machines and Application to Spam Filtering Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery

More information

A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION

A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION 1 A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION Dimitri Bertsekas M.I.T. FEBRUARY 2003 2 OUTLINE Convexity issues in optimization Historical remarks Our treatment of the subject Three unifying lines of

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

MINIMIZATION OF ENTROPY FUNCTIONALS UNDER MOMENT CONSTRAINTS. denote the family of probability density functions g on X satisfying

MINIMIZATION OF ENTROPY FUNCTIONALS UNDER MOMENT CONSTRAINTS. denote the family of probability density functions g on X satisfying MINIMIZATION OF ENTROPY FUNCTIONALS UNDER MOMENT CONSTRAINTS I. Csiszár (Budapest) Given a σ-finite measure space (X, X, µ) and a d-tuple ϕ = (ϕ 1,..., ϕ d ) of measurable functions on X, for a = (a 1,...,

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

More information

1 Review of Least Squares Solutions to Overdetermined Systems

1 Review of Least Squares Solutions to Overdetermined Systems cs4: introduction to numerical analysis /9/0 Lecture 7: Rectangular Systems and Numerical Integration Instructor: Professor Amos Ron Scribes: Mark Cowlishaw, Nathanael Fillmore Review of Least Squares

More information

1 Solving LPs: The Simplex Algorithm of George Dantzig

1 Solving LPs: The Simplex Algorithm of George Dantzig Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Determining distribution parameters from quantiles

Determining distribution parameters from quantiles Determining distribution parameters from quantiles John D. Cook Department of Biostatistics The University of Texas M. D. Anderson Cancer Center P. O. Box 301402 Unit 1409 Houston, TX 77230-1402 USA cook@mderson.org

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric

More information

A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

More information

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014 Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

More information

Fixed Point Theorems

Fixed Point Theorems Fixed Point Theorems Definition: Let X be a set and let T : X X be a function that maps X into itself. (Such a function is often called an operator, a transformation, or a transform on X, and the notation

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

The p-norm generalization of the LMS algorithm for adaptive filtering

The p-norm generalization of the LMS algorithm for adaptive filtering The p-norm generalization of the LMS algorithm for adaptive filtering Jyrki Kivinen University of Helsinki Manfred Warmuth University of California, Santa Cruz Babak Hassibi California Institute of Technology

More information

ALMOST COMMON PRIORS 1. INTRODUCTION

ALMOST COMMON PRIORS 1. INTRODUCTION ALMOST COMMON PRIORS ZIV HELLMAN ABSTRACT. What happens when priors are not common? We introduce a measure for how far a type space is from having a common prior, which we term prior distance. If a type

More information

Linear Programming for Optimization. Mark A. Schulze, Ph.D. Perceptive Scientific Instruments, Inc.

Linear Programming for Optimization. Mark A. Schulze, Ph.D. Perceptive Scientific Instruments, Inc. 1. Introduction Linear Programming for Optimization Mark A. Schulze, Ph.D. Perceptive Scientific Instruments, Inc. 1.1 Definition Linear programming is the name of a branch of applied mathematics that

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

More information

Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725

Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725 Duality in General Programs Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: duality in linear programs Given c R n, A R m n, b R m, G R r n, h R r : min x R n c T x max u R m, v R r b T

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Lecture 6: Logistic Regression

Lecture 6: Logistic Regression Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

More information

Big Data - Lecture 1 Optimization reminders

Big Data - Lecture 1 Optimization reminders Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics

More information

Elasticity Theory Basics

Elasticity Theory Basics G22.3033-002: Topics in Computer Graphics: Lecture #7 Geometric Modeling New York University Elasticity Theory Basics Lecture #7: 20 October 2003 Lecturer: Denis Zorin Scribe: Adrian Secord, Yotam Gingold

More information

The Cobb-Douglas Production Function

The Cobb-Douglas Production Function 171 10 The Cobb-Douglas Production Function This chapter describes in detail the most famous of all production functions used to represent production processes both in and out of agriculture. First used

More information

Linear Programming. March 14, 2014

Linear Programming. March 14, 2014 Linear Programming March 1, 01 Parts of this introduction to linear programming were adapted from Chapter 9 of Introduction to Algorithms, Second Edition, by Cormen, Leiserson, Rivest and Stein [1]. 1

More information

Part II Redundant Dictionaries and Pursuit Algorithms

Part II Redundant Dictionaries and Pursuit Algorithms Aisenstadt Chair Course CRM September 2009 Part II Redundant Dictionaries and Pursuit Algorithms Stéphane Mallat Centre de Mathématiques Appliquées Ecole Polytechnique Sparsity in Redundant Dictionaries

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

Trading regret rate for computational efficiency in online learning with limited feedback

Trading regret rate for computational efficiency in online learning with limited feedback Trading regret rate for computational efficiency in online learning with limited feedback Shai Shalev-Shwartz TTI-C Hebrew University On-line Learning with Limited Feedback Workshop, 2009 June 2009 Shai

More information

Markov Chain Monte Carlo Simulation Made Simple

Markov Chain Monte Carlo Simulation Made Simple Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical

More information

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression Natural Language Processing Lecture 13 10/6/2015 Jim Martin Today Multinomial Logistic Regression Aka log-linear models or maximum entropy (maxent) Components of the model Learning the parameters 10/1/15

More information

Duality of linear conic problems

Duality of linear conic problems Duality of linear conic problems Alexander Shapiro and Arkadi Nemirovski Abstract It is well known that the optimal values of a linear programming problem and its dual are equal to each other if at least

More information

Review of Fundamental Mathematics

Review of Fundamental Mathematics Review of Fundamental Mathematics As explained in the Preface and in Chapter 1 of your textbook, managerial economics applies microeconomic theory to business decision making. The decision-making tools

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all. 1. Differentiation The first derivative of a function measures by how much changes in reaction to an infinitesimal shift in its argument. The largest the derivative (in absolute value), the faster is evolving.

More information

Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8

Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8 Spaces and bases Week 3: Wednesday, Feb 8 I have two favorite vector spaces 1 : R n and the space P d of polynomials of degree at most d. For R n, we have a canonical basis: R n = span{e 1, e 2,..., e

More information

Level Set Framework, Signed Distance Function, and Various Tools

Level Set Framework, Signed Distance Function, and Various Tools Level Set Framework Geometry and Calculus Tools Level Set Framework,, and Various Tools Spencer Department of Mathematics Brigham Young University Image Processing Seminar (Week 3), 2010 Level Set Framework

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

LECTURE 5: DUALITY AND SENSITIVITY ANALYSIS. 1. Dual linear program 2. Duality theory 3. Sensitivity analysis 4. Dual simplex method

LECTURE 5: DUALITY AND SENSITIVITY ANALYSIS. 1. Dual linear program 2. Duality theory 3. Sensitivity analysis 4. Dual simplex method LECTURE 5: DUALITY AND SENSITIVITY ANALYSIS 1. Dual linear program 2. Duality theory 3. Sensitivity analysis 4. Dual simplex method Introduction to dual linear program Given a constraint matrix A, right

More information

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Support Vector Machine. Tutorial. (and Statistical Learning Theory) Support Vector Machine (and Statistical Learning Theory) Tutorial Jason Weston NEC Labs America 4 Independence Way, Princeton, USA. jasonw@nec-labs.com 1 Support Vector Machines: history SVMs introduced

More information

CHAPTER 6: Continuous Uniform Distribution: 6.1. Definition: The density function of the continuous random variable X on the interval [A, B] is.

CHAPTER 6: Continuous Uniform Distribution: 6.1. Definition: The density function of the continuous random variable X on the interval [A, B] is. Some Continuous Probability Distributions CHAPTER 6: Continuous Uniform Distribution: 6. Definition: The density function of the continuous random variable X on the interval [A, B] is B A A x B f(x; A,

More information

Introduction to Online Learning Theory

Introduction to Online Learning Theory Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

More information

What is Linear Programming?

What is Linear Programming? Chapter 1 What is Linear Programming? An optimization problem usually has three essential ingredients: a variable vector x consisting of a set of unknowns to be determined, an objective function of x to

More information

Definition and Properties of the Production Function: Lecture

Definition and Properties of the Production Function: Lecture Definition and Properties of the Production Function: Lecture II August 25, 2011 Definition and : Lecture A Brief Brush with Duality Cobb-Douglas Cost Minimization Lagrangian for the Cobb-Douglas Solution

More information

Constrained optimization.

Constrained optimization. ams/econ 11b supplementary notes ucsc Constrained optimization. c 2010, Yonatan Katznelson 1. Constraints In many of the optimization problems that arise in economics, there are restrictions on the values

More information

Least-Squares Intersection of Lines

Least-Squares Intersection of Lines Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a

More information

3 An Illustrative Example

3 An Illustrative Example Objectives An Illustrative Example Objectives - Theory and Examples -2 Problem Statement -2 Perceptron - Two-Input Case -4 Pattern Recognition Example -5 Hamming Network -8 Feedforward Layer -8 Recurrent

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

Dual Methods for Total Variation-Based Image Restoration

Dual Methods for Total Variation-Based Image Restoration Dual Methods for Total Variation-Based Image Restoration Jamylle Carter Institute for Mathematics and its Applications University of Minnesota, Twin Cities Ph.D. (Mathematics), UCLA, 2001 Advisor: Tony

More information

An Overview Of Software For Convex Optimization. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.

An Overview Of Software For Convex Optimization. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt. An Overview Of Software For Convex Optimization Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu In fact, the great watershed in optimization isn t between linearity

More information

A Potential-based Framework for Online Multi-class Learning with Partial Feedback

A Potential-based Framework for Online Multi-class Learning with Partial Feedback A Potential-based Framework for Online Multi-class Learning with Partial Feedback Shijun Wang Rong Jin Hamed Valizadegan Radiology and Imaging Sciences Computer Science and Engineering Computer Science

More information

Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Duality in Linear Programming

Duality in Linear Programming Duality in Linear Programming 4 In the preceding chapter on sensitivity analysis, we saw that the shadow-price interpretation of the optimal simplex multipliers is a very useful concept. First, these shadow

More information

On Adaboost and Optimal Betting Strategies

On Adaboost and Optimal Betting Strategies On Adaboost and Optimal Betting Strategies Pasquale Malacaria School of Electronic Engineering and Computer Science Queen Mary, University of London Email: pm@dcs.qmul.ac.uk Fabrizio Smeraldi School of

More information

Some stability results of parameter identification in a jump diffusion model

Some stability results of parameter identification in a jump diffusion model Some stability results of parameter identification in a jump diffusion model D. Düvelmeyer Technische Universität Chemnitz, Fakultät für Mathematik, 09107 Chemnitz, Germany Abstract In this paper we discuss

More information

Linear Programming in Matrix Form

Linear Programming in Matrix Form Linear Programming in Matrix Form Appendix B We first introduce matrix concepts in linear programming by developing a variation of the simplex method called the revised simplex method. This algorithm,

More information

Log-Linear Models. Michael Collins

Log-Linear Models. Michael Collins Log-Linear Models Michael Collins 1 Introduction This note describes log-linear models, which are very widely used in natural language processing. A key advantage of log-linear models is their flexibility:

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Practical Guide to the Simplex Method of Linear Programming

Practical Guide to the Simplex Method of Linear Programming Practical Guide to the Simplex Method of Linear Programming Marcel Oliver Revised: April, 0 The basic steps of the simplex algorithm Step : Write the linear programming problem in standard form Linear

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

More information

Linear Programming Notes V Problem Transformations

Linear Programming Notes V Problem Transformations Linear Programming Notes V Problem Transformations 1 Introduction Any linear programming problem can be rewritten in either of two standard forms. In the first form, the objective is to maximize, the material

More information

Lasso on Categorical Data

Lasso on Categorical Data Lasso on Categorical Data Yunjin Choi, Rina Park, Michael Seo December 14, 2012 1 Introduction In social science studies, the variables of interest are often categorical, such as race, gender, and nationality.

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

Lecture 2: The SVM classifier

Lecture 2: The SVM classifier Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function

More information

10. Proximal point method

10. Proximal point method L. Vandenberghe EE236C Spring 2013-14) 10. Proximal point method proximal point method augmented Lagrangian method Moreau-Yosida smoothing 10-1 Proximal point method a conceptual algorithm for minimizing

More information

Review D: Potential Energy and the Conservation of Mechanical Energy

Review D: Potential Energy and the Conservation of Mechanical Energy MSSCHUSETTS INSTITUTE OF TECHNOLOGY Department of Physics 8.01 Fall 2005 Review D: Potential Energy and the Conservation of Mechanical Energy D.1 Conservative and Non-conservative Force... 2 D.1.1 Introduction...

More information

1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where.

1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where. Introduction Linear Programming Neil Laws TT 00 A general optimization problem is of the form: choose x to maximise f(x) subject to x S where x = (x,..., x n ) T, f : R n R is the objective function, S

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin Tutorial@SIGKDD 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of

More information

International Doctoral School Algorithmic Decision Theory: MCDA and MOO

International Doctoral School Algorithmic Decision Theory: MCDA and MOO International Doctoral School Algorithmic Decision Theory: MCDA and MOO Lecture 2: Multiobjective Linear Programming Department of Engineering Science, The University of Auckland, New Zealand Laboratoire

More information

Fitting Subject-specific Curves to Grouped Longitudinal Data

Fitting Subject-specific Curves to Grouped Longitudinal Data Fitting Subject-specific Curves to Grouped Longitudinal Data Djeundje, Viani Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh, EH14 4AS, UK E-mail: vad5@hw.ac.uk Currie,

More information

Two-Stage Stochastic Linear Programs

Two-Stage Stochastic Linear Programs Two-Stage Stochastic Linear Programs Operations Research Anthony Papavasiliou 1 / 27 Two-Stage Stochastic Linear Programs 1 Short Reviews Probability Spaces and Random Variables Convex Analysis 2 Deterministic

More information

2.3 Convex Constrained Optimization Problems

2.3 Convex Constrained Optimization Problems 42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

More information

Nonlinear Regression:

Nonlinear Regression: Zurich University of Applied Sciences School of Engineering IDP Institute of Data Analysis and Process Design Nonlinear Regression: A Powerful Tool With Considerable Complexity Half-Day : Improved Inference

More information

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil. Steven J Zeil Old Dominion Univ. Fall 200 Discriminant-Based Classification Linearly Separable Systems Pairwise Separation 2 Posteriors 3 Logistic Discrimination 2 Discriminant-Based Classification Likelihood-based:

More information

Operation Research. Module 1. Module 2. Unit 1. Unit 2. Unit 3. Unit 1

Operation Research. Module 1. Module 2. Unit 1. Unit 2. Unit 3. Unit 1 Operation Research Module 1 Unit 1 1.1 Origin of Operations Research 1.2 Concept and Definition of OR 1.3 Characteristics of OR 1.4 Applications of OR 1.5 Phases of OR Unit 2 2.1 Introduction to Linear

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

More information

Variational Mean Field for Graphical Models

Variational Mean Field for Graphical Models Variational Mean Field for Graphical Models CS/CNS/EE 155 Baback Moghaddam Machine Learning Group baback @ jpl.nasa.gov Approximate Inference Consider general UGs (i.e., not tree-structured) All basic

More information

Statistical Machine Translation: IBM Models 1 and 2

Statistical Machine Translation: IBM Models 1 and 2 Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation

More information

Christfried Webers. Canberra February June 2015

Christfried Webers. Canberra February June 2015 c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

More information

Epipolar Geometry. Readings: See Sections 10.1 and 15.6 of Forsyth and Ponce. Right Image. Left Image. e(p ) Epipolar Lines. e(q ) q R.

Epipolar Geometry. Readings: See Sections 10.1 and 15.6 of Forsyth and Ponce. Right Image. Left Image. e(p ) Epipolar Lines. e(q ) q R. Epipolar Geometry We consider two perspective images of a scene as taken from a stereo pair of cameras (or equivalently, assume the scene is rigid and imaged with a single camera from two different locations).

More information

Towards running complex models on big data

Towards running complex models on big data Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation

More information

Non-Inferiority Tests for Two Proportions

Non-Inferiority Tests for Two Proportions Chapter 0 Non-Inferiority Tests for Two Proportions Introduction This module provides power analysis and sample size calculation for non-inferiority and superiority tests in twosample designs in which

More information

Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page

Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page Errata for ASM Exam C/4 Study Manual (Sixteenth Edition) Sorted by Page 1 Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page Practice exam 1:9, 1:22, 1:29, 9:5, and 10:8

More information

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

More information

5 Scalings with differential equations

5 Scalings with differential equations 5 Scalings with differential equations 5.1 Stretched coordinates Consider the first-order linear differential equation df dx + f = 0. Since it is first order, we expect a single solution to the homogeneous

More information

From Sparse Approximation to Forecast of Intraday Load Curves

From Sparse Approximation to Forecast of Intraday Load Curves From Sparse Approximation to Forecast of Intraday Load Curves Mathilde Mougeot Joint work with D. Picard, K. Tribouley (P7)& V. Lefieux, L. Teyssier-Maillard (RTE) 1/43 Electrical Consumption Time series

More information

Online Convex Optimization

Online Convex Optimization E0 370 Statistical Learning heory Lecture 19 Oct 22, 2013 Online Convex Optimization Lecturer: Shivani Agarwal Scribe: Aadirupa 1 Introduction In this lecture we shall look at a fairly general setting

More information

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) Abstract Indirect inference is a simulation-based method for estimating the parameters of economic models. Its

More information

Linear Programming. April 12, 2005

Linear Programming. April 12, 2005 Linear Programming April 1, 005 Parts of this were adapted from Chapter 9 of i Introduction to Algorithms (Second Edition) /i by Cormen, Leiserson, Rivest and Stein. 1 What is linear programming? The first

More information

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION Ş. İlker Birbil Sabancı University Ali Taylan Cemgil 1, Hazal Koptagel 1, Figen Öztoprak 2, Umut Şimşekli

More information

Variational approach to restore point-like and curve-like singularities in imaging

Variational approach to restore point-like and curve-like singularities in imaging Variational approach to restore point-like and curve-like singularities in imaging Daniele Graziani joint work with Gilles Aubert and Laure Blanc-Féraud Roma 12/06/2012 Daniele Graziani (Roma) 12/06/2012

More information

Optimization. J(f) := λω(f) + R emp (f) (5.1) m l(f(x i ) y i ). (5.2) i=1

Optimization. J(f) := λω(f) + R emp (f) (5.1) m l(f(x i ) y i ). (5.2) i=1 5 Optimization Optimization plays an increasingly important role in machine learning. For instance, many machine learning algorithms minimize a regularized risk functional: with the empirical risk min

More information

Summer course on Convex Optimization. Fifth Lecture Interior-Point Methods (1) Michel Baes, K.U.Leuven Bharath Rangarajan, U.

Summer course on Convex Optimization. Fifth Lecture Interior-Point Methods (1) Michel Baes, K.U.Leuven Bharath Rangarajan, U. Summer course on Convex Optimization Fifth Lecture Interior-Point Methods (1) Michel Baes, K.U.Leuven Bharath Rangarajan, U.Minnesota Interior-Point Methods: the rebirth of an old idea Suppose that f is

More information

Multi-variable Calculus and Optimization

Multi-variable Calculus and Optimization Multi-variable Calculus and Optimization Dudley Cooke Trinity College Dublin Dudley Cooke (Trinity College Dublin) Multi-variable Calculus and Optimization 1 / 51 EC2040 Topic 3 - Multi-variable Calculus

More information