# The equivalence of logistic regression and maximum entropy models

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 The equivalence of logistic regression and maximum entropy models John Mount September 23, 20 Abstract As our colleague so aptly demonstrated ( (link) ) there is one derivation of Logistic Regression that is particularly beautiful. It is not as general as that found in Agresti[Agresti, 990] (which deals with generalized linear models in their full generality), but gets to the important balance equations very quickly. We will pursue this further to re-derive multi-category logistic regression in both its standard (sigmoid) phrasing and also in its equivalent maximum entropy clothing. It is well known that logistic regression and maximum entropy modeling are equivalent (for example see [Klein and Manning, 2003])- but we will show that the simpler derivation already given is a very good way to demonstrate the equivalence (and points out that logistic regression is actually specialnot just one of many equivalent GLMs). Overview We will proceed as follows:. This outline. 2. Introduce a simplified machine learning problem and some notation. 3. Re-discover logistic regression by a simplified version of the standard derivations. 4. Re-invent logistic regression by using the maximum entropy method. 5. Draw some conclusions. 2 Notation Suppose our machine learning input is a sequence of real vectors of dimension n (where we have preprocessed in the machine learning tricks of converting categorical variables into indicators over levels and adding a constant variable to our representation). Our notation will be as follows:. x() x(m) will denote our input data. Each x(i) is a vector in R n. We will use the function of i notation to denote which specific example we are working with. We will also use the variable j to denote which of the n coordinates or parameters we are interested in (as in x(i) j ). web:

2 2. y() y(m) will denote our known training outcomes. The y(i) take values from the set { k}. For standard two class logistic regression we have k 2 (the target class and its complement). There is no implied order relation between the classes- each class is just using a different symbol. We will use the variables u and v to indicate which class/category/outcome we are working with. 3. π() will be the modeled (or learned) probability function we are solving for. π() : R n R k or in English: π() is a function mapping from the space the x(i) are in into vectors indexed by the categories. In fact π(x) u will denote the probability π() assigns in input situation x to category u. 4. A(u, v) will be a notational convenience called an indicator function. It is defined as being the unique function that has A(u, v) if u v and 0 otherwise. We will use A(, ) to check if a given y(i) (a known training outcome) is assigned to the category u that we are at the time working on (so we will write expressions like A(u, y(i))). Our notation is designed to be explicit (at the cost of some wordiness). For example we will do most work in scalar notation (using partial derivatives) instead of the much more concise gradient operator notation. Our goal from this training data is to learn a function such that f(x(i)) y(i) (for all i). Actually we will do a bit more we will develop an estimate function π() : R n R k such that: π(x(i)) v is a modeled estimate of the probability that y(i) v. π() should have the following properties:. π(x) v 0 always 2. k v π(x) v always 3. π(x(i)) y(i) tends to be large. The first two requirements state that π() looks like a probability distribution over the possible outcomes and the last states that π() is a good prediction (matches our training data). In the next two sections we will formalize what good is. Everything we are going to do (except appealing to the Karush-Kuhn-Tucker theorem) is elementary. We use the slightly cumbersome partial derivative notation (in favor of the more terse gradient vector notation) to preserve the elementary nature of the derivations. 3 Standard Logistic Regression In standard logistic regression we usually have k 2 (one class is the class we are trying to classify into and the other class is taken as its complement). And in this framework π() is chosen out of the blue to be of the form: e λ x π(x) e λ x + () π(x) 2 π(x) (2) where λ is a vector in R n. You can check that this form guarantees that our π() looks like a probability distribution. Also our model is completely specified by λ, so it is just a matter of picking λ that gives a best estimate. We also note: in this form π(x) is the so-called sigmoid function which is the inverse of the logit-function (hence the name logistic regression). This also hints that logistic regression is equivalent to a single layer neural net with sigmoid non-linearity (though logistic regression fitting by Fisher scoring is a bit more sophisticated than neural net back-prop training). 2

3 We are actually going to suggest a different way to write π() that has more notational symmetry (and generalizes beyond k 2 for free). Define: π(x) v e λv x k. (3) u eλu x In this case λ is now a k by n matrix with one set of values per category we are trying to predict. Our original form is gotten by forcing λ 2 to be the all zero vector. π() of this form has two properties that are very useful in deriving logistic regression: π(x) v λ v,j x j π(x) v ( π(x) v ) (4) π(x) v λ u,j x j π(x) v π(x) u (when u v) (5) Let us go back to our requirement that π(x(i)) y(i) be large. We can formalize that as maximizing the product: m π(x(i)) y(i) (6) which means we are maximizing the simultaneous likelihood the model assigns to the training data (treating data as independent conditioned on the model parameters). This makes sense- the model should not consider the data it was trained on as unlikely. Equivalently we could maximize a different function (but one that has its maximum in the same place) the log-likelihood: f(λ) log(π(x(i)) y(i) ) (7) (remember π(x) is a function of x and λ). We want a λ such that f() is maximized. Now in this notation we have no constraints on λ so we know at any maximal point we will have the derivatives of f() with respect to λ u,j equal to zero for all u,j. It will further turn out that under fairly mild assumptions about the training data that the point that all the partial derivatives are zero is unique- so therefore it is the unique global maximum. So we solve for the λ where all the partial derivatives are zero. 3

4 λ u,j f(λ) λ u,j, + log(π(x(i)) y(i) ) (8) π(x(i)) y(i), y(i) u,, y(i) u,, y(i) u, π(x(i)) u π(x(i)) y(i) λ u,j π(x(i)) y(i) (9) λ u,j π(x(i)) u (0) λ u,j π(x(i)) y(i) π(x(i)) u x(i) j π(x(i)) u ( π(x(i)) u ) () π(x(i)) y(i) x(i) j π(x(i)) y(i) π(x(i)) u x(i) j ( π(x(i)) u ) (2) x(i) j π(x(i)) u x(i) j x(i) j π(x(i)) u (3) And since this quantity is supposed to equal zero we learn the relation: π(x(i)) u x(i) j, x(i) j (for all u, j). (4) These equations are remarkable. They say: the sum of any coordinate (j) of the x(i) s from all the training data in a particular category u is equal the sum of probability mass the model places in that coordinate summed across all data (for all u, j). Summaries of the training data are preserved by the model. Using A(u, y(i)) we can re-write equation 4 as: π(x(i)) u x(i) j A(u, y(i))x(i) j (for all u, j). (5) This means the best chosen value of λ creates a probability function π(x(i)) u that behaves very much like the indicator function A(u, y(i)) (simultaneously for all variables and categories when summed over i). We will call equations 4 and 5 the balance equations (drawing an analogy to the detail balance equations from Markov chain theory). Also notice that the model parameters λ do not appear explicitly in these equations (they are implicit π() but are not seen explicitly in the balance equations like the x(i) do). This means we are 4

5 in some sense coordinate free the result we get depends only on the variables we have decided to measure/model and not on how we parameterized π() in terms of λ. Solving this system of k n equations involving our k n unknown λ s is slightly tricky as π() depends on the λ s in a non-linear fashion. However, we can take derivatives of these sets of equations (compute Jacobian of our system of equations, which is essentially the Hessian of f()) easily. Then we can apply one of Newton s Method, Fisher Scoring or Iteratively re-weighted Least Squares to find a λ that satisfies all of our equations very quickly. From a logistic regression point of view we are done. The lucky form of the sigmoid function (as pulled out of thin air in equations and 3) had such a simple derivative structure that we could derive logistic regression just by re-arranging a few sums. 2 But what if you didn t think of the sigmoid? 4 Maximum Entropy In the equivalent maximum entropy derivation of logistic regression you don t have to cleverly guess the sigmoid form. Instead you assume you want a balance equation like equation 5 to be true and you can, without needing any luck, solve for the necessary form for π(). Start as follows. Assume nothing about π()- it could be any sort of hideously complicated function. Just start listing things you would like to be true about it (and don t worry if you are asking for too many things to be true or too few things to be true). Suppose we require π() to obey the following relations: π(x) v 0 always (6) π(x) v always (7) v π(x(i)) u x(i) j A(u, y(i))x(i) j (for all u, j) (8) The first two conditions are needed for π() to behave like a probability and the third we can think of as saying π(x(i)) u should well approximate the category indicator A(u, y(i)) on our training data. Now we have not yet limited the complexity of π() which is something we typically do to make sure π() is not some horrendous function that is impossible to implement and over-fits the training data in a non-useful fashion. One particularly bad π() (which we do not want) is anything like: match(x) minimal i from n such that x(i) x or 0 if there is no such i. pick(i) y(i) if i n and otherwise. π(x) such that π(x) pick(match(x)) and 0 in all other coordinates. This is an evil useless cousin of a the (useful) nearest neighbor model. It may get exact items from training right- but moves to zero once you introduce a novel point. What we want is some notion of continuity, smoothness, minimum description length, Occam s razor or low complexity. In information theory the standard thing to try is maximizing the entropy (which in turn minimizes the over-fitting).[cover and Thomas, 99] The entropy of π() would be defined as: v π(x(i)) v log(π(x(i)) v ) (9) Newton s zero finding method is the easiest to derive, it is a remarkable property of GLMs that this method is in this case almost indistinguishable from Fisher Scoring or Iteratively re-weighted Least Squares. 2 You can also derive the same update rules for more general GLMs, but the derivation requires slightly more cleverness. 5

6 This equation isn t pulled out of thin air as it is the central equation of information theory. It is maximized for simple π() (like the constant function π(x) /k independent of x). So we write our problem as a constrained optimization problem: find a function π() maximizing the equation 9 obeying equations 6,7,8. Despite being the simplest form equation 6 is the hardest to work with (because it is an inequality instead of an equality) so leave it out for now and hope that what we find can be shown to satisfy equation 6 (which is how things turn out). Also notice we are not assuming any form of π(), so we are searching the space of all functions (not just a parameterized sub-space). We will actually derive the form instead of having to guess or know it. In constrained optimization the standard technique is to set up the Lagrangian. This is just a sum over all constraints and the objective function 3. Our Lagrangian can be taken to be: L ( n m ) λ v,j π(x(i)) v x(i) j A(v, y(i))x(i) j j v + β i (π(x(i)) v ) v v π(x(i)) v log(π(x(i)) v ) (20) What we have done (and this is a standard and re-usable technique) is formed L as a sum. For each equation we wanted to enforce we added a corresponding term into our sum multiplied by a new free (note-keeping) variable (the λ s and β s). The objective function is added in without a variable 4. The signs and variable names are all arbitrary- but we have picked λ to be suggestive (though the technique would work the same if we picked different signs and names). The subtle point is (and it is a consequence of the Karush-Kuhn-Tucker theorem) is that certain things that are true about the optimal solution are also true about the Lagrangian. In particular the partial derivatives we expect to vanish for an unconstrained optimization problem do in fact vanish for the Lagrangian at the optimum. So the unknown optimal π() is such that: π(x(i)) u L 0 (for all i, u). (2) Note that we are treating π() almost as lookup table in that we are assuming we can set π(x(i)) almost independently for each i and further assuming the formal derivatives don t interact for different i. But we also are strict in that we are enforcing a derivative condition at each and every data point. If we have a lot of data (i.e. m is large and the points are diverse) we get a lot of conditions. This severely limits the unknown structure of the optimal π() despite the apparent freedom of the formal analysis. Let us look at π(x(i)) u L: π(x(i)) u L λ u x(i) + β i log(π(x(i)) u ) (22) Anything that does not directly have a π(x(i)) u goes to zero and we applied the standard rules of derivatives. As we said we expect these derivatives to all be zero for an optimal π() we get a huge 3 Inequality constraints can also be in the Lagrangian, but this requires introducing more variables. 4 All terms should have a free variable- but by scaling we can leave one variable off- so we left it off the objective function since it is unique. 6

7 family of equations: for all i, u. Or re-arranging a bit: λ u x(i) + β i log(π(x(i)) u ) 0 (23) π(x(i)) u e λu x(i)+β i. (24) This is starting to look promising. For example we now can see π() 0 everywhere (a consequence of the exponential form). We can now start to solve for our free variables β and λ which now parameterize the family of forms π() we need to look at to find an optimal solution. By equation 7 we know we want to enforce: So e λv x(i)+βi (25) v e β / e λv x(i) (26) v And we can plug this β back into π() and simplify to get: π(x) u e λu x k v eλv x (27) This is the exact multi-category version of the sigmoid we had to be lucky enough to guess the first time. From this point on you solve as before. This derivation also gives some indication that the sigmoid function (the inverse of the logit link) is in some sense special among the different GLM models (a good indication that you might as well try it until you know something about your problem that indicates you should use some other form). It might seem that guessing the sigmoid form is less trouble than appealing to maximum entropy. However the sigmoid is special trick (either it is appropriate or it is not) and the maximum entropy principle (and also taking partial derivatives of the Lagrangian) is a general technique. The sigmoid is the what the methods we used here (maximum entropy, the Lagrangian and essentially the calculus of variations[gelfand and Fomin, 963]) is the how. 5 Conclusion We showed how to derive logistic regression twice. For the first derivation we guessed the useful sigmoid function and then used simple calculus to deriver important properties of logistic regression. This derivation for the sigmoid alone (not introducing general GLMs and not introducing link functions) is much quicker and simpler than the standard derivations (which require some more clever substitutions). The second derivation we used the central equation of information theory and the Karush-Kuhn- Tucker theorem to actually discover the sigmoid and logistic regression from first principles. This did require brining in knowledge from two different fields (optimization for the Karush-Kuhn-Tucker conditions and information theory for the entropy equation). However in both case we took the most famous fact from a highly celebrated field. We have worked in detail to show the how and called out to larger theories to illustrate the why. 7

8 References [Agresti, 990] Agresti, A. (990). Categorical Data Analysis. [Cover and Thomas, 99] Cover, T. M. and Thomas, J. A. (99). Elements of Information Theory. [Gelfand and Fomin, 963] Gelfand, I. M. and Fomin, S. V. (963). Calculus of Variations. Prentice Hall, Inc. [Klein and Manning, 2003] Klein, D. and Manning, C. D. (2003). estimation, and optimization. Maxent models, conditional 8

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### 3.8 Finding Antiderivatives; Divergence and Curl of a Vector Field

3.8 Finding Antiderivatives; Divergence and Curl of a Vector Field 77 3.8 Finding Antiderivatives; Divergence and Curl of a Vector Field Overview: The antiderivative in one variable calculus is an important

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### constraint. Let us penalize ourselves for making the constraint too big. We end up with a

Chapter 4 Constrained Optimization 4.1 Equality Constraints (Lagrangians) Suppose we have a problem: Maximize 5, (x 1, 2) 2, 2(x 2, 1) 2 subject to x 1 +4x 2 =3 If we ignore the constraint, we get the

### Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

1. Differentiation The first derivative of a function measures by how much changes in reaction to an infinitesimal shift in its argument. The largest the derivative (in absolute value), the faster is evolving.

### Revised Version of Chapter 23. We learned long ago how to solve linear congruences. ax c (mod m)

Chapter 23 Squares Modulo p Revised Version of Chapter 23 We learned long ago how to solve linear congruences ax c (mod m) (see Chapter 8). It s now time to take the plunge and move on to quadratic equations.

### Recall that two vectors in are perpendicular or orthogonal provided that their dot

Orthogonal Complements and Projections Recall that two vectors in are perpendicular or orthogonal provided that their dot product vanishes That is, if and only if Example 1 The vectors in are orthogonal

### Logistic Regression (1/24/13)

STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

### MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column

### discuss how to describe points, lines and planes in 3 space.

Chapter 2 3 Space: lines and planes In this chapter we discuss how to describe points, lines and planes in 3 space. introduce the language of vectors. discuss various matters concerning the relative position

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

### Linear Programming Notes V Problem Transformations

Linear Programming Notes V Problem Transformations 1 Introduction Any linear programming problem can be rewritten in either of two standard forms. In the first form, the objective is to maximize, the material

### Orthogonal Projections

Orthogonal Projections and Reflections (with exercises) by D. Klain Version.. Corrections and comments are welcome! Orthogonal Projections Let X,..., X k be a family of linearly independent (column) vectors

### (Quasi-)Newton methods

(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

### Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

(für Informatiker) M. Grepl J. Berger & J.T. Frings Institut für Geometrie und Praktische Mathematik RWTH Aachen Wintersemester 2010/11 Problem Statement Unconstrained Optimality Conditions Constrained

### Monotonicity Hints. Abstract

Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: joe@cs.caltech.edu Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology

### Introduction to Convex Optimization for Machine Learning

Introduction to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009 Duchi (UC Berkeley) Convex Optimization for Machine Learning

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### Lecture 6. Weight. Tension. Normal Force. Static Friction. Cutnell+Johnson: 4.8-4.12, second half of section 4.7

Lecture 6 Weight Tension Normal Force Static Friction Cutnell+Johnson: 4.8-4.12, second half of section 4.7 In this lecture, I m going to discuss four different kinds of forces: weight, tension, the normal

### a 11 x 1 + a 12 x 2 + + a 1n x n = b 1 a 21 x 1 + a 22 x 2 + + a 2n x n = b 2.

Chapter 1 LINEAR EQUATIONS 1.1 Introduction to linear equations A linear equation in n unknowns x 1, x,, x n is an equation of the form a 1 x 1 + a x + + a n x n = b, where a 1, a,..., a n, b are given

### i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Statistics 580 Maximum Likelihood Estimation Introduction Let y (y 1, y 2,..., y n be a vector of iid, random variables from one of a family of distributions on R n and indexed by a p-dimensional parameter

### 9 Multiplication of Vectors: The Scalar or Dot Product

Arkansas Tech University MATH 934: Calculus III Dr. Marcel B Finan 9 Multiplication of Vectors: The Scalar or Dot Product Up to this point we have defined what vectors are and discussed basic notation

### Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

### Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

### Penalized regression: Introduction

Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

### Several Views of Support Vector Machines

Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min

### Lecture 8 February 4

ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt

### the points are called control points approximating curve

Chapter 4 Spline Curves A spline curve is a mathematical representation for which it is easy to build an interface that will allow a user to design and control the shape of complex curves and surfaces.

### Maximum Likelihood Estimation

Math 541: Statistical Theory II Lecturer: Songfeng Zheng Maximum Likelihood Estimation 1 Maximum Likelihood Estimation Maximum likelihood is a relatively simple method of constructing an estimator for

### Regression Using Support Vector Machines: Basic Foundations

Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering

### Introduction to Logistic Regression

OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction

### 1. LINEAR EQUATIONS. A linear equation in n unknowns x 1, x 2,, x n is an equation of the form

1. LINEAR EQUATIONS A linear equation in n unknowns x 1, x 2,, x n is an equation of the form a 1 x 1 + a 2 x 2 + + a n x n = b, where a 1, a 2,..., a n, b are given real numbers. For example, with x and

### Neural Networks. CAP5610 Machine Learning Instructor: Guo-Jun Qi

Neural Networks CAP5610 Machine Learning Instructor: Guo-Jun Qi Recap: linear classifier Logistic regression Maximizing the posterior distribution of class Y conditional on the input vector X Support vector

### 3. INNER PRODUCT SPACES

. INNER PRODUCT SPACES.. Definition So far we have studied abstract vector spaces. These are a generalisation of the geometric spaces R and R. But these have more structure than just that of a vector space.

### High School Algebra Reasoning with Equations and Inequalities Solve equations and inequalities in one variable.

Performance Assessment Task Quadratic (2009) Grade 9 The task challenges a student to demonstrate an understanding of quadratic functions in various forms. A student must make sense of the meaning of relations

### The Method of Least Squares

Hervé Abdi 1 1 Introduction The least square methods (LSM) is probably the most popular technique in statistics. This is due to several factors. First, most common estimators can be casted within this

### Numerical Analysis Lecture Notes

Numerical Analysis Lecture Notes Peter J. Olver 5. Inner Products and Norms The norm of a vector is a measure of its size. Besides the familiar Euclidean norm based on the dot product, there are a number

### Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

### Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

### In this section, we will consider techniques for solving problems of this type.

Constrained optimisation roblems in economics typically involve maximising some quantity, such as utility or profit, subject to a constraint for example income. We shall therefore need techniques for solving

### Envelope Theorem. Kevin Wainwright. Mar 22, 2004

Envelope Theorem Kevin Wainwright Mar 22, 2004 1 Maximum Value Functions A maximum (or minimum) value function is an objective function where the choice variables have been assigned their optimal values.

### ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of

### Programming Exercise 3: Multi-class Classification and Neural Networks

Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

### Linear Dependence Tests

Linear Dependence Tests The book omits a few key tests for checking the linear dependence of vectors. These short notes discuss these tests, as well as the reasoning behind them. Our first test checks

### Geometric Constraints

Simulation in Computer Graphics Geometric Constraints Matthias Teschner Computer Science Department University of Freiburg Outline introduction penalty method Lagrange multipliers local constraints University

### Christfried Webers. Canberra February June 2015

c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

### To give it a definition, an implicit function of x and y is simply any relationship that takes the form:

2 Implicit function theorems and applications 21 Implicit functions The implicit function theorem is one of the most useful single tools you ll meet this year After a while, it will be second nature to

### SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA

SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA This handout presents the second derivative test for a local extrema of a Lagrange multiplier problem. The Section 1 presents a geometric motivation for the

### CITY UNIVERSITY LONDON. BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION

No: CITY UNIVERSITY LONDON BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION ENGINEERING MATHEMATICS 2 (resit) EX2005 Date: August

### Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

### Multivariate Logistic Regression

1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation

### CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e.

CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e. This chapter contains the beginnings of the most important, and probably the most subtle, notion in mathematical analysis, i.e.,

### MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors Jordan canonical form (continued) Jordan canonical form A Jordan block is a square matrix of the form λ 1 0 0 0 0 λ 1 0 0 0 0 λ 0 0 J = 0

### Manifold Learning Examples PCA, LLE and ISOMAP

Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition

### 1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where.

Introduction Linear Programming Neil Laws TT 00 A general optimization problem is of the form: choose x to maximise f(x) subject to x S where x = (x,..., x n ) T, f : R n R is the objective function, S

### Multi-variable Calculus and Optimization

Multi-variable Calculus and Optimization Dudley Cooke Trinity College Dublin Dudley Cooke (Trinity College Dublin) Multi-variable Calculus and Optimization 1 / 51 EC2040 Topic 3 - Multi-variable Calculus

### 1 Portfolio mean and variance

Copyright c 2005 by Karl Sigman Portfolio mean and variance Here we study the performance of a one-period investment X 0 > 0 (dollars) shared among several different assets. Our criterion for measuring

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### Applied Algorithm Design Lecture 5

Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design

### Big Data - Lecture 1 Optimization reminders

Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics

### 6.042/18.062J Mathematics for Computer Science. Expected Value I

6.42/8.62J Mathematics for Computer Science Srini Devadas and Eric Lehman May 3, 25 Lecture otes Expected Value I The expectation or expected value of a random variable is a single number that tells you

### Chapter 21: The Discounted Utility Model

Chapter 21: The Discounted Utility Model 21.1: Introduction This is an important chapter in that it introduces, and explores the implications of, an empirically relevant utility function representing intertemporal

### Manipulator Kinematics. Prof. Matthew Spenko MMAE 540: Introduction to Robotics Illinois Institute of Technology

Manipulator Kinematics Prof. Matthew Spenko MMAE 540: Introduction to Robotics Illinois Institute of Technology Manipulator Kinematics Forward and Inverse Kinematics 2D Manipulator Forward Kinematics Forward

### 2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

1. Systems of linear equations We are interested in the solutions to systems of linear equations. A linear equation is of the form 3x 5y + 2z + w = 3. The key thing is that we don t multiply the variables

### Machine Learning Logistic Regression

Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.

### These axioms must hold for all vectors ū, v, and w in V and all scalars c and d.

DEFINITION: A vector space is a nonempty set V of objects, called vectors, on which are defined two operations, called addition and multiplication by scalars (real numbers), subject to the following axioms

### Cost Minimization and the Cost Function

Cost Minimization and the Cost Function Juan Manuel Puerta October 5, 2009 So far we focused on profit maximization, we could look at a different problem, that is the cost minimization problem. This is

### Summation Algebra. x i

2 Summation Algebra In the next 3 chapters, we deal with the very basic results in summation algebra, descriptive statistics, and matrix algebra that are prerequisites for the study of SEM theory. You

### THE FUNDAMENTAL THEOREM OF ARBITRAGE PRICING

THE FUNDAMENTAL THEOREM OF ARBITRAGE PRICING 1. Introduction The Black-Scholes theory, which is the main subject of this course and its sequel, is based on the Efficient Market Hypothesis, that arbitrages

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### Building a Smooth Yield Curve. University of Chicago. Jeff Greco

Building a Smooth Yield Curve University of Chicago Jeff Greco email: jgreco@math.uchicago.edu Preliminaries As before, we will use continuously compounding Act/365 rates for both the zero coupon rates

### Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

### The Envelope Theorem 1

John Nachbar Washington University April 2, 2015 1 Introduction. The Envelope Theorem 1 The Envelope theorem is a corollary of the Karush-Kuhn-Tucker theorem (KKT) that characterizes changes in the value

### The Reinvestment Assumption Dallas Brozik, Marshall University

The Reinvestment Assumption Dallas Brozik, Marshall University Every field of study has its little odd bits, and one of the odd bits in finance is the reinvestment assumption. It is an artifact of the

### Overview of Math Standards

Algebra 2 Welcome to math curriculum design maps for Manhattan- Ogden USD 383, striving to produce learners who are: Effective Communicators who clearly express ideas and effectively communicate with diverse

### 1 Short Introduction to Time Series

ECONOMICS 7344, Spring 202 Bent E. Sørensen January 24, 202 Short Introduction to Time Series A time series is a collection of stochastic variables x,.., x t,.., x T indexed by an integer value t. The

### Constrained optimization.

ams/econ 11b supplementary notes ucsc Constrained optimization. c 2010, Yonatan Katznelson 1. Constraints In many of the optimization problems that arise in economics, there are restrictions on the values

### CONTINUED FRACTIONS AND FACTORING. Niels Lauritzen

CONTINUED FRACTIONS AND FACTORING Niels Lauritzen ii NIELS LAURITZEN DEPARTMENT OF MATHEMATICAL SCIENCES UNIVERSITY OF AARHUS, DENMARK EMAIL: niels@imf.au.dk URL: http://home.imf.au.dk/niels/ Contents

### Section 1.1. Introduction to R n

The Calculus of Functions of Several Variables Section. Introduction to R n Calculus is the study of functional relationships and how related quantities change with each other. In your first exposure to

### MATH 551 - APPLIED MATRIX THEORY

MATH 55 - APPLIED MATRIX THEORY FINAL TEST: SAMPLE with SOLUTIONS (25 points NAME: PROBLEM (3 points A web of 5 pages is described by a directed graph whose matrix is given by A Do the following ( points

### Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

Notes on Orthogonal and Symmetric Matrices MENU, Winter 201 These notes summarize the main properties and uses of orthogonal and symmetric matrices. We covered quite a bit of material regarding these topics,

### Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and

### Linear Codes. In the V[n,q] setting, the terms word and vector are interchangeable.

Linear Codes Linear Codes In the V[n,q] setting, an important class of codes are the linear codes, these codes are the ones whose code words form a sub-vector space of V[n,q]. If the subspace of V[n,q]

### A Quick Algebra Review

1. Simplifying Epressions. Solving Equations 3. Problem Solving 4. Inequalities 5. Absolute Values 6. Linear Equations 7. Systems of Equations 8. Laws of Eponents 9. Quadratics 10. Rationals 11. Radicals

### 3. Interpolation. Closing the Gaps of Discretization... Beyond Polynomials

3. Interpolation Closing the Gaps of Discretization... Beyond Polynomials Closing the Gaps of Discretization... Beyond Polynomials, December 19, 2012 1 3.3. Polynomial Splines Idea of Polynomial Splines

### Linear Codes. Chapter 3. 3.1 Basics

Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length

### ELEC-E8104 Stochastics models and estimation, Lecture 3b: Linear Estimation in Static Systems

Stochastics models and estimation, Lecture 3b: Linear Estimation in Static Systems Minimum Mean Square Error (MMSE) MMSE estimation of Gaussian random vectors Linear MMSE estimator for arbitrarily distributed

### CHAPTER 3 Numbers and Numeral Systems

CHAPTER 3 Numbers and Numeral Systems Numbers play an important role in almost all areas of mathematics, not least in calculus. Virtually all calculus books contain a thorough description of the natural,

### How I won the Chess Ratings: Elo vs the rest of the world Competition

How I won the Chess Ratings: Elo vs the rest of the world Competition Yannis Sismanis November 2010 Abstract This article discusses in detail the rating system that won the kaggle competition Chess Ratings:

### We call this set an n-dimensional parallelogram (with one vertex 0). We also refer to the vectors x 1,..., x n as the edges of P.

Volumes of parallelograms 1 Chapter 8 Volumes of parallelograms In the present short chapter we are going to discuss the elementary geometrical objects which we call parallelograms. These are going to

### Stochastic Inventory Control

Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the