Linear Threshold Units



Similar documents
Statistical Machine Learning

Linear Classification. Volker Tresp Summer 2015

Lecture 3: Linear methods for classification

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Machine Learning and Pattern Recognition Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Lecture 8 February 4

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

CSCI567 Machine Learning (Fall 2014)

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Lecture 2: The SVM classifier

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Logistic Regression (1/24/13)

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Introduction to Online Learning Theory

STA 4273H: Statistical Machine Learning

Christfried Webers. Canberra February June 2015

An Introduction to Machine Learning

Statistical Machine Learning from Data

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Data Mining: Algorithms and Applications Matrix Math Review

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Lecture 6: Logistic Regression

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Linear Models for Classification

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LCs for Binary Classification

Introduction to General and Generalized Linear Models

Bayes and Naïve Bayes. cs534-machine Learning

How To Solve The Cluster Algorithm

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Machine Learning Logistic Regression

3. INNER PRODUCT SPACES

Decompose Error Rate into components, some of which can be measured on unlabeled data

Gaussian Processes in Machine Learning

(Quasi-)Newton methods

Question 2 Naïve Bayes (16 points)

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Simple and efficient online algorithms for real world applications

α = u v. In other words, Orthogonal Projection

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Chapter 4: Artificial Neural Networks

Predict Influencers in the Social Network

SOLVING LINEAR SYSTEMS

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Content-Based Recommendation

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Solution of Linear Systems

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Local classification and local likelihoods

Classification Problems

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Lecture 9: Introduction to Pattern Analysis

Introduction to Learning & Decision Trees

Introduction to Machine Learning Using Python. Vikram Kamath

NOV /II. 1. Let f(z) = sin z, z C. Then f(z) : 3. Let the sequence {a n } be given. (A) is bounded in the complex plane

The Artificial Prediction Market

CS229 Lecture notes. Andrew Ng

Several Views of Support Vector Machines

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Big Data Analytics CSCI 4030

Statistics Graduate Courses

5. Linear Regression

Inner Product Spaces

CS229 Lecture notes. Andrew Ng

Equations Involving Lines and Planes Standard equations for lines in space

Logistic Regression for Spam Filtering

Multivariate Logistic Regression

Introduction to Logistic Regression

Linear algebra and the geometry of quadratic equations. Similarity transformations and orthogonal matrices

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

Course: Model, Learning, and Inference: Lecture 5

Support Vector Machines

Gaussian Conjugate Prior Cheat Sheet

Support Vector Machines Explained

Factorization Theorems

Component Ordering in Independent Component Analysis Based on Data Power

Principles of Data Mining by Hand&Mannila&Smyth

Econometrics Simple Linear Regression

GLM, insurance pricing & big data: paying attention to convergence issues.

Machine Learning Big Data using Map Reduce

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Big Data Analytics: Optimization and Randomization

Distributed Machine Learning and Big Data

Ex. 2.1 (Davide Basilio Bartolini)

Poisson Models for Count Data

Classification by Pairwise Coupling

Transcription:

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear threshold units: Perceptron: classifier Logistic Regression: conditional distribution Linear Discriminant Analysis: joint distribution 26

What can be represented by an Conjunctions LTU: x x x y x x x x At least m-of-n {x,x,x } y x x x x 27

Things that cannot be represented: Non-trivial disjunctions: x x x x y x x x x f hi Exclusive-OR: x x x x y 28

A canonical representation Given a training example of the form (hx 1, x 2, x 3, x 4 i,, y) transform it to (, x 2, x 3, x 4 i,, y) The parameter vector will then be w = hw 0, w 1, w 2, w 3, w 4 i. We will call the unthresholded hypothesis u(x,w) u(x,w) ) = w x Each hypothesis can be written h(x) ) = sgn(u(x,w)) Our goal is to find w. 29

The LTU Hypothesis Space Fixed size: There are distinct linear threshold units over n boolean features Deterministic Continuous parameters O µ n 30

Geometrical View Consider three training examples: h.,.i, h.,.i, h.,.i, We want a classifier that looks like the following: 31

The Unthresholded Discriminant Function is a Hyperplane The equation u(x) ) = w x is a plane y ( ux 32

Machine Learning and Optimization When learning a classifier, the natural way to formulate the learning problem is the following: Given: Find: A set of N training examples {(x 1,y 1 ), (x( 2,y 2 ),,, (x( N,y N )} A loss function L The weight vector w that minimizes the expected loss on the training data NX Jw N i Lw x i,y i. In general, machine learning algorithms apply some optimization algorithm to find a good hypothesis. In this case, J is piecewise constant,, which makes this a difficult problem 33

Approximating the expected loss by a smooth function Simplify the optimization problem by replacing the original objective function by a smooth, differentiable function. For example, consider the hinge loss: Jw N NX i max, y i w x i When y = 1 34

Minimizing J by Gradient Descent Search Start with weight vector w 0 Compute gradient Jw à Jw w Compute w 1 = w 0 Jw where is a step size parameter Repeat until convergence, Jw,...,! Jw w w n 35

Computing the Gradient J i w, y i w x i Jw w k w k N NX i N Ã NX i J i w w k J i w J i w X, y w k w i w j x ij k j ( P yi j w j x ij > y i x ik! 36

Batch Perceptron Algorithm x i,y i i...n w,,,,..., g,,..., i N u i w x i y i u i < j n g j g j y i x ij g g/n w w g Simplest case: = 1, don t normalize g: Fixed Increment Perceptron 37

Online Perceptron Algorithm w,,,,..., i hx i,y i i u i w x i y i u i < j n g j y i x ij w w g This is called stochastic gradient descent because the overall gradient is approximated by the gradient from each individual example 38

Learning Rates and Convergence The learning rate must decrease to zero in order to guarantee convergence. The online case is known as the Robbins-Munro algorithm. It is guaranteed to converge under the following assumptions: t t X t X t t t < The learning rate is also called the step size.. Some algorithms (e.g., Newton s s method, conjugate gradient) choose the stepsize automatically and converge faster There is only one basin for linear threshold units, so a local minimum is the global minimum. Choosing a good starting point can c make the algorithm converge faster 39

Decision Boundaries A classifier can be viewed as partitioning the input space or feature space X into decision regions A linear threshold unit always produces a linear decision boundary. A set of points that can be separated by a linear decision boundary is said to be linearly separable. 40

Exclusive-OR is Not Linearly Separable 41

Extending Perceptron to More than Two Classes If we have K > 2 classes, we can learn a separate LTU for each class. Let w k be the weight vector for class k. We train it by treating examples from class y = k as the positive examples and treating the examples from all other classes as negative examples. Then we classify a new data point x according to y k w k x. 42

Summary of Perceptron algorithm for LTUs Directly Learns a Classifier Local Search Begins with an initial weight vector. Modifies it iterative to minimize an error function. The error function is loosely related to the goal of minimizing the number of classification errors Eager The classifier is constructed from the training examples The training examples can then be discarded Online or Batch Both variants of the algorithm can be used 43

Logistic Regression Learn the conditional distribution P(y x) Let p y (x; w) ) be our estimate of P(y x), where w is a vector of adjustable parameters. Assume only two classes y = 0 and y = 1, and p x w w x w x. p x w p x w. On the homework, you will show that this is equivalent to p x w p x w w x. In other words, the log odds of class 1 is a linear function of x. 44

Why the exp function? One reason: A linear function has a range from [, ] ] and we need to force it to be positive and sum to 1 in order to be a probability: 45

Deriving a Learning Algorithm h Since we are fitting a conditional probability distribution, we no longer seek to minimize the loss on the training data. Instead, we seek to find the probability distribution h that is most likely given the training data Let S be the training sample. Our goal is to find h to maximize P(h S): P h S P S hp h h P S P S hp h h P S h h h P S h P S h P h The distribution P(S h) is called the likelihood function. The log likelihood is frequently used as the objective function for learning. It is often written as (w). The h that maximizes the likelihood on the training data is called the maximum likelihood estimator (MLE) 46

Computing the Likelihood In our framework, we assume that each training example (x( i,y i ) is drawn from the same (but unknown) probability distribution P(x,y). This means that the log likelihood of S is the sum of the log likelihoods of the individual training examples: P S h Y i P x i,y i h X i P x i,y i h 47

Computing the Likelihood (2) Recall that any joint distribution P(a,b) can be factored as P(a b) P(b). Hence, we can write h P S h h h X i X i P x i,y i h P y i x i,hp x i h In our case, P(x h) ) = P(x), because it does not depend on h,, so h P S h h h X i X i P y i x i,hp x i h P y i x i,h 48

Log Likelihood for Conditional Probability Estimators We can express the log likelihood in a compact form known as the cross entropy. Consider an example (x( i,y i ) If y i = 0, the log likelihood is log [1 p 1 (x; w)] if y i = 1, the log likelihood is log [p 1 (x; w)] These cases are mutually exclusive, so we can combine them to obtain: (y i ; x i,w)) = log P(y i x i,w)) = (1 y i ) log[1 p 1 (x i ;w)] + y i log p 1 (x i ;w) The goal of our learning algorithm will be to find w to maximize J(w) ) = i (y i ; x i,w) 49

Fitting Logistic Regression by Gradient Ascent Jw w j X i `y w i x i, w j w j `y i x i, w y w i p x i w y p x i w j à y i p! à x i w p x y i w p x i w w i j p x i w w j " y i p x i w y #Ã! i p x i w p x i w w j " #Ã! yi p x i w y i p x i w p x i w " p x i w p x i w #à y i p x i w p x i w p x i w p x i w w j! w j! 50

Gradient Computation (continued) Note that p 1 can also be written as p x i w w x i. From this, we obtain: p x i w w j w x i w x i w j w x i w x i w x i w j w x i w x ix ij p x i w p x i wx ij 51

Completing the Gradient Computation The gradient of the log likelihood of a single point is therefore w j `y i x i, w " " y i p x i w p x i w p x i w y i p x i w p x i w p x i w y i p x i wx ij The overall gradient is #Ã p x i w # w j! p x i w p x i wx ij Jw w j X i y i p x i wx ij 52

Batch Gradient Ascent for Logistic Regression x i,y i i...n w,,,,..., g,,..., i N p i / w x i i y i p i j n g j g j i x ij w w g An online gradient ascent algorithm can be constructed, of course Most statistical packages use a second-order order (Newton-Raphson) algorithm for faster convergence. Each iteration of the second-order order method can be viewed as a weighted least squares computation, so the algorithm is known as Iteratively-Reweighted Least Squares (IRLS) 53

Logistic Regression Implements a Linear Discriminant Function In the 2-class 2 0/1 loss function case, we should predict = 1 if X E y x L,y > E y x L,y P y xl,y y > X P y xl,y y P y xl, P y xl, > Py xl, P y xl, P y x > Py x P y x P y x > P y X 6 P y x P y x > w x > A similar derivation can be done for arbitrary L(0,1) and L(1,0). 54

Extending Logistic Regression to K > 2 classes Choose class K to be the reference class and represent each of the other classes as a logistic function of the odds of class k versus class K: P y x P y K x w x P y x P y K x w x P y K x P y K x w K x Gradient ascent can be applied to simultaneously train all of these weight vectors w k 55

Logistic Regression for K > 2 (continued) The conditional probability for class k K can be computed as P y k x w k x P K ` w` x For class K, the conditional probability is P y K x P K ` w` x 56

Summary of Logistic Regression Learns conditional probability distribution P(y x) Local Search begins with initial weight vector. Modifies it iteratively to maximize the log likelihood of the data Eager the classifier is constructed from the training examples, which can then be discarded Online or Batch both online and batch variants of the algorithm exist 57

Linear Discriminant Analysis Learn P(x,y). This is sometimes called the generative approach, because we can think of P(x,y) ) as a model of how the data is generated. For example, if we factor the joint distribution into the form P(x,y) ) = P(y) ) P(x y) we can think of P(y) ) as generating a value for y according to P(y). Then we can think of P(x y) ) as generating a value for x given the previously-generated value for y. This can be described as a Bayesian network y x 58

Linear Discriminant Analysis (2) P(y) ) is a discrete multinomial distribution example: P(y = 0) = 0.31, P(y = 1) = 0.69 will generate 31% negative examples and 69% positive examples For LDA, we assume that P(x y) ) is a multivariate normal distribution with mean k and covariance matrix y x P x y k n/ / µ x µ k T x µ k 59

Multivariate Normal Distributions: A tutorial Recall that the univariate normal (Gaussian) distribution has the e formula " px / x µ # where is the mean and 2 is the variance Graphically, it looks like this: 60

The Multivariate Gaussian A 2-dimensional 2 Gaussian is defined by a mean vector = (( 1, 2 ) and a covariance matrix # ",,,, where 2 i,j = E[(x i i )(x j - j )] is the variance (if i = j) or co-variance (if i j). is symmetrical and positive-definite. 61

The Multivariate Gaussian (2) The Multivariate Gaussian (2) If is the identity matrix " # and = (0, 0), we get the standard normal distribution: 62

The Multivariate Gaussian (3) If is a diagonal matrix, then x 1, and x 2 are independent random variables, and lines of equal probability are ellipses parallel to the coordinate axes. For example, when # " µ, and we obtain 63

The Multivariate Gaussian (4) Finally, if is an arbitrary matrix, then x 1 and x 2 are dependent, and lines of equal probability are ellipses tilted relative to the coordinate axes. For example, when ".. µ, # and we obtain 64

Estimating a Multivariate Gaussian Given a set of N data points {x{ 1,, x N }, we can compute the maximum likelihood estimate for the multivariate Gaussian distribution as follows: µ N N X x i i X i x i µ x i µ T Note that the dot product in the second equation is an outer product.. The outer product of two vectors is a matrix: x y T x x x y y y x y x y x y x y x y x y x y x y x y For comparison, the usual dot product is written as xt y 65

The LDA Model Linear discriminant analysis assumes that the joint distribution has the form P x,yp y n/ / µ x µ y T x µy where each y is the mean of a multivariate Gaussian for examples belonging to class y and is a single covariance matrix shared by all classes. 66

Fitting the LDA Model It is easy to learn the LDA model in a single pass through the data: k Let be our estimate of P(y = k) Let N k be the number of training examples belonging to class k. k N k N µ k N k X N x i {iy i k} X i x i µy i x i µy i T Note that each x i is subtracted from its corresponding prior to taking the outer product. This gives us the pooled estimate of µy i 67

LDA learns an LTU Consider the 2-class 2 case with a 0/1 loss function. Recall that P y x P y x Also recall from our derivation of the Logistic Regression classifier that we should classify into class = 1 if P y x P y x > Hence, for LDA, we should classify into = 1 if because the denominators cancel P x,y P x,y P x,y P x,y P x,y P x,y P x,y P x, y > 68

LDA learns an LTU (2) P x,y P y n/ / µ x µ y T x µy P y P x,y P x,y n/ / ³ x µ T x µ ³ P y n/ / x µ T x µ P x,y P x,y P y P y ³ x µ T x µ ³ x µ T x µ P x,y P x, y y P P y ³ x µ T x µ x µ T x µ 69

LDA learns an LTU (3) Let s s focus on the term in brackets: ³ x µ T x µ x µ T x µ Expand the quadratic forms as follows: x µ T x µ x T x x T µ µ T x µ T µ x µ T x µ x T x x T µ µ T x µ T µ Subtract the lower from the upper line and collect similar terms. Note that the quadratic terms cancel! This leaves only terms linear in x. x T µ µ µ µ xµ T µ µ T µ 70

LDA learns an LTU (4) x T µ µ µ µ xµ T µ µ T µ Note that since -1 is symmetric a T b b T a for any two vectors a and b.. Hence, the first two terms can be combined to give x T µ µ µ T µ µ T µ. Now plug this back in P x,y P x,y P x,y P x,y y P P y hx T µ µ µ T µ µ T i µ P y P y xt µ µ µt µ µt µ 71

LDA learns an LTU (5) P x,y P x, y P y P y xt µ µ µt µ µt µ w µ µ c P y P y µt µ µt µ y w x c>. 72

Two Geometric Views of LDA View 1: Mahalanobis Distance The quantity D M x, u x u T x u is known as the (squared) Mahalanobis distance between x and u.. We can think of the matrix - 1 as a linear distortion of the coordinate system that converts the standard Euclidean distance into the Mahalanobis distance Note that P x y k k x µ k T x µ k P x y k k D Mx,µ k Therefore, we can view LDA as computing D M x,µ and D M x,µ and then classifying x according to which mean 0 or 1 is closest in Mahalanobis distance (corrected by log k ) 73

View 2: Most Informative Low- Dimensional Projection LDA can also be viewed as finding a hyperplane of dimension K 1 such that x and the {{ k } are projected down into this hyperplane and then x is classified to the nearest k using Euclidean distance inside this hyperplane 74

Generalizations of LDA General Gaussian Classifier Instead of assuming that all classes share the same,, we can allow each class k to have its own k. In this case, the resulting classifier will be a quadratic threshold unit (instead of an LTU) Naïve Gaussian Classifier Allow each class to have its own k, but require that each k be diagonal. This means that within each class, any pair of features x j1 and x j2 will be assumed to be statistically independent. The resulting classifier is still a quadratic threshold unit (but with a restricted form) 75

Summary of Linear Discriminant Analysis Learns the joint probability distribution P(x, y). Direct Computation. The maximum likelihood estimate of P(x,y) ) can be computed from the data without search. However, inverting the matrix requires O(n 3 ) time. Eager. The classifier is constructed from the training examples. The examples can then be discarded. Batch. Only a batch algorithm is available. An online algorithm could be constructed if there is an online algorithm for incrementally updated - 1. [This is easy for the case where is diagonal.] 76

Comparing Perceptron, Logistic Regression, and LDA How should we choose among these three algorithms? There is a big debate within the machine learning community! 77

Issues in the Debate Statistical Efficiency. If the generative model P(x,y) ) is correct, then LDA usually gives the highest accuracy, particularly when the amount of training data is small. If the model is correct, LDA requires 30% less data than Logistic Regression in theory Computational Efficiency.. Generative models typically are the easiest to learn. In our example, LDA can be computed directly from the data without using gradient descent. 78

Issues in the Debate Robustness to changing loss functions.. Both generative and conditional probability models allow the loss function to be changed at run time without re-learning. Perceptron requires re-training the classifier when the loss function changes. Robustness to model assumptions.. The generative model usually performs poorly when the assumptions are violated. For example, if P(x y) ) is very non- Gaussian, then LDA won t t work well. Logistic Regression is more robust to model assumptions, and Perceptron is even more robust. Robustness to missing values and noise.. In many applications, some of the features x ij may be missing or corrupted in some of the training examples. Generative models typically provide better ways of handling this than non-generative models. 79