Statistical Machine Learning

Similar documents

Lecture 3: Linear methods for classification

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Linear Classification. Volker Tresp Summer 2015

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Linear Threshold Units

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Machine Learning and Pattern Recognition Logistic Regression

STA 4273H: Statistical Machine Learning

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Linear Models for Classification

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on

An Introduction to Machine Learning

Penalized Logistic Regression and Classification of Microarray Data

CSCI567 Machine Learning (Fall 2014)

Statistical Machine Learning from Data

Lecture 2: The SVM classifier

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Logistic Regression (1/24/13)

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Introduction to Online Learning Theory

Lecture 8 February 4

Introduction to General and Generalized Linear Models

Christfried Webers. Canberra February June 2015

Reject Inference in Credit Scoring. Jie-Men Mok

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Basics of Statistical Machine Learning

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

LCs for Binary Classification

Lecture 6: Logistic Regression

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

Big Data Analytics CSCI 4030

Local classification and local likelihoods

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Bayes and Naïve Bayes. cs534-machine Learning

Data Mining: Algorithms and Applications Matrix Math Review

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Machine Learning in Spam Filtering

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

STATISTICA Formula Guide: Logistic Regression. Table of Contents

CSI:FLORIDA. Section 4.4: Logistic Regression

Supervised Learning (Big Data Analytics)

AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

A Simple Introduction to Support Vector Machines

Classification Problems

Several Views of Support Vector Machines

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Poisson Models for Count Data

Least Squares Estimation

Making Sense of the Mayhem: Machine Learning and March Madness

Machine Learning Logistic Regression

Multivariate Normal Distribution

Classification by Pairwise Coupling

171:290 Model Selection Lecture II: The Akaike Information Criterion

Introduction: Overview of Kernel Methods

D-optimal plans in observational studies

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Econometrics Simple Linear Regression

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

The equivalence of logistic regression and maximum entropy models

HT2015: SC4 Statistical Data Mining and Machine Learning

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Didacticiel - Études de cas

2.3 Convex Constrained Optimization Problems

Question 2 Naïve Bayes (16 points)

Machine Learning Big Data using Map Reduce

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

CSE 473: Artificial Intelligence Autumn 2010

Similarity and Diagonalization. Similar Matrices

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Data Mining - Evaluation of Classifiers

Multivariate Statistical Inference and Applications

Designing a learning system

Maximum Likelihood Estimation

Logistic Regression for Spam Filtering

Support Vector Machines

Detecting Corporate Fraud: An Application of Machine Learning

THE GOAL of this work is to learn discriminative components

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Factorial experimental designs and generalized linear models

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Introduction to Logistic Regression

Bayesian logistic betting strategy against probability forecasting. Akimichi Takemura, Univ. Tokyo. November 12, 2012

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Transcription:

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25

Linear separation For two classes in R d : simple idea: separate the classes using a hyperplane H w,b = {X : X w + b 0} ; Simplest extension for several classes: consider a family of linear scores s y (x) = w y x b y and the rule f (x) = Arg Max s y (x). y Y Then the separation between any two classes is linear. Using linear regression 2 / 25

Classification via linear regression Simplest idea for two classes: perform a standard linear regression of Y (coded e.g. in {0, 1}) by X, ŵ = (X T X) 1 X T Y where X is the (n, d + 1) extended data matrix and Y the (n, 1) vector of training classes; For a new point x, the linear regression function predicts x ŵ, and the decision function would be 1{x ŵ 1 2 }. Using linear regression 3 / 25

We can extend this idea to K classes by performing regression on each of the class indicator variables 1{Y = y}, y Y. In matrix form: same as above, replacing Y by the matrix of indicator responses. This is equivalent to solving globally the least squares problem: min W n Y i X i W 2, i=1 where W is a coefficient matrix and Y i is the class indicator vector. Using linear regression 4 / 25

Problems using linear regression For multiple classes, a masking problem is likely to occur. A possible fix is to extend the data vectors with quadratic components. Two disadvantages however: masking can still occur when there are many classes. increasing the degree of the components lead to too many parameters and overfitting. Using linear regression 5 / 25

Separating two Gaussians We can adopt a simple parametric generative approach and model the classes by simple Gaussians. Assume we take a Gaussian generative model for the classes distribution: 1 p(x Y = i) = ( (2π) d Σ i exp 1 ) 2 (x m i) T Σ 1 i (x m i ) What is the Bayes classifier for this model? Remember that if the generating densities for classes 0 and 1 are f 0, f 1 and the marginal probability p = P(Y = 1) then the Bayes decision is given by F(x) = Arg Max(pf 1 (x), (1 p)f 2 (x)). Gaussian case LDA and QDA 6 / 25

Hence, denoting d i the Mahalanobis distance corresponding to Σ i, d 2 i (x, y) = (x y) T Σ 1 i (x y), Then the decision rule is class 1 or 2 depending whether d 2 1 (x, m 1) d 2 2 (x, m 2) t(p, Σ 1, Σ 2 ) ; it is a quadratic decision rule (QDA). Case Σ 1 = Σ 2 : becomes a linear decision rule (LDA). We can use a pooled estimate for the two covariance matrices: Σ = 1 n 2 (S 1 + S 2 ), where S l = i:y i =l (X i m l )(X i m l ) T. Gaussian case LDA and QDA 7 / 25

Multiclass: the previous analysis suggests to look at the criterion where or Arg Min δ y (x), y Y δ y (x) = 1 2 (x m y) T Σ 1 y (x m y ) + t y (p y, Σ y ), in the general QDA case (then the decision regions are intersections of quadratic regions), δ y (x) = x T Σ 1 m y + t y (p y ), in the common variance (LDA) case (then the decision regions are intersections of half-planes) Gaussian case LDA and QDA 8 / 25

Relation to linear regression Theorem In the two-class case the direction of w found by the Gaussian model coincides with the one found by classical linear regression.... but the constants b differ. In practice it is recommended not to trust either but to consider this as a separate parameter to optimize to reduce the empirical classification error. Regression using quadratic terms does not give the same result as QDA. Gaussian case LDA and QDA 9 / 25

Fisher s linear discriminant Yet another approach to the problem: find the projection maximizing the ratio of inter-class to intra-class variance, for two classes: J(w) = (w m 1 w m 2 ) 2 w T (S 1 + S 2 )w, where m l are the empirical class means and S l = i:y i =l (X i m l )(X i m l ) T Finding dj dw = 0 leads to the solution (again, the scaling is arbitrary). w = λ(s 1 + S 2 ) 1 ( m 1 m 2 ) The projection direction coincides with the previous methods; Fisher s criterion only provides the projection direction (again, optimize the constant separately) Fisher s discriminant analysis 10 / 25

Fisher s discriminant in multi-class Fisher s criterion can be extended to the multi-class case by maximizing the ratio (Rayleigh coefficient) J(w) = w T Mw w T Sw where S = y S y is the pooled intraclass covariance and M = y (m y m)(m y m) T is the interclass covariance (covariance of the class centroids). (Note that normalization of the matrices is unimportant) Leads to the generalized eigenvalue problem Mw = λsw Can be iterated to find Y 1 dimensions by constraining orthogonality (for the scalar product w, w = w T Sw ) with previously found directions. Equivalent to the following: whiten the data by applying S 1 2 ; perform PCA on the transformed class centroids; apply S 1 2 to the found directions. Fisher s discriminant analysis 11 / 25

Properties of Fisher s canonical projections This is a linear dimension reduction method aimed at separating the classes (using 1st and 2nd moment information only). Invariant by any linear transform of the input space. When we take L = min(d, Y 1) canonical coordinates, this commutes with LDA. When we take L < min(d, Y 1) canonical coordinates, this is equivalent to a reduced rank LDA, i.e. where we require the mean of the Gaussians in the model to belong to a space of dimension L (and perform ML fitting). It can also be seen as a CCA of X wrt. the class indicator function Y. Fisher s discriminant analysis 12 / 25

Regularized linear and quadratic discriminant When the dimension d is too large, overfitting and instability can occur. Looking back at standard linear regression, a possible is ridge regression finding 2 N d p β λ = Arg Min Y i β 0 x ij β j + λ βj 2 ; β i=1 The solution is given by regularization by shrinkage. j=1 β 1 i d = (X T X + λi) 1 X T Y i=1 Fisher s discriminant analysis 13 / 25

By a (weak) analogy with ridge regression we can consider the following regularized version for the covariance estimation in LDA: another possibility is Σ γ = γ Σ + (1 γ) σ 2 I Σ γ = γ Σ + (1 γ)d, where D is the diagonal matrix formed with entries σ i 2. We can also regularize QDA using the following scheme for the estimator of the covariance matrix for class k: Σ k (α) = α Σ k + (1 α) Σ... we can even combine the two. In practice, as usual it is recommended to use cross-validation to tune the parameters. Fisher s discriminant analysis 14 / 25

Linear Logistic regression Recall that in the 2-class case logistic regression aims at finding the log-odds ratio function log P(Y = 1 X = x) P(Y = 0 X = x) = log η(x) 1 η(x) ; In the multiclass case, this can be generalized to log-odds ratio wrt. some (arbitrary) reference class: s i (x) = log P(Y = i X = x) P(Y = 0 X = x) ; again the resulting (plug-in) classifier outputs the max of the score functions. If we model scores by linear functions, we get again a linear classifier. Using logistic regression 15 / 25

The model s i (x) = β i0 + β i x gives rise to the conditional class probabilities P(Y = i X = x) = we can fit this using Maximum Likelihood. exp (β i0 + β i x) 1 + K 1 l=1 exp (β l0 + β l x), Using logistic regression 16 / 25

Algorithm for linear logistic regression We consider the 2-class case (Y = {0, 1}). The log-likelihood function is l(β) = n (Y i β X i log(1 + exp(β X i ))) ; i=1 (where the data points X i are augmented with a contant coordinate) and dl dβ = n X i (Y i η(x i, β)) ( = 0 ) ; i=1 we can solve this using a Newton-Raphson algorithm with step ( β new = β d old 2 ) 1 l dl dβdβ T dβ. Using logistic regression 17 / 25

We have ( d 2 ) l dβdβ T = n i=1 X i X T i η(x i, β)(1 η(x i, β)) ; If we denote W the diagonal matrix of weights η(x i, β)(1 η(x i, β)), we can rewrite the NR step in matrix form as β new = (X T WX) 1 X T WZ, where Z = X β old + W 1 (Y η). This can be seen as an iterated modified least squares fitting. Using logistic regression 18 / 25

LDA vs. logistic regression LDA and logistic regression both fit a linear model to the log-odds ratio. They do not result in the same output however. Why? The answer is that logistic regression only fits the conditional densities P(Y = i X) and remains agnostic as to the distribution of covariate X. LDA on the other hand implicitly fits a distribution for the joint distribution P(Y X) (mixture of Gaussians). In practice, logistic regression is therefore considered more adaptive, but also less robust. Using logistic regression 19 / 25

The linear perceptron Assume that the point clouds for two the classes in the training set turn out to be perfectly separable by some hyperplane. Then LDA will not necessarily return a hyperplane having zero training error. On the other hand, logistic regression will return infinite parameters (why?) Other approach: consider minimizing a criterion based on the distance of misclassified examples to the hyperplane: D(w, b) = Y i (X i w + b). i:y i (X i w+b)<0 (here we assume Y i { 1, 1}!). (Note that actually the average distance to the hyperplane would be w 1 D(w, b).) Principle of perceptron training (more or less): minimize the above by a kind of stochastic gradient descent. Linear Methods in high dimension 20 / 25

Convergence of perceptron training Very simple iterative rule: first, put R = max i X i. If all points are correctly classified, stop. If there are misclassified points, choose such a point (Xi, Y i ) arbitrarily. Theorem Put [ ] [ w new w old = Repeat. b new b old ] + [ ] Yi X i Y i R 2. If there exists (w, b ) a separating hyperplane, such that for all i Y i (w X i + b ) γ, then above algorithm will eventually find a separating hyperplane in a ( ) 2R 2 finite number of steps bounded by. γ Linear Methods in high dimension 21 / 25

Some problems with the perceptron algorithm: The number of steps required to converge can be large! If the classes are not separable, there is no guarantee of convergence. In fact, cycles can occur. There is no regularization and so no protection against overfitting (the number of steps can be used as regularization though) Linear Methods in high dimension 22 / 25

The naive Bayes classifier Assume that X is a vector of binary features (possibly in very high dimension, X {0, 1} d. We don t want to model very complicated dependencies. Naive assumption: given Y, the coordinates of X are independent! Can be seen as an elementary graphical model: Y X (1) X (2) (d)... X Linear Methods in high dimension 23 / 25

Assume we know the probability distribution of each coordinate (a Bernoulli variable) given Y, p k, 1 = P(X (k) = 1 Y = 1) ; p i,1 = P(X (k) = 1 Y = 1). In this case the Bayes classifier is given by the sign of log P(Y = 1 X) P(Y = 1 X) = k X (k) α k + a where α k = log p k,1(1 p k, 1 ) (1 p k,1 )p k, 1 ; a = k log 1 p k,1 P(Y = 1) + log 1 p k, 1 P(Y = 1) ; In practice: like for LDA, it is recommended to optimize the constant a separately (to minimize the training error). Linear Methods in high dimension 24 / 25

Naive Bayes classifier generalized We can generalize this idea to continuous-valued coordinates with the same conditional independence hypothesis. We estimate the conditional density of each coordinate (e.g. with a Gaussian) The decision function becomes an additive model: ) f (k) (x (k) ) f (x) = sign ( k Advantage of naive Bayes: robust also in high dimension, simple and surprisingly good in a number of situations even when the assumption obviously does not hold. Disadvantage: generally does not match the performance of more flexible methods. Linear Methods in high dimension 25 / 25