11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression



Similar documents
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Linear Classification. Volker Tresp Summer 2015

Statistical Machine Learning

Lecture 3: Linear methods for classification

Linear Threshold Units

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

STA 4273H: Statistical Machine Learning

MACHINE LEARNING IN HIGH ENERGY PHYSICS

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

STATISTICA Formula Guide: Logistic Regression. Table of Contents

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Classification by Pairwise Coupling

Data Mining: Algorithms and Applications Matrix Math Review

Multivariate Normal Distribution

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

How to report the percentage of explained common variance in exploratory factor analysis

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

171:290 Model Selection Lecture II: The Akaike Information Criterion

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Machine Learning Logistic Regression

Multivariate Analysis of Variance (MANOVA): I. Theory

Machine Learning and Pattern Recognition Logistic Regression

The Optimality of Naive Bayes

Classification Problems

Gamma Distribution Fitting

Introduction to General and Generalized Linear Models

Linear Models for Classification

Christfried Webers. Canberra February June 2015

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

Bayes and Naïve Bayes. cs534-machine Learning

Java Modules for Time Series Analysis

Statistics in Retail Finance. Chapter 6: Behavioural models

Lecture 2. Summarizing the Sample

Least Squares Estimation

Introduction to Logistic Regression

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Simple Linear Regression Inference

Didacticiel - Études de cas

6 EXTENDING ALGEBRA. 6.0 Introduction. 6.1 The cubic equation. Objectives

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Getting Correct Results from PROC REG

Multivariate Statistical Inference and Applications

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Component Ordering in Independent Component Analysis Based on Data Power

Recall this chart that showed how most of our course would be organized:

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Penalized regression: Introduction

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

AP Physics 1 and 2 Lab Investigations

How To Check For Differences In The One Way Anova

Computational Assignment 4: Discriminant Analysis

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Regression III: Advanced Methods

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

Simple Regression Theory II 2010 Samuel L. Baker

A MULTIVARIATE OUTLIER DETECTION METHOD

CHI-SQUARE: TESTING FOR GOODNESS OF FIT

Lecture 8: Signal Detection and Noise Assumption

Reject Inference in Credit Scoring. Jie-Men Mok

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

4. Continuous Random Variables, the Pareto and Normal Distributions

Normal Distribution. Definition A continuous random variable has a normal distribution if its probability density. f ( y ) = 1.

DISCRIMINANT FUNCTION ANALYSIS (DA)

1 Maximum likelihood estimation

Tutorial 5: Hypothesis Testing

Gaussian Conjugate Prior Cheat Sheet

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Lecture 9: Introduction to Pattern Analysis

MATHEMATICAL METHODS OF STATISTICS

NCSS Statistical Software

A Brief Introduction to SPSS Factor Analysis

Time Series Analysis

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Overview of Factor Analysis

Descriptive Statistics

Factor Analysis. Factor Analysis

CHAPTER 13. Experimental Design and Analysis of Variance

Detecting Corporate Fraud: An Application of Machine Learning

Statistiek II. John Nerbonne. October 1, Dept of Information Science

Dongfeng Li. Autumn 2010

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Introduction to Matrix Algebra

Factor Analysis. Chapter 420. Introduction

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

11. Analysis of Case-control Studies Logistic Regression

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

LOGNORMAL MODEL FOR STOCK PRICES

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

1 Teaching notes on GMM 1.

Predicting Successful Completion of the Nursing Program: An Analysis of Prerequisites and Demographic Variables

Transcription:

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 le-tex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression In this chapter, we review, for the most part, linear methods for classification. The only exception is quadratic discriminant analysis, a straightforward generalization of a linear technique. These methods are best known for their simplicity. A linear decision boundary is easy to understand and visualize, even in many dimensions. An example of such a boundary is shown in Figure 11.1 for Fisher iris data. Because of their high interpretability, linear methods are often the first choice for data analysis. They can be the only choice if the analyst seeks to discover linear relationships between variables and classes. If the analysis goal is maximization of the predictive power and the data do not have a linear structure, nonparametric nonlinear methods should be favored over simple interpretable techniques. Linear discriminant analysis (LDA), also known as Fisher discriminant, has been a very popular technique in particle and astrophysics. Quadratic discriminant analysis (QDA) is its closest cousin. 11.1 Discriminant Analysis Suppose we observe a sample drawn from a multivariate normal distribution N(μ, Σ ) with mean vector μ and covariance matrix Σ.ThedataareD-dimensional, and vectors, unless otherwise noted, are column-oriented. The multivariate density is then 1 P(x) D p (2π)D jσ j exp 1 2 (x μ)t Σ 1 (x μ), (11.1) where jσ j is the determinant of Σ. Suppose we observe a sample of data drawn from two classes, each described by a multivariate normal density 1 P(xjk) D p (2π)D jσ k j exp 1 2 (x μ k )T Σ 1 k (x μ k ) (11.2) Statistical Analysis Techniques in Particle Physics, First Edition. Ilya Narsky and Frank C. Porter. 2014 WILEY-VCH Verlag GmbH & Co. KGaA. Published 2014 by WILEY-VCH Verlag GmbH & Co. KGaA.

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 222 le-tex 222 11 Linear Classification Figure 11.1 Class boundary obtained by linear discriminant analysis for Fisher iris data. The square covers one observation of class versicolor and two observations of class virginica. for classes k D 1, 2. Recall that Bayes rule gives P(kjx) D π k P(xjk) P(x) (11.3) fortheposteriorprobabilityp(kjx) of observing an instance of class k at point x. The unconditional probability P(x) in the denominator does not depend on k.the prior class probability π k was introduced in Chapter 9; we discuss its role in the discriminant analysis below. Let us take a natural logarithm of the posterior odds: P(k D 1jx) log P(k D 2jx) D log π 1 1 π 2 2 log jσ 1j jσ 2 j C x T Σ 1 1 μ 1 Σ 1 2 μ 2 1 2 x T Σ 1 1 Σ 1 2 x 1 2 μ T 1 Σ 1 1 μ 1 μ T 2 Σ 1 2 μ 2. (11.4) The hyperplane separating the two classes is obtained by equating this log-ratio tozero.thisisaquadraticfunctionofx, hencequadratic discriminant analysis. If the two classes have the same covariance matrix Σ 1 D Σ 2,thequadraticterm disappears and we obtain linear discriminant analysis. Fisher (1936) originally derived discriminant analysis in a different fashion. He searched for a direction q maximizing separation between two classes, q T 2 μ S(q) D 1 μ 2. (11.5) q T Σ q This separation is maximized at q D Σ 1 (μ 1 μ 2 ). The terms not depending on x in (11.4) do not change the orientation of the hyperplane separating the two distributions they only shift the boundary closer to one class and further away from the other. The formulation by Fisher is therefore equivalent to (11.4) for LDA.

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 223 le-tex 11.1 Discriminant Analysis 223 To use this formalism for K > 2 classes, choose one class, for example the last one, for normalization. Compute posterior odds log[p(kjx)/p(kjx)] for k D 1,...,K 1. The logarithm of the posterior odds is additive, that is, log[p(ijx)/p( j jx)] D log[p(ijx)/p(kjx)] log[p( j jx)/p(kjx)] for classes i and j.thecomputedk 1 log-ratios give complete information about the hyperplanes of separation. If we need to compute the posterior probabilities, we require that P K kd1 P(kjx) D 1 and obtain estimates of P(kjx) from the log-ratios. Thesametrickisusedinothermulticlassmodelssuchasmultinomiallogistic regression. For prediction on new data, the class label is assigned by choosing the class with the largest posterior probability. In this formulation, the class prior probabilities merely shift the boundaries between the classes without changing their orientations (for LDA) or their shapes (for QDA). They are not used to estimate the class means or covariance matrices; hence, they can be applied after training. Alternatively, we could ignore the prior probabilities and classify observations by imposing thresholds on the computed log-ratios. These thresholds would be optimized using some physics-driven criteria. Physicists often follow the second approach. As discussed in Chapter 9, classifying into the class with the largest posterior probability P(yjx) minimizes the classification error. If the posterior probabilities are accurately modeled, this classifier is optimal. If classes indeed have multivariate normal densities, QDA is the optimal classifier. If classes indeed have multivariate normal densities with equal covariance matrices, LDA is the optimal classifier. Most usually, we need to estimate the covariance matrices empirically. LDA is seemingly simple, but this simplicity may be deceiving. Subtleties in LDA implementation can change its result dramatically. Let us review them now. 11.1.1 Estimating the Covariance Matrix Under the LDA assumptions, classes have equal covariance matrices and different means. Take the training data with known class labels. Let M be an N K class membership matrix for N observations and K classes: m nk D 1ifobservationn is from class k and 0 otherwise. First, estimate the mean for each class in turn, nd1 Oμ k D m nkx n nd1 m nk. (11.6) Then compute the pooled-in covariance matrix. For example, use a maximum likelihood estimate, OΣ ML D 1 KX NX m nk (x n Oμ N k )(x n Oμ k ) T. (11.7) kd1 nd1 Vectors x n and Oμ k are D 1(column-oriented),and(x n Oμ k )(x n Oμ k ) T is therefore a symmetric D D matrix. This maximal likelihood estimator is biased. To remove

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 224 le-tex 224 11 Linear Classification the bias, apply a small correction: OΣ D N N K O Σ ML. (11.8) Elementary statistics textbooks derive a similar correction for a univariate normal distribution and include a formula for an unbiased estimate, S 2 D nd1 (x n Nx) 2 /(N 1), of the variance, σ 2.Thestatistic(N 1)S 2 /σ 2 is distributed as χ 2 with N 1 degrees of freedom. We use N K instead of N 1becausewehaveK classes. Think of it as losing one degree of freedom per linear constraint. In this case, there are K linear constraints for K class means. In physics analysis, datasets are usually large, N K. For unweighted data, this correction can be safely neglected. For weighted data, the problem is a bit more involved. The weighted class means are given by Oμ k D nd1 m nkw n x n nd1 m nkw n. (11.9) The maximum likelihood estimate (11.7) generalizes to OΣ ML D KX kd1 nd1 NX m nk w n (x n Oμ k )(x n Oμ k ) T. (11.10) Above, we assume that the weights are normalized to sum to one: nd1 w n D 1. The unbiased estimate is then OΣ D OΣ ML 1 P K kd1 W (2) k W k, (11.11) where W k D nd1 m nkw n is the sum of weights in class k and W (2) k D nd1 m nkwn 2 is the sum of squared weights in class k. For class-free data K D 1, this simplifies to OΣ D OΣ ML /(1 nd1 w n 2 ). If all weights are set to 1/N, (11.11) simplifies to (11.8). In this case, the corrective term nd1 w n 2 attains minimum, and the denominator in (11.11) is close to 1. If the weights are highly nonuniform, the denominator in (11.11) can get close to zero. For LDA with two classes, this bias correction is, for the most part, irrelevant. Multiplying all elements of the covariance matrix by factor a is equivalent to multiplying x T Σ 1 (μ 1 μ 2 )by1/a. This multiplication does not change the orientation of the hyperplane separating the two classes, but it does change the posterior class probabilities at point x. Instead of using the predicted posterior probabilities directly, physicists often inspect the ROC curve and select a threshold on classification scores (in this case, posterior probabilities) by optimizing some function of true positive and false positive rates. If we fix the false positive rate, multiply the logratio at any point x by the same factor and measure the true positive rate, we will obtain the same value as we would without multiplication.

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 225 le-tex 11.1 Discriminant Analysis 225 Unfortunately, this safety mechanism fails for QDA, multiclass LDA, and even LDA with two classes if the covariance matrix is estimated as a weighted combination of the individual covariance matrices, as described in Section 11.1.3. We refrain from recommending the unbiased estimate over the maximum likelihood estimate or the other way around. We merely point out this issue. If you work with highly nonuniform weights, you should investigate the stability of your analysis procedure with respect to weighting. 11.1.2 Verifying Discriminant Analysis Assumptions The key assumptions for discriminant analysis are multivariate normality (for QDA and LDA) and equality of the class covariance matrices (for LDA). You can verify these assumptions numerically. Two popular tests of normality proposed in Mardia (1970) are based on multivariate skewness and kurtosis. The sample kurtosis is readily expressed in matrix notation Ok D 1 N NX nd1 h (x n Oμ) T OΣ 1 (x n Oμ)i 2, (11.12) where Oμ and OΣ are the usual estimates of the mean and covariance matrix. Asymptotically, Ok has a normal distributionwith mean D(DC2) and variance 8D(DC2)/N for the sample size N and dimensionality D. A large observed value indicates a distribution with tails heavier than normal, and a small value points to a distribution with tails lighter than normal. A less rigorous but more instructive procedure is to inspect a quantile-quantile (QQ) plot of the squared Mahalanobis distance (Healy, 1968). If X is a random vector drawn from a multivariatenormaldistribution with mean μ and covariance Σ, its squared Mahalanobis distance (X μ) T Σ 1 (X μ) hasaχ 2 distribution with D degrees of freedom. We can plot quantiles of the observed Mahalanobis distance versus quantiles of the χ 2 D distribution. A departure from a straight line would indicate the lack of normality, and extreme points would be considered as candidates for outliers. In practice we know neither μ nor Σ and must substitute their estimates. As soon as we do, we, strictly speaking, can no longer use the χ 2 D distribution, although it remains a reasonable approximation for large N. No statistical test is associated with this approach, but visual inspection often proves fruitful. Equalityof the class covariancematrices can be verified by a Bartlett multivariate test described in popular textbooks such as Andersen (2003). Formally, we test hypothesis H 0 W Σ 1 D... D Σ K against H 1 W at least two Σ 0 s are different. The test statistic, 2logV D (N K)logj OΣj KX (n k 1) log j OΣ k j, (11.13) kd1

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 226 le-tex 226 11 Linear Classification resembles a log-likelihood ratio for the pooled-in unbiased estimate OΣ and unbiased class estimates OΣ k. Here, n k is the number of observations in class k. This formula would need to be modified for weighted data. When the sizes of all classes arecomparableandlarge, 2logV can be approximated by a χ 2 distribution with (K 1)D(D C 1)/2 degrees of freedom (see, for example, Box, 1949). For small samples, the exact distribution can be found in Gupta and Tang (1984). H 0 should be rejected if the observed value of 2logV is large. Intuitively, 2logV measures the lack of uniformity of the covariance matrices across the classes. The pooled-in estimate is simply a sum over class estimates (N K) OΣ D P K kd1 (n k 1) OΣ k.if the sum of several positive numbers is fixed, their product (equivalently, the sum of their logs) is maximal when the numbers are equal. This test is based on the same idea. The Bartlett test is sensitive to outliers and should not be used in their presence. The tests described here are mostly of theoretical value. Practitioners often apply discriminant analysis when its assumptions do not hold. The ultimate test of any classification model is its performance. If discriminant analysis gives a satisfactory predictive power for nonnormal samples, don t let the rigor of theory stand in your way. Likewise, you can verify that QDA improves over LDA by comparing the accuracies of the two models using one of the techniques reviewed in Chapter 10. 11.1.3 Applying LDA When LDA Assumptions Are Invalid Under the LDA assumptions, all classes have multivariate normal distributions with different means and the same covariance matrix. The maximum likelihood estimate (11.10) is equal to the weighted average of the covariance matrix estimates per class: OΣ ML D KX W k OΣ k. (11.14) kd1 Above, W k D nd1 m nkw n is the sum of weights in class k. As usual, we take nd1 w n D 1. In practice, physicists apply LDA when none of the LDA conditions holds. The class densities are not normal and the covariance matrices are not equal. In these circumstances, you can still apply LDA and obtain some separation between the classes. But there is no theoretical justification for the pooled-in covariance matrix estimate (11.14). It is tempting to see how far we can get by experimenting with the covariance matrix estimate. Let us illustrate this problem on a hypothetical example. Suppose we have two classes with means μ 1 D 1/2 0 1/2 μ 2 D 0 (11.15)

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 227 le-tex 11.1 Discriminant Analysis 227 and covariance matrices 2 1 Σ 1 D 1 2 Σ 2 D 2 1 1 2. (11.16) The two covariance matrices are inverse to each other, up to some constant. If we set the covariance matrix for LDA to (Σ 1 C Σ 2 )/2, the predicted line of optimal separation is orthogonal to the first coordinate axis and the optimal classification is given by x 1.IfwesetthecovariancematrixforLDAtoΣ 1, the optimal classification is 2x 1 x 2.IfwesetthecovariancematrixforLDAtoΣ 2, the optimal classification is 2x 1 C x 2. If we had to estimate the pooled-in covariance matrix on a training set, we would face the same problem. If classes 1 and 2 were represented equally in the training data, we would obtain (Σ 1 C Σ 2 )/2. If the training set were composed mostly of observations of class 1, we would obtain Σ 1. Similarly for Σ 2. Let us plot ROC curves for the three pooled-in matrix estimates. As explained in the previous chapter, a ROC curve is a plot of true positive rate (TPR) versus false positive rate (FPR), or accepted signal versus accepted background. Take class 1 to be signal and class 2 to be background. The three curves are shown in Figure 11.2. Your choice of the optimal curve (and therefore the optimal covariance matrix) would be defined by the specifics of your analysis. If you were mostly concerned with background suppression, you would be interested in the lower left corner of the plot and choose Σ 2 as your estimate. If you goal were to retain as much signal as possible at a modest background rejection rate, you would focus on the upper right corner of the plot and choose Σ 1. If you wanted the best overall quality of separation measured by the area under the ROC curve, you would choose (Σ 1 C Σ 2 )/2. It is your analysis, so take your pick! If we took this logic to the extreme, we could search for the best a in the linear combination OΣ D a OΣ 1 C (1 a) OΣ 2 by minimizing some criterion, perhaps FPR at fixed TPR. If you engage in such optimization, you should ask yourself if LDA is the right tool. At this point, you might want to give up the beloved linearity and switch to a more flexible technique such as QDA. In this example, QDA beats LDA at any FPR, no matter what covariance matrix estimate you choose. Figure 11.2 ROC curves for LDA with three estimates of the covariance matrix and QDA.