# 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 221 le-tex Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression In this chapter, we review, for the most part, linear methods for classification. The only exception is quadratic discriminant analysis, a straightforward generalization of a linear technique. These methods are best known for their simplicity. A linear decision boundary is easy to understand and visualize, even in many dimensions. An example of such a boundary is shown in Figure 11.1 for Fisher iris data. Because of their high interpretability, linear methods are often the first choice for data analysis. They can be the only choice if the analyst seeks to discover linear relationships between variables and classes. If the analysis goal is maximization of the predictive power and the data do not have a linear structure, nonparametric nonlinear methods should be favored over simple interpretable techniques. Linear discriminant analysis (LDA), also known as Fisher discriminant, has been a very popular technique in particle and astrophysics. Quadratic discriminant analysis (QDA) is its closest cousin Discriminant Analysis Suppose we observe a sample drawn from a multivariate normal distribution N(μ, Σ ) with mean vector μ and covariance matrix Σ.ThedataareD-dimensional, and vectors, unless otherwise noted, are column-oriented. The multivariate density is then 1 P(x) D p (2π)D jσ j exp 1 2 (x μ)t Σ 1 (x μ), (11.1) where jσ j is the determinant of Σ. Suppose we observe a sample of data drawn from two classes, each described by a multivariate normal density 1 P(xjk) D p (2π)D jσ k j exp 1 2 (x μ k )T Σ 1 k (x μ k ) (11.2) Statistical Analysis Techniques in Particle Physics, First Edition. Ilya Narsky and Frank C. Porter WILEY-VCH Verlag GmbH & Co. KGaA. Published 2014 by WILEY-VCH Verlag GmbH & Co. KGaA.

2 Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 222 le-tex Linear Classification Figure 11.1 Class boundary obtained by linear discriminant analysis for Fisher iris data. The square covers one observation of class versicolor and two observations of class virginica. for classes k D 1, 2. Recall that Bayes rule gives P(kjx) D π k P(xjk) P(x) (11.3) fortheposteriorprobabilityp(kjx) of observing an instance of class k at point x. The unconditional probability P(x) in the denominator does not depend on k.the prior class probability π k was introduced in Chapter 9; we discuss its role in the discriminant analysis below. Let us take a natural logarithm of the posterior odds: P(k D 1jx) log P(k D 2jx) D log π 1 1 π 2 2 log jσ 1j jσ 2 j C x T Σ 1 1 μ 1 Σ 1 2 μ x T Σ 1 1 Σ 1 2 x 1 2 μ T 1 Σ 1 1 μ 1 μ T 2 Σ 1 2 μ 2. (11.4) The hyperplane separating the two classes is obtained by equating this log-ratio tozero.thisisaquadraticfunctionofx, hencequadratic discriminant analysis. If the two classes have the same covariance matrix Σ 1 D Σ 2,thequadraticterm disappears and we obtain linear discriminant analysis. Fisher (1936) originally derived discriminant analysis in a different fashion. He searched for a direction q maximizing separation between two classes, q T 2 μ S(q) D 1 μ 2. (11.5) q T Σ q This separation is maximized at q D Σ 1 (μ 1 μ 2 ). The terms not depending on x in (11.4) do not change the orientation of the hyperplane separating the two distributions they only shift the boundary closer to one class and further away from the other. The formulation by Fisher is therefore equivalent to (11.4) for LDA.

3 Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 223 le-tex 11.1 Discriminant Analysis 223 To use this formalism for K > 2 classes, choose one class, for example the last one, for normalization. Compute posterior odds log[p(kjx)/p(kjx)] for k D 1,...,K 1. The logarithm of the posterior odds is additive, that is, log[p(ijx)/p( j jx)] D log[p(ijx)/p(kjx)] log[p( j jx)/p(kjx)] for classes i and j.thecomputedk 1 log-ratios give complete information about the hyperplanes of separation. If we need to compute the posterior probabilities, we require that P K kd1 P(kjx) D 1 and obtain estimates of P(kjx) from the log-ratios. Thesametrickisusedinothermulticlassmodelssuchasmultinomiallogistic regression. For prediction on new data, the class label is assigned by choosing the class with the largest posterior probability. In this formulation, the class prior probabilities merely shift the boundaries between the classes without changing their orientations (for LDA) or their shapes (for QDA). They are not used to estimate the class means or covariance matrices; hence, they can be applied after training. Alternatively, we could ignore the prior probabilities and classify observations by imposing thresholds on the computed log-ratios. These thresholds would be optimized using some physics-driven criteria. Physicists often follow the second approach. As discussed in Chapter 9, classifying into the class with the largest posterior probability P(yjx) minimizes the classification error. If the posterior probabilities are accurately modeled, this classifier is optimal. If classes indeed have multivariate normal densities, QDA is the optimal classifier. If classes indeed have multivariate normal densities with equal covariance matrices, LDA is the optimal classifier. Most usually, we need to estimate the covariance matrices empirically. LDA is seemingly simple, but this simplicity may be deceiving. Subtleties in LDA implementation can change its result dramatically. Let us review them now Estimating the Covariance Matrix Under the LDA assumptions, classes have equal covariance matrices and different means. Take the training data with known class labels. Let M be an N K class membership matrix for N observations and K classes: m nk D 1ifobservationn is from class k and 0 otherwise. First, estimate the mean for each class in turn, nd1 Oμ k D m nkx n nd1 m nk. (11.6) Then compute the pooled-in covariance matrix. For example, use a maximum likelihood estimate, OΣ ML D 1 KX NX m nk (x n Oμ N k )(x n Oμ k ) T. (11.7) kd1 nd1 Vectors x n and Oμ k are D 1(column-oriented),and(x n Oμ k )(x n Oμ k ) T is therefore a symmetric D D matrix. This maximal likelihood estimator is biased. To remove

4 Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 224 le-tex Linear Classification the bias, apply a small correction: OΣ D N N K O Σ ML. (11.8) Elementary statistics textbooks derive a similar correction for a univariate normal distribution and include a formula for an unbiased estimate, S 2 D nd1 (x n Nx) 2 /(N 1), of the variance, σ 2.Thestatistic(N 1)S 2 /σ 2 is distributed as χ 2 with N 1 degrees of freedom. We use N K instead of N 1becausewehaveK classes. Think of it as losing one degree of freedom per linear constraint. In this case, there are K linear constraints for K class means. In physics analysis, datasets are usually large, N K. For unweighted data, this correction can be safely neglected. For weighted data, the problem is a bit more involved. The weighted class means are given by Oμ k D nd1 m nkw n x n nd1 m nkw n. (11.9) The maximum likelihood estimate (11.7) generalizes to OΣ ML D KX kd1 nd1 NX m nk w n (x n Oμ k )(x n Oμ k ) T. (11.10) Above, we assume that the weights are normalized to sum to one: nd1 w n D 1. The unbiased estimate is then OΣ D OΣ ML 1 P K kd1 W (2) k W k, (11.11) where W k D nd1 m nkw n is the sum of weights in class k and W (2) k D nd1 m nkwn 2 is the sum of squared weights in class k. For class-free data K D 1, this simplifies to OΣ D OΣ ML /(1 nd1 w n 2 ). If all weights are set to 1/N, (11.11) simplifies to (11.8). In this case, the corrective term nd1 w n 2 attains minimum, and the denominator in (11.11) is close to 1. If the weights are highly nonuniform, the denominator in (11.11) can get close to zero. For LDA with two classes, this bias correction is, for the most part, irrelevant. Multiplying all elements of the covariance matrix by factor a is equivalent to multiplying x T Σ 1 (μ 1 μ 2 )by1/a. This multiplication does not change the orientation of the hyperplane separating the two classes, but it does change the posterior class probabilities at point x. Instead of using the predicted posterior probabilities directly, physicists often inspect the ROC curve and select a threshold on classification scores (in this case, posterior probabilities) by optimizing some function of true positive and false positive rates. If we fix the false positive rate, multiply the logratio at any point x by the same factor and measure the true positive rate, we will obtain the same value as we would without multiplication.

5 Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 225 le-tex 11.1 Discriminant Analysis 225 Unfortunately, this safety mechanism fails for QDA, multiclass LDA, and even LDA with two classes if the covariance matrix is estimated as a weighted combination of the individual covariance matrices, as described in Section We refrain from recommending the unbiased estimate over the maximum likelihood estimate or the other way around. We merely point out this issue. If you work with highly nonuniform weights, you should investigate the stability of your analysis procedure with respect to weighting Verifying Discriminant Analysis Assumptions The key assumptions for discriminant analysis are multivariate normality (for QDA and LDA) and equality of the class covariance matrices (for LDA). You can verify these assumptions numerically. Two popular tests of normality proposed in Mardia (1970) are based on multivariate skewness and kurtosis. The sample kurtosis is readily expressed in matrix notation Ok D 1 N NX nd1 h (x n Oμ) T OΣ 1 (x n Oμ)i 2, (11.12) where Oμ and OΣ are the usual estimates of the mean and covariance matrix. Asymptotically, Ok has a normal distributionwith mean D(DC2) and variance 8D(DC2)/N for the sample size N and dimensionality D. A large observed value indicates a distribution with tails heavier than normal, and a small value points to a distribution with tails lighter than normal. A less rigorous but more instructive procedure is to inspect a quantile-quantile (QQ) plot of the squared Mahalanobis distance (Healy, 1968). If X is a random vector drawn from a multivariatenormaldistribution with mean μ and covariance Σ, its squared Mahalanobis distance (X μ) T Σ 1 (X μ) hasaχ 2 distribution with D degrees of freedom. We can plot quantiles of the observed Mahalanobis distance versus quantiles of the χ 2 D distribution. A departure from a straight line would indicate the lack of normality, and extreme points would be considered as candidates for outliers. In practice we know neither μ nor Σ and must substitute their estimates. As soon as we do, we, strictly speaking, can no longer use the χ 2 D distribution, although it remains a reasonable approximation for large N. No statistical test is associated with this approach, but visual inspection often proves fruitful. Equalityof the class covariancematrices can be verified by a Bartlett multivariate test described in popular textbooks such as Andersen (2003). Formally, we test hypothesis H 0 W Σ 1 D... D Σ K against H 1 W at least two Σ 0 s are different. The test statistic, 2logV D (N K)logj OΣj KX (n k 1) log j OΣ k j, (11.13) kd1

6 Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 226 le-tex Linear Classification resembles a log-likelihood ratio for the pooled-in unbiased estimate OΣ and unbiased class estimates OΣ k. Here, n k is the number of observations in class k. This formula would need to be modified for weighted data. When the sizes of all classes arecomparableandlarge, 2logV can be approximated by a χ 2 distribution with (K 1)D(D C 1)/2 degrees of freedom (see, for example, Box, 1949). For small samples, the exact distribution can be found in Gupta and Tang (1984). H 0 should be rejected if the observed value of 2logV is large. Intuitively, 2logV measures the lack of uniformity of the covariance matrices across the classes. The pooled-in estimate is simply a sum over class estimates (N K) OΣ D P K kd1 (n k 1) OΣ k.if the sum of several positive numbers is fixed, their product (equivalently, the sum of their logs) is maximal when the numbers are equal. This test is based on the same idea. The Bartlett test is sensitive to outliers and should not be used in their presence. The tests described here are mostly of theoretical value. Practitioners often apply discriminant analysis when its assumptions do not hold. The ultimate test of any classification model is its performance. If discriminant analysis gives a satisfactory predictive power for nonnormal samples, don t let the rigor of theory stand in your way. Likewise, you can verify that QDA improves over LDA by comparing the accuracies of the two models using one of the techniques reviewed in Chapter Applying LDA When LDA Assumptions Are Invalid Under the LDA assumptions, all classes have multivariate normal distributions with different means and the same covariance matrix. The maximum likelihood estimate (11.10) is equal to the weighted average of the covariance matrix estimates per class: OΣ ML D KX W k OΣ k. (11.14) kd1 Above, W k D nd1 m nkw n is the sum of weights in class k. As usual, we take nd1 w n D 1. In practice, physicists apply LDA when none of the LDA conditions holds. The class densities are not normal and the covariance matrices are not equal. In these circumstances, you can still apply LDA and obtain some separation between the classes. But there is no theoretical justification for the pooled-in covariance matrix estimate (11.14). It is tempting to see how far we can get by experimenting with the covariance matrix estimate. Let us illustrate this problem on a hypothetical example. Suppose we have two classes with means μ 1 D 1/2 0 1/2 μ 2 D 0 (11.15)

7 Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 227 le-tex 11.1 Discriminant Analysis 227 and covariance matrices 2 1 Σ 1 D 1 2 Σ 2 D (11.16) The two covariance matrices are inverse to each other, up to some constant. If we set the covariance matrix for LDA to (Σ 1 C Σ 2 )/2, the predicted line of optimal separation is orthogonal to the first coordinate axis and the optimal classification is given by x 1.IfwesetthecovariancematrixforLDAtoΣ 1, the optimal classification is 2x 1 x 2.IfwesetthecovariancematrixforLDAtoΣ 2, the optimal classification is 2x 1 C x 2. If we had to estimate the pooled-in covariance matrix on a training set, we would face the same problem. If classes 1 and 2 were represented equally in the training data, we would obtain (Σ 1 C Σ 2 )/2. If the training set were composed mostly of observations of class 1, we would obtain Σ 1. Similarly for Σ 2. Let us plot ROC curves for the three pooled-in matrix estimates. As explained in the previous chapter, a ROC curve is a plot of true positive rate (TPR) versus false positive rate (FPR), or accepted signal versus accepted background. Take class 1 to be signal and class 2 to be background. The three curves are shown in Figure Your choice of the optimal curve (and therefore the optimal covariance matrix) would be defined by the specifics of your analysis. If you were mostly concerned with background suppression, you would be interested in the lower left corner of the plot and choose Σ 2 as your estimate. If you goal were to retain as much signal as possible at a modest background rejection rate, you would focus on the upper right corner of the plot and choose Σ 1. If you wanted the best overall quality of separation measured by the area under the ROC curve, you would choose (Σ 1 C Σ 2 )/2. It is your analysis, so take your pick! If we took this logic to the extreme, we could search for the best a in the linear combination OΣ D a OΣ 1 C (1 a) OΣ 2 by minimizing some criterion, perhaps FPR at fixed TPR. If you engage in such optimization, you should ask yourself if LDA is the right tool. At this point, you might want to give up the beloved linearity and switch to a more flexible technique such as QDA. In this example, QDA beats LDA at any FPR, no matter what covariance matrix estimate you choose. Figure 11.2 ROC curves for LDA with three estimates of the covariance matrix and QDA.

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

### NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### Classification by Pairwise Coupling

Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating

### CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

### Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

### Notes for STA 437/1005 Methods for Multivariate Data

Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### Probability and Statistics

CHAPTER 2: RANDOM VARIABLES AND ASSOCIATED FUNCTIONS 2b - 0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be

### 171:290 Model Selection Lecture II: The Akaike Information Criterion

171:290 Model Selection Lecture II: The Akaike Information Criterion Department of Biostatistics Department of Statistics and Actuarial Science August 28, 2012 Introduction AIC, the Akaike Information

### Classifiers & Classification

Classifiers & Classification Forsyth & Ponce Computer Vision A Modern Approach chapter 22 Pattern Classification Duda, Hart and Stork School of Computer Science & Statistics Trinity College Dublin Dublin

### CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

### How to report the percentage of explained common variance in exploratory factor analysis

UNIVERSITAT ROVIRA I VIRGILI How to report the percentage of explained common variance in exploratory factor analysis Tarragona 2013 Please reference this document as: Lorenzo-Seva, U. (2013). How to report

### Wes, Delaram, and Emily MA751. Exercise 4.5. 1 p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Wes, Delaram, and Emily MA75 Exercise 4.5 Consider a two-class logistic regression problem with x R. Characterize the maximum-likelihood estimates of the slope and intercept parameter if the sample for

### Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

### Multivariate Analysis of Variance (MANOVA): I. Theory

Gregory Carey, 1998 MANOVA: I - 1 Multivariate Analysis of Variance (MANOVA): I. Theory Introduction The purpose of a t test is to assess the likelihood that the means for two groups are sampled from the

### LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

### The Optimality of Naive Bayes

The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### Machine Learning Logistic Regression

Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.

### Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

### Classification Problems

Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

### Christfried Webers. Canberra February June 2015

c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

### Linear Models for Classification

Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci

### Chi-Square Test. Contingency Tables. Contingency Tables. Chi-Square Test for Independence. Chi-Square Tests for Goodnessof-Fit

Chi-Square Tests 15 Chapter Chi-Square Test for Independence Chi-Square Tests for Goodness Uniform Goodness- Poisson Goodness- Goodness Test ECDF Tests (Optional) McGraw-Hill/Irwin Copyright 2009 by The

### Gamma Distribution Fitting

Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

### Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

### Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

### Bayes and Naïve Bayes. cs534-machine Learning

Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

### Introduction to Logistic Regression

OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction

### Simple Linear Regression Inference

Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

### Didacticiel - Études de cas

1 Topic Linear Discriminant Analysis Data Mining Tools Comparison (Tanagra, R, SAS and SPSS). Linear discriminant analysis is a popular method in domains of statistics, machine learning and pattern recognition.

### Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

### Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

Introduction to Hypothesis Testing CHAPTER 8 LEARNING OBJECTIVES After reading this chapter, you should be able to: 1 Identify the four steps of hypothesis testing. 2 Define null hypothesis, alternative

### 6 EXTENDING ALGEBRA. 6.0 Introduction. 6.1 The cubic equation. Objectives

6 EXTENDING ALGEBRA Chapter 6 Extending Algebra Objectives After studying this chapter you should understand techniques whereby equations of cubic degree and higher can be solved; be able to factorise

### Numerical Summarization of Data OPRE 6301

Numerical Summarization of Data OPRE 6301 Motivation... In the previous session, we used graphical techniques to describe data. For example: While this histogram provides useful insight, other interesting

### Java Modules for Time Series Analysis

Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series

### Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

### Multivariate Statistical Inference and Applications

Multivariate Statistical Inference and Applications ALVIN C. RENCHER Department of Statistics Brigham Young University A Wiley-Interscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim

### Lecture 2. Summarizing the Sample

Lecture 2 Summarizing the Sample WARNING: Today s lecture may bore some of you It s (sort of) not my fault I m required to teach you about what we re going to cover today. I ll try to make it as exciting

### Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on

Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classiﬁca6on Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output

### CS395T Computational Statistics with Application to Bioinformatics

CS395T Computational Statistics with Application to Bioinformatics Prof. William H. Press Spring Term, 2010 The University of Texas at Austin Unit 6: Multivariate Normal Distributions and Chi Square The

### Penalized regression: Introduction

Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

### 3. The Multivariate Normal Distribution

3. The Multivariate Normal Distribution 3.1 Introduction A generalization of the familiar bell shaped normal density to several dimensions plays a fundamental role in multivariate analysis While real data

### Inferential Statistics

Inferential Statistics Sampling and the normal distribution Z-scores Confidence levels and intervals Hypothesis testing Commonly used statistical methods Inferential Statistics Descriptive statistics are

### Getting Correct Results from PROC REG

Getting Correct Results from PROC REG Nathaniel Derby, Statis Pro Data Analytics, Seattle, WA ABSTRACT PROC REG, SAS s implementation of linear regression, is often used to fit a line without checking

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

### Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

### AP Physics 1 and 2 Lab Investigations

AP Physics 1 and 2 Lab Investigations Student Guide to Data Analysis New York, NY. College Board, Advanced Placement, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks

### Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

### Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Prof. Dr. J. Franke All of Statistics 1.52 Binary response variables - logistic regression Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit

### Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

### Computational Assignment 4: Discriminant Analysis

Computational Assignment 4: Discriminant Analysis -Written by James Wilson -Edited by Andrew Nobel In this assignment, we will investigate running Fisher s Discriminant analysis in R. This is a powerful

### MINITAB ASSISTANT WHITE PAPER

MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

### A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA

A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University Agency Internal User Unmasked Result Subjects

### The Exponential Family

The Exponential Family David M. Blei Columbia University November 3, 2015 Definition A probability density in the exponential family has this form where p.x j / D h.x/ expf > t.x/ a./g; (1) is the natural

### Recall this chart that showed how most of our course would be organized:

Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical

### 4. Introduction to Statistics

Statistics for Engineers 4-1 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation

### CHI-SQUARE: TESTING FOR GOODNESS OF FIT

CHI-SQUARE: TESTING FOR GOODNESS OF FIT In the previous chapter we discussed procedures for fitting a hypothesized function to a set of experimental data points. Such procedures involve minimizing a quantity

### Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

### Lecture 8: Signal Detection and Noise Assumption

ECE 83 Fall Statistical Signal Processing instructor: R. Nowak, scribe: Feng Ju Lecture 8: Signal Detection and Noise Assumption Signal Detection : X = W H : X = S + W where W N(, σ I n n and S = [s, s,...,

### 2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came

### MATH4427 Notebook 2 Spring 2016. 2 MATH4427 Notebook 2 3. 2.1 Definitions and Examples... 3. 2.2 Performance Measures for Estimators...

MATH4427 Notebook 2 Spring 2016 prepared by Professor Jenny Baglivo c Copyright 2009-2016 by Jenny A. Baglivo. All Rights Reserved. Contents 2 MATH4427 Notebook 2 3 2.1 Definitions and Examples...................................

### DISCRIMINANT FUNCTION ANALYSIS (DA)

DISCRIMINANT FUNCTION ANALYSIS (DA) John Poulsen and Aaron French Key words: assumptions, further reading, computations, standardized coefficents, structure matrix, tests of signficance Introduction Discriminant

### 1 Maximum likelihood estimation

COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

### Statistical Analysis with Missing Data

Statistical Analysis with Missing Data Second Edition RODERICK J. A. LITTLE DONALD B. RUBIN WILEY- INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents Preface PARTI OVERVIEW AND BASIC APPROACHES

### Simple Regression Theory II 2010 Samuel L. Baker

SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the

### Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

### Gaussian Conjugate Prior Cheat Sheet

Gaussian Conjugate Prior Cheat Sheet Tom SF Haines 1 Purpose This document contains notes on how to handle the multivariate Gaussian 1 in a Bayesian setting. It focuses on the conjugate prior, its Bayesian

### Unit 29 Chi-Square Goodness-of-Fit Test

Unit 29 Chi-Square Goodness-of-Fit Test Objectives: To perform the chi-square hypothesis test concerning proportions corresponding to more than two categories of a qualitative variable To perform the Bonferroni

### MATHEMATICAL METHODS OF STATISTICS

MATHEMATICAL METHODS OF STATISTICS By HARALD CRAMER TROFESSOK IN THE UNIVERSITY OF STOCKHOLM Princeton PRINCETON UNIVERSITY PRESS 1946 TABLE OF CONTENTS. First Part. MATHEMATICAL INTRODUCTION. CHAPTERS

### Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Density Curve A density curve is the graph of a continuous probability distribution. It must satisfy the following properties: 1. The total area under the curve must equal 1. 2. Every point on the curve

### Factor Analysis. Factor Analysis

Factor Analysis Principal Components Analysis, e.g. of stock price movements, sometimes suggests that several variables may be responding to a small number of underlying forces. In the factor model, we

### A MULTIVARIATE OUTLIER DETECTION METHOD

A MULTIVARIATE OUTLIER DETECTION METHOD P. Filzmoser Department of Statistics and Probability Theory Vienna, AUSTRIA e-mail: P.Filzmoser@tuwien.ac.at Abstract A method for the detection of multivariate

### Time Series Analysis

Time Series Analysis hm@imm.dtu.dk Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby 1 Outline of the lecture Identification of univariate time series models, cont.:

### Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round \$200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### Introduction to Matrix Algebra

Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary

### Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares

Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares Many economic models involve endogeneity: that is, a theoretical relationship does not fit

### Statistiek II. John Nerbonne. October 1, 2010. Dept of Information Science j.nerbonne@rug.nl

Dept of Information Science j.nerbonne@rug.nl October 1, 2010 Course outline 1 One-way ANOVA. 2 Factorial ANOVA. 3 Repeated measures ANOVA. 4 Correlation and regression. 5 Multiple regression. 6 Logistic

### Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

### Factor Analysis. Chapter 420. Introduction

Chapter 420 Introduction (FA) is an exploratory technique applied to a set of observed variables that seeks to find underlying factors (subsets of variables) from which the observed variables were generated.

### 4. Continuous Random Variables, the Pareto and Normal Distributions

4. Continuous Random Variables, the Pareto and Normal Distributions A continuous random variable X can take any value in a given range (e.g. height, weight, age). The distribution of a continuous random

### A Brief Introduction to SPSS Factor Analysis

A Brief Introduction to SPSS Factor Analysis SPSS has a procedure that conducts exploratory factor analysis. Before launching into a step by step example of how to use this procedure, it is recommended

### Ordinal Regression. Chapter

Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe