L10: Linear discriminants analysis



Similar documents
Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

Forecasting the Direction and Strength of Stock Market Movement

Lecture 5,6 Linear Methods for Classification. Summary

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Logistic Regression. Steve Kroon

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

How To Calculate The Accountng Perod Of Nequalty

Single and multiple stage classifiers implementing logistic discrimination

Support Vector Machines

Georey E. Hinton. University oftoronto. Technical Report CRG-TR May 21, 1996 (revised Feb 27, 1997) Abstract

Learning from Multiple Outlooks

What is Candidate Sampling

Loop Parallelization

ONE of the most crucial problems that every image

Kernel Methods for General Pattern Analysis

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

J. Parallel Distrib. Comput.

Data Visualization by Pairwise Distortion Minimization

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

This circuit than can be reduced to a planar circuit

Estimating the Number of Clusters in Genetics of Acute Lymphoblastic Leukemia Data

Bypassing Synthesis: PLS for Face Recognition with Pose, Low-Resolution and Sketch

Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing

Statistical Methods to Develop Rating Models

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Brigid Mullany, Ph.D University of North Carolina, Charlotte

where the coordinates are related to those in the old frame as follows.

The OC Curve of Attribute Acceptance Plans

General Iteration Algorithm for Classification Ratemaking

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

Imperial College London

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

Performance Analysis and Coding Strategy of ECOC SVMs

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Project Networks With Mixed-Time Constraints

Portfolio Loss Distribution

Compiling for Parallelism & Locality. Dependence Testing in General. Algorithms for Solving the Dependence Problem. Dependence Testing

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

PRACTICE 1: MUTUAL FUNDS EVALUATION USING MATLAB.

Abstract. Clustering ensembles have emerged as a powerful method for improving both the

A cooperative connectionist IDS model to identify independent anomalous SNMP situations

Interpreting Patterns and Analysis of Acute Leukemia Gene Expression Data by Multivariate Statistical Analysis

Fisher Markets and Convex Programs

An Alternative Way to Measure Private Equity Performance

Least Squares Fitting of Data

Visualization of high-dimensional data with relational perspective map

The Distribution of Eigenvalues of Covariance Matrices of Residuals in Analysis of Variance

EAR BIOMETRICS IN PERSONAL IDENTIFICATION. M. Sc. Thesis by Bahattin KOCAMAN, B.Sc.

An Empirical Study of Search Engine Advertising Effectiveness

Linear Circuits Analysis. Superposition, Thevenin /Norton Equivalent circuits

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

Active Learning for Interactive Visualization

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

An Algorithm for Data-Driven Bandwidth Selection

Nonlinear Time Series Analysis in a Nutshell

CHAPTER 14 MORE ABOUT REGRESSION

Chapter 4 ECONOMIC DISPATCH AND UNIT COMMITMENT

Efficient Project Portfolio as a tool for Enterprise Risk Management

PERRON FROBENIUS THEOREM

Intra-day Trading of the FTSE-100 Futures Contract Using Neural Networks With Wavelet Encodings

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

Point cloud to point cloud rigid transformations. Minimizing Rigid Registration Errors

An Analysis of the relationship between WTI term structure and oil market fundamentals in

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Quantization Effects in Digital Filters

SVM Tutorial: Classification, Regression, and Ranking

Using Series to Analyze Financial Situations: Present Value

Probabilistic Linear Classifier: Logistic Regression. CS534-Machine Learning

Fixed income risk attribution

1 De nitions and Censoring

Dimensionality Reduction for Data Visualization

Learning to Classify Ordinal Data: The Data Replication Method

Multiclass sparse logistic regression for classification of multiple cancer types using gene expression data

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

Fast Fuzzy Clustering of Web Page Collections

BERNSTEIN POLYNOMIALS

A Fast Incremental Spectral Clustering for Large Data Sets

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

Lecture 2: Single Layer Perceptrons Kevin Swingler

Question 2: What is the variance and standard deviation of a dataset?

Discussion Papers. Support Vector Machines (SVM) as a Technique for Solvency Analysis. Laura Auria Rouslan A. Moro. Berlin, August 2008

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

8 Algorithm for Binary Searching in Trees

To Fill or not to Fill: The Gas Station Problem

Fault tolerance in cloud technologies presented as a service

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

Application of Quasi Monte Carlo methods and Global Sensitivity Analysis in finance

Stochastic Protocol Modeling for Anomaly Based Network Intrusion Detection

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending

Bayesian Cluster Ensembles

An Interest-Oriented Network Evolution Mechanism for Online Communities

Robust Design of Public Storage Warehouses. Yeming (Yale) Gong EMLYON Business School

Regression Models for a Binary Response Using EXCEL and JMP

Recurrence. 1 Definitions and main statements

Analysis of Premium Liabilities for Australian Lines of Business

Transcription:

L0: Lnear dscrmnants analyss Lnear dscrmnant analyss, two classes Lnear dscrmnant analyss, C classes LDA vs. PCA Lmtatons of LDA Varants of LDA Other dmensonalty reducton methods CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU

Lnear dscrmnant analyss, two-classes Objectve LDA seeks to reduce dmensonalty whle preservng as much of the class dscrmnatory nformaton as possble Assume we have a set of D-dmensonal samples x (, x (, x (N, N of whch belong to class ω, and N to class ω We seek to obtan a scalar y by projectng the samples x onto a lne y = w T x Of all the possble lnes we would lke to select the one that maxmzes the separablty of the scalars x x x CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU x

In order to fnd a good projecton vector, we need to defne a measure of separaton The mean vector of each class n x-space and y-space s μ = N x ω x and μ = N y ω y = N x ω w T x = w T μ We could then choose the dstance between the projected means as our objectve functon J w = μ μ = w T μ μ However, the dstance between projected means s not a good measure snce t does not account for the standard devaton wthn classes x Ths axs yelds better class separablty x Ths axs has a larger dstance between means CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU

Fsher s soluton Fsher suggested maxmzng the dfference between the means, normalzed by a measure of the wthn-class scatter For each class we defne the scatter, an equvalent of the varance, as s = y μ y ω where the quantty s + s s called the wthn-class scatter of the projected examples The Fsher lnear dscrmnant s defned as the lnear functon w T x that maxmzes the crteron functon J w = μ μ s +s Therefore, we are lookng for a projecton where examples from the same class are projected very close to each other and, at the same tme, the projected means are as farther apart as possble x x CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU

To fnd the optmum w, we must express J(w) as a functon of w Frst, we defne a measure of the scatter n feature space x T S = x ω x μ x μ S + S = S W where S W s called the wthn-class scatter matrx The scatter of the projecton y can then be expressed as a functon of the scatter matrx n feature space x s = y ω y μ = w T x w T x ω μ = = x ω w T x μ x μ T w = w T S w s + s = w T S W w Smlarly, the dfference between the projected means can be expressed n terms of the means n the orgnal feature space μ μ = w T μ w T μ = w T T μ μ μ μ w = w T S B w The matrx S B s called the between-class scatter. Note that, snce S B s the outer product of two vectors, ts rank s at most one We can fnally express the Fsher crteron n terms of S W and S B as J w = wt S B w w T S W w CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU S B

To fnd the maxmum of J(w) we derve and equate to zero d dw J w = d dw w T S W w d wt S B w w T S B w w T S W w = 0 w T S dw B w d wt S W w dw w T S W w S B w w T S B w S W w = 0 Dvdng by w T S W w = 0 w T S W w S w T S W w Bw wt S B w S w T S W w Ww = 0 S B w JS W w = 0 S W S B w Jw = 0 Solvng the generalzed egenvalue problem (S W S B w = Jw) yelds w = arg max wt S B w w T S W w = S W μ μ Ths s know as Fsher s lnear dscrmnant (96), although t s not a dscrmnant but rather a specfc choce of drecton for the projecton of the data down to one dmenson CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU 6

Example Compute the LDA projecton for the followng D dataset X = {(,), (,), (,), (,6), (,)} X = {(9,0), (6,8), (9,), (8,7), (0,8)} SOLUTION (by hand) The class statstcs are S =.8..6 S =.8.0.6 0 8 6 x w LDA μ =.0.6 T ; μ = 8. 7.6 T The wthn- and between-class scatter are S B = 9.6.6 6.0 S W =.6..8 The LDA projecton s then obtaned as the soluton of the generalzed egenvalue problem S W S B v = λv S W S B λi = 0.89 8.8.08.76 Or drectly by w = S W μ μ v v =.6 v v v v =.9.9 =.9.9 T.89 λ 8.8.08.76 λ 0 0 6 8 0 x = 0 λ =.6 CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU 7

LDA, C classes Fsher s LDA generalzes gracefully for C-class problems Instead of one projecton y, we wll now seek (C ) projectons [y, y, y C ] by means of (C ) projecton vectors w arranged by columns nto a projecton matrx W = [w w w C ]: y = w T x y = W T x Dervaton x The wthn-class scatter generalzes as S W = C = S where S = x μ x μ T x ω and μ = N x ω x And the between-class scatter becomes S B = N μ μ μ μ T C = where μ = N x x = N C = N μ Matrx S T = S B + S W s called the total scatter S W S W S B S B S B S W x CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU 8

Smlarly, we defne the mean vector and scatter matrces for the projected samples as μ = N μ = N C = T y ω y S W = y μ y μ y ω y y S B = C = N μ μ μ μ T From our dervaton for the two-class problem, we can wrte S W = W T S W W S B = W T S B W Recall that we are lookng for a projecton that maxmzes the rato of between-class to wthn-class scatter. Snce the projecton s no longer a scalar (t has C dmensons), we use the determnant of the scatter matrces to obtan a scalar objectve functon J W = S B S W = WT S B W W T S W W And we wll seek the projecton matrx W that maxmzes ths rato CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU 9

It can be shown that the optmal projecton matrx W s the one whose columns are the egenvectors correspondng to the largest egenvalues of the followng generalzed egenvalue problem NOTES W = w w w C = arg max WT S B W W T S W W S B λ S W w = 0 S B s the sum of C matrces of rank and the mean vectors are constraned by C C = μ = μ Therefore, S B wll be of rank (C ) or less Ths means that only (C ) of the egenvalues λ wll be non-zero The projectons wth maxmum class separablty nformaton are the egenvectors correspondng to the largest egenvalues of S W S B LDA can be derved as the Maxmum Lkelhood method for the case of normal class-condtonal denstes wth equal covarance matrces CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU 0

LDA vs. PCA Ths example llustrates the performance of PCA and LDA on an odor recognton problem Fve types of coffee beans were presented to an array of gas sensors For each coffee type, snffs were performed and the response of the gas sensor array was processed n order to obtan a 60-dmensonal feature vector Results From the D scatter plots t s clear that LDA outperforms PCA n terms of class dscrmnaton Ths s one example where the dscrmnatory nformaton s not algned wth the drecton of maxmum varance PCA Sensor response 60 0 0 0-0 -0 normalzed data 0 0 00 0 00 Sulaw esy Kenya Araban Sumatra Colomba LDA axs 0 0 0-60 -0-0 -00 70 90 80 axs 00 axs 7. 7. 7.8 7.6 7. 7. -.96 0. -.9 -.9 -.9 -.88 0. axs 0. axs axs CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU

Lmtatons of LDA LDA produces at most C feature projectons If the classfcaton error estmates establsh that more features are needed, some other method must be employed to provde those addtonal features LDA s a parametrc method (t assumes unmodal Gaussan lkelhoods) If the dstrbutons are sgnfcantly non-gaussan, the LDA projectons may not preserve complex structure n the data needed for classfcaton = = = = x LDA wll also fal f dscrmnatory nformaton s not n the mean but n the varance of the data P C A L D A x CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU

Varants of LDA Non-parametrc LDA (Fukunaga) NPLDA relaxes the unmodal Gaussan assumpton by computng S B usng local nformaton and the knn rule. As a result of ths The matrx S B s full-rank, allowng us to extract more than (C ) features The projectons are able to preserve the structure of the data more closely Orthonormal LDA (Okada and Tomta) OLDA computes projectons that maxmze the Fsher crteron and, at the same tme, are par-wse orthonormal The method used n OLDA combnes the egenvalue soluton of S W S B and the Gram- Schmdt orthonormalzaton procedure OLDA sequentally fnds axes that maxmze the Fsher crteron n the subspace orthogonal to all features already extracted OLDA s also capable of fndng more than (C ) features Generalzed LDA (Lowe) GLDA generalzes the Fsher crteron by ncorporatng a cost functon smlar to the one we used to compute the Bayes Rsk As a result, LDA can produce projects that are based by the cost functon,.e., classes wth a hgher cost C j wll be placed further apart n the low-dmensonal projecton Multlayer perceptrons (Webb and Lowe) It has been shown that the hdden layers of mult-layer perceptrons perform nonlnear dscrmnant analyss by maxmzng Tr[S B S T ], where the scatter matrces are measured at the output of the last hdden layer CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU

Other dmensonalty reducton methods Exploratory Projecton Pursut (Fredman and Tukey) EPP seeks an M-dmensonal (M=, typcally) lnear projecton of the data that maxmzes a measure of nterestngness Interestngness s measured as departure from multvarate normalty Ths measure s not the varance and s commonly scale-free. In most mplementatons t s also affne nvarant, so t does not depend on correlatons between features. [Rpley, 996] In other words, EPP seeks projectons that separate clusters as much as possble and keeps these clusters compact, a smlar crteron as Fsher s, but EPP does NOT use class labels Once an nterestng projecton s found, t s mportant to remove the structure t reveals to allow other nterestng vews to be found more easly Unnterestng Interestng x x x CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU x

Sammon s non-lnear mappng (Sammon) Ths method seeks a mappng onto an M-dmensonal space that preserves the nter-pont dstances n the orgnal N-dmensonal space Ths s accomplshed by mnmzng the followng objectve functon E d, d = j d P,P j d P,Pj d P,P j The orgnal method dd not obtan an explct mappng but only a lookup table for the elements n the tranng set Newer mplementatons based on neural networks do provde an explct mappng for test data and also consder cost functons (e.g., Neuroscale) Sammon s mappng s closely related to Mult Dmensonal Scalng (MDS), a famly of multvarate statstcal methods commonly used n the socal scences We wll revew MDS technques when we cover manfold learnng x x P j P j d(p, P j )= d(p, P j ) ",j P P x x x CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU