Principal Component Analysis



Similar documents
Introduction to Principal Component Analysis: Stock Market Values

Introduction to Principal Components and FactorAnalysis

Factor Analysis. Chapter 420. Introduction

Dimensionality Reduction: Principal Components Analysis

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Common factor analysis

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

How To Cluster

Statistics for Business Decision Making

Statistics in Psychosocial Research Lecture 8 Factor Analysis I. Lecturer: Elizabeth Garrett-Mayer

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Principal Component Analysis

Exploratory data analysis for microarray data

The ith principal component (PC) is the line that follows the eigenvector associated with the ith largest eigenvalue.

FACTOR ANALYSIS. Factor Analysis is similar to PCA in that it is a technique for studying the interrelationships among variables.

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

Factor Analysis. Advanced Financial Accounting II Åbo Akademi School of Business

MULTIVARIATE DATA ANALYSIS WITH PCA, CA AND MS TORSTEN MADSEN 2007

Factor Analysis. Principal components factor analysis. Use of extracted factors in multivariate dependency models

T-test & factor analysis

Exploratory Factor Analysis and Principal Components. Pekka Malo & Anton Frantsev 30E00500 Quantitative Empirical Research Spring 2016

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Multivariate Analysis (Slides 13)

DISCRIMINANT FUNCTION ANALYSIS (DA)

Multivariate Analysis

Overview of Factor Analysis

Introduction to Matrix Algebra

A Solution Manual and Notes for: Exploratory Data Analysis with MATLAB by Wendy L. Martinez and Angel R. Martinez.

FACTOR ANALYSIS NASC

Using Principal Components Analysis in Program Evaluation: Some Practical Considerations

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

A Brief Introduction to Factor Analysis

DATA ANALYSIS II. Matrix Algorithms

Multivariate Analysis of Ecological Data

SPSS ADVANCED ANALYSIS WENDIANN SETHI SPRING 2011

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Factor Analysis. Sample StatFolio: factor analysis.sgp

Teaching Multivariate Analysis to Business-Major Students

A Demonstration of Hierarchical Clustering

How to Get More Value from Your Survey Data

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Nonlinear Iterative Partial Least Squares Method

Exploratory Data Analysis with MATLAB

Data Mining: Algorithms and Applications Matrix Math Review

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

How to report the percentage of explained common variance in exploratory factor analysis

Component Ordering in Independent Component Analysis Based on Data Power

Data analysis process

Geostatistics Exploratory Analysis

5.2 Customers Types for Grocery Shopping Scenario

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection

Exploratory Factor Analysis

A Comparison of Variable Selection Techniques for Credit Scoring

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

A Brief Introduction to SPSS Factor Analysis

Tutorial on Exploratory Data Analysis

Partial Least Squares (PLS) Regression.

THE USING FACTOR ANALYSIS METHOD IN PREDICTION OF BUSINESS FAILURE

Monitoring chemical processes for early fault detection using multivariate data analysis methods

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Final Project Report

CLASSIFICATION OF EUROPEAN UNION COUNTRIES FROM DATA MINING POINT OF VIEW, USING SAS ENTERPRISE GUIDE

Research Methodology: Tools

4. Matrix Methods for Analysis of Structure in Data Sets:

Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

To do a factor analysis, we need to select an extraction method and a rotation method. Hit the Extraction button to specify your extraction method.

Similar matrices and Jordan form

Principal components analysis

An introduction to. Principal Component Analysis & Factor Analysis. Using SPSS 19 and R (psych package) Robin Beaumont robin@organplayers.co.

Volume 2, Issue 9, September 2014 International Journal of Advance Research in Computer Science and Management Studies

by the matrix A results in a vector which is a reflection of the given

CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.

Multivariate Normal Distribution

How To Understand Multivariate Models

Sections 2.11 and 5.8

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Factor analysis. Angela Montanari

How To Identify A Churner

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Exploratory Factor Analysis

Chapter 7 Factor Analysis SPSS

D-optimal plans in observational studies

Data, Measurements, Features

Multiple regression - Matrices

Chapter 6. Orthogonality

PRINCIPAL COMPONENT ANALYSIS

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Detecting Network Anomalies. Anant Shah

x = + x 2 + x

Transcription:

Principal Component Analysis ERS70D George Fernandez INTRODUCTION Analysis of multivariate data plays a key role in data analysis. Multivariate data consists of many different attributes or variables recorded for each observation. If there are p variables in a database, each variable could be regarded as constituting a different dimension, in a p-dimensional hyperspace. multi-dimensional hyperspace is often difficult to visualize, and thus the main objectives of unsupervised learning methods are to reduce dimensionality, scoring all observations based on a composite index and clustering similar observations together based on multi-attributes. summarizing multivariate attributes by, two or three that can be displayed graphically with minimal loss of information is useful in knowledge discovery. 2

PRINCIPAL COMPONENT ANALYSIS Because it is hard to visualize multi-dimensional space, principal components analysis (PCA), a popular multivariate technique, is mainly used to reduce the dimensionality of p multi-attributes to two or three dimensions. PCA summarizes the variation in a correlated multi-attribute to a set of uncorrelated components, each of which is a particular linear combination of the original variables. The extracted uncorrelated components are called principal components (PC) and are estimated from the eigenvectors of the covariance or correlation matrix of the original variables. Therefore, the objective of PCA is to achieve parsimony and reduce dimensionality by extracting the smallest number components that account for most of the variation in the original multivariate data and to summarize the data with little loss of information. 3 PRINCIPAL COMPONENT ANALYSIS In PCA, uncorrelated PC s are extracted by linear transformations of the original variables so that the first few PC s contain most of the variations in the original dataset. These PCs are extracted in decreasing order of importance so that the first PC accounts for as much of the variation as possible and each successive component accounts for a little less. Following PCA, analyst tries to interpret the first few principal components in terms of the original variables, and thereby have a greater understanding of the data. To reproduce the total system variability of the original p variables, we need all p PCs. However, if the first few PCs account for a large proportion of the variability (80-90%), we have achieved our objective of dimension reduction. Because the first principal component accounts for the co-variation shared by all attributes, this may be a better estimate than simple or weighted averages of the original variables. Thus, PCA can be useful when there is a severe high-degree of correlation present in the multi-attributes. 4 2

PRINCIPAL COMPONENT ANALYSIS In PCA, the extractions of PC can be made using either original multivariate datasets or using the covariance or the correlation matrix if the original dataset is not available. In deriving PC, the correlation matrix is commonly used when different variables in the dataset are measured using different units (annual income, educational level, numbers of cars owned per family) or if different variables have different variances. Using the correlation matrix is equivalent to standardizing the variables to zero mean and unit standard deviation. 5 Applications of PCA analysis Hall R.I, Leavitt P.R,Quinlan R., Dixit A.S, Smol, J.P 999 Effects of agriculture, urbanization, and climate on water quality in the northern Great plains. Limnol. Oceanogr. 44(3, part 2) 739-756 PCA of sample scores and selected species from (a) diatom percent abundances, (b) fossil pigment concentrations, and Chironomid percent abundances. 6 3

Applications of PCA analysis PCA2 3.9% S. hantzschii 3-76 F. capun. S. niagare PCA 45.5% 77-93 89-930 A. granulata 7 Correlation matrix Y2 Y4 X4 X8 X X5 Y2-0.58837 0.78498 0.0454 0.44994 0.6442 midrprce <.000 <.000 0.6692 <.000 <.000 Y4-0.58837-0.66755-0.40888-0.7759-0.84055 ctympg <.000 <.000 <.000 <.000 <.000 X4 0.78498-0.66755-0.00435 0.64054 0.73446 hp <.000 <.000 0.967 <.000 <.000 X8 0.0454-0.40888-0.00435 0.48475 0.54683 pcap 0.6692 <.000 0.967 <.000 <.000 X 0.44994-0.7759 0.64054 0.48475 0.87408 width <.000 <.000 <.000 <.000 <.000 X5 0.6442-0.84055 0.73446 0.54683 0.87408 weight <.000 <.000 <.000 <.000 <.000 8 4

Eigen value / vector decomposition of a correlation matrics r2 r32 r2 r3 r3 r23 = p j= λjaja j ' 9 PCA TERMINOLOGY Eigenvalues measure the amount of the variation explained by each PC and will be largest for the first PC and smaller for the subsequent PCs. An eigenvalue greater than indicates that PCs account for more variance than accounted by one of the original variables in standardized data. This is commonly used as a cutoff point for which PCs are retained. Eigenvectors provides the weights to compute the uncorrelated PC, which are the linear combination of the centered standardized or centered un-standardized original variables. 0 5

Eigenvectors and eigen values Eigenvectors 2 Y2 midrprce 0.377-0.442 Y4 ctympg -0.4475-0.0523 X4 hp 0.48-0.426 X8 pcap 0.2297 0.7522 X width 0.4394 0.947 X5 weight 0.4867 0.296 Eigenvalues of the Correlation Matrix: Total = 6 Eigenvalue Difference Proportion Cumulative 3.948323 2.7240979 0.658 0.658 2.2240344 0.8439396 0.204 0.862 3 0.3800949 0.23044 0.0633 0.9254 4 0.2677905 0.444804 0.0446 0.97 5 0.2330 0.0666722 0.0206 0.9906 6 0.0566379 0.0094 PCA PC = p j a jx PC- principal compnent aj= Linear coefficient eigen vectors 2 6

Estimating the Number of PC Scree Test: Plotting the eigenvalues against the corresponding PC produces a scree plot that illustrates the rate of change in the magnitude of the eigenvalues for the PC. The rate of decline tends to be fast first then levels off. The elbow, or the point at which the curve bends, is considered to indicate the maximum number of PC to extract. One less PC than the number at the elbow might be appropriate if you are concerned about getting an overly defined solution. 3 cars:scree PLOT and Parallel Analysis - PCA 4 E 3 Eigenvalue 2 0 p pe p E p E E E 2 3 4 5 6 Number of PC PLOT E E E Eigenvalue p p p ev 4 7

PCA TERMINOLOGY PC loadings are correlation coefficients between the PC scores and the original variables. PC loadings measure the importance of each variable in accounting for the variability in the PC. It is possible to interpret the first few PCs in terms of 'overall' effect or a 'contrast' between groups of variables based on the structures of PC loadings. high correlation between PC and a variable indicates that the variable is associated with the direction of the maximum amount of variation in the dataset. More than one variable might have a high correlation with PC. A strong correlation between a variable and PC2 indicates that the variable is responsible for the next largest variation in the data perpendicular to PC, and so on. if a variable does not correlate to any PC, or correlates only with the last PC, or one before the last PC, this usually suggests that the variable has little or no contribution to the variation in the dataset. Therefore, PCA may often indicate which variables in a dataset are important and which ones may be of little consequence. Some of these low-performance variables might therefore be removed from consideration in order to simplify the overall analyses.. 5 PC loadings Factor Pattern Factor Factor2 Y2 midrprce 0.7493-0.489 Y4 ctympg -0.8892-0.0579 X4 hp 0.8308-0.474 X8 pcap 0.4565 0.8322 X width 0.873 0.254 X5 weight 0.967 0.433 6 8

PC Scores PC scores are the derived composite scores computed for each observation based on the eigenvectors for each PC. The means of PC scores are equal to zero, as these are the linear combination of the centered variables. These uncorrelated PC scores can be used in subsequent analyses, to check for multivariate normality, to detect multivariate outliers, or as a remedial measure in regression analysis with severe multi-colliniarity. 7 PC Scores Obs ID Factor Factor2 39-2.60832-0.36804 2 82-2.5032-0.37408 3 8-2.0432-0.46095 4 32 -.9308-0.22925 5 42 -.67883-0.5745 6 85 -.5346 0.3244 7 74 -.46923-0.3646 8 44 -.4465 0.38442 9 87 -.4928-0.30308 0 47 -.32023-0.37438 23 -.23447 0.37825 2 40 -.852-0.3955 3 30 -.6499 0.2589 4 62 -.4394 0.3846 5 56 -.0393 0.24003 6 80 -.058 0.35356 83 9.2486-3.6769 84 28.25067 -.895 85.30762-0.547 86 3.345 0.854 87 57.4063-2.25287 88 7.50488 0.8694 89 6.53423 2.5236 90 52.708 0.00869 9 48.82993 -.872 92 0.87482 -.58474 8 9

BI-PLOT DISPLAY OF PCA Bi-plot display is a visualization technique for investigating the inter-relationships between the observations and variables in multivariate data. To display a bi-plot, the data should be considered as a matrix, in which the column represents the variable space while the row represents the observational space. The term bi-plot means it is a plot of two dimensions with the observation and variable spaces plotted simultaneously. In PCA, relationships between PC scores and PCA loadings associated with any two PCs can be illustrated in a bi-plot display. 9 Bi-Plot display cars: Factor method: p - Rotation-none Factor2 score 3-32 6 Y4 7 25 2 30 33 29 4 8 62 90 48 X8 88 55 86 6760 46 78 73 58 77 X X5 42 36 Y2 X4 40 35 438 39 44-3 43 37-5 -4-3 -2-0 2 Factor score 20 0