2 Robust Principal Component Analysis



Similar documents
Exploratory Factor Analysis and Principal Components. Pekka Malo & Anton Frantsev 30E00500 Quantitative Empirical Research Spring 2016

A MULTIVARIATE OUTLIER DETECTION METHOD

Multiple group discriminant analysis: Robustness and error rate

Factor Analysis. Factor Analysis

Multivariate Analysis (Slides 13)

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

Introduction to Matrix Algebra

Factor analysis. Angela Montanari

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Data Mining: Algorithms and Applications Matrix Math Review

Factor Analysis. Chapter 420. Introduction

FACTOR ANALYSIS NASC

Least-Squares Intersection of Lines

Principal components analysis

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Least Squares Estimation

Introduction to Principal Components and FactorAnalysis

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

Principal Component Analysis

D-optimal plans in observational studies

Introduction: Overview of Kernel Methods

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Introduction to General and Generalized Linear Models

FACTOR ANALYSIS. Factor Analysis is similar to PCA in that it is a technique for studying the interrelationships among variables.

Cluster analysis applied to regional geochemical data: Problems and possibilities

Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components

Multivariate Normal Distribution

Systems of Linear Equations

A Regime-Switching Model for Electricity Spot Prices. Gero Schindlmayr EnBW Trading GmbH

Statistical Machine Learning

Factor Analysis. Sample StatFolio: factor analysis.sgp

Module 3: Correlation and Covariance

Multivariate Analysis of Variance (MANOVA): I. Theory

Subspace Analysis and Optimization for AAM Based Face Alignment

CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.

1 Teaching notes on GMM 1.

Gaussian Conjugate Prior Cheat Sheet

7 Gaussian Elimination and LU Factorization

Fitting Subject-specific Curves to Grouped Longitudinal Data

Econometrics Simple Linear Regression

1 Introduction to Matrices

Factorization Theorems

3. Regression & Exponential Smoothing

Linear Algebra Notes for Marsden and Tromba Vector Calculus

Partial Least Squares (PLS) Regression.

Understanding the Impact of Weights Constraints in Portfolio Theory

Understanding and Applying Kalman Filtering

Part 2: Analysis of Relationship Between Two Variables

Local outlier detection in data forensics: data mining approach to flag unusual schools

MATH APPLIED MATRIX THEORY

Orthogonal Diagonalization of Symmetric Matrices

Moving Least Squares Approximation

Notes on Symmetric Matrices

Component Ordering in Independent Component Analysis Based on Data Power

Lecture 9: Introduction to Pattern Analysis

University of Lille I PC first year list of exercises n 7. Review

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Lecture 3: Linear methods for classification

7 Time series analysis

Statistics in Psychosocial Research Lecture 8 Factor Analysis I. Lecturer: Elizabeth Garrett-Mayer

Inner Product Spaces and Orthogonality

Univariate Regression

DATA ANALYSIS II. Matrix Algorithms

Common factor analysis

Modeling and Performance Evaluation of Computer Systems Security Operation 1

Application of discriminant analysis to predict the class of degree for graduating students in a university system

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

NOTES ON LINEAR TRANSFORMATIONS

Notes on Determinant

Statistics for Business Decision Making

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

Quadratic forms Cochran s theorem, degrees of freedom, and all that

Factor Analysis. Principal components factor analysis. Use of extracted factors in multivariate dependency models

Estimating an ARMA Process

Market Risk Analysis. Quantitative Methods in Finance. Volume I. The Wiley Finance Series

Math 115A HW4 Solutions University of California, Los Angeles. 5 2i 6 + 4i. (5 2i)7i (6 + 4i)( 3 + i) = 35i + 14 ( 22 6i) = i.

Geostatistics Exploratory Analysis

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

Similarity and Diagonalization. Similar Matrices

Linear Algebra Review. Vectors

Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively.

Inner Product Spaces

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

Section Inner Products and Norms

How to report the percentage of explained common variance in exploratory factor analysis

1 Solving LPs: The Simplex Algorithm of George Dantzig

How to assess the risk of a large portfolio? How to estimate a large covariance matrix?

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan , Fall 2010

Exploratory Factor Analysis: rotation. Psychology 588: Covariance structure and factor models

Covariance and Correlation

Factor Rotations in Factor Analyses.

Analysing Questionnaires using Minitab (for SPSS queries contact -)

Machine Learning in Statistical Arbitrage

Transcription:

Robust Multivariate Methods in Geostatistics Peter Filzmoser 1, Clemens Reimann 2 1 Department of Statistics, Probability Theory, and Actuarial Mathematics, Vienna University of Technology, A-1040 Vienna, Austria 2 Geological Survey of Norway, N-7491 Trondheim, Norway Abstract: Two robust approaches to principal component analysis and factor analysis are presented. The different methods are compared, and properties are discussed. As an application we use a large geochemical data set which was analyzed in detail by univariate (geo-)statistical methods. We explain the advantages of applying robust multivariate methods. 1 Introduction In regional geochemistry an advantage could be that instead of presenting maps for 50 (or more) chemical elements only a few maps of the principal components or factors may have to be presented, containing a high percentage of the information of the single element maps. Additionally, it might be possible to find effects which are not visible in the single element maps. Especially factor analysis is used in different kinds of applications to detect hidden structures in the data. Geochemical data sets usually include outliers which are caused by a multitude of different processes. It is well known that outliers can heavily influence classical statistical methods, including multivariate statistical methods. Even one single (huge) outlier can completely determine the result of principal component analysis. For that reason it is advisable to use robust multivariate methods for detecting the multivariate structure. Section 2 treats two methods of robust principal component analysis. Two different versions of robust factor analysis which have recently been proposed, are considered in Sections 3 and 4. Section 5 gives an example with a real geochemical data set. 2 Robust Principal Component Analysis Let x be a p-dimensional random vector with E(x) = µ and Cov(x) = Σ. The covariance matrix can be decomposed as Σ = ΓAΓ, where the columns of Γ = (γ.1,..., γ.p ) are the eigenvectors of Σ and A is a diagonal matrix with the corresponding eigenvalues (arranged in descending order) of Σ. The principal components of x are defined by z = Γ (x µ). Classically, µ is estimated by the sample mean x, and

2 Σ by the sample covariance matrix S, which is decomposed into eigenvectors and -values. x as well as S are highly sensitive with respect to outlying observations. Hence, for seriously analyzing geochemical data, a robust version of principal component analysis (PCA) has to be applied. PCA can easily be robustified by estimating the covariance matrix Σ in a robust way, e.g. by taking the Minimum Covariance Determinant (MCD) estimator of Rousseeuw (1985). The robustly estimated covariance matrix is not influenced by outliers, and hence the eigenvector/eigenvalue decomposition is also robust. Since the MCD additionally gives a robust estimation of µ, the whole PCA procedure is robust. We will discuss the usage of the MCD estimator in more detail in the context of factor analysis (Section 3). Another way for robustifying PCA was introduced by Li and Chen (1985). The method is based on the projection pursuit technique. PCA can be seen as a special case of projection pursuit, where the variance of the projected data points is to be maximized. Let X = (x 1.,..., x n.) be a data matrix with observation vectors x i. IR p (i = 1,..., n). Now, let us assume that the first (k 1) projection directions γ.1,..., γ.(k 1) are already known. We define a projection matrix P 1 = I p, k 1 P k = I p γ.j γ.j. (1) j=1 P k corresponds to a projection onto the space spanned by the first (k 1) projection directions. We are interested in finding a projection direction a which maximizes the function a S(XP k a) (2) under the restrictions a a = 1 and P k a = a (orthogonality to previously found projection directions). Defining S in (2) as the classical sample standard deviation would result in classical PCA. The method can easily be robustified by taking a robust measure of spread, e.g. the median absolute deviation (MAD) MAD(y) = med i y i med(y j ). (3) j Since the number of possible projection directions is infinite, an approximative solution for maximizing (2) is as follows. The k-th projection direction is only searched in the set A n,k = { } P k (x 1. µ n ) P k (x 1. µ n ),..., P k (x n. µ n ) P k (x n. µ n ) (4)

where µ n denotes a robust estimation of the mean, like the L 1 -median or the component-wise median. The algorithm outlined above was suggested by Croux and Ruiz-Gazen (1996). It is easy to implement and fast to compute which makes the method quite attractive to use in practice. Furthermore, this robust PCA method has a big advantage for high-dimensional data (large p) because it allows to stop at a desired number k < p of components, whereas usually all p components are to be extracted by the eigenvector/eigenvalue decomposition of the (robust) covariance matrix. Moreover, the method still gives reliable results for n < p, which is important for a variety of applications. The computation of the MCD estimator requires at least n > p. 3 Robust Factor Analysis using the MCD The aim of factor analysis (FA) is to summarize the correlation structure of observed variables x 1,..., x p. For this purpose one constructs k < p unobservable or latent variables f 1,..., f k, which are called the factors, and which are linked with the original variables through the equation x j = λ j1 f 1 + λ j2 f 2 +... + λ jk f k + ε j, (5) for each 1 j p. The error variables ε 1,..., ε p are supposed to be independent, but they have specific variances ψ 1,..., ψ p. The coefficients λ jl are called the factor loadings, and they are collected into the matrix of loadings Λ. Using the vector notations x = (x 1,..., x p ), f = (f 1,..., f k ), and ε = (ε 1,..., ε p ), the usual conditions on factors and error terms can be written as E(f) = E(ε) = 0, Cov(f) = I k, and Cov(ε) = Ψ, with Ψ a diagonal matrix containing on its diagonal the specific variances. Furthermore, ε and f are assumed to be independent. From the above conditions it follows that the covariance matrix of x can be expressed by Σ = ΛΛ + Ψ. (6) In classical FA the matrix Σ is estimated by the sample covariance matrix. Afterwards, decomposition (6) is used to obtain the estimators for Λ and Ψ. Many methods have been proposed for this decomposition, of which maximum likelihood (ML) and the principal factor analysis (PFA) method are the most frequently used. Similar to the previous section, the parameter estimates can heavily be influenced when using a classical estimation of the scatter matrix. The problem can be avoided when Σ is estimated by the MCD estimator, which looks for the subset of h out of all n observations having the smallest determinant of its covariance matrix. Typically, h 3n/4. 3

4 Pison et al. (1999) used the MCD for robustifying FA. They have shown that PFA based on MCD results in a resistant FA method with bounded influence function. It has better robustness properties than the MLbased counterpart. The empirical influence function can be used as a data-analytic tool. The method is also attractive for computational reasons since a fast algorithm for the MCD estimator has recently been developed (Rousseeuw and Van Driessen (1999)). 4 FA using Robust Alternating Regressions A limitation of the MCD-based approach is that the sample size n needs to be bigger than the number of variables p. For samples with n p (which occur quite frequently in the practice of FA), a robust FA technique based on alternating regressions, originating from Croux et al. (1999), can be used. For this we consider the sample version of model (5): x ij = k λ jl f il + ε ij (7) l=1 for i = 1,..., n and j = 1,..., p. Suppose that preliminary estimates for the factor scores f il are known, and consider them as constants for a moment. The loadings λ jl can now be estimated by linear regressions of the x j s on the factors. Moreover, by applying a robust scale estimator on the computed residuals, estimates ˆψ j for ψ j can easily be obtained (for example by computing the MAD of the residuals). On the other hand, if preliminary estimates of the loadings are available, linear regression estimators can again be used for estimating the factor scores. Indeed, if we take i fixed in (7) and suppose that the λ jl are fixed, a regression of x ij on the loadings λ jl yields updated estimates for the factor scores. Since there is heteroscedasticity, weights proportional to ( ˆψ j ) 1/2 should be included. Using robust principal components (Section 2) as appropriate starting values for the factor scores, an iterative process (called alternating or interlocking regressions) can be carried out to estimate the unknown parameters of the factor model. To ensure robustness of the procedure we use a weighted L 1 -regression estimator since it is fast to compute and very robust. More details about the method and the choice of the weights can be found in Croux et al. (1999). Note that in contrast to the method described in Section 3, the factor scores are estimated directly.

5 Example We consider a data set described and analyzed by univariate methods in Reimann et al. (1998). From 1992-1998 the Geological Surveys of Finland (GTK), and Norway (NGU) and the Central Kola Expedition (CKE), Russia, carried out a large multi-element geochemical mapping project, covering an area of 188,000 km 2 between 24 and 35.5 E up to the Barents Sea coast. One of the sample media was the C-horizon of podzol profiles, developed on glacial drift. C-horizon samples were taken at 605 sites, and the contents of more than 50 chemical elements was measured for all samples. Although the project was mainly designed to reveal the environmental conditions in the area, the C-horizon was sampled to reflect the geogenic background. In the following we will apply the alternating regression-based FA approach (Section 4). Robust PCA and MCD-based FA was used in Filzmoser (1999) for the upper layer, humus, of the complete data set. For the investigation of the C-horizon data we only considered the elements Ag, Al, As, Ba, Bi, Ca, Cd, Co, Cr, Cu, Fe, K, Mg, Mn, Na, Ni, P, Pb, S, Si, Sr, Th, V and Zn. These variables have been transformed to a logarithmic scale to give a better approximation to the normal distribution. In order to put everything to a common scale we first standardized (robustly) the variables to mean zero and variance one. We want to analyze the data by using non-robust least squares (LS) regression and robust weighted L 1 -regression in the alternating regression scheme. We decided to extract 6 factors which results in a proportion of total variance of 75% for both cases. The loadings of factors F 1 to F 6 are shown in Figure 1. We just printed the elements with an absolute value of the loadings larger than 0.3 to avoid confusion. The percentage of explained variance is printed at the top of the plots. Figure 1 shows that for the first factor F 1 there is just a slight difference between the non-robust (a) and the robust (b) method. However, for the subsequent factors this difference grows. Especially the loadings of factors F 4 and F 6 are strongly changing. It is also interesting to inspect the factor scores which are directly estimated by our method. Because of space limitations we only show the scores of the second factor F 2 (Figure 2), which is interesting because it nicely reflects the distribution of alkaline intrusions in the survey area. Figure 2 shows the whole region under consideration. The dark lines are the borders of the countries Russia (east), Norway (north-west), and Finland (south-west). The gray lines show rivers and the coast. At a first glance the two results presented in Figure 2, the non-robust (a) and the robust (b) scores of factor F 2 seem to be very similar. But already the ranges of the estimated scores are different ([ 3.08, 4.98] for the non-robust and [ 3.82, 5.13] for the robust method (in the maps we 5

6 used the same scaling). The smaller range is typical for LS-based methods because all data points, including the outliers, are tried to be fitted. Robust methods fit the majority of good data points which leads to a reliable estimation. As a consequence, the regions with high and low outliers are presented more reliable by the robust method. In the map the two uppermost classes (crosses) mark areas which are underlain by alkaline bedrocks. The anomalies in the factor maps are much more prominent than the intrusions themselves in a geological map. The reason is that the emplacement of the intrusions was accompanied by the movement of large amounts of hydrothermal fluids. These changed the chemical composition of the intruded bedrocks. The map thus reflects the alteration haloes of these intrusions and demonstrates the importance of the geological process for a very large region. References CROUX, C., FILZMOSER, P., PISON, G., and ROUSSEEUW, P. J. (1999): Fitting Factor Models by Robust Interlocking Regression. Preprint, Vienna University of Technology. CROUX, C. and RUIZ-GAZEN, A. (1996): A Fast Algorithm for Robust Principal Components based on Projection Pursuit, in Prat (Ed.): Proceedings in Computational Statistics, Physika-Verlag, Heidelberg. FILZMOSER, P. (1999): Robust Principal Component and Factor Analysis in the Geostatistical Treatment of Environmental Data. Environmetrics, 10, 363-375. LI, G. and CHEN, Z. (1985): Projection-Pursuit Approach to Robust Dispersion Matrices and Principal Components: Primary Theory and Monte Carlo. J. Amer. Statist. Assoc., 80, 759-766. PISON, G., ROUSSEEUW, P. J., FILZMOSER, P., and CROUX, C. (1999): Robust Factor Analysis. Preprint, Vienna University of Technology. REIMANN, C., ÄYRÄS, M., CHEKUSHIN, V., BOGATYREV, I., BOYD, R., CARITAT, P. DE, DUTTER, R., FINNE, T. E., HALLER- AKER, J. H., JÆGER, Ø., KASHULINA, G., LEHTO, O., NISKAVAA- RA, H., PAVLOV, V., RÄISÄNEN, M. L., STRAND, T., and VOLDEN, T. (1998): Environmental Geochemical Atlas of the Central Barents Region. Geological Survey of Norway (NGU), Geological Survey of Finland (GTK), and Central Kola Expedition (CKE), Special Publication, Trondheim, Espoo, Monchegorsk. ROUSSEEUW, P. J.(1985): Multivariate Estimation with High Breakdown Point, in Grossmann et al. (Eds.): Mathematical Statistics and Applications, Vol. B, Akadémiai Kiadó, Budapest.

ROUSSEEUW, P. J. and VAN DRIESSEN, K. (1999): A Fast Algorithm for the Minimum Covariance Determinant Estimator. Technometrics, 41, 212-223. 7

8 (a) +1 +0.5 0% 38% Ag Al Ba As Bi Cd Co CrCuFe K MgMn Ni Pb S Th Sr V Zn Ba Na Ca P 50% Sr Cr Ni Co Cu Mg 59% 65% 70% 75% P V K S Ca Ba Ag P 0-0.5 As Bi Pb Ag SrTh Pb Si Bi Si -1 F1 F2 F3 F4 F5 F6 (b) +1 0% 36% 50% 58% 64% 70% 75% +0.5 Co Al Cd Cu Fe Mg MnNi Zn As Ba Cr Pb S V K Ag Bi Sr Th Ba Ca K Mn P Sr Na Cr Mg Ni V Co CuFe S Si Ba K Ag 0 Pb Bi -0.5 Th -1 F1 F2 F3 F4 F5 F6 Figure 1: Loadings of the alternating regression based FA method using (a) LS-regression and (b) weighted L 1 -regression.

9 (a) (b) Figure 2: Scores of the second factor of the alternating regression based FA method using (a) LS-regression and (b) weighted L 1 -regression.