Principal Component Analysis

Transcription

1 Principal Component Analsis Mark Richardson Ma 29 Contents Introduction 2 2 An Eample From Multivariate Data Analsis 3 3 The Technical Details Of PCA 6 4 The Singular Value Decomposition 9 5 Image Compression Using PCA 6 Blind Source Separation 5 7 Conclusions 9 8 Appendi - MATLAB 2

2 Introduction Principal Component Analsis (PCA) is the general name for a technique which uses sophisticated underling mathematical principles to transforms a number of possibl correlated variables into a smaller number of variables called principal components. The origins of PCA lie in multivariate data analsis, however, it has a wide range of other applications, as we will show in due course. PCA has been called, one of the most important results from applied linear algebra [2] and perhaps its most common use is as the first step in tring to analse large data sets. Some of the other common applications include; de-noising signals, blind source separation, and data compression. In general terms, PCA uses a vector space transform to reduce the dimensionalit of large data sets. Using mathematical projection, the original data set, which ma have involved man variables, can often be interpreted in just a few variables (the principal components). It is therefore often the case that an eamination of the reduced dimension data set will allow the the user to spot trends, patterns and outliers in the data, far more easil than would have been possible without performing the principal component analsis. The aim of this essa is to eplain the theoretical side of PCA, and to provide eamples of its application. We will begin with a non-rigorous motivational eample from multivariate data analsis in which we will attempt to etract some meaning from a 7 dimensional data set. After this motivational eample, we shall discuss the PCA technique in terms of its linear algebra fundamentals. This will lead us to a method for implementing PCA for real-world data, and we will see that there is a close connection between PCA and the singular value decomposition (SVD) from numerical linear algebra. We will then look at two further eamples of PCA in practice; Image Compression and Blind Source Separation. 2

3 2 An Eample From Multivariate Data Analsis In this section, we will eamine some real life multivariate data in order to eplain, in simple terms what PCA achieves. We will perform a principal component analsis of this data and eamine the results, though we will skip over the computational details for now. Suppose that we are eamining the following DEFRA data showing the consumption in grams (per person, per week) of 7 different tpes of foodstuff measured and averaged in the four countries of the United Kingdom in 997. We shall sa that the 7 food tpes are the variables and the 4 countries are the observations. A cursor glance over the numbers in Table does not reveal much, indeed in general it is difficult to etract meaning from an given arra of numbers. Given that this is actuall a relativel small data set, we see that a powerful analtical method is absolutel necessar if we wish to observe trends and patterns in larger data. England Wales Scotland N Ireland Cheese Carcass meat Other meat Fish Fats and oils Sugars Fresh potatoes Fresh Veg Other Veg Processed potatoes Processed Veg Fresh fruit Cereals Beverages Soft drinks Alcoholic drinks Confectioner Table : UK food consumption in 997 (g/person/week). Source: DEFRA website We need some wa of making sense of the above data. Are there an trends present which are not obvious from glancing at the arra of numbers? Traditionall, we would use a series of bivariate plots (scatter diagrams) and analse these to tr and determine an relationships between variables, however the number of such plots required for such a task is tpicall O(n 2 ), where n is the number of variables. Clearl, for large data sets, this is not feasible. PCA generalises this idea and allows us to perform such an analsis simultaneousl, for man variables. In our eample above, we have 7 dimensional data for 4 countries. We can thus imagine plotting the 4 coordinates representing the 4 countries in 7 dimensional space. If there is an correlation between the observations (the countries), this will be observed in the 7 dimensional space b the correlated points being clustered close together, though of course since we cannot visualise such a space, we are not able to see such clustering directl. Department for Environment, Food and Rural Affairs 3

4 The first task of PCA is to identif a new set of orthogonal coordinate aes through the data. This is achieved b finding the direction of maimal variance through the coordinates in the 7 dimensional space. It is equivalent to obtaining the (least-squares) line of best fit through the plotted data. We call this new ais the first principal component of the data. Once this first principal component has been obtained, we can use orthogonal projection 2 to map the coordinates down onto this new ais. In our food eample above, the four 7 dimensional coordinates are projected down onto the first principal component to obtain the following representation in Figure. Figure : Projections onto first principal component (-D space).5.5 Wal Eng Scot N Ire PC This tpe of diagram is known as a score plot. Alread, we can see that the there are two potential clusters forming, in the sense that England, Wales and Scotland seem to be close together at one end of the principal component, whilst Northern Ireland is positioned at the opposite end of the ais. The PCA method then obtains a second principal coordinate (ais) which is both orthogonal to the first PC, and is the net best direction for approimating the original data (i.e. it finds the direction of second largest variance in the data, chosen from directions which are orthogonal to the first principal component). We now have two orthogonal principal components defining a plane which, similarl to before, we can project our coordinates down onto. This is shown below in the 2 dimensional score plot in Figure 2. Notice that the inclusion of the second principal component has highlighted variation between the dietar habits present England, Scotland and Wales. Figure 2: Projections onto first 2 principal components (2-D space) 4 2 Wal PC2 Eng N Ire 2 Scot PC As part of the PCA method (which will be eplained in detail later), we automaticall obtain information about the contributions of each principal component to the total variance of the coordinates. In fact, in this case approimatel 67% of the variance in the data is accounted for b the first principal component, and approimatel 97% is accounted for in total b the first two principal components. In this case, we have therefore accounted for the vast majorit of the variation in the data using a two dimensional plot - a dramatic reduction in dimensionalit from seventeen dimensions to two. 2 In linear algebra and functional analsis, a projection is defined as a linear transformation, P, that maps from a given vector space to the same vector space and is such that P 2 = P. 4

5 In practice, it is usuall sufficient to include enough principal components so that somewhere in the region of 7 8% of the variation in the data is accounted for [3]. This information can be summarised in a plot of the variances (nonzero eigenvalues) with respect to the principal component number (eigenvector number), which is given in Figure 3, below. 5 4 Figure 3: Eigenspectrum eigenvalue eigenvector number We can also consider the influence of each of the original variables upon the principal components. This information can be summarised in the following plot, in Figure 4. Figure 4: Load plot effect(pc2).5.5 Fresh potatoes Fresh fruit Other Fresh VegCereals Veg Other meat FishFats and oils Cheese Sugars Processed Confectioner Processed Beverages Carcass Veg potatoes meat Alcoholic drinks Soft drinks effect(pc) Observe that there is a central group of variables around the middle of each principal component, with four variables on the peripher that do not seem to be part of the group. Recall the 2D score plot (Figure 2), on which England, Wales and Scotland were clustered together, whilst Northern Ireland was the countr that was awa from the cluster. Perhaps there is some association to be made between the four variables that are awa from the cluster in Figure 4 and the countr that is located awa from the rest of the countries in Figure 2, Northern Ireland. A look at the original data in Table reveals that for the three variables, Fresh potatoes, Alcoholic drinks and Fresh fruit, there is a noticeable difference between the values for England, Wales and Scotland, which are roughl similar, and Northern Ireland, which is usuall significantl higher or lower. PCA has the abilit to be able to make these associations for us. It has also successfull managed to reduce the dimensionalit of our data set down from 7 to 2, allowing us to assert (using Figure 2) that countries England, Wales and Scotland are similar with Northern Ireland being different in some wa. Furthermore, using Figure 4 we were able to associate certain food tpes with each cluster of countries. 5

6 3 The Technical Details Of PCA The principal component analsis for the eample above took a large set of data and identified an optimal new basis in which to re-epress the data. This mirrors the general aim of the PCA method: can we obtain another basis that is a linear combination of the original basis and that re-epresses the data optimall? There are some ambiguous terms in this statement, which we shall address shortl, however for now let us frame the problem in the following wa. Assume that we start with a data set that is represented in terms of an m n matri, X where the n columns are the samples (e.g. observations) and the m rows are the variables. We wish to linearl transform this matri, X into another matri, Y, also of dimension m n, so that for some m m matri, P, Y = PX () This equation represents a change of basis. If we consider the rows of P to be the row vectors p,p 2,...,p m, and the columns of X to be the column vectors, 2,..., n, then (3) can be interpreted in the following wa. p. p. 2 p. n PX = ( ) p 2. p 2. 2 p 2. n P P 2... P n = = Y p m. p m. 2 p m. n Note that p i, j R m, and so p i. j is just the standard Euclidean inner (dot) product. This tells us that the original data, X is being projected on to the columns of P. Thus, the rows of P, {p,p 2,...,p m } are a new basis for representing the columns of X. The rows of P will later become our principal component directions. We now need to address the issue of what this new basis should be, indeed what is the best wa to re-epress the data in X - in other words, how should we define independence between principal components in the new basis? Principal component analsis defines independence b considering the variance of the data in the original basis. It seeks to de-correlate the original data b finding the directions in which variance is maimised and then use these directions to define the new basis. Recall the definition for the variance of a random variable, Z with mean, µ. σ 2 Z = E[(Z µ) 2 ] Suppose we have a vector of n discrete measurements, r = ( r, r 2,..., r n ), with mean µ r. If we subtract the mean from each of the measurements, then we obtain a translated set of measurements r = (r, r 2,...,r n ), that has zero mean. Thus, the variance of these measurements is given b the relation σ 2 r = n rrt If we have a second vector of n measurements, s = (s, s 2,...,s n ), again with zero mean, then we can generalise this idea to obtain the covariance of r and s. Covariance can be thought of as a measure of how much two variables change together. Variance is thus a special case of covariance, when the the two variables are identical. It is in fact correct to divide through b a factor of n rather than n, a fact which we shall not justif here, but is discussed in [2]. 6

7 σ 2 rs = n rst We can now generalise this idea to considering our m n data matri, X. Recall that m was the number of variables, and n the number of samples. We can therefore think of this matri, X in terms of m row vectors, each of length n.,,2,n 2, 2,2 2,n X = = m, m,2 m,n 2. m Rm n, i T R n Since we have a row vector for each variable, each of these vectors contains all the samples for one particular variable. So for eample, i is a vector of the n samples for the i th variable. It therefore makes sense to consider the following matri product. C X = n XXT = n T T 2 T m 2 T 2 T 2 2 T m m T m T 2 m T m Rm m If we look closel at the entries of this matri, we see that we have computed all the possible covariance pairs between the m variables. Indeed, on the diagonal entries, we have the variances and on the off-diagonal entries, we have the covariances. This matri is therefore known as the Covariance Matri. Now let us return to the original problem, that of linearl transforming the original data matri using the relation Y = PX, for some matri, P. We need to decide upon some features that we would like the transformed matri, Y to ehibit and somehow relate this to the features of the corresponding covariance matri C Y. Covariance can be considered to be a measure of how well correlated two variables are. The PCA method makes the fundamental assumption that the variables in the transformed matri should be as uncorrelated as possible. This is equivalent to saing that the covariances of different variables in the matri C Y, should be as close to zero as possible (covariance matrices are alwas positive definite or positive semi-definite). Conversel, large variance values interest us, since the correspond to interesting dnamics in the sstem (small variances ma well be noise). We therefore have the following requirements for constructing the covariance matri, C Y :. Maimise the signal, measured b variance (maimise the diagonal entries) 2. Minimise the covariance between variables (minimise the off-diagonal entries) We thus come to the conclusion that since the minimum possible covariance is zero, we are seeking a diagonal matri, C Y. If we can choose the transformation matri, P in such a wa that C Y is diagonal, then we will have achieved our objective. We now make the assumption that the vectors in the new basis, p,p 2,...,p m are orthogonal (in fact, we additionall assume that the are orthonormal). Far from being restrictive, this assumption enables us to proceed b using the tools of linear algebra to find a solution to the problem. Consider the formula for the covariance matri, C Y and our interpretation of Y in terms of X and P. 7

8 C Y = n YYT = n (PX)(PX)T = n (PX)(XT P T ) = n P(XXT )P T i.e. C Y = n PSPT where S = XX T Note that S is an m m smmetric matri, since (XX T ) T = (X T ) T (X) T = XX T. We now invoke the well known theorem from linear algebra that ever square smmetric matri is orthogonall (orthonormall) diagonalisable. That is, we can write: S = EDE T Where E is an m m orthonormal matri whose columns are the orthonormal eigenvectors of S, and D is a diagonal matri which has the eigenvalues of S as its (diagonal) entries. The rank, r, of S is the number of orthonormal eigenvectors that it has. If B turns out to be rank-deficient so that r is less than the size, m, of the matri, then we simpl need to generate m r orthonormal vectors to fill the remaining columns of S. It is at this point that we make a choice for the transformation matri, P. B choosing the rows of P to be the eigenvectors of S, we ensure that P = E T and vice-versa. Thus, substituting this into our derived epression for the covariance matri, C Y gives: C Y = = n PSPT n ET (EDE T )E Now, since E is an orthonormal matri, we have E T E = I, where I is the m m identit matri. Hence, for this special choice of P, we have: C Y = n D A last point to note is that with this method, we automaticall gain information about the relative importance of each principal component from the variances. The largest variance corresponds to the first principal component, the second largest to the second principal component, and so on. This therefore gives us a method for organising the data in the diagonalisation stage. Once we have obtained the eigenvalues and eigenvectors of S = XX T, we sort the eigenvalues in descending order and place them in this order on the diagonal of D. We then construct the orthonormal matri, E b placing the associated eigenvectors in the same order to form the columns of E (i.e. place the eigenvector that corresponds to the largest eigenvalue in the first column, the eigenvector corresponding to the second largest eigenvalue in the second column etc.). We have therefore achieved our objective of diagonalising the covariance matri of the transformed data. The principal components (the rows of P) are the eigenvectors of the covariance matri, XX T, and the rows are in order of importance, telling us how principal each principal component is. 8

9 4 The Singular Value Decomposition In this section, we will eamine how the well known singular value decomposition (SVD) from linear algebra can be used in principal component analsis. Indeed, we will show that the derivation of PCA in the previous section and the SVD are closel related. We will not derive the SVD, as it is a well established result, and can be found in an good book on numerical linear algebra, such as [4]. Given A R n m, not necessaril of full rank, a singular value decomposition of A is: Where A = UΣV T U R n n is orthonormal Σ R n m is diagonal V R m m is orthonormal In addition, the diagonal entries, σ i, of Σ are non-negative and are called the singular values of A. The are ordered such that the largest singular value, σ is placed in the (, ) entr of Σ, and the other singular values are placed in order down the diagonal, and satisf σ σ 2...σ p, where p = min(n, m). Note that we have reversed the row and column indees in defining the SVD from the wa the were defined in the derivation of PCA in the previous section. The reason for doing this will become apparent shortl. The SVD can be considered to be a general method for understanding change of basis, as can be illustrated b the following argument (which follows [4]). Since U R n n and V R m m are orthonormal matrices, their columns form bases for, respectivel, the vector spaces R n and R m. Therefore, an vector b R n can be epanded in the basis formed b the columns of U (also known as the left singular vectors of A) and an vector R m can be epanded in the basis formed b the columns of V (also known as the right singular vectors of A). The vectors for these epansions ˆb and ˆ, are given b: ˆb = U T b & ˆ = V T Now, if the relation b = A holds, then we can infer the following: U T b = U T A ˆb = U T (UΣV T ) ˆb = Σˆ Thus, the SVD allows us to assert that ever matri is diagonal, so long as we choose the appropriate bases for the domain and range spaces. How does this link in to the previous analsis of PCA? Consider the n m matri, A, for which we have a singular value decomposition, A = UΣV T. There is a theorem from linear algebra which sas that the non-zero singular values of A are the square roots of the nonzero eigenvalues of AA T or A T A. The former assertion for the case A T A is proven in the following wa: A T A = (UΣV T ) T (UΣV T ) = (VΣ T U T )(UΣV T ) = V(Σ T Σ)V T 9

10 We observe that A T A is similar to Σ T Σ, and thus it has the same eigenvalues. Since Σ T Σ is a square (m m), diagonal matri, the eigenvalues are in fact the diagonal entries, which are the squares of the singular values. Note that the nonzero eigenvalues of each of the covariance matrices, AA T and A T A are actuall identical. It should also be noted that we have effectivel performed an eigenvalue decomposition for the matri, A T A. Indeed, since A T A is smmetric, this is an orthogonal diagonalisation and thus the eigenvectors of A T A are the columns of V. This will be important in making the practical connection between the SVD and and the PCA of matri X, which is what we will do net. Returning to the original m n data matri, X, let us define a new n m matri, Z: Z = n X T Recall that since the m rows of X contained the n data samples, we subtracted the row average from each entr to ensure zero mean across the rows. Thus, the new matri, Z has columns with zero mean. Consider forming the m m matri, Z T Z: ( ) T ( ) Z T Z = X T X T n n i.e. = Z T Z = C X n XXT We find that defining Z in this wa ensures that Z T Z is equal to the covariance matri of X, C X. From the discussion in the previous section, the principal components of X (which is what we are tring to identif) are the eigenvectors of C X. Therefore, if we perform a singular value decomposition of the matri Z T Z, the principal components will be the columns of the orthogonal matri, V. The last step is to relate the SVD of Z T Z back to the change of basis represented b equation (3): Y = PX We wish to project the original data onto the directions described b the principal components. Since we have the relation V = P T, this is simpl: Y = V T X If we wish to recover the original data, we simpl compute (using orthogonalit of V): X = VY

11 5 Image Compression Using PCA In the previous section, we developed a method for principal component analsis which utilised the singular value decomposition of an m m matri Z T Z, where Z = n X T and X was an m n data matri. Since Z T Z R m m, the matri, V obtained in the singular value decomposition of Z T Z must also be of dimensions m m. Recall also that the columns of V are the principal component directions, and that the SVD automaticall sorts these components in decreasing order of importance or principalit, so that the most principal component is the first column of V. Suppose that before projecting the data using the relation, Y = V T X, we were to truncate the matri, V so that we kept onl the first r < m columns. We would thus have a matri Ṽ R m r. The projection Ỹ = ṼT X is still dimensionall consistent, and the result of the product is a matri, Ỹ R r n. Suppose that we then wished to transform this data back to the original basis b computing X = ṼỸ. We therefore recover the dimensions of the original data matri, X and obtain, X R m n. The matrices, X and X are of the same dimensions, but the are not the same matri, since we truncated the matri of principal components V in order to obtain X. It is therefore reasonable to conclude that the matri, X has in some sense, less information in it than the matri X. Of course, in terms of memor allocation on a computer, this is certainl not the case since both matrices have the same dimensions and would therefore allotted the same amount of memor. However, the matri, X can be computed as the product of two smaller matrices (Ṽ and Ỹ). This, together with the fact that the important information in the matri is captured b the first principal components suggests a possible method for image compression. During the subsequent analsis, we shall work with a standard test image that is often used in image processing and image compression. It is a grescale picture of a butterfl, and is displaed in Figure 5. We will use MATLAB to perform the following analsis, though the principles can be applied in other computational packages. Figure 5: The Butterfl grescale test image MATLAB considers grescale images as objects consisting of two components, a matri of piels, and a colourmap. The Butterfl image above is stored in a matri (and therefore has this number of piels). The colourmap is a 52 3 matri. For RGB colour images, each image can be stored as a single matri, where the third dimension stores three numbers in the range [, ] corresponding to each piel in the matri,

12 representing the intensit of the red, green and blue components. For a grescale image such as the one we are dealing with, the colourmap matri has three identical columns with a scale representing intensit on the one dimensional gre scale. Each element of the piel matri contains a number representing a certain intensit of gre scale for an individual piel. MATLAB displas all of the piels simultaneousl with the correct intensit and the grescale image that we see is produced. The matri containing the piel information is our data matri, X. We will perform a principal component analsis of this matri, using the SVD method outlined above. The steps involved are eactl as described above and summarised in the following MATLAB code. [fl,map] = imread('butterfl.gif'); % load image into MATLAB 2 fl=double(fl); % convert to double precision 3 image(fl),colormap(map); % displa image 4 ais off, ais equal 5 [m n]=size(fl); 6 mn = mean(fl,2); % compute row mean 7 X = fl repmat(mn,,n); % subtract row mean to obtain X 8 Z=/sqrt(n )*X'; % create matri, Z 9 covz=z'*z; % covariance matri of Z %% Singular value decomposition [U,S,V] = svd(covz); 2 variances=diag(s).*diag(s); % compute variances 3 bar(variances(:3)) % scree plot of variances 4 %% Etract first 2 principal components 5 PCs=4; 6 VV=V(:,:PCs); 7 Y=VV'*X; % project data onto PCs 8 ratio=256/(2*pcs+); % compression ratio 9 XX=VV*Y; % convert back to original basis 2 XX=XX+repmat(mn,,n); % add the row means back on 2 image(xx),colormap(map),ais off; % displa results Figure 6: MATLAB code for image compression PCA In this case, we have chosen to use the first 4 (out of 52) principal components. What compression ratio does this equate to? To answer this question, we need to compare the amount of data we would have needed to store previousl, with what we can now store. Without compression, we would still have our matri to store. After selecting the first 4 principal components, we have the two matrices Ṽ and Ỹ (VV and YY) in the above MATLAB code) from which we can obtain a piel matri b computing the matri product. Matri Ṽ is 52 4, whilst matri Ỹ is There is also one more matri that we must use if we wish to displa our image - the vector of means which we add back on after converting back to the original basis (this is just a 52 matri which we can later cop into a larger matri to add to X). We therefore have reduced the number of columns needed from 52 to = 4 and the compression ratio is then calculated in the following wa: 52 : 8 i.e. approimatel 6.3 : compression A decent ratio it seems, however what does the compressed image look like? The image for 4 principal components (6.3: compression) is displaed in Figure 7. 2

13 Figure 7: 4 principal components (6.3: compression) The loss in qualit is evident (after all, this loss compression, as opposed to lossless compression), however considering the compression ratio, the trade off seems quite good. Let s look net at the eigenspectrum, in Figure 8. Figure 8: Eigenspectrum (first 2 eigenvalues) 3 eigenvalue eigenvector number The first principal component accounts for 5.6% of the variance, the first two account for 69.8%, the first si 93.8%. This tpe of plot is not so informative here, as accounting for 93.8% of the variance in the data does not correspond to us seeing a clear image, as is shown in Figure 9. Figure 9: Image compressed using 6 principal components On the net page, in Figure 5, a selection of images is shown with an increasing number of principal components retained. In Table 5, the cumulative sum of the contribution from the first variances is displaed. 3

14 Eigenvector Number Cumulative proportion of variance Table 2: Cumulative variance accounted for b PCs 2.4: compression 2 principal components 39.4: compression 6 principal components 24.4: compression principal components 7.7: compression 4 principal components 2.5: compression 2 principal components 8.4: compression 3 principal components 6.3: compression 4 principal components 4.2: compression 6 principal components 2.8: compression 9 principal components 2.: compression 2 principal components.7: compression 5 principal components.4: compression 8 principal components Figure : The visual effect of retaining principal components 4

15 6 Blind Source Separation The final application of PCA in this report is motivated b the cocktail part problem, a diagram of which is displaed in Figure. Imagine we have N people at a cocktail part. The N people are all speaking at once, resulting in a miture of all the voices. Suppose that we wish to obtain the individual monologues from this miture - how would we go about doing this? Figure : The cocktail part problem (image courtes of Gari Clifford, MIT) The room has been equipped with eactl n microphones, spread around at different points in the room. Each microphone thus records a slightl different version of the combined signal, together with some random noise. B analsing these n combined signals, suing PCA, it is possible to both de-noise the group signal, and to separate out the original sources. A formal statement of this problem is: Matri Z R m n consists of m samples of n independent sources The signals are mied together linearl using a matri, A R n n The matri of observations is represented as the product X T = AZ T We attempt to demi the observations b finding W R n n s.t. Y T = WX T The hope is that Y Z, and thus W A These points reflect the assumptions of the blind source separation (BSS) problem:. The miture of source signals must be linear 2. The source signals are independent 3. The miture (the matri A) is stationar (constant) 4. The number of observations (microphones) is the same as the number of sources In order to use PCA for BSS, we need to define independence in terms of the variance of the signals. In analog with the previous eamples and discussion of PCA in Section 3, we assume that we will be able to de-correlate the individual signals b finding the (orthogonal) 5

16 directions of maimal variance for the matri of observations, Z. It is therefore possible to again use the SVD for this analsis. Consider the skinn SVD of X R m n : X = UΣV T where U R m n, Σ R n n, V R n n Comparing the the skinn SVD with the full SVD, in both cases, the n n matri V is the same. Assuming that m n, the diagonal matri of singular values, Σ, is square (n n) in the skinn case, and rectangular (m n) in the full case, with the additional m n rows being ghost rows (i.e. have entries that are all zero). The first n columns of the matri U in the full case are identical to the n columns of the skinn case U, with the additional m n columns being arbitrar orthogonal appendments. Recall that we are tring to approimate the original signals matri (Z Y) b tring to find a matri (W A ) such that: Y T = WX T This matri is obtained b rearranging the equation for the skinn SVD. X = UΣV T X T = VΣ T U T U T = Σ T V T X T Thus, we identif our approimation to the de-miing matri as W = Σ T V T, and our de-mied signals are therefore the columns of the matri U. Note that since the matri Σ is square and diagonal, Σ T = Σ, which is computed b simpl taking the reciprocal value of each diagonal entr of Σ. However, if we are using the SVD method, it is not necessar to worr about eplicitl calculating the matri W, since the SVD automaticall delivers us the de-mied signals. To illustrate this, consider constructing the matri Z R 2 3 consisting of the following three signals (the columns) sampled at 2 equispaced points on the interval [, 2] (the rows)..5 sin() 2mod(/,) sign(sin(π/4))* Figure 2: Three input signals for the BSS problem We now construct a 3 3 miing matri, A b randoml perturbing the 3 3 identit matri. The MATLAB code A=round((ee(M)+.*randn(3,3))*)/ achieves this. To demonstrate, an eample miing matri could be therefore be: A = To simulate the cocktail part, we will also add some noise to the signals before the mi (for this, we use the MATLAB code Z=Z+.2*randn(2,3)). After adding this noise and miing the signals to obtain X via X T = AZ T, we obtain the following miture of signals: 6

17 Figure 3: A miture of the three input signals, with noise added This miing corresponds to each of the three microphones in the eample above being close to a unique guest of the cocktail part, and therefore predominantl picking up what that individual is saing. We can now go ahead and perform a (skinn) singular value decomposition of the matri, X, using the command, [u,s,v]=svd(x,). Figure 4 shows the separated signals (the columns of U) plotted together, and individuall Figure 4: The separated signals plotted together, and individuall We observe that the etracted signals are quite easil identifiable representations the three input signals, however note that the sin and sawtooth functions have been inverted, and that the scaling is different to that of the inputted signals. The performance of PCA for blind source separation is good in this case because the miing matri was close to the identit. If we generate normall distributed random numbers for the elements of the miing matri, we can get good results, but we can also get poor results. In the following figures, the upper left plot shows the three mied signals and the subsequent 3 plots are the SVD etractions. Figures 5, 6 and 7 show eamples of relativel good performance in etracting the original signals from the randoml mied combination, whilst 8 shows a relativel poor performance. 7

18 Figure 5: Figure 6: Figure 7: Figure 8: 8

19 7 Conclusions M aim in writing this article was that somebod with a similar level of mathematical knowledge as mself (i.e. earl graduate level) would be able to gain a good introductor understanding of PCA b reading this essa. I hope that the would understand that it is a diverse tool in data analsis, with man applications, three of which we have covered in detail here. I would also hope that the would gain an good understanding of the surrounding mathematics, and the close link that PCA has with the singular value decomposition. I embarked upon writing this essa with onl one application in mind, that of Blind Source Separation. However, when it came to researching the topic in detail, I found that there were man interesting applications of PCA, and I identified dimensional reduction in multivariate data analsis and image compression as being two of the most appealing alternative applications. Though it is a powerful technique, with a diverse range of possible applications, it is fair to sa that PCA is not necessaril the best wa to deal with each of the sample applications that I have discussed. For the multivariate data analsis eample, we were able to identif that the inhabitants of Northern Ireland were in some wa different in their dietar habits to those of the other three countries in the UK. We were also able to identif particular food groups with the eating habits of Northern Ireland, et we were limited in being able to make distinctions between the dietar habits of the English, Scottish and Welsh. In order to eplore this avenue, it would perhaps be necessar to perform a similar analsis on just those three countries. Image compression (and more generall, data compression) is b now getting to be a mature field,and there are man sophisticated technologies available that perform this task. JPEG is an obvious and comparable eample that springs to mind (JPEG can also involve loss compression). JPEG utilises the discrete cosine transform to convert the image to a frequenc-domain representation and generall achieves much higher qualit for similar compression ratios when compared to PCA. Having said this, PCA is a nice technique in its own right for implementing image compression and it is nice to find such a pleasing implementation. As we saw in the last eample, Blind Source Separation can cause problems for PCA under certain circumstances. PCA will not be able to separate the individual sources if the signals are combined nonlinearl, and can produce spurious results even if the combination is linear. PCA will also fail for BSS if the data is non-gaussian. In this situation, a well known technique that works is called Independent Component Analsis (ICA). The main philosophical difference between the two methods is that PCA defines independence using variance, whilst ICA defines independence using statistical independence - it identifies the principal components b maimising the statistical independence between each of the components. Writing the theoretical parts of this essa (Sections 3 and 4) was a ver educational eperience and I was aided in doing this b the ecellent paper b Jonathon Shlens, A Tutorial on Principal Component Analsis [2], and the famous book on Numerical Linear Algebra b Llod N. Trefethen and David Bau III[4]. However, the original motivation for writing this special topic was from the ecellent lectures in Signals Processing delivered b Dr. I Drobnjak and Dr. C. Orphinadou during Hilar term of 28, at the Oford Universit Mathematical Institute. 9

20 8 Appendi - MATLAB Figure 9: MATLAB code: Data Analsis X = [ ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ]; 7 covmatri=x*x'; data = X; [M,N] = size(data); mn = mean(data,2); 8 data = data repmat(mn,,n); Y = data' / sqrt(n ); [u,s,pc] = svd(y); 9 S = diag(s); V = S.* S; signals = PC' * data; plot(signals(,),,'b.',signals(,2),,'b.',... signals(,3),,'b.',signals(,4),,'r.','markersize',5) 2 label('pc') 3 tet(signals(,) 25,.2,'Eng'),tet(signals(,2) 25,.2,'Wal'), 4 tet(signals(,3) 2,.2,'Scot'),tet(signals(,4) 3,.2,'N Ire') 5 plot(signals(,),signals(2,),'b.',signals(,2),signals(2,2),'b.',... 6 signals(,3),signals(2,3),'b.',signals(,4),signals(2,4),'r.',... 7 'markersize',5) 8 label('pc'),label('pc2') 9 tet(signals(,)+2,signals(2,),'eng') 2 tet(signals(,2)+2,signals(2,2),'wal') 2 tet(signals(,3)+2,signals(2,3),'scot') 22 tet(signals(,4) 6,signals(2,4),'N Ire') plot(pc(,),pc(,2),'m.',pc(2,),pc(2,2),'m.', PC(3,),PC(3,2),'m.',PC(4,),PC(4,2),'m.', PC(5,),PC(5,2),'m.',PC(6,),PC(6,2),'m.', PC(7,),PC(7,2),'m.',PC(8,),PC(8,2),'m.', PC(9,),PC(9,2),'m.',PC(,),PC(,2),'m.', PC(,),PC(,2),'m.',PC(2,),PC(2,2),'m.',... 3 PC(3,),PC(3,2),'m.',PC(4,),PC(4,2),'m.',... 3 PC(5,),PC(5,2),'m.',PC(6,),PC(6,2),'m.', PC(7,),PC(7,2),'m.','markersize',5) label('effect(pc)'),label('effect(pc2)') tet(pc(,),pc(,2).,'cheese'),tet(pc(2,),pc(2,2).,'carcass meat') 37 tet(pc(3,),pc(3,2).,'other meat'),tet(pc(4,),pc(4,2).,'fish') 38 tet(pc(5,),pc(5,2).,'fats and oils'),tet(pc(6,),pc(6,2).,'sugars') 39 tet(pc(7,),pc(7,2).,'fresh potatoes') 4 tet(pc(8,),pc(8,2).,'fresh Veg') 4 tet(pc(9,),pc(9,2).,'other Veg') 42 tet(pc(,),pc(,2).,'processed potatoes') 43 tet(pc(,),pc(,2).,'processed Veg') 44 tet(pc(2,),pc(2,2).,'fresh fruit'), 45 tet(pc(3,),pc(3,2).,'cereals'),tet(pc(4,),pc(4,2).,'beverages') 46 tet(pc(5,),pc(5,2).,'soft drinks'), 47 tet(pc(6,),pc(6,2).,'alcoholic drinks') 48 tet(pc(7,),pc(7,2).,'confectioner') 49 %% 5 bar(v) 5 label('eigenvector number'), label('eigenvalue') 52 %% 53 t=sum(v);cumsum(v/t) 2

21 Figure 2: MATLAB code : Image Compression clear all;close all;clc, 2 3 [fl,map] = imread('butterfl.gif'); 4 fl=double(fl); 5 whos 6 7 image(fl) 8 colormap(map) 9 ais off, ais equal [m n]=size(fl); 2 mn = mean(fl,2); 3 X = fl repmat(mn,,n); 4 5 Z=/sqrt(n )*X'; 6 covz=z'*z; 7 8 [U,S,V] = svd(covz); 9 2 variances=diag(s).*diag(s); 2 bar(variances,'b') 22 lim([ 2]) 23 label('eigenvector number') 24 label('eigenvalue') tot=sum(variances) 27 [[:52]' cumsum(variances)/tot] PCs=4; 3 VV=V(:,:PCs); 3 Y=VV'*X; 32 ratio=52/(2*pcs+) XX=VV*Y; XX=XX+repmat(mn,,n); image(xx) 39 colormap(map) 4 ais off, ais equal 4 42 z=; 43 for PCs=[ ] 44 VV=V(:,:PCs); 45 Y=VV'*X; 46 XX=VV*Y; 47 XX=XX+repmat(mn,,n); 48 subplot(4,3,z) 49 z=z+; 5 image(xx) 5 colormap(map) 52 ais off, ais equal 53 title({[num2str(round(*52/(2*pcs+))/) ': compression']; [int2str(pcs) ' principal components']}) 55 end 2

22 Figure 2: MATLAB code : Blind Source Separation clear all; close all; clc; 2 3 set(,'defaultfigureposition',[ ],... 4 'defaultaeslinewidth',.9,'defaultaesfontsize',8,... 5 'defaultlinelinewidth',.,'defaultpatchlinewidth',.,... 6 'defaultlinemarkersize',5), format compact, format short 7 8 =[:.:2]'; 9 signala sin(); signalb 2*mod(/,) ; signalc sign(sin(.25*pi*))*.5; 2 Z=[signalA() signalb() signalc()]; 3 4 [N M]=size(Z); 5 Z=Z+.2*randn(N,M); 6 [N M]=size(Z); 7 A=round(*randn(3,3)*)/ 8 X=A*Z'; 9 X=X'; 2 2 figure 22 subplot(2,2,) 23 plot(x,'linewidth',2) 24 lim([,2]) 25 label(''),label('','rotation',) [u,s,v] = svd(x,); subplot(2,2,2) 3 plot(u(:,),'b','linewidth',2) 3 lim([,2]) 32 label(''),label('','rotation',) subplot(2,2,3) 35 plot(u(:,2),'g','linewidth',2) 36 lim([,2]) 37 label(''),label('','rotation',) subplot(2,2,4) 4 plot(u(:,3),'r','linewidth',2) 4 lim([,2]) 42 label(''),label('','rotation',) 22

23 References [] UMETRICS Multivariate Data Analsis MVA how8/c/# [2] Jonathon Shlens A Tutorial on Principal Component Analsis [3] Soren Hojsgaard Eamples of multivariate analsis Principal component analsis (PCA) Statistics and Decision Theor Research Unit, Danish Institute of Agricultural Sciences [4] Llod Trefethen & David Bau Numerical Linear Algebra SIAM [5] Signals and Sstems Group Uppsala Universit [6] Dr. I. Drobnjak Oford Universit MSc MMSC Signals Processing Lecture Notes (PCA/ICA) [7] Gari Clifford MIT Blind Source Separation: PCA & ICA 23