The ith principal component (PC) is the line that follows the eigenvector associated with the ith largest eigenvalue.

Transcription

1 More Principal Components Summary Principal Components (PCs) are associated with the eigenvectors of either the covariance or correlation matrix of the data. The ith principal component (PC) is the line that follows the eigenvector associated with the ith largest eigenvalue. the ith eigenvalue measure the variance in the direction of the ith principal component. the ratio of the eigenvalue to the sum of the eigenvalues is the proportion of variation explained by the ith PC. Normally, we re more interested in the cumulative porportion of variationce explained by the first q PCs In practice, the first few PCs do the lion s share of explaining the variance. Eventually, the law of diminishing returns clicks in, and adding an additional PC brings very little improvement in the amount of variation explained. For this reason, PCs are often used as a dimension reduction technique: keep the first few PCs, discard the rest. How many to keep is up to you. The scree plot (see below) is sometimes helpful. In practice, people often like to use only the first two or three because they are easily visualized. If the first two capture most of the variation, then you can use a two-dimensional scatterplot to visualize a p-dimension data set. Features The first PC is chosen so that it is aligned with the direction of maximum variance. The second is chosen to be independent of the first. And so on. The result is that the PCs are statistically independent of each other. If the variables are normally distributed, then their cloud is a hyper-ellipse, and the PCs run along the axes of the ellipse. The predicted values, also called scores, are found by mutiplying each point by the eigenvector. This is the same as projecting each point onto the corresponding PC. The scores are, therefore, a linear combination of the initial variables. The eigenvectors give the recipe for creating scores: take this much of the first variable plus this much of the second, etc. The components of the eigenvectors are sometimes called the loadings. Correlation or Covariance?

2 If one variable has much larger variance than the others, then it will tend to dominate the first principal component. Sometimes this variation is just an artifact of the units chosen. For example, if we have measured the heights of four people in units of feet, we might see 5.8, 5.3, 5.9, 6.0 which has a variance of.0967 squared feet. Converting the same list to inches, however, has a variance of squared-inches. A much bigger value in absolute terms. When dealing with such situations (in which variables differ drastically in variance or in which variables are measured in different units), it is best to use the correlation matrix. This is equivalent to first standardizing each observation. That is, if X represents the humidity reading at day i, then replace it with (x - xbar)/s where xbar is the average humidity of all days, and s is the standard deviation. Applied Principal Components Principal Component Analysis is an exploratory technique useful for finding patterns or structure in high-dimensional data sets. Two immediate uses are dimension reduction and collinearity-elimination. But it is also useful as a more general tool for understanding the structure of the data. It is used in regression, sometimes, to solve the problem of collinearity. If a collection of p variables is collinear, then transforming them to p PCs produces a new set of variables that are statistical independent (as the regression model requests). Unfortunately, these new variables are linear combinations of the old, and might have lost their interpretatbility. Sometimes you get lucky, though, and this linear combination has physical meaning. For example, I ve seen an example in which a biologist wished to compare squirrels living in different habitats. She was wondering if there were measurable physical characteristics that differed. So she measured height, width, length, ear height, etc. from both groups. It turned out that the first principal component was a linear combination that gave strong weights to width, length, height, and small weights to the other variables. She interpreted this to be a size index. The second principal component strongly weighted variables which she could interpret to be a shape index. The remaining PCs were discarded, and she now had two indices with which to compare the populations. A PC analysis might include these steps: Examine the cumulative explained variation and decide how many PCs to keep. Examine the loadings of the first few PCs to determine if they are interpretable. Are some variables much more important than others? Examine the relationship of variables to each other. Which are most alike? Examine data points with respect to their PCs. Do they cluster? Are new trends apparent?

3 Graphical Tools The screeplot is, quite simply, a plot of the variances (the eigenvalues) on the y axis against the integers 1, 2,...p on the horizontal axis. The purpose is to visualize how quickly the additional variation falls off. Often, the screeplot will descend quickly and then level out. Many people will drop PCs that occur after this kink at which the graph levels off. For some strange reason, R doesn t draw the right plot, but instead draws bar-graphs. This is a little more awkward, but the essential features remian. The bi-plot is a much more useful tool. It is a two-dimensional projection of the data onto the first two PCs. Often the points are labelled with a meaningful identifer to aid in picking up trends. On the same plot, we then include vectors that represent the variables. These are plotted using the loadings of the first two PCs. Easier demonstrated then explained. Example Suppose we have three variables, Height, Weight, Hatsize. And suppose for our data set consiting of measurements on 100 men and women, we get these eigenvectors: PC1 PC2 PC3 Height Weight Hatsize The first person in our data set, say, has these measuresments: Height = 5.2ft, Weight = 145 pounds, Hatsize = 6.5. (Assume we ve somewhat incorrectly used the covariance matrix). Then that person gets a point plotted at these coordinates: PC1 =.33* * * 6.5 = (on the horizontal axis) PC2 = -.2 * * * 6.5 = (on the vertical axis). And so on for each of the 100 observations. We might label these points M and F on the plot to see if there are differences between the men and the women. The variable Height gets a vector drawn pointing in the direction from the origin (0,0) to the point (.33, -.20). Weight gets a vector drawn from (0,0) to ( ), and Hatsize from (0,0) to (.10, 0). The lengths of these vectors are then scaled so that they are proportional to the variance. The biplot has several useful features: Points that are near each other are observations that had similar scores.

4 The cosine of the angles between vectors is equal to the correlation between those variables. Hence vectors pointing in the same direction are perfectly correlated, and those at right angles are uncorrelated. The length of the difference vector between any two vectors is equal to the sampling variance of the difference of those two variables. Using R R has several ways of doing principal component analysis. The eigen function returns eigen values and eigenvectors. The prcomp function is a numerically stable routine that returns a prcomp object that contains the square-root of the eigenvalues ( sdev ), the eigenvectors ( rotation ), and the scores ( x ). The princomp function is slightly less stable, but has more features. It returns a princomp object that contains the square-root of the eigenvalues ( sdev ), the eigenvectors ( loadings ), the means for each variable ( center ) and the scores ( scores ), as well as some other things. Typing summary(princomp) or summary(prcomp) will return the percent of variation explained by each PC. Typing plot(princomp) or plot(prcomp) will return a scree-plot. Typing biplog(princomp) returns a biplot. This does not work for prcomp. Example1: Violent crime in the US The dataset USArrests that comes with R contains data contains information on the number of arrests per 100,000 residents in each of the 50 US states in 1973 for 3 types of crimes. It also includes the percent of the population living in urban areas. For example: Murder Assault UrbanPop Rape Alabama Alaska California From which we learn that Alabuma had an arrest rate of for Murder, while California had an arrest rate of So, if you don t want to be murdered, California is perhaps the safer state. But of course, less safe for Assault and Rape. What is the best state overall for safety with respect to these measures? Is there a relationship between the size of the urban population and crime?

5 Because the variables are very different in scale, we ll base our analysis on the correlation matrix. (We do this by including an option cor = TRUE in the call.) out <- princomp(usarrests, cor = TRUE) > summary(out) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Standard deviation Proportion of Variance Cumulative Proportion The first three PCs explain almost all of the variation. There s certainly little to be gained by adding the fourth. If we stick with only the first two, we get an adequate (maybe) amount of the structure preserved. > plot(out) This screeplot shows that there s not much to be gained as we move from Comp3 to 4. Let s examine the loadings: > out$loadings

6 Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Murder Assault UrbanPop Rape The first principal component is an average of the three types of crime, with a little bit more added in for the percent of the population living in cities. States with high crime rates in all three categories will score a high negative score here, and it will be even higher if a lot of the population lives in cities. Put differently, states with slightly lower crime rates can still score high here if they have large urban populations. The second principal component places highest weights on Urbanpop and Murder, and in fact on the difference between them. States with high urban populations and low murder will score big negative values here. States with low urban populations and high murder will get big positive values. This is mitigated, somewhat, by the Rape and Assault rates. The third PC is very difficult to interpret. Keep in mind that since the analysis was done on the correlation matrix, terms like low and high mean high with respect to average (and measured in standard units.) The biplot helps summarize this: > biplot(out)

7 First, note that Murder, Assault and Rape are highly correlated with each other, but there is low correlation with Urbanpopulation! This means that states with a higher than average percentage of residents in city tend to not make arrests at a higher than average rate. You can see, now, how the variables contribute to the PCs. The three crime variables are nearly paralell to PC1, and so contribute heavily. But UrbanPop is nearly orthogonal, and so scoring high on it has a neglible effect on your placement along PC1. States in the center are average on all variables. But look at California: it is extreme on both PCs, and in fact there are few states like it. It has a high urban population, and high crime rates. It s very negative score on PC1 means that its overallcrime-rate is high, and its very negative score on PC2 means that, for its size, it has a low murder arrest rate. (It s the highest in terms of urban population, but only moderately high in terms of murder.) Generally, states on the right-hand side of the graph have low overall crime rates. States in the upper-half have high murder with respect to their low urban populations. To choose states arbitratily, South and Norht Dakota are veyr similar in crime and urban popuulation, as our new Mexico and Michigan. Surprisingly, New Jersey and Hawaii are also similar.

8 Example 2: Ozone data Returning to our Ozone data set. We again do our analysis on the correlation matrix. > out <- princomp(o2, cor=t) > summary(out) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation Proportion of Variance Cumulative Proportion Comp.6 Comp.7 Comp.8 Comp.9 Standard deviation Proportion of Variance Cumulative Proportion We have to go all the way up to the 6th component before explaining a high percent of the variation, although the law of diminishing returns kicks in around the fourth or fifth component. It would be surprising if we could interpret the PCs in a meaningful way based on their loadings. You are welcome to try. I could find nothing meaningful. The biplot, however, reveals an interesting structure that we have not yet seen: The data were collected daily, and the numbers thus represent the day on which the observation

9 was made. Recall that points near each other had similar scores. Now, notice that there are many sequences clumped together. For example, in the upper right corner you can see 111, 112, 114, 115, 116 nearby, which means that those give days -- all occuring in the same week, had similar weather. This means that are observations were not, probably, independent as we assumed they were when we did the regression. The vectors tell us that humidity, windspeed and pressure are correlated with each other. Visibility and ozone are anti-correlated. The group of humidity windspeed and pressure are almost independent of height and inversionht. (Some of the vectors do not have labels, but I m looking at the loadings to figure out which is hwich.) If you want to try to use these PCs to reduce the collinearity, you will find that your fit is not much improved. But examining the biplot quickly gave us insight into the relationship of the variables and also pointed out the dependence of subsequent days.