Exploratory Factor Analysis

Transcription

1 Introduction Principal components: explain many variables using few new variables. Not many assumptions attached. Exploratory Factor Analysis Exploratory factor analysis: similar idea, but based on model. Idea: variance of each variable depends on common factors (shared among variables) plus specific factor for that variable. Aim: identify common factors, relate to original variables. Want small variance of specific factors ( error ). Extra: use rotation to get clearer answers Examples One aim of factor analysis: identify unobservable characteristics (eg. attitudes, beliefs, perceptions). Observe measurable variables, try to relate. Example: give children several tests of different kinds: reading comprehension, spelling, sentence completion addition and subtraction, counting Hope to find factors picking out first 3 tests ( verbal ability ), last 2 ( numerical ability ). Another: perceptions of luxury cars. Ask potential customers to rate many luxury cars on many features ( style, reliability, performance ), look for factors relating to features. Might get factor based on reliability, fuel economy, maintenance, quality, durability call it sensible. Then another based on luxury, style, performance call it eg. appealing. Aim: pick out common features of variables

2 How it works Measure p variables, X 1,..., X p. Assume standardized: E(X i ) = 0, var(x i ) = 1 for all i. One common factor to start: Observed X i depends on common factor ξ ( xi ) plus specific factor δ i. Write X i = λ i ξ + δ i, like regression except ξ, δ i not observable. Assume ξ also standardized, ξ, δ i independent. Take variances: Interpret λ 2 i as proportion of variation in X i explained by common factor (like R 2 ). Called communality of X i. Rest of variance down to specific factor. Write θ 2 ii = var(δ i ); communality then 1 θ 2 ii. Communality near 1: X i near-perfect measure of ξ. Near 0: nothing to do with common factor. Want communalities not too small. var(x i ) = var(λ i ξ +δ i ) = λ 2 i var(ξ)+var(δ i ) = λ 2 i +var(δ i ). But this is Two common factors In principal components, cannot always use only one. Likewise, in factor analysis, may need 2 or more common factors. Assume ξ 1, ξ 2 independent, standardized. Write X i = λ i1 ξ 1 + λ i2 ξ 2 + δ i ; var(x i ) = λ 2 i1 + λ 2 i2 + θii 2 = 1. Specific variances θii 2 only appear with X i, so only affect variances (not covariances) of X i. Communality now 1 θ 2 ii = λ 2 i1 + λ 2 i2. Hope that one of the λ ij is reasonably large (eg. if factor 1 is verbal ability and factor 2 mathematical, that each test has something to do with one of these). Finding a solution: Principal Factor Analysis Specific factors only impact variances, not covariances. So if we knew the θ 2 ii = var δ i, would only impact diagonal of var-cov matrix. Also, since we standardized X i, var-cov matrix is same as correlation matrix R. So work with matrix like R, but with the θii 2 subtracted off the diagonal. Then can solve problem using same ideas as principal components: find eigenvalues, then use eigenvectors as factors. (Not only solution, but will work.) 73 74

3 Example: correlation matrix for psychological test data, page 132. Read into IML matrix R. Suppose the specific factor variances are all θ 2 ii = 0.4 (actually value used in text). Then following lines subtract them from matrix diagonal and get eigenvalues/vectors: Eigenvalues are: EVAL theta={ 0.40, 0.40, 0.40, 0.40, 0.40 }; m=r-diag(theta); print m; call eigen(eval,evec,m); print eval; print evec; (as text) and eigenvectors (different from text) EVEC Now, 1 θ 2 ii = λ2 i1 + λ2 i2, so use current estimates of λ ij. These are the j-th eigenvalue times the eigenvector coefficient squared, Not all eigenvalues positive. Use those that look meaningfully so (first 2). Eigenvectors give factor loadings as in principal components. First factor mostly 1st 3 tests (para, sent, word); second factor last 2 tests (add, dots). (Verbal, numerical skills.) But... were the communalities correct? added up. Thus: 1 θ11 2 = 2.187(0.5345) ( ) 2 = θ22 2 = 2.187(0.5424) ( ) 2 = θ33 2 = 2.187(0.5234) ( ) 2 = θ44 2 = 2.187(0.2971) (0.6268) 2 = θ55 2 = 2.187(0.2406) (0.6776) 2 = Probably not, since they were just guesses. so go back and repeat the process with values 0.314, 0.329, 0.339, 0.405,

4 Updated values of 1 θ 2 ii are 0.720, 0.694, 0.679, 0.592, Go Eigenvalues and eigenvectors changed slightly: EVAL EVEC back and repeat process with 1 minus these. Continue until no change in eigenvalues, eigenvectors, 1 θ 2 ii. Final answers (basically as text): EVAL EVEC Using SAS s PROC FACTOR Of course, only did above to show method. In practice, use canned procedure. SAS can use original data or correlation matrix. For latter, type matrix into file like this: para sent word add dots data rmat(type=corr); _type_= corr ; infile "rex2.dat"; input _name_ $ para sent word add dots; Following does one step of cycle starting from 1 θii 2 = 0.6: proc factor; priors ; and produces this output: with variable name on front of line, then read in file in special way with same variable names: 81 82

5 To do iterated principal factor analysis, change the code to contain Prior Communality Estimates: NUMERIC PARA SENT WORD ADD DOTS Eigenvalues of the Reduced Correlation Matrix: Total = 3 Average = Eigenvalue Difference Proportion Cumulative Eigenvalue Difference Proportion Cumulative which has the same eigenvalues as we calculated. proc factor method=prinit; which gives this (edited): Iter Change Communalities Convergence criterion satisfied. Eigenvalues of the Reduced Correlation Matrix: Total = Average = Eigenvalue Eigenvalue Initial Factor Method: Iterated Principal Factor Analysis Factor Pattern PARA SENT WORD ADD DOTS Variance explained by each factor Final Communality Estimates: Total = PARA SENT WORD ADD DOTS This is the same (numerically) as the text and our previous calculation. Mathematics of the common factor model With c factors, have model like X i = λ i1 ξ λ ic ξ c + δ i for each i, i = 1, 2,..., p. Easier to write in matrix terms: X = ΞΛ c +. Assumptions: 1. Common factors ξ uncorrelated, variance 1: Ξ Ξ/(n 1) = I. 2. Specific factors δ uncorrelated with variances θii 2 : Θ = /(n 1) diagonal

6 Invariance under rotation 3. Common and specific factors uncorrelated: Ξ = 0. Since X i standardized, correlation matrix R = X X/(n 1). Substitute for X, and remove any terms 0 by assumption. Finally R Θ = Λ c Λ c. In principal components, choose each component to maximize variance (while being uncorrelated with previous components). But here, no such restriction: any matrix Λ c satisfying equation is OK. Consider orthogonal matrix T, representing rotation. Let Λ c = Λ c T. Then Λ cλ c = Λ c T T Λ c = Λ c Λ c, because T T = I. That is, for any matrix of factor loadings solving the problem, any rotated version of it also solves the problem. In principal components, difficult to interpret medium-sized component loading. Idea: find rotation method that makes factors easy to interpret Kaiser s varimax rotation Want factor loadings (elements of rotated Λ c ) to be close to 0 or 1. Then each factor clearly depends (or does not depend) on each variable. Varimax: find rotation that maximizes sum of column variances. Maximizing variances drives values towards extremes. In SAS, change FACTOR line to read proc factor method=prinit rotate=varimax; Can eliminate PRIORS line. Results: Rotation Method: Varimax Orthogonal Transformation Matrix Rotated Factor Pattern PARA SENT WORD ADD DOTS Variance explained by each factor

7 Factor 1 now clearly depends on the three verbal tests (paragraph comprehension, sentence completion, word meaning); factor 2 now clearly based on numerical tests (addition, counting dots). Variance explained by 1st factor now no longer largest possible, but 2 factors together still explain same amount of variance. Quartimax rotation Similar idea to varimax: but now maximize total row variance. Makes each variable load on as few factors as possible. (In varimax, could still have variable appearing in several different factors.) In example data set, results very similar (because each variable only loaded on one factor anyway in varimax) Factor scores In principal components, obtained component scores: values for each observation representing where that observation falls on each component. Provided way to plot multidimensional data in 2 dimensions. Used component loadings to make linear combinations of original variables. Same idea in factor analysis, but difficulty: don t know specific factors exactly. So estimate factor scores by assuming specific factor δ i = 0: Ξ = XR 1 Λ c. Λ c depends on rotation, therefore factor scores do too. Saving and plotting factor scores Factor scores depend on original observations, so need original data not just correlation matrix. Data in file stock_returns.dat are weekly rates of return for 5 stocks on NYSE: Allied Chemical, du Pont, Union Carbide, Exxon, Texaco. Collected over 100 weeks. SAS FACTOR line: proc factor method=prinit rotate=varimax priors=smc nfact Variations: priors command specifies communalities that should be closer to truth; nfactors says to get 2 factors; out says to create output dataset with factor scores

8 Factor pattern before rotation: ALLCHEM DUPONT UNIONCAR EXXON TEXACO Variance explained by each factor Factor 1 basically average of all; factor 2 contrasts Texaco with du Pont. Factor 1 explains most of variance. Compare pattern after rotation: ALLCHEM DUPONT UNIONCAR EXXON TEXACO Variance explained by each factor Factor 1 picks out first 3 (chemical) companies; factor 2 picks out last 2 (oil). Explained variance shared out more evenly between factors Can print out new dataset (including factor scores) and plot factor scores. (SAS by default uses newest dataset.) proc print; proc plot; plot Factor1 * Factor2; On plot, pick out good/bad days for chemical (top/bottom), good/bad days for oil (right/left). Example: day 13. Factor 1 score 2.64, factor 2 score Good day for chem companies: du Pont, Union Carbide big gains. Average day for oil: small gains for both Exxon, Texaco. Day 20: factor , factor No gain for chem companies, both Exxon, Texaco solid gain. 97