CANONICAL CORRELATION ANALYSIS

Transcription

1 CANONICAL CORRELATION ANALYSIS V.K. Bhatia I.A.S.R.I., Library Avenue, New Delhi A canonical correlation is the correlation of two canonical (latent) variables, one representing a set of independent variables, the other a set of dependent variables. Each set may be considered a latent variable based on measured indicator variables in its set. The canonical correlation is optimized such that the linear correlation between the two latent variables is maximized. Whereas multiple regression is used for many-to-one relationships, canonical correlation is used for many-to-many relationships. There may be more than one such linear correlation relating the two sets of variables, with each such correlation representing a different dimension by which the independent set of variables is related to the dependent set. The purpose of canonical correlation is to explain the relation of the two sets of variables, not to model the individual variables. Analogous with ordinary correlation, canonical correlation squared is the percent of variance in the dependent set explained by the independent set of variables along a given dimension (there may be more than one). In addition to asking how strong the relationship is between two latent variables, canonical correlation is useful in determining how many dimensions are needed to account for that relationship. Canonical correlation finds the linear combination of variables that produces the largest correlation with the second set of variables. This linear combination, or "root," is extracted and the process is repeated for the residual data, with the constraint that the second linear combination of variables must not correlate with the first one. The process is repeated until a successive linear combination is no longer significant. Canonical correlation is a member of the multiple general linear hypothesis (MLGH) family and shares many of the assumptions of mutliple regression such as linearity of relationships, homoscedasticity (same level of relationship for the full range of the data), interval or nearinterval data, untruncated variables, proper specification of the model, lack of high multicollinearity, and multivariate normality for purposes of hypothesis testing. Often in applied research, scientists encounter variables of large dimensions and are faced with the problem of understanding dependency structures, reduction of dimensionalities, construction of a subset of good predictors from the explanatory variables, etc. Canonical correlation Analysis (CCA) provides us with a tool to attack these problems. However, its appeal and hence its motivation seed to differ from the theoretical statisticians to the social scientists. We deal here with the various motivations of CCA as mentioned above and related statistical inference procedures. Dependency between Two sets of Stochastic Variables Let X: px1 be a random vector partitioned into two subvectors X 1 :p1x1 and X 2; p 2 x1, p 1 <p 2, p 1 +p 2 =p. Assume EX=0. In order to study the dependency between X 1 and X 2, we seek to evaluate the maximum possible correlation between any two arbitrary linear compounds

2 U=α ' X 1 and V=γ' X 2 subject to the normalizations, Var (U) = =α ' Σ 11 α = 1 and Var (V) = γ'σ 11 γ = 1' where, Disp. (X) = = is partitioned according to that of X as above, It follows that, this maximum correlation, say ρ 1 is given by the positive square root of the largest eigen root among the eigen roots ρ 2 1 ρ ρ 2 r... ρ 2 pl, of Σ 12 Σ Σ 21 in the metrix of Σ 11, i.e. of Σ 12 Σ Σ 21 Σ α and γ are then given by, α 1,γ 1 such that α ' Σ 11 α 1 =γ ' 1 Σ 22 γ 1 =1 and, ρ α ρ = 0 γ 1 (2.1) Alternatively, α and γ may be obtained as the eigen vector solutions, subject to the same normalizations, from (Σ 11-1 Σ 12-1 Σ 22-1 Σ 21 - ρ 2 I)α =0, (Σ 22-1 Σ 21 Σ Σ 12 - ρ 2 I)γ =0 (2.2) Further it follows that, α = Σ 11-1 Σ 12 γ/ρ and γ =Σ 22-1 Σ 21 α/ρ, (2.3) so that one needs to solve only one of the two equations in (2.2). ρ 1 is called the (first) canonical correlation between X 1 and X 2 and (U 1, V 1 ) = (α 1 ' X 1,γ 1 ' X 2 ) the pair of first canonical varieties. If Σ ii, i=1 or 2 happens to be singular, one can use a g-inverse Σ ii - in place of Σ ii -1 above. now that, p 1 = p 2 =1 ρ 1 = usual Pearson's product moment correlation coefficient between the scalar random variables X 1 and X 2 ; p 1 = 1, p 2 = p 2 >1 ρ 1 = Multiple correlation coefficient between the scalar X 1 and the vector X 2. Sample analogues are trivially defined. Reduction of Dimensionality In case p 2 or p 1 is large, it may become necessary to achieve a reduction of dimensionality but without sacrificing much of the dependency between X 1 and X 2. We then seek further linear combinations U i =α ' X 1, V i = γ 1 ' X 2, i = 1,2,..., r+1, such that U r+1 and V r+1 are maximally correlated among all linear combinations subject to having unit variances and further subject to being uncorrelated with U 1, V 1,...U r, V r. It turns out that Corr. (U r+1, V r+1 ) = ρ r+1 and α r+1, γ r+1 are simply solutions of (2.1) with ρ 1 replaced by ρ r+1. When ρ k+1 is judged to be insignificant compared to zero for some k+1, one may then retain only (U i, V i ), i=1,2,...k variables for further analysis in place of the original = ρ 1 + ρ 2 IV-38

3 presumably much larger number of variables. Note however, that information on all ρ 1 + ρ 2 variables X 1 and X 2 are still needed even to construct these 2k new variables. Canonical Correlation in SPSS o Canonical correlation is part of MANOVA in SPSS, in which one has to refer to one set of variables as "dependent" and the other as "covariates." It is available only in syntax. The command syntax method is as follows, where set1 and set2 are variable lists: MANOVA set1 WITH SET2 /DISCRIM ALL ALPHA(1) /PRINT SIGNIF(MULTIV UNIV EIGEN DIMENR). Note one cannot save canonical scores in this method. o Canonical correlation has to be run in syntax, not from the SPSS menus. If you just want to create a dataset with canonical variables, as part of the Advanced Statistics module SPSS supplies the CANCORR macro located in the file canonic correlation.sps, usually in the same directory as the SPSS main program. Open the syntax window with File, New, Syntax. Enter this: INCLUDE 'c:\program Files\SPSS\Canonical correlation.sps'. CANCORR SET1=varlist/ SET2=varlist/. where "varlist" is one of two lists of numeric variables. Output will be saved to a file called "cc_tmp2.sav," which will contain the canonical scores as new variables along with the original data file. These scores will be labeled s1_cv1 and s1_cv1, s2_cv1 and s2_cv2, and the like, standing for the scores on the two canonical variables associated with each canonical correlation. The macro will create two canonical variables for a number of canonical correlations equal to the smaller number of variables in SET1 or SET2. o OVERALS, which is part of the SPSS Categories module, computes nonlinear canonical correlation analysis on two or more sets of variables. Some Comments on the Canonical Correlations There could be a situation where some of variables have high structure correlations even though their canonical weights are near zero. This could happen because the weights are partial coefficients whereas the structure correlations (canonical factor loadings) are not: if a given variable shares variance with other independent variables entered in the linear combination of variables used to create a canonical variable, its canonical coefficient (weight) is computed based on the residual variance it can explain after controlling for these variables. If an independent variable is totally redundant with another independent variable, its partial coefficient (canonical weight) will be zero. Nonetheless, such a variable might have a high correlation with the canonical variable (that is, a high structure coefficient). In summary, the canonical weights have to do with the unique contributions of an original variable to the canonical variable, whereas the structure correlations have to do with the simple, overall correlation of the original variable with the canonical variable. Canonical correlation is not a measure of the percent of variance explained in the original variables. The square of the structure correlation is the percent of the variance in a given original variable accounted for by a given canonical variable on a given (usually the first) canonical correlation. Note that the average percent of variance explained in the original IV-39

4 variables by a canonical variable (the mean of the squared structure correlations for the canonical variable) is not at all the same as the canonical correlation, which has to do with the correlation between the weighted sums of the two sets of variables. Put another way, the canonical correlation does not tell us how much of the variance in the original variables is explained by the canonical variables. Instead, that is determined on the basis of the squares of the structure correlations. Canonical coefficients can be used to explain with which original variables a canonical correlation is predominantly associated. The canonical coefficients are standardized coefficients and (like beta weights in regression) their magnitudes can be compared. Looking at the columns in SPSS output which list the canonical coefficients as columns and the variables in a set of variables as rows, some researchers simply note variables with the highest coefficients to determine which variables are associated with which canonical correlations and use this as the basis for inducing the meaning of the dimension represented by the canonical correlation. However, Levine (1977) argues against the procedure above on the ground that the canonical coefficients may be subject to multicollinearity, leading to incorrect judgments. Also, because of suppression, a canonical coefficient may even have a different sign compared to the correlation of the original variable with the canonical variable. Therefore, instead, Levine recommends interpreting the relations of the original variables to a canonical variable in terms of the correlations of the original variables with the canonical variables - that is, by structure coefficients. This is now the standard approach. Redundancy in Canonical Correlation Analysis Redundancy is the percent of variance in one set of variables accounted for by the variate of the other set. The researcher wants high redundancy, indicating that independent variate accounts for a high percent of the variance in the dependent set of original variables. Note this is not the canonical correlation squared, which the percent of variance in the dependent variate is accounted for by the independent variate. The redundancy analysis section of SAS output looks like that below, where rows 1 and 2 refer to the first and second canonical correlations extracted for these data. Italicized comments are not part of SAS output. Canonical Redundancy Analysis Raw variance tables are reported by SAS but are omitted here because redundancy is normally interpreted using the standardized tables. Standardized Variance of the dependent variables Explained by Their Own The Opposite Cumulative Canonical Cumulative Proportion Proportion R-Squared Proportion Proportion IV-40

5 The table above shows that, for the first canonical correlation, although the independent canonical variable explains 47.15% of the variance in the dependent canonical variable, the independent canonical variable is able to predict only 11.29% of the variance in the individual original dependent variables. Also, the dependent canonical variable predicts only 23.94% of the variance in the individual original dependent variables. Similar statements could be made about the second canonical correlation (row 2). Canonical Redundancy Analysis Standardized Variance of the independent variables Explained by Their Own The Opposite Cumulative Canonical Cumulative Proportion Proportion R-Squared Proportion Proportion The table above repeats the first, except for comparisons involving the independent canonical variable. Canonical Redundancy Analysis Squared Multiple Correlations Between the dependent variables and the First 'M' of the independent variables M 1 2 Y Y Y In the table above, the columns represent the canonical correlations and the rows represent the original dependent variables, three in this case. The R-squareds are the percent of variance in each original dependent variable explained by the independent canonical variables. A similar table for the independent variables and the dependent canonical variables is also output by SAS but is not reproduced here. Nonlinear Canonical Correlation (OVERALS) Nonlinear canonical correlation analysis corresponds to categorical canonical correlation analysis with optimal scaling. The OVERALS procedure in SPSS (part of SPSS Categories) implements nonlinear canonical correlation. Independent variables can be nominal, ordinal, or interval, and there can be more than two sets of variables (more than one independent set and one dependent set). Whereas ordinary canonical correlation maximizes correlations between the variable sets, in OVERALS the sets are compared to an unknown compromise set defined by the object scores OVERALS uses optimal scaling, which quantifies categorical variables and then treats as numerical variables, including applying nonlinear transformations to find the best-fitting model. For nominal variables, the order of the categories is not retained but values are created IV-41

6 for each category such that goodness of fit is maximized. For ordinal variables, order is retained and values maximizing fit are created. For interval variables, order is retained as are equal distances between values. Obtain OVERALS from the SPSS menu by selecting Analyze, Data Reduction, Optimal Scaling; Select Multiple sets; Select either Some variable(s) not multiple nominal or All variables multiple nominal; click Define; define at least two sets of variables; define the value range and measurement scale (optimal scaling level) for each selected variable. SPSS output includes frequencies, centroids, iteration history, object scores, category quantifications, weights, component loadings, single and multiple fit, object scores plots, category coordinates plots, component loadings plots, category centroids plots, and transformation plots. Tip: To minimize output, use the Automatic Recode facility on the Transform menu to create consecutive categories beginning with 1 for variables treated as nominal or ordinal. To minimize output, for each variable scaled at the numerical (integer) level, subtract the smallest observed value from every value and add 1. Warning: Optimal scaling recodes values on the fly to maximize goodness of fit for the given data. As with any atheoretical, post-hoc data mining procedure, there is a danger of overfitting the model to the given data. Therefore, it is particularly appropriate to employ crossvalidation, developing the model for a training dataset and then assessing its generalizability by running the model on a separate validation dataset. The SPSS manual notes, "If each set contains one variable, nonlinear canonical correlation analysis is equivalent to principal components analysis with optimal scaling. If each of these variables is multiple nominal, the analysis corresponds to homogeneity analysis. If two sets of variables are involved and one of the sets contains only one variable, the analysis is identical to categorical regression with optimal scaling." Reference Levine, Mark S. (1977). Canonical Analysis and Factor Comparison. Thousand Oaks, CA: Sage Publications, Quantitative Applications in the Social Sciences Series, No. 6. IV-42