FACTORANALYSISOFCATEGORICALDATAINSAS Lei Han, Torsten B. Neilands, M. Margaret Dolcini University of California, San Francisco ABSTRACT Data analysts frequently employ factor analysis to extract latent factors from sets of survey items. Often these items are not continuous scales; instead, they are either polytomous (e.g., Likert scaled) or dichotomous (e.g., "yes/no") items. The FACTOR procedure in SAS computes Pearson product-moment correlations from raw data as its default input matrix. This approach may not be optimal for polytomous or dichotomous input data. Polychoric and tetrachoric correlation coefficients for polytomous and dichotomous items, respectively, may be a better choice. This paper illustrates the use of the SAS Institute polychor.sas macro program to compute polychoric and tetrachoric correlation matrices; these matrices are subsequently analyzed using PROC FACTOR. Limitations and benefits of this approach are discussed. INTRODUCTION Survey research questionnaires often contain orderedpolytomous variables such as Likert scale items and dichotomous variables (''yes/no" items) to measure respondent attitudes, beliefs, or preferences. Researchers frequently employ factor analysis techniques to ascertain the presence of underlying latent traits that govern responses to survey items. Factor analysis methods that are implemented in common statistical software programs such as SAS, however, have been developed for the analysis of continuous variables via factoring of the Pearson correlations among the observed items. The basic factor analysis model assumes a linear relationship between a set of observed variables and latent trait variables, namely common factors. However, if survey items are related to a common factor via a nonlinear function, which is particularly likely for dichotomous variables, the linear factor analysis model may produce mathematical artifacts (McDonald, 1985). For example, it has been noted by several researchers that factor loadings for binary items tend to be highly correlated with item means (i.e., the proportion of "1" responses in a 0/1 dichotomy), suggesting that something other than an underlying trait is represented in such cases (Gorsuch, 1974; Mcdonald, 1985, Waltman & Dunbar, 1994). The correct interpretation of a factor's meaning depends on appropriate model selection and input data In the present illustration we mainly focus on one input data issue, the correlation matrix to be factored, and one model selection issue, the choice of a factor extraction method. Specifically,. given the choice of correlation matrices available from SAS, which correlation coefficients are most appropriate to describe the relationship of ordered polytomous variables or dichotomous variables and serve as the appropriate input matrix for factor analysis? Likert scaled items usually feature four to seven score points indicating level of agreement, importance, or frequency. Binary data, by definition, have two points. When Pearson correlation coefficients are computed for dichotomous or polytomous variables, the magnitudes of these correlations shrink due to range restriction. This attenuation is more severe when two variables differ substantially in their means or marginal distributions. Polycboric (for polytomous variables) or tetrachoric (for dichotomous variables) correlation coefficients are proposed to overcome this limitation (Gorsuch, 1974; Cohen and Cohen, 1983). Factor analysis results are also determined by the factor extraction method used. One typically employed method is principal axis factoring (P AF), in which the factor analysis begins with squared multiple correlations (SMC) of each item with the remaining items in the main diagonal of the factored correlation matrix. Thus, the SMC. values represent initial estimates of communalities, the proportion of variance accounted for within each item by the factor analysis solution. A contrasting method, unweighted least-squares (ULS), employs a least-squares approach to minimize item uniquenesses (residuals) and maximize factor loading values. This is one recommended extraction method for the factor analysis of poiychoric and tetrachoric correlation coefficients. The present paper will address two practical questions. First, do factor analysis results differ substantially when dichotomous and Likert scaled variables are treated as categorical rather than as continuous variables? Second, which factor extraction procedure, principal ais factoring (P AF) or unweighted least squares (ULS), works best for dichotomous and Likert scale items (i.e., yields simple structure)? METHODS The data used in this illustration come from a longitudinal study of adolescent neighborhood friendships. Respondents are African American youth of ages 14-18 living in a single neighborhood in a West Coast city. Probability sampling and 190
multiple. phase subject recruitment procedures were described in detail in Dolcini et al. (2001). In the current analysis We use only cross-sectional data from the baseline interview (N = 201). In a face-to-face interview participants responded to items assessing health-related attitudes and behaviors. Each respondent also identified three to five close friends. A total of 7 42 friends ranging in age from 13 to 21 were nominated and relationship data were obtained. Measures We use two measures included in the baseline interview questionnaire in the present analyses for illustrative purposes. One measure, composed of seven dichotomous (yes/no) items, is an inventory designed by Dolcini et al. (2001) to measure the quality of friendship. The second instrument encompasses eight Likert scale items designed to assess degree of depression. These items are drawn from the Center for Epidemiological Studies Depression Scale (CESD; Radloff, 1977) short form. Items on this inventory were scored on a four point Likert format: 1 = "Rarely or none of the time ( < 1 day)"; 2 = "Some or a little of the time (1-2 days)", 3 ="Occasionally or a moderate amount of the time" (3-4 days)", and 4 = "Most or all of the time (5-7 days)". Higher scores indicate increased levels of depression per week. The items for the friendship scale and the CESD appear in Table 1 and in Table 2, respectively. Procedure Prior to factor analysis, the distributions of the two sets of items were examined and descriptive statistics were computed. SAS PROC FACTOR was performed on both Pearson r and polychoric correlation matrices on both binary and Likert data sets. The maximum number of factors to be extracted was limited to two because the number of items for each instrument is relatively small. The two factor extraction methods described above, iterated principal factor estimation (P AF) and unweighted least squared (ULS), were conducted on the polychoric correlation matrix. Only PAF was performed on the Pearson r because the distributions of all items are skewed in the same direction. factor analysis of a polychoric correlation matrix using SAS proceeds in two steps: (1) Computing the polychroic correlation matrix and (2) submitting the computed polychoric correlation rna$ to SAS PROC FACTOR for factor extraction. The SAS macro polychor.sas was used to compute polychoric and tetrachoric correlations. (The tetrachoric correlation is the polychoric correlation. between twq dichtomous items. The macro automatically computes tetrachoric correlations for dichotomous items) The macro program can be downloaded from the SAS Institute's Web site at the following URL: http:f/www.sas.comlserviceltechsup!faq/stat macro{oolych QrJllm! Following computation of the polychoric correlations, the data analyst then factor analyzes the po1ychoric correlation matrix using PROC FACTOR, as shown in the sample syntax below. Libname data c:\my documents\nacs\baseline\data'; Data friend; infile c:\my documents\nacs\baseline\data\base7var.txt ; input TRUST1 HOLD1 BACKUP1 MONEY1 PROBLEM1 TROUBLE1 BUSINES1;. proc means;run; %inc c:\my documents\nacs\sas\prg\polychor.sas ; %polychor(data=friend,var=trust1 HOL01 BACKUP1 MONEY1 PROBLEM1 TROUBLE1 BUSINES1, out=tetcorb, type=corr); proc print; run; proc factor data=tetcorb method=prinit priors=smc scree residual rotate=promax; vartrust1 HOLD1 BACKUP1 MONEY1 PROBLEM1 TROUBLE1 BUSINESj run; Based on the obtained factor loadings, items were grouped into subscales. Finally, Cronbach's coefficient alpha was computed for each subscale as an index of internal consistency. RESULTS Descriptive Statistics The proportion of respondents who endorsed each of the friendship items are listed in Table 1. Descriptive statistics for the eight Likert variables from the CESD are listed in Table 2. Within each set of variables, all items are skewed in the same direction. For the seven friendship variables, we asked only about participants' 3-5 closest friends, so the restricted range in means (from 0.82 to 0.97) is not surprising. The overall mean for the eight depression variables is 1.55 indicating that participants had minimal levels of depression at the time of interview. Table 1. Descriptive Statistics fm Seven Binary Items from the Friendship Inventory. (N = 742). Item N=742 P*(Mean) Label 1 721 0.97 Do you trust your E? 2 671 0.90 Wouldyouletl'holdSOJDethingforyou? 3 715 0.96 Would l' ever back you up? 4 691 0.93 Would you lend f money if you had it? S 610 0.82 Would you talk toe about personal prob.? 6 622 0.83 If you were in trouble would you turn to E'l 7 616 0.83 Would you telleyourbusiness? Total 0.89 P is tbe percent of respondents who endorse the item. 191
Table 2. Distributions of the CESD Items. (N = 201} Item N=201 Prop(%) Mean so Shake off the blues 1.64 0.86 < 1 day 113 56 1-2 days 58 29 3-4 days 20 10 5-7 days 10 s 2 Feel depressed 1.68 0.90 < 1 day 112 56 1-2 days 54 27 3-4 days 23 11 5-7 days 12 6 3 My life is a failure 1.45 0.84 <I day 143 72 1-2 days 35 18 3-4 days 10 5 5-7 days 12 6 4 Fearful 1.36 0.66 <I day 146 73 1-2 days 43 21 3-4days 8 4 5-7 days 4 2 5 Restless 1.81 0.96 <I day 100 50 1-2 days 55 28 3-4 days 29 IS 5-7days 16 8 6 Feel lonely 1 0.92 <I day 130 65 1-2 days 43 22 3-4days II 5 5-7 days 16 8 7 Crying spells 1.26 0.65 <I day 165 82 1-2 days 26. 13 3-4 days 4 2 5-7 days 6 3 8 Feel sad 1.67 0.85 <I day 108 54 1-2 days 63 31 3-4 days 20 10 5-7 days 10 S Total 1.55 0.55 Correlation Coefficients For the seven dichotomous friendship items, the Pearson correlations ranged from 0.19 to 0.48, while the corresponding tetrachoric correlations ranged from 0.41 to 0.75, almost twice as large as the Pearson r values (See Table 3). The last three items correlate more highly than the first four items. For the eight Likert scaled CESD items, r ranges from 0.12 to 0.63; the corresponding polychroic values are 0.25 to 0.73 (See Table 4). In general, the discrepancy between the polychoric and Pearson r is larger for binary data than for Likert variables. This finding is reasonable because as the number of categories increases, the items.behave more like continuous variables. This result suggests that Pearson correlation coefficients may not be badly suited for Likert scale items with large numbers of categories, provided that research participants use the full scale (i.e., that sufficient variance exists for all items employed in the analysis). Table 3. Pearson rand Tetrachoric Correlation Coefficients for Seven Binary Items from the Friendship Inventory. (N = 742) ltem2 :J1 Item! itejd2 Item3 ltem4 ItemS ltem6.75 ltem3..jz.71 ltem4..ll.69 ItemS.23 ltem6..2.i.60 ltem7.62.66..jz.61.46..j!l.55..j!l.ss.2l!.62..j!l.66.32.80 J.2.41.21.45.52 dd.65.&a : Underlined coefficients are Pearson product-moment correlations; non-underlined values are polychoric correlation coefficients. Table 4. Pearson rand Polychoric Correlation Coefficients for Eight Likert CESD items. (N = 201) ltem2.!! Item I ltem2 ltem3.70 ltem3 M..60 Item4.ll.37 ItemS.J1.23 ltem6 Item7.52 ItemS.67 ltem7 & ItemS.69.70..ll.38..zo.23 M.55,'],1.45.68.31.22.30.62.44 M.74 ltem4 ItemS.35.28.29..2.i.35.34.25.2!1.28.3.3.63 Item6 Note: Underlined coefficients are Pearson product-moment correlations; non-underlined values are polychoric correlation coefficients. Factor Analyses: Friendship Inventory The principal axis factor analysis results for binary data for two types of correlation coefficients, Pearson r and tetrachoric are summarized in Table 5 and Table 6, respectively. The tables report the squared multiple correlations (SMC) of each itern with all other items in the analysis. The tables also show the first few positive proportions of eigenvalues divided by the sum of the total eigenvalues of the reduced correlation matrix.so &l.73 192
(Prop). The tables also display the factor loadings for the one and two factor solutions, the latter obtained via promax rotation. The tables also report the root mean-square residual for each solution, an index of badness of fit of the factor analysis model to the input data (RMS). Smaller RMS values indicate superior factor analysis solutions. If the data analyst does not specify the number of factors to extract, SAS uses a proportion of variance criterion to determine number of factors to retain. Similar to Kaiser's eigenvalue greater than one rule, this criterion retains the number of factors whose sum of squared factor loadings exceeds the total sum of squared loadings for all factors. From a practical perspective, this means that the sum of the proportion values should exceed 1.00 to ensure a sufficient number of factors extracted. For example, in Table 5 the frrst factor's proportion value is 1.15, which exceeds the threshold value of 1.00, so the analyst would retain a single factor from this solution. By contrast, in Table 6 the sum of two factors' proportion values is required (0.94 + 0.09) to exceed the 1.00 cutoff value, so two factors are retained from this solution. Table 5. Factor Loading for Seven Binary Friendship Inventory Items using Pearson Correlation Coefficients. {N = 742) Item SMC Prop ;;!Factms Fl F1 F2 I 0.24 1.15 O.S3 0.63 2 0.24 0.16 0.55 0.55 3 0.26 0.02 0 0.45 4 0.19 0.47 0.52 5 0.30 0.56 0.77 6 0.28 0.59 0.47 7 0.31 0.59 0.63 RMS: 0.064 0.026 Alpha: 0.73 0.64 0.69 Proportion of each eigenvalue to the sum of all eigenvalues. Table 6. Factor Loading for Seven Binary Friendship Inventory Items using Tetrachoric Correlation Coefficients. {N = 742) Item SMC Prop.1:fll la[ Fl Fl F2 0.71 0.94 0.85 0.80 2 0.61 0.09 0.76 0.79 3 0.77 0.04 0.87 0.47 0.47 4 0.55 0.69 o,76 5 0.64 0.74 0.94 6 0.69 0.79 0.66 7 0.65 0.77 0.66 RMS: 0.077 0.040 * Proportion of each eigenvalue to the sum of all eigenvalues. Several differences between these two solutions are evident. First, the factor loadings in the solution based on Pearson correlations are attenuated. Interestingly, this attenuation. is mirrored in the factor intercorrelation values obtained in the two factor solutions: r "" = 0 for the solution based on Perason correlations whereas rn,. = 0.67 for the solution based on tetrachoric correlations. Second, an investigator using Pearson correlations exclusively would extract a single factor via the proportion criterion whereas an analyst using the tetrachoric correlation matrix as the input to the factor analysis would extract two factors via the proportion criterion. The reliability for the scale derived from the single factor solution is 0.73 whereas alpha is 0.64 and 0.69 for the subscales based on the first four and last three items, respectively, in the dual factor solution. FadrorA:CESD The factor analysis results for the Likert scale items are summarized in Table 7 and Table 8, respectively. Table 7. Factor Loading for Eight CESD Ukert Items Using Pearson Correlation Coefficients. {N = 201) Item SMC Prop.1:fll la[ -Factors F1 F1 F2 1 0.47 1.01 0.73 0.49 2 0.55 0.09 0.76 0.85 3 0.43 0.07 0.65 0.67 4 0.17 0.02 0.39 0.40 s 0.14 0.32 6 0.47 0.70 0.51 7 0.25 0.46-0.64 8 0.58 0.80 0.73 RMS: 0.061 0.043 Alpha: 0.82 0.76 0.74 Proportion of each eigenvalue to the sum of all eigenvalues. Table 8. Factor Loading for Eight CESD Likert Items Using Polychoric Correlation Coefficients. (N = 201) Item SMC Prop*.1.:EI&I!li: 2-fac:tor< F1 F1 F2 1 0.58 0.96 0.78 0.52 0.34 2 0.68 0.07 0.80 0.96 3 0.58 0.06 0.74 0.58 4 0.24 0.03 0.45 5 0.20 0.37 0.32 6 0.61 0.76 0.61 7 0.50 0.66 0.70 8 0.'75 0.88 0:79 RMS: 0.063 0.050 Proportion of each eigenvalue to the sum of all eigenvalues. 193
As was the case in the analyses of the friendship inventory, the factor analysis based on Pearson correlations returns a one factor solution whereas the analysis based on polychoric correlations yields a two factor solution. Unlike the analysis of dichotomous items, however, the difference between the factor loadings based on the Pearson r and polychoric correlation matrices are negligible. Furthermore, the factor intercorrelation values for the two factor solutions derived from both the Pearson and the polychoric correlation input matrices are identical: 0.66. The alpha is 0.82 for eight items for a single factor scale and 0.76 and 0.74 for subscales derived from a two factor solution. Factor Extraction Methods To compare differences among factor extraction methods, unweighted least-squares extraction was performed on the correlation matrices for each survey instrument The results are not listed in the tables due to limited space. However, ULS extraction produced highly similar results to the PAP method with both binary and Likert scale data. Interestingly, the Ul..S method proved more vulnerable to Heywood cases (i.e., negative residual variance estimates) with the Likert scale polychoric correlations; removing items 4 and 5, those with the weakest communality estimates, fixed this problem. DISCUSSION Many measurable qualities of interest to researchers are dichotomous. This paper provides two empirical examples comparing factor analysis results from two different methods of analysis. SAS PROC FACTOR analyzes a Pearson product-moment correlation matrix by default. While this approach may not perform poorly for the analysis of Likert scaled data where research participants endorse a sufficiently wide range of scale points, our examples illustrate the serious attenuation of correlation coefficients and the attendant reduction of factor loading and factor intercorrelation values when factor analyzing a Pearson product moment correlation matrix derived from binary items. This in tum can lead to misinterpretation of the dimensionality of the solution and possibly the omission of valuable survey items due to artificially low factor loading values for those items in a factor analysis based on Pearson correlations. Fortunately, SAS Institute provides the readily available and easily employed polychor.sas macro program to compute polychoric and tetrachoric correlation coefficients. These coefficients may in turn be factor analyzed by SAS PROC FACTOR, as our example syntax demonstrates. This option affords SAS users a convenient method of comparing the results derived from the standard Pearson correlation-based factor analysis with those obtained from tetrachoric or polychoric correlations, a valuable analysis tool SAS PROC FACTOR provides the data analyst flexibility to extract factors from correlation matrices using a number of extraction methods, including principal axis factoring and unweighted least-squares. Some extraction methods may perform better with Pearson correlations whereas other methods may perform better with polychoric and tetrachoric correlations. The interaction of extraction method and input correlation matrix type is an area deserving further exploration and research. The use of polychoric and tetrachoric correlations are not without limitations, however. Polychoric and tetrachoric correlations are assumed to be derived from a set of underlying normally-distributed latent variables, an untestable assumption that may not be true in the population from which the researcher draws her data. In addition, because each polychoric correlation coefficient is computed separately, a matrix of these coefficients may not be positive definite, i.e., non-gramian (Gorsuch, 1974; Cohen and Cohen, 1983). Special software programs designed for the factor analysis of categorical and dichotomous outcome data also exist and may be of use for extensive analyses involving such data (e.g., Mplus; Muthen & Muthen, 2001). These programs compute tetrachoric and polychoric correlation matrices internally as part of the factor analysis procedure, thereby saving the data analyst the two step process outlined above. Furthermore, the Mplus program generates several goodness-of-fit statistics that may be of use to data analysts in choosing the number of factors to retain from any given factor analysis solution. Nonetheless, despite the advantages of special purpose software programs and the limitations inherent in the SASbased approach documented above, we believe that the use of tetrachoric and polychoric correlation coefficients in conjunction with PROC FACTOR in SAS enables data analysts to obtain reasonable answers to research questions involving the factor analysis of dichotomous and polytomous survey items. REFERENCES Cohen J. & Cohen P. (1983). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. Hillsdale, NJ, Lawrence Erlbaum Associates, Inc., Publishers. Dolcini, M. M, Happer, G., Watson, S., Han, L., Ellen, J., & Catania, J. (2001). The structure and quality of adolescent friendships in an urban African American neighborhood. Paper presented at the biannual meeting of Society for Research in Child Development, Minneapolis, MN. Gorsuch, R. L. (1974). Factor Analysis. Philadelphia, PA, W. B. Saunders Company. McDonald, R. P. (1985). Factor Analysis and Related Methods. Hillsdale, NJ, Lawrence Erlbaum Associates, Inc., Publishers. 194
Muthen, B. (1978). Contributions to factor analysis of dichotomous variables. Psychometrika, 551-560. Muthen, L. & Muthen, B. (2001). Mplus User's Guide, v2. Los Angeles, CA, Muthen & Muthen. Uebersax, J., S. (2000). Estimating a latent trait model by factor analysis of tetrachoric correlations. Web site: http://ourworld.compuserve.com/homepages/jsuebersaxlirt. htm. Waltman K. K. & Dunbar, S. B. (1994). Dimensions of content and difficulty in binary test items. A paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans. CONTACT INFORMATION Lei Han, Ph.D. Statistician Center for AIDS Prevention Studies University of California, San Francisco 74 New Montgomery St. Suite 600 San Francisco, CA 94105 Phone: (415) 597-9208 Fax: (415) 597 9395 E-mail Address: lhan@osg.ncsfedu Tor Neilands, Ph.D. Specialist/Senior Statistician Margaret Dolcini, Ph.D. Principal Investigator Center for AIDS Prevention Studies University of California, San Francisco 74 New Montgomery St. Suite 600 San Francisco, CA 94105 AUTHOR NOTES The authors are grateful to Dr. Lance Pollack and Melissa Krone for an earlier review of a draft of this article. 195