Correspondence analysis and Related Methods Part 2. Between-set. versus within-set



Similar documents
Transcription:

Correspondence analysis and Related Methods Part. What is multiple correspondence analysis (MCA)?. Why is MCA so useful as a method of visualizing questionnaire data? 3. How is MCA implemented in XLSTAT? Classical or simple CA analyses the relationships between two variables, although the method is extended to analyse different forms of tabular data, for example the product attribute data shown previously, as well as ratings, preferences, on an individual or aggregate level. Multiple CA analyses several categorical variables where we are interested in all the relationships within the set of variables, not between one set and another The best way to understand the difference is to see the different data format for the MCA program in XLSTAT: these are individual-level responses to several questions. Between-set versus within-set Questions: Should a women work full-time, work part-time or stay at home or missing data [4 response categories]: (Q) before she has children; (Q) when she has a preschool child; (Q3) when children are still at school; (Q4) when all children have left home. Demographics: Country [4], Sex [], Age group [6] Responses to four questions concerning working women Demographic categories Source: Family & Changing Gender Roles Survey ISSP (994) between-set means that there are two sets of variables and we are interested in the relationships between them e.g., between demographics and the question responses within-set means that there is one set of variables and we are interested in the relationships amongst them e.g., amongst the question responses... this is the multiple correspondence analysis (MCA) case

Between-set example: Simple CA Q3: Should a woman with a child at school work full-time, part-time or stay at home? work work stay at DK/unsure/ full-time part-time home missing COUNTRY W w H? Total AUS 56 56 76 9 779 DW 394 58 48 34 DE 78 69 6 66 97 GB 6 646 7 7 984 NIRL 6 394 75 5 647 USA 48 686 7 7 447 A 84 63 59 977 H 85 736 447 3 5 I 7 67 67 8 IRL 3 44 9 8 938 NL 539 5 43 8 968 N 487 4 5 53 87 S 95 833 39 5 7 CZ 8 585 98 3 4 SLO 34 48 4 3 PL 43 45 589 5 597 BG 7 47 335 94 6 RUS 75 54 55 9 998 NZ 754 7 47 CDN 566 497 8 69 44 IL 468 664 9 63 87 J 3 67 33 37 E 738 54 3 494 RP 43 448 484 5 Total 77 7774 596 585 3359 Average profile.6.59.77.77 Simple CA of multiway tables Source: Family & Changing Gender Roles Survey ISSP (994) Each country is split by gender: 4 country-age groups. We say the variables country and age are interactively coded work work stay at DK/unsure/ full-time part-time home missing COUNTRY W w H? Total AUSm 7 596 4 8 99 AUSf 38 559 6 9 866 DWm 43 675 357 3 98 DWf 58 79 4 5 6.................. RPm 347 445 94 97 RPf 39 566 8 8 9 Total 77 7774 596 585 3359 Average profile.6.59.77.77 Average profile stays the same, so definition of centre and geometric distance remain identical to previous map, all that has been done is to split each country point into two profiles Simple CA.6.4. -. -.4 CDN USA IL W NL DE N S NIRL GB NZ AUS w? I.53 (36.5%) E CZ SLO A IRL J DW RUS BG H PL RP H.737 (5.6%) -.4 -...4.6 Simple CA of multiway tables 87.% inertia.6.4. CDNf USAf Ilf CDNm Ilm USAm W? Ef.546 (35.3%) SLOf SLOm Em BGm IRLm BGf PLf PLm RPm Dem IRLf RPf NLm Nm H Sm NLf CZm.797 (5.5%) Nf NIRLm Hf Jm Hm Def GBm Jf CZf Sf AUSf NIRLf -. w Im RUSm GBf If AUSm NZf Am DWm NZm Af DWf RUSf -.4 -.4 -...4.6.8 Inertia before:.456 Inertia with M F split:.546 5.8% due to M F Ireland (IRL) has largest M F difference Bulgaria (BG) is only country with a reverse M F difference 86.8% inertia

Simple CA of multiway tables.5 -.5 - CDNf<5 DEm<5.3 (54.3%) W? w NZm>66.79 (33.%) PLm<6-35 - -.5.5 H PLm>66 Hm>66 Interactive coding of country (4), gender () and age (6), giving 88 combinations Points tend to lie in a curved pattern (called arch or horseshoe) Points that lie inside the arch are polarized, e.g. PLm6-35: 3% W, % w, 3% H, but NZm>66: 7% W, 73% w, 5% H Average: % W, 53% w, 8% H 87.3% inertia Stacked tables Country (4) Gender () Age (6) Education (7) Marital status (5) Social class (8) W w H? Each variable is separately cross-tabulated with the question and then stacked one on top of another. Since the column margins of each table are identical (and same as the interactively coded tables before), the basic geometry remains the same, it s just the detail that is sacrificed here, all the information is collapsed into main effects. Inertia of stacked table is the average of the inertias of its subtables Should a (married) woman before having children... Country (4) Gender () Age (6) Education (7) Marital status (5) Social class (8) Stacked tables... with a preschool child...... with a child at school... W w H? W w H? W w H? W w H?... when her children have left home work full-time, parttime or stay at home? Tables can be stacked row-wise and column-wise, adding additional questions as columns 4 contingency tables in a 6 4 pattern, row margins and column margins are the same. Inertia of stacked table is the average of the inertias of its subtables.4. -. CDN S DE IL Stacked tables Women in the workplace and 6 demographic variables A E7 si S6 A S BG USA IRL se E NL 4W M S5 E6 A3? F di E5 S* S I W A5 N CZ ma wi E3 H A6 S4 E4 A4 S3 J NIRL RUS GB NZ AUS.84 (.9%) SLO A DW E PL S -.4 -.4 -...4.6 E RP.88 (49.%) Relationships between each demographic variable and each question displayed jointly Relationships within questions and relationships within demographics not displayed explicitly Join categories of ordinal variable to see trends, for example age. 7.% inertia

Multiple correspondence analysis (MCA) West & East German samples only Original data Indicator Matrix Questions Qu. Qu. Qu. 3 Qu. 4 3 4 W w H? W w H? W w H? W w H? -------------------------------------------------- 3 3 3 4 3 3 4 4 4 4 4 4 4 4 3......... and so on for 345 rows Response data is recoded as dummy variables N rows, Q questions, q-th question has J q categories, total number of categories is J ( N = 345, Q = 4 J q = 4 for all q, J = 6 ) One definition of MCA is that it is the CA of the indicator matrix MCA: XLSTAT initial output Total inertia: 3 Eigenvalues and percentages of inertia: F F F3 F4 F5 Eigenvalue.69.53.365.37.8 Inertia (%) 3.6 7.8.56.48 7.54 Cumulative % 3.6 4.69 5.35 6.573 69.87 Adjusted Inertia.347.3.3.6 Adjusted Inertia (%) 66.5 3.48 4.456.8 Cumulative % 66.5 89.634 94.9 95.8 J Q 6 4 Total inertia in MCA of indicator matrix Z = = = 3 Q 4... F W? 4W Multiple correspondence analysis (MCA) Burt matrix W? 4W 5 7 7 3 9 355 79 345 9 766 537 4 57 476 7 9 335 5 6 6 8 8 8 93 7 38 79 6 7 7 6 4 38 6 36 57 8 94 7 96 55 5 45 6 7 7 8 7 48 4 65 5 7 9 6 57 99 9 997 6 97 39 3 75 3 335 7 8 645 4 988 573 6 76 65 84 86 9 5 94 9 9 5 4 7 6 7 355 6 7 7 9 4 9 379 36 4 4 79 6 7 96 48 997 988 5 83 348 566 3 46 345 8 6 55 4 6 573 4 64 86 73 8 9 8 6 7 3 49 3 3 766 8 4 5 65 97 76 6 36 348 49 959 537 93 45 5 39 65 7 4 566 86 3 896 4 7 38 3 84 3 73 97 57 38 6 6 75 86 4 46 8 3 463 Stacked matrix of all two-way contingency tables, including each variable with itself If Z (N J) is the indicator matrix, then the Burt matrix B (J J) is B = Z T Z Alternative definition of MCA is that it is the CA of the Burt matrix - - MCA (Burt matrix version) 4W W.479.63 (3.%) Results are same for Burt matrix, just principal inertias change. -3-3?.479.63 (4.9%) (4.9%) Relationships amongst (within) the set of questions are displayed jointly Missing value categories have strong association 64.9% inertia (only 4.% if indicator matrix analysed)

W? 4W Multiple correspondence analysis (MCA) Burt matrix inertias of each subtable W? 4W 5 7 7 3 9 355 79 345 9 766 537 4 57 476 7 9 335 5 6 6 8 8 8 93 7 38 3..363.44.644 79 6 7 7 6 4 38 6 36 57 8 94 7 96 55 5 45 6 7 7 8 7 48 4 65 5.363 3..89.345 7 9 6 57 99 9 997 6 97 39 3 75 3 335 7 8 645 4 988 573 6 76 65 84 86 9 5 94 9 9 5 4 7 6 7 355 6 7 7 9 4 9 379 36 4 4 79 6 7 96 48 997 988 5 83 348 566 3 46.44.89 3..48 345 8 6 55 4 6 573 4 64 86 73 8 9 8 6 7 3 49 3 3 766 8 4 5 65 97 76 6 36 348 49 959 537 93 45 5 39 65 7 4 566 86 3 896.644.345.48 3. 4 7 38 3 84 3 73 97 57 38 6 6 75 86 4 46 8 3 463 MCA (adjusted adjusted) Percentage of variance is actually much higher, in MCA the overall inertia is inflated by the diagonal tables in the Burt matrix the percentage is actually about 9% Total inertia of Burt matrix is average of the inertias of its submatrices =.43 Since the diagonal inertias are so high, this inflates the average, hence low percentages Adjustment of principal inertias (eigenvalues eigenvalues) We can rescale an existing MCA solution in order to best fit the off-diagonal tables. All we need is the total inertia of the Burt matrix, inertia(b), and the principal inertias λ k of the Burt matrix in the solution space. If we have computed the solution on the indicator matrix Z (as in MCA module of XLSTAT), the eigenvalues calculated are λ k so all the squares of the principal inertias of Z need to be summed in order to get inertia(b). If you have analysed the Burt matrix B, inertia(b) is the total inertia. Here are the steps to rescale the solution:. Calculate the average off-diagonal inertia : Q J Q average off-diagonal inertia = inertia ( B) Q Q. Calculate the adjusted principal inertias : Q adjusted principal inertias = λ only for λ > Q k Q k Q 3. Calculate adjusted percentages of inertia : adjusted percentages of inertia = adjusted principal inertias average off -diagonal inertia MCA (Burt matrix version).3 (3.5%).479 (3.%).63 (3.%) - 4W W?.347 (66.%) - 4W W?.479.63 (4.9%) - - 89.7% inertia -3-3 64.9% inertia -3-3

MCA Women in the workplace supplementary demographic groups.5 DE E4 di A3 F se ma A5 A4 E3 A6 -.5 E A E* A E5 si E6 M wi DW E -.5.5 Related topics. Subset correspondence analysis restricting analysis to a subset of categories (e.g. all substantive responses excluding missing categories, or missing categories by themselves, or middle categories). Square asymmetric tables mobility tables, brand-switching, migration... 3. Recoding of data before applying CA ratings, preferences, paired comparisons, continuous-scale data (ratio and interval) 4. Stability and inference concentration ellipses, convex hulls, permutation tests 5. Canonical correspondence analysis (CCA) CA with explanatory variables (combination of dimensions reduction and regression) Subset correspondence analysis For example, analysing the women working data but ignoring the missing values (this is NOT just a CA of the table without the missing value columns the masses and metric of the complete matrix are maintained). In XLSTAT s MCA program you are given a menu for selecting which categories you want to retain or omit: Subset correspondence analysis.4 (3.5%).5 W 4W.4 (7.%) -.5 -.5 - -.5.5

Canonical correspondence analysis (CCA CCA) Canonical correspondence analysis (restricted to age group differences).6.465 (8.4%).4. -. agegp-6 agegp-5 Q- Q3-3 Q4-3 Q- Q3-4 Q3- agegp- agegp- Q- Q4-4 Q-4 Q4- Q-3 Q4- Q-.685 (63.5%) Q-4 Q3- This has the same objective as CA but restricts the CA solution to be (linearly) related to external predictor variables, for exampe we want to find the best low-dimensional view of the responses which is related to age (either age group or original age variable) -.4 Q-3 agegp-4 agegp-3 -.6 -.8 -.6 -.4 -...4.6