Correspondence analysis and Related Methods Part 2. Between-set. versus within-set

Transcription

1 Correspondence analysis and Related Methods Part. What is multiple correspondence analysis (MCA)?. Why is MCA so useful as a method of visualizing questionnaire data? 3. How is MCA implemented in XLSTAT? Classical or simple CA analyses the relationships between two variables, although the method is extended to analyse different forms of tabular data, for example the product attribute data shown previously, as well as ratings, preferences, on an individual or aggregate level. Multiple CA analyses several categorical variables where we are interested in all the relationships within the set of variables, not between one set and another The best way to understand the difference is to see the different data format for the MCA program in XLSTAT: these are individual-level responses to several questions. Between-set versus within-set Questions: Should a women work full-time, work part-time or stay at home or missing data [4 response categories]: (Q) before she has children; (Q) when she has a preschool child; (Q3) when children are still at school; (Q4) when all children have left home. Demographics: Country [4], Sex [], Age group [6] Responses to four questions concerning working women Demographic categories Source: Family & Changing Gender Roles Survey ISSP (994) between-set means that there are two sets of variables and we are interested in the relationships between them e.g., between demographics and the question responses within-set means that there is one set of variables and we are interested in the relationships amongst them e.g., amongst the question responses... this is the multiple correspondence analysis (MCA) case

2 Between-set example: Simple CA Q3: Should a woman with a child at school work full-time, part-time or stay at home? work work stay at DK/unsure/ full-time part-time home missing COUNTRY W w H? Total AUS DW DE GB NIRL USA A H I IRL NL N S CZ SLO PL BG RUS NZ CDN IL J E RP Total Average profile Simple CA of multiway tables Source: Family & Changing Gender Roles Survey ISSP (994) Each country is split by gender: 4 country-age groups. We say the variables country and age are interactively coded work work stay at DK/unsure/ full-time part-time home missing COUNTRY W w H? Total AUSm AUSf DWm DWf RPm RPf Total Average profile Average profile stays the same, so definition of centre and geometric distance remain identical to previous map, all that has been done is to split each country point into two profiles Simple CA CDN USA IL W NL DE N S NIRL GB NZ AUS w? I.53 (36.5%) E CZ SLO A IRL J DW RUS BG H PL RP H.737 (5.6%) Simple CA of multiway tables 87.% inertia.6.4. CDNf USAf Ilf CDNm Ilm USAm W? Ef.546 (35.3%) SLOf SLOm Em BGm IRLm BGf PLf PLm RPm Dem IRLf RPf NLm Nm H Sm NLf CZm.797 (5.5%) Nf NIRLm Hf Jm Hm Def GBm Jf CZf Sf AUSf NIRLf -. w Im RUSm GBf If AUSm NZf Am DWm NZm Af DWf RUSf Inertia before:.456 Inertia with M F split: % due to M F Ireland (IRL) has largest M F difference Bulgaria (BG) is only country with a reverse M F difference 86.8% inertia

3 Simple CA of multiway tables CDNf<5 DEm<5.3 (54.3%) W? w NZm>66.79 (33.%) PLm< H PLm>66 Hm>66 Interactive coding of country (4), gender () and age (6), giving 88 combinations Points tend to lie in a curved pattern (called arch or horseshoe) Points that lie inside the arch are polarized, e.g. PLm6-35: 3% W, % w, 3% H, but NZm>66: 7% W, 73% w, 5% H Average: % W, 53% w, 8% H 87.3% inertia Stacked tables Country (4) Gender () Age (6) Education (7) Marital status (5) Social class (8) W w H? Each variable is separately cross-tabulated with the question and then stacked one on top of another. Since the column margins of each table are identical (and same as the interactively coded tables before), the basic geometry remains the same, it s just the detail that is sacrificed here, all the information is collapsed into main effects. Inertia of stacked table is the average of the inertias of its subtables Should a (married) woman before having children... Country (4) Gender () Age (6) Education (7) Marital status (5) Social class (8) Stacked tables... with a preschool child with a child at school... W w H? W w H? W w H? W w H?... when her children have left home work full-time, parttime or stay at home? Tables can be stacked row-wise and column-wise, adding additional questions as columns 4 contingency tables in a 6 4 pattern, row margins and column margins are the same. Inertia of stacked table is the average of the inertias of its subtables CDN S DE IL Stacked tables Women in the workplace and 6 demographic variables A E7 si S6 A S BG USA IRL se E NL 4W M S5 E6 A3? F di E5 S* S I W A5 N CZ ma wi E3 H A6 S4 E4 A4 S3 J NIRL RUS GB NZ AUS.84 (.9%) SLO A DW E PL S E RP.88 (49.%) Relationships between each demographic variable and each question displayed jointly Relationships within questions and relationships within demographics not displayed explicitly Join categories of ordinal variable to see trends, for example age. 7.% inertia

4 Multiple correspondence analysis (MCA) West & East German samples only Original data Indicator Matrix Questions Qu. Qu. Qu. 3 Qu W w H? W w H? W w H? W w H? and so on for 345 rows Response data is recoded as dummy variables N rows, Q questions, q-th question has J q categories, total number of categories is J ( N = 345, Q = 4 J q = 4 for all q, J = 6 ) One definition of MCA is that it is the CA of the indicator matrix MCA: XLSTAT initial output Total inertia: 3 Eigenvalues and percentages of inertia: F F F3 F4 F5 Eigenvalue Inertia (%) Cumulative % Adjusted Inertia Adjusted Inertia (%) Cumulative % J Q 6 4 Total inertia in MCA of indicator matrix Z = = = 3 Q 4... F W? 4W Multiple correspondence analysis (MCA) Burt matrix W? 4W Stacked matrix of all two-way contingency tables, including each variable with itself If Z (N J) is the indicator matrix, then the Burt matrix B (J J) is B = Z T Z Alternative definition of MCA is that it is the CA of the Burt matrix - - MCA (Burt matrix version) 4W W (3.%) Results are same for Burt matrix, just principal inertias change. -3-3? (4.9%) (4.9%) Relationships amongst (within) the set of questions are displayed jointly Missing value categories have strong association 64.9% inertia (only 4.% if indicator matrix analysed)

5 W? 4W Multiple correspondence analysis (MCA) Burt matrix inertias of each subtable W? 4W MCA (adjusted adjusted) Percentage of variance is actually much higher, in MCA the overall inertia is inflated by the diagonal tables in the Burt matrix the percentage is actually about 9% Total inertia of Burt matrix is average of the inertias of its submatrices =.43 Since the diagonal inertias are so high, this inflates the average, hence low percentages Adjustment of principal inertias (eigenvalues eigenvalues) We can rescale an existing MCA solution in order to best fit the off-diagonal tables. All we need is the total inertia of the Burt matrix, inertia(b), and the principal inertias λ k of the Burt matrix in the solution space. If we have computed the solution on the indicator matrix Z (as in MCA module of XLSTAT), the eigenvalues calculated are λ k so all the squares of the principal inertias of Z need to be summed in order to get inertia(b). If you have analysed the Burt matrix B, inertia(b) is the total inertia. Here are the steps to rescale the solution:. Calculate the average off-diagonal inertia : Q J Q average off-diagonal inertia = inertia ( B) Q Q. Calculate the adjusted principal inertias : Q adjusted principal inertias = λ only for λ > Q k Q k Q 3. Calculate adjusted percentages of inertia : adjusted percentages of inertia = adjusted principal inertias average off -diagonal inertia MCA (Burt matrix version).3 (3.5%).479 (3.%).63 (3.%) - 4W W?.347 (66.%) - 4W W? (4.9%) % inertia % inertia -3-3

6 MCA Women in the workplace supplementary demographic groups.5 DE E4 di A3 F se ma A5 A4 E3 A6 -.5 E A E* A E5 si E6 M wi DW E Related topics. Subset correspondence analysis restricting analysis to a subset of categories (e.g. all substantive responses excluding missing categories, or missing categories by themselves, or middle categories). Square asymmetric tables mobility tables, brand-switching, migration Recoding of data before applying CA ratings, preferences, paired comparisons, continuous-scale data (ratio and interval) 4. Stability and inference concentration ellipses, convex hulls, permutation tests 5. Canonical correspondence analysis (CCA) CA with explanatory variables (combination of dimensions reduction and regression) Subset correspondence analysis For example, analysing the women working data but ignoring the missing values (this is NOT just a CA of the table without the missing value columns the masses and metric of the complete matrix are maintained). In XLSTAT s MCA program you are given a menu for selecting which categories you want to retain or omit: Subset correspondence analysis.4 (3.5%).5 W 4W.4 (7.%)

7 Canonical correspondence analysis (CCA CCA) Canonical correspondence analysis (restricted to age group differences) (8.4%) agegp-6 agegp-5 Q- Q3-3 Q4-3 Q- Q3-4 Q3- agegp- agegp- Q- Q4-4 Q-4 Q4- Q-3 Q4- Q-.685 (63.5%) Q-4 Q3- This has the same objective as CA but restricts the CA solution to be (linearly) related to external predictor variables, for exampe we want to find the best low-dimensional view of the responses which is related to age (either age group or original age variable) -.4 Q-3 agegp-4 agegp