Tutorial on Exploratory Data Analysis

Transcription

1 Tutorial on Exploratory Data Analysis Julie Josse, François Husson, Sébastien Lê julie.josse at agrocampus-ouest.fr francois.husson at agrocampus-ouest.fr Applied Mathematics Department, Agrocampus Ouest user-2008 Dortmund, August 11th / 48

2 Why a tutorial on Exploratory Data Analysis? Our research focus on multi-table data analysis We teach Exploratory Data analysis from a long time (but in French...) We made an R package: Possibility to add supplementary information The use of a more geometrical point of view allowing to draw graphs The possibility to propose new methods (taking into account different structure on the data) To have a package user friendly and oriented to practitioner (a very easy GUI) 2 / 48

3 Outline 1 Principal Component Analysis (PCA) 2 Correspondence Analysis (CA) 3 Multiple Correspondence Analysis (MCA) 4 Some extensions Please ask questions! 3 / 48

4 Multivariate data analysis Principal Component Analysis (PCA) continuous variables Correspondence Analysis (CA) contingency table Multiple Correspondence Analysis (MCA) categorical variables Dimensionality reduction describe the dataset with smaller number of variables Techniques widely used for applications such as: data compression, data reconstruction; preprocessing before clustering, and... 4 / 48

5 Exploratory data analysis Descriptive methods Data visualization Geometrical approach: importance to graphical outputs Identification of clusters, detection of outliers French school (Benzécri) 5 / 48

6 Principal Component Analysis 6 / 48

7 PCA in R prcomp, princomp from stats dudi.pca from ade4 ( PCA from FactoMineR ( 7 / 48

8 PCA deals with which kind of data? Notations: Figure: Data table in PCA. xi. the individual i, S = 1 n X cx c the covariance matrix, W = X c X c the inner products matrix. PCA deals with continuous variables, but categorical variables can also be included in the analysis 8 / 48

9 Some examples Many examples Sensory analysis: products - descriptors Environmental data: plants - measurements; waters - physico-chemical analyses Economy: countries - economic indicators Microbiology: cheeses - microbiological analyses etc. Today we illustrate PCA with: data decathlon: athletes performances during two athletics meetings data chicken: genomics data 9 / 48

10 Introduction Individuals cloud Variables cloud Helps to interpret Decathlon data 100m Long.jump Shot.put High.jump 400m 110m.hurdle Discus Pole.vault Javeline 1500m Rank Points Competition 41 athletes (rows) 13 variables (columns): 10 continuous variables corresponding to the performances 2 continuous variables corresponding to the rank and the points obtained PCA Example 1 categorical variable corresponding to the athletics meeting: DataOlympic : performances 41 athletes during two meetings of decathlon Gameofand Decastar (2004) SEBRLE CLAY KARPOV BERNARD YURKOV Decastar Decastar Decastar Decastar Decastar Sebrle Clay Karpov Macey Warners OlympicG OlympicG OlympicG OlympicG OlympicG 10 / 48

11 Problems - objectives Individuals study: similarity between individuals for all the variables (Euclidian distance) partition between individuals Variables study: are there linear relationships between variables? Visualization of the correlation matrix; find some synthetic variables Link between the two studies: characterization of the groups of individuals by the variables; particular individuals to better understand the links between variables 11 / 48

12 Two points clouds X X var 1 ind 1 var k ind k 12 / 48

13 Pre-processing: Mean centering, Scaling? Mean centering does not modify the shape of the cloud Scaling: variables are always scaled when they are not in the same units Long jump (in m) Long jump (in cm) Time for 1500 m (s) Time for 1500 m (s) PCA always centered and often scaled 13 / 48

14 Individuals cloud Individuals are in R K Similarity between individuals: Euclidean distance Study the structure, i.e. the shape of the individual cloud 14 / 48

15 Inertia Total inertia multidimensional variance distance between the data and the barycenter: I I g = p i x i. g 2 = 1 I i=1 = tr(s) = λ s = K. s I (x i. g) (x i. g), i=1 15 / 48

16 Fit the individuals cloud Find the subspace who better sum up the data: the closest one by projection. Figure: Camel vs dromedary? 16 / 48

17 Fit the individuals cloud x i. min P u1 (x i. ) = u 1 (u 1u 1 ) 1 u 1x i. max F u u = < x, u 1 > u 1 F u1 (i) = < x, u 1 > Maximize the variance of the projected data Minimize the distance between individuals and their projections Best representation of the diversity, variability of the individuals Do not distort the distances between individuals 17 / 48

18 Find the subspace (1) Find the first axis u 1, for which variance of F u1 = Xu 1 is maximized: u 1 = argmax u1 var(xu 1 ) with u 1u 1 = 1 It leads to: var(f u1 ) = 1 I (Xu 1) Xu 1 = u 1 1 I X Xu 1 max u 1Su 1 with u 1u 1 = 1 u 1 first eigenvector of S (associated with the largest eigenvalue λ 1 ): Su 1 = λ 1 u 1. This eigenvector is known as the first axis (loadings). 18 / 48

19 Find the subspace (2) Projected inertia on the first axis: var(f u1 ) = var(xu 1 ) = λ 1 Percentage of variance explained by the first axis: λ 1 k λ k Additional axes are defined in an incremental fashion: each new direction is chosen by maximizing the projected variance among all orthogonal directions Solution: K eigenvectors u 1,...,u K of the data covariance matrix corresponding to the K largest eigenvalues λ 1,...,λ K 19 / 48

20 Example: graph of the individuals Individuals factor map (PCA) Dimension 2 (17.37%) Casarsa Parkhomenko Korkizoglou YURKOV Zsivoczky Smith Macey Pogorelov SEBRLE MARTINEAU HERNU CLAY Turi Terek KARPOV BOURGUIGNON Uldal Barras McMULLEN Schoenbeck Bernard Karlivans BARRAS Qi Hernu BERNARD Ojaniemi Smirnov Gomez ZSIVOCZKY Schwarzl Lorenzo Nool WARNERS Averyanov Warners NOOL Drews Sebrle Clay Karpov Dimension 1 (32.72%) Need variables to interpret the dimension of variability 20 / 48

21 Can we interpret the individuals graph with the variables? Correlation between variable x.k and F u (the vector of individuals coordinates in R I ) r(f 2,k) k r(f 1,k) Correlation circle graph 21 / 48

22 Can we interpret the individuals graph with the variables? Variables factor map (PCA) Dimension 2 (17.37%) X400m X110m.hurdle X100m X1500m Javeline Pole.vault Discus Shot.put High.jump Long.jump Dimension 1 (32.72%) Figure: Graph of the variables. 22 / 48

23 Variables are in R I Find the subspace (1) Find the first dimension v 1 (in R I ) which maximizes the variance of the projected data x.k G v v P v1 (x.k ) = v 1 (v 1v 1 ) 1 v 1x.k G v1 (k) = < v 1, x.k > v 1 max K k=1 G v 1 (k) 2 = max K k=1 cor(v 1, x.k ) 2 = max K k=1 cos2 (θ) with v v = 1 v 1 is the best synthetic variable 23 / 48

24 Find the subspace (2) Solution: v 1 is the first eigenvector of W = XX the individuals inner product matrix (associated with the largest eigenvalue λ 1 ): Wv 1 = λ 1 v 1 The next dimensions are the other eigenvectors Dimensionality reduction: principal components are linear combination of the variables A subset of components to sum up the data 24 / 48

25 Fit the variables cloud Variables factor map (PCA) Dimension 2 (17.37%) X400m X110m.hurdle X100m X1500m Javeline Pole.vault Discus Shot.put High.jump Long.jump Dimension 1 (32.72%) Figure: Graph of the variables. Same representation! What a wonderful result! 25 / 48

26 Projections... Only well projected variables (high cos 2 between the variable and its projection) can be interpreted! A H A HB D H AHB H D H D H C H C B C 26 / 48

27 Exercices... With 200 independent variables and 7 individuals, how does your correlation circle look like? 27 / 48

28 Exercices... With 200 independent variables and 7 individuals, how does your correlation circle look like? mat=matrix(rnorm(7*200,0,1),ncol=200) PCA(mat) 28 / 48

29 Link between the two representations: transition formulae Su = X Xu = λu XX Xu = X λu W (Xu) = λ(xu) WF u = λf u and since Wv = λv then F u and v are colinear Since, F u = λ and v = 1 we have: v = 1 λ F u G v = X v = 1 λ X F u u = 1 λ G v F u = Xu = 1 λ XG v F s (i) = 1 λs K k=1 x ik G s (k) G s (k) = 1 λs I x ik F s (i) i=1 29 / 48

30 Example on decathlon Dim 2 (17.37%) X400m X1500m Discus Shot.put Casarsa Parkhomenko Korkizoglou YURKOV X110m.hurdle X100m Rank Javeline Pole.vault High.jump Points Dim 1 (32.72%) Long.jump Zsivoczky Smith Macey Pogorelov SEBRLE MARTINEAU HERNU KARPOV CLAY Turi Terek BOURGUIGNON Uldal Barras McMULLEN DecastarOlympicG Schoenbeck Bernard Karlivans BARRAS Qi Hernu BERNARDOjaniemi Smirnov Gomez ZSIVOCZKY Schwarzl Lorenzo Averyanov Nool WARNERS Warners NOOL Sebrle Clay Karpov Dim 1 (32.72%) Drews Figure: Individuals and variables representations. 30 / 48

31 Supplementary information (1) For the continuous variables: projection of these supplementary variables on the dimensions Variables factor map (PCA) Dimension 2 (17.37%) X400m X110m.hurdle X100m Rank X1500m Javeline Pole.vault Discus Shot.put High.jump Points Long.jump Dimension 1 (32.72%) For the individuals: projection Supplementary information do not participate to create the axes 31 / 48

32 Supplementary information (2) How to deal with (supplementary) categorical variables? X100m Long.jump Shot.put High.jump Competition HERNU Decastar BARRAS Decastar NOOL Decastar BOURGUIGNON Decastar Sebrle OlympicG Clay OlympicG X100m Long.jump Shot.put High.jump HERNU BARRAS NOOL BOURGUIGNON Sebrle Clay Decastar Olympic.G The categories are projected at the barycenter of the individuals who take the categories 32 / 48

33 Supplementary information (3) Individuals factor map (PCA) Decastar OlympicG Dimension 2 (17.37%) Casarsa Parkhomenko Korkizoglou YURKOV Zsivoczky Smith Macey Pogorelov SEBRLE MARTINEAU HERNU KARPOV CLAY Turi Terek BOURGUIGNON Uldal Barras McMULLEN Decastar Schoenbeck OlympicG Bernard Karlivans BARRAS Qi Hernu BERNARD Ojaniemi Smirnov Gomez ZSIVOCZKY Schwarzl Lorenzo Nool WARNERS Averyanov Warners NOOL Drews Sebrle Clay Karpov Dimension 1 (32.72%) Figure: Projection of supplementary variables. 33 / 48

34 Confidence ellipses Individuals factor map (PCA) Decastar OlympicG Dimension 2 (17.37%) Casarsa Parkhomenko Korkizoglou YURKOV Zsivoczky Smith Macey Pogorelov SEBRLE MARTINEAU HERNU KARPOV CLAY Turi Terek BOURGUIGNON Uldal Barras McMULLEN Decastar Schoenbeck OlympicG Bernard Karlivans BARRAS Qi Hernu BERNARD Ojaniemi Smirnov Gomez ZSIVOCZKY Schwarzl Lorenzo Nool WARNERS Averyanov Warners NOOL Drews Sebrle Clay Karpov Dimension 1 (32.72%) Figure: Confidence ellipses around the barycenter of each category. 34 / 48

35 Number of dimensions? Percentage of variance explained by each axis: information brought by the dimension Quality of the approximation: λs K s λs Dimensionality Reduction implies Information Loss Q s Number of components to retain? Retain much of the variability in our data (the other components are noise) Bar plot of the eigenvalues: scree test Test on eigenvalues, confidence interval, / 48

36 Percentage of variance obtained under independence Is there a structure on my data? nr=50 nc=8 iner=rep(0,1000) for (i in 1:1000) { mat=matrix(rnorm(nr*nc,0,1),ncol=nc) iner[i]=pca(mat,graph=f)$eig[2,3] } quantile(iner,0.95) 36 / 48

37 Percentage of variance obtained under independence Is there a structure on my data? Number of variables nbind Table: 95 % quantile inertia on the two first dimensions of PCA on data with independent variables 37 / 48

38 Percentage of variance obtained under independence Number of variables nbind Table: 95 % quantile inertia on the two first dimensions of PCA on data with independent variables 38 / 48

39 Quality of the representation: cos 2 For the variables: only well projected variables (high cos 2 between the variable and its projection) can be interpreted! For the individuals: (same idea) distance between individuals can only be interpreted for well projected individuals res.pca$ind$cos Dim.1 Dim.2 Sebrle Clay Karpov / 48

40 Contribution Contribution to the inertia to create the axis: For the individuals: Ctr s (i) = F s 2 (i) = F s 2 (i) I i=1 F s 2 (i) λ s Individuals with large coordinate contribute the most to the construction of the axis round(res.pca$ind$contrib,2) Dim.1 Dim.2 Sebrle Clay Karpov For the variables: Ctr s (k) = G 2 s (k) λ s = cor(x.k,vs)2 λ s Variables highly correlated with the principal component contribute the most to the construction of the dimension 40 / 48

41 Description of the dimensions (1) By the quantitative variables: The correlation between each variable and the coordinate of the individuals (principal components) on the axis s is calculated The correlation coefficients are sorted and significant ones are given $Dim.1 $Dim.2 $Dim.1$quanti $Dim.2$quanti Dim.1 Dim.2 Points 0.96 Discus 0.61 Long.jump 0.74 Shot.put 0.60 Shot.put 0.62 Rank m m.hurdle m / 48

42 Description of the dimensions (2) By the categorical variables: Perform a one-way analysis of variance with the coordinates of the individuals on the axis explained by the categorical variable A F -test by variable For each category, a t-test to compare the average of the category with the general mean $Dim.1$quali P-value Competition $Dim.1$category Estimate P-value OlympicG Decastar / 48

43 Practice library(factominer) data(decathlon) res <- PCA(decathlon,quanti.sup=11:12,quali.sup=13) plot(res,habillage=13) res$eig x11() barplot(res$eig[,1],main="eigenvalues",names.arg=1:nrow(res$eig)) res$ind$coord res$ind$cos2 res$ind$contrib dimdesc(res) aa=cbind.data.frame(decathlon[,13],res$ind$coord) bb=coord.ellipse(aa,bary=true) plot.pca(res,habillage=13,ellipse=bb) #write.infile(res,file="my_factominer_results.csv") #to export a list 43 / 48

44 Application Chicken data: 43 chickens (individuals) 7407 genes (variables) One categorical variable: 6 diets corresponding to different stresses Do genes differentially expressed from one stress to another? Dimensionality reduction: with few principal components, we identify the structure in the data 44 / 48

45 Individuals factor map (PCA) Dimension 2 (9.35%) J16 J16R16 J16R5 J48 J48R24 N j48r24_6 j48_7 j48_1 j48_6 J48 j48_3 j48_4 j48_2 j48r24_9 j48r24_5 j48r24_8 J48R24 j48r24_3 j48r24_2 j48r24_1 j48r24_4 j48r24_7 N J16 J16R16 J16R Dimension 1 (19.63%) 45 / 48

46 Individuals factor map (PCA) Dimension 4 (5.87%) J16 J16R16 j16r5_3 j48r24_8 J16R5 J48 j16r5_6 J48R24 j48r24_9 j16r5_8 N j48_7j16r5_2 j16r5_7 j16r16_1 N_1 j48_1 j16_7 j16r5_5 j16_6 N_2 j48_6 j48_4 N_3 N N_6 j48r24_6 J16R5 j48r24_1 J48 N_4 j48r24_2 j16r16_5 J16 N_7 j16r16_6 j16_5 j16r16_7 j16r16_2 j16r16_3 j16_4 J16R16 j48_3 J48R24 j16r16_9 j16r16_8 j16r5_4 j16r5_1 j16_3 j48_2 j48r24_7 j48r24_3 j48r24_5 j16r16_4 j48r24_ Dimension 3 (7.24%) 46 / 48

47 Individuals factor map (PCA) Dimension 2 (9.35%) J16 J16R16 J16R5 J48 J48R24 N j48r24_6 j48_7 j48_1 j48_6 J48 j48_3 j48_4 j48_2 j48r24_9 j48r24_5 j48r24_8 J48R24 j48r24_3 j48r24_2 j48r24_1 j48r24_4 j48r24_7 N J16 J16R16 J16R Dimension 1 (19.63%) 47 / 48

48 Individuals factor map (PCA) Dimension 4 (5.87%) J16 J16R16 j16r5_3 j48r24_8 J16R5 J48 j16r5_6 J48R24 j48r24_9 j16r5_8 N j48_7j16r5_2 j16r5_7 j16r16_1 N_1 j48_1 j16_7 j16r5_5 j16_6 N_2 j48_6 j48_4 N_3 N N_6 j48r24_6 J16R5 j48r24_1 J48 N_4 j48r24_2 j16r16_5 J16 N_7 j16r16_6 j16_5 j16r16_7 j16r16_2 j16r16_3 j16_4 J16R16 j48_3 J48R24 j16r16_9 j16r16_8 j16r5_4 j16r5_1 j16_3 j48_2 j48r24_7 j48r24_3 j48r24_5 j16r16_4 j48r24_ Dimension 3 (7.24%) 48 / 48