Clustering on Principal Components (HCPC) height 0 10 20 30 40 50 60 70 cluster 1 10 Kiev cluster 3 5 Moscow 0 Helsinki Minsk Krakow cluster 2 Oslo Copenhagen Prague Budapest Sarajevo Sofia Madrid Rome Athens Berlin Brussels Paris Lisbon -5 Reykjavik Stockholm Amsterdam London -10 Dublin -15-20 -10 0 10 20 30 LE RAY Guillaume MOLTO Quentin Students of AGROCAMPUS OUEST majored in applied statistics 1
Context R: A free, opensource software for statistics (1875 packages). FactoMineR: a R package, developped in Agrocampus- Ouest, dedicated to factorial analysis. The aim is to create a complementary tool to this package, dedicated to, especially after a factorial analysis. Wide range of choices and uses, results, and graphical representations. 2
Clustering and factorial analysis Factorial analysis and hierarchical are very complementary tools to explore data. Removing the last factors of a factorial analysis remove noise and makes the robuster. Analyses factorielles simples et multiples 4 éme édition, Escofier,Pagès 2008 3
Program structure Factorial analysis Factorial analysis PCA, MCA, MFA Clustering Ward, Euclidean partition K-means 4
Statistic methods (1) Factorial analysis : Function agnes Euclidean distance Ward criterion=d²(i,j)x(mi.mj)/(mi+mj) Suggested level to cut the tree: Intra-cluster inertia Partition comparison: Q=(I n+1 - I n )/I n+1 Max = nb of individuals/2 Inertia 0 2 4 6 8 10 1 2 3 4 5 6 7 8 9 10 Nb of clusters 5
Statistic methods (2) Factorial analysis Non optimal partition K means with the cluster centers 6
Statistic methods (2) Factorial analysis Non optimal partition K means with the cluster centers 7
Statistic methods (3) Clusters description Description by individuals: Use real individuals to caracterise clusters. Factorial analysis Description by variables: Give list of typical variable of clusters. Description by axes: Like in factorial analysis. 8
Dataset presentation 9
Factorial Analysis Dimension 2 (15.4%) -1.0-0.5 0.0 0.5 1.0 June July May August September April October November March February December January Factorial analysis -1.0-0.5 0.0 0.5 1.0 Dimension 1 (82.9%) Dim 2 (15.4%) -4-3 -2-1 0 1 2 3 Moscow Kiev Budapest Minsk Krakow Prague Sofia Helsinki Oslo Sarajevo Stockholm Copenhagen Berlin Paris Amsterdam Brussels London Dublin Reykjavik Madrid Rome Lisbon Athens -5 0 5 Dim 1 (82.9%) 10
0 10 20 30 40 50 60 70 Moscow Minsk Helsinki Oslo Clustering Stockholm Kiev Krakow Reykjavik Copenhagen Click to cut the tree Prague Sarajevo Sofia Berlin Budapest Dublin London Amsterdam Brussels Paris Madrid Rome Lisbon Athens 0 20 40 60 80 Option: inertia gain suggested level of cutting. Sort the individuals as on the first component. Factorial analysis 11
Clustering Factorial analysis 0 10 20 30 40 50 60 70 Moscow Minsk Helsinki Oslo Stockholm Kiev Krakow Reykjavik Copenhagen Prague Sarajevo Sofia Berlin Budapest Dublin London Amsterdam Brussels Paris Madrid Rome Lisbon Colored rectangles are drawn around the clusters. We keep the same color for each cluster in the next graphs (function rect). Athens Options: cut automatically the tree at the suggested level, Cut at level with a choosen number of clusters. 12
Factor map and clusters Factorial analysis -15-10 -5 0 5 10 Dim 2 (15.4%) cluster 1 cluster 2 cluster 3 Moscow Kiev Krakow Sofia Minsk Prague Oslo Berlin Sarajevo Helsinki Stockholm Copenhagen Paris Reykjavik Amsterdam Dublin Budapest Brussels London Madrid Rome Lisbon Athens -20-10 0 10 20 30 Dim 1 (82.9%) Options: Draw other axes, Remove the names, the centers. 13
Factor map, clusters, and tree Factorial analysis height 0 10 20 30 40 50 60 70 cluster 1 cluster 2 cluster 3 Moscow Elsinki Minsk Kiev Oslo Stockholm Krakow Prague Sofia Budapest Berlin Sarajevo Paris Copenhagen London Brussels Amsterdam Reykjavik Dublin Madrid Rome Lisbon -20-10 0 10 20 30 Dim 1 (82.9%) Athens 0-5 -10-15 Options: Draw only a part of the tree, Draw other axes, Remove the names Change the height 5 10 14
Cluster description (1) By individuals factorial analysis Option: the number of individuals for each cluster (here 2) Dim 2 (15.4%) -15-10 -5 0 5 10 cluster 1 cluster 2 cluster 3 Moscow Kiev Minsk Krakow Sofia Prague Sarajevo Helsinki Berlin Oslo Stockholm Copenhagen Paris Amsterdam Brussels London Reykjavik Budapest Dublin Madrid Rome Lisbon Athens -20-10 0 10 20 30 Dim 1 (82.9%) 15
Cluster description (1) By individuals factorial analysis Option: the number of individuals for each cluster (here 2) Dim 2 (15.4%) -15-10 -5 0 5 10 cluster 1 cluster 2 cluster 3 Moscow Kiev Minsk Krakow Sofia Prague Sarajevo Helsinki Berlin Oslo Stockholm Copenhagen Paris Amsterdam Brussels London Reykjavik Budapest Dublin Madrid Rome Lisbon Athens -20-10 0 10 20 30 Dim 1 (82.9%) 16
Cluster description (2) By individuals factorial analysis Option: the number of individuals for each cluster (here 2) Dim 2 (15.4%) -15-10 -5 0 5 10 cluster 1 cluster 2 cluster 3 Moscow Kiev Minsk Krakow Sofia Prague Sarajevo Helsinki Berlin Oslo Stockholm Copenhagen Paris Amsterdam Brussels London Reykjavik Budapest Dublin Madrid Rome Lisbon Athens -20-10 0 10 20 30 Dim 1 (82.9%) 17
Cluster description (3) By variables This is the result of a catdes, it describes the different clusters by the variables (the mean in the category, the v.test ) factorial analysis Option: the p.value (here 0.05). 18
Cluster description (3) By axes factorial analysis This is the result of a catdes, it describes the different clusters by the axes (the mean in the category, the v.test ) Option: the p.value (here 0.05). 19
Conclusion This function was presented with a PCA, but it also acepts: MCA and MFA results, directly a quantitative dataset (nonscaled PCA), a continuous variables to divide into modalities. A normal distribution divided in 3 clusters 20
Function plot.catdes cluster 1 cluster 3 cluster 1 cluster 2 cluster 3 v.test -4-2 0 2 4 Mars Février Novembre Octobre Décembre Janvier Avril Septembre Août Mai Juillet Juin v.test -4-2 0 2 4 Mai Janvier Décembre Juin Février Mars Avril Juillet Novembre Août Octobre Septembre v.test -4-2 0 2 4 Dim.1 v.test -4-2 0 2 4 Dim.3 v.test -4-2 0 2 4 Dim.1 It is a graphical representation of the desc.var results Option: show only the quantitative, qualitative variables or all 21