Exploratory data analysis approaches unsupervised approaches Steven Kiddle With thanks to Richard Dobson and Emanuele de Rinaldis
Lecture overview Page 1 Ø Background Ø Revision Ø Other clustering methods
Background Page 2
Motivation Page 3 The most challenging task for a scientist is to make sense of lots of data The power of high-throughput analysis does not come from the analysis of single genes, but rather, from the analysis of many data points to identify patterns of gene expression Unsupervised learning allows unexpected patterns to be spotted
Supervised v.s. Unsupervised learning Page 4 Supervised Inputs Observations Observations Outputs
Supervised v.s. Unsupervised learning Page 5 Supervised Observations Inputs Unsupervised Unobserved, or not used in initial analysis Observations Observations Outputs
Clustering Page 6 Finding a partition such that: - Distance between objects within partition is minimised - Distance between objects from different cluster is maximised
Applications Page 7 Biology finding similar organisms, sequences, molecular signatures Marketing identify groups of customers with similar preferences Earthquakes Predict epicentre based on recordings Images Image compression Many more
Revision Page 8
Hiearchical clustering revision Page 9 In R: hclust In MATLAB: clusterdata
Hiearchical clustering revision Page 10 In R: hclust In MATLAB: clusterdata
Hiearchical clustering revision Page 11 In R: hclust In MATLAB: clusterdata
Hiearchical clustering revision Page 12 In R: hclust In MATLAB: clusterdata
Page 13
Page 14
Page 15
Page 16
Screen Shot 2014-01-09 at 16.52.35 Page 17
Page 18
Principal Components Analysis (PCA) revision Page 19 http:// gettinggeneticsdone.blogspot.co.uk/ In R: prcomp In MATLAB: princomp
Principal Components Analysis (PCA) revision Page 20 http:// gettinggeneticsdone.blogspot.co.uk/ In R: prcomp In MATLAB: princomp
Other clustering methods Page 21
Multidimensional scaling (MDS) Page 22 PCA is a special case of MDS. MDS can use non-linear transformations of data points. Aflalo et al 2013
Automatic identification of clusters Page 23 Standard approaches: Ø K-means Ø K-centre
K-means example Page 24 Li et al., (2010)
K-means algorithm Page 25 In R: kmeans In MATLAB: kmeans 1) k initial "means" (in this case k=3) are randomly generated within the data domain (shown in color). Wikimedia
K-means algorithm Page 26 2) k clusters are created by associating every observation with the nearest mean. Wikimedia
K-means algorithm Page 27 3) The centroid of each of the k clusters becomes the new mean. Wikimedia
K-means algorithm Page 28 4) Steps 2 and 3 are repeated until convergence has been reached. Wikimedia
K-means disadvantages Page 29 Assumes clusters have same variance as each other in all directions. Ø Expectation-Maximisation (EM) clustering Requires a distance measure with a defined mean, such as Euclidean distance (the Pythagoras/ordinary distance). Ø K-centres Wikimedia
Unusual distance measures Page 30 Flight time can be affected by wind, busy airports etc Wikimedia
K-centres (also called K-medians) Page 31 Same data points as we used for k-means In R: pam {cluster} In MATLAB: kcenters http://www.psi.toronto.edu/index.php?q=affinity%20propagation Wikimedia
K-centres (also called K-medians) Page 32 1) k data points selected at random. Wikimedia
K-centres (also called K-medians) Page 33 2) k clusters are created by associating every observation with the nearest centre/median. Wikimedia
K-centres (also called K-medians) Page 34 3) For each cluster, the observation with the shortest distance to the rest of the cluster is chosen as the new centre/median. Wikimedia
K-centres (also called K-medians) Page 35 2 3 2 4) Steps 2 & 3 are repeated until convergence has been reached Wikimedia
Example of results Page 36 Wikimedia
Even more clustering methods Page 37 K-means style Ø Expectation Maximisation Ø Self Organising Maps Ø Neural Networks K-centres style Ø Affinity Propagation (Frey et al., 2007) Ø Simulated Annealing For time series Ø SplineCluster (Heard et al., 2006) Wikimedia