Introduction to Clustering

Introduction to Clustering Yumi Kondo Student Seminar LSK301 Sep 25, 2010 Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 1 / 36

Microarray Example N=65 P=1756 Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 2 / 36

Clustering The data set {x ij }, i =1,..,N, j=1,...,p consist of P features measured on n independent observations. Clustering Clustering algorithm seek to assign N observations in p space, labeled x 1,.., x N to one of K groups, based on some similarity measure. Unsupervised learning the problem of finding groups in data without the help of a response variable No right or wrong partition umi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 3 / 36

What is similarity measure? x 1 and x 2 are observation vectors in p dimention Some examples Euclidean distance x i x i 2 n[,2] 0 1 2 3 4 5 0 1 2 3 4 5 n[,1] Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 4 / 36

What is similarity measure? x 1 and x 2 are observation vectors in p dimention Some examples Euclidean distance x i x i 2 n[,2] 0 1 2 3 4 5 0 1 2 3 4 5 n[,1] Absolute distance x i x i 2 1 Correlation with d = 1 correlation Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 4 / 36

what is K-means? Clustering method Hierarchical Clustering Non-hierarchical Clustering -K-means Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 5 / 36

Hierarchical Clustering Outline what is K-means? Hierarchical Clustering It produces a dendrogram that represents a nested set of clusters: depending on where the dendrogram is cut, between 1 and N clusters can result. Cool Microarray example http : //genome www.stanford.edu/breast c ancer/molecularportraits/download.shtm Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 6 / 36

what is K-means? Hierarchical Clustering PRO CON Nice tree! (dendrogram) Visualize the different levels of similarity between observations. computationally expensive! Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 7 / 36

what is K-means? Non-hierarchical clustering, K-means K-mean with Euclidian distance as a similarity measure Solution of K-mean clustering is the partition such that min WSS C 1,..,C K = min C 1,..,C K K 1 n k k=1 i,i C k x i x i 2 -white board Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 8 / 36

what is K-means? note; WSS = = = K 1 x i x i 2 2n k i,i C k K 1 K K x i x k + x k x i 2 2n k k=1 k=1 i=1 i =1 K n k x i x k 2 k=1 i=1 Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 9 / 36

Algorithm for K-mean Outline what is K-means? Step 1 and step 2 are iterated until convergence. Step 1. Given cluster assignment C 1,.., C K, cluster centroids are calculated as i C ˆµ k = k x i k = 1,.., K N Step 2. Given cluster centroids, objective function is minimized by assigning each observation to the closest cluster mean. I i = argmin 1 k K x i ˆµ k 2 white board Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 10 / 36

what is K-means? Correlation as similarity measurement in K-means It is not so easy to create an algorithm when similarity measurement is correlation. No simple analytic form for cluster centroid :( Data transformation approach 1. normalize the observation vector x i = x i x x i x 2 P umi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 11 / 36

what is K-means? Correlation as similarity measurement in K-means It is not so easy to create an algorithm when similarity measurement is correlation. No simple analytic form for cluster centroid :( Data transformation approach 1. normalize the observation vector x i = x i x x i x 2 2. x i ỹ i 2 d ρx,y P x i ỹ i 2 x i x = y i ȳ 2 x i x 2 P y i ȳ 2 P = x i x 2 x i x 2 P P P x i x 2 y i ȳ 2 (x i x) (y i ȳ) y 2 i ȳ y i ȳ 2 P 2 = 2p 2p (x i x) (y i ȳ) x i x y i ȳ umi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 11 / 36

what is K-means? Non-hierarchical Method K-means Drawback of K-means No pretty tree The number of clusters must be pre-known! Not robust Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 12 / 36

K must be preknown but how? GAP statistics, Tibshirani et al (2001) Clest Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 13 / 36

GAP statistics idea behind GAP statistics Find ˆk such that WSS k shows an elbow decline Jump to WSS -cool example in R Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 14 / 36

Definition GAP(k) = E null (log(wss(k)) ˆ log(wss(k)) ˆk = the smallest k such that GAP(k) GAP(k + 1) s k+1 Standardize the graph of log (WSS(k)) by comparing it with its expectation under an appropriate null reference distribution of the data ˆ E null (log(wss(k)) = 1 B B log(wss(k) b ) i=1 -another cool one in R Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 15 / 36

Reference distribution: K=1 Generate the reference features from a uniform distribution over a box aligned with the principal components of the data. Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 16 / 36

Reference distribution: K=1 Generate the reference features from a uniform distribution over a box aligned with the principal components of the data. Orthogonally diagonalize S X = PDP. D is a diagonal matrix with the eigenvalues,λ 1,.., λ p of S on the diagonal, arranged so that,λ 1,.., λ p 0 and P is an orthogonal matrix whose columns are the corresponding unit eigenvectors u 1,.., u p. Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 16 / 36

Reference distribution: K=1 Generate the reference features from a uniform distribution over a box aligned with the principal components of the data. Orthogonally diagonalize S X = PDP. D is a diagonal matrix with the eigenvalues,λ 1,.., λ p of S on the diagonal, arranged so that,λ 1,.., λ p 0 and P is an orthogonal matrix whose columns are the corresponding unit eigenvectors u 1,.., u p. Transform via X = XP. Then S X = P S X P = P PDP P = D. The transformed data is no longer correlated. Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 16 / 36

Reference distribution: K=1 Generate the reference features from a uniform distribution over a box aligned with the principal components of the data. Orthogonally diagonalize S X = PDP. D is a diagonal matrix with the eigenvalues,λ 1,.., λ p of S on the diagonal, arranged so that,λ 1,.., λ p 0 and P is an orthogonal matrix whose columns are the corresponding unit eigenvectors u 1,.., u p. Transform via X = XP. Then S X = P S X P = P PDP P = D. The transformed data is no longer correlated. Draw uniform features Z over the range of the columns of X. Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 16 / 36

Reference distribution: K=1 Generate the reference features from a uniform distribution over a box aligned with the principal components of the data. Orthogonally diagonalize S X = PDP. D is a diagonal matrix with the eigenvalues,λ 1,.., λ p of S on the diagonal, arranged so that,λ 1,.., λ p 0 and P is an orthogonal matrix whose columns are the corresponding unit eigenvectors u 1,.., u p. Transform via X = XP. Then S X = P S X P = P PDP P = D. The transformed data is no longer correlated. Draw uniform features Z over the range of the columns of X. Finally we back-transform via Z = Z P to give reference data set Z. Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 16 / 36

Definition GAP(k) = E null (log(wss(k))) ˆ log(wss(k)) ˆk = the smallest k such that GAP(k) GAP(k + 1) s k+1 ˆ E null (log(wss(k)) = 1 B B log(wss(k) b ) i=1 Standardize the graph of log (WSS(k)) by comparing it with its expectation under an appropriate null reference distribution of the data. Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 17 / 36

Prediction-based resampling method, Clest Clest returns K which has the most stable predictability in clustering procedure. Algorithm For each K, repeat the following process B times for data set and reference data set partition data set into learning set and testing set perform K-means on learning set, return classifiers Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 18 / 36

Prediction-based resampling method, Clest Clest returns K which has the most stable predictability in clustering procedure. Algorithm For each K, repeat the following process B times for data set and reference data set partition data set into learning set and testing set perform K-means on learning set, return classifiers classify testing set by classifiers, return C 1,classifier,.., C K,classifiers Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 18 / 36

Prediction-based resampling method, Clest Clest returns K which has the most stable predictability in clustering procedure. Algorithm For each K, repeat the following process B times for data set and reference data set partition data set into learning set and testing set perform K-means on learning set, return classifiers classify testing set by classifiers, return C 1,classifier,.., C K,classifiers classify testing set by Kmeans, return C 1,K mean,..., C K,K means Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 18 / 36

Prediction-based resampling method, Clest Clest returns K which has the most stable predictability in clustering procedure. Algorithm For each K, repeat the following process B times for data set and reference data set partition data set into learning set and testing set perform K-means on learning set, return classifiers classify testing set by classifiers, return C 1,classifier,.., C K,classifiers classify testing set by Kmeans, return C 1,K mean,..., C K,K means measure the similarity of two partitions C 1,classifier,.., C K,classifiers and C 1,K mean,..., C K,K means Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 18 / 36

Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 19 / 36

Compute S(k,cluster labels 1, cluster labels 2) Repeat this process B times for each K and obtain the average of measure S k Repeat the algorithm for reference dataset for B timesand obtain S 0 k Obtain standardized similarity measure d k =S k S 0 k and ˆK=argmax k {1,..,K} d k Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 25 / 36

Compute S(k,cluster labels 1, cluster labels 2) Repeat this process B times for each K and obtain the average of measure S k Repeat the algorithm for reference dataset and obtain S 0 k Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 33 / 36

Compute S(k,cluster labels 1, cluster labels 2) Repeat this process B times for each K and obtain the average of measure S k Repeat the algorithm for reference dataset and obtain S 0 k Obtain standardized similarity measure d k =S k S 0 k and ˆK=argmax k {1,..,K} d k Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 33 / 36

Similarity measures of two partitions Let P and Q represent two partitions i>i I CER = P(i,i ) I Q(i,i ) ( n 2) { 1 if i and i belong to the same cluster by partitioning P I P(i,i ) = 0 otherwise 0 CER 1 CER= 0 means perfect agreement of two partitions CER= 1 means complete disagreement of two partitions Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 34 / 36

Does Clest outperform GAP statistics? Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 35 / 36

Tibshirani, Robert,et al. Estimating the number of clusters in a data set via the gap statistic Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 36 / 36