Lecture 18: Clustering & classification

O CPS260/BGT204. Algorthms n Computatonal Bology October 30, 2003 Lecturer: Pana K. Agarwal Lecture 8: Clusterng & classfcaton Scrbe: Daun Hou Open Problem In HomeWor 2, problem 5 has an open problem whch may be easy or may be hard. You can publsh a paper f you can fnd the soluton. The problem s: How to mprove the space complexty of the algorthm to O() or O(+logn). What s Clusterng? A loose defnton of clusterng could be the process of organzng obects nto groups whose members are smlar n some way. A cluster s therefore a collecton of obects whch are smlar to each other and are dssmlar to the obects belongng to other clusters. Cluster analyss s also used to form descrptve statstcs to ascertan whether or not the data conssts of a set dstnct subgroups, each group representng obects wth substantally dfferent propertes. The latter goal requres an assessment of the degree of dfference between the obects assgned to the respectve clusters [5]. Example of Clusterng Central to clusterng s to decde what consttutes a good clusterng. Ths can only come from subect matter consderatons and there s no absolute best crteron whch would be ndependent of the fnal am of the clusterng. For example, we could be nterested n

2 fndng representatves for homogeneous groups (data reducton), n fndng natural clusters and descrbe ther unnown propertes ( natural data types), n fndng useful and sutable groupngs ( useful data classes) or n fndng unusual data obects (outler detecton). Two mportant components of cluster analyss are the smlarty (dstance) measure between two data samples and the clusterng algorthm. 2. Dstance Measure Dfferent formula n defnng the dstance between two data ponts can lead to dfferent classfcaton results. Doman nowledge must be used to gude the formulaton of a sutable dstance measure for each partcular applcaton. For hgh dmensonal data, a popular measure s the Mnows Metrc: d d( x, x ) = x x,, = p p where d s the dmensonalty of the data. Specal Cases: p=2: Eucldean dstance p=: Manhattan dstance p-> : Super dstance However, there are no general theoretcal gudelnes for selectng a measure for any gven applcaton. In the case that the components of the data feature vectors are not mmedately comparable, such as the days of the wee, doman nowledge must be used to formulate an approprate measure. 3. Clusterng algorthms Clusterng algorthms may be classfed as lsted below: Exclusve Clusterng In exclusve clusterng data are grouped n an exclusve way, so that a certan datum belongs to only one defnte cluster. K-means clusterng s one example of the exclusve clusterng algorthms.

3 Overlappng Clusterng The overlappng clusterng uses fuzzy sets to cluster data, so that each pont may belong to two or more clusters wth dfferent degrees of membershp. Herarchcal Clusterng Herarchcal clusterng algorthm has two versons: agglomeratve clusterng and dvsve clusterng Agglomeratve clusterng s based on the unon between the two nearest clusters. The begnnng condton s realzed by settng every datum as a cluster. After a few teratons t reaches the fnal clusters wanted. Bascally, ths s a bottom-up verson Dvsve clusterng starts from one cluster contanng all data tems. At each step, clusters are successvely splt nto smaller clusters accordng to some dssmlarty. Bascally ths s a top-down verson. Probablstc Clusterng Probablstc clusterng, e.g. Mxture of Gaussan, uses a completely probablstc approach. 4. Herarchcal Algorthm Gven a set of N obects = { s s } S s, 2,... N to be clustered and a functon of dstance D( c, c) between two clusters c and c, buld a herarchy tree on S : c, c S, c c =. The basc process of herarchcal clusterng ( S.C. Johnson n 967) s as follows:. Start by assgnng each obect to a cluster c = s ( =,... N), so that f you have N, 2... N, each contanng ust one tem. 2. Fnd the par of clusters ( c, c ) such that D( c, c) D( c', c' ) c' c' and merge them nto a sngle cluster c = c c. Delete c, c from and nsert c nto so that now you have one cluster less. 3. Compute dstances (smlartes) between the new cluster and each of the old clusters. 4. Repeat steps 2 and 3 untl all tems are clustered nto a sngle cluster of sze N. obects, you have N clusters = { c c c } An example of how the herarchcal algorthm leads to long clusters s shown n the followng fgure:

4 Example of Herarchcal Clusterng Step 3 n the herarchcal algorthm can be done n dfferent ways, whch s what dstngushes sngle-lnage from complete-lnage and average-lnage clusterng. In sngle-lnage clusterng, the dstance between one cluster and another cluster s equal to the shortest dstance from any member of one cluster to any member of the other cluster: D( c, c) = mn d( a, b) a c, b c. It s obvous that: { } (, l) mn (, l), (, l) D c c = D c c D c c for c = c c In complete-lnage clusterng, the dstance between one cluster and another cluster s equal to the greatest dstance from any member of one cluster to any member of the other cluster: D( c, c) = max d( a, b) a c, b c. In average-lnage clusterng, the dstance between one cluster and another cluster s equal to the average dstance from any member of one cluster to any member of the other cluster: D( c, c) = d( a, b). It s obvous that c c, a c b c c c D( c, c ) = D( c, c ) + D( c c ) for l l, l c = c c c c The man weanesses of herarchcal clusterng methods nclude that they do not scale well: the tme complexty s at least O(N 2 ), where N s the number of total obects, and that they can never undo what was done prevously. It s bascally a greedy approach.

An example [4]: clusterng 4 data tems n 2-dmensonal space 5

7 5. K-Means Clusterng K-means (MacQueen, 967) s one of the smplest unsupervsed learnng algorthms that solves the clusterng problem. The obectve s to classfy a gven data set S = { s, s2,... s N } nto a certan number of clusters (assume ntal clusters) fxed a pror. The dea s to defne ntal centrods, one for each cluster c ( =,... ). The procedure s: o. Intal clusters: = { c 0 0 0, c2... c } possble from each other. ; the ntal centrods should be placed as far as 2: Calculate the centrods of the clusters: the th teraton. u = c x where =,... and denotes x c 3. Tae each pont belongng to a gven data set and assocate t to the nearest centrod: { (, ) (, ') ' } c + = x d x u d x u { c + } + = 4. Repeat steps 2 and 3 untl no more changes can be made to the clusters, that s, + =. In other words centrods do not move any more. Each teraton taes O(N) tme, but we don t now n how many teratons t wll tae to converge. Fnally, ths algorthm ams at mnmzng an obectve functon, n ths case a squared error functon: J = x u = x c 2 Although t can be proved that the -means algorthm wll always termnate, the algorthm does not necessarly fnd the most optmal confguraton, correspondng to the global obectve functon mnmum. It mght get stuc n a local mnmum. The algorthm s also sgnfcantly senstve to the ntal randomly selected cluster centers. To get out of local mnmum, the -means algorthm can be run multple tmes from dfferent ntal clusterng or smulated annealng technque could be used.

8 6. Prncpal Component Analyss d The prncpal components v{ =,2... d} of a data set S R consstng of N d- dmensonal random vectors {, s,... s } S s (d>>logn) provdes a sequence of best = 2 N lnear approxmaton to that data n terms of mnmum mean-square error. The orgnal vector s { =,2... N} hence can be represented as a lnear combnaton of the prncpal components: s d = = a v Ths representaton has several attractve propertes:. The effectveness of each prncpal components, v{ =,2... d}, n terms of representng S, s determned by ts correspondng egenvalue. Therefore, the prncpal component wth the largest egenvalue has the greatest mportance n effectvely approxmatng the orgnal data set S, and vce versa. 2. The prncpal components, v{ =,2... d} are mutually uncorrelated, that s, v v = 0( < d). Furthermore, n the specal case where S s normally dstrbuted, the prncpal components are mutually ndependent. 3. The set of prncpal components v{ =, 2... }, whch corresponds to the largest egenvalues, mnmzes the mean-square error over all choces of orthogonal vectors. The man applcaton of Prncpal Component Analyss s for feature space reducton: the d-dmensonal data can be proected onto a -dmensonal subspace usng the frst prncpal components. In classfcaton, the data then mght be clustered around the - dmensonal hyperplane after the transformaton proecton. To calculate the prncpal components v{ =,2... d}, the covarance matrx X of the data set S s frst calculated: N T X = ( s u)( s u N N ) where u = s N = = The average u of all vectors s n the data set s subtracted so that the egenvector wth the largest egenvalue represents the subspace dmenson n whch the varance of the data set s maxmzed n the correlaton sense. The prncpal components v{ =,2... d} are the egenvectors of the covarance matrx X and can be determned by solvng the well-nown egenvalue decomposton problem: v = X v λ Here λ λ 2... λ... λ d. One approach n selectng such that the frst egenvectors of X capture mportant varatons n the data set S s:

9 = d = λ T λ where the threshold T s close to but less than unty. 7. Supervsed Learnng All the clusterng analyss methods ntroduced above are examples of unsupervsed learnng algorthms. A learnng method s consdered unsupervsed f t learns n the absence of a teacher sgnal that provdes pror nowledge of the correct answer. Supervsed learnng has a substantal advantage over unsupervsed learnng. In partcular, supervsed learnng allows us to tae advantage of our own nowledge about the classfcaton problem we are tryng to solve. Instead of ust lettng the algorthm wor out for tself what the classes should be, we can tell t what we now about the classes: how many there are and what examples of each one loo le. The supervsed learnng algorthm s ob s then to fnd the features n the examples that are most useful n predctng the classes. Neural networs and support vector machnes are wdely used n supervsed learnng. The followng s an example of supervsed learnng: For feature vectors S = { s s s } and classfer : s {, }, 2,... N χ +, the obectve s to fnd a hyperplane H passng through the orgn so that the obectve functon s maxmzed: sgn( H ( s )) χ( s ). Bascally you want to maxmze the number of tmes s sgn( H ( s)) = χ ( s). Sometmes a hyperplane dose not wor because the ponts are not lnearly separable. In that case you could loo at dfferent hyper-surfaces to separate the data ponts. References: [] S. C. Johnson (967): "Herarchcal Clusterng Schemes" Psychometra, 2:24-254. [2] J.B.MacQueen (967): "Some Methods for classfcaton and Analyss of Multvarate Observatons, Proceedngs of 5-th Bereley Symposum on Mathematcal Statstcs and Probablty", Bereley, Unversty of Calforna Press, :28-297. [3] A Tutoral on Clusterng Algorthms http://www.elet.polm.t/upload/matteucc/clusterng/tutoral_html/ [4] Jeong-Ho Chang http://cbt.snu.ac.r/tutoral-2002/ppt/clusteranalyss.pdf [5] T. Haste, R. Tbshran, and J. Fredman, The Elements of Statstcal Learnng