Chapter 3: Cluster Analysis 3.1 Basic Cncepts f Clustering 3.1.1 Cluster Analysis 3.1. Clustering Categries 3. Partitining Methds 3..1 The principle 3.. K-Means Methd 3..3 K-Medids Methd 3..4 CLARA 3..5 CLARANS 3.3 Hierarchical Methds 3.4 Density-based Methds 3.5 Clustering High-Dimensinal Data 3. Outlier Analysis
3.1.1 Cluster Analysis Unsupervised learning (i.e., Class label is unknwn) Grup data t frm new categries (i.e., clusters), e.g., cluster huses t find distributin patterns Principle: Maximizing intra-class similarity & minimizing interclass similarity Typical Applicatins WWW, Scial netwrks, Marketing, Bilgy, Library, etc.
3.1. Clustering Categries Partitining Methds Cnstruct k partitins f the data Hierarchical Methds Creates a hierarchical decmpsitin f the data Density-based Methds Grw a given cluster depending n its density (# data bjects) Grid-based Methds Quantize the bject space int a finite number f cells Mdel-based methds Hypthesize a mdel fr each cluster and find the best fit f the data t the given mdel Clustering high-dimensinal data Subspace clustering Cnstraint-based methds Used fr user-specific applicatins
Chapter 3: Cluster Analysis 3.1 Basic Cncepts f Clustering 3.1.1 Cluster Analysis 3.1. Clustering Categries 3. Partitining Methds 3..1 The principle 3.. K-Means Methd 3..3 K-Medids Methd 3..4 CLARA 3..5 CLARANS 3.3 Hierarchical Methds 3.4 Density-based Methds 3.5 Clustering High-Dimensinal Data 3. Outlier Analysis
3..1 Partitining Methds: The Principle Given A data set f n bjects K the number f clusters t frm Organize the bjects int k partitins (k<=n) where each partitin represents a cluster The clusters are frmed t ptimize an bjective partitining criterin Objects within a cluster are similar Objects f different clusters are dissimilar
3.. K-Means Methd Chse 3 bjects (cluster centrids) Gal: create 3 clusters (partitins) Assign each bject t the clsest centrid t frm Clusters Update cluster centrids + + +
K-Means Methd Recmpute Clusters + + + If Stable centrids, then stp + + +
K-Means Algrithm Input K: the number f clusters D: a data set cntaining n bjects Output: A set f k clusters Methd: (1) Arbitrary chse k bjects frm D as in initial cluster centers () Repeat (3) Reassign each bject t the mst similar cluster based n the mean value f the bjects in the cluster (4) Update the cluster means (5) Until n change
K-Means Prperties The algrithm attempts t determine k partitins that minimize the square-errr functin E k i 1 p C i ( p m i ) E: the sum f the squared errr fr all bjects in the data set P: the data pint in the space representing an bject m i : is the mean f cluster C i It wrks well when the clusters are cmpact cluds that are rather well separated frm ne anther
K-Means Prperties Advantages K-means is relatively scalable and efficient in prcessing large data sets The cmputatinal cmplexity f the algrithm is O(nkt) n: the ttal number f bjects k: the number f clusters t: the number f iteratins Nrmally: k<<n and t<<n Disadvantage Can be applied nly when the mean f a cluster is defined Users need t specify k K-means is nt suitable fr discvering clusters with nncnvex shapes r clusters f very different size It is sensitive t nise and utlier data pints (can influence the mean value)
Variatins f the K-Means Methd A few variants f the k-means which differ in Selectin f the initial k means Dissimilarity calculatins Strategies t calculate cluster means Handling categrical data: k-mdes (Huang 9) Replacing means f clusters with mdes Using new dissimilarity measures t deal with categrical bjects Using a frequency-based methd t update mdes f clusters A mixture f categrical and numerical data Nvember, 010 Data Mining: Cncepts and Techniques 11
3..3 K-Medids Methd Minimize the sensitivity f k-means t utliers Pick actual bjects t represent clusters instead f mean values Each remaining bject is clustered with the representative bject (Medid) t which is the mst similar The algrithm minimizes the sum f the dissimilarities between each bject and its crrespnding reference pint E k i 1 p C i p i E: the sum f abslute errr fr all bjects in the data set P: the data pint in the space representing an bject O i : is the representative bject f cluster C i
K-Medids Methd: The Idea Initial representatives are chsen randmly The iterative prcess f replacing representative bjects by n representative bjects cntinues as lng as the quality f the clustering is imprved Fr each representative Object O Fr each nn-representative bject R, swap O and R Chse the cnfiguratin with the lwest cst Cst functin is the difference in abslute errr-value if a current representative bject is replaced by a nn-representative bject
K-Medids Methd: Example Data Objects O 1 A 1 A O 3 4 O 3 3 O 4 4 7 O 5 O 4 O 7 7 3 O 7 4 O 9 5 O 10 7 9 7 5 4 3 3 4 1 10 9 7 5 3 4 5 7 9 Gal: create tw clusters Chse randmly tw medids O = (3,4) O = (7,4)
K-Medids Methd: Example Data Objects A 1 O 1 A O 3 4 O 3 3 O 4 4 7 O 5 O 4 O 7 7 3 O 7 4 O 9 5 O 10 7 9 7 5 4 3 1 3 cluster1 4 5 3 4 5 7 9 Cluster1 = {O 1, O, O 3, O 4 } 10 cluster Assign each bject t the clsest representative bject Using L1 Metric (Manhattan), we frm the fllwing clusters 7 9 Cluster = {O 5, O, O 7, O, O 9, O 10 }
K-Medids Methd: Example O 1 A 1 A O 3 4 O 3 3 O 4 4 7 O 5 O 4 O 7 7 3 O 7 4 O 9 5 O 10 7 Data Objects 3 4 5 7 9 3 4 5 7 9 1 3 4 5 7 9 10 Cmpute the abslute errr criterin [fr the set f Medids (O,O)] 10 9 7 5 4 3 1 1 p E k i C p i i cluster1 cluster
K-Medids Methd: Example Data Objects A 1 O 1 A O 3 4 O 3 3 O 4 4 7 O 5 O 4 O 7 7 3 O 7 4 O 9 5 O 10 7 9 7 5 4 3 1 3 cluster1 4 5 3 4 5 7 9 The abslute errr criterin [fr the set f Medids (O,O)] 10 cluster E ( 3 4 4) (3 11 ) 7 9 0
K-Medids Methd: Example Data Objects A 1 O 1 A O 3 4 O 3 3 O 4 4 7 O 5 O 4 O 7 7 3 O 7 4 O 9 5 O 10 7 9 7 5 4 3 1 Chse a randm bject O 7 Swap O and O7 3 cluster1 4 5 3 4 5 7 9 Cmpute the abslute errr criterin [fr the set f Medids (O,O7)] 10 cluster E ( 3 4 4) ( 1 3 3) 7 9
K-Medids Methd: Example Data Objects A 1 O 1 A O 3 4 O 3 3 O 4 4 7 O 5 O 4 O 7 7 3 O 7 4 O 9 5 O 10 7 9 7 5 4 3 cluster1 3 4 1 5 10 9 7 3 4 5 7 9 Cmpute the cst functin Abslute errr [fr O,O 7 ] Abslute errr [O,O ] S 0 cluster S> 0 it is a bad idea t replace O by O 7
K-Medids Methd Data Objects A 1 O 1 A O 3 4 O 3 3 O 4 4 7 O 5 O 4 O 7 7 3 O 7 4 O 9 5 O 10 7 9 7 5 4 3 1 3 cluster1 4 5 3 4 5 7 9 In this example, changing the medid f cluster did nt change the assignments f bjects t clusters. 10 cluster What are the pssible cases when we replace a medid by anther bject? 7 9
K-Medids Methd Cluster 1 Cluster A B B First case The assignment f P t A des nt change p Representative bject Randm Object Currently P assigned t A Cluster 1 Cluster A p B B Secnd case P is reassigned t A Representative bject Randm Object Currently P assigned t B
K-Medids Methd Cluster 1 Cluster A p B B Third case P is reassigned t the new B Representative bject Randm Object Currently P assigned t B Cluster 1 Cluster A Furth case p B B P is reassigned t B Representative bject Randm Object Currently P assigned t A
K-Medids Algrithm(PAM) PAM : Partitining Arund Medids Input K: the number f clusters D: a data set cntaining n bjects Output: A set f k clusters Methd: (1) Arbitrary chse k bjects frm D as representative bjects (seeds) () Repeat (3) Assign each remaining bject t the cluster with the nearest representative bject (4) Fr each representative bject O j (5) Randmly select a nn representative bject O randm () Cmpute the ttal cst S f swapping representative bject Oj with O randm (7) if S<0 then replace O j with O randm () Until n change
K-Medids Prperties(k-medids vs.k-means) The cmplexity f each iteratin is O(k(n-k) ) Fr large values f n and k, such cmputatin becmes very cstly Advantages K-Medids methd is mre rbust than k-means in the presence f nise and utliers Disadvantages K-Medids is mre cstly that the k-means methd Like k-means, k-medids requires the user t specify k It des nt scale well fr large data sets
3..4 CLARA CLARA (Clustering Large Applicatins) uses a sampling-based methd t deal with large data sets A randm sample shuld clsely represent the riginal data sample PAM The chsen medids will likely be similar t what wuld have been chsen frm the whle data set
CLARA Draw multiple samples f the data set Apply PAM t each sample Chse the best clustering Return the best clustering Clusters Clusters Clusters PAM PAM PAM sample 1 sample sample m
CLARA Prperties Cmplexity f each Iteratin is: O(ks + k(n-k)) s: the size f the sample k: number f clusters n: number f bjects PAM finds the best k medids amng a given data, and CLARA finds the best k medids amng the selected samples Prblems The best k medids may nt be selected during the sampling prcess, in this case, CLARA will never find the best clustering If the sampling is biased we cannt have a gd clustering Trade ff-f efficiency
3..5 CLARANS CLARANS (Clustering Large Applicatins based upn RANdmized Search ) was prpsed t imprve the quality and the scalability f CLARA It cmbines sampling techniques with PAM It des nt cnfine itself t any sample at a given time It draws a sample with sme randmness in each step f the search
CLARANS: The idea Clustering view Current medids medids Cst=10 Cst=5 Cst=1 Cst=0 Cst= Cst=3 Cst=5 Keep the current medids
CLARA CLARANS: The idea Draws a sample f ndes at the beginning f the search Neighbrs are frm the chsen sample Restricts the search t a specific area f the riginal data First step f the search Neighbrs are frm the chsen sample Current medids Sample medids secnd step f the search Neighbrs are frm the chsen sample
CLARANS: The idea CLARANS Des nt cnfine the search t a lcalized area Stps the search when a lcal minimum is fund Finds several lcal ptimums and utput the clustering with the best lcal ptimum First step f the search Draw a randm sample f neighbrs Current medids Original data medids secnd step f the search Draw a randm sample f neighbrs The number f neighbrs sampled frm the riginal data is specified by the user
CLARANS Prperties Advantages Experiments shw that CLARANS is mre effective than bth PAM and CLARA Handles utliers Disadvantages The cmputatinal cmplexity f CLARANS is O(n ), where n is the number f bjects The clustering quality depends n the sampling methd
Summary f Sectin 3. Partitining methds find sphere-shaped clusters K- mean is efficient fr large data sets but sensitive t utliers PAM uses centers f the clusters instead f means CLARA and CLARANS are used fr clustering large databases