Data Mining 資 料 探 勘 Tamkang University 分 群 分 析 (Cluster Analysis) DM MI Wed,, (:- :) (B) Min-Yuh Day 戴 敏 育 Assistant Professor 專 任 助 理 教 授 Dept. of Information Management, Tamkang University 淡 江 大 學 資 訊 管 理 學 系 http://mail. tku.edu.tw/myday/ - -
課 程 大 綱 (Syllabus) 週 次 (Week) 日 期 (Date) 內 容 (Subject/Topics) // 資 料 探 勘 導 論 (IntroducGon to Data Mining) // 關 連 分 析 (AssociaGon Analysis) // 分 類 與 預 測 (ClassificaGon and PredicGon) // 分 群 分 析 (Cluster Analysis) // 個 案 分 析 與 實 作 一 (SAS EM 分 群 分 析 ): Case Study (Cluster Analysis K- Means using SAS EM) // 個 案 分 析 與 實 作 二 (SAS EM 關 連 分 析 ): Case Study (AssociaGon Analysis using SAS EM) // 教 學 行 政 觀 摩 日 (Off- campus study) // 個 案 分 析 與 實 作 三 (SAS EM 決 策 樹 模 型 評 估 ): Case Study (Decision Tree, Model EvaluaGon using SAS EM)
課 程 大 綱 (Syllabus) 週 次 (Week) 日 期 (Date) 內 容 (Subject/Topics) // 期 中 報 告 (Midterm Project PresentaGon) // 期 中 考 試 週 (Midterm Exam) // 個 案 分 析 與 實 作 四 (SAS EM 迴 歸 分 析 類 神 經 網 路 ): Case Study (Regression Analysis, ArGficial Neural Network using SAS EM) // 文 字 探 勘 與 網 頁 探 勘 (Text and Web Mining) // 海 量 資 料 分 析 (Big Data AnalyGcs) // 期 末 報 告 (Final Project PresentaGon) // 畢 業 考 試 週 (Final Exam)
Outline Cluster Analysis K- Means Clustering Source: Han & Kamber ()
Cluster Analysis Used for automagc idengficagon of natural groupings of things Part of the machine- learning family Employ unsupervised learning Learns the clusters of things from past data, then assigns new instances There is not an output variable Also known as segmentagon Source: Turban et al. (), Decision Support and Business Intelligence Systems
Cluster Analysis Clustering of a set of objects based on the k-means method. (The mean of each cluster is marked by a +.) Source: Han & Kamber ()
Cluster Analysis Clustering results may be used to IdenGfy natural groupings of customers IdenGfy rules for assigning new cases to classes for targegng/diagnosgc purposes Provide characterizagon, definigon, labeling of populagons Decrease the size and complexity of problems for other data mining methods IdenGfy outliers in a specific domain (e.g., rare- event detecgon) Source: Turban et al. (), Decision Support and Business Intelligence Systems
Example of Cluster Analysis Point P P(x,y) p a (, ) p b (, ) p c (, ) p d (, ) p e (, ) p f (, ) p g (, ) p h (, ) p i (, ) p j (, )
Cluster Analysis for Data Mining Analysis methods StaGsGcal methods (including both hierarchical and nonhierarchical), such as k- means, k- modes, and so on Neural networks (adapgve resonance theory [ART], self- organizing map [SOM]) Fuzzy logic (e.g., fuzzy c- means algorithm) GeneGc algorithms Divisive versus AgglomeraGve methods Source: Turban et al. (), Decision Support and Business Intelligence Systems
Cluster Analysis for Data Mining How many clusters? There is not a truly opgmal way to calculate it HeurisGcs are ojen used. Look at the sparseness of clusters. Number of clusters = (n/) / (n: no of data points). Use Akaike informagon criterion (AIC). Use Bayesian informagon criterion (BIC) Most cluster analysis methods involve the use of a distance measure to calculate the closeness between pairs of items Euclidian versus Manhalan (recglinear) distance Source: Turban et al. (), Decision Support and Business Intelligence Systems
k- Means Clustering Algorithm k : pre- determined number of clusters Algorithm (Step : determine value of k) Step : Randomly generate k random points as inigal cluster centers Step : Assign each point to the nearest cluster center Step : Re- compute the new cluster centers RepeGGon step: Repeat steps and ungl some convergence criterion is met (usually that the assignment of points to clusters becomes stable) Source: Turban et al. (), Decision Support and Business Intelligence Systems
Cluster Analysis for Data Mining - k- Means Clustering Algorithm Step Step Step Source: Turban et al. (), Decision Support and Business Intelligence Systems
Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular ones include: Minkowski distance: where i = (x i, x i,, x ip ) and j = (x j, x j,, x jp ) are two p- dimensional data objects, and q is a posigve integer If q =, d is Manhalan distance q q p p q q j x i x j x i x j x i x j i d )... ( ), ( + + + =... ), ( p p j x i x j x i x j x i x j i d + + + = Source: Han & Kamber ()
Similarity and Dissimilarity Between Objects (Cont.) If q =, d is Euclidean distance: d( i, j) = ProperGes d(i,j) d(i,i) = ( x i d(i,j) = d(j,i) x j + x i d(i,j) d(i,k) + d(k,j) x j +... + x i p p Also, one can use weighted distance, parametric Pearson product moment correlagon, or other disimilarity measures x j ) Source: Han & Kamber ()
Euclidean distance vs ManhaMan distance Distance of two point x = (, ) and x (, ) x (, ). x = (, ) Euclidean distance: = ((-) + (-) ) / = ( + ) / = ( + ) / = () / =. Manhattan distance: = (-) + (-) = + =
Example The K- Means Clustering Method Assign each objects to most similar center reassign Update the cluster means reassign K= Arbitrarily choose K object as initial cluster center Update the cluster means Source: Han & Kamber ()
K- Means Clustering Step by Step Point P P(x,y) p a (, ) p b (, ) p c (, ) p d (, ) p e (, ) p f (, ) p g (, ) p h (, ) p i (, ) p j (, )
m = (, ) K- Means Clustering Step : K=, Arbitrarily choose K object as initial cluster center M = (, ) Point P P(x,y) p a (, ) p b (, ) p c (, ) p d (, ) p e (, ) p f (, ) p g (, ) p h (, ) p i (, ) p j (, ) Initial m (, ) Initial m (, )
Step : Compute seed points as the centroids of the clusters of the current partition Step : Assign each objects to most similar center m = (, ) M = (, ) Point P P(x,y) m distance m distance Cluster p a (, ).. Cluster p b (, ).. Cluster p c (, ).. Cluster p d (, ).. Cluster p e (, ).. Cluster p f (, ).. Cluster p g (, ).. Cluster p h (, ).. Cluster p i (, ).. Cluster p j (, ).. Cluster K- Means Clustering Initial m (, ) Initial m (, )
Step : Compute seed points as the centroids of the clusters of the current partition Step : Assign each objects to most similar center m = (, ) Euclidean distance b(,) ß à m(,) = ((-) + (-) ) / = ( + (-) ) / = ( + ) / M = (, ) K- Means = () / Clustering =. Point P P(x,y) m distance m distance Cluster p a (, ).. Cluster p b (, ).. Cluster p c (, ).. Cluster p d (, ).. Cluster Euclidean distance b(,) ß à m(,) = ((-) + (-) ) / = ( + (-) ) / = ( + ) / = () / =. p e (, ).. Cluster p f (, ).. Cluster p g (, ).. Cluster p h (, ).. Cluster p i (, ).. Cluster p j (, ).. Cluster Initial m (, ) Initial m (, )
Step : Update the cluster means, Repeat Step,, stop when no more new assignment m = (.,.) M = (.,.) K- Means Clustering Point P P(x,y) m distance m distance Cluster p a (, ).. Cluster p b (, ).. Cluster p c (, ).. Cluster p d (, ).. Cluster p e (, ).. Cluster p f (, ).. Cluster p g (, ).. Cluster p h (, ).. Cluster p i (, ).. Cluster p j (, ).. Cluster m (.,.) m (.,.)
Step : Update the cluster means, Repeat Step,, stop when no more new assignment m = (.,.) M = (..,.) K- Means Clustering Point P P(x,y) m distance m distance Cluster p a (, ).. Cluster p b (, ).. Cluster p c (, ).. Cluster p d (, ).. Cluster p e (, ).. Cluster p f (, ).. Cluster p g (, ).. Cluster p h (, ).. Cluster p i (, ).. Cluster p j (, ).. Cluster m (.,.) m (.,.)
stop when no more new assignment K- Means Clustering Point P P(x,y) m distance m distance Cluster p a (, ).. Cluster p b (, ).. Cluster p c (, ).. Cluster p d (, ).. Cluster p e (, ).. Cluster p f (, ).. Cluster p g (, ).. Cluster p h (, ).. Cluster p i (, ).. Cluster p j (, ).. Cluster m (.,.) m (.,.)
K- Means Clustering (K=, two clusters) stop when no more new assignment K- Means Clustering Point P P(x,y) m distance m distance Cluster p a (, ).. Cluster p b (, ).. Cluster p c (, ).. Cluster p d (, ).. Cluster p e (, ).. Cluster p f (, ).. Cluster p g (, ).. Cluster p h (, ).. Cluster p i (, ).. Cluster p j (, ).. Cluster m (.,.) m (.,.)
Summary Cluster Analysis K- Means Clustering Source: Han & Kamber ()
References Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Second EdiGon,, Elsevier Efraim Turban, Ramesh Sharda, Dursun Delen, Decision Support and Business Intelligence Systems, Ninth EdiGon,, Pearson.