Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Size: px

Start display at page:

Download "Clustering. Data Mining. Abraham Otero. Data Mining. Agenda"

Samantha Benson
10 years ago
Views:

1 Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1

2 Introduction It seems logical that in a new situation we should act in a similar way as in previous similar situations, if we succeeded in them. In order to taking advantage of this strategy it is necessary to define what is meant by "similar, or the equivalent mathematical concept of "distance". It will also be necessary to determine when we are going to take advantage of this similarity: In an eager mode, processing the data available before starting the process. In a lazy mode, processing the data as it arrives. 3/46 Introduction Problem formulation: 4/46 2

3 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference Open problems References 5/46 Distance Several common distances: p-norm(euclidean p=2, Minkowski p>2) Chebyshev Manhattan 6/46 3

References 5/46 Distance Several common distances:

4 Distance Be careful when applying distances: 7/46 Distance Be careful when applying distances: 8/46 4

5 Always normalize first: Distance 9/46 Distance But when normalizing beware of outliers!: 10/46 5

6 Distance Sometimes, we need to calculate the distance between a point and a set of points: 11/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference Open problems References 12/46 6

Introduction Distance K-nearest neighbors Hierarchical

7 k-nearest neighbors k-nearest neighbors algorithm (k-nn) is a method for classifying objects based on closest training examples in the feature space. It is an instance-based learning lazy algorithm. An object is classified by a majority vote of its neighbors. The object that is assigned to the class is the one that is most common amongst its k nearest neighbors. 13/46 k-nearest neighbors It is one of the simplest methods of clustering. Requires an initial set of labeled points. It is critical to determine an appropriate value for K. Try several values. Circle Square 14/46 7

The object that is assigned to the class is the one that is most common amongst its k nearest neighbors.

8 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference Open problems References 15/46 It is prototype based clustering. Each of the existing classes is represented by a prototype vector (a fictitious instance of the class) called centroid. Once the centroids have been calculated, if we need to classify a new element we simply calculate its closest centroid; this will be its class. Centroids share space in a set of regions called Voronoi regions. 16/46 8

Each of the existing classes is represented by a prototype vector (a fictitious instance of the class) called centroid.

9 Centroid calculation: 17/46 algorithm: 18/46 9

10 Sample (successful) run: 19/46 Initialization matters: Try different initial values. 20/46 10

11 The selection of K is critical: Try different K values. K=3 K=4 21/46 Limitations: Different cluster sizes 22/46 11

12 Limitations: Different density 23/46 Limitations: Non-globular shapes 24/46 12

13 One possible solution is to use many clusters. Find parts of clusters. Then you need to put them together. 25/46 What about the nominal attributes? We can define a function if a=b, and otherwise. Therefore, the distance between two classes is given by: 26/46 13

14 KMeans demo: ering/tutorial_html/appletkm.html Applet/Code/Cluster.html 27/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference Open problems References 28/46 14

ualberta.ca/~yaling/cluster/ Applet/Code/Cluster.

15 (Density-Based Spatial Clustering of Applications with Noise) is a data clustering algorithm, not prototype based. It finds a number of clusters starting from the estimated density distribution of corresponding nodes. Classifies points in three categories: A point is a core point if it has more than a specified number of points (MinPts) within a radius Eps (these points are the interior of a cluster). A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point. A noise point is any point that is not a core point or a border point. 29/46 Example: 30/46 15

Classifies points in three categories: A point is a core point if it has more than a specified number of points (MinPts) within a radius Eps (these

16 Algorithm: Classify points as noise, border and core. Eliminate noise points. Perform clustering on the remaining points. 31/46 Example: 32/46 16

17 Strong points: Resistant to noise. Can handle clusters of different shapes and sizes. Weak points: Clusters with varying densities. High-dimensional data (it usually becomes too sparse). 33/46 34/46 17

18 Parameter determination. For MinPts a small number is usually employed. For two-dimensional experimental data it has been shown that 4 is the most reasonable value. Eps is more tricky, as we have seen. A possible solution: For points in a cluster, their k th nearest neighbors are at roughly the same distance. Noise points have the k th nearest neighbor at a farther distance. So, plot sorted distance of every point to its k th nearest neighbor 35/46 Parameter determination. 36/46 18

19 demo: de/cluster.html 37/46 Agenda Introduction Distance K-nearest neighbors Grid clustering Hierarchical clustering Quick reference 38/46 19

20 Hierarchical clustering Hierarchical clustering builds a hierarchy of clusters based on distance measurements. The traditional representation of this hierarchy is a tree (called a dendrogram), with individual elements on the leaves and a single cluster containing every element at the root. The tree like diagram can be interpreted as a sequences of merges or splits. Any desired number of clusters can be obtained by cutting the dendogram at the proper level. 39/46 Hierarchical clustering There are two main types of hierarchical clustering: Agglomerative (AGNES, Agglomerative NESting): Starts with the points as individual clusters. At each step, merge the closest pair of clusters until only one cluster (or k clusters) are left. Divisive (DIANA, Divisive ANAlysis Clustering): Start with one, all-inclusive cluster. At each step, split a cluster until each cluster contains a point (or there are k clusters). In both cases, once a decision is made to combine/split two clusters, it cannot be undone. There is no global minimization. 40/46 20

The tree like diagram can be interpreted as a sequences of merges or splits. Any desired number of clusters can be obtained by cutting the dendogram at the proper level.

21 Hierarchical clustering How to define inter-cluster distance? 41/46 Hierarchical clustering Single link Can handle non ellipitical clusters. Sensitive to noise and outliers Complete link Less sensitive to noise and outliers. Tends to break large clusters. Biased to globular clusters. Group and centroid average Less sensitive to noise and outliers Biased to globular clusters 42/46 21

22 Demo: Hierarchical clustering al_html/appleth.html 43/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 44/46 22

23 Quick reference Some general tips for choosing the clustering algorithm: Prototype-based and Hierarchical clustering (except single-link) tend to form globular clusters. This is good for vector quantization but not for other kinds of data. Density-based and graph-based (except those in the previous rule) tend to form non-globular clusters. Most clustering algorithms work well for low dimensional spaces. If the dimensionality of the data is very large, think of reducing the dimensionality beforehand (PCA). 45/46 Quick reference If a taxonomy is to be created, consider hierarchical clustering. If a summarization of the data is needed, consider a partitional clustering. Can we allow the algorithm to discard outliers? (Ex: ). They might represent unusually profitable customers. Is it necessary to classify all the data? (Ex: we have to classify all documents in the database). Computing the mean makes sense only for real-value attributes (K-Means). Define an appropriate distance (Ex: Euclidean distance is valid for real-valued attributes only). 46/46 23

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar