# Neural Networks Lesson 5 - Cluster Analysis

Save this PDF as:

Size: px
Start display at page:

Download "Neural Networks Lesson 5 - Cluster Analysis"

## Transcription

1 Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome Rome, 29 October 2009 M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 1 / 35

2 1 Cluster Analysis 2 M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 2 / 35

3 Cluster Analysis M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 3 / 35

4 Cluster Analysis Intelligent Circuits and Neural Network applications, in principle, can be divided in two main categories 1 Static data processing: patterns recognition, cluster analysis, associative memory; 2 Dynamic data processing: non-linear filtering, prediction, functional and operator approximation, dynamic pattern recognition, etc. Let us consider cluster analysis, one of the main problem of interest to computer scientists and engineers. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 4 / 35

5 Cluster Analysis Cluster Analysis Cluster Analysis is the collection procedure used to describe methods for grouping unlabeled data X i into subset that are believed to reflect the underlying structure of the data generator. The techniques for clustering are many and diverse, summarized in the following scheme M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 5 / 35

6 Cluster Analysis Cluster Analysis Hierarchical algorithms find successive clusters using previously established clusters. These algorithms can be either agglomerative ( bottom-up ) or divisive ( top-down ): 1 Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters; 2 Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters. Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering. Bayesian algorithms try to generate a posteriori distribution over the collection of all partitions of the data. Many clustering algorithms require specification of the number of clusters to produce in the input data set, prior to execution of the algorithm, like partitional algorithms. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 6 / 35

7 Cluster Analysis Cluster Analysis In clustering, also known as unsupervised pattern classification, there are no training data with known class labels. A clustering algorithm explores the similarity between the patterns and places similar patterns in a cluster. Clustering is an unsupervised procedure that uses unlabeled samples. Unsupervised procedures are used for several reasons: 1 collecting and labeling a large set of sample patterns can be costly; 2 one can train with large amount of unlabeled data, and then use supervision to label the groupings found; 3 exploratory data analysis can provide insight into the nature or structure of the data; 4 well-known clustering applications include data mining and data compression. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 7 / 35

8 Cluster Analysis Cluster Analysis A cluster is comprised of a number of similar objects collected or grouped together. Patterns within a cluster are more similar to each other than are patterns in different clusters. Clusters may be described as connected regions of a multi-dimensional space containing a relatively high density of points, separated from other such regions by a region containing a relatively low density of points. The number of clusters in the data often depend on the resolution (fine vs. coarse) with which we view the data. How many clusters do you see in this figure? 5, 8, 10, more? M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 8 / 35

9 Cluster Analysis Cluster Analysis Clustering is a very difficult problem because data can reveal clusters with different shapes and sizes. Most of the clustering algorithms are based on the following two popular techniques: the iterative squared-error partitioning and the agglomerative hierarchical clustering. On of the main challenges is to select an appropriate measure of similarity to define clusters that is often both data (cluster shape) and context dependent. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 9 / 35

10 Cluster Analysis Cluster Analysis We define feature vector [x 1 x 2 ] T R 2 1, defined in a two-dimensional feature space. We define, also, scatter plot as a graphical representation of the feature space variables vs true classified species. As an example we show the scatter plot of weight and width features of salmon and sea-bass. We can draw a decision boundary to divide the feature space into two regions. Considering training data set as a subset of fishes correctly classified, we can draw the classification function (or decision boundary) to divide the feature space into two regions. The decision boundary determination should be performed considering a criteria for evaluation the tradeoff between complexity of decision rules and their performance to unknown samples. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 10 / 35

11 Cluster Analysis Cluster Analysis Let x = [x 1 x 2 x N ] T R N 1 be a N-dimension feature space data set, then two (or more) data classes are defined linearly separable if there exists a (usually unknown) linear classification function or decision boundary function able to separate these classes. Considering the following figure, as an example, with N = 2, classes A and B are non linearly separable while, on the contrary, classes C and D are linearly separable. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 11 / 35

12 Cluster Analysis Cluster Analysis Taking into consideration the problem of fish species classification, different criteria lead to different decision boundaries also, more complex models result in more complex boundaries. The following figure shows others possible non-linear decision boundary that can be determined using different criteria. We may distinguish training samples perfectly but we can t predict how well can generalize to unknown sample. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 12 / 35

13 Cluster Analysis Cluster Analysis Consider the classification of two classes of patterns that are linearly separable. The linear classifier hyperplane can be estimated from training data set, using various distance measure. Obviously, different criteria produce different optimal solution. For example, in the problem described following, hyperplanes H 1, H 2 and H 3 are all optimum solution relatively to the choice of the distance measure to be minimized during the learning phase. However, as we can observe, only hyperplane H 2 presents not misclassification error. From these considerations, it follows that the choice of a distance measure, for identification of the optimum separation hyperplane, is one of the central point in pattern recognition and is strictly dependent to the specific problem. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 13 / 35

14 M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 14 / 35

15 is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types: 1 Agglomerative: This is a bottom up approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. 2 Divisive: This is a top down approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram. In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate metric (a measure of distance between pairs of observations), and a linkage criteria which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 15 / 35

16 : metric The choice of an appropriate metric will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. Some commonly used metrics for hierarchical clustering are: 1 Euclidean distance: d(x, y) = i (x i y i ) 2 ; 2 Squared Euclidean distance: d(x, y) = i (x i y i ) 2 ; 3 Manhattan (City-block) distance: d(x, y) = i x i y i ; 4 Chebychev (maximum) distance: d(x, y) = max x i y i ; 5 Power distance: d(x, y) = r xi y i p ; 6 Percent disagreement: d(x, y) = Number of x i y i i ; 7 Mahalanobis distance: d(x, y) = i (x i y i )R 1 (x i y i ); 8 Cosine similarity: d(x, y) = cos 1 a b a b. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 16 / 35

17 : linkage criteria The linkage criteria determines the distance between sets of observations as a function of the pairwise distances between observations. Some commonly used linkage criteria between two sets of observations X and Y (where d is the chosen metric) are: 1 Maximum or complete linkage clustering: max {d(x, y) : x X, y Y }; 2 Minimum or single-linkage clustering: min {d(x, y) : x X, y Y }; 3 Mean or average linkage clustering, or UPGMA: 1 X Y x X y Y d(x, y); 4 The sum of all intra-cluster variance; 5 The increase in variance for the cluster being merged (Ward s criterion); 6 The probability that candidate clusters are spawn from the same distribution function (V-linkage). M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 17 / 35

18 Dendrogram A dendrogram is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. Dendrograms are often used in computational biology to illustrate the clustering of genes. For a clustering example, suppose this data is to be clustered using Euclidean distance as the distance metric. The hierarchical clustering dendrogram would be as such. Here the top row of nodes represent data, and the remaining nodes represent the clusters to which the data belong, and the arrows represent the distance. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 18 / 35

19 Dendrogram The hierarchical cluster tree created by clustering is most easily understood when viewed graphically. Statistics Toolbox of Matlab R includes the dendrogram function that plots this hierarchical tree information as a graph, as in the following example. In the figure, the numbers along the horizontal axis represent the indexes of the objects in the original data set. The links between objects are represented as upside-down U-shaped lines. The height of the U indicates the distance between the objects. For example, the link representing the cluster containing objects 1 and 3 has a height of 1. The link representing the cluster that groups object 2 together with objects 1, 3, 4, and 5 (which are already clustered as object 8), has a height of 2.5. The height represents the distance linkage computes between objects 2 and 8. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 19 / 35

20 K-means algorithm The k-means clustering algorithm is a method of cluster analysis which aims to partition N observations into K clusters in which each observation belongs to the cluster with the nearest mean. It is similar to the expectationmaximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data. Given a set of observations (x 1, x 2,..., x N ), where each observation is a d-dimensional real vector, then k-means clustering aims to partition the N observations into K sets (K < N) S = {S 1, S 2,..., S K } so as to minimize the within-cluster sum of squares (WCSS): where µ i is the mean of S i. arg min S K i=1 x j S i x j µ i 2 M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 20 / 35

21 K-means algorithm The most common algorithm uses an iterative refinement technique. Due to its ubiquity it is often called the k-means algorithm; it is also referred to as Lloyd s algorithm, particularly in the computer science community. Given an initial set of K means m (1) 1,..., m(1) k, which may be specified randomly or by some heuristic, the algorithm proceeds by alternating between two steps: 1 Assignment step: assign each observation to the cluster with the closest mean { } S (t) i = x j : x j m (t) i x j m (t) i for all i = 1,..., K 2 Update step: calculate the new means to be the centroid of the observations in the cluster m (t+1) i = 1 S (t) x j i x j S (t) i The algorithm is deemed to have converged when the assignments no longer change. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 21 / 35

22 Example of K-means algorithm 1) K initial means (in this case K = 3) are randomly selected from the data set (shown in color). 2) K clusters are created by associating every observation with the nearest mean. 3) The centroid of each of the K clusters becomes the new means. 4) Steps 2 and 3 are repeated until convergence has been reached. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 22 / 35

23 K-means algorithm As it is a heuristic algorithm, there is no guarantee that it will converge to the global optimum, and the result may depend on the initial clusters. As the algorithm is usually very fast, it is common to run it multiple times with different starting conditions. The two key features of k-means which make it efficient are often regarded as its biggest drawbacks: 1 The number of clusters K is an input parameter: an inappropriate choice of K may yield poor results; 2 Euclidean distance is used as a metric and variance is used as a measure of cluster scatter. The k-means clustering algorithm is commonly used in computer vision as a form of image segmentation. The results of the segmentation are used to aid border detection and object recognition. In this context, the standard Euclidean distance is usually insufficient in forming the clusters. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 23 / 35

24 K-medoids algorithm The k-medoids algorithm is a clustering algorithm related to the k-means algorithm. Both the k-means and k-medoids algorithms are partitional (breaking the dataset up into groups) and both attempt to minimize squared error, the distance between points labeled to be in a cluster and a point designated as the center of that cluster. K-medoid is a classical partitioning technique of clustering that clusters the data set of N objects into K clusters known a priori. In contrast to the k-means algorithm k-medoids chooses datapoints as centers (medoids or exemplars). A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal. It is more robust to noise and outliers as compared to k-means. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 24 / 35

25 K-medoids algorithm The most common realization of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm and is as follows: 1 Initialize: randomly select K of the N data points as the medoids; 2 Associate each data point to the closest medoid ( closest here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance); 3 For each mediod m: For each non-mediod data point p: Swap m and p and compute the total cost of the configuration; 4 Select the configuration with the lowest cost; 5 repeat steps 2 to 4 until there is no change in the medoid. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 25 / 35

26 K-medians algorithm The k-medians algorithm is similar to the k-means one, but in this case, the squared error is replaced by the absolute error, and the mean is replaced with a median-like object arising from optimization. The objective function is arg min S K i=1 x j S i x j µ i There is some evidence that this procedure is more resistant to outliers or strong non-normality than regular k-means. However, like all centroid-based methods, it works best when the clusters are convex. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 26 / 35

27 Silhouette Silhouette refers to a method of interpretation and validation of clusters of data. Assume the data has been clustered. For each datum, i let a(i) be the average dissimilarity of i with all other data within the same cluster. Any measure of dissimilarity can be used but distance measures are the most common. We can interpret a(i) as how well matched i is to the cluster it is assigned (the smaller the value, the better the matching). Then find the average dissimilarity of i with the data of another single cluster. Repeat this for every cluster that i is not a member of. Denote the cluster with the lowest average dissimilarity to i by b(i). This cluster is said to be the neighboring cluster of i as it is, aside from the cluster i is assigned, the cluster i fits best in. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 27 / 35

28 Silhouette We now define: 1 a(i)/b(i), if a(i) < b(i) s (i) = 0, if a(i) = b(i) b(i)/a(i) 1, if a(i) > b(i) From the above definition it is clear that 1 s(i) 1 For s(i) to be close to one we require a(i) << b(i). As a(i) is a measure of how dissimilar i is to its own cluster, a small value means it is well matched. Furthermore, a large b(i) implies that i is badly matched to its neighboring cluster. Thus an s(i) close to one means that the datum is appropriately clustered. If s(i) is close to negative one, then by the same logic we see that i would be more appropriate if it was clustered in its neighboring cluster. An s(i) near zero means that the datum is on the border of two natural clusters. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 28 / 35

29 QT clustering algorithm The QT (quality threshold) clustering is an alternative method of partitioning data, invented for gene clustering. It requires more computing power than k-means, but does not require specifying the number of clusters a priori, and always returns the same result when run several times. The algorithm is: 1 The user chooses a maximum diameter for clusters; 2 Build a candidate cluster for each point by including the closest point, the next closest, and so on, until the diameter of the cluster surpasses the threshold; 3 Save the candidate cluster with the most points as the first true cluster, and remove all points in the cluster from further consideration. Must clarify what happens if more than 1 cluster has the maximum number of points; 4 Recurse with the reduced set of points. The distance between a point and a group of points is computed using complete linkage, i.e. as the maximum distance from the point to any member of the group. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 29 / 35

30 Spectral clustering Given a set of data points A, the similarity matrix may be defined as a matrix S where s ij represents a measure of the similarity between points i, j A. Spectral clustering techniques make use of the spectrum of the similarity matrix of the data to perform dimensionality reduction for clustering in fewer dimensions. One such technique is the Shi-Malik algorithm, commonly used for image segmentation. It partitions points into two sets (S 1, S 2 ) based on the eigenvector v corresponding to the second-smallest eigenvalue of the Laplacian matrix L = I D 1/2 SD 1/2 of S, where D is the diagonal matrix, with d ii = j s ij. This partitioning may be done in various ways, such as by taking the median m of the components in v, and placing all points whose component in v is greater than m in S 1, and the rest in S 2. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 30 / 35

31 Gaussian mixture models clustering There s another way to deal with clustering problems: a model-based approach, which consists in using certain models for clusters and attempting to optimize the fit between the data and the model. In practice, each cluster can be mathematically represented by a parametric distribution, like a Gaussian (continuous) or a Poisson (discrete). The entire data set is therefore modeled by a mixture of these distributions. An individual distribution used to model a specific cluster is often referred to as a component distribution. A mixture model with high likelihood tends to have the following traits: 1 component distributions have high peaks (data in one cluster are tight); 2 the mixture model covers the data well (dominant patterns in the data are captured by component distributions). Main advantages of model-based clustering: 1 well-studied statistical inference techniques available; 2 flexibility in choosing the component distribution; 3 obtain a density estimation for each cluster; 4 a soft classification is available. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 31 / 35

32 Gaussian mixture models clustering The most widely used clustering method of this kind is the one based on learning a mixture of Gaussians: we can actually consider clusters as Gaussian distributions centered on their barycentres, as we can see in this picture, where the grey circle represents the first variance of the distribution: Clusters are assigned by selecting the component that maximizes the posterior probability. Like k-means clustering, Gaussian mixture modeling uses an iterative algorithm that converges to a local optimum. Gaussian mixture modeling may be more appropriate than k-means clustering when clusters have different sizes and correlation within them. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 32 / 35

33 Gaussian mixture models clustering The algorithm works in this way: 1 it chooses the parameter θ i (for i = 1,..., K) at random with prior probability p(θ i ) N {µ i, σ}; 2 it obtains the posterior distribution (likelihood), where θ i = {µ i, σ}; 3 the likelihood function is L = K p(θ i )p(x θ i ) i=1 4 Now the likelihood function should be maximized by calculating L θ i = 0, but it would be too difficult. That s why it is used a simplified algorithm, known as EM (Expectation-Maximization). M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 33 / 35

34 Bayesian clustering Fundamentally the Bayesian clustering aims to obtain a posterior distribution over partitions of the data set D, denoted by C = {C 1,..., C K }, with or without specifying K. Several methods have been proposed for how to do this. Usually they come down to specifying a hierarchical model mimicking the partial order on the class of partitions so that the procedure is also hierarchical, usually agglomerative. The first effort to Bayesian clustering was the hierarchical technique due to Makato and Tokunaga. Starting with the data D = {x 1,..., x N } as N clusters of size 1, the idea is to merge clusters when the probability of the merged cluster p(c k C j ) is greater than the probability of the individual clusters p(c k )p(c j ). Thus the clusters themselves are treated as random variables. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 34 / 35

35 References B. Clarke, E. Fokoué, H.H. Zhang. Principles and Theory for data Mining and Machine Learning. Springer, S. Theodoridis and K. Koutroumbas. Pattern Recognition. Elsevier, J.P. Marques de Sà. Pattern Recognition. Springer, J.S. Liu, J.L. Zhang, M.J. Palumbo, C.E. Lawrence. Bayesian Clustering with Variable and Transformation selactions. in Bayesian Statistics (Bernardo et al. Eds.), Oxford University Press, M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 35 / 35

### Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico

Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

### Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov

Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

### DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

### Unsupervised learning: Clustering

Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What

### Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

### Fig. 1 A typical Knowledge Discovery process [2]

Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering

### ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

### L15: statistical clustering

Similarity measures Criterion functions Cluster validity Flat clustering algorithms k-means ISODATA L15: statistical clustering Hierarchical clustering algorithms Divisive Agglomerative CSCE 666 Pattern

### Machine Learning using MapReduce

Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

### Distance based clustering

// Distance based clustering Chapter ² ² Clustering Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 99). What is a cluster? Group of objects separated from other clusters Means

### Clustering UE 141 Spring 2013

Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or

### Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

### Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

### EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

### Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

### Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

### Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

### Lecture 20: Clustering

Lecture 20: Clustering Wrap-up of neural nets (from last lecture Introduction to unsupervised learning K-means clustering COMP-424, Lecture 20 - April 3, 2013 1 Unsupervised learning In supervised learning,

### Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

### Chapter ML:XI (continued)

Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

### Clustering & Association

Clustering - Overview What is cluster analysis? Grouping data objects based only on information found in the data describing these objects and their relationships Maximize the similarity within objects

### UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

### Robotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard

Robotics 2 Clustering & EM Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard 1 Clustering (1) Common technique for statistical data analysis to detect structure (machine learning,

### Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Unsupervised Learning and Data Mining Unsupervised Learning and Data Mining Clustering Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster

### . Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

### 10-810 /02-710 Computational Genomics. Clustering expression data

10-810 /02-710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally,

### Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

### CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka

CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training

### Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

### ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

### Clustering and Data Mining in R

Clustering and Data Mining in R Workshop Supplement Thomas Girke December 10, 2011 Introduction Data Preprocessing Data Transformations Distance Methods Cluster Linkage Hierarchical Clustering Approaches

### Distances, Clustering, and Classification. Heatmaps

Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be

### Classification Techniques for Remote Sensing

Classification Techniques for Remote Sensing Selim Aksoy Department of Computer Engineering Bilkent University Bilkent, 06800, Ankara saksoy@cs.bilkent.edu.tr http://www.cs.bilkent.edu.tr/ saksoy/courses/cs551

### Clustering Hierarchical clustering and k-mean clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics Elhanan Borenstein The clustering problem: A quick review partition genes into distinct sets with high homogeneity

### Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501

CLUSTER ANALYSIS Steven M. Ho!and Department of Geology, University of Georgia, Athens, GA 30602-2501 January 2006 Introduction Cluster analysis includes a broad suite of techniques designed to find groups

### Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig

### Clustering and Cluster Evaluation. Josh Stuart Tuesday, Feb 24, 2004 Read chap 4 in Causton

Clustering and Cluster Evaluation Josh Stuart Tuesday, Feb 24, 2004 Read chap 4 in Causton Clustering Methods Agglomerative Start with all separate, end with some connected Partitioning / Divisive Start

### Text Clustering. Clustering

Text Clustering 1 Clustering Partition unlabeled examples into disoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover

### SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

### Standardization and Its Effects on K-Means Clustering Algorithm

Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

### STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

### There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are

### Cluster Analysis: Basic Concepts and Methods

10 Cluster Analysis: Basic Concepts and Methods Imagine that you are the Director of Customer Relationships at AllElectronics, and you have five managers working for you. You would like to organize all

### Hierarchical Cluster Analysis Some Basics and Algorithms

Hierarchical Cluster Analysis Some Basics and Algorithms Nethra Sambamoorthi CRMportals Inc., 11 Bartram Road, Englishtown, NJ 07726 (NOTE: Please use always the latest copy of the document. Click on this

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms K-means and its variants Hierarchical clustering

### A comparison of various clustering methods and algorithms in data mining

Volume :2, Issue :5, 32-36 May 2015 www.allsubjectjournal.com e-issn: 2349-4182 p-issn: 2349-5979 Impact Factor: 3.762 R.Tamilselvi B.Sivasakthi R.Kavitha Assistant Professor A comparison of various clustering

### Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

### An Enhanced Clustering Algorithm to Analyze Spatial Data

International Journal of Engineering and Technical Research (IJETR) ISSN: 2321-0869, Volume-2, Issue-7, July 2014 An Enhanced Clustering Algorithm to Analyze Spatial Data Dr. Mahesh Kumar, Mr. Sachin Yadav

### Cluster Analysis: Basic Concepts and Algorithms

8 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should

### Data visualization and clustering. Genomics is to no small extend a data science

Data visualization and clustering Genomics is to no small extend a data science [www.data2discovery.org] Data visualization and clustering Genomics is to no small extend a data science [Andersson et al.,

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### Cluster Algorithms. Adriano Cruz adriano@nce.ufrj.br. 28 de outubro de 2013

Cluster Algorithms Adriano Cruz adriano@nce.ufrj.br 28 de outubro de 2013 Adriano Cruz adriano@nce.ufrj.br () Cluster Algorithms 28 de outubro de 2013 1 / 80 Summary 1 K-Means Adriano Cruz adriano@nce.ufrj.br

### An Introduction to Cluster Analysis for Data Mining

An Introduction to Cluster Analysis for Data Mining 10/02/2000 11:42 AM 1. INTRODUCTION... 4 1.1. Scope of This Paper... 4 1.2. What Cluster Analysis Is... 4 1.3. What Cluster Analysis Is Not... 5 2. OVERVIEW...

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar

### Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

### The Scientific Data Mining Process

Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

### Introduction to Statistical Machine Learning

CHAPTER Introduction to Statistical Machine Learning We start with a gentle introduction to statistical machine learning. Readers familiar with machine learning may wish to skip directly to Section 2,

### Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

### Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? K-Means Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar

### Chapter 4: Non-Parametric Classification

Chapter 4: Non-Parametric Classification Introduction Density Estimation Parzen Windows Kn-Nearest Neighbor Density Estimation K-Nearest Neighbor (KNN) Decision Rule Gaussian Mixture Model A weighted combination

### Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems Ran M. Bittmann School of Business Administration Ph.D. Thesis Submitted to the Senate of Bar-Ilan University Ramat-Gan,

### Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

### Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

### Cluster analysis Cosmin Lazar. COMO Lab VUB

Cluster analysis Cosmin Lazar COMO Lab VUB Introduction Cluster analysis foundations rely on one of the most fundamental, simple and very often unnoticed ways (or methods) of understanding and learning,

### Cluster Analysis: Basic Concepts and Algorithms

Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering K-means Intuition Algorithm Choosing initial centroids Bisecting K-means Post-processing Strengths

### Cluster Analysis. Chapter. Chapter Outline. What You Will Learn in This Chapter

5 Chapter Cluster Analysis Chapter Outline Introduction, 210 Business Situation, 211 Model, 212 Distance or Dissimilarities, 213 Combinatorial Searches with K-Means, 216 Statistical Mixture Model with

### Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 What is machine learning? Data description and interpretation

### Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009 Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative K-means Density-based Interpretation

### An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

### IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181 The Global Kernel k-means Algorithm for Clustering in Feature Space Grigorios F. Tzortzis and Aristidis C. Likas, Senior Member, IEEE

### Going Big in Data Dimensionality:

LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für

### Cluster Analysis using R

Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other

### Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: macario.cordel@dlsu.edu.ph

### Important Characteristics of Cluster Analysis Techniques

Cluster Analysis Can we organize sampling entities into discrete classes, such that within-group similarity is maximized and amonggroup similarity is minimized according to some objective criterion? Sites

### Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

### Machine Learning for NLP

Natural Language Processing SoSe 2015 Machine Learning for NLP Dr. Mariana Neves May 4th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Introduction Field of study that gives computers the ability

### Personalized Hierarchical Clustering

Personalized Hierarchical Clustering Korinna Bade, Andreas Nürnberger Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, D-39106 Magdeburg, Germany {kbade,nuernb}@iws.cs.uni-magdeburg.de

### Clustering & Visualization

Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

### B490 Mining the Big Data. 2 Clustering

B490 Mining the Big Data 2 Clustering Qin Zhang 1-1 Motivations Group together similar documents/webpages/images/people/proteins/products One of the most important problems in machine learning, pattern

### Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

### An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

### Data Mining Clustering. Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering Toon Calders Sheets are based on the those provided b Tan, Steinbach, and Kumar. Introduction to Data Mining What is Cluster Analsis? Finding groups of objects such that the objects

### Introduction to Clustering

Introduction to Clustering Yumi Kondo Student Seminar LSK301 Sep 25, 2010 Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 1 / 36 Microarray Example N=65 P=1756 Yumi

### Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

### Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

### Computational Complexity between K-Means and K-Medoids Clustering Algorithms for Normal and Uniform Distributions of Data Points

Journal of Computer Science 6 (3): 363-368, 2010 ISSN 1549-3636 2010 Science Publications Computational Complexity between K-Means and K-Medoids Clustering Algorithms for Normal and Uniform Distributions

### Data Preprocessing. Week 2

Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

### HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

### Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Summary Data Mining & Process Mining (1BM46) Made by S.P.T. Ariesen Content Data Mining part... 2 Lecture 1... 2 Lecture 2:... 4 Lecture 3... 7 Lecture 4... 9 Process mining part... 13 Lecture 5... 13

### Data Mining and Visualization

Data Mining and Visualization Jeremy Walton NAG Ltd, Oxford Overview Data mining components Functionality Example application Quality control Visualization Use of 3D Example application Market research

### Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

Data Mining 資 料 探 勘 Tamkang University 分 群 分 析 (Cluster Analysis) DM MI Wed,, (:- :) (B) Min-Yuh Day 戴 敏 育 Assistant Professor 專 任 助 理 教 授 Dept. of Information Management, Tamkang University 淡 江 大 學 資

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear