Neural Networks Lesson 5  Cluster Analysis


 Barrie Manning
 2 years ago
 Views:
Transcription
1 Neural Networks Lesson 5  Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt.  Sapienza University of Rome Rome, 29 October 2009 M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 1 / 35
2 1 Cluster Analysis 2 M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 2 / 35
3 Cluster Analysis M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 3 / 35
4 Cluster Analysis Intelligent Circuits and Neural Network applications, in principle, can be divided in two main categories 1 Static data processing: patterns recognition, cluster analysis, associative memory; 2 Dynamic data processing: nonlinear filtering, prediction, functional and operator approximation, dynamic pattern recognition, etc. Let us consider cluster analysis, one of the main problem of interest to computer scientists and engineers. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 4 / 35
5 Cluster Analysis Cluster Analysis Cluster Analysis is the collection procedure used to describe methods for grouping unlabeled data X i into subset that are believed to reflect the underlying structure of the data generator. The techniques for clustering are many and diverse, summarized in the following scheme M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 5 / 35
6 Cluster Analysis Cluster Analysis Hierarchical algorithms find successive clusters using previously established clusters. These algorithms can be either agglomerative ( bottomup ) or divisive ( topdown ): 1 Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters; 2 Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters. Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering. Bayesian algorithms try to generate a posteriori distribution over the collection of all partitions of the data. Many clustering algorithms require specification of the number of clusters to produce in the input data set, prior to execution of the algorithm, like partitional algorithms. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 6 / 35
7 Cluster Analysis Cluster Analysis In clustering, also known as unsupervised pattern classification, there are no training data with known class labels. A clustering algorithm explores the similarity between the patterns and places similar patterns in a cluster. Clustering is an unsupervised procedure that uses unlabeled samples. Unsupervised procedures are used for several reasons: 1 collecting and labeling a large set of sample patterns can be costly; 2 one can train with large amount of unlabeled data, and then use supervision to label the groupings found; 3 exploratory data analysis can provide insight into the nature or structure of the data; 4 wellknown clustering applications include data mining and data compression. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 7 / 35
8 Cluster Analysis Cluster Analysis A cluster is comprised of a number of similar objects collected or grouped together. Patterns within a cluster are more similar to each other than are patterns in different clusters. Clusters may be described as connected regions of a multidimensional space containing a relatively high density of points, separated from other such regions by a region containing a relatively low density of points. The number of clusters in the data often depend on the resolution (fine vs. coarse) with which we view the data. How many clusters do you see in this figure? 5, 8, 10, more? M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 8 / 35
9 Cluster Analysis Cluster Analysis Clustering is a very difficult problem because data can reveal clusters with different shapes and sizes. Most of the clustering algorithms are based on the following two popular techniques: the iterative squarederror partitioning and the agglomerative hierarchical clustering. On of the main challenges is to select an appropriate measure of similarity to define clusters that is often both data (cluster shape) and context dependent. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 9 / 35
10 Cluster Analysis Cluster Analysis We define feature vector [x 1 x 2 ] T R 2 1, defined in a twodimensional feature space. We define, also, scatter plot as a graphical representation of the feature space variables vs true classified species. As an example we show the scatter plot of weight and width features of salmon and seabass. We can draw a decision boundary to divide the feature space into two regions. Considering training data set as a subset of fishes correctly classified, we can draw the classification function (or decision boundary) to divide the feature space into two regions. The decision boundary determination should be performed considering a criteria for evaluation the tradeoff between complexity of decision rules and their performance to unknown samples. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 10 / 35
11 Cluster Analysis Cluster Analysis Let x = [x 1 x 2 x N ] T R N 1 be a Ndimension feature space data set, then two (or more) data classes are defined linearly separable if there exists a (usually unknown) linear classification function or decision boundary function able to separate these classes. Considering the following figure, as an example, with N = 2, classes A and B are non linearly separable while, on the contrary, classes C and D are linearly separable. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 11 / 35
12 Cluster Analysis Cluster Analysis Taking into consideration the problem of fish species classification, different criteria lead to different decision boundaries also, more complex models result in more complex boundaries. The following figure shows others possible nonlinear decision boundary that can be determined using different criteria. We may distinguish training samples perfectly but we can t predict how well can generalize to unknown sample. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 12 / 35
13 Cluster Analysis Cluster Analysis Consider the classification of two classes of patterns that are linearly separable. The linear classifier hyperplane can be estimated from training data set, using various distance measure. Obviously, different criteria produce different optimal solution. For example, in the problem described following, hyperplanes H 1, H 2 and H 3 are all optimum solution relatively to the choice of the distance measure to be minimized during the learning phase. However, as we can observe, only hyperplane H 2 presents not misclassification error. From these considerations, it follows that the choice of a distance measure, for identification of the optimum separation hyperplane, is one of the central point in pattern recognition and is strictly dependent to the specific problem. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 13 / 35
14 M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 14 / 35
15 is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types: 1 Agglomerative: This is a bottom up approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. 2 Divisive: This is a top down approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram. In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate metric (a measure of distance between pairs of observations), and a linkage criteria which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 15 / 35
16 : metric The choice of an appropriate metric will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. Some commonly used metrics for hierarchical clustering are: 1 Euclidean distance: d(x, y) = i (x i y i ) 2 ; 2 Squared Euclidean distance: d(x, y) = i (x i y i ) 2 ; 3 Manhattan (Cityblock) distance: d(x, y) = i x i y i ; 4 Chebychev (maximum) distance: d(x, y) = max x i y i ; 5 Power distance: d(x, y) = r xi y i p ; 6 Percent disagreement: d(x, y) = Number of x i y i i ; 7 Mahalanobis distance: d(x, y) = i (x i y i )R 1 (x i y i ); 8 Cosine similarity: d(x, y) = cos 1 a b a b. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 16 / 35
17 : linkage criteria The linkage criteria determines the distance between sets of observations as a function of the pairwise distances between observations. Some commonly used linkage criteria between two sets of observations X and Y (where d is the chosen metric) are: 1 Maximum or complete linkage clustering: max {d(x, y) : x X, y Y }; 2 Minimum or singlelinkage clustering: min {d(x, y) : x X, y Y }; 3 Mean or average linkage clustering, or UPGMA: 1 X Y x X y Y d(x, y); 4 The sum of all intracluster variance; 5 The increase in variance for the cluster being merged (Ward s criterion); 6 The probability that candidate clusters are spawn from the same distribution function (Vlinkage). M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 17 / 35
18 Dendrogram A dendrogram is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. Dendrograms are often used in computational biology to illustrate the clustering of genes. For a clustering example, suppose this data is to be clustered using Euclidean distance as the distance metric. The hierarchical clustering dendrogram would be as such. Here the top row of nodes represent data, and the remaining nodes represent the clusters to which the data belong, and the arrows represent the distance. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 18 / 35
19 Dendrogram The hierarchical cluster tree created by clustering is most easily understood when viewed graphically. Statistics Toolbox of Matlab R includes the dendrogram function that plots this hierarchical tree information as a graph, as in the following example. In the figure, the numbers along the horizontal axis represent the indexes of the objects in the original data set. The links between objects are represented as upsidedown Ushaped lines. The height of the U indicates the distance between the objects. For example, the link representing the cluster containing objects 1 and 3 has a height of 1. The link representing the cluster that groups object 2 together with objects 1, 3, 4, and 5 (which are already clustered as object 8), has a height of 2.5. The height represents the distance linkage computes between objects 2 and 8. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 19 / 35
20 Kmeans algorithm The kmeans clustering algorithm is a method of cluster analysis which aims to partition N observations into K clusters in which each observation belongs to the cluster with the nearest mean. It is similar to the expectationmaximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data. Given a set of observations (x 1, x 2,..., x N ), where each observation is a ddimensional real vector, then kmeans clustering aims to partition the N observations into K sets (K < N) S = {S 1, S 2,..., S K } so as to minimize the withincluster sum of squares (WCSS): where µ i is the mean of S i. arg min S K i=1 x j S i x j µ i 2 M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 20 / 35
21 Kmeans algorithm The most common algorithm uses an iterative refinement technique. Due to its ubiquity it is often called the kmeans algorithm; it is also referred to as Lloyd s algorithm, particularly in the computer science community. Given an initial set of K means m (1) 1,..., m(1) k, which may be specified randomly or by some heuristic, the algorithm proceeds by alternating between two steps: 1 Assignment step: assign each observation to the cluster with the closest mean { } S (t) i = x j : x j m (t) i x j m (t) i for all i = 1,..., K 2 Update step: calculate the new means to be the centroid of the observations in the cluster m (t+1) i = 1 S (t) x j i x j S (t) i The algorithm is deemed to have converged when the assignments no longer change. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 21 / 35
22 Example of Kmeans algorithm 1) K initial means (in this case K = 3) are randomly selected from the data set (shown in color). 2) K clusters are created by associating every observation with the nearest mean. 3) The centroid of each of the K clusters becomes the new means. 4) Steps 2 and 3 are repeated until convergence has been reached. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 22 / 35
23 Kmeans algorithm As it is a heuristic algorithm, there is no guarantee that it will converge to the global optimum, and the result may depend on the initial clusters. As the algorithm is usually very fast, it is common to run it multiple times with different starting conditions. The two key features of kmeans which make it efficient are often regarded as its biggest drawbacks: 1 The number of clusters K is an input parameter: an inappropriate choice of K may yield poor results; 2 Euclidean distance is used as a metric and variance is used as a measure of cluster scatter. The kmeans clustering algorithm is commonly used in computer vision as a form of image segmentation. The results of the segmentation are used to aid border detection and object recognition. In this context, the standard Euclidean distance is usually insufficient in forming the clusters. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 23 / 35
24 Kmedoids algorithm The kmedoids algorithm is a clustering algorithm related to the kmeans algorithm. Both the kmeans and kmedoids algorithms are partitional (breaking the dataset up into groups) and both attempt to minimize squared error, the distance between points labeled to be in a cluster and a point designated as the center of that cluster. Kmedoid is a classical partitioning technique of clustering that clusters the data set of N objects into K clusters known a priori. In contrast to the kmeans algorithm kmedoids chooses datapoints as centers (medoids or exemplars). A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal. It is more robust to noise and outliers as compared to kmeans. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 24 / 35
25 Kmedoids algorithm The most common realization of kmedoid clustering is the Partitioning Around Medoids (PAM) algorithm and is as follows: 1 Initialize: randomly select K of the N data points as the medoids; 2 Associate each data point to the closest medoid ( closest here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance); 3 For each mediod m: For each nonmediod data point p: Swap m and p and compute the total cost of the configuration; 4 Select the configuration with the lowest cost; 5 repeat steps 2 to 4 until there is no change in the medoid. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 25 / 35
26 Kmedians algorithm The kmedians algorithm is similar to the kmeans one, but in this case, the squared error is replaced by the absolute error, and the mean is replaced with a medianlike object arising from optimization. The objective function is arg min S K i=1 x j S i x j µ i There is some evidence that this procedure is more resistant to outliers or strong nonnormality than regular kmeans. However, like all centroidbased methods, it works best when the clusters are convex. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 26 / 35
27 Silhouette Silhouette refers to a method of interpretation and validation of clusters of data. Assume the data has been clustered. For each datum, i let a(i) be the average dissimilarity of i with all other data within the same cluster. Any measure of dissimilarity can be used but distance measures are the most common. We can interpret a(i) as how well matched i is to the cluster it is assigned (the smaller the value, the better the matching). Then find the average dissimilarity of i with the data of another single cluster. Repeat this for every cluster that i is not a member of. Denote the cluster with the lowest average dissimilarity to i by b(i). This cluster is said to be the neighboring cluster of i as it is, aside from the cluster i is assigned, the cluster i fits best in. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 27 / 35
28 Silhouette We now define: 1 a(i)/b(i), if a(i) < b(i) s (i) = 0, if a(i) = b(i) b(i)/a(i) 1, if a(i) > b(i) From the above definition it is clear that 1 s(i) 1 For s(i) to be close to one we require a(i) << b(i). As a(i) is a measure of how dissimilar i is to its own cluster, a small value means it is well matched. Furthermore, a large b(i) implies that i is badly matched to its neighboring cluster. Thus an s(i) close to one means that the datum is appropriately clustered. If s(i) is close to negative one, then by the same logic we see that i would be more appropriate if it was clustered in its neighboring cluster. An s(i) near zero means that the datum is on the border of two natural clusters. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 28 / 35
29 QT clustering algorithm The QT (quality threshold) clustering is an alternative method of partitioning data, invented for gene clustering. It requires more computing power than kmeans, but does not require specifying the number of clusters a priori, and always returns the same result when run several times. The algorithm is: 1 The user chooses a maximum diameter for clusters; 2 Build a candidate cluster for each point by including the closest point, the next closest, and so on, until the diameter of the cluster surpasses the threshold; 3 Save the candidate cluster with the most points as the first true cluster, and remove all points in the cluster from further consideration. Must clarify what happens if more than 1 cluster has the maximum number of points; 4 Recurse with the reduced set of points. The distance between a point and a group of points is computed using complete linkage, i.e. as the maximum distance from the point to any member of the group. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 29 / 35
30 Spectral clustering Given a set of data points A, the similarity matrix may be defined as a matrix S where s ij represents a measure of the similarity between points i, j A. Spectral clustering techniques make use of the spectrum of the similarity matrix of the data to perform dimensionality reduction for clustering in fewer dimensions. One such technique is the ShiMalik algorithm, commonly used for image segmentation. It partitions points into two sets (S 1, S 2 ) based on the eigenvector v corresponding to the secondsmallest eigenvalue of the Laplacian matrix L = I D 1/2 SD 1/2 of S, where D is the diagonal matrix, with d ii = j s ij. This partitioning may be done in various ways, such as by taking the median m of the components in v, and placing all points whose component in v is greater than m in S 1, and the rest in S 2. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 30 / 35
31 Gaussian mixture models clustering There s another way to deal with clustering problems: a modelbased approach, which consists in using certain models for clusters and attempting to optimize the fit between the data and the model. In practice, each cluster can be mathematically represented by a parametric distribution, like a Gaussian (continuous) or a Poisson (discrete). The entire data set is therefore modeled by a mixture of these distributions. An individual distribution used to model a specific cluster is often referred to as a component distribution. A mixture model with high likelihood tends to have the following traits: 1 component distributions have high peaks (data in one cluster are tight); 2 the mixture model covers the data well (dominant patterns in the data are captured by component distributions). Main advantages of modelbased clustering: 1 wellstudied statistical inference techniques available; 2 flexibility in choosing the component distribution; 3 obtain a density estimation for each cluster; 4 a soft classification is available. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 31 / 35
32 Gaussian mixture models clustering The most widely used clustering method of this kind is the one based on learning a mixture of Gaussians: we can actually consider clusters as Gaussian distributions centered on their barycentres, as we can see in this picture, where the grey circle represents the first variance of the distribution: Clusters are assigned by selecting the component that maximizes the posterior probability. Like kmeans clustering, Gaussian mixture modeling uses an iterative algorithm that converges to a local optimum. Gaussian mixture modeling may be more appropriate than kmeans clustering when clusters have different sizes and correlation within them. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 32 / 35
33 Gaussian mixture models clustering The algorithm works in this way: 1 it chooses the parameter θ i (for i = 1,..., K) at random with prior probability p(θ i ) N {µ i, σ}; 2 it obtains the posterior distribution (likelihood), where θ i = {µ i, σ}; 3 the likelihood function is L = K p(θ i )p(x θ i ) i=1 4 Now the likelihood function should be maximized by calculating L θ i = 0, but it would be too difficult. That s why it is used a simplified algorithm, known as EM (ExpectationMaximization). M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 33 / 35
34 Bayesian clustering Fundamentally the Bayesian clustering aims to obtain a posterior distribution over partitions of the data set D, denoted by C = {C 1,..., C K }, with or without specifying K. Several methods have been proposed for how to do this. Usually they come down to specifying a hierarchical model mimicking the partial order on the class of partitions so that the procedure is also hierarchical, usually agglomerative. The first effort to Bayesian clustering was the hierarchical technique due to Makato and Tokunaga. Starting with the data D = {x 1,..., x N } as N clusters of size 1, the idea is to merge clusters when the probability of the merged cluster p(c k C j ) is greater than the probability of the individual clusters p(c k )p(c j ). Thus the clusters themselves are treated as random variables. M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 34 / 35
35 References B. Clarke, E. Fokoué, H.H. Zhang. Principles and Theory for data Mining and Machine Learning. Springer, S. Theodoridis and K. Koutroumbas. Pattern Recognition. Elsevier, J.P. Marques de Sà. Pattern Recognition. Springer, J.S. Liu, J.L. Zhang, M.J. Palumbo, C.E. Lawrence. Bayesian Clustering with Variable and Transformation selactions. in Bayesian Statistics (Bernardo et al. Eds.), Oxford University Press, M. Scarpiniti Neural Networks Lesson 5  Cluster Analysis 35 / 35
Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico
Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from
More informationData Clustering. Dec 2nd, 2013 Kyrylo Bessonov
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms kmeans Hierarchical Main
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDDLAB ISTI CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationUnsupervised learning: Clustering
Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What
More informationClustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
More informationFig. 1 A typical Knowledge Discovery process [2]
Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering
More informationROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015
ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti
More informationL15: statistical clustering
Similarity measures Criterion functions Cluster validity Flat clustering algorithms kmeans ISODATA L15: statistical clustering Hierarchical clustering algorithms Divisive Agglomerative CSCE 666 Pattern
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationDistance based clustering
// Distance based clustering Chapter ² ² Clustering Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 99). What is a cluster? Group of objects separated from other clusters Means
More informationClustering UE 141 Spring 2013
Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or
More informationMachine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More informationClustering. Data Mining. Abraham Otero. Data Mining. Agenda
Clustering 1/46 Agenda Introduction Distance Knearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
More informationEM Clustering Approach for MultiDimensional Analysis of Big Data Set
EM Clustering Approach for MultiDimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
More informationData Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distancebased Kmeans, Kmedoids,
More informationClustering. 15381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is
Clustering 15381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv BarJoseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is
More informationData Mining Project Report. Document Clustering. Meryem UzunPer
Data Mining Project Report Document Clustering Meryem UzunPer 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. Kmeans algorithm...
More informationLecture 20: Clustering
Lecture 20: Clustering Wrapup of neural nets (from last lecture Introduction to unsupervised learning Kmeans clustering COMP424, Lecture 20  April 3, 2013 1 Unsupervised learning In supervised learning,
More informationClustering. Adrian Groza. Department of Computer Science Technical University of ClujNapoca
Clustering Adrian Groza Department of Computer Science Technical University of ClujNapoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 Kmeans 3 Hierarchical Clustering What is Datamining?
More informationChapter ML:XI (continued)
Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis DensityBased Cluster Analysis Cluster Evaluation Constrained
More informationClustering & Association
Clustering  Overview What is cluster analysis? Grouping data objects based only on information found in the data describing these objects and their relationships Maximize the similarity within objects
More informationUNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable
More informationRobotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard
Robotics 2 Clustering & EM Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard 1 Clustering (1) Common technique for statistical data analysis to detect structure (machine learning,
More informationUnsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning
Unsupervised Learning and Data Mining Unsupervised Learning and Data Mining Clustering Decision trees Artificial neural nets Knearest neighbor Support vectors Linear regression Logistic regression...
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster
More information. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns
Outline Part 1: of data clustering NonSupervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties
More information10810 /02710 Computational Genomics. Clustering expression data
10810 /02710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intracluster similarity low intercluster similarity Informally,
More informationChapter 7. Cluster Analysis
Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. DensityBased Methods 6. GridBased Methods 7. ModelBased
More informationCLASSIFICATION AND CLUSTERING. Anveshi Charuvaka
CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training
More informationData Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototypebased clustering Densitybased clustering Graphbased
More informationARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications
More informationClustering and Data Mining in R
Clustering and Data Mining in R Workshop Supplement Thomas Girke December 10, 2011 Introduction Data Preprocessing Data Transformations Distance Methods Cluster Linkage Hierarchical Clustering Approaches
More informationDistances, Clustering, and Classification. Heatmaps
Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be
More informationClassification Techniques for Remote Sensing
Classification Techniques for Remote Sensing Selim Aksoy Department of Computer Engineering Bilkent University Bilkent, 06800, Ankara saksoy@cs.bilkent.edu.tr http://www.cs.bilkent.edu.tr/ saksoy/courses/cs551
More informationClustering Hierarchical clustering and kmean clustering
Clustering Hierarchical clustering and kmean clustering Genome 373 Genomic Informatics Elhanan Borenstein The clustering problem: A quick review partition genes into distinct sets with high homogeneity
More informationSteven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 306022501
CLUSTER ANALYSIS Steven M. Ho!and Department of Geology, University of Georgia, Athens, GA 306022501 January 2006 Introduction Cluster analysis includes a broad suite of techniques designed to find groups
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 WolfTilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig
More informationClustering and Cluster Evaluation. Josh Stuart Tuesday, Feb 24, 2004 Read chap 4 in Causton
Clustering and Cluster Evaluation Josh Stuart Tuesday, Feb 24, 2004 Read chap 4 in Causton Clustering Methods Agglomerative Start with all separate, end with some connected Partitioning / Divisive Start
More informationText Clustering. Clustering
Text Clustering 1 Clustering Partition unlabeled examples into disoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover
More informationSPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
More informationStandardization and Its Effects on KMeans Clustering Algorithm
Research Journal of Applied Sciences, Engineering and Technology 6(7): 3993303, 03 ISSN: 0407459; eissn: 0407467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03
More informationSTATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Webbased Analytics Table
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will
More informationThere are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:
Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are
More informationCluster Analysis: Basic Concepts and Methods
10 Cluster Analysis: Basic Concepts and Methods Imagine that you are the Director of Customer Relationships at AllElectronics, and you have five managers working for you. You would like to organize all
More informationHierarchical Cluster Analysis Some Basics and Algorithms
Hierarchical Cluster Analysis Some Basics and Algorithms Nethra Sambamoorthi CRMportals Inc., 11 Bartram Road, Englishtown, NJ 07726 (NOTE: Please use always the latest copy of the document. Click on this
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms Kmeans and its variants Hierarchical clustering
More informationA comparison of various clustering methods and algorithms in data mining
Volume :2, Issue :5, 3236 May 2015 www.allsubjectjournal.com eissn: 23494182 pissn: 23495979 Impact Factor: 3.762 R.Tamilselvi B.Sivasakthi R.Kavitha Assistant Professor A comparison of various clustering
More informationCluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototypebased Fuzzy cmeans
More informationAn Enhanced Clustering Algorithm to Analyze Spatial Data
International Journal of Engineering and Technical Research (IJETR) ISSN: 23210869, Volume2, Issue7, July 2014 An Enhanced Clustering Algorithm to Analyze Spatial Data Dr. Mahesh Kumar, Mr. Sachin Yadav
More informationCluster Analysis: Basic Concepts and Algorithms
8 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should
More informationData visualization and clustering. Genomics is to no small extend a data science
Data visualization and clustering Genomics is to no small extend a data science [www.data2discovery.org] Data visualization and clustering Genomics is to no small extend a data science [Andersson et al.,
More information15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
More informationCluster Algorithms. Adriano Cruz adriano@nce.ufrj.br. 28 de outubro de 2013
Cluster Algorithms Adriano Cruz adriano@nce.ufrj.br 28 de outubro de 2013 Adriano Cruz adriano@nce.ufrj.br () Cluster Algorithms 28 de outubro de 2013 1 / 80 Summary 1 KMeans Adriano Cruz adriano@nce.ufrj.br
More informationAn Introduction to Cluster Analysis for Data Mining
An Introduction to Cluster Analysis for Data Mining 10/02/2000 11:42 AM 1. INTRODUCTION... 4 1.1. Scope of This Paper... 4 1.2. What Cluster Analysis Is... 4 1.3. What Cluster Analysis Is Not... 5 2. OVERVIEW...
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationKMeans Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
KMeans Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar
More informationComponent Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationIntroduction to Statistical Machine Learning
CHAPTER Introduction to Statistical Machine Learning We start with a gentle introduction to statistical machine learning. Readers familiar with machine learning may wish to skip directly to Section 2,
More informationComparison of Nonlinear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Nonlinear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Nonlinear
More informationExample: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering
Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? KMeans Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar
More informationChapter 4: NonParametric Classification
Chapter 4: NonParametric Classification Introduction Density Estimation Parzen Windows KnNearest Neighbor Density Estimation KNearest Neighbor (KNN) Decision Rule Gaussian Mixture Model A weighted combination
More informationDecision Support System Methodology Using a Visual Approach for Cluster Analysis Problems
Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems Ran M. Bittmann School of Business Administration Ph.D. Thesis Submitted to the Senate of BarIlan University RamatGan,
More informationUsing Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
More informationLecture 9: Introduction to Pattern Analysis
Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns
More informationCluster analysis Cosmin Lazar. COMO Lab VUB
Cluster analysis Cosmin Lazar COMO Lab VUB Introduction Cluster analysis foundations rely on one of the most fundamental, simple and very often unnoticed ways (or methods) of understanding and learning,
More informationCluster Analysis: Basic Concepts and Algorithms
Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering Kmeans Intuition Algorithm Choosing initial centroids Bisecting Kmeans Postprocessing Strengths
More informationCluster Analysis. Chapter. Chapter Outline. What You Will Learn in This Chapter
5 Chapter Cluster Analysis Chapter Outline Introduction, 210 Business Situation, 211 Model, 212 Distance or Dissimilarities, 213 Combinatorial Searches with KMeans, 216 Statistical Mixture Model with
More informationIntroduction to machine learning and pattern recognition Lecture 1 Coryn BailerJones
Introduction to machine learning and pattern recognition Lecture 1 Coryn BailerJones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 What is machine learning? Data description and interpretation
More informationCluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009
Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009 Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative Kmeans Densitybased Interpretation
More informationAn Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
More informationIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181 The Global Kernel kmeans Algorithm for Clustering in Feature Space Grigorios F. Tzortzis and Aristidis C. Likas, Senior Member, IEEE
More informationGoing Big in Data Dimensionality:
LUDWIG MAXIMILIANS UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für
More informationCluster Analysis using R
Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other
More informationMethodology for Emulating Self Organizing Maps for Visualization of Large Datasets
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: macario.cordel@dlsu.edu.ph
More informationImportant Characteristics of Cluster Analysis Techniques
Cluster Analysis Can we organize sampling entities into discrete classes, such that withingroup similarity is maximized and amonggroup similarity is minimized according to some objective criterion? Sites
More informationEnvironmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
More informationMachine Learning for NLP
Natural Language Processing SoSe 2015 Machine Learning for NLP Dr. Mariana Neves May 4th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Introduction Field of study that gives computers the ability
More informationPersonalized Hierarchical Clustering
Personalized Hierarchical Clustering Korinna Bade, Andreas Nürnberger Faculty of Computer Science, OttovonGuerickeUniversity Magdeburg, D39106 Magdeburg, Germany {kbade,nuernb}@iws.cs.unimagdeburg.de
More informationClustering & Visualization
Chapter 5 Clustering & Visualization Clustering in highdimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to highdimensional data.
More informationB490 Mining the Big Data. 2 Clustering
B490 Mining the Big Data 2 Clustering Qin Zhang 11 Motivations Group together similar documents/webpages/images/people/proteins/products One of the most important problems in machine learning, pattern
More informationClassifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical microclustering algorithm ClusteringBased SVM (CBSVM) Experimental
More informationAn Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
More informationData Mining Clustering. Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
Data Mining Clustering Toon Calders Sheets are based on the those provided b Tan, Steinbach, and Kumar. Introduction to Data Mining What is Cluster Analsis? Finding groups of objects such that the objects
More informationIntroduction to Clustering
Introduction to Clustering Yumi Kondo Student Seminar LSK301 Sep 25, 2010 Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 1 / 36 Microarray Example N=65 P=1756 Yumi
More informationCategorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors ChiaHui Chang and ZhiKai Ding Department of Computer Science and Information Engineering, National Central University, ChungLi,
More informationAnalysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
More informationComputational Complexity between KMeans and KMedoids Clustering Algorithms for Normal and Uniform Distributions of Data Points
Journal of Computer Science 6 (3): 363368, 2010 ISSN 15493636 2010 Science Publications Computational Complexity between KMeans and KMedoids Clustering Algorithms for Normal and Uniform Distributions
More informationData Preprocessing. Week 2
Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.
More informationHT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
More informationSummary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen
Summary Data Mining & Process Mining (1BM46) Made by S.P.T. Ariesen Content Data Mining part... 2 Lecture 1... 2 Lecture 2:... 4 Lecture 3... 7 Lecture 4... 9 Process mining part... 13 Lecture 5... 13
More informationData Mining and Visualization
Data Mining and Visualization Jeremy Walton NAG Ltd, Oxford Overview Data mining components Functionality Example application Quality control Visualization Use of 3D Example application Market research
More informationData Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)
Data Mining 資 料 探 勘 Tamkang University 分 群 分 析 (Cluster Analysis) DM MI Wed,, (: :) (B) MinYuh Day 戴 敏 育 Assistant Professor 專 任 助 理 教 授 Dept. of Information Management, Tamkang University 淡 江 大 學 資
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationFlat Clustering KMeans Algorithm
Flat Clustering KMeans Algorithm 1. Purpose. Clustering algorithms group a set of documents into subsets or clusters. The cluster algorithms goal is to create clusters that are coherent internally, but
More informationData Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data
More information