Data Mining Cluster Analysis: Advanced Concepts and Algorithms. ref. Chapter 9. Introduction to Data Mining


 Trevor Roberts
 9 months ago
 Views:
Transcription
1 Data Mining Cluster Analysis: Advanced Concepts and Algorithms ref. Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar 1
2 Outline Prototypebased Fuzzy cmeans Mixture Model Clustering Densitybased Gridbased clustering Subspace clustering Graphbased Chameleon Scalable Clustering Algorithms Cure and Birch Characteristics of Clustering Algorithms Data e Web Mining 2
3 Hard (Crisp) vs Soft (Fuzzy) Clustering Hard (Crisp) vs. Soft (Fuzzy) clustering Generalize Kmeans objective function (for all the N points) k N k SSE = w ij x i c j, w ij =1 j=1 i=1 ( ) 2 j=1 w ij : weight with which object x i belongs to cluster C j To minimize SSE, repeat the following steps: Fixed c j and determine w ij (cluster assignment) Fixed w ij and recompute c j Hard clustering: w ij {0,1} Data e Web Mining 3
4 Hard (Crisp) vs Soft (Fuzzy) Clustering c 1 c 2 x SSE(x) is minimized when w x1 = 1, w x2 = 0 Data e Web Mining 4
5 Fuzzy Cmeans Objective function k p: fuzzifier (p > 1) SSE = w p ij x i c j, j=1 N i=1 ( ) 2 k w ij =1 j=1 w ij : weight with which object x i belongs to cluster C j To minimize objective function, repeat the following: Fixed c j and determine w ij Fixed w ij and recompute c j Fuzzy clustering: w ij [0,1] Data e Web Mining 5
6 Fuzzy Cmeans c 1 c 2 x SSE(x) is minimized when w x1 = 0.9, w x2 = 0.1 Data e Web Mining 6
7 Fuzzy Cmeans Objective function: Initialization: choose the weights w ij randomly Repeat: Update centroids: Update weights: k N i=1 SSE = w p ij x i c j, w ij =1 j=1 p: fuzzifier (p > 1) ( ) 2 k j=1 Data e Web Mining 7
8 Fuzzy Kmeans Applied to Sample Data Data e Web Mining 8
9 Hard (Crisp) vs Soft (Probabilistic) Clustering Idea is to model the set of data points as arising from a mixture of distributions Typically, normal (Gaussian) distribution is used But other distributions have been very profitably used. Clusters are found by estimating the parameters of the statistical distributions Can use a kmeans like algorithm, called the EM algorithm, to estimate these parameters Actually, kmeans is a special case of this approach Provides a compact representation of clusters The probabilities with which point belongs to each cluster provide a functionality similar to fuzzy clustering. Data e Web Mining 9
10 Probabilistic Clustering: Example Informal example: consider modeling the points that generate the following histogram. Looks like a combination of two normal distributions Suppose we can estimate the mean and standard deviation of each normal distribution. This completely describes the two clusters We can compute the probabilities with which each point belongs to each cluster Can assign each point to the cluster (distribution) in which it is most probable. Data e Web Mining 10
11 Probabilistic Clustering: EM Algorithm Initialize the parameters Repeat For each point, compute its probability under each distribution Using these probabilities, update the parameters of each distribution Until there is not change Very similar to of Kmeans Consists of assignment and update steps Can use random initialization Problem of local minima For normal distributions, typically use Kmeans to initialize If using normal distributions, can find elliptical as well as spherical shapes. Data e Web Mining 11
12 Probabilistic Clustering: EM Algorithm Choose K seeds: means of a gaussian distribution Estimation: calculate probability of belonging to a cluster based on distance Maximization: move mean of gaussian to centroid of data set, weighted by the contribution of each point Repeat till means don t move Data e Web Mining 12
13 Probabilistic Clustering Applied to Sample Data Data e Web Mining 13
14 Gridbased Clustering A type of densitybased clustering Data e Web Mining 14
15 Gridbased Clustering Issues how to discretize the dimensions equal width vs. equal frequency discretization density of cells containing the points close to the border of a cluster can be very low these cells are discarded. A possible solution is to reduce the size of cells, but this may yield additional problems Data e Web Mining 15
16 Subspace Clustering Until now, we found clusters by considering all of the attributes Some clusters may involve only a subset of attributes, i.e., subspaces of the data Example: In a document collection, documents can be represented as vectors, where the dimensions correspond to terms When kmeans is used to find document clusters, the resulting clusters can typically be characterized by 10 or so terms Data e Web Mining 16
17 Example Three clear clusters. The circle points are not a cluster in three dimensions If the dimensions are discretized (equal width), these points are included in low density cells Data e Web Mining 17
18 Histograms to determine density Equiwidth discretized space Density Threshold = 6% Contiguous intervals to be clustered Data e Web Mining 18
19 Example Data e Web Mining 19
20 Example Data e Web Mining 20
21 Example Data e Web Mining 21
22 Example : remarks The circles do not form a cluster in the three dimensions, but they may form a cluster in some subspaces A cluster in the three dimensions is part of a cluster (maybe a larger one) in the subspaces Data e Web Mining 22
23 Clique Algorithm  Overview A gridbased clustering algorithm that methodically finds subspace clusters Partitions the data space into rectangular units of equal volume Measures the density of each unit by the fraction of points it contains A unit is dense if the fraction of overall points it contains is above a user specified threshold, τ A cluster is a group of collections of contiguous (touching) dense units Data e Web Mining 23
24 Clique Algorithm It is impractical to check each subspace to see if it is dense, due to the exponential number of them 2 n subspaces, if n are the dimensions Monotone property of densitybased clusters: If a set of points forms a density based cluster in k dimensions, then the same set of points is also part of a density based cluster in all possible subsets of those dimensions Very similar to the Apriori algorithm for frequent itemset mining Can find overlapping clusters Data e Web Mining 24
25 Clique Algorithm Data e Web Mining 25
26 Limitations of Clique Time complexity is exponential in number of dimensions Especially if too many dense units are generated at lower stages May fail if clusters are of widely differing densities, since the threshold is fixed Determining appropriate threshold and unit interval length can be challenging Data e Web Mining 26
27 GraphBased Clustering: General Concepts GraphBased clustering uses the proximity graph Start with the proximity matrix Consider each point as a node in a graph Each edge between two nodes has a weight which is the proximity between the two points Initially the proximity graph is fully connected MIN (singlelink) and MAX (completelink) can be viewed as starting with this graph In the simplest case, clusters are connected components in the graph. Data e Web Mining 27
28 GraphBased Clustering: Chameleon Based on several key ideas Sparsification of the proximity graph Partitioning the data into clusters that are relatively pure subclusters of the true clusters Merging based on preserving characteristics of clusters Data e Web Mining 28
29 GraphBased Clustering: Sparsification The amount of data that needs to be processed is drastically reduced, thus making the algorithm more scalable Sparsification can eliminate more than 99% of the entries in a proximity matrix The amount of time required to cluster the data is drastically reduced The size of the problems that can be handled is increased Data e Web Mining 29
30 GraphBased Clustering: Sparsification Clustering may work better Sparsification techniques keep the connections to the most similar (nearest) neighbors of a point while breaking the connections to less similar points. The nearest neighbors of a point tend to belong to the same class as the point itself. This reduces the impact of noise and outliers and sharpens the distinction between clusters. Sparsification facilitates the use of graph partitioning algorithms (or algorithms based on graph partitioning algorithms) Chameleon and Hypergraphbased Clustering Data e Web Mining 30
31 Sparsification in the Clustering Process Data e Web Mining 31
32 Limitations of Current Merging Schemes Existing merging schemes in hierarchical clustering algorithms are static in nature MIN or CURE: Merge two clusters based on their closeness (or minimum distance) GROUPAVERAGE: Merge two clusters based on their average connectivity Data e Web Mining 32
33 Limitations of Current Merging Schemes (a) (b) (c) (d) Closeness schemes will merge (a) and (b) Average connectivity schemes will merge (c) and (d) Data e Web Mining 33
34 Chameleon: Clustering Using Dynamic Modeling Adapt to the characteristics of the data set to find the natural clusters Use a dynamic model to measure the similarity between clusters Main properties are the relative closeness and relative interconnectivity of the cluster Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters The merging scheme preserves selfsimilarity Data e Web Mining 34
35 Experimental Results: CHAMELEON Data e Web Mining 35
36 Experimental Results: CHAMELEON Data e Web Mining 36
37 Experimental Results: CHAMELEON Data e Web Mining 37
38 CURE: a Scalable Algorithm Agglomerative hierarchical clustering algorithms vary in terms of how the proximity of two clusters are computed MIN (single link) susceptible to noise/outliers MAX (complete link)/group AVERAGE/Centroid: may not work well with nonglobular clusters CURE (Clustering Using REpresentatives) algorithm tries to handle both problems It is a graphbased algorithm Starts with a proximity matrix/proximity graph Data e Web Mining 38
39 CURE Algorithm Represents a cluster using multiple representative points Goals: scalability, by choosing points that capture the geometry and shape of clusters Representative points are found by selecting a constant number of points from a cluster The first representative point is chosen to be the point farthest from the center of the cluster Remaining representative points are chosen so that they are farthest from all previously chosen points Data e Web Mining 39
40 CURE Algorithm Shrink representative points toward the center of the cluster by a factor, α Shrinking representative points toward the center helps avoid problems with noise and outliers shrinking factor: α Cluster similarity is the similarity of the closest pair of representative points (MIN) from different clusters Data e Web Mining 40
41 CURE Algorithm Uses agglomerative hierarchical scheme to perform clustering; α = 0: similar to centroidbased α = 1: somewhat similar to singlelink (MIN) CURE is better able to handle clusters of arbitrary shapes and sizes Data e Web Mining 41
42 Experimental Results: CURE (10 clusters) Data e Web Mining 42
43 Experimental Results: CURE (9 clusters) Data e Web Mining 43
44 Experimental Results: CURE Picture from CURE, Guha, Rastogi, Shim. Data e Web Mining 44
45 Experimental Results: CURE (centroid) (single link) Picture from CURE, Guha, Rastogi, Shim. Data e Web Mining 45
46 CURE Cannot Handle Differing Densities Original Points CURE Data e Web Mining 46
47 BIRCH: a Scalable Algorithm Balanced Iterative Reducing and Clustering using Hierarchies Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans Weakness: handles only numeric data (Euclidean space), and is sensitive to the order of the data record Data e Web Mining 47
48 BIRCH Clustering Feature (CF): (N,LS,SS ) Number of point, Linear Sum of points, Sum of Squares of points CF incrementally updated, to be used for computing centroids, variance (used for measuring the diameter of the cluster) Also used for computing distances between clusters Data e Web Mining 48
49 BIRCH CF is a compact storage for data on points in a cluster Has enough information to calculate the intracluster distances Additivity theorem allows us to merge subclusters C3 = C1 C2 CFC3= <nc1+ nc2, LSC1+ LSC2, SSC1+SSC2> Data e Web Mining 49
50 BIRCH Basic steps of BIRCH Load the data into memory by creating a CF tree that summarizes the data (see the following slide) Perform global clustering. Produces a better clustering than the initial step. An agglomerative, hierarchical technique was selected. Redistribute the data points using the centroids of clusters discovered in the global clustering phase, and thus, discover a new (and hopefully better) set of clusters. Data e Web Mining 50
51 BIRCH BIRCH maintains a balanced CFTree Branching Factor B: max entry number in a nonleaf Max size leaf L: max entry number in a leaf Threshold T: the diameter of a leaf < T CF Tree Root B = 7 CF 1 CF 2 CF 3 CF 6 L = 6 child 1 child 2 child 3 child 6 Nonleaf node CF 1 CF 2 CF 3 CF 5 child 1 child 2 child 3 child 5 Leaf node Leaf node prev CF 1 CF 2 CF 6 next prev CF 1 CF 2 CF 4 next Data e Web Mining 51
52 Characteristics of Data High dimensionality Size of data set Sparsity of attribute values Noise and Outliers Types of attributes and type of data sets Differences in attribute scale Properties of the data space Can you define a meaningful centroid Data e Web Mining 52
53 Characteristics of Clusters Data distribution Shape Differing sizes Differing densities Poor separation Relationship of clusters Subspace clusters Data e Web Mining 53
54 Characteristics of Clustering Algorithms Order dependence Nondeterminism Parameter selection Scalability Underlying model Optimization based approach Data e Web Mining 54
Cluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototypebased Fuzzy cmeans
More informationData Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototypebased clustering Densitybased clustering Graphbased
More informationData Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004
More informationData Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004
More informationClustering. Adrian Groza. Department of Computer Science Technical University of ClujNapoca
Clustering Adrian Groza Department of Computer Science Technical University of ClujNapoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 Kmeans 3 Hierarchical Clustering What is Datamining?
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will
More informationClustering. Data Mining. Abraham Otero. Data Mining. Agenda
Clustering 1/46 Agenda Introduction Distance Knearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
More informationBIRCH: An Efficient Data Clustering Method For Very Large Databases
BIRCH: An Efficient Data Clustering Method For Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny CPSC 504 Presenter: Discussion Leader: Sophia (Xueyao) Liang HelenJr, Birches. Online Image.
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical
More informationUnsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
More informationClustering UE 141 Spring 2013
Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or
More informationData Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distancebased Kmeans, Kmedoids,
More informationClustering & Association
Clustering  Overview What is cluster analysis? Grouping data objects based only on information found in the data describing these objects and their relationships Maximize the similarity within objects
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDDLAB ISTI CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationClustering Hierarchical clustering and kmean clustering
Clustering Hierarchical clustering and kmean clustering Genome 373 Genomic Informatics Elhanan Borenstein The clustering problem: A quick review partition genes into distinct sets with high homogeneity
More informationClassifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical microclustering algorithm ClusteringBased SVM (CBSVM) Experimental
More informationCluster Analysis: Basic Concepts and Algorithms
8 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should
More informationCluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009
Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009 Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative Kmeans Densitybased Interpretation
More informationAn Introduction to Cluster Analysis for Data Mining
An Introduction to Cluster Analysis for Data Mining 10/02/2000 11:42 AM 1. INTRODUCTION... 4 1.1. Scope of This Paper... 4 1.2. What Cluster Analysis Is... 4 1.3. What Cluster Analysis Is Not... 5 2. OVERVIEW...
More informationClustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
More informationKMeans Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
KMeans Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar
More informationData Clustering Techniques Qualifying Oral Examination Paper
Data Clustering Techniques Qualifying Oral Examination Paper Periklis Andritsos University of Toronto Department of Computer Science periklis@cs.toronto.edu March 11, 2002 1 Introduction During a cholera
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 WolfTilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig
More informationCLASSIFICATION AND CLUSTERING. Anveshi Charuvaka
CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster
More informationExample: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering
Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? KMeans Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar
More informationChapter 7. Cluster Analysis
Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. DensityBased Methods 6. GridBased Methods 7. ModelBased
More informationAn Enhanced Clustering Algorithm to Analyze Spatial Data
International Journal of Engineering and Technical Research (IJETR) ISSN: 23210869, Volume2, Issue7, July 2014 An Enhanced Clustering Algorithm to Analyze Spatial Data Dr. Mahesh Kumar, Mr. Sachin Yadav
More informationClustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1
Clustering: Techniques & Applications Nguyen Sinh Hoa, Nguyen Hung Son 15 lutego 2006 Clustering 1 Agenda Introduction Clustering Methods Applications: Outlier Analysis Gene clustering Summary and Conclusions
More informationClustering. 15381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is
Clustering 15381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv BarJoseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is
More informationCluster Analysis: Basic Concepts and Methods
10 Cluster Analysis: Basic Concepts and Methods Imagine that you are the Director of Customer Relationships at AllElectronics, and you have five managers working for you. You would like to organize all
More informationCluster Analysis: Basic Concepts and Algorithms
Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering Kmeans Intuition Algorithm Choosing initial centroids Bisecting Kmeans Postprocessing Strengths
More informationClustering on Large Numeric Data Sets Using Hierarchical Approach Birch
Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 12 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global
More informationThe SPSS TwoStep Cluster Component
White paper technical report The SPSS TwoStep Cluster Component A scalable component enabling more efficient customer segmentation Introduction The SPSS TwoStep Clustering Component is a scalable cluster
More informationRobotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard
Robotics 2 Clustering & EM Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard 1 Clustering (1) Common technique for statistical data analysis to detect structure (machine learning,
More informationClustering & Visualization
Chapter 5 Clustering & Visualization Clustering in highdimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to highdimensional data.
More informationOn Clustering Validation Techniques
Journal of Intelligent Information Systems, 17:2/3, 107 145, 2001 c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques MARIA HALKIDI mhalk@aueb.gr YANNIS
More informationSmartSample: An Efficient Algorithm for Clustering Large HighDimensional Datasets
SmartSample: An Efficient Algorithm for Clustering Large HighDimensional Datasets Dudu Lazarov, Gil David, Amir Averbuch School of Computer Science, TelAviv University TelAviv 69978, Israel Abstract
More informationROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015
ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti
More informationUnsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning
Unsupervised Learning and Data Mining Unsupervised Learning and Data Mining Clustering Decision trees Artificial neural nets Knearest neighbor Support vectors Linear regression Logistic regression...
More informationClustering Techniques: A Brief Survey of Different Clustering Algorithms
Clustering Techniques: A Brief Survey of Different Clustering Algorithms Deepti Sisodia Technocrates Institute of Technology, Bhopal, India Lokesh Singh Technocrates Institute of Technology, Bhopal, India
More information2 Basic Concepts and Techniques of Cluster Analysis
The Challenges of Clustering High Dimensional Data * Michael Steinbach, Levent Ertöz, and Vipin Kumar Abstract Cluster analysis divides data into groups (clusters) for the purposes of summarization or
More informationEchidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis
Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of
More informationCluster Algorithms. Adriano Cruz adriano@nce.ufrj.br. 28 de outubro de 2013
Cluster Algorithms Adriano Cruz adriano@nce.ufrj.br 28 de outubro de 2013 Adriano Cruz adriano@nce.ufrj.br () Cluster Algorithms 28 de outubro de 2013 1 / 80 Summary 1 KMeans Adriano Cruz adriano@nce.ufrj.br
More informationAn Analysis on Density Based Clustering of Multi Dimensional Spatial Data
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,
More informationA Novel Density based improved kmeans Clustering Algorithm Dbkmeans
A Novel Density based improved kmeans Clustering Algorithm Dbkmeans K. Mumtaz 1 and Dr. K. Duraiswamy 2, 1 Vivekanandha Institute of Information and Management Studies, Tiruchengode, India 2 KS Rangasamy
More informationUNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable
More informationGraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering
GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering Yu Qian Kang Zhang Department of Computer Science, The University of Texas at Dallas, Richardson, TX 750830688, USA {yxq012100,
More informationFig. 1 A typical Knowledge Discovery process [2]
Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering
More informationNeural Networks Lesson 5  Cluster Analysis
Neural Networks Lesson 5  Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt.  Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29
More informationData Mining Clustering. Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
Data Mining Clustering Toon Calders Sheets are based on the those provided b Tan, Steinbach, and Kumar. Introduction to Data Mining What is Cluster Analsis? Finding groups of objects such that the objects
More information. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns
Outline Part 1: of data clustering NonSupervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties
More informationClustering methods for Big data analysis
Clustering methods for Big data analysis Keshav Sanse, Meena Sharma Abstract Today s age is the age of data. Nowadays the data is being produced at a tremendous rate. In order to make use of this largescale
More information10810 /02710 Computational Genomics. Clustering expression data
10810 /02710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intracluster similarity low intercluster similarity Informally,
More informationNimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff
Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/4 What is
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms Kmeans and its variants Hierarchical clustering
More informationData Mining Project Report. Document Clustering. Meryem UzunPer
Data Mining Project Report Document Clustering Meryem UzunPer 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. Kmeans algorithm...
More informationComparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool
Comparison and Analysis of Various Clustering Metho in Data mining On Education data set Using the weak tool Abstract: Data mining is used to find the hidden information pattern and relationship between
More informationSPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
More informationOriginal Article Survey of Recent Clustering Techniques in Data Mining
International Archive of Applied Sciences and Technology Volume 3 [2] June 2012: 6875 ISSN: 09764828 Society of Education, India Website: www.soeagra.com/iaast/iaast.htm Original Article Survey of Recent
More informationLecture 20: Clustering
Lecture 20: Clustering Wrapup of neural nets (from last lecture Introduction to unsupervised learning Kmeans clustering COMP424, Lecture 20  April 3, 2013 1 Unsupervised learning In supervised learning,
More informationClustering Data Streams
Clustering Data Streams Mohamed Elasmar Prashant Thiruvengadachari Javier Salinas Martin gtg091e@mail.gatech.edu tprashant@gmail.com javisal1@gatech.edu Introduction: Data mining is the science of extracting
More informationData Clustering. Dec 2nd, 2013 Kyrylo Bessonov
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms kmeans Hierarchical Main
More informationConcept of Cluster Analysis
RESEARCH PAPER ON CLUSTER TECHNIQUES OF DATA VARIATIONS Er. Arpit Gupta 1,Er.Ankit Gupta 2,Er. Amit Mishra 3 arpit_jp@yahoo.co.in, ank_mgcgv@yahoo.co.in,amitmishra.mtech@gmail.com Faculty Of Engineering
More informationL15: statistical clustering
Similarity measures Criterion functions Cluster validity Flat clustering algorithms kmeans ISODATA L15: statistical clustering Hierarchical clustering algorithms Divisive Agglomerative CSCE 666 Pattern
More informationDistance based clustering
// Distance based clustering Chapter ² ² Clustering Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 99). What is a cluster? Group of objects separated from other clusters Means
More informationCluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico
Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from
More informationA Method for Decentralized Clustering in Large MultiAgent Systems
A Method for Decentralized Clustering in Large MultiAgent Systems Elth Ogston, Benno Overeinder, Maarten van Steen, and Frances Brazier Department of Computer Science, Vrije Universiteit Amsterdam {elth,bjo,steen,frances}@cs.vu.nl
More informationTerritorial Analysis for Ratemaking. Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs
Territorial Analysis for Ratemaking by Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs Department of Statistics and Applied Probability University
More informationAuthors. Data Clustering: Algorithms and Applications
Authors Data Clustering: Algorithms and Applications 2 Contents 1 Gridbased Clustering 1 Wei Cheng, Wei Wang, and Sandra Batista 1.1 Introduction................................... 1 1.2 The Classical
More informationUnsupervised learning: Clustering
Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationMachine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand
More informationA comparison of various clustering methods and algorithms in data mining
Volume :2, Issue :5, 3236 May 2015 www.allsubjectjournal.com eissn: 23494182 pissn: 23495979 Impact Factor: 3.762 R.Tamilselvi B.Sivasakthi R.Kavitha Assistant Professor A comparison of various clustering
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More informationData visualization and clustering. Genomics is to no small extend a data science
Data visualization and clustering Genomics is to no small extend a data science [www.data2discovery.org] Data visualization and clustering Genomics is to no small extend a data science [Andersson et al.,
More informationClustering and Cluster Evaluation. Josh Stuart Tuesday, Feb 24, 2004 Read chap 4 in Causton
Clustering and Cluster Evaluation Josh Stuart Tuesday, Feb 24, 2004 Read chap 4 in Causton Clustering Methods Agglomerative Start with all separate, end with some connected Partitioning / Divisive Start
More informationPart 2: Community Detection
Chapter 8: Graph Data Part 2: Community Detection Based on Leskovec, Rajaraman, Ullman 2014: Mining of Massive Datasets Big Data Management and Analytics Outline Community Detection  Social networks 
More informationText Clustering. Clustering
Text Clustering 1 Clustering Partition unlabeled examples into disoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover
More informationCSE 494 CSE/CBS 598 (Fall 2007): Numerical Linear Algebra for Data Exploration Clustering Instructor: Jieping Ye
CSE 494 CSE/CBS 598 Fall 2007: Numerical Linear Algebra for Data Exploration Clustering Instructor: Jieping Ye 1 Introduction One important method for data compression and classification is to organize
More informationMaking SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing
SUBMISSION TO DATA MINING AND KNOWLEDGE DISCOVERY: AN INTERNATIONAL JOURNAL, MAY. 2005 100 Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing Hwanjo Yu, Jiong Yang, Jiawei Han,
More informationClustering in Ratemaking: Applications in Territories Clustering
Clustering in Ratemaking: Applications in Territories Clustering Ji Yao, Ph.D. Abstract: Clustering methods are briefly reviewed and their applications in insurance ratemaking are discussed in this paper.
More informationClustering. Chapter 7. 7.1 Introduction to Clustering Techniques. 7.1.1 Points, Spaces, and Distances
240 Chapter 7 Clustering Clustering is the process of examining a collection of points, and grouping the points into clusters according to some distance measure. The goal is that points in the same cluster
More informationData Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data
More informationOUTLIER ANALYSIS. Data Mining 1
OUTLIER ANALYSIS Data Mining 1 What Are Outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism Ex.: Unusual credit card purchase,
More informationA Comparative Analysis of Various Clustering Techniques used for Very Large Datasets
A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets Preeti Baser, Assistant Professor, SJPIBMCA, Gandhinagar, Gujarat, India 382 007 Research Scholar, R. K. University,
More informationA Survey of Kernel Clustering Methods
A Survey of Kernel Clustering Methods Maurizio Filippone, Francesco Camastra, Francesco Masulli and Stefano Rovetta Presented by: Kedar Grama Outline Unsupervised Learning and Clustering Types of clustering
More informationClustering and Data Mining in R
Clustering and Data Mining in R Workshop Supplement Thomas Girke December 10, 2011 Introduction Data Preprocessing Data Transformations Distance Methods Cluster Linkage Hierarchical Clustering Approaches
More informationForschungskolleg Data Analytics Methods and Techniques
Forschungskolleg Data Analytics Methods and Techniques Martin Hahmann, Gunnar Schröder, Phillip Grosse Prof. Dr.Ing. Wolfgang Lehner Why do we need it? We are drowning in data, but starving for knowledge!
More informationData Preprocessing. Week 2
Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationChapter 4: NonParametric Classification
Chapter 4: NonParametric Classification Introduction Density Estimation Parzen Windows KnNearest Neighbor Density Estimation KNearest Neighbor (KNN) Decision Rule Gaussian Mixture Model A weighted combination
More informationFlat Clustering KMeans Algorithm
Flat Clustering KMeans Algorithm 1. Purpose. Clustering algorithms group a set of documents into subsets or clusters. The cluster algorithms goal is to create clusters that are coherent internally, but
More informationThere are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:
Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are
More informationOn Data Clustering Analysis: Scalability, Constraints and Validation
On Data Clustering Analysis: Scalability, Constraints and Validation Osmar R. Zaïane, Andrew Foss, ChiHoon Lee, and Weinan Wang University of Alberta, Edmonton, Alberta, Canada Summary. Clustering is
More informationOn Density Based Transforms for Uncertain Data Mining
On Density Based Transforms for Uncertain Data Mining Charu C. Aggarwal IBM T. J. Watson Research Center 19 Skyline Drive, Hawthorne, NY 10532 charu@us.ibm.com Abstract In spite of the great progress in
More informationA TwoStep Method for Clustering Mixed Categroical and Numeric Data
Tamkang Journal of Science and Engineering, Vol. 13, No. 1, pp. 11 19 (2010) 11 A TwoStep Method for Clustering Mixed Categroical and Numeric Data MingYi Shih*, JarWen Jheng and LienFu Lai Department
More informationA Study on the Hierarchical Data Clustering Algorithm Based on Gravity Theory
A Study on the Hierarchical Data Clustering Algorithm Based on Gravity Theory YenJen Oyang, ChienYu Chen, and TsuiWei Yang Department of Computer Science and Information Engineering National Taiwan
More informationRtrees. RTrees: A Dynamic Index Structure For Spatial Searching. RTree. Invariants
RTrees: A Dynamic Index Structure For Spatial Searching A. Guttman Rtrees Generalization of B+trees to higher dimensions Diskbased index structure Occupancy guarantee Multiple search paths Insertions
More information