# Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Save this PDF as:

Size: px
Start display at page:

Download "Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is"

## Transcription

1 Clustering Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among objects. Why do we want to do that? Any REAL application? 1

2 Example: clusty Example: clustering genes Microarrays measures the activities of all genes in different conditions Clustering genes provide a lot of information about the genes An early killer application in this area The most cited (7,812) paper in PNAS! 2

3 Why clustering? Organizing data into clusters shows internal structure of the data Ex. Clusty and clustering genes above Sometimes the partitioning is the goal Ex. Market segmentation Prepare for other AI techniques Ex. Summarize news (cluster and then find centroid) Techniques for clustering is useful in knowledge discovery in data Ex. Underlying rules, reoccurring patterns, topics, etc. Outline Motivation Distance measure Hierarchical clustering Partitional clustering K-means Gaussian Mixture Models Number of clusters 3

4 What is a natural grouping among these objects? What is a natural grouping among these objects? Clustering is subjective Simpson's Family School Employees Females Males 4

5 What is Similarity? The quality or state of being similar; likeness; resemblance; as, a similarity of features. Webster's Dictionary Similarity is hard to define, but We know it when we see it The real meaning of similarity is a philosophical question. We will take a more pragmatic approach. Defining Distance Measures Definition: Let O 1 and O 2 be two objects from the universe of possible objects. The distance (dissimilarity) between O 1 and O 2 is a real number denoted by D(O 1,O 2 ) gene1 gene

6 d('', s) of s2+ch2) 0 s2) + else 1 s = ) + '') d(s1+ch1, s if 1, = 1 --i.e. ch1=ch2 d(s1, 0 fi, = d(s, min( d(s1+ch1, length s2+ch2) '') d(s1, then = d('', gene1 gene2 Inside these black boxes: some function on two variables (might be simple or very complex) 3 What properties should a distance measure have? D(A,B) = D(B,A) D(A,B) = 0 iif A = B D(A,B) 0 D(A,B) D(A,C) + D(B,C) Symmetry Constancy of Self-Similarity Positivity Triangular Inequality Outline Motivation Distance measure Hierarchical clustering Partitional clustering K-means Gaussian Mixture Models Number of clusters 6

7 Two Types of Clustering Partitional algorithms: Construct various partitions and then evaluate them by some criterion (we will see an example called BIRCH) Hierarchical algorithms: Create a hierarchical decomposition of the set of objects using some criterion Hierarchical Partitional Desirable Properties of a Clustering Algorithm Scalability (in terms of both time and space) Ability to deal with different data types Minimal requirements for domain knowledge to determine input parameters Able to deal with noise and outliers Insensitive to order of input records Incorporation of user-specified constraints Interpretability and usability 7

8 A Useful Tool for Summarizing Similarity Measurements In order to better appreciate and evaluate the examples given in the early part of this talk, we will now introduce the dendrogram. Terminal Branch Root Internal Node Leaf Internal Branch The similarity between two objects in a dendrogram is represented as the height of the lowest internal node they share. Note that hierarchies are commonly used to organize information, for example in a web portal. Yahoo s hierarchy is manually created, we will focus on automatic creation of hierarchies in data mining. Business & Economy B2B Finance Shopping Jobs Aerospace Agriculture Banking Bonds Animals Apparel Career Workspace 8

9 We can look at the dendrogram to determine the correct number of clusters. In this case, the two highly separated subtrees are highly suggestive of two clusters. (Things are rarely this clear cut, unfortunately) One potential use of a dendrogram is to detect outliers The single isolated branch is suggestive of a data point that is very different to all others Outlier 9

10 (How-to) Hierarchical Clustering The number of dendrograms with n leafs = (2n -3)!/[(2 (n -2) ) (n -2)!] Number Number of Possible of Leafs Dendrograms ,459,425 Since we cannot test all possible trees we will have to heuristic search of all possible trees. We could do this.. Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Top-Down (divisive): Starting with all the data in a single cluster, consider every possible way to divide the cluster into two. Choose the best division and recursively operate on both sides. We begin with a distance matrix which contains the distances between every pair of objects in our database D(, ) = 8 D(, ) =

11 Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Consider all possible merges Choose the best Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Consider all possible merges Choose the best Consider all possible merges Choose the best 11

12 Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Consider all possible merges Choose the best Consider all possible merges Choose the best Consider all possible merges Choose the best Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Consider all possible merges Choose the best Consider all possible merges Choose the best Consider all possible merges Choose the best 12

13 Similarity criteria: Single Link cluster similarity = similarity of two most similar members - Potentially long and skinny clusters Hierarchical: Complete Link cluster similarity = similarity of two least similar members + tight clusters 13

14 Hierarchical: Average Link cluster similarity = average similarity of all pairs the most widely used similarity measure Robust against noise Single linkage Average linkage 14

15 Summary of Hierarchal Clustering Methods No need to specify the number of clusters in advance. Hierarchical structure maps nicely onto human intuition for some domains They do not scale well: time complexity of at least O(n 2 ), where n is the number of total objects. Like any heuristic search algorithms, local optima are a problem. Interpretation of results is (very) subjective. Partitional Clustering Nonhierarchical, each instance is placed in exactly one of K non-overlapping clusters. Since only one set of clusters is output, the user normally has to input the desired number of clusters K. 15

16 K-means Clustering: Initialization Decide K, and initialize K centers (randomly) k 1 2 k k 3 K-means Clustering: Iteration 1 Assign all objects to the nearest center. 5 Move a center to the mean of its members. 4 3 k 1 2 k k

17 K-means Clustering: Iteration 2 After moving centers, re-assign the objects 5 4 k k 2 k K-means Clustering: Iteration 2 After moving centers, re-assign the objects to nearest centers. Move a center to the mean of its new members. 5 4 k k 2 k

18 K-means Clustering: Finished! Re-assign and move centers, until 5 no objects changed membership. 4 k k 2 k Algorithm k-means 1. Decide on a value for K, the number of clusters. 2. Initialize the K cluster centers (randomly, if necessary). 3. Decide the class memberships of the N objects by assigning them to the nearest cluster center. 4. Re-estimate the K cluster centers, by assuming the memberships found above are correct. 5. Repeat 3 and 4 until none of the N objects changed membership in the last iteration. 18

19 Why K-means Works What is a good partition? High intra-cluster similarity K-means optimizes the average distance to members of the same cluster K n 1 k n k n k = 1 k i= 1 j= 1 which is twice the total distance to centers, also called squared error x ki x kj K n k se = x µ k = 1 i= 1 ki k Comments on K-Means Strength Simple, easy to implement and debug Intuitive objective function: optimizes intra-cluster similarity Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Weakness Applicable only when mean is defined, then what about categorical data? Often terminates at a local optimum. Initialization is important. Need to specify K, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes Summary Assign members based on current centers Re-estimate centers based on current assignment 19

20 Outline Motivation Distance measure Hierarchical clustering Partitional clustering K-means Gaussian Mixture Models Number of clusters Gaussian Gaussian Mixture Models 2 1 ( x µ ) P( x) = ϕ ( x; µ, σ) = exp( ) 2 σ 2π 2σ ex. height of one population Gaussian Mixture P( C = i) = ω, P( x C = i) = ϕ ( x; µ, σ ) K i i i P( x) = P( x, C = i) = P( C = i) P( x C = i) = ωϕ ( x; µ, σ ) i= 1 i= 1 ex. height of two population K i i i 20

21 Gaussian Mixture Models Mixture of Multivariate Gaussian P( C = k) = ω, P( x C = i) = ϕ ( x; µ, Σ ) i i i ex. y-axis is blood pressure and x-axis is age GMM+EM = Soft K-means Decide the number of clusters, K Initialize parameters (randomly) E-step: assign probabilistic membership M-step: re-estimate parameters based on probabilistic membership Repeat until change in parameters are smaller than a threshold See R&N for details 21

22 Iteration 1 The cluster means are randomly assigned 22

23 Iteration 2 Iteration 5 23

24 Iteration 25 Strength of Gaussian Mixture Models Interpretability: learns a generative model of each cluster you can generate new data based on the learned model Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Intuitive (?) objective function: optimizes data likelihood Extensible to other mixture models for other data types e.g. mixture of multinomial for categorical data maximization instead of mean sensitivity to noise and outliers depend on the distribution 24

25 Weakness of Gaussian Mixture Models Often terminates at a local optimum. Initialization is important. Need to specify K, the number of clusters, in advance Not suitable to discover clusters with non-convex shapes Summary To learn Gaussian mixture, assign probabilistic membership based on current parameters, and reestimate parameters based on current membership Clustering methods: Comparison Hierarchical K-means GMM Running time naively, O(N 3 ) fastest (each iteration is linear) fast (each iteration is linear) Assumptions requires a similarity / distance measure strong assumptions strongest assumptions Input parameters none K (number of clusters) K (number of clusters) Clusters subjective (only a tree is returned) exactly K clusters exactly K clusters 25

26 Outline Motivation Distance measure Hierarchical clustering Partitional clustering K-means Gaussian Mixture Models Number of clusters How can we tell the right number of clusters? In general, this is a unsolved problem. However there are many approximate methods. In the next few slides we will see an example

27 When k = 1, the objective function is When k = 2, the objective function is

28 When k = 3, the objective function is We can plot the objective function values for k equals 1 to 6 The abrupt change at k = 2, is highly suggestive of two clusters in the data. This technique for determining the number of clusters is known as knee finding or elbow finding. Objective Function 1.00E E E E E E E E E E E+00 k Note that the results are not always as clear cut as in this toy example 28

29 What you should know Why is clustering useful What are the different types of clustering algorithms What are the assumptions we are making for each, and what can we get from them Unsolved issues: number of clusters, initialization, etc. Acknowledgement: modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, Andrew Moore, and others. 29

### Unsupervised Learning. What is clustering for? What is clustering for? (cont )

Unsupervised Learning c: Artificial Intelligence The University of Iowa Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

### 10-810 /02-710 Computational Genomics. Clustering expression data

10-810 /02-710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally,

### Clustering UE 141 Spring 2013

Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or

### Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico

Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

### Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

### Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

### Clustering Hierarchical clustering and k-mean clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics Elhanan Borenstein The clustering problem: A quick review partition genes into distinct sets with high homogeneity

### Unsupervised learning: Clustering

Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What

### Data Mining Cluster Analysis: Advanced Concepts and Algorithms. ref. Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms ref. Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Outline Prototype-based Fuzzy c-means Mixture Model Clustering Density-based

### A Micro-course on Cluster Analysis Techniques

Università di Trento DiSCoF Dipartimento di Scienze della Cognizione e Formazione A Micro-course on Cluster Analysis Techniques luigi.lombardi@unitn.it Methodological course (COBRAS-DiSCoF) Finding groups

### Robotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard

Robotics 2 Clustering & EM Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard 1 Clustering (1) Common technique for statistical data analysis to detect structure (machine learning,

### Text Clustering. Clustering

Text Clustering 1 Clustering Partition unlabeled examples into disoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover

### Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig

### Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

### Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

### DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

### Fig. 1 A typical Knowledge Discovery process [2]

Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering

### Data Warehousing and Data Mining

Data Warehousing and Data Mining Lecture Clustering Methods CITS CITS Wei Liu School of Computer Science and Software Engineering Faculty of Engineering, Computing and Mathematics Acknowledgement: The

### Distance based clustering

// Distance based clustering Chapter ² ² Clustering Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 99). What is a cluster? Group of objects separated from other clusters Means

### There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are

### STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

### UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

### ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

### Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

### Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand

### CLUSTERING (Segmentation)

CLUSTERING (Segmentation) Dr. Saed Sayad University of Toronto 2010 saed.sayad@utoronto.ca http://chem-eng.utoronto.ca/~datamining/ 1 What is Clustering? Given a set of records, organize the records into

### 6. If there is no improvement of the categories after several steps, then choose new seeds using another criterion (e.g. the objects near the edge of

Clustering Clustering is an unsupervised learning method: there is no target value (class label) to be predicted, the goal is finding common patterns or grouping similar examples. Differences between models/algorithms

### Cluster Analysis: Basic Concepts and Algorithms

Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering K-means Intuition Algorithm Choosing initial centroids Bisecting K-means Post-processing Strengths

### Cluster Analysis: Basic Concepts and Methods

10 Cluster Analysis: Basic Concepts and Methods Imagine that you are the Director of Customer Relationships at AllElectronics, and you have five managers working for you. You would like to organize all

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

### Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009 Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative K-means Density-based Interpretation

### . Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

### Lecture 20: Clustering

Lecture 20: Clustering Wrap-up of neural nets (from last lecture Introduction to unsupervised learning K-means clustering COMP-424, Lecture 20 - April 3, 2013 1 Unsupervised learning In supervised learning,

### Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov

Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

### Distances, Clustering, and Classification. Heatmaps

Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster

### The SPSS TwoStep Cluster Component

White paper technical report The SPSS TwoStep Cluster Component A scalable component enabling more efficient customer segmentation Introduction The SPSS TwoStep Clustering Component is a scalable cluster

### Clustering - example. Given some data x i X Find a partitioning of the data into k disjunctive clusters Example: k-means clustering

Clustering - example Graph Mining and Graph Kernels Given some data x i X Find a partitioning of the data into k disjunctive clusters Example: k-means clustering x!!!!8!!! 8 x 1 1 Clustering - example

### Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

### BIRCH: An Efficient Data Clustering Method For Very Large Databases

BIRCH: An Efficient Data Clustering Method For Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny CPSC 504 Presenter: Discussion Leader: Sophia (Xueyao) Liang HelenJr, Birches. Online Image.

### Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

### L15: statistical clustering

Similarity measures Criterion functions Cluster validity Flat clustering algorithms k-means ISODATA L15: statistical clustering Hierarchical clustering algorithms Divisive Agglomerative CSCE 666 Pattern

### MODULE 15 Clustering Large Datasets LESSON 34

MODULE 15 Clustering Large Datasets LESSON 34 Incremental Clustering Keywords: Single Database Scan, Leader, BIRCH, Tree 1 Clustering Large Datasets Pattern matrix It is convenient to view the input data

### Mixture Models. Jia Li. Department of Statistics The Pennsylvania State University. Mixture Models

Mixture Models Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based

Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

### CHAPTER 3 DATA MINING AND CLUSTERING

CHAPTER 3 DATA MINING AND CLUSTERING 3.1 Introduction Nowadays, large quantities of data are being accumulated. The amount of data collected is said to be almost doubled every 9 months. Seeking knowledge

### A comparison of various clustering methods and algorithms in data mining

Volume :2, Issue :5, 32-36 May 2015 www.allsubjectjournal.com e-issn: 2349-4182 p-issn: 2349-5979 Impact Factor: 3.762 R.Tamilselvi B.Sivasakthi R.Kavitha Assistant Professor A comparison of various clustering

### Data visualization and clustering. Genomics is to no small extend a data science

Data visualization and clustering Genomics is to no small extend a data science [www.data2discovery.org] Data visualization and clustering Genomics is to no small extend a data science [Andersson et al.,

### Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? K-Means Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar

### Clustering and Data Mining in R

Clustering and Data Mining in R Workshop Supplement Thomas Girke December 10, 2011 Introduction Data Preprocessing Data Transformations Distance Methods Cluster Linkage Hierarchical Clustering Approaches

### Flat Clustering K-Means Algorithm

Flat Clustering K-Means Algorithm 1. Purpose. Clustering algorithms group a set of documents into subsets or clusters. The cluster algorithms goal is to create clusters that are coherent internally, but

### Important Characteristics of Cluster Analysis Techniques

Cluster Analysis Can we organize sampling entities into discrete classes, such that within-group similarity is maximized and amonggroup similarity is minimized according to some objective criterion? Sites

### Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences

### ANALYTICAL TECHNIQUES FOR DATA VISUALIZATION

ANALYTICAL TECHNIQUES FOR DATA VISUALIZATION CSE 537 Ar@ficial Intelligence Professor Anita Wasilewska GROUP 2 TEAM MEMBERS: SAEED BOOR BOOR - 110564337 SHIH- YU TSAI - 110385129 HAN LI 110168054 SOURCES

### Clustering Techniques

Clustering Techniques Marco BOTTA Dipartimento di Informatica Università di Torino botta@di.unito.it www.di.unito.it/~botta/didattica/clustering.html Data Clustering Outline What is cluster analysis? What

### K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar

### Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Unsupervised Learning and Data Mining Unsupervised Learning and Data Mining Clustering Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

### Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering Clustering Algorithms Contents K-means Hierarchical algorithms Linkage functions Vector quantization Clustering Formulation Objects.................................

### ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

### Data Mining Techniques Chapter 11: Automatic Cluster Detection

Data Mining Techniques Chapter 11: Automatic Cluster Detection Clustering............................................................. 2 k-means Clustering......................................................

### CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka

CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training

### Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

### Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

### Extensive Survey on Hierarchical Clustering Methods in Data Mining

Extensive Survey on Hierarchical Clustering Methods in Data Mining Dipak P Dabhi 1, Mihir R Patel 2 1Dipak P Dabhi Assistant Professor, Computer and Information Technology (C.E & I.T), C.G.P.I.T, Gujarat,

### Chapter 4 Data Mining A Short Introduction

Chapter 4 Data Mining A Short Introduction Data Mining - 1 1 Today's Question 1. Data Mining Overview 2. Association Rule Mining 3. Clustering 4. Classification Data Mining - 2 2 3. Clustering - Descriptive

### EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

### Chapter ML:XI (continued)

Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

### Clustering & Association

Clustering - Overview What is cluster analysis? Grouping data objects based only on information found in the data describing these objects and their relationships Maximize the similarity within objects

### Data Mining Clustering. Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering Toon Calders Sheets are based on the those provided b Tan, Steinbach, and Kumar. Introduction to Data Mining What is Cluster Analsis? Finding groups of objects such that the objects

### Data Preprocessing. Week 2

Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

### Clustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1

Clustering: Techniques & Applications Nguyen Sinh Hoa, Nguyen Hung Son 15 lutego 2006 Clustering 1 Agenda Introduction Clustering Methods Applications: Outlier Analysis Gene clustering Summary and Conclusions

### Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of

### K-Means Clustering. Clustering and Classification Lecture 8

K-Means Clustering Clustering and Lecture 8 Today s Class K-means clustering: What it is How it works What it assumes Pitfalls of the method (locally optimal results) 2 From Last Time If you recall the

### SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

### DATA CLUSTERING USING MAPREDUCE

DATA CLUSTERING USING MAPREDUCE by Makho Ngazimbi A project submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Boise State University March 2009

### An Enhanced Clustering Algorithm to Analyze Spatial Data

International Journal of Engineering and Technical Research (IJETR) ISSN: 2321-0869, Volume-2, Issue-7, July 2014 An Enhanced Clustering Algorithm to Analyze Spatial Data Dr. Mahesh Kumar, Mr. Sachin Yadav

### Machine Learning using MapReduce

Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

### Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

### Cluster Analysis using R

Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other

### An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

### Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

### Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

### Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

Comparison and Analysis of Various Clustering Metho in Data mining On Education data set Using the weak tool Abstract:- Data mining is used to find the hidden information pattern and relationship between

### A K-means-like Algorithm for K-medoids Clustering and Its Performance

A K-means-like Algorithm for K-medoids Clustering and Its Performance Hae-Sang Park*, Jong-Seok Lee and Chi-Hyuck Jun Department of Industrial and Management Engineering, POSTECH San 31 Hyoja-dong, Pohang

### Clustering & Visualization

Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

### Clustering Techniques: A Brief Survey of Different Clustering Algorithms

Clustering Techniques: A Brief Survey of Different Clustering Algorithms Deepti Sisodia Technocrates Institute of Technology, Bhopal, India Lokesh Singh Technocrates Institute of Technology, Bhopal, India

### International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 12, December 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

### Hierarchical Cluster Analysis Some Basics and Algorithms

Hierarchical Cluster Analysis Some Basics and Algorithms Nethra Sambamoorthi CRMportals Inc., 11 Bartram Road, Englishtown, NJ 07726 (NOTE: Please use always the latest copy of the document. Click on this

### Machine Learning for NLP

Natural Language Processing SoSe 2015 Machine Learning for NLP Dr. Mariana Neves May 4th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Introduction Field of study that gives computers the ability

### Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

### Data Mining for Knowledge Management. Classification

1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

### Standardization and Its Effects on K-Means Clustering Algorithm

Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

### Hierarchical Clustering. Clustering Overview

lustering Overview Last lecture What is clustering Partitional algorithms: K-means Today s lecture Hierarchical algorithms ensity-based algorithms: SN Techniques for clustering large databases IRH UR ata

### Visualizing non-hierarchical and hierarchical cluster analyses with clustergrams

Visualizing non-hierarchical and hierarchical cluster analyses with clustergrams Matthias Schonlau RAND 7 Main Street Santa Monica, CA 947 USA Summary In hierarchical cluster analysis dendrogram graphs