Clustering. Clustering. What is Clustering? What is Clustering? What is Clustering? Types of Data in Cluster Analysis

Size: px
Start display at page:

Download "Clustering. Clustering. What is Clustering? What is Clustering? What is Clustering? Types of Data in Cluster Analysis"

Transcription

1 What is Clustering? Clustering Tpes of Data in Cluster Analsis Clustering A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods What is Clustering? Clustering of data is a method b which large sets of data are grouped into clusters of smaller sets of similar data. Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Clustering is unsupervised classification: no predefined classes Tpical applications What is Clustering? As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms Use cluster detection when ou suspect that there are natural groupings that ma represent groups of customers or products that have lot in common. When there are man competing patterns in the data, making it hard to spot a single pattern, creating clusters of similar records reduces the compleit within clusters so that other data mining techniques are more likel to succeed.

2 Eamples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifing groups of motor insurance polic holders with a high average claim cost Cit-planning: Identifing groups of houses according to their house tpe, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Clustering definition Given a set of data points, each having a set of attributes, and a similarit measure among them, find clusters such that: data points in one cluster are more similar to one another (high intra-class similarit) data points in separate clusters are less similar to one another (low inter-class similarit ) Similarit measures: e.g. Euclidean distance if attributes are continuous. Requirements of Clustering in Data Mining Notion of a Cluster is Ambiguous Scalabilit Abilit to deal with different tpes of attributes Discover of clusters with arbitrar shape Minimal requirements for domain knowledge to determine input parameters Able to deal with noise and outliers Insensitive to order of input records High dimensionalit Incorporation of user-specified constraints Interpretabilit and usabilit Initial points. Two Clusters Si Clusters Four Clusters

3 Clustering What is Cluster Analsis? Tpes of Data in Cluster Analsis A Categorization of Major Clustering Methods Data Matri Represents n objects with p variables (attributes, measures) A relational table Partitioning Methods Hierarchical Methods M i M n L M L M L f M if M nf L M L M L p M ip M np Dissimilarit Matri Proimities of pairs of objects d(i,j): dissimilarit between objects i and j Nonnegative Close to : similar d(,) d(,) d(,) M M M d(n,) d(n,) L L Tpe of data in clustering analsis Continuous variables Binar variables Nominal and ordinal variables Variables of mied tpes

4 Continuous variables To avoid dependence on the choice of measurement units the data should be standardized. Standardize data Calculate the mean absolute deviation: s f = n ( f m + m m ) f f f nf f where m =... ) n ( + + f f f + nf Calculate the standardized measurement (z-score) z if = if m f s f Using mean absolute deviation is more robust than using standard deviation. Since the deviations are not squared the effect of outliers is somewhat reduced but their z-scores do not become to small; therefore, the outliers remain detectable. Similarit/Dissimilarit Between Objects Distances are normall used to measure the similarit or dissimilarit between two data objects Euclidean distance is probabl the most commonl chosen tpe of distance. It is the geometric distance in the multidimensional space: Properties d(i,j) d(i,i) = d(i,j) = d(j,i) d(i, j) d(i,j) d(i,k) + d(k,j) = k p = ( ki ) kj Similarit/Dissimilarit Between Objects Similarit/Dissimilarit Between Objects Cit-block (Manhattan) distance. This distance is simpl the sum of differences across dimensions. In most cases, this distance measure ields results similar to the Euclidean distance. However, note that in this measure, the effect of single large differences (outliers) is dampened (since the are not squared). d(i, j) = i j + i j ip jp The properties stated for the Euclidean distance also hold for this measure. Minkowski distance. Sometimes one ma want to increase or decrease the progressive weight that is placed on dimensions on which the respective objects are ver different. This measure enables to accomplish that and is computed as: d(i, j) = i q j + i q j ip q q jp

5 Similarit/Dissimilarit Between Objects Binar Variables Binar variable has onl two states: or If we have some idea of the relative importance that should be assigned to each variable, then we can weight them and obtain a weighted distance measure. d(i, j) = w ( ) + L + i j w ( p ip ) jp A binar variable is smmetric if both of its states are equall valuable, that is, there is no preference on which outcome should be coded as. A binar variable is asmmetric if the outcome of the states are not equall important, such as positive or negative outcomes of a disease test. Similarit that is based on smmetric binar variables is called invariant similarit. Binar Variables A contingenc table for binar data Object i sum a c a+ c Object j b+ d c+ d Simple matching coefficient (invariant, if the binar variable is smmetric): d(i, j) = b + c a + b + c + d Jaccard coefficient (noninvariant if the binar variable is asmmetric): d(i, j) = b + c a + b + c b d sum a+ b p Dissimilarit between Binar Variables Eample Name Gender Fever Cough Test- Test- Test- Test- Jack M Y N P N N N Mar F Y N P N P N Jim M Y P N N N N gender is a smmetric attribute the remaining attributes are asmmetric binar let the values Y and P be set to, and the value N be set to Jaccard coefficient + d(jack, mar) = = d(jack, jim) = = d(jim, mar) = =. + +

6 Nominal Variables A generalization of the binar variable in that it can take more than states, e.g., red, ellow, blue, green Ordinal Variables On ordinal variables order is important e.g. Gold, Silver, Bronze Method : simple matching m: # of matches, p: total # of variables p d(i,j) = p m Method : use a large number of binar variables creating a new binar variable for each of the M nominal states Can be treated like continuous the ordered states define the ranking,...,m f replacing if b their rank r {,...,M } if f map the range of each variable onto [, ] b replacing i-th object in the f-th variable b z if r = if M f compute the dissimilarit using methods for continuous variables Variables of Mied Tpes A database ma contain several/all tpes of variables continuous, smmetric binar, asmmetric binar, nominal and ordinal. One ma use a weighted formula to combine their effects. What is Cluster Analsis? Clustering Tpes of Data in Cluster Analsis d(i, j) = p (f) (f) δ d ij ij f = p f = (f) δ ij δ ij = if if is missing or if = jf = and the variable f is asmmetric binar δ ij = otherwise continuous and ordinal variables dij: normalized absolute distance binar and nominal variables dij= if if = jf ; otherwise dij= A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods

7 Major Clustering Approaches Partitioning algorithms: Construct various partitions and then evaluate them b some criterion Hierarch algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion Densit-based: Based on connectivit and densit functions. Able to find clusters of arbitrar shape. Continues growing a cluster as long as the densit of points in the neighborhood eceeds a specified limit. Clustering What is Cluster Analsis? Tpes of Data in Cluster Analsis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Grid-based: Based on a multiple-level granularit structure that forms a grid structure on which all operations are performed. Performance depends onl on the number of cells in the grid. Model-based: A model is hpothesized for each of the clusters and the idea is to find the best fit of that model to each other Partitioning Algorithms: Basic Concept The K-Means Clustering Method Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: ehaustivel enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means: Each cluster is represented b the center of the cluster k-medoids or PAM (Partition around medoids): Each cluster is represented b one of the objects in the cluster Given k, the k-means algorithm is implemented in steps:. Partition objects into k nonempt subsets. Compute centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster.. Assign each object to the cluster with the nearest seed point.. Go back to Step ; stop when no more new assignment.

8 K-means clustering (k=) Comments on the K-Means Method Strengths & Weaknesses Relativel efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normall, k, t << n. Often terminates at a local optimum Applicable onl when mean is defined Need to specif k, the number of clusters, in advance Sensitive to noise and outliers as a small number of such points can influence the mean value Not suitable to discover clusters with non-conve shapes Importance of Choosing Initial Centroids Importance of Choosing Initial Centroids Iteration... Iteration. Iteration Iteration Iteration Iteration Iteration Iteration Iteration.. Iteration.. Iteration

9 The K-Medoids Clustering Method PAM (Partitioning Around Medoids) Find representative objects, called medoids, in clusters PAM (Partitioning Around Medoids, ) starts from an initial set of medoids and iterativel replaces one of the medoids b one of the non-medoids if it improves the total distance of the resulting clustering PAM works effectivel for small data sets, but does not scale well for large data sets CLARA (Kaufmann & Rousseeuw, ) CLARANS (Ng & Han, ): Randomized sampling PAM (Kaufman and Rousseeuw, ) Use real object to represent the cluster Select k representative objects arbitraril For each pair of non-selected object h and selected object i, calculate the total swapping cost TC ih Select pair of i and h which corresponds to the minimum TC ih If min.tc ih <, i is replaced b h Then assign each non-selected object to the most similar representative object Repeat steps - until there is no change PAM Clustering: Total swapping cost TC ih = j C jih CLARA (Clustering LARge Applications) i,t: medoids h: medoid candidate j: a point t j i h t j h i CLARA (Kaufmann and Rousseeuw in ) draws a sample of the dataset and applies PAM on the sample in order to find the medoids. If the sample is representative the medoids of the sample should approimate the medoids of the entire dataset. To improve the approimation, multiple samples are drawn and the C jih : swapping cost due to j C jih =d(j,h)-d(j,i) h j i t C jih = i h t j best clustering is returned as the output The clustering accurac is measured b the average dissimilarit of all objects in the entire dataset. Eperiments show that samples of size +k give satisfactor results C jih =d(j,t)-d(j,i) C jih =d(j,h)-d(j,t)

10 CLARA (Clustering LARge Applications) Strengths and Weaknesses: Deals with larger data sets than PAM Efficienc depends on the sample size A good clustering based on samples will not necessaril represent a good clustering of the whole data set if the sample is biased CLARANS ( Randomized CLARA) CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han ) The clustering process can be presented as searching a graph where ever node is a potential solution, that is, a set of k medoids Two nodes are neighbors if their sets differ b onl one medoid Each node can be assigned a cost that is defined to be the total dissimilarit between ever object and the medoid of its cluster The problem corresponds to search for a minimum on the graph At each step, all neighbors of current node are searched; the neighbor which corresponds to the deepest descent in cost is chosen as the net solution CLARANS ( Randomized CLARA) For large values of n and k, eamining k(n-k) neighbors is time consuming. At each step, CLARANS draws sample of neighbors to eamine. Note that CLARA draws a sample of nodes at the beginning of search; therefore, CLARANS has the benefit of not confining the search to a restricted area. If the local optimum is found, CLARANS starts with new randoml selected node in search for a new local optimum. The number of local optimums to search for is a parameter. Clustering What is Cluster Analsis? Tpes of Data in Cluster Analsis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods It is more efficient and scalable than both PAM and CLARA; returns higher qualit clusters.

11 Hierarchical Clustering Hierarchical Clustering Use distance matri as clustering criteria. These methods work b grouping data into a tree of clusters. There are two tpes of hierarchical clustering: Agglomerative: bottom-up strateg Divisive: top-down strateg Does not require the number of clusters as an input, but needs a termination condition, e. g., could be the desired number of clusters or a distance threshold for merging Step Step Step Step Step a a b b a b c d e c c d e d d e e Step Step Step Step Step agglomerative divisive Agglomerative hierarchical clustering Clustering result: dendrogram

12 Linkage rules () Single link (nearest neighbor). The distance between two clusters is determined b the distance of the two closest objects (nearest neighbors) in the different clusters. This rule will, in a sense, string objects together to form clusters, and the resulting clusters tend to represent long "chains." Complete link (furthest neighbor). The distances between clusters are determined b the greatest distance between an two objects in the different clusters (i.e., b the "furthest neighbors"). This method usuall performs quite well in cases when the objects actuall form naturall distinct "clumps." If the clusters tend to be somehow elongated or of a "chain" tpe nature, then this method is inappropriate. Linkage rules () Pair-group average. The distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters. This method is also ver efficient when the objects form natural distinct "clumps," however, it performs equall well with elongated, "chain" tpe clusters. Pair-group centroid. The distance between two clusters is determined as the distance between centroids. AGNES (Agglomerative Nesting) DIANA (Divisive Analsis) Use the Single-Link method and the dissimilarit matri. Repeatedl merge nodes that have the least dissimilarit, e.g. merge C and C is objects from C and C form the minimum Euclidean distance between an two objects from different clusters. Eventuall all nodes belong to the same cluster Introduced in Kaufmann and Rousseeuw () Inverse order of AGNES All objects are used to form one initial cluster The cluster is split according to some principle, e.g, the maimum Euclidean distance between the closest neighboring objects in different clusters Eventuall each node forms a cluster on its own

13 More on Hierarchical Clustering Major weakness of agglomerative clustering methods do not scale well: time compleit of at least O(n), where n is the number of total objects can never undo what was done previousl Integration of hierarchical with distance-based clustering BIRCH: uses CF-tree and incrementall adjusts the qualit of sub-clusters CURE: selects well-scattered points from the cluster and then shrinks them towards the center of the cluster b a specified fraction DBscan: Densit-based Alg. based on local connectivit and densit functions BIRCH algorithm BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies, b Zhang, Ramakrishnan, Livn (SIGMOD ) A tree is built that captures needed information to perform clustering Introduces two new concepts Clustering Feature (contains info about a cluster) Clustering Feature Tree which are used to summarize cluster representation BIRCH - Clustering Feature Vector A clustering feature is a triplet summarizing information about sub-clusters of objects. It registers crucial measurements for computing clusters in a compact form BIRCH - Clustering Feature Tree A tree that stores the clustering features for hierarchical clustering Clustering Feature: CF = (N, LS, SS) N: Number of data points Linear sum LS: N i==x i CF = (, (,),(,)) Square Sum SS: N i==x i (,) (,) (,) (,) (,) B = Ma. no. of CF in a non-leaf node L = Ma. no. of CF in a leaf node

14 Notes BIRCH algorithm A Leaf node represents a cluster. A sub-cluster in a leaf node must have a diameter no greater than a given threshold T. A point is inserted into the leaf node (cluster) to which is closer. When one item is inserted into a cluster at the leaf node T (for the corresponding sub-cluster) must be satisfied. The corresponding CF must be updated. Incrementall construct a CF tree, a hierarchical data structure for multiphase clustering Phase : scan DB to build an initial in-memor CF tree If threshold condition is violated If there is room to insert Insert point as a single cluster If not Leaf node split: take two farthest CFs and create two leaf nodes, put the remaining CFs (including the new one) into the closest node Update CF for non-leafs. Insert new non-leaf entr into parent node We ma have to split the parent as well. Spilt the root increases tree height b one. If not Insert point into the closest cluster If there is no space on the node the node is split. Phase : use an arbitrar clustering algorithm to cluster the leaf nodes of the CF-tree Some Comments on Birch CURE (Clustering Using REpresentatives ) It can be shown that CF vectors can be stored and calculated incrementall and accuratel as clusters are merged Eperiments have shown that scales linearl with the number of objects. Finds a good clustering with a single scan and improves the qualit with a few additional scans Proposed b Guha, Rastogi & Shim, Uses multiple representative points to evaluate distance between clusters. Handles onl numeric data, and sensitive to the order of the data record. Better suited to find spherical clusters. Representative points are well-scattered objects for the cluster and are shrunk towards the centre of the cluster. (adjusts well to arbitrar shaped clusters; and avoids single-link effect) At each step, the two clusters with the closest pair of representative points are merged.

15 Cure: The Algorithm Partial Clustering Draw random sample s (to ensure data fits memor) Partition sample to p partitions with size s/p (to speed up algorithm) Partiall cluster partitions into s/pq clusters (using hierarchical alg.) Eliminate outliers If a cluster grows too slowl or if is ver small at the end, eliminate it. Cluster partial clusters. Label data (cluster the entire database using c representative points for each cluster) Each cluster is represented b c representative points The r. p. are chosen to be far from each other The r.p. are shrunk toward the mean (the centroid) of the cluster (for α = all r.p. are shrunk to the centroid) The two clusters with the closest pair of r.p. are merged to form a new cluster and new r.p. are chosen (Hierarchical clustering) CURE :Data Partitioning and Clustering s = p = s/p = s/pq = Cure: Shrinking Representative Points Further cluster the partial clustering Partitioning the sample data Partial clustering Shrink the multiple representative points towards the gravit center b a fraction of α. (helps dampen the effects of outliers) Multiple representatives capture the shape of the cluster

16 CURE DBSCAN algorithm Having several representative points per cluster allows CURE to adjust well to the geometr of nonspherical shapes. Shrinking the scattered points toward the mean b a factor of α gets rid of surface abnormalities and mitigates the effect of outliers. Results with large datasets indicate that CURE scales well. Time compleit is O (n lg n) Densit-based Alg: based on local connectivit and densit functions Major features: Discover clusters of arbitrar shape Handle noise One scan DBSCAN: Densit-Based Clustering DBSCAN: Densit Concepts () Densit: the minimum number of points within a certain distance of each other. Two parameters: Eps : Maimum radius of the neighborhood Clustering based on densit (local cluster criterion), such as densit-connected points MinPts : Minimum number of points in an Eps-neighborhood of that point Each cluster has a considerable higher densit of points than outside of the cluster Core Point: object with at least MinPts objects within a radius Eps-neighborhood

17 DBSCAN: Densit Concepts () DBSCAN: Densit Concepts () Directl Densit-Reachable: A point p is directl densitreachable from a point q with respect to Eps, MinPts if ) p belongs to NEps(q) ) core point condition: NEps (q) >= MinPts Densit-reachable: A point p is densit-reachable from a point q wrt. Eps, MinPts if there is a chain of points p,, pn, p = q, pn = p such that pi+ is directl densit-reachable from pi q p p (a DDR point needs to be close to a core point but it does not need to be a core point itself, if not it is a border point) q p MinPts = Eps = cm Densit-connected: A point p is densit-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are densit-reachable from o wrt. Eps and MinPts. p o q DBSCAN: Cluster definition DBSCAN: The Algorithm Cluster C For all p,q if p is in C, and q is densit reachable from p, then q is also in C For all p,q in C: p is densit connected to q A cluster is defined as a maimal set of densit-connected points A cluster has a core set of points ver close to a large number of other points (core points) and then some other points (border points) that are sufficientl close to at least one core point. Arbitrar select a point p If p is not a core point, no points are densit-reachable from p and DBSCAN visits the net point of the database. If p is a core point, a cluster is formed. Retrieve all points densit-reachable from p wrt Eps and MinPts. Continue the process until all of the points have been processed. (it is possible that a border point could belong to two clusters. Such point will be assigned to whichever cluster is generated first)

18 Comments on DBSCAN DBScan For each core point which is not in a cluster Eplore its neighbourhood in search for ever densit reachable point For each neighbourhood point eplored If it is a core point -> further eplore it It it is not a core point -> assign to the cluster and do not eplore it Eperiments have shown DBScan to be faster and more precise than CLARANS Epected time compleit O(n lg n) The fact that a cluster is composed b the maimal set of points that are densit-connect it is a propert (and therefore a consequence) of the method References Data Mining: Concepts and Techniques, Jiawei Han, Micheline Kamber (Morgan Kaufmann - ) Data Mining: Introductor and Advanced Topics, Margaret Dunham (Prentice Hall, ) Clustering Web Search Results, Iwona Białnicka-Birula, Thank ou!!!

Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

More information

Data Mining for Knowledge Management. Clustering

Data Mining for Knowledge Management. Clustering Data Mining for Knowledge Management Clustering Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management Thanks for slides to: Jiawei Han Eamonn Keogh Jeff

More information

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar

More information

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering K-means Intuition Algorithm Choosing initial centroids Bisecting K-means Post-processing Strengths

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/4 What is

More information

Cluster Analysis Overview. Data Mining Techniques: Cluster Analysis. What is Cluster Analysis? What is Cluster Analysis?

Cluster Analysis Overview. Data Mining Techniques: Cluster Analysis. What is Cluster Analysis? What is Cluster Analysis? Cluster Analsis Overview Data Mining Techniques: Cluster Analsis Mirek Riedewald Man slides based on presentations b Han/Kamber, Tan/Steinbach/Kumar, and Andrew Moore Introduction Foundations: Measuring

More information

Clustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1

Clustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1 Clustering: Techniques & Applications Nguyen Sinh Hoa, Nguyen Hung Son 15 lutego 2006 Clustering 1 Agenda Introduction Clustering Methods Applications: Outlier Analysis Gene clustering Summary and Conclusions

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms K-means and its variants Hierarchical clustering

More information

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

Data Mining 5. Cluster Analysis

Data Mining 5. Cluster Analysis Data Mining 5. Cluster Analysis 5.2 Fall 2009 Instructor: Dr. Masoud Yaghini Outline Data Structures Interval-Valued (Numeric) Variables Binary Variables Categorical Variables Ordinal Variables Variables

More information

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? K-Means Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar

More information

Clustering Techniques: A Brief Survey of Different Clustering Algorithms

Clustering Techniques: A Brief Survey of Different Clustering Algorithms Clustering Techniques: A Brief Survey of Different Clustering Algorithms Deepti Sisodia Technocrates Institute of Technology, Bhopal, India Lokesh Singh Technocrates Institute of Technology, Bhopal, India

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

Data Clustering Techniques Qualifying Oral Examination Paper

Data Clustering Techniques Qualifying Oral Examination Paper Data Clustering Techniques Qualifying Oral Examination Paper Periklis Andritsos University of Toronto Department of Computer Science periklis@cs.toronto.edu March 11, 2002 1 Introduction During a cholera

More information

BIRCH: An Efficient Data Clustering Method For Very Large Databases

BIRCH: An Efficient Data Clustering Method For Very Large Databases BIRCH: An Efficient Data Clustering Method For Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny CPSC 504 Presenter: Discussion Leader: Sophia (Xueyao) Liang HelenJr, Birches. Online Image.

More information

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

More information

Unsupervised learning: Clustering

Unsupervised learning: Clustering Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What

More information

Cluster Analysis: Basic Concepts and Methods

Cluster Analysis: Basic Concepts and Methods 10 Cluster Analysis: Basic Concepts and Methods Imagine that you are the Director of Customer Relationships at AllElectronics, and you have five managers working for you. You would like to organize all

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Clustering methods for Big data analysis

Clustering methods for Big data analysis Clustering methods for Big data analysis Keshav Sanse, Meena Sharma Abstract Today s age is the age of data. Nowadays the data is being produced at a tremendous rate. In order to make use of this large-scale

More information

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

More information

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,

More information

For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall

For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall Cluster Validation Cluster Validit For supervised classification we have a variet of measures to evaluate how good our model is Accurac, precision, recall For cluster analsis, the analogous question is

More information

On Clustering Validation Techniques

On Clustering Validation Techniques Journal of Intelligent Information Systems, 17:2/3, 107 145, 2001 c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques MARIA HALKIDI mhalk@aueb.gr YANNIS

More information

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

More information

Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data

Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data Client Based Power Iteration Clustering Algorithm to Reduce Dimensionalit in Big Data Jaalatchum. D 1, Thambidurai. P 1, Department of CSE, PKIET, Karaikal, India Abstract - Clustering is a group of objects

More information

An Introduction to Cluster Analysis for Data Mining

An Introduction to Cluster Analysis for Data Mining An Introduction to Cluster Analysis for Data Mining 10/02/2000 11:42 AM 1. INTRODUCTION... 4 1.1. Scope of This Paper... 4 1.2. What Cluster Analysis Is... 4 1.3. What Cluster Analysis Is Not... 5 2. OVERVIEW...

More information

The SPSS TwoStep Cluster Component

The SPSS TwoStep Cluster Component White paper technical report The SPSS TwoStep Cluster Component A scalable component enabling more efficient customer segmentation Introduction The SPSS TwoStep Clustering Component is a scalable cluster

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

Clustering and Outlier Detection

Clustering and Outlier Detection Clustering and Outlier Detection Application Examples Customer segmentation How to partition customers into groups so that customers in each group are similar, while customers in different groups are dissimilar?

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

More information

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 12 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global

More information

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

More information

Data Clustering Using Data Mining Techniques

Data Clustering Using Data Mining Techniques Data Clustering Using Data Mining Techniques S.R.Pande 1, Ms. S.S.Sambare 2, V.M.Thakre 3 Department of Computer Science, SSES Amti's Science College, Congressnagar, Nagpur, India 1 Department of Computer

More information

A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases

A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases Published in the Proceedings of 14th International Conference on Data Engineering (ICDE 98) A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases Xiaowei Xu, Martin Ester, Hans-Peter

More information

Clustering UE 141 Spring 2013

Clustering UE 141 Spring 2013 Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or

More information

Distances, Clustering, and Classification. Heatmaps

Distances, Clustering, and Classification. Heatmaps Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be

More information

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis) Data Mining 資 料 探 勘 Tamkang University 分 群 分 析 (Cluster Analysis) DM MI Wed,, (:- :) (B) Min-Yuh Day 戴 敏 育 Assistant Professor 專 任 助 理 教 授 Dept. of Information Management, Tamkang University 淡 江 大 學 資

More information

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows: Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

More information

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Classification and Prediction

Classification and Prediction Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501 CLUSTER ANALYSIS Steven M. Ho!and Department of Geology, University of Georgia, Athens, GA 30602-2501 January 2006 Introduction Cluster analysis includes a broad suite of techniques designed to find groups

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool Comparison and Analysis of Various Clustering Metho in Data mining On Education data set Using the weak tool Abstract:- Data mining is used to find the hidden information pattern and relationship between

More information

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets Dudu Lazarov, Gil David, Amir Averbuch School of Computer Science, Tel-Aviv University Tel-Aviv 69978, Israel Abstract

More information

A comparison of various clustering methods and algorithms in data mining

A comparison of various clustering methods and algorithms in data mining Volume :2, Issue :5, 32-36 May 2015 www.allsubjectjournal.com e-issn: 2349-4182 p-issn: 2349-5979 Impact Factor: 3.762 R.Tamilselvi B.Sivasakthi R.Kavitha Assistant Professor A comparison of various clustering

More information

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms 8 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig

More information

Classification Techniques (1)

Classification Techniques (1) 10 10 Overview Classification Techniques (1) Today Classification Problem Classification based on Regression Distance-based Classification (KNN) Net Lecture Decision Trees Classification using Rules Quality

More information

Data Mining. Session 9 Main Theme Clustering. Dr. Jean-Claude Franchitti

Data Mining. Session 9 Main Theme Clustering. Dr. Jean-Claude Franchitti Data Mining Session 9 Main Theme Clustering Dr. Jean-Claude Franchitti New York University Computer Science Department Courant Institute of Mathematical Sciences Adapted from course textbook resources

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

More information

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA Prakash Singh 1, Aarohi Surya 2 1 Department of Finance, IIM Lucknow, Lucknow, India 2 Department of Computer Science, LNMIIT, Jaipur,

More information

How To Solve The Cluster Algorithm

How To Solve The Cluster Algorithm Cluster Algorithms Adriano Cruz adriano@nce.ufrj.br 28 de outubro de 2013 Adriano Cruz adriano@nce.ufrj.br () Cluster Algorithms 28 de outubro de 2013 1 / 80 Summary 1 K-Means Adriano Cruz adriano@nce.ufrj.br

More information

Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

More information

Linköpings Universitet - ITN TNM033 2011-11-30 DBSCAN. A Density-Based Spatial Clustering of Application with Noise

Linköpings Universitet - ITN TNM033 2011-11-30 DBSCAN. A Density-Based Spatial Clustering of Application with Noise DBSCAN A Density-Based Spatial Clustering of Application with Noise Henrik Bäcklund (henba892), Anders Hedblom (andh893), Niklas Neijman (nikne866) 1 1. Introduction Today data is received automatically

More information

A Method for Decentralized Clustering in Large Multi-Agent Systems

A Method for Decentralized Clustering in Large Multi-Agent Systems A Method for Decentralized Clustering in Large Multi-Agent Systems Elth Ogston, Benno Overeinder, Maarten van Steen, and Frances Brazier Department of Computer Science, Vrije Universiteit Amsterdam {elth,bjo,steen,frances}@cs.vu.nl

More information

Data Mining: Foundation, Techniques and Applications

Data Mining: Foundation, Techniques and Applications Data Mining: Foundation, Techniques and Applications Lesson 1b :A Quick Overview of Data Mining Li Cuiping( 李 翠 平 ) School of Information Renmin University of China Anthony Tung( 鄧 锦 浩 ) School of Computing

More information

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning Unsupervised Learning and Data Mining Unsupervised Learning and Data Mining Clustering Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

More information

Standardization and Its Effects on K-Means Clustering Algorithm

Standardization and Its Effects on K-Means Clustering Algorithm Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

More information

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009 Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009 Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative K-means Density-based Interpretation

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

USING THE AGGLOMERATIVE METHOD OF HIERARCHICAL CLUSTERING AS A DATA MINING TOOL IN CAPITAL MARKET 1. Vera Marinova Boncheva

USING THE AGGLOMERATIVE METHOD OF HIERARCHICAL CLUSTERING AS A DATA MINING TOOL IN CAPITAL MARKET 1. Vera Marinova Boncheva 382 [7] Reznik, A, Kussul, N., Sokolov, A.: Identification of user activity using neural networks. Cybernetics and computer techniques, vol. 123 (1999) 70 79. (in Russian) [8] Kussul, N., et al. : Multi-Agent

More information

GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering

GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering Yu Qian Kang Zhang Department of Computer Science, The University of Texas at Dallas, Richardson, TX 75083-0688, USA {yxq012100,

More information

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier Data Mining: Concepts and Techniques Jiawei Han Micheline Kamber Simon Fräser University К MORGAN KAUFMANN PUBLISHERS AN IMPRINT OF Elsevier Contents Foreword Preface xix vii Chapter I Introduction I I.

More information

Data Mining Process Using Clustering: A Survey

Data Mining Process Using Clustering: A Survey Data Mining Process Using Clustering: A Survey Mohamad Saraee Department of Electrical and Computer Engineering Isfahan University of Techno1ogy, Isfahan, 84156-83111 saraee@cc.iut.ac.ir Najmeh Ahmadian

More information

Chapter 4 Data Mining A Short Introduction. 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Chapter 4 Data Mining A Short Introduction. 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1 Chapter 4 Data Mining A Short Introduction 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1 1 Today's Question 1. Data Mining Overview 2. Association Rule Mining

More information

A Two-Step Method for Clustering Mixed Categroical and Numeric Data

A Two-Step Method for Clustering Mixed Categroical and Numeric Data Tamkang Journal of Science and Engineering, Vol. 13, No. 1, pp. 11 19 (2010) 11 A Two-Step Method for Clustering Mixed Categroical and Numeric Data Ming-Yi Shih*, Jar-Wen Jheng and Lien-Fu Lai Department

More information

DATA mining in general is the search for hidden patterns

DATA mining in general is the search for hidden patterns IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 5, SEPTEMBER/OCTOBER 2002 1003 CLARANS: A Method for Clustering Objects for Spatial Data Mining Raymond T. Ng and Jiawei Han, Member, IEEE

More information

A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets

A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets Preeti Baser, Assistant Professor, SJPIBMCA, Gandhinagar, Gujarat, India 382 007 Research Scholar, R. K. University,

More information

A Comparative Study of clustering algorithms Using weka tools

A Comparative Study of clustering algorithms Using weka tools A Comparative Study of clustering algorithms Using weka tools Bharat Chaudhari 1, Manan Parikh 2 1,2 MECSE, KITRC KALOL ABSTRACT Data clustering is a process of putting similar data into groups. A clustering

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

A Survey of Clustering Techniques

A Survey of Clustering Techniques A Survey of Clustering Techniques Pradeep Rai Asst. Prof., CSE Department, Kanpur Institute of Technology, Kanpur-0800 (India) Shubha Singh Asst. Prof., MCA Department, Kanpur Institute of Technology,

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

Statistical Databases and Registers with some datamining

Statistical Databases and Registers with some datamining Unsupervised learning - Statistical Databases and Registers with some datamining a course in Survey Methodology and O cial Statistics Pages in the book: 501-528 Department of Statistics Stockholm University

More information

The Role of Visualization in Effective Data Cleaning

The Role of Visualization in Effective Data Cleaning The Role of Visualization in Effective Data Cleaning Yu Qian Dept. of Computer Science The University of Texas at Dallas Richardson, TX 75083-0688, USA qianyu@student.utdallas.edu Kang Zhang Dept. of Computer

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information

Solving Quadratic Equations by Graphing. Consider an equation of the form. y ax 2 bx c a 0. In an equation of the form

Solving Quadratic Equations by Graphing. Consider an equation of the form. y ax 2 bx c a 0. In an equation of the form SECTION 11.3 Solving Quadratic Equations b Graphing 11.3 OBJECTIVES 1. Find an ais of smmetr 2. Find a verte 3. Graph a parabola 4. Solve quadratic equations b graphing 5. Solve an application involving

More information

CHAPTER 20. Cluster Analysis

CHAPTER 20. Cluster Analysis CHAPTER 20 Cluster Analysis 20.1 Introduction 20.2 What Is Cluster Analysis? 20.3 Typical requirements 20.4 Types of Data in cluster Analysis 20.5 Interval-scaled Variables 20.6 Binary Variables 20.7 Nominal,Ordinal,

More information

Dynamical Clustering of Personalized Web Search Results

Dynamical Clustering of Personalized Web Search Results Dynamical Clustering of Personalized Web Search Results Xuehua Shen CS Dept, UIUC xshen@cs.uiuc.edu Hong Cheng CS Dept, UIUC hcheng3@uiuc.edu Abstract Most current search engines present the user a ranked

More information

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

More information

A Review on Clustering and Outlier Analysis Techniques in Datamining

A Review on Clustering and Outlier Analysis Techniques in Datamining American Journal of Applied Sciences 9 (2): 254-258, 2012 ISSN 1546-9239 2012 Science Publications A Review on Clustering and Outlier Analysis Techniques in Datamining 1 Koteeswaran, S., 2 P. Visu and

More information

Concept of Cluster Analysis

Concept of Cluster Analysis RESEARCH PAPER ON CLUSTER TECHNIQUES OF DATA VARIATIONS Er. Arpit Gupta 1,Er.Ankit Gupta 2,Er. Amit Mishra 3 arpit_jp@yahoo.co.in, ank_mgcgv@yahoo.co.in,amitmishra.mtech@gmail.com Faculty Of Engineering

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

Outlier Detection in Clustering

Outlier Detection in Clustering Outlier Detection in Clustering Svetlana Cherednichenko 24.01.2005 University of Joensuu Department of Computer Science Master s Thesis TABLE OF CONTENTS 1. INTRODUCTION...1 1.1. BASIC DEFINITIONS... 1

More information

Clustering Data Streams

Clustering Data Streams Clustering Data Streams Mohamed Elasmar Prashant Thiruvengadachari Javier Salinas Martin gtg091e@mail.gatech.edu tprashant@gmail.com javisal1@gatech.edu Introduction: Data mining is the science of extracting

More information

Cross-validation for detecting and preventing overfitting

Cross-validation for detecting and preventing overfitting Cross-validation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures.

More information

Clustering. Chapter 7. 7.1 Introduction to Clustering Techniques. 7.1.1 Points, Spaces, and Distances

Clustering. Chapter 7. 7.1 Introduction to Clustering Techniques. 7.1.1 Points, Spaces, and Distances 240 Chapter 7 Clustering Clustering is the process of examining a collection of points, and grouping the points into clusters according to some distance measure. The goal is that points in the same cluster

More information

Hierarchical Cluster Analysis Some Basics and Algorithms

Hierarchical Cluster Analysis Some Basics and Algorithms Hierarchical Cluster Analysis Some Basics and Algorithms Nethra Sambamoorthi CRMportals Inc., 11 Bartram Road, Englishtown, NJ 07726 (NOTE: Please use always the latest copy of the document. Click on this

More information