Cluster Analysis Overview. Data Mining Techniques: Cluster Analysis. What is Cluster Analysis? What is Cluster Analysis?

Size: px
Start display at page:

Download "Cluster Analysis Overview. Data Mining Techniques: Cluster Analysis. What is Cluster Analysis? What is Cluster Analysis?"

Transcription

1 Cluster Analsis Overview Data Mining Techniques: Cluster Analsis Mirek Riedewald Man slides based on presentations b Han/Kamber, Tan/Steinbach/Kumar, and Andrew Moore Introduction Foundations: Measuring Distance (Similarit) Partitioning Methods: K-Means Hierarchical Methods Densit-Based Methods Clustering High-Dimensional Data Cluster Evaluation What is Cluster Analsis? What is Cluster Analsis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Unsupervised learning: usuall no training set with known classes Tpical applications As a stand-alone tool to get insight into data properties As a preprocessing step for other algorithms Intra-cluster distances are minimized Inter-cluster distances are maimized Rich Applications, Multidisciplinar Efforts Pattern Recognition Spatial Data Analsis Image Processing Data Reduction Economic Science Market research WWW Document classification Weblogs: discover groups of similar access patterns Clustering precipitation in Australia Eamples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifing groups of motor insurance polic holders with a high average claim cost Cit-planning: Identifing groups of houses according to their house tpe, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

2 Qualit: What Is Good Clustering? Notion of a Cluster can be Ambiguous Cluster membership objects in same class High intra-class similarit, low inter-class similarit Choice of similarit measure is important How man clusters? Si Clusters Abilit to discover some or all of the hidden patterns Difficult to measure without ground truth Two Clusters Four Clusters 7 Distinctions Between Sets of Clusters Eclusive versus non-eclusive Non-eclusive clustering: points ma belong to multiple clusters Fuzz versus non-fuzz Fuzz clustering: a point belongs to ever cluster with some weight between and Weights must sum to Partial versus complete Cluster some or all of the data Heterogeneous versus homogeneous Clusters of widel different sizes, shapes, densities Cluster Analsis Overview Introduction Foundations: Measuring Distance (Similarit) Partitioning Methods: K-Means Hierarchical Methods Densit-Based Methods Clustering High-Dimensional Data Cluster Evaluation 9 Distance Clustering is inherentl connected to question of (dis-)similarit of objects How can we define similarit between objects? Similarit Between Objects Usuall measured b some notion of distance Popular choice: Minkowski distance q q q q dist ( i ), ( j) ( i) ( j) ( i) ( j) ( i) ( j) d d q is a positive integer q = : Manhattan distance dist ( i), ( j) ( i) ( j) ( i) ( j) ( i) ( j) d d q = : Euclidean distance: dist ( i), ( j) ( i) ( j) ( i) ( j) ( i ) ( j ) d d

3 Metrics Properties of a metric d(i,j) d(i,j) = if and onl if i=j d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j) Eamples: Euclidean distance, Manhattan distance Man other non-metric similarit measures eist After selecting the distance function, is it now clear how to compute similarit between objects? Challenges How to compute a distance for categorical attributes An attribute with a large domain often dominates the overall distance Weight and scale the attributes like for k-nn Curse of dimensionalit Curse of Dimensionalit Best solution: remove an attribute that is known to be ver nois or not interesting Tr different subsets of the attributes and determine where good clusters are found Nominal Attributes Method : work with original values Difference = if same value, difference = otherwise Method : transform to binar attributes New binar attribute for each domain value Encode specific domain value b setting corresponding binar attribute to and all others to Ordinal Attributes Method : treat as nominal Problem: loses ordering information Method : map to [,] Problem: To which values should the original values be mapped? Default: equi-distant mapping to [,] Scaling and Transforming Attributes Sometimes it might be necessar to transform numerical attributes to [,] or use another normalizing transformation, mabe even nonlinear (e.g., logarithm) Might need to weight attributes differentl Often requires epert knowledge or trial-anderror 7

4 Other Similarit Measures Special distance or similarit measures for man applications Might be a non-metric function Information retrieval Document similarit based on kewords Bioinformatics Gene features in micro-arras Calculating Cluster Distances Single link = smallest distance between an element in one cluster and an element in the other: dist(k i, K j ) = min( ip, jq ) Complete link = largest distance between an element in one cluster and an element in the other: dist(k i, K j ) = ma( ip, jq ) Average distance between an element in one cluster and an element in the other: dist(k i, K j ) = avg( ip, jq ) Distance between cluster centroids: dist(k i, K j ) = d(m i, m j ) Distance between cluster medoids: dist(k i, K j ) = dist( mi, mj ) Medoid: one chosen, centrall located object in the cluster 9 Cluster Centroid, Radius, and Diameter Centroid: the middle of a cluster C m Radius: square root of average distance from an point of the cluster to its centroid ( m) C R C Diameter: square root of average mean squared distance between all pairs of points in the cluster ( ) C C, D C ( C ) C C Cluster Analsis Overview Introduction Foundations: Measuring Distance (Similarit) Partitioning Methods: K-Means Hierarchical Methods Densit-Based Methods Clustering High-Dimensional Data Cluster Evaluation Partitioning Algorithms: Basic Concept Construct a partition of a database D of n objects into a set of K clusters, s.t. sum of squared distances to cluster representative m is minimized K ( m ) i C i i Given a K, find partition of K clusters that optimizes the chosen partitioning criterion Globall optimal: enumerate all partitions Heuristic methods K-means ( 7): each cluster represented b its centroid K-medoids ( 7): each cluster represented b one of the objects in the cluster K-means Clustering Each cluster is associated with a centroid Each object is assigned to the cluster with the closest centroid. Given K, select K random objects as initial centroids. Repeat until centroids do not change. Form K clusters b assigning ever object to its nearest centroid. Recompute centroid of each cluster

5 K-Means Eample Iteration. Overview of K-Means Convergence Iteration Iteration Iteration Iteration Iteration Iteration K-means Questions What is it tring to optimize? Will it alwas terminate? Will it find an optimal clustering? How should we start it? How could we automaticall choose the number of centers?.we ll deal with these questions net K-means Clustering Details Initial centroids often chosen randoml Clusters produced var from one run to another Distance usuall measured b Euclidean distance, cosine similarit, correlation, etc. Comparabl fast algorithm: O( n * K * I * d ) n = number of objects I = number of iterations d = number of attributes 7 Evaluating K-means Clusters Most common measure: Sum of Squared Error (SSE) For each point, the error is the distance to the nearest centroid K SSE dist ( m, ) i C i m i = centroid of cluster C i Given two clusterings, choose the one with the smallest error Eas wa to reduce SSE: increase K In practice, large K not interesting i K-means Convergence () Assign each to its nearest center (minimizes SSE for fied centers) () Choose centroid of all points in the same cluster as cluster center (minimizes SSE for fied clusters) Ccle through steps () and () = K-means algorithm Algorithm terminates when neither () nor () results in change of configuration Finite number of was of partitioning n records into K groups If the configuration changes on an iteration, it must have improved SSE So each time the configuration changes it must go to a configuration it has never been to before So if it tried to go on forever, it would eventuall run out of configurations 9

6 Will it Find the Optimal Clustering?... Importance of Initial Centroids Iteration Iteration Iteration Iteration Iteration Iteration Optimal Clustering Sub-optimal Clustering Will It Find The Optimal Clustering Now? Iteration Importance of Initial Centroids Iteration Iteration Iteration Iteration Iteration Problems with Selecting Initial Centroids Probabilit of starting with eactl one initial centroid per real cluster is ver low K selected for algorithm might be different from inherent K of the data Might randoml select multiple initial objects from same cluster Sometimes initial centroids will readjust themselves in the right wa, and sometimes the don t Clusters Eample Iteration Starting with two initial centroids in one cluster of each pair of clusters

7 Clusters Eample Iteration Iteration Clusters Eample Iteration Iteration - - Iteration Starting with two initial centroids in one cluster of each pair of clusters 7 Starting with some pairs of clusters having three initial centroids, while other have onl one. Clusters Eample Iteration Iteration Solutions to Initial Centroids Problem Iteration Iteration Multiple runs Helps, but probabilit is not on our side Sample and use hierarchical clustering to determine initial centroids Select more than k initial centroids and then select among these the initial centroids Select those that are most widel separated Postprocessing Eliminate small clusters that ma represent outliers Split clusters with high SSE Merge clusters that are close and have low SSE Starting with some pairs of clusters having three initial centroids, while other have onl one. 9 Limitations of K-means Limitations of K-means: Differing Sizes K-means has problems when clusters are of differing Sizes Densities Non-globular shapes K-means has problems when the data contains outliers K-means ( Clusters) 7

8 Limitations of K-means: Differing Densit Limitations of K-means: Non-globular Shapes K-means ( Clusters) K-means ( Clusters) Overcoming K-means Limitations Overcoming K-means Limitations K-means Clusters K-means Clusters One solution is to use man clusters. Find parts of clusters, then put them together. Overcoming K-means Limitations K-Means and Outliers K-means Clusters K-means algorithm is sensitive to outliers Centroid is average of cluster members Outlier can dominate average computation Solution: K-medoids Medoid = most centrall located real object in a cluster Algorithm similar to K-means, but finding medoid is much more epensive Tr all objects in cluster to find the one that minimizes SSE, or just tr a few randoml to reduce cost 7

9 Cluster Analsis Overview Introduction Foundations: Measuring Distance (Similarit) Partitioning Methods: K-Means Hierarchical Methods Densit-Based Methods Clustering High-Dimensional Data Cluster Evaluation Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Visualized as a dendrogram Tree-like diagram that records the sequences of merges or splits Strengths of Hierarchical Clustering Do not have to assume an particular number of clusters An number of clusters can be obtained b cutting the dendogram at the proper level Ma correspond to meaningful taonomies Eample in biological sciences (e.g., animal kingdom, phlogen reconstruction, ) Hierarchical Clustering Two main tpes of hierarchical clustering Agglomerative: Start with the given objects as individual clusters At each step, merge the closest pair of clusters until onl one cluster (or K clusters) left Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a single object (or there are K clusters) Agglomerative Clustering Algorithm More popular hierarchical clustering technique Basic algorithm is straightforward. Compute the proimit matri. Let each data object be a cluster. Repeat until onl a single cluster remains. Merge the two closest clusters. Update the proimit matri Ke operation: computation of the proimit of two clusters Different approaches to defining the distance between clusters distinguish the different algorithms Starting Situation Clusters of individual objects, proimit matri p p p p p.. p p p p p.... Proimit Matri... p p p p p9 p p p 9

10 Intermediate Situation Intermediate Situation Some clusters are merged C C C C C C Merge closest clusters (C and C ) and update proimit matri C C C C C C C C C C C C C Proimit Matri C C C C C C C Proimit Matri C C C C... p p p p p9 p p p... p p p p p9 p p p After Merging Defining Cluster Distance How do we update the proimit matri? C C C C C U C C C C C U C??????? C C Proimit Matri Min: clusters near each other Ma: low diameter Avg: more robust against outliers Distance between centroids C U C... p p p p p9 p p p 7 Strength of MIN Limitations of MIN Two Clusters Two Clusters Can handle non-elliptical shapes Sensitive to noise and outliers 9

11 Strength of MAX Limitations of MAX Two Clusters Two Clusters Less susceptible to noise and outliers Tends to break large clusters Biased towards globular clusters Hierarchical Clustering: Average Compromise between Single and Complete Link Strengths Less susceptible to noise and outliers Limitations Biased towards globular clusters Cluster Similarit: Ward s Method Distance of two clusters is based on the increase in squared error when two clusters are merged Similar to group average if distance between objects is distance squared Less susceptible to noise and outliers Biased towards globular clusters Hierarchical analogue of K-means Can be used to initialize K-means Hierarchical Clustering: Comparison MIN MAX Ward s Method Group Average Time and Space Requirements O(n ) space for proimit matri n = number of objects O(n ) time in man cases There are n steps and at each step the proimit matri must be updated and searched Compleit can be reduced to O(n log(n) ) time for some approaches

12 Hierarchical Clustering: Problems and Limitations Once a decision is made to combine two clusters, it cannot be undone No objective function is directl minimized Different schemes have problems with one or more of the following: Sensitivit to noise and outliers Difficult handling different sized clusters and conve shapes Breaking large clusters Cluster Analsis Overview Introduction Foundations: Measuring Distance (Similarit) Partitioning Methods: K-Means Hierarchical Methods Densit-Based Methods Clustering High-Dimensional Data Cluster Evaluation 7 7 Densit-Based Clustering Methods Clustering based on densit of data objects in a neighborhood Local clustering criterion Major features: Discover clusters of arbitrar shape Handle noise Need densit parameters as termination condition 7 DBSCAN: Basic Concepts Two parameters: Eps: Maimum radius of the neighborhood N Eps (q): {p D dist(q,p) Eps} MinPts: Minimum number of points in an Epsneighborhood of that point A point p is directl densit-reachable from a point q w.r.t. Eps and MinPts if p belongs to N Eps (q) Core point condition: MinPts = N Eps (q) MinPts p q 7 Densit-Reachable, Densit-Connected DBSCAN: Classes of A point p is densit-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points q = p,p,, p n = p such that p i + is directl densitreachable from p i A point p is densit-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both p and q are densit-reachable from o w.r.t. Eps and MinPts Cluster = set of densit-connected points p q o p p q A point is a core point if it has more than a specified number of points (MinPts) within Eps At the interior of a cluster A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point At the outer surface of a cluster A noise point is an point that is not a core point or a border point Not part of an cluster 77 7

13 DBSCAN: Core, Border, and Noise DBSCAN Algorithm Repeat until all points have been processed Select a point p If p is core point then Retrieve and remove all points densit-reachable from p w.r.t. Eps and MinPts; output them as a cluster Discards all noise points (how?) Discovers clusters of arbitrar shape Fairl robust against noise Runtime: O(n ), space: O(n) O(n * timetofindinneighborhood) Can be O(n log(n)) with spatial inde 79 DBSCAN: Core, Border and Noise When DBSCAN Works Well Point tpes: core, border and noise Clusters Eps =, MinPts = When DBSCAN Does NOT Work Well DBSCAN: Determining Eps and MinPts (MinPts=, large Eps) Idea: for points in a cluster, their k-th nearest neighbors are at roughl the same distance Noise points have the k-th nearest neighbor at farther distance Plot the sorted distance of ever point to its k-th nearest neighbor Choose Eps where sharp change occurs MinPts = k k too large: small clusters labeled as noise k too small: small groups of outliers labeled as cluster Varing densities High-dimensional data (MinPts=, small Eps)

14 DBSCAN: Sensitive to Parameters Cluster Analsis Overview Introduction Foundations: Measuring Distance (Similarit) Partitioning Methods: K-Means Hierarchical Methods Densit-Based Methods Clustering High-Dimensional Data Cluster Evaluation 9 Clustering High-Dimensional Data Man applications: tet documents, DNA micro-arra data Major challenges: Irrelevant dimensions ma mask clusters Curse of dimensionalit for distance computation Clusters ma eist onl in some subspaces Methods Feature transformation, e.g., PCA and SVD Some useful onl when features are highl correlated/redundant Feature selection: wrapper or filter approaches Subspace-clustering: find clusters in all subspaces CLIQUE Curse of Dimensionalit Graphs on the right adapted from Parsons et al. KDD Eplorations Data in onl one dimension is relativel packed Adding a dimension stretches the objects across that dimension, moving them further apart High-dimensional data is ver sparse Distance measure becomes meaningless For man distributions, distances between objects become more similar in high dimensions 9 97 Wh Subspace Clustering? Adapted from Parsons et al. SIGKDD Eplorations CLIQUE (Clustering In QUEst) Automaticall identifies clusters in sub-spaces Eploits monotonicit propert If a set of points forms a dense cluster in d dimensions, the also form a cluster in an subset of these dimensions A region is dense if the fraction of data points in the region eceeds the input model parameter Sound familiar? Apriori algorithm... Algorithm is both densit-based and grid-based Partitions each dimension into the same number of equallength intervals Partitions an m-dimensional data space into non-overlapping rectangular units Cluster = maimal set of connected dense units within a subspace 9 99

15 Vacation 7 Salar (,) Vacation (week) CLIQUE Algorithm Find all dense regions in -dim space for each attribute. This is the set of dense -dim cells. Let k=. Repeat until there are no dense k-dim cells k = k+ Generate all candidate k-dim cells from dense (k-)-dim cells Eliminate cells with fewer than points Find clusters b taking union of all adjacent, highdensit cells of same dimensionalit Summarize each cluster using a small set of inequalities that describe the attribute ranges of the cells in the cluster = age 7 age age Compute intersection of dense age-salar and age-vacation regions Strengths and Weaknesses of CLIQUE Strengths Automaticall finds subspaces of the highest dimensionalit that contain high-densit clusters Insensitive to the order of objects in input and does not presume some canonical data distribution Scales linearl with input size and has good scalabilit with number of dimensions Weaknesses Need to tune grid size and densit threshold Each point can be a member of man clusters Can still have high mining cost (inherent problem for subspace clustering) Same densit threshold for low and high dimensionalit Cluster Analsis Overview Introduction Foundations: Measuring Distance (Similarit) Partitioning Methods: K-Means Hierarchical Methods Densit-Based Methods Clustering High-Dimensional Data Cluster Evaluation Cluster Validit on Test Data Cluster Validit Clustering: usuall no ground truth available Problem: clusters are in the ee of the beholder Then wh do we want to evaluate them? To avoid finding patterns in noise To compare clustering algorithms To compare two sets of clusters To compare two clusters

16 Random K-means Clusters found in Random Data DBSCAN Complete Link Measuring Cluster Validit Via Correlation Two matrices Similarit Matri Incidence Matri One row and one column for each object Entr is if the associated pair of objects belongs to the same cluster, otherwise Compute correlation between the two matrices Since the matrices are smmetric, onl the correlation between n(n-) / entries needs to be calculated. High correlation: objects close to each other tend to be in same cluster Not a good measure when clusters can be non-globular and intertwined Measuring Cluster Validit Via Correlation.9 Similarit Matri for Cluster Validation Order the similarit matri with respect to cluster labels and inspect visuall Block-diagonal matri for well-separated clusters Corr =.9 Corr = Similarit 9 Similarit Matri for Cluster Validation Clusters in random data are not so crisp Similarit Matri for Cluster Validation Clusters in random data are not so crisp Similarit.... Similarit.... DBSCAN K-means

17 Count SSE Similarit Matri for Cluster Validation Similarit Matri for Cluster Validation Clusters in random data are not so crisp Similarit Complete Link DBSCAN Sum of Squared Error When SSE Is Not So Great For fied number of clusters, lower SSE indicates better clustering Not necessaril true for non-globular, intertwined clusters Can also be used to estimate the number of clusters Run K-means for different K, compare SSE SSE of clusters found using K-means - K Comparison to Random Data or Clustering Need a framework to interpret an measure E.g., if measure =, is that good or bad? Statistical framework for cluster validit Compare cluster qualit measure on random data or random clustering to those on real data If value for random setting is unlikel, then cluster results are valid (cluster = non-random structure) For comparing the results of two different sets of cluster analses, a framework is less necessar But: need to know whether the difference between two inde values is significant Statistical Framework for SSE Eample: found clusters, got SSE =. for given data set Compare to SSE of clusters in random data Histogram: SSE of clusters in sets of random data points ( points from range.. for and ) Estimate mean, stdev for SSE on random data Check how man stdev awa from mean the real-data SSE is Random data SSE 7

18 Statistical Framework for Correlation Compare correlation of incidence and proimit matrices for well-separated data versus random data Random data Cluster Cohesion and Separation Cohesion: how closel related are objects in a cluster Can be measured b SSE (m i = centroid of cluster i): SSE ( mi) ( ) C i Ci i i, C i Separation: how well-separated are clusters Can be measured b between-cluster sum of squares (m = overall mean): BSS ( m m ) i C i i Corr = -.9 Corr = -. 9 cohesion separation Cohesion and Separation Eample Note: BSS + SSE = constant Minimize SSE get ma. BSS m m m K= cluster: K= clusters: SSE ( ) ( ) ( ) ( ) BSS ( ) Total SSE (.) (.) (.) (.) BSS (.) (. ) 9 Total 9 Silhouette Coefficient Combines ideas of both cohesion and separation For an individual object i Calculate a i = average distance of i to the objects in its cluster Calculate b i = average distance of i to objects in another cluster C, choosing the C that minimizes b i Silhouette coefficient of i = (b i -a i ) / ma{a i,b i } Range: [-,], but tpicall between and The closer to, the better Can calculate the Average Silhouette width over all objects b a Final Comment on Cluster Validit The validation of clustering structures is the most difficult and frustrating part of cluster analsis. Without a strong effort in this direction, cluster analsis will remain a black art accessible onl to those true believers who have eperience and great courage. Algorithms for Clustering Data, Jain and Dubes

19 Summar Cluster analsis groups objects based on their similarit (or distance) and has wide applications Measure of similarit (or distance) can be computed for all tpes of data Man different tpes of clustering algorithms Discover different tpes of clusters Man measures of clustering qualit, but absence of ground truth alwas a challenge 9

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms K-means and its variants Hierarchical clustering

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/4 What is

More information

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? K-Means Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar

More information

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall

For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall Cluster Validation Cluster Validit For supervised classification we have a variet of measures to evaluate how good our model is Accurac, precision, recall For cluster analsis, the analogous question is

More information

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering K-means Intuition Algorithm Choosing initial centroids Bisecting K-means Post-processing Strengths

More information

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

More information

Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

Clustering. Clustering. What is Clustering? What is Clustering? What is Clustering? Types of Data in Cluster Analysis

Clustering. Clustering. What is Clustering? What is Clustering? What is Clustering? Types of Data in Cluster Analysis What is Clustering? Clustering Tpes of Data in Cluster Analsis Clustering A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods What is Clustering? Clustering of data is

More information

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

Unsupervised learning: Clustering

Unsupervised learning: Clustering Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What

More information

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms 8 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should

More information

Clustering UE 141 Spring 2013

Clustering UE 141 Spring 2013 Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or

More information

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

Data Mining for Knowledge Management. Clustering

Data Mining for Knowledge Management. Clustering Data Mining for Knowledge Management Clustering Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management Thanks for slides to: Jiawei Han Eamonn Keogh Jeff

More information

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

More information

An Introduction to Cluster Analysis for Data Mining

An Introduction to Cluster Analysis for Data Mining An Introduction to Cluster Analysis for Data Mining 10/02/2000 11:42 AM 1. INTRODUCTION... 4 1.1. Scope of This Paper... 4 1.2. What Cluster Analysis Is... 4 1.3. What Cluster Analysis Is Not... 5 2. OVERVIEW...

More information

Cluster Analysis: Basic Concepts and Methods

Cluster Analysis: Basic Concepts and Methods 10 Cluster Analysis: Basic Concepts and Methods Imagine that you are the Director of Customer Relationships at AllElectronics, and you have five managers working for you. You would like to organize all

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data

Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data Client Based Power Iteration Clustering Algorithm to Reduce Dimensionalit in Big Data Jaalatchum. D 1, Thambidurai. P 1, Department of CSE, PKIET, Karaikal, India Abstract - Clustering is a group of objects

More information

Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool Comparison and Analysis of Various Clustering Metho in Data mining On Education data set Using the weak tool Abstract:- Data mining is used to find the hidden information pattern and relationship between

More information

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009 Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009 Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative K-means Density-based Interpretation

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Clustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1

Clustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1 Clustering: Techniques & Applications Nguyen Sinh Hoa, Nguyen Hung Son 15 lutego 2006 Clustering 1 Agenda Introduction Clustering Methods Applications: Outlier Analysis Gene clustering Summary and Conclusions

More information

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,

More information

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Concept of Cluster Analysis

Concept of Cluster Analysis RESEARCH PAPER ON CLUSTER TECHNIQUES OF DATA VARIATIONS Er. Arpit Gupta 1,Er.Ankit Gupta 2,Er. Amit Mishra 3 arpit_jp@yahoo.co.in, ank_mgcgv@yahoo.co.in,amitmishra.mtech@gmail.com Faculty Of Engineering

More information

Data Clustering Techniques Qualifying Oral Examination Paper

Data Clustering Techniques Qualifying Oral Examination Paper Data Clustering Techniques Qualifying Oral Examination Paper Periklis Andritsos University of Toronto Department of Computer Science periklis@cs.toronto.edu March 11, 2002 1 Introduction During a cholera

More information

Going Big in Data Dimensionality:

Going Big in Data Dimensionality: LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für

More information

Classification Techniques (1)

Classification Techniques (1) 10 10 Overview Classification Techniques (1) Today Classification Problem Classification based on Regression Distance-based Classification (KNN) Net Lecture Decision Trees Classification using Rules Quality

More information

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows: Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are

More information

A comparison of various clustering methods and algorithms in data mining

A comparison of various clustering methods and algorithms in data mining Volume :2, Issue :5, 32-36 May 2015 www.allsubjectjournal.com e-issn: 2349-4182 p-issn: 2349-5979 Impact Factor: 3.762 R.Tamilselvi B.Sivasakthi R.Kavitha Assistant Professor A comparison of various clustering

More information

Cluster analysis Cosmin Lazar. COMO Lab VUB

Cluster analysis Cosmin Lazar. COMO Lab VUB Cluster analysis Cosmin Lazar COMO Lab VUB Introduction Cluster analysis foundations rely on one of the most fundamental, simple and very often unnoticed ways (or methods) of understanding and learning,

More information

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen Summary Data Mining & Process Mining (1BM46) Made by S.P.T. Ariesen Content Data Mining part... 2 Lecture 1... 2 Lecture 2:... 4 Lecture 3... 7 Lecture 4... 9 Process mining part... 13 Lecture 5... 13

More information

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis) Data Mining 資 料 探 勘 Tamkang University 分 群 分 析 (Cluster Analysis) DM MI Wed,, (:- :) (B) Min-Yuh Day 戴 敏 育 Assistant Professor 專 任 助 理 教 授 Dept. of Information Management, Tamkang University 淡 江 大 學 資

More information

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

2 Basic Concepts and Techniques of Cluster Analysis

2 Basic Concepts and Techniques of Cluster Analysis The Challenges of Clustering High Dimensional Data * Michael Steinbach, Levent Ertöz, and Vipin Kumar Abstract Cluster analysis divides data into groups (clusters) for the purposes of summarization or

More information

Clustering methods for Big data analysis

Clustering methods for Big data analysis Clustering methods for Big data analysis Keshav Sanse, Meena Sharma Abstract Today s age is the age of data. Nowadays the data is being produced at a tremendous rate. In order to make use of this large-scale

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

Data Preprocessing. Week 2

Data Preprocessing. Week 2 Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

More information

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA Prakash Singh 1, Aarohi Surya 2 1 Department of Finance, IIM Lucknow, Lucknow, India 2 Department of Computer Science, LNMIIT, Jaipur,

More information

How To Cluster On A Large Data Set

How To Cluster On A Large Data Set An Ameliorated Partitioning Clustering Algorithm for Large Data Sets Raghavi Chouhan 1, Abhishek Chauhan 2 MTech Scholar, CSE department, NRI Institute of Information Science and Technology, Bhopal, India

More information

Cross-validation for detecting and preventing overfitting

Cross-validation for detecting and preventing overfitting Cross-validation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures.

More information

Outlier Detection in Clustering

Outlier Detection in Clustering Outlier Detection in Clustering Svetlana Cherednichenko 24.01.2005 University of Joensuu Department of Computer Science Master s Thesis TABLE OF CONTENTS 1. INTRODUCTION...1 1.1. BASIC DEFINITIONS... 1

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Cluster Analysis using R

Cluster Analysis using R Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other

More information

Clustering Techniques: A Brief Survey of Different Clustering Algorithms

Clustering Techniques: A Brief Survey of Different Clustering Algorithms Clustering Techniques: A Brief Survey of Different Clustering Algorithms Deepti Sisodia Technocrates Institute of Technology, Bhopal, India Lokesh Singh Technocrates Institute of Technology, Bhopal, India

More information

Data Mining 5. Cluster Analysis

Data Mining 5. Cluster Analysis Data Mining 5. Cluster Analysis 5.2 Fall 2009 Instructor: Dr. Masoud Yaghini Outline Data Structures Interval-Valued (Numeric) Variables Binary Variables Categorical Variables Ordinal Variables Variables

More information

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

More information

Distances, Clustering, and Classification. Heatmaps

Distances, Clustering, and Classification. Heatmaps Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be

More information

0.1 What is Cluster Analysis?

0.1 What is Cluster Analysis? Cluster Analysis 1 2 0.1 What is Cluster Analysis? Cluster analysis is concerned with forming groups of similar objects based on several measurements of different kinds made on the objects. The key idea

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

Territorial Analysis for Ratemaking. Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs

Territorial Analysis for Ratemaking. Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs Territorial Analysis for Ratemaking by Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs Department of Statistics and Applied Probability University

More information

How To Perform An Ensemble Analysis

How To Perform An Ensemble Analysis Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

Linköpings Universitet - ITN TNM033 2011-11-30 DBSCAN. A Density-Based Spatial Clustering of Application with Noise

Linköpings Universitet - ITN TNM033 2011-11-30 DBSCAN. A Density-Based Spatial Clustering of Application with Noise DBSCAN A Density-Based Spatial Clustering of Application with Noise Henrik Bäcklund (henba892), Anders Hedblom (andh893), Niklas Neijman (nikne866) 1 1. Introduction Today data is received automatically

More information

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) Data Exploration and Preprocessing Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information

The Role of Visualization in Effective Data Cleaning

The Role of Visualization in Effective Data Cleaning The Role of Visualization in Effective Data Cleaning Yu Qian Dept. of Computer Science The University of Texas at Dallas Richardson, TX 75083-0688, USA qianyu@student.utdallas.edu Kang Zhang Dept. of Computer

More information

A Comparative Study of clustering algorithms Using weka tools

A Comparative Study of clustering algorithms Using weka tools A Comparative Study of clustering algorithms Using weka tools Bharat Chaudhari 1, Manan Parikh 2 1,2 MECSE, KITRC KALOL ABSTRACT Data clustering is a process of putting similar data into groups. A clustering

More information

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

More information

Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan

Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:

More information

On Clustering Validation Techniques

On Clustering Validation Techniques Journal of Intelligent Information Systems, 17:2/3, 107 145, 2001 c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques MARIA HALKIDI mhalk@aueb.gr YANNIS

More information

Statistical Databases and Registers with some datamining

Statistical Databases and Registers with some datamining Unsupervised learning - Statistical Databases and Registers with some datamining a course in Survey Methodology and O cial Statistics Pages in the book: 501-528 Department of Statistics Stockholm University

More information

Data Exploration Data Visualization

Data Exploration Data Visualization Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select

More information

CLUSTER ANALYSIS FOR SEGMENTATION

CLUSTER ANALYSIS FOR SEGMENTATION CLUSTER ANALYSIS FOR SEGMENTATION Introduction We all understand that consumers are not all alike. This provides a challenge for the development and marketing of profitable products and services. Not every

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig

More information

A Method for Decentralized Clustering in Large Multi-Agent Systems

A Method for Decentralized Clustering in Large Multi-Agent Systems A Method for Decentralized Clustering in Large Multi-Agent Systems Elth Ogston, Benno Overeinder, Maarten van Steen, and Frances Brazier Department of Computer Science, Vrije Universiteit Amsterdam {elth,bjo,steen,frances}@cs.vu.nl

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Clustering. Chapter 7. 7.1 Introduction to Clustering Techniques. 7.1.1 Points, Spaces, and Distances

Clustering. Chapter 7. 7.1 Introduction to Clustering Techniques. 7.1.1 Points, Spaces, and Distances 240 Chapter 7 Clustering Clustering is the process of examining a collection of points, and grouping the points into clusters according to some distance measure. The goal is that points in the same cluster

More information

A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases

A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases Published in the Proceedings of 14th International Conference on Data Engineering (ICDE 98) A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases Xiaowei Xu, Martin Ester, Hans-Peter

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

A Study of Web Log Analysis Using Clustering Techniques

A Study of Web Log Analysis Using Clustering Techniques A Study of Web Log Analysis Using Clustering Techniques Hemanshu Rana 1, Mayank Patel 2 Assistant Professor, Dept of CSE, M.G Institute of Technical Education, Gujarat India 1 Assistant Professor, Dept

More information

Dynamical Clustering of Personalized Web Search Results

Dynamical Clustering of Personalized Web Search Results Dynamical Clustering of Personalized Web Search Results Xuehua Shen CS Dept, UIUC xshen@cs.uiuc.edu Hong Cheng CS Dept, UIUC hcheng3@uiuc.edu Abstract Most current search engines present the user a ranked

More information

Authors. Data Clustering: Algorithms and Applications

Authors. Data Clustering: Algorithms and Applications Authors Data Clustering: Algorithms and Applications 2 Contents 1 Grid-based Clustering 1 Wei Cheng, Wei Wang, and Sandra Batista 1.1 Introduction................................... 1 1.2 The Classical

More information

Solving Quadratic Equations by Graphing. Consider an equation of the form. y ax 2 bx c a 0. In an equation of the form

Solving Quadratic Equations by Graphing. Consider an equation of the form. y ax 2 bx c a 0. In an equation of the form SECTION 11.3 Solving Quadratic Equations b Graphing 11.3 OBJECTIVES 1. Find an ais of smmetr 2. Find a verte 3. Graph a parabola 4. Solve quadratic equations b graphing 5. Solve an application involving

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information

A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets

A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets Preeti Baser, Assistant Professor, SJPIBMCA, Gandhinagar, Gujarat, India 382 007 Research Scholar, R. K. University,

More information

Forschungskolleg Data Analytics Methods and Techniques

Forschungskolleg Data Analytics Methods and Techniques Forschungskolleg Data Analytics Methods and Techniques Martin Hahmann, Gunnar Schröder, Phillip Grosse Prof. Dr.-Ing. Wolfgang Lehner Why do we need it? We are drowning in data, but starving for knowledge!

More information

Downloaded from www.heinemann.co.uk/ib. equations. 2.4 The reciprocal function x 1 x

Downloaded from www.heinemann.co.uk/ib. equations. 2.4 The reciprocal function x 1 x Functions and equations Assessment statements. Concept of function f : f (); domain, range, image (value). Composite functions (f g); identit function. Inverse function f.. The graph of a function; its

More information

SoSe 2014: M-TANI: Big Data Analytics

SoSe 2014: M-TANI: Big Data Analytics SoSe 2014: M-TANI: Big Data Analytics Lecture 4 21/05/2014 Sead Izberovic Dr. Nikolaos Korfiatis Agenda Recap from the previous session Clustering Introduction Distance mesures Hierarchical Clustering

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

USING THE AGGLOMERATIVE METHOD OF HIERARCHICAL CLUSTERING AS A DATA MINING TOOL IN CAPITAL MARKET 1. Vera Marinova Boncheva

USING THE AGGLOMERATIVE METHOD OF HIERARCHICAL CLUSTERING AS A DATA MINING TOOL IN CAPITAL MARKET 1. Vera Marinova Boncheva 382 [7] Reznik, A, Kussul, N., Sokolov, A.: Identification of user activity using neural networks. Cybernetics and computer techniques, vol. 123 (1999) 70 79. (in Russian) [8] Kussul, N., et al. : Multi-Agent

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning Unsupervised Learning and Data Mining Unsupervised Learning and Data Mining Clustering Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

More information

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

More information