Text Analytics. Text Clustering. Ulf Leser

Size: px
Start display at page:

Download "Text Analytics. Text Clustering. Ulf Leser"

Transcription

1 Text Analytics Text Clustering Ulf Leser

2 Content of this Lecture (Text) clustering Cluster quality Clustering algorithms Application Ulf Leser: Text Analytics, Winter Semester 2010/2011 2

3 Clustering Clustering groups objects into (usually disjoint) sets Intuitively, each set should contain objects that are similar to each other and dissimilar to objects in any other set We need a similarity or distance function That is the only text-specific bit the rest is just clustering Often called unsupervised learning We don t know how many sets/classes there are (if there are any) We don t know how those sets should look like Ulf Leser: Text Analytics, Winter Semester 2010/2011 3

4 Nice Ulf Leser: Text Analytics, Winter Semester 2010/2011 4

5 Nice Not Nice In text clustering, we typically have > dimensions Ulf Leser: Text Analytics, Winter Semester 2010/2011 5

6 Text Clustering Applications Explorative data analysis Learn about the structure within your document collection Corpus preprocessing Clustering provides a semantic index to corpus Group docs into clusters to ease navigation Retrieval speed: Index only one representative per cluster Processing of search results Cluster all hits into groups of similar hits (in particular: duplicates) Improving search recall Return doc and all members of its cluster Has similarity to automatic relevance feedback using top-k docs Word sense disambiguation The different senses of a word should appear as clusters Ulf Leser: Text Analytics, Winter Semester 2010/2011 6

7 Processing Search Results The research breakthrough was labeling the clusters, i.e., grouping search results into folder topics [Clusty.com blog] Ulf Leser: Text Analytics, Winter Semester 2010/2011 7

8 Ulf Leser: Text Analytics, Winter Semester 2010/ Similarity between Documents Clustering requires a distance function Must be a metric: d(x,x)=0, d(x,y)=d(y,x), d(x,y) d(x,z)+d(z,y) In contrast to search, we now compare two docs And not a document and a query Nevertheless, the same methods are often used Vector space, TF*IDF, cosine distance ( ) ] [ * ] [ ] [ ]* [ * ), cos( ), ( = = = i d i d i d i d d d d d d d d d sim o

9 Clustering Speed In Information Retrieval We compare a vector with ~3 non-null values (query) with one with ~500 non-null values Usage inverted files to pick docs that have an overlap with the query (~3 log-scale searches in main memory) In clustering We compare ~500 nnv with one with ~500 nnv We need to compare all docs with all docs For some clustering methods; in any case many docs with many docs Inverted files offer much less speed-up compared to scanning To increase speed, feature selection is essential E.g., use the most descriptive terms Perform LSI on docs before clustering Ulf Leser: Text Analytics, Winter Semester 2010/2011 9

10 Cluster Labels To support user interaction, clusters need to have a name Names should capture the topic (semantic) of the cluster Many possibilities Chose term with highest TF*IDF value in cluster E.g. TF computed as average or considering all docs in cluster as one Chose term with highest TF*IDF value in cluster centre Apply statistical test to find terms whose TF*IDF distribution deviates the most between clusters For instance, test each cluster against all others Requires comparison of each cluster with each cluster for each term Only possible if applied after strict pre-filtering Report top-k terms (by whatever method) Extend to bi-grams Ulf Leser: Text Analytics, Winter Semester 2010/

11 Content of this Lecture (Text) clustering Cluster quality Clustering algorithms Application Ulf Leser: Text Analytics, Winter Semester 2010/

12 How many Clusters? Ulf Leser: Text Analytics, Winter Semester 2010/

13 Maybe 2? Ulf Leser: Text Analytics, Winter Semester 2010/

14 Maybe 4? Ulf Leser: Text Analytics, Winter Semester 2010/

15 Maybe 4 and One Outlier? Ulf Leser: Text Analytics, Winter Semester 2010/

16 Maybe 5? Ulf Leser: Text Analytics, Winter Semester 2010/

17 Maybe 4 and 2 at Different Levels? Ulf Leser: Text Analytics, Winter Semester 2010/

18 Which Distance Measure did you Use? Ulf Leser: Text Analytics, Winter Semester 2010/

19 Quality of a Clustering There is no true number of clusters We need a method to define which number of clusters we find reasonable (for the data) and which not Finding the clusters is a second, actually somewhat simpler (means: more clearly defined) problem In real applications, you cannot find this by looking at the data Because clustering should help you in looking at the data Explorative usage of course possible Ulf Leser: Text Analytics, Winter Semester 2010/

20 Distance to a Cluster We frequently will have to compute the distance between a point o and a cluster c, d(o,c) And sometimes distances between clusters see hier. clustering Various methods Let c m me the numerical center of a cluster Let c med me the central point of a cluster Let d(o 1,o 2 ) be the distance between two points d ( o, c) = d( o, cm) d ( o, c) = d( o, cmed d( o, c) = p c d( o, c)* ) 1 c Ulf Leser: Text Analytics, Winter Semester 2010/

21 Quality of a Clustering First Approach We could measure cluster quality by the average distance of objects within a cluster Definition Let f be a clustering of a set of objects O into a set of classes C with C =k. The k-score q k of f is q k ( f ) = c C f ( o) = c d( o, c) Remark We could define the k-score as the average distance across all objects pairs within a cluster no centers necessary Ulf Leser: Text Analytics, Winter Semester 2010/

22 6-Score Certainly better than the 2/4/5-score we have seen Thus: Chose the k with the best k-score? Ulf Leser: Text Analytics, Winter Semester 2010/

23 Disadvantage Always has a trivially optimal solution: k= O Score still be useful to compare different clusterings for the same k Ulf Leser: Text Analytics, Winter Semester 2010/

24 Silhouette Considering only the inner-cluster distances is not enough Points in a cluster should be close to each other but also far away from points in other clusters One way: Extend k-score, for instance to: q k ( f ) = d( o, c) c C f ( o) = c i C, i c C 1 But: Sensitive to outliers clusters far away from the others Alternative: Silhouette of a clustering d( o, i)* Punish points that are not uniquely assigned to one cluster Captures how clearly points are part of their cluster 1 Ulf Leser: Text Analytics, Winter Semester 2010/

25 More Formal Definition Let f: O C with C arbitrary. We define Inner score: a(o) = d(o,f(o)) Outer score: b(o) = min( d(o,c i )) with c i f(o) The silhouette of o, s(o), is defined as s( o) = b( o) a( o) max( a( o), b( o)) The silhouette s(f) of f is defined as s( f ) = s( o) Note It holds: -1 s(o) 1 s(o) 0: Point right between two cluster s(o) ~ 1: Point very close to only one (its own) cluster s(o) ~ -1: Point far away from its own cluster Ulf Leser: Text Analytics, Winter Semester 2010/

26 Behavior Silhouette is not always better / worse for more clusters s(o) probably higher s(o) probably lower Ulf Leser: Text Analytics, Winter Semester 2010/

27 Not the End Intuitive clusters need not be hyper-spheres Clusters need not have convex shapes Cluster centre need not be part of a cluster Requires completely different quality metrics Definition must fit to the data/application Not used in text clustering To my knowldge Source: [FPPS96] Ulf Leser: Text Analytics, Winter Semester 2010/

28 Content of this Lecture Text clustering Cluster quality Clustering algorithms Hierarchical clustering K-means Soft clustering: EM algorithm Application Ulf Leser: Text Analytics, Winter Semester 2010/

29 Classes of Cluster Algorithms Hierarchical clustering Iteratively creates a hierarchy of clusters Bottom-Up: Start from D clusters and merge until only 1 remains Top-Down: Start from one cluster and split Or until some stop criterion is met Partitioning Heuristically partition all objects in k clusters Guess a first partitioning and improve iteratively k is a parameter of the method, not a result Other Graph-Theoretic: Max-Cut (partitioning) etc. Density-base clustering Ulf Leser: Text Analytics, Winter Semester 2010/

30 Hierarchical Clustering Also called UPGMA Unweighted Pair-group method with arithmetic mean We only discuss the bottom-up approach Computes a binary tree (dendrogram) Simple algorithm Compute distance matrix M Choose pair d 1, d 2 with smallest distance Compute x = avg(d 1,d 2 ) (the centre point) Remove d 1, d 2 from M Insert x into M Distance between x and any d in M: Average distance between d 1 and d and d 2 and d Loop until M is empty Ulf Leser: Text Analytics, Winter Semester 2010/

31 Ulf Leser: Text Analytics, Winter Semester 2010/ Example ABCDEFG A B. C.. D... E... F... G... (B,D) a ACEFGa A C. E.. F... G... a... A B C D E F G ACGab A C. G.. a... b... (E,F) b A B C D E F G (A,b) c CGac C G. a.. c... A B C D E F G (C,G) d acd a c. d.. A B C D E F G (d,c) e A B C D E F G (a,e) f A B C D E F G ae a e.

32 Visual Ulf Leser: Text Analytics, Winter Semester 2010/

33 Properties Advantages Simple and intuitive Number of clusters is not an input of the method Usually good quality clusters (which clusters?) Disadvantage Does not really generate clusters Very expensive; let n= O, m= K Computing M requires O(n 2 ) space and O(mn 2 ) time Clustering requires O(m*n 2 *log(n)) Why? Not applicable as such to large doc sets Ulf Leser: Text Analytics, Winter Semester 2010/

34 Intuition Hierarchical clustering organizes a doc collection Ideally, hierarchical clustering directly creates a hierarchical and intuitive directory of the corpus Not easy Many, many ways to group objects hierarchical clustering will choose just one No guarantee that clusters make sense semantically Ulf Leser: Text Analytics, Winter Semester 2010/

35 Branch Length Use branch length to symbolize distance Outlier detection Outlier Ulf Leser: Text Analytics, Winter Semester 2010/

36 Variations We used the distance between the centers of two clusters to decide about distance between clusters Other alternatives (incurring different complexities) Single Link: Distance of the two closest docs in both clusters Complete Link: Distance of the two furthest docs Average Link: Average distance between pairs of docs from both clusters Centroid: Distance between centre points Ulf Leser: Text Analytics, Winter Semester 2010/

37 Variations We used the distance between the centers of clusters to decide about distance between clusters Other alternatives (incurring different complexities) Single Link: Distance of the two closest docs in both clusters Complete Link: Distance of the two furthest docs Average Link: Average distance between pairs of docs from both clusters Centroid: Distance between centre points Ulf Leser: Text Analytics, Winter Semester 2010/

38 Variations We used the distance between the centers of clusters to decide about distance between clusters Other alternatives (incurring different complexities) Single Link: Distance of the two closest docs in both clusters Complete Link: Distance of the two furthest docs Average Link: Average distance between pairs of docs from both clusters Centroid: Distance between centre points Ulf Leser: Text Analytics, Winter Semester 2010/

39 Variations We used the distance between the centers of clusters to decide about distance between clusters Other alternatives (incurring different complexities) Single Link: Distance of the two closest docs in both clusters Complete Link: Distance of the two furthest docs Average Link: Average distance between pairs of docs from both clusters Centroid: Distance between centre points Ulf Leser: Text Analytics, Winter Semester 2010/

40 Single-link versus Complete-link Ulf Leser: Text Analytics, Winter Semester 2010/

41 More Properties Single-link Optimizes a local criterion (only look at the closest pair) Similar to computing a minimal spanning tree With cuts at most expensive branches as going down the hierarchy Creates elongated clusters (chaining effect) Complete-link Optimizes a global criterion (look at the worst pair) Creates more compact, more convex, spherical clusters Ulf Leser: Text Analytics, Winter Semester 2010/

42 Content of this Lecture Text clustering Cluster quality Clustering algorithms Hierarchical clustering K-means Soft clustering: EM algorithm Application Ulf Leser: Text Analytics, Winter Semester 2010/

43 K-Means Probably is the best known clustering algorithm Partitioning method Requires the number k of clusters to be predefined Algorithm Fix k Guess k cluster centers at random Can use k docs or k random points in doc-space Loop forever Assign all docs to their closest cluster center If no doc has changed its assignment, stop Or if sufficiently few docs have changed their assignment Otherwise, compute new cluster centres Ulf Leser: Text Analytics, Winter Semester 2010/

44 Example 1 k=3 Choose random start points Quelle: Stanford, CS 262 Computational Genomics Ulf Leser: Text Analytics, Winter Semester 2010/

45 Example 2 Assign docs to closest cluster centre Ulf Leser: Text Analytics, Winter Semester 2010/

46 Example 3 Compute new cluster centre Ulf Leser: Text Analytics, Winter Semester 2010/

47 Example 4 Ulf Leser: Text Analytics, Winter Semester 2010/

48 Example 5 Ulf Leser: Text Analytics, Winter Semester 2010/

49 Example 6 Converged Ulf Leser: Text Analytics, Winter Semester 2010/

50 Properties Usually, k-means converges quite fast Let l be the number of iterations Complexity: O(l*k*n) Assignment: n*k distance computations New centers: Summing up n vectors k times Choosing the right start points is important K-Means is a greedy heuristic and only finds local optima Option 1: Start several times with different start points Option 2: Compute hierarchical clustering on random sample and choose cluster centers as start points ( Buckshot algorithm) How to choose k? Try for different k and use quality score(s) Ulf Leser: Text Analytics, Winter Semester 2010/

51 k-means and Outlier Try for k=3 Ulf Leser: Text Analytics, Winter Semester 2010/

52 Help: K-Medoid Chose the doc in the middle of a cluster as representative PAM: Partitioning around Medoids Advantage Less sensitive to outliers Also works for non-metric spaces as no new center point needs to be computed Disadvantage: Considerable more expensive Finding the median doc requires computing all pair-wise distances in each cluster in each round Overall complexity is O(n 3 ) Can save re-computations at the expense of more space Ulf Leser: Text Analytics, Winter Semester 2010/

53 k-medoid and Outlier Ulf Leser: Text Analytics, Winter Semester 2010/

54 Content of this Lecture Text clustering Cluster quality Clustering algorithms Hierarchical clustering K-means Soft clustering: EM algorithm Application Ulf Leser: Text Analytics, Winter Semester 2010/

55 Soft Clustering We always assumed docs are assigned exactly one cluster Probabilistic interpretation: All docs pertain to all clusters with a certain probability Generative model Assume we have k doc-producing devices Such as authors, topics, Each device produces docs that are normally distributed in vector space with a device-specific mean and variance Assume that k devices have produced D documents Clustering can be interpreted as re-covering mean and variance of each distribution / device Solution: Expectation Maximization Algorithm (EM) Ulf Leser: Text Analytics, Winter Semester 2010/

56 Expectation Maximization (rough sketch, no math) EM optimizes the set of parameters Θ of a multi-variant normal distribution (mean and variance of k clusters) given sample data Iterative process with two phases Guess an initial Θ Expectation: Assign all docs its most likely generator based on Θ Maximization: Compute new optimal Θ based on assignment Using MLE or other estimation techniques Iterate through both steps until convergence Finds a local optimum, convergence guaranteed K-Means: Special case of EM k normal distributions with different means but equal variance Ulf Leser: Text Analytics, Winter Semester 2010/

57 Content of this Lecture Text clustering Cluster quality Clustering algorithms Application Clustering Phenotypes Ulf Leser: Text Analytics, Winter Semester 2010/

58 Mining Phenotypes for Function Prediction Ulf Leser: Text Analytics, Winter Semester 2010/

59 Or Source: Ulf Leser: Text Analytics, Winter Semester 2010/

60 Mining Phenotypes: General Idea Established Genotype Gene A Phenotype Genotype Gene B Phenotype? Established Genes with similar genotypes likely have similar phenotypes Question If genes generate very similar phenotypes do they have the same genotype? Ulf Leser: Text Analytics, Winter Semester 2010/

61 Approach GO Annotation Gene A Phenotype Description Prediction Inference Similarity GO Annotation Gene B Phenotype Description Ulf Leser: Text Analytics, Winter Semester 2010/

62 Phenodocs 411,102 phenotypes Short: <250 words Remove all phenotypes associated to more than one gene (~500) PhenomicDB Remove small phenotypes Remove multi-gene phenotypes Remove stop words Stemming 39,610 phenodocs for 15,426 genes Phenodocs Ulf Leser: Text Analytics, Winter Semester 2010/

63 K-Means Clustering Hierarchical clustering would require ~ * = comparisons K-Means: Simple, iterative algorithm Number of clusters must be predefined We experimented with clusters Ulf Leser: Text Analytics, Winter Semester 2010/

64 Properties: Phenodoc Similarity of Genes Genes in phenoclusters Control (Random selection) Pair-wise similarity scores of phenodocs of genes in the same cluster, sorted by score Result: Phenodocs of genes in phenoclusters are highly similar to each other Ulf Leser: Text Analytics, Winter Semester 2010/

65 PPI: Inter-Connectedness Interacting proteins often share function PPI from BIOGRID database Not at all a complete dataset In >200 clusters, >30% of genes interact with each other Control (random groups): 3 clusters Result: Genes in phenoclusters interact with each other much more often than expected by chance Proteins and interactions from BioGrid. Red proteins have no phenotypes in PhenomicDB Ulf Leser: Text Analytics, Winter Semester 2010/

66 Coherence of Functional Annotation Comparison of GO annotation of genes in phenoclusters Data from Entrez Gene Similarity of two GO terms: Normalized n# of shared ancestors Similarity of two genes: Average of the top-k GO pairs >200 clusters with score >0.4 Control: 2 clusters Results: Genes in phenoclusters have a much higher coherence in functional annotation than expected by chance Gene Ontology Molecular Function Biological Process Physiological Process Catalytic Activity Cellular Process Binding Metabolism Transferase Activity Nucleotide Binding Protein Metabolism Kinase Activity Cell Communication Signal Transduction Protein Modification Ulf Leser: Text Analytics, Winter Semester 2010/

67 Function Prediction Can increased functional coherence of clusters be exploited for function prediction? General approach Compute phenoclusters For each cluster, compute set of associated genes (gene cluster) In each gene cluster, predict common GO terms to all genes Common: annotated to >50% of genes in the cluster Filtering clusters Idea: Find clusters a-priori which give hope for very good results Filter 1: Only clusters with >2 members and at least one common GO term Filter 2: Only clusters with GO coherence>0.4 Filter 3: Only clusters with PPI-connectedness >33% Ulf Leser: Text Analytics, Winter Semester 2010/

68 Evaluation How can we know how good we are? Cross-validation Separate genes in training (90%) and test (10%) Remove annotation from genes in test set Build clusters and predict functions on training set Compare predicted with removed annotations Precision and recall Repeat and average results Macro-average Note: This punishes new and potentially valid annotations Ulf Leser: Text Analytics, Winter Semester 2010/

69 Results for Different Filters (Filter 1) (Filter 1 & Filter 2) (Filter 1 & Filter 3) # of groups # of terms # of genes Precision 67.91% 62.52% 60.52% Recall 22.98% 26.16% 19.78% What if we consider predicted terms to be correct that are a little more general than the removed terms (filter 1)? One step more general: 75.6% precision, 28.7% recall Two steps: 76.3% precision, 30.7% recall The less stringent GO equality, the better the results This is a common trick in studies using GO Ulf Leser: Text Analytics, Winter Semester 2010/

70 Results for Different Cluster Sizes With increasing k Clusters are smaller Number of predicted terms increases Clusters are more homogeneous Number of genes which receive annotations stays roughly the same Precision decreases slowly, recall increases Effect of the rapid increase in number of predictions Ulf Leser: Text Analytics, Winter Semester 2010/

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

Distances, Clustering, and Classification. Heatmaps

Distances, Clustering, and Classification. Heatmaps Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

More information

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

More information

Unsupervised learning: Clustering

Unsupervised learning: Clustering Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What

More information

Clustering UE 141 Spring 2013

Clustering UE 141 Spring 2013 Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or

More information

Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Dynamical Clustering of Personalized Web Search Results

Dynamical Clustering of Personalized Web Search Results Dynamical Clustering of Personalized Web Search Results Xuehua Shen CS Dept, UIUC xshen@cs.uiuc.edu Hong Cheng CS Dept, UIUC hcheng3@uiuc.edu Abstract Most current search engines present the user a ranked

More information

Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

More information

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning Unsupervised Learning and Data Mining Unsupervised Learning and Data Mining Clustering Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

Data Warehousing und Data Mining

Data Warehousing und Data Mining Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

More information

A comparison of various clustering methods and algorithms in data mining

A comparison of various clustering methods and algorithms in data mining Volume :2, Issue :5, 32-36 May 2015 www.allsubjectjournal.com e-issn: 2349-4182 p-issn: 2349-5979 Impact Factor: 3.762 R.Tamilselvi B.Sivasakthi R.Kavitha Assistant Professor A comparison of various clustering

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28 Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object

More information

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

More information

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows: Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster

More information

Clustering Connectionist and Statistical Language Processing

Clustering Connectionist and Statistical Language Processing Clustering Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? K-Means Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar

More information

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering K-means Intuition Algorithm Choosing initial centroids Bisecting K-means Post-processing Strengths

More information

A successful market segmentation initiative answers the following critical business questions: * How can we a. Customer Status.

A successful market segmentation initiative answers the following critical business questions: * How can we a. Customer Status. MARKET SEGMENTATION The simplest and most effective way to operate an organization is to deliver one product or service that meets the needs of one type of customer. However, to the delight of many organizations

More information

Statistical Databases and Registers with some datamining

Statistical Databases and Registers with some datamining Unsupervised learning - Statistical Databases and Registers with some datamining a course in Survey Methodology and O cial Statistics Pages in the book: 501-528 Department of Statistics Stockholm University

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

CLUSTER ANALYSIS FOR SEGMENTATION

CLUSTER ANALYSIS FOR SEGMENTATION CLUSTER ANALYSIS FOR SEGMENTATION Introduction We all understand that consumers are not all alike. This provides a challenge for the development and marketing of profitable products and services. Not every

More information

Cluster Analysis: Basic Concepts and Methods

Cluster Analysis: Basic Concepts and Methods 10 Cluster Analysis: Basic Concepts and Methods Imagine that you are the Director of Customer Relationships at AllElectronics, and you have five managers working for you. You would like to organize all

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms 8 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should

More information

BIRCH: An Efficient Data Clustering Method For Very Large Databases

BIRCH: An Efficient Data Clustering Method For Very Large Databases BIRCH: An Efficient Data Clustering Method For Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny CPSC 504 Presenter: Discussion Leader: Sophia (Xueyao) Liang HelenJr, Birches. Online Image.

More information

Distances between Clustering, Hierarchical Clustering

Distances between Clustering, Hierarchical Clustering Distances between Clustering, Hierarchical Clustering 36-350, Data Mining 14 September 2009 Contents 1 Distances Between Partitions 1 2 Hierarchical clustering 2 2.1 Ward s method............................

More information

An Introduction to Cluster Analysis for Data Mining

An Introduction to Cluster Analysis for Data Mining An Introduction to Cluster Analysis for Data Mining 10/02/2000 11:42 AM 1. INTRODUCTION... 4 1.1. Scope of This Paper... 4 1.2. What Cluster Analysis Is... 4 1.3. What Cluster Analysis Is Not... 5 2. OVERVIEW...

More information

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA Prakash Singh 1, Aarohi Surya 2 1 Department of Finance, IIM Lucknow, Lucknow, India 2 Department of Computer Science, LNMIIT, Jaipur,

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501 CLUSTER ANALYSIS Steven M. Ho!and Department of Geology, University of Georgia, Athens, GA 30602-2501 January 2006 Introduction Cluster analysis includes a broad suite of techniques designed to find groups

More information

Forschungskolleg Data Analytics Methods and Techniques

Forschungskolleg Data Analytics Methods and Techniques Forschungskolleg Data Analytics Methods and Techniques Martin Hahmann, Gunnar Schröder, Phillip Grosse Prof. Dr.-Ing. Wolfgang Lehner Why do we need it? We are drowning in data, but starving for knowledge!

More information

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Statistical Machine Translation: IBM Models 1 and 2

Statistical Machine Translation: IBM Models 1 and 2 Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation

More information

B490 Mining the Big Data. 2 Clustering

B490 Mining the Big Data. 2 Clustering B490 Mining the Big Data 2 Clustering Qin Zhang 1-1 Motivations Group together similar documents/webpages/images/people/proteins/products One of the most important problems in machine learning, pattern

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information

Standardization and Its Effects on K-Means Clustering Algorithm

Standardization and Its Effects on K-Means Clustering Algorithm Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

More information

A Survey of Clustering Techniques

A Survey of Clustering Techniques A Survey of Clustering Techniques Pradeep Rai Asst. Prof., CSE Department, Kanpur Institute of Technology, Kanpur-0800 (India) Shubha Singh Asst. Prof., MCA Department, Kanpur Institute of Technology,

More information

Fast Matching of Binary Features

Fast Matching of Binary Features Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

HES-SO Master of Science in Engineering. Clustering. Prof. Laura Elena Raileanu HES-SO Yverdon-les-Bains (HEIG-VD)

HES-SO Master of Science in Engineering. Clustering. Prof. Laura Elena Raileanu HES-SO Yverdon-les-Bains (HEIG-VD) HES-SO Master of Science in Engineering Clustering Prof. Laura Elena Raileanu HES-SO Yverdon-les-Bains (HEIG-VD) Plan Motivation Hierarchical Clustering K-Means Clustering 2 Problem Setup Arrange items

More information

Part 2: Community Detection

Part 2: Community Detection Chapter 8: Graph Data Part 2: Community Detection Based on Leskovec, Rajaraman, Ullman 2014: Mining of Massive Datasets Big Data Management and Analytics Outline Community Detection - Social networks -

More information

Gene expression analysis. Ulf Leser and Karin Zimmermann

Gene expression analysis. Ulf Leser and Karin Zimmermann Gene expression analysis Ulf Leser and Karin Zimmermann Ulf Leser: Bioinformatics, Wintersemester 2010/2011 1 Last lecture What are microarrays? - Biomolecular devices measuring the transcriptome of a

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Fast Contextual Preference Scoring of Database Tuples

Fast Contextual Preference Scoring of Database Tuples Fast Contextual Preference Scoring of Database Tuples Kostas Stefanidis Department of Computer Science, University of Ioannina, Greece Joint work with Evaggelia Pitoura http://dmod.cs.uoi.gr 2 Motivation

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures ~ Spring~r Table of Contents 1. Introduction.. 1 1.1. What is the World Wide Web? 1 1.2. ABrief History of the Web

More information

Cluster analysis Cosmin Lazar. COMO Lab VUB

Cluster analysis Cosmin Lazar. COMO Lab VUB Cluster analysis Cosmin Lazar COMO Lab VUB Introduction Cluster analysis foundations rely on one of the most fundamental, simple and very often unnoticed ways (or methods) of understanding and learning,

More information

Towards running complex models on big data

Towards running complex models on big data Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation

More information

Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),

More information

Lecture 6 Online and streaming algorithms for clustering

Lecture 6 Online and streaming algorithms for clustering CSE 291: Unsupervised learning Spring 2008 Lecture 6 Online and streaming algorithms for clustering 6.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line

More information

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection

More information

3 An Illustrative Example

3 An Illustrative Example Objectives An Illustrative Example Objectives - Theory and Examples -2 Problem Statement -2 Perceptron - Two-Input Case -4 Pattern Recognition Example -5 Hamming Network -8 Feedforward Layer -8 Recurrent

More information

Hadoop SNS. renren.com. Saturday, December 3, 11

Hadoop SNS. renren.com. Saturday, December 3, 11 Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December

More information

SoSe 2014: M-TANI: Big Data Analytics

SoSe 2014: M-TANI: Big Data Analytics SoSe 2014: M-TANI: Big Data Analytics Lecture 4 21/05/2014 Sead Izberovic Dr. Nikolaos Korfiatis Agenda Recap from the previous session Clustering Introduction Distance mesures Hierarchical Clustering

More information

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

More information

SQL Query Evaluation. Winter 2006-2007 Lecture 23

SQL Query Evaluation. Winter 2006-2007 Lecture 23 SQL Query Evaluation Winter 2006-2007 Lecture 23 SQL Query Processing Databases go through three steps: Parse SQL into an execution plan Optimize the execution plan Evaluate the optimized plan Execution

More information

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy BMI Paper The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy Faculty of Sciences VU University Amsterdam De Boelelaan 1081 1081 HV Amsterdam Netherlands Author: R.D.R.

More information

Territorial Analysis for Ratemaking. Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs

Territorial Analysis for Ratemaking. Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs Territorial Analysis for Ratemaking by Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs Department of Statistics and Applied Probability University

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Least-Squares Intersection of Lines

Least-Squares Intersection of Lines Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms K-means and its variants Hierarchical clustering

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

Lecture 4 Online and streaming algorithms for clustering

Lecture 4 Online and streaming algorithms for clustering CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line

More information

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

More information

CLUSTERING FOR FORENSIC ANALYSIS

CLUSTERING FOR FORENSIC ANALYSIS IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 4, Apr 2014, 129-136 Impact Journals CLUSTERING FOR FORENSIC ANALYSIS

More information

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen Summary Data Mining & Process Mining (1BM46) Made by S.P.T. Ariesen Content Data Mining part... 2 Lecture 1... 2 Lecture 2:... 4 Lecture 3... 7 Lecture 4... 9 Process mining part... 13 Lecture 5... 13

More information

Vector Quantization and Clustering

Vector Quantization and Clustering Vector Quantization and Clustering Introduction K-means clustering Clustering issues Hierarchical clustering Divisive (top-down) clustering Agglomerative (bottom-up) clustering Applications to speech recognition

More information

Chapter 13: Query Processing. Basic Steps in Query Processing

Chapter 13: Query Processing. Basic Steps in Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Introduction to Clustering

Introduction to Clustering Introduction to Clustering Yumi Kondo Student Seminar LSK301 Sep 25, 2010 Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 1 / 36 Microarray Example N=65 P=1756 Yumi

More information

Clustering. Chapter 7. 7.1 Introduction to Clustering Techniques. 7.1.1 Points, Spaces, and Distances

Clustering. Chapter 7. 7.1 Introduction to Clustering Techniques. 7.1.1 Points, Spaces, and Distances 240 Chapter 7 Clustering Clustering is the process of examining a collection of points, and grouping the points into clusters according to some distance measure. The goal is that points in the same cluster

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

How To Solve The Cluster Algorithm

How To Solve The Cluster Algorithm Cluster Algorithms Adriano Cruz adriano@nce.ufrj.br 28 de outubro de 2013 Adriano Cruz adriano@nce.ufrj.br () Cluster Algorithms 28 de outubro de 2013 1 / 80 Summary 1 K-Means Adriano Cruz adriano@nce.ufrj.br

More information

They can be obtained in HQJHQH format directly from the home page at: http://www.engene.cnb.uam.es/downloads/kobayashi.dat

They can be obtained in HQJHQH format directly from the home page at: http://www.engene.cnb.uam.es/downloads/kobayashi.dat HQJHQH70 *XLGHG7RXU This document contains a Guided Tour through the HQJHQH platform and it was created for training purposes with respect to the system options and analysis possibilities. It is not intended

More information

Constrained Clustering of Territories in the Context of Car Insurance

Constrained Clustering of Territories in the Context of Car Insurance Constrained Clustering of Territories in the Context of Car Insurance Samuel Perreault Jean-Philippe Le Cavalier Laval University July 2014 Perreault & Le Cavalier (ULaval) Constrained Clustering July

More information

Hierarchical Cluster Analysis Some Basics and Algorithms

Hierarchical Cluster Analysis Some Basics and Algorithms Hierarchical Cluster Analysis Some Basics and Algorithms Nethra Sambamoorthi CRMportals Inc., 11 Bartram Road, Englishtown, NJ 07726 (NOTE: Please use always the latest copy of the document. Click on this

More information

Lecture 3: Finding integer solutions to systems of linear equations

Lecture 3: Finding integer solutions to systems of linear equations Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

Applied Algorithm Design Lecture 5

Applied Algorithm Design Lecture 5 Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design

More information

Principal components analysis

Principal components analysis CS229 Lecture notes Andrew Ng Part XI Principal components analysis In our discussion of factor analysis, we gave a way to model data x R n as approximately lying in some k-dimension subspace, where k

More information

A SURVEY OF TEXT CLUSTERING ALGORITHMS

A SURVEY OF TEXT CLUSTERING ALGORITHMS Chapter 4 A SURVEY OF TEXT CLUSTERING ALGORITHMS Charu C. Aggarwal IBM T. J. Watson Research Center Yorktown Heights, NY charu@us.ibm.com ChengXiang Zhai University of Illinois at Urbana-Champaign Urbana,

More information