Clustering and Outlier Detection

Size: px
Start display at page:

Download "Clustering and Outlier Detection"

Transcription

1 Clustering and Outlier Detection

2 Application Examples Customer segmentation How to partition customers into groups so that customers in each group are similar, while customers in different groups are dissimilar? Pattern recognition in image How to identify objects in a satellite image? The pixels of an object are similar to each other in some way Jian Pei: Data Mining -- Clustering and Outlier Detection 2

3 What Is Clustering? Group data into clusters Similar to one another within the same cluster Dissimilar to the objects in other clusters Unsupervised learning: no predefined classes Outliers Cluster 2 Cluster 1 Jian Pei: Data Mining -- Clustering and Outlier Detection 3

4 More Application Examples A stand-alone tool: exploring data distribution A preprocessing step for other algorithms Pattern recognition, spatial data analysis, image processing, market research, WWW, Clustering documents Clustering web log data to discover groups of similar access patterns Jian Pei: Data Mining -- Clustering and Outlier Detection 4

5 What Is Good Clustering? High intra-class similarity and low inter-class similarity Depending on the similarity measure The ability to discover some or all of the hidden patterns Jian Pei: Data Mining -- Clustering and Outlier Detection 5

6 Requirements of Clustering Scalability Ability to deal with various types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Jian Pei: Data Mining -- Clustering and Outlier Detection 6

7 Requirements of Clustering (Con t) Can deal with noise and outliers Insensitive to the order of input records Can handle high dimensionality Incorporation of user-specified constraints Interpretability and usability Jian Pei: Data Mining -- Clustering and Outlier Detection 7

8 Jian Pei: Data Mining -- Clustering and Outlier Detection 8 Data Matrix For memory-based clustering Also called object-by-variable structure Represents n objects with p variables (attributes, measures) A relational table np x nf x n x ip x if x i x p x f x x

9 Dissimilarity Matrix For memory-based clustering Also called object-by-object structure Proximities of pairs of objects d(i, j): dissimilarity between objects i and j Nonnegative Close to 0: similar 0 d(2,1) 0 d(3,1) d(3,2) 0 d( n,1) d( n,2) 0 Jian Pei: Data Mining -- Clustering and Outlier Detection 9

10 How Good Is Clustering? Dissimilarity/similarity depends on distance function Different applications have different functions Judgment of clustering quality is typically highly subjective Jian Pei: Data Mining -- Clustering and Outlier Detection 10

11 Types of Data in Clustering Interval-scaled variables Binary variables Nominal, ordinal, and ratio variables Variables of mixed types Jian Pei: Data Mining -- Clustering and Outlier Detection 11

12 Interval-valued Variables Continuous measurements of a roughly linear scale Weight, height, latitude and longitude coordinates, temperature, etc. Effect of measurement units in attributes Smaller unit larger variable range larger effect to the result Standardization + background knowledge Jian Pei: Data Mining -- Clustering and Outlier Detection 12

13 Standardization Calculate the mean absolute deviation s f = 1( n x m + x m x m 1 f f 2 f f nf f Calculate the standardized measurement (zscore) z x m if f = if s f Mean absolute deviation is more robust The effect of outliers is reduced but remains detectable ) = 1 n + m (x x f 1 f 2 f x nf ). Jian Pei: Data Mining -- Clustering and Outlier Detection 13

14 Jian Pei: Data Mining -- Clustering and Outlier Detection 14 Similarity and Dissimilarity Distances are normally used measures Minkowski distance: a generalization If q = 2, d is Euclidean distance If q = 1, d is Manhattan distance If q =, d is Chebyshev distance Weighed distance 0) (... ), ( > = q j x i x j x i x j x i x j i d q q p p q q 0) ( ) ), ( > = q j x i x p w j x i x w j x i x w j i d q q p p q q

15 Manhattan and Chebyshev Distance Manhattan Distance Chebyshev Distance When n = 2, chess-distance Picture from Wekipedia Jian Pei: Data Mining -- Clustering and Outlier Detection 15

16 Properties of Minkowski Distance Nonnegative: d(i,j) 0 The distance of an object to itself is 0 d(i,i) = 0 Symmetric: d(i,j) = d(j,i) Triangular inequality d(i,j) d(i,k) + d(k,j) i j k Jian Pei: Data Mining -- Clustering and Outlier Detection 16

17 Binary Variables A contingency table for binary data Symmetric variable: each state carries the same weight Invariant similarity d( i, Object i j)= Asymmetric variable: the positive value carries more weight d( i, j)= Noninvariant similarity (Jacard) Object j 1 0 Sum 1 q r q+r 0 s t s+t Sum q+s r+t p r s q+ r + + s+ t r s q+ + r + s Jian Pei: Data Mining -- Clustering and Outlier Detection 17

18 Nominal Variables A generalization of the binary variable in that it can take more than 2 states, e.g., Red, yellow, blue, green d( i, j) = Method 1: simple matching p p m M: # of matches, p: total # of variables Method 2: use a large number of binary variables Creating a new binary variable for each of the M nominal states Jian Pei: Data Mining -- Clustering and Outlier Detection 18

19 Ordinal Variables An ordinal variable can be discrete or continuous r { 1,..., M if f Order is important, e.g., rank Can be treated like interval-scaled Replace x if by their rank Map the range of each variable onto [0, 1] by replacing the i-th object in the f-th variable by z if = r M if 1 1 Compute the dissimilarity using methods for interval-scaled variables f Jian Pei: Data Mining -- Clustering and Outlier Detection 19 }

20 Ratio-scaled Variables Ratio-scaled variable: a positive measurement on a nonlinear scale E.g., approximately at exponential scale, such as Ae Bt Treat them like interval-scaled variables? Not a good choice: the scale can be distorted! Apply logarithmic transformation, y if = log(x if ) Treat them as continuous ordinal data, treat their rank as interval-scaled Jian Pei: Data Mining -- Clustering and Outlier Detection 20

21 Variables of Mixed Types A database may contain all the six types of variables Symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio One may use a weighted formula to combine their effects d( i, Σ j) = p f Σ = 1 p f δ ( f ij = 1 δ ) ( ij d ( f ij f ) ) Jian Pei: Data Mining -- Clustering and Outlier Detection 21

22 Dimensionality Reduction Clustering a high dimensional data set is challenging Distance between two points could be dominated by noise Dimensionality reduction: choosing the informative dimensions for clustering analysis Feature selection: choosing a subset of existing dimensions Feature construction: construct a new (small) set of informative attributes Jian Pei: Data Mining -- Clustering and Outlier Detection 22

23 Jian Pei: Data Mining -- Clustering and Outlier Detection 23 Variance and Covariance Given a set of 1-d points, how different are those points? Standard deviation: Variance: Given a set of 2-d points, are the two dimensions correlated? Covariance: 1 ) ( 1 2 = = n X X s n i i 1 ) ( = = n X X s n i i 1 ) )( ( ), cov( 1 = = n Y Y X X Y X n i i i

24 Principal Components Art work and example from Jian Pei: Data Mining -- Clustering and Outlier Detection 24

25 Step 1: Mean Subtraction Subtract the mean from each dimension for each data point Intuition: centralizing the data set Jian Pei: Data Mining -- Clustering and Outlier Detection 25

26 Jian Pei: Data Mining -- Clustering and Outlier Detection 26 Step 2: Covariance Matrix = ), cov( ), cov( ), cov( ), cov( ), cov( ), cov( ), cov( ), cov( ), cov( n n n n n n D D D D D D D D D D D D D D D D D D C

27 Step 3: Eigenvectors and Eigenvalues Compute the eigenvectors and the eigenvalues of the covariance matrix Intuition: find those direction invariant vectors as candidates of new attributes Eigenvalues indicate how much the direction invariant vectors are scaled the larger the better for manifest the data variance Jian Pei: Data Mining -- Clustering and Outlier Detection 27

28 Step 4: Forming New Features Choose the principal components and forme new features Typically, choose the top-k components Jian Pei: Data Mining -- Clustering and Outlier Detection 28

29 New Features NewData = RowFeatureVector x RowDataAdjust The first principal component is used Jian Pei: Data Mining -- Clustering and Outlier Detection 29

30 Clustering Methods K-means and partitioning methods Hierarchical clustering Density-based clustering Grid-based clustering Pattern-based clustering Other clustering methods Jian Pei: Data Mining -- Clustering and Outlier Detection 30

31 Partitioning Algorithms: Ideas Partition n objects into k clusters Optimize the chosen partitioning criterion Global optimal: examine all possible partitions (k n -(k-1) n - -1) possible partitions, too expensive! Heuristic methods: k-means and k-medoids K-means: a cluster is represented by the center K-medoids or PAM (partition around medoids): each cluster is represented by one of the objects in the cluster Jian Pei: Data Mining -- Clustering and Outlier Detection 31

32 K-means Arbitrarily choose k objects as the initial cluster centers Until no change, do (Re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster Update the cluster means, i.e., calculate the mean value of the objects for each cluster Jian Pei: Data Mining -- Clustering and Outlier Detection 32

33 K-Means: Example Assign each objects to most similar center reassign Update the cluster means reassign K=2 Arbitrarily choose K object as initial cluster center Update the cluster means Jian Pei: Data Mining -- Clustering and Outlier Detection 33

34 Pros and Cons of K-means Relatively efficient: O(tkn) n: # objects, k: # clusters, t: # iterations; k, t << n. Often terminate at a local optimum Applicable only when mean is defined What about categorical data? Need to specify the number of clusters Unable to handle noisy data and outliers Unsuitable to discover non-convex clusters Jian Pei: Data Mining -- Clustering and Outlier Detection 34

35 Variations of the K-means Aspects of variations Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means Handling categorical data: k-modes Use mode instead of mean Mode: the most frequent item(s) A mixture of categorical and numerical data: k-prototype method EM (expectation maximization): assign a probability of an object to a cluster Jian Pei: Data Mining -- Clustering and Outlier Detection 35

36 A Problem of K-means Sensitive to outliers Outlier: objects with extremely large values May substantially distort the distribution of the data K-medoids: the most centrally located object in a cluster Jian Pei: Data Mining -- Clustering and Outlier Detection 36

37 PAM: A K-medoids Method PAM: partitioning around Medoids Arbitrarily choose k objects as the initial medoids Until no change, do (Re)assign each object to the cluster to which the nearest medoid Randomly select a non-medoid object o, compute the total cost, S, of swapping medoid o with o If S < 0 then swap o with o to form the new set of k medoids Jian Pei: Data Mining -- Clustering and Outlier Detection 37

38 Swapping Cost Measure whether o is better than o as a medoid Use the squared-error criterion E = i= 1 p C i Compute E o -E o Negative: swapping brings benefit k d( p, o i 2 ) Jian Pei: Data Mining -- Clustering and Outlier Detection 38

39 PAM: Example Total Cost = Arbitrary choose k object as initial medoids Assign each remaining object to nearest medoids K=2 Total Cost = 26 Randomly select a nonmedoid object,o ramdom Do loop Until no change Swapping O and O ramdom If quality is improved Compute total cost of swapping Jian Pei: Data Mining -- Clustering and Outlier Detection 39

40 Pros and Cons of PAM PAM is more robust than k-means in the presence of noise and outliers Medoids are less influenced by outliers PAM is efficiently for small data sets but does not scale well for large data sets O(k(n-k) 2 ) for each iteration Sampling based method: CLARA Jian Pei: Data Mining -- Clustering and Outlier Detection 40

41 CLARA CLARA: Clustering LARge Applications (Kaufmann and Rousseeuw in 1990) Built in statistical analysis packages, such as S + Draw multiple samples of the data set, apply PAM on each sample, give the best clustering Perform better than PAM in larger data sets Efficiency depends on the sample size A good clustering on samples may not be a good clustering of the whole data set Jian Pei: Data Mining -- Clustering and Outlier Detection 41

42 CLARANS Clustering large applications based upon randomized search The problem space graph of clustering A vertex is k from n numbers, vertices in total PAM searches the whole graph CLARA searches some random sub-graphs CLARANS climbs hills Randomly sample a set and select k medoids Consider neighbors of medoids as candidate for new medoids Use the sample set to verify Repeat multiple times to avoid bad samples n k Jian Pei: Data Mining -- Clustering and Outlier Detection 42

43 Hierarchical Clustering Group data objects into t tree of clusters Step 0 Step 1 Step 2 Step 3 Step 4 a a b b a b c d e c c d e d d e e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive (DIANA) Jian Pei: Data Mining -- Clustering and Outlier Detection 43

44 AGNES (Agglomerative Nesting) Initially, each object is a cluster Step-by-step cluster merging, until all objects form a cluster Single-link approach Each cluster is represented by all of the objects in the cluster The similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters Jian Pei: Data Mining -- Clustering and Outlier Detection 44

45 Dendrogram Show how to merge clusters hierarchically Decompose data objects into a multilevel nested partitioning (a tree of clusters) A clustering of the data objects: cutting the dendrogram at the desired level Each connected component forms a cluster Jian Pei: Data Mining -- Clustering and Outlier Detection 45

46 DIANA (DIvisive ANAlysis) Initially, all objects are in one cluster Step-by-step splitting clusters until each cluster contains only one object Jian Pei: Data Mining -- Clustering and Outlier Detection 46

47 Jian Pei: Data Mining -- Clustering and Outlier Detection 47 Distance Measures Minimum distance Maximum distance Mean distance Average distance = = = = i j j i j i C p C q j i j i avg j i j i mean C q C p j i C q C p j i q p d n n C C d m m d C C d q p d C C d q p d C C d ), ( 1 ), ( ), ( ), ( ), ( max ), ( ), ( min ), (, max, min m: mean for a cluster C: a cluster n: the number of objects in a cluster

48 Challenges of Hierarchical Clustering Methods Hard to choose merge/split points Never undo merging/splitting Merging/splitting decisions are critical Do not scale well: O(n 2 ) Integrating hierarchical clustering with other techniques BIRCH, CURE, CHAMELEON, ROCK Jian Pei: Data Mining -- Clustering and Outlier Detection 48

49 BIRCH Balanced Iterative Reducing and Clustering using Hierarchies CF (Clustering Feature) tree: a hierarchical data structure summarizing object info Clustering objects clustering leaf nodes of the CF tree Jian Pei: Data Mining -- Clustering and Outlier Detection 49

50 Clustering Feature Vector Clustering Feature: CF = (N, LS, SS) N: Number of data points LS: N i=1=o i SS: N i=1=o i CF = (5, (16,30),(54,190)) (3,4) (2,6) (4,5) (4,7) (3,8) Jian Pei: Data Mining -- Clustering and Outlier Detection 50

51 CF-tree in BIRCH Clustering feature: Summarize the statistics for a cluster Many cluster quality measures (e.g., radium, distance) can be derived Additivity: CF 1 +CF 2 =(N 1 +N 2, L 1 +L 2, SS 1 +SS 2 ) A CF tree: a height-balanced tree storing the clustering features for a hierarchical clustering A nonleaf node in a tree has descendants or children The nonleaf nodes store sums of the CFs of children Jian Pei: Data Mining -- Clustering and Outlier Detection 51

52 CF Tree B = 7 L = 6 CF 1 CF 2 child 1 child 2 child 3 child 6 CF 3 CF 6 Non-leaf node Root CF 1 CF 2 child 1 child 2 child 3 child 5 CF 3 CF 5 Leaf node Leaf node prev CF 1 CF 2 CF 6 next prev CF 1 CF 2 CF 4 next Jian Pei: Data Mining -- Clustering and Outlier Detection 52

53 Parameters of A CF-tree Branching factor: the maximum number of children Threshold: max diameter of sub-clusters stored at the leaf nodes Jian Pei: Data Mining -- Clustering and Outlier Detection 53

54 BIRCH Clustering Phase 1: scan DB to build an initial inmemory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data) Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CFtree Jian Pei: Data Mining -- Clustering and Outlier Detection 54

55 Pros & Cons of BIRCH Linear scalability Good clustering with a single scan Quality can be further improved by a few additional scans Can handle only numeric data Sensitive to the order of the data records Jian Pei: Data Mining -- Clustering and Outlier Detection 55

56 Drawbacks of Square Error Based Methods One representative per cluster Good only for convex shaped having similar size and density K: the number of clusters parameter Good only if k can be reasonably estimated Jian Pei: Data Mining -- Clustering and Outlier Detection 56

57 CURE: the Ideas Each cluster has c representatives Choose c well scattered points in the cluster Shrink them towards the mean of the cluster by a fraction of α The representatives capture the physical shape and geometry of the cluster Merge the closest two clusters Distance of two clusters: the distance between the two closest representatives Jian Pei: Data Mining -- Clustering and Outlier Detection 57

58 Cure: The Algorithm Draw random sample S Partition sample to p partitions Partially cluster each partition Eliminate outliers Random sampling + remove clusters growing too slowly Cluster partial clusters until only k clusters left Shrink representatives of clusters towards the cluster center Jian Pei: Data Mining -- Clustering and Outlier Detection 58

59 Data Partitioning and Clustering y y y y y x x x x Jian Pei: Data Mining -- Clustering and Outlier Detection 59 x

60 Cure: Shrinking Representative Points Shrink the multiple representative points towards the gravity center by a fraction of α Representatives capture the shape y y x x Jian Pei: Data Mining -- Clustering and Outlier Detection 60

61 Clustering Categorical Data: ROCK Robust Clustering using links # of common neighbors between two points Use links to measure similarity/proximity Not distance based 2 2 O( n + nm m + n log n) Basic ideas: m Similarity function and neighbors: Let T1 = {1,2,3}, T2={3,4,5} a Sim( T1, T2) Sim( T, T ) = { 3} 1 = = = 0. 2 { 1, 2, 3, 4, 5} 5 Jian Pei: Data Mining -- Clustering and Outlier Detection T T T T

62 Limitations Merging decision based on static modeling No special characteristics of clusters are considered C1 C2 C1 C2 CURE and BIRCH merge C1 and C2 C1 and C2 are more appropriate for merging Jian Pei: Data Mining -- Clustering and Outlier Detection 62

63 Chameleon Hierarchical clustering using dynamic modeling Measures the similarity based on a dynamic model Interconnectivity & closeness (proximity) between two clusters vs interconnectivity of the clusters and closeness of items within the clusters A two-phase algorithm Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-clusters Find the genuine clusters by repeatedly combining subclusters Jian Pei: Data Mining -- Clustering and Outlier Detection 63

64 Overall Framework of CHAMELEON Construct Sparse Graph Partition the Graph Data Set Merge Partition Final Clusters Jian Pei: Data Mining -- Clustering and Outlier Detection 64

65 Drawback of Distance-based Methods Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: Data Mining -- Clustering and Outlier Detection 65

66 Directly Density Reachable Parameters MinPts = 3 Eps = 1 cm Eps: Maximum radius of the neighborhood MinPts: Minimum number of points in an Epsneighborhood of that point NEps(p): {q dist(p,q) Eps} Core object p: Neps(p) MinPts Point q directly density-reachable from p iff q NEps(p) and p is a core object q p Jian Pei: Data Mining -- Clustering and Outlier Detection 66

67 Density-Based Clustering Density-reachable Directly density reachable p 1 p 2, p 2 p 3,, p n- 1 p n p n density-reachable from p 1 Density-connected Points p, q are density-reachable from o p and q are density-connected p p q q p 1 o Jian Pei: Data Mining -- Clustering and Outlier Detection 67

68 DBSCAN A cluster: a maximal set of densityconnected points Discover clusters of arbitrary shape in spatial databases with noise Outlier Border Core Eps = 1cm MinPts = 5 Jian Pei: Data Mining -- Clustering and Outlier Detection 68

69 DBSCAN: the Algorithm Arbitrary select a point p Retrieve all points density-reachable from p wrt Eps and MinPts If p is a core point, a cluster is formed If p is a border point, no points are densityreachable from p and DBSCAN visits the next point of the database Continue the process until all of the points have been processed Jian Pei: Data Mining -- Clustering and Outlier Detection 69

70 Problems of DBSCAN Different clusters may have very different densities Clusters may be in hierarchies Jian Pei: Data Mining -- Clustering and Outlier Detection 70

71 OPTICS: A Cluster-ordering Method OPTICS: ordering points to identify the clustering structure Group points by density connectivity Hierarchies of clusters Visualize clusters and the hierarchy Jian Pei: Data Mining -- Clustering and Outlier Detection 71

72 Ordering Points Points strongly density-connected should be close to one another Clusters density-connected should be close to one another and form a cluster of clusters Jian Pei: Data Mining -- Clustering and Outlier Detection 72

73 OPTICS: An Example Reachability-distance undefined ε ε ε Cluster-order of the objects Jian Pei: Data Mining -- Clustering and Outlier Detection 73

74 DENCLUE: Using Density Functions DENsity-based CLUstEring Major features Solid mathematical foundation Good for data sets with large amounts of noise Allow a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets Significantly faster than existing algorithms (faster than DBSCAN by a factor of up to 45) But need a large number of parameters Jian Pei: Data Mining -- Clustering and Outlier Detection 74

75 DENCLUE: Techniques Use grid cells Only keep grid cells actually containing data points Manage cells in a tree-based access structure Influence function: describe the impact of a data point on its neighborhood Overall density of the data space is the sum of the influence function of all data points Clustering by identifying density attractors Density attractor: local maximal of the overall density function Jian Pei: Data Mining -- Clustering and Outlier Detection 75

76 Density Attractor Jian Pei: Data Mining -- Clustering and Outlier Detection 76

77 Center-defined and Arbitrary Clusters Jian Pei: Data Mining -- Clustering and Outlier Detection 77

78 A Shrinking-based Approach Difficulties of Multi-dimensional Clustering Noise (outliers) Clusters of various densities Not well-defined shapes A novel preprocessing concept Shrinking A shrinking-based clustering approach Jian Pei: Data Mining -- Clustering and Outlier Detection 78

79 Intuition & Purpose For data points in a data set, what if we could make them move towards the centroid of the natural subgroup they belong to? Natural sparse subgroups become denser, thus easier to be detected Noises are further isolated Jian Pei: Data Mining -- Clustering and Outlier Detection 79

80 Inspiration Newton s Universal Law of Gravitation Any two objects exert a gravitational force of attraction on each other The direction of the force is along the line joining the objects The magnitude of the force is directly proportional to the product of the gravitational masses of the objects, and inversely proportional to the square of the distance between them 1 2 G: universal gravitational constant G = 6.67 x N m 2 /kg 2 Fg = 2 G m m r Jian Pei: Data Mining -- Clustering and Outlier Detection 80

81 The Concept of Shrinking A data preprocessing technique Aim to optimize the inner structure of real data sets Each data point is attracted by other data points and moves to the direction in which way the attraction is the strongest Can be applied in different fields Jian Pei: Data Mining -- Clustering and Outlier Detection 81

82 Apply shrinking into clustering field Shrink the natural sparse clusters to make them much denser to facilitate further cluster-detecting process. Multiattribute hyperspac e Jian Pei: Data Mining -- Clustering and Outlier Detection 82

83 Data Shrinking Each data point moves along the direction of the density gradient and the data set shrinks towards the inside of the clusters Points are attracted by their neighbors and move to create denser clusters It proceeds iteratively; repeated until the data are stabilized or the number of iterations exceeds a threshold Jian Pei: Data Mining -- Clustering and Outlier Detection 83

84 Approximation & Simplification Problem: Computing mutual attraction of each data points pair is too time consuming O(n 2 ) Solution: No Newton's constant G, m 1 and m 2 are set to unit Only aggregate the gravitation surrounding each data point Use grids to simplify the computation Jian Pei: Data Mining -- Clustering and Outlier Detection 84

85 Termination condition Average movement of all points in the current iteration is less than a threshold The number of iterations exceeds a threshold Jian Pei: Data Mining -- Clustering and Outlier Detection 85

86 Optics on Pendigits Data Before data shrinking After data shrinking Jian Pei: Data Mining -- Clustering and Outlier Detection 86

87 Grid-based Clustering Methods Ideas Using multi-resolution grid data structures Using dense grid cells to form clusters Several interesting methods STING WaveCluster CLIQUE Jian Pei: Data Mining -- Clustering and Outlier Detection 87

88 STING: A Statistical Information Grid Approach Complexity of spatial query answering and clustering At least O(n), if each point has to be accessed Get summarization lower complexity The spatial area is divided into rectangular cells Levels of cells correspond to different levels of resolution Jian Pei: Data Mining -- Clustering and Outlier Detection 88

89 Grid and Cells in STING Jian Pei: Data Mining -- Clustering and Outlier Detection 89

90 STING: Hierarchical Structure of Cells A cell at a high level is partitioned into a number of smaller cells in the next lower level Statistical info of each cell is pre-computed and stored query answering Parameters of higher level cells can be easily calculated from parameters of lower level cells Count, mean, standard deviation, min, max Type of distribution normal, uniform, etc For each cell in the current level, compute the confidence interval Jian Pei: Data Mining -- Clustering and Outlier Detection 90

91 STING: Query Answering A top-down approach Start from a pre-selected layer typically with a small number of cells Remove the irrelevant cells from further consideration When finish examining the current layer, proceed to the next lower level Repeat this process until the bottom layer is reached Jian Pei: Data Mining -- Clustering and Outlier Detection 91

92 STING: Pros and Cons Complexity O(k) K: number of grid cells at the lowest level Query-independent, easy to parallelize, incremental update All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected Jian Pei: Data Mining -- Clustering and Outlier Detection 92

93 WaveCluster A multi-resolution clustering approach Apply wavelet transformation to the feature space Both grid-based and density-based Input parameters: Number of grid cells for each dimension The wavelet The number of applications of wavelet transform Jian Pei: Data Mining -- Clustering and Outlier Detection 93

94 Wavelet Decomposition Wavelets: a math tool for space-efficient hierarchical decomposition of functions S=[2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^=[2 3 / 4, -1 1 / 4, 1 / 2, 0, 0, -1, 0] Compression: many small detail coefficients can be replaced by 0 s, and only the significant coefficients are retained Jian Pei: Data Mining -- Clustering and Outlier Detection 94

95 Haar Wavelet Coefficients Coefficient Supports Hierarchical decomposition structure (a.k.a. error tree ) Original frequency distribution Jian Pei: Data Mining -- Clustering and Outlier Detection 95

96 What Is Wavelet Transform? Decomposes a signal into different frequency subbands Applicable to n-dimensional signals Data are transformed to preserve relative distance between objects at different levels of resolution Allow natural clusters to become more distinguishable Jian Pei: Data Mining -- Clustering and Outlier Detection 96

97 Wavelet Transformation Jian Pei: Data Mining -- Clustering and Outlier Detection 97

98 Why Is Wavelet Transform? Use hat-shape filters Emphasize region where points cluster Suppress weaker information in their boundaries Effective removal of outliers Insensitive to noise, insensitive to input order Multi-resolution Detect arbitrary shaped clusters at different scales Efficient Complexity O(N) Only applicable to low dimensional data Jian Pei: Data Mining -- Clustering and Outlier Detection 98

99 WaveCluster: Method Summarize the data by imposing a multidimensional grid structure on to data space Multidimensional spatial data objects are represented in an n-dimensional feature space Apply wavelet transform on feature space to find the dense regions in the feature space Apply wavelet transform multiple times Result in clusters at different scales from fine to coarse Jian Pei: Data Mining -- Clustering and Outlier Detection 99

100 CLIQUE Clustering In QUEst Automatically identify subspaces of a high dimensional data space Both density-based and grid-based Jian Pei: Data Mining -- Clustering and Outlier Detection 100

101 CLIQUE: the Ideas Partition each dimension into the same number of equal length intervals Partition an m-dimensional data space into nonoverlapping rectangular units A unit is dense if the number of data points in the unit exceeds a threshold A cluster is a maximal set of connected dense units within a subspace Jian Pei: Data Mining -- Clustering and Outlier Detection 101

102 CLIQUE: the Method Partition the data space and find the number of points in each cell of the partition Apriori: a k-d cell cannot be dense if one of its (k-1)-d projection is not dense Identify clusters: Determine dense units in all subspaces of interests and connected dense units in all subspaces of interests Generate minimal description for the clusters Determine the minimal cover for each cluster Jian Pei: Data Mining -- Clustering and Outlier Detection 102

103 CLIQUE: An Example Salary (10,000) Vac atio n age Vacatio n(week) age age Jian Pei: Data Mining -- Clustering and Outlier Detection 103

104 CLIQUE: Pros and Cons Automatically find subspaces of the highest dimensionality with high density clusters Insensitive to the order of input Not presume any canonical data distribution Scale linearly with the size of input Scale well with the number of dimensions The clustering result may be degraded at the expense of simplicity of the method Jian Pei: Data Mining -- Clustering and Outlier Detection 104

105 Bad Results from CLIQUE Parts of a cluster may be missed A cluster from CLIQUE may contain noise Jian Pei: Data Mining -- Clustering and Outlier Detection 105

106 Pattern-based Clustering How to cluster the five objects? Hard to define a global similarity measure Jian Pei: Data Mining -- Clustering and Outlier Detection 106

107 Pattern-based Clustering A cluster: a set of objects following the same pattern in a subset of dimensions (Wang et al, 2002) Jian Pei: Data Mining -- Clustering and Outlier Detection 107

108 Is That Subspace Clustering? Looks like subset of dimensions! Not really!!! Subspace clustering uses global distance/similarity measure Pattern-based clustering looks at patterns A subspace cluster according to a globally defined similarity measure may not follow the same pattern Jian Pei: Data Mining -- Clustering and Outlier Detection 108

109 Two Distinct Features No globally defined similarity/distance measure Can be used in many cases Clusters may not be exclusive An object can appear in multiple clusters DNA micro-array data analysis using patternbased clusters Identify subsets of genes whose expression levels change coherently under a subset of conditions Critical in revealing the significant connections in gene regulatory networks Jian Pei: Data Mining -- Clustering and Outlier Detection 109

110 Objects Follow the Same Pattern? pscore Object blue Obejct green D 1 D 2 The less the pscore, the more consistent the objects Jian Pei: Data Mining -- Clustering and Outlier Detection 110

111 Jian Pei: Data Mining -- Clustering and Outlier Detection 111 Pattern-based Clusters pscore: the similarity between two objects r x, r y on two attributes a u, a v δ-pcluster (R, D): for any objects r x, r y R and any attributes a u, a v D, ).. ( ).. (.... v y v x u y u x v y u y v x u x a r a r a r a r a r a r a r a r pscore = 0) (.... δ δ v y u y v x u x a r a r a r a r pscore

112 Maximal pcluster If (R, D) is a δ-pcluster, then every subcluster (R, D ) is a δ-pcluster, where R R and D D An anti-monotonic property A large pcluster is accompanied with many small pclusters! Inefficacious Idea: mining only the maximal pclusters! A δ-pcluster is maximal if there exists no proper super cluster as a δ-pcluster Jian Pei: Data Mining -- Clustering and Outlier Detection 112

113 Mining Maximal pclusters Given A cluster threshold δ An attribute threshold min a An object threshold min o Task: mine the complete set of significant maximal δ-pclusters A significant δ-pcluster has at least min o objects on at least min a attributes Jian Pei: Data Mining -- Clustering and Outlier Detection 113

114 pcluters and Frequent Itemsets A transaction database can be modeled as a binary matrix Frequent itemset: a sub-matrix of all 1 s 0-pCluster on binary data Min o : support threshold Min a : no less than mina attributes Maximal pclusters closed itemsets Frequent itemset mining algorithms cannot be extended straightforwardly for mining pclusters on numeric data Jian Pei: Data Mining -- Clustering and Outlier Detection 114

115 Where Should We Start from? How about the pclusters having only 2 objects or 2 attributes? MDS (maximal dimension set) A pcluster must have at least 2 objects and 2 attributes Objects Finding MDSs Attribute a b c d e f g h x y x - y Jian Pei: Data Mining -- Clustering and Outlier Detection 115

116 How to Assemble Larger pclusters? Systematically enumerate every combination of attributes D For each attribute subset, find the maximal subsets of objects R s.t. (R, D) is a pcluster Check whether (R, D) is maximal Prune search branches as early as possible Why attribute-first-objectlater? # of objects >> # attributes Algorithm MaPle (Pei et al, 2003) Jian Pei: Data Mining -- Clustering and Outlier Detection 116

117 Pruning MDS s Let (R, D) be a significant pcluster An attribute should appear in every object pair MDS for by rx, ry R At least min o (min o -1)/2 object pair MDSs Similarly, an object should appear in at least min a (min a -1)/2 attribute-pair MDSs Objects and attributes less frequent than stated above can be pruned The pruning can be used repeatedly, until no objects and attributes are pruned Jian Pei: Data Mining -- Clustering and Outlier Detection 117

118 More Pruning Techniques Only possible attributes should be considered to get larger pclusters Pruning local maximal pclusters having insufficient possible attributes Extracting common attributes from possible attribute set directly Prune non-maximal pclusters Jian Pei: Data Mining -- Clustering and Outlier Detection 118

119 Gene-Sample-Time Series Data Sample-Time Matrix Sample time2 time1 sample1 sample2 Time gene1 gene2 Gene-Sample Matrix Gene-Time Matrix Gene expression level of gene i on sample j at time k Jian Pei: Data Mining -- Clustering and Outlier Detection 119

120 Mining GST Microarray Data Reduce the gene-sample-time series data to gene-sample data Use the Pearson's correlation coeffcient as the coherence measure Jian Pei: Data Mining -- Clustering and Outlier Detection 120

121 Basic Approaches Sample-gene search Enumerate the subsets of samples systematically For each subset of samples, find the genes that are coherent on the samples Gene-sample search Enumerate the subsets of genes systematically For each subset of genes, find the samples on which the genes are coherent Jian Pei: Data Mining -- Clustering and Outlier Detection 121

122 Basic Tools Set enumeration tree Sample-gene search and gene-sample search are not symmetric! Many genes, but a few samples No requirement on samples coherent on genes Jian Pei: Data Mining -- Clustering and Outlier Detection 122

123 Phenotypes and Informative Genes samples Informative Genes gene 1 gene 2 gene 3 gene 4 Noninformative Genes gene 5 gene 6 gene 7 Jian Pei: Data Mining -- Clustering and Outlier Detection 123

124 The Phenotype Mining Problem Input: a microarray matrix and k Output: phenotypes and informative genes Partitioning the samples into k exclusive subsets phenotypes Informative genes discriminating the phenotypes Machine learning methods Heuristic search Mutual reinforcing adjustment Jian Pei: Data Mining -- Clustering and Outlier Detection 124

125 Requirements The expression levels of each informative gene should be similar over the samples within each phenotype The expression levels of each informative gene should display a clear dissimilarity between each pair of phenotypes Jian Pei: Data Mining -- Clustering and Outlier Detection 125

126 Intra-phenotype Consistency In a subset of genes (candidate informative genes), does every gene have good consistency on a set of samples? Average of variance of the subset of genes the smaller the intra-phenotype consistency, the better Con( G', S') = G' 1 ( S' 1) gi G ' sj S ' ( w i, j wi, S ') 2 Jian Pei: Data Mining -- Clustering and Outlier Detection 126

127 Inter-phenotype Divergence How a subset of genes (candidate informative genes) can discriminate two phenotypes of samples? Sum of the average difference between the phenotypes the larger the inter-phenotype divergence, the better Div( G', S 1, S 2 )) i g G ' = w i, S 1 G' w i, S 2 Jian Pei: Data Mining -- Clustering and Outlier Detection 127

128 Quality of Phenotypes and Informative Genes Ω = S i, S j (1 i, j K ; i j) 1 Con( G', S i Div( G', S ) + Con( G', S S i, j) j ) The higher the value, the better the quality Jian Pei: Data Mining -- Clustering and Outlier Detection 128

129 Heuristic Search Start from a random subset of genes and an arbitrary partition of the samples Iteratively adjust the partition and the gene set toward a better solution For each possible adjustment, compute Ω For each gene, try possible insert/remove For each sample, try the best movement Ω > 0 conduct the adjustment T (i) e Ω < 0 conduct the adjustment with probability Ω Ω T(i) is a decreasing simulated annealing function and i is the iteration number. T(i)=1/(i+1) in our implementation Jian Pei: Data Mining -- Clustering and Outlier Detection 129

130 Possible Adjustments Insert a gene Remove a gene Move a sample Jian Pei: Data Mining -- Clustering and Outlier Detection 130

131 Disadvantages of Heuristic Search Samples and genes are examined and adjusted with equal chances # samples << # genes Samples should play more important roles Outliers in the samples should be handled specifically Outliers highly interfere the quality and the adjustment decisions Jian Pei: Data Mining -- Clustering and Outlier Detection 131

132 Mutual Reinforcing Adjustment A two-phase approach Iteration phase Refinement phase Mutual reinforcement Use gene partition to improve the sample partition Use the sample partition to improve the gene partition Jian Pei: Data Mining -- Clustering and Outlier Detection 132

133 Fuzzy Clustering Each point x i takes a probability w ij to belong to a cluster C j Requirements For each point x i, k j= 1 w ij = 1 For each cluster C j m 0 < i=1 w ij < m Jian Pei: Data Mining -- Clustering and Outlier Detection 133

134 Fuzzy C-Means (FCM) Select an initial fuzzy pseudo-partition, i.e., assign values to all the w ij Repeat Compute the centroid of each cluster using the fuzzy pseudo-partition Recompute the fuzzy pseudo-partition, i.e., the w ij Until the centroids do not change (or the change is below some threshold) Jian Pei: Data Mining -- Clustering and Outlier Detection 134

135 Critical Details Optimization on sum of the squared error k m (SSE): p 2 SSE( C 1,, Ck ) = wij dist( xi, c j ) j= 1 i= 1 m m Computing centroids: p c j = wij xi / i= 1 i= 1 Updating the fuzzy pseudo-partition w ij = (1/ dist( x i, c j ) 2 ) 1 p 1 k q= 1 (1/ dist( x i, c w q p ij ) 2 ) 1 p 1 When p=2 w ij = 1/ dist( x i, c j ) 2 k q= 1 1/ dist( x i, c q ) 2 Jian Pei: Data Mining -- Clustering and Outlier Detection 135

136 Choice of P When p 1, FCM behaves like traditional k-means When p is larger, the cluster centroids approach the global centroid of all data points The partition becomes fuzzier as p increases Jian Pei: Data Mining -- Clustering and Outlier Detection 136

137 Effectiveness Jian Pei: Data Mining -- Clustering and Outlier Detection 137

138 Mixture Models A cluster can be modeled as a probability distribution Practically, assume a distribution can be approximated well using multivariate normal distribution Multiple clusters is a mixture of different probability distributions A data set is a set of observations from a mixture of models Jian Pei: Data Mining -- Clustering and Outlier Detection 138

139 Object Probability Suppose there are k clusters and a set X of m objects Let the j-th cluster have parameter θ j = (µ j, σ j ) The probability that a point is in the j-th cluster is w j, w w k = 1 The probability of an object x is k prob( x Θ) = w j p j ( x θ j ) m i= 1 j= 1 prob( X Θ) = prob( x Θ) = w p ( x θ ) i m k i= 1 j= 1 j j i j Jian Pei: Data Mining -- Clustering and Outlier Detection 139

140 Example prob ( x µ ) 1 2 2σ ( ) x i Θ = e 2πσ 2 θ1 = ( 4,2) θ2 = (4,2) prob( x Θ) = 2 1 e 2π ( x+ 4) e 2π ( x 4) 8 2 Jian Pei: Data Mining -- Clustering and Outlier Detection 140

141 Maximal Likelihood Estimation Maximum likelihood principle: if we know a set of objects are from one distribution, but do not know the parameter, we can choose the parameter maximizing the probability 2 m ( x µ ) Maximize 1 2 prob 2σ ( x Θ = i ) e 2πσ j= 1 Equivalently, maximize log prob( X Θ) = m i= 1 ( xi µ ) 2 2σ 2 0.5mlog 2π mlogσ Jian Pei: Data Mining -- Clustering and Outlier Detection 141

142 EM Algorithm Expectation Maximization algorithm Select an initial set of model parameters Repeat Expectation Step: for each object, calculate the probability that it belongs to each distribution θ i, i.e., prob(x i θ i ) Maximization Step: given the probabilities from the expectation step, find the new estimates of the parameters that maximize the expected likelihood Until the parameters are stable Jian Pei: Data Mining -- Clustering and Outlier Detection 142

143 Advantages and Disadvantages Mixture models are more general than k- means and fuzzy c-means Clusters can be characterized by a small number of parameters The results may satisfy the statistical assumptions of the generative models Computationally expensive Need large data sets Hard to estimate the number of clusters Jian Pei: Data Mining -- Clustering and Outlier Detection 143

144 Constrained Clustering Constraints exist in data space or in user queries Example: ATM allocation with bridges and highways People can cross a highway by a bridge Jian Pei: Data Mining -- Clustering and Outlier Detection 144

145 Clustering With Obstacle Objects Not Taking obstacles into account Taking obstacles into account Jian Pei: Data Mining -- Clustering and Outlier Detection 145

146 Outlier Analysis One person s noise is another person s signal Outliers: the objects considerably dissimilar from the remainder of the data Examples: credit card fraud, Michael Jordon, intrusions, etc Applications: credit card fraud detection, telecom fraud detection, intrusion detection, customer segmentation, medical analysis, etc Jian Pei: Data Mining -- Clustering and Outlier Detection 146

147 Statistical Outlier Analysis Discordancy/outlier tests 100+ tests proposed Data distribution Distribution parameters The number of outliers The types of expected outliers Example: upper or lower outliers in an ordered sample Jian Pei: Data Mining -- Clustering and Outlier Detection 147

148 Drawbacks of Statistical Approaches Most tests are univariate Unsuitable for multidimensional datasets All are distribution-based Unknown distributions in many applications Jian Pei: Data Mining -- Clustering and Outlier Detection 148

149 Depth-based Methods Organize data objects in layers with various depths The shallow layers are more likely to contain outliers Example: Peeling, Depth contours Complexity O(N k/2 ) for k-d datasets Unacceptable for k>2 Jian Pei: Data Mining -- Clustering and Outlier Detection 149

150 Depth-based Outliers: Example Jian Pei: Data Mining -- Clustering and Outlier Detection 150

151 Distance-based Outliers A DB(p, D)-outlier is an object O in a dataset T s.t. at least fraction p of the objects in T lies at a distance greater than distance D from O Algorithms for mining distance-based outliers The index-based algorithm, the nested-loop algorithm, the cell-based algorithm Jian Pei: Data Mining -- Clustering and Outlier Detection 151

152 Index-based Algorithms Find DB(p, D) outliers in T with n objects Find an objects having at most n(1-p) neighbors with radius D Algorithm Build a standard multidimensional index Search every object O with radius D If there are at least n(1-p) neighbors, O is not an outlier Else, output O Jian Pei: Data Mining -- Clustering and Outlier Detection 152

153 Pros and Cons of Index-based Algorithms Complexity of search O(kN 2 ) More scalable with dimensionality than depthbased approaches Building a right index is very costly Index building cost renders the index-based algorithms non-competitive Jian Pei: Data Mining -- Clustering and Outlier Detection 153

154 A Naïve Nested-loop Algorithm For j=1 to n do Set count j =0; For k=1 to n do if (dist(j,k)<d) then count j ++; If count j <= n(1-p) then output j as an outlier; No explicit index construction O(N 2 ) Many database scans Jian Pei: Data Mining -- Clustering and Outlier Detection 154

155 Optimizations of Nested-loop Algorithm Once an object has at least n(1-p) neighbors with radius D, no need to count further Use the data in main memory as much as possible Reduce the number of database scans Jian Pei: Data Mining -- Clustering and Outlier Detection 155

156 A Block-based Nested-loop Algorithm Partition the available memory into two blocks with an equivalent size Fill the first block, compare objects in the block, mark non-outliers Read remaining objects into the second block, compare objects from the first and second block Mark non-outliers, only compare potential outliers in the first block Output unmarked objects in the first block as outliers Swap the names of the first and second blocks, until all objects have been processed Jian Pei: Data Mining -- Clustering and Outlier Detection 156

157 Example Dataset has four blocks: A, B, C, and D A Compare objects in A (1 read) C D C A A B A C A D Compare objects in A to those in B, C, and D (3 reads) C B Compare objects in C to those in C, D, A, and B (2 reads) C B C A A D Compare objects in D (0 read) C D Compare objects in B to those in B, C, A, and D (2 reads) Jian Pei: Data Mining -- Clustering and Outlier Detection 157 A D Compare objects in D to those in A (0 read) B D C D Compare objects in D to those in B and C (2 reads) 10 blocks are read in total 10/4=2.5 passes over T

158 Analysis of the Nested-loop Algorithm The data set is partition into n blocks Total number of block reads: n+(n-2)(n-1)=n 2-2n+2 The number of passes over the dataset (n-2) Many passes for large datasets Jian Pei: Data Mining -- Clustering and Outlier Detection 158

159 A Cell-based Approach L l = ( Cx, y ) = { Cu, v u x 1, v y 1, Cu, v Cx, 1 y 2 D 2 L ( Cx, y ) = { Cu, v u x 3, v y 3, Cu, v L1 ( Cx, y ), Cu, v Cx, 2 y } M+ objects in C x,y no outlier in C x,y } D M+ objects in C x,y L 1 (C x,y ) no outlier in C x,y M- objects in C x,y L 1 (C x,y ) L 2 (C x,y ) all objects in C x,y are outliers Jian Pei: Data Mining -- Clustering and Outlier Detection 159

160 The Algorithm Quantize each object to its appropriate cell Label all cells having m+ objects red No outlier in red cells Label L 1 neighbours of red cells, and cells having m+ objects in C x,y L1(C x,y ) pink No outlier in pink cells Output objects in cells having m- objects in C x,y L 1 (C x,y ) L 2 (C x,y ) as outliers For remaining cells, check them one by one Jian Pei: Data Mining -- Clustering and Outlier Detection 160

Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

Clustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1

Clustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1 Clustering: Techniques & Applications Nguyen Sinh Hoa, Nguyen Hung Son 15 lutego 2006 Clustering 1 Agenda Introduction Clustering Methods Applications: Outlier Analysis Gene clustering Summary and Conclusions

More information

Data Mining for Knowledge Management. Clustering

Data Mining for Knowledge Management. Clustering Data Mining for Knowledge Management Clustering Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management Thanks for slides to: Jiawei Han Eamonn Keogh Jeff

More information

Data Clustering Techniques Qualifying Oral Examination Paper

Data Clustering Techniques Qualifying Oral Examination Paper Data Clustering Techniques Qualifying Oral Examination Paper Periklis Andritsos University of Toronto Department of Computer Science periklis@cs.toronto.edu March 11, 2002 1 Introduction During a cholera

More information

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

More information

Cluster Analysis: Basic Concepts and Methods

Cluster Analysis: Basic Concepts and Methods 10 Cluster Analysis: Basic Concepts and Methods Imagine that you are the Director of Customer Relationships at AllElectronics, and you have five managers working for you. You would like to organize all

More information

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

More information

Data Mining 5. Cluster Analysis

Data Mining 5. Cluster Analysis Data Mining 5. Cluster Analysis 5.2 Fall 2009 Instructor: Dr. Masoud Yaghini Outline Data Structures Interval-Valued (Numeric) Variables Binary Variables Categorical Variables Ordinal Variables Variables

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

More information

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier Data Mining: Concepts and Techniques Jiawei Han Micheline Kamber Simon Fräser University К MORGAN KAUFMANN PUBLISHERS AN IMPRINT OF Elsevier Contents Foreword Preface xix vii Chapter I Introduction I I.

More information

BIRCH: An Efficient Data Clustering Method For Very Large Databases

BIRCH: An Efficient Data Clustering Method For Very Large Databases BIRCH: An Efficient Data Clustering Method For Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny CPSC 504 Presenter: Discussion Leader: Sophia (Xueyao) Liang HelenJr, Birches. Online Image.

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

Data Mining. Session 9 Main Theme Clustering. Dr. Jean-Claude Franchitti

Data Mining. Session 9 Main Theme Clustering. Dr. Jean-Claude Franchitti Data Mining Session 9 Main Theme Clustering Dr. Jean-Claude Franchitti New York University Computer Science Department Courant Institute of Mathematical Sciences Adapted from course textbook resources

More information

Clustering methods for Big data analysis

Clustering methods for Big data analysis Clustering methods for Big data analysis Keshav Sanse, Meena Sharma Abstract Today s age is the age of data. Nowadays the data is being produced at a tremendous rate. In order to make use of this large-scale

More information

On Clustering Validation Techniques

On Clustering Validation Techniques Journal of Intelligent Information Systems, 17:2/3, 107 145, 2001 c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques MARIA HALKIDI mhalk@aueb.gr YANNIS

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Unsupervised learning: Clustering

Unsupervised learning: Clustering Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

More information

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

More information

A comparison of various clustering methods and algorithms in data mining

A comparison of various clustering methods and algorithms in data mining Volume :2, Issue :5, 32-36 May 2015 www.allsubjectjournal.com e-issn: 2349-4182 p-issn: 2349-5979 Impact Factor: 3.762 R.Tamilselvi B.Sivasakthi R.Kavitha Assistant Professor A comparison of various clustering

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

An Introduction to Cluster Analysis for Data Mining

An Introduction to Cluster Analysis for Data Mining An Introduction to Cluster Analysis for Data Mining 10/02/2000 11:42 AM 1. INTRODUCTION... 4 1.1. Scope of This Paper... 4 1.2. What Cluster Analysis Is... 4 1.3. What Cluster Analysis Is Not... 5 2. OVERVIEW...

More information

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

More information

Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Clustering Techniques: A Brief Survey of Different Clustering Algorithms

Clustering Techniques: A Brief Survey of Different Clustering Algorithms Clustering Techniques: A Brief Survey of Different Clustering Algorithms Deepti Sisodia Technocrates Institute of Technology, Bhopal, India Lokesh Singh Technocrates Institute of Technology, Bhopal, India

More information

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,

More information

Clustering. Clustering. What is Clustering? What is Clustering? What is Clustering? Types of Data in Cluster Analysis

Clustering. Clustering. What is Clustering? What is Clustering? What is Clustering? Types of Data in Cluster Analysis What is Clustering? Clustering Tpes of Data in Cluster Analsis Clustering A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods What is Clustering? Clustering of data is

More information

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms 8 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen

CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 3: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major

More information

Authors. Data Clustering: Algorithms and Applications

Authors. Data Clustering: Algorithms and Applications Authors Data Clustering: Algorithms and Applications 2 Contents 1 Grid-based Clustering 1 Wei Cheng, Wei Wang, and Sandra Batista 1.1 Introduction................................... 1 1.2 The Classical

More information

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of

More information

Distances, Clustering, and Classification. Heatmaps

Distances, Clustering, and Classification. Heatmaps Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be

More information

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool Comparison and Analysis of Various Clustering Metho in Data mining On Education data set Using the weak tool Abstract:- Data mining is used to find the hidden information pattern and relationship between

More information

How To Cluster On A Large Data Set

How To Cluster On A Large Data Set An Ameliorated Partitioning Clustering Algorithm for Large Data Sets Raghavi Chouhan 1, Abhishek Chauhan 2 MTech Scholar, CSE department, NRI Institute of Information Science and Technology, Bhopal, India

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets

A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets Preeti Baser, Assistant Professor, SJPIBMCA, Gandhinagar, Gujarat, India 382 007 Research Scholar, R. K. University,

More information

Clustering UE 141 Spring 2013

Clustering UE 141 Spring 2013 Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or

More information

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

How To Solve The Cluster Algorithm

How To Solve The Cluster Algorithm Cluster Algorithms Adriano Cruz adriano@nce.ufrj.br 28 de outubro de 2013 Adriano Cruz adriano@nce.ufrj.br () Cluster Algorithms 28 de outubro de 2013 1 / 80 Summary 1 K-Means Adriano Cruz adriano@nce.ufrj.br

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Data Preprocessing. Week 2

Data Preprocessing. Week 2 Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Linköpings Universitet - ITN TNM033 2011-11-30 DBSCAN. A Density-Based Spatial Clustering of Application with Noise

Linköpings Universitet - ITN TNM033 2011-11-30 DBSCAN. A Density-Based Spatial Clustering of Application with Noise DBSCAN A Density-Based Spatial Clustering of Application with Noise Henrik Bäcklund (henba892), Anders Hedblom (andh893), Niklas Neijman (nikne866) 1 1. Introduction Today data is received automatically

More information

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

More information

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering K-means Intuition Algorithm Choosing initial centroids Bisecting K-means Post-processing Strengths

More information

Data Clustering Using Data Mining Techniques

Data Clustering Using Data Mining Techniques Data Clustering Using Data Mining Techniques S.R.Pande 1, Ms. S.S.Sambare 2, V.M.Thakre 3 Department of Computer Science, SSES Amti's Science College, Congressnagar, Nagpur, India 1 Department of Computer

More information

Chapter 3: Cluster Analysis

Chapter 3: Cluster Analysis Chapter 3: Cluster Analysis 3.1 Basic Concepts of Clustering 3.2 Partitioning Methods 3.3 Hierarchical Methods 3.4 Density-Based Methods 3.5 Model-Based Methods 3.6 Clustering High-Dimensional Data 3.7

More information

VISUALIZING HIERARCHICAL DATA. Graham Wills SPSS Inc., http://willsfamily.org/gwills

VISUALIZING HIERARCHICAL DATA. Graham Wills SPSS Inc., http://willsfamily.org/gwills VISUALIZING HIERARCHICAL DATA Graham Wills SPSS Inc., http://willsfamily.org/gwills SYNONYMS Hierarchical Graph Layout, Visualizing Trees, Tree Drawing, Information Visualization on Hierarchies; Hierarchical

More information

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar

More information

A Comparative Study of clustering algorithms Using weka tools

A Comparative Study of clustering algorithms Using weka tools A Comparative Study of clustering algorithms Using weka tools Bharat Chaudhari 1, Manan Parikh 2 1,2 MECSE, KITRC KALOL ABSTRACT Data clustering is a process of putting similar data into groups. A clustering

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster

More information

OPTICS: Ordering Points To Identify the Clustering Structure

OPTICS: Ordering Points To Identify the Clustering Structure Proc. ACM SIGMOD 99 Int. Conf. on Management of Data, Philadelphia PA, 1999. OPTICS: Ordering Points To Identify the Clustering Structure Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig

More information

Outlier Detection in Clustering

Outlier Detection in Clustering Outlier Detection in Clustering Svetlana Cherednichenko 24.01.2005 University of Joensuu Department of Computer Science Master s Thesis TABLE OF CONTENTS 1. INTRODUCTION...1 1.1. BASIC DEFINITIONS... 1

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

More information

The SPSS TwoStep Cluster Component

The SPSS TwoStep Cluster Component White paper technical report The SPSS TwoStep Cluster Component A scalable component enabling more efficient customer segmentation Introduction The SPSS TwoStep Clustering Component is a scalable cluster

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

More information

Clustering Via Decision Tree Construction

Clustering Via Decision Tree Construction Clustering Via Decision Tree Construction Bing Liu 1, Yiyuan Xia 2, and Philip S. Yu 3 1 Department of Computer Science, University of Illinois at Chicago, 851 S. Morgan Street, Chicago, IL 60607-7053.

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows: Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are

More information

CHAPTER 20. Cluster Analysis

CHAPTER 20. Cluster Analysis CHAPTER 20 Cluster Analysis 20.1 Introduction 20.2 What Is Cluster Analysis? 20.3 Typical requirements 20.4 Types of Data in cluster Analysis 20.5 Interval-scaled Variables 20.6 Binary Variables 20.7 Nominal,Ordinal,

More information

A Survey of Clustering Algorithms for Big Data: Taxonomy & Empirical Analysis

A Survey of Clustering Algorithms for Big Data: Taxonomy & Empirical Analysis TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2014 1 A Survey of Clustering Algorithms for Big Data: Taxonomy & Empirical Analysis A. Fahad, N. Alshatri, Z. Tari, Member, IEEE, A. Alamri, I. Khalil A.

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information

Data Mining: Foundation, Techniques and Applications

Data Mining: Foundation, Techniques and Applications Data Mining: Foundation, Techniques and Applications Lesson 1b :A Quick Overview of Data Mining Li Cuiping( 李 翠 平 ) School of Information Renmin University of China Anthony Tung( 鄧 锦 浩 ) School of Computing

More information

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets Dudu Lazarov, Gil David, Amir Averbuch School of Computer Science, Tel-Aviv University Tel-Aviv 69978, Israel Abstract

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information

Forschungskolleg Data Analytics Methods and Techniques

Forschungskolleg Data Analytics Methods and Techniques Forschungskolleg Data Analytics Methods and Techniques Martin Hahmann, Gunnar Schröder, Phillip Grosse Prof. Dr.-Ing. Wolfgang Lehner Why do we need it? We are drowning in data, but starving for knowledge!

More information

Going Big in Data Dimensionality:

Going Big in Data Dimensionality: LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli (alberto.ceselli@unimi.it)

More information

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning Unsupervised Learning and Data Mining Unsupervised Learning and Data Mining Clustering Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

More information

Philosophies and Advances in Scaling Mining Algorithms to Large Databases

Philosophies and Advances in Scaling Mining Algorithms to Large Databases Philosophies and Advances in Scaling Mining Algorithms to Large Databases Paul Bradley Apollo Data Technologies paul@apollodatatech.com Raghu Ramakrishnan UW-Madison raghu@cs.wisc.edu Johannes Gehrke Cornell

More information

A Method for Decentralized Clustering in Large Multi-Agent Systems

A Method for Decentralized Clustering in Large Multi-Agent Systems A Method for Decentralized Clustering in Large Multi-Agent Systems Elth Ogston, Benno Overeinder, Maarten van Steen, and Frances Brazier Department of Computer Science, Vrije Universiteit Amsterdam {elth,bjo,steen,frances}@cs.vu.nl

More information

Part 2: Community Detection

Part 2: Community Detection Chapter 8: Graph Data Part 2: Community Detection Based on Leskovec, Rajaraman, Ullman 2014: Mining of Massive Datasets Big Data Management and Analytics Outline Community Detection - Social networks -

More information

Cluster analysis Cosmin Lazar. COMO Lab VUB

Cluster analysis Cosmin Lazar. COMO Lab VUB Cluster analysis Cosmin Lazar COMO Lab VUB Introduction Cluster analysis foundations rely on one of the most fundamental, simple and very often unnoticed ways (or methods) of understanding and learning,

More information

Discovering Local Subgroups, with an Application to Fraud Detection

Discovering Local Subgroups, with an Application to Fraud Detection Discovering Local Subgroups, with an Application to Fraud Detection Abstract. In Subgroup Discovery, one is interested in finding subgroups that behave differently from the average behavior of the entire

More information

The Role of Visualization in Effective Data Cleaning

The Role of Visualization in Effective Data Cleaning The Role of Visualization in Effective Data Cleaning Yu Qian Dept. of Computer Science The University of Texas at Dallas Richardson, TX 75083-0688, USA qianyu@student.utdallas.edu Kang Zhang Dept. of Computer

More information

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut. Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,

More information

A Survey of Clustering Techniques

A Survey of Clustering Techniques A Survey of Clustering Techniques Pradeep Rai Asst. Prof., CSE Department, Kanpur Institute of Technology, Kanpur-0800 (India) Shubha Singh Asst. Prof., MCA Department, Kanpur Institute of Technology,

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501 CLUSTER ANALYSIS Steven M. Ho!and Department of Geology, University of Georgia, Athens, GA 30602-2501 January 2006 Introduction Cluster analysis includes a broad suite of techniques designed to find groups

More information

GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering

GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering Yu Qian Kang Zhang Department of Computer Science, The University of Texas at Dallas, Richardson, TX 75083-0688, USA {yxq012100,

More information

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen Summary Data Mining & Process Mining (1BM46) Made by S.P.T. Ariesen Content Data Mining part... 2 Lecture 1... 2 Lecture 2:... 4 Lecture 3... 7 Lecture 4... 9 Process mining part... 13 Lecture 5... 13

More information

Statistical Databases and Registers with some datamining

Statistical Databases and Registers with some datamining Unsupervised learning - Statistical Databases and Registers with some datamining a course in Survey Methodology and O cial Statistics Pages in the book: 501-528 Department of Statistics Stockholm University

More information

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants R-Trees: A Dynamic Index Structure For Spatial Searching A. Guttman R-trees Generalization of B+-trees to higher dimensions Disk-based index structure Occupancy guarantee Multiple search paths Insertions

More information