For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall

Similar documents

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Cluster Analysis Overview. Data Mining Techniques: Cluster Analysis. What is Cluster Analysis? What is Cluster Analysis?

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Cluster Analysis: Basic Concepts and Algorithms

Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data

Solving Quadratic Equations by Graphing. Consider an equation of the form. y ax 2 bx c a 0. In an equation of the form

Performance Metrics for Graph Mining Tasks

Cluster Analysis: Advanced Concepts

Identifying second degree equations

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Social Media Mining. Data Mining Essentials

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Clustering UE 141 Spring 2013

Clustering. Clustering. What is Clustering? What is Clustering? What is Clustering? Types of Data in Cluster Analysis

Data Mining: Algorithms and Applications Matrix Math Review

To Be or Not To Be a Linear Equation: That Is the Question

{ } Sec 3.1 Systems of Linear Equations in Two Variables

MAT188H1S Lec0101 Burbulla

Solving Systems of Linear Equations With Row Reductions to Echelon Form On Augmented Matrices. Paul A. Trogdon Cary High School Cary, North Carolina

Cluster analysis Cosmin Lazar. COMO Lab VUB

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

SECTION 5-1 Exponential Functions

Cross-validation for detecting and preventing overfitting

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Linköpings Universitet - ITN TNM DBSCAN. A Density-Based Spatial Clustering of Application with Noise

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

We start with the basic operations on polynomials, that is adding, subtracting, and multiplying.

Pearson s Correlation Coefficient

Solution to Homework 2

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Pennsylvania System of School Assessment

Java Modules for Time Series Analysis

K-Means Clustering Tutorial

Graphing Quadratic Equations

CLUSTER ANALYSIS FOR SEGMENTATION

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

The majority of college students hold credit cards. According to the Nellie May

Cluster Analysis: Basic Concepts and Algorithms

D.2. The Cartesian Plane. The Cartesian Plane The Distance and Midpoint Formulas Equations of Circles. D10 APPENDIX D Precalculus Review

COC131 Data Mining - Clustering

Introduction to Matrices for Engineers

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

Cluster analysis with SPSS: K-Means Cluster Analysis

2. Simple Linear Regression

Downloaded from equations. 2.4 The reciprocal function x 1 x

Introduction to Clustering

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Session 7 Bivariate Data and Analysis

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

LESSON EIII.E EXPONENTS AND LOGARITHMS

Find the Relationship: An Exercise in Graphing Analysis

Exploratory data analysis (Chapter 2) Fall 2011

Examples: Joint Densities and Joint Mass Functions Example 1: X and Y are jointly continuous with joint pdf

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

CHAPTER TEN. Key Concepts

Higher. Polynomials and Quadratics 64

Decision Trees from large Databases: SLIQ

CART 6.0 Feature Matrix

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

2D Geometrical Transformations. Foley & Van Dam, Chapter 5

Lecture 10: Regression Trees

Polynomial Degree and Finite Differences

An Introduction to Cluster Analysis for Data Mining

Data Mining and Visualization

Mining Social-Network Graphs

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Building Data Cubes and Mining Them. Jelena Jovanovic

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

x y The matrix form, the vector form, and the augmented matrix form, respectively, for the system of equations are

Exploratory data analysis for microarray data

Introduction to Machine Learning Using Python. Vikram Kamath

Chapter ML:XI (continued)

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

7.3 Parabolas. 7.3 Parabolas 505

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

MATH 102 College Algebra

Week 1 Lecture: (1) Course Introduction (2) Data Analysis and Processing

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Distances, Clustering, and Classification. Heatmaps

Name Date. Break-Even Analysis

Multivariate Analysis of Ecological Data

How To Solve The Cluster Algorithm

PROPERTIES OF ELLIPTIC CURVES AND THEIR USE IN FACTORING LARGE NUMBERS

THE POWER RULES. Raising an Exponential Expression to a Power

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

MULTIPLE-OBJECTIVE DECISION MAKING TECHNIQUE Analytical Hierarchy Process

f x a 0 n 1 a 0 a 1 cos x a 2 cos 2x a 3 cos 3x b 1 sin x b 2 sin 2x b 3 sin 3x a n cos nx b n sin nx n 1 f x dx y

Transcription:

Cluster Validation

Cluster Validit For supervised classification we have a variet of measures to evaluate how good our model is Accurac, precision, recall For cluster analsis, the analogous question is how to evaluate the goodness of the resulting clusters? But clusters are in the ee of the beholder! Then wh do we want to evaluate them? To avoid finding patterns in noise To compare clustering algorithms To compare two sets of clusters To compare two clusters

Clusters found in Random Data. 9. 9. 8. 8 Random Points. 7. 6. 5. 7. 6. 5 DBSCAN. 4. 4. 3. 3...... 4. 6. 8.. 4. 6. 8. 9. 9 K-means. 8. 7. 6. 8. 7. 6 Complete Link. 5. 5. 4. 4. 3. 3...... 4. 6. 8.. 4. 6. 8

Different Aspects of Cluster Validation. Determining the clustering tendenc of a set of data, i.e., distinguishing whether non-random structure actuall eists in the data.. Comparing the results of a cluster analsis to eternall known results, e.g., to eternall given class labels. 3. Evaluating how well the results of a cluster analsis fit the data without reference to eternal information. - Use onl the data 4. Comparing the results of two different sets of cluster analses to determine which is better. 5. Determining the correct number of clusters. For, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters.

Measures of Cluster Validit Numerical measures that are applied to judge various aspects of cluster validit, are classified into the following three tpes. Eternal Inde: Used to measure the etent to which cluster labels match eternall supplied class labels. Entrop Internal Inde: Used to measure the goodness of a clustering structure without respect to eternal information. Sum of Squared Error (SSE) Relative Inde: Used to compare two different clusterings or clusters. Often an eternal or internal inde is used for this function, e.g., SSE or entrop Sometimes these are referred to as criteria instead of indices However, sometimes criterion is the general strateg and inde is the numerical measure that implements the criterion.

Measuring Cluster Validit Via Correlation Two matrices Proimit Matri Incidence Matri One row and one column for each data point An entr is if the associated pair of points belong to the same cluster An entr is if the associated pair of points belongs to different clusters Compute the correlation between the two matrices Since the matrices are smmetric, onl the correlation between n(n-) / entries needs to be calculated. High correlation indicates that points that belong to the same cluster are close to each other. Not a good measure for some densit or contiguit based clusters.

Measuring Cluster Validit Via Correlation Correlation of incidence and proimit matrices for the K-means clusterings of the following two data sets.. 9. 8. 7. 6. 5. 4. 3.... 4. 6. 8. 9. 8. 7. 6. 5. 4. 3.... 4. 6. 8 Corr = -.935 Corr = -.58

Using Similarit Matri for Cluster Validation Order the similarit matri with respect to cluster labels and inspect visuall.. 9. 8. 7. 6. 5. 4. 3.... 4. 6. 8 P o in t s 3 4 5 6 7 8 9 4 6 8 S i m i la r i t P o in t s. 9. 8. 7. 6. 5. 4. 3..

Using Similarit Matri for Cluster Validation Clusters in random data are not so crisp. 9. 9. 8. 8 3. 7. 7 4. 6. 6 P o in t s 5 6. 5. 4. 5. 4 7. 3. 3 8.. 9.. 4 6 8 S i m i la r i t P o i n t s.. 4. 6. 8 DBSCAN

Using Similarit Matri for Cluster Validation Clusters in random data are not so crisp. 9. 9. 8. 8 3. 7. 7 4. 6. 6 P o i n t s 5 6. 5. 4. 5. 4 7. 3. 3 8.. 9.. 4 6 8 S i m i la r i t P o in t s.. 4. 6. 8 K-means

Using Similarit Matri for Cluster Validation Clusters in random data are not so crisp. 9. 9. 8. 8 3. 7. 7 4. 6. 6 P o in t s 5 6. 5. 4. 5. 4 7. 3. 3 8.. 9.. 4 6 8 S i m i la r i t P o in t s.. 4. 6. 8 Complete Link

Using Similarit Matri for Cluster Validation. 9 6 5. 8 4 3 5. 7. 6. 5. 4. 3 5 5. 7 3 5 5 5 3. DBSCAN

Internal Measures: SSE Clusters in more complicated figures aren t well separated Internal Inde: Used to measure the goodness of a clustering structure without respect to eternal information SSE SSE is good for comparing two clusterings or two clusters (average SSE). Can also be used to estimate the number of clusters 6 9 4 8 7 6 S S E 5 4-3 - 4-6 5 5 5 5 5 3 K

Internal Measures: SSE SSE curve for a more complicated data set 6 4 3 5 7 SSE of clusters found using K-means

Framework for Cluster Validit Need a framework to interpret an measure. For eample, if our measure of evaluation has the value,, is that good, fair, or poor? Statistics provide a framework for cluster validit The more atpical a clustering result is, the more likel it represents valid structure in the data Can compare the values of an inde that result from random data or clusterings to those of a clustering result. If the value of the inde is unlikel, then the cluster results are valid These approaches are more complicated and harder to understand. For comparing the results of two different sets of cluster analses, a framework is less necessar. However, there is the question of whether the difference between two inde values is significant

Statistical Framework for SSE Eample. 9. 8. 7. 6. 5. 4. 3.. Compare SSE of.5 against three clusters in random data Histogram shows SSE of three clusters in 5 sets of random data points of size distributed over the range..8 for and values.. 4. 6. 8 C o u n t 5 4 5 4 3 5 3 5 5 5. 6. 8... 4. 6. 8. 3. 3. 3 4 S S E

Statistical Framework for Correlation Correlation of incidence and proimit matrices for the K-means clusterings of the following two data sets.. 9. 8. 7. 6. 5. 4. 3.... 4. 6. 8. 9. 8. 7. 6. 5. 4. 3.... 4. 6. 8 Corr = -.935 Corr = -.58

Internal Measures: Cohesion and Separation Cluster Cohesion: Measures how closel related are objects in a cluster Eample: SSE Cluster Separation: Measure how distinct or wellseparated a cluster is from other clusters Eample: Squared Error Cohesion is measured b the within cluster sum of squares (SSE) WSS = i Separation is measured b the between cluster sum of squares BSS = Ci ( m m i ) Where C i i is the size of cluster i C ( i m i )

Internal Measures: Cohesion and Separation Eample: SSE BSS + WSS = constant m m 3 4 m 5 K= cluster: WSS= ( 3) + ( 3) + (4 3) + (5 3) = BSS= 4 (3 3) = Total = + = K= clusters: WSS= (.5) + (.5) + (4 4.5) + (5 4.5) = BSS= Total (3.5) = + 9 = + (4.5 3) = 9

Internal Measures: Cohesion and Separation A proimit graph based approach can also be used for cohesion and separation. Cluster cohesion is the sum of the weight of all links within a cluster. Cluster separation is the sum of the weights between nodes in the cluster and nodes outside the cluster. cohesion separation

Internal Measures: Silhouette Coefficient Silhouette Coefficient combine ideas of both cohesion and separation, but for individual points, as well as clusters and clusterings For an individual point, i Calculate a = average distance of i to the points in its cluster Calculate b = min (average distance of i to points in another cluster) The silhouette coefficient for a point is then given b s = a/b if a < b, (or s = b/a - Tpicall between and. The closer to the better. if a b, not the usual case) a b Can calculate the Average Silhouette width for a cluster or a clustering

Eternal Measures of Cluster Validit: Entrop and Purit

Final Comment on Cluster Validit The validation of clustering structures is the most difficult and frustrating part of cluster analsis. Without a strong effort in this direction, cluster analsis will remain a black art accessible onl to those true believers who have eperience and great courage. Algorithms for Clustering Data, Jain and Dubes