Performance Metrics for Graph Mining Tasks


 Ashlyn Wilson
 2 years ago
 Views:
Transcription
1 Performance Metrics for Graph Mining Tasks 1
2 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical Significance Techniques Model Comparison 2
3 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical Significance Techniques Model Comparison 3
4 Introduction to Performance Metrics Performance metric measures how well your data mining algorithm is performing on a given dataset. For example, if we apply a classification algorithm on a dataset, we first check to see how many of the data points were classified correctly. This is a performance metric and the formal name for it is accuracy. Performance metrics also help us decide is one algorithm is better or worse than another. For example, one classification algorithm A classifies 80% of data points correctly and another classification algorithm B classifies 90% of data points correctly. We immediately realize that algorithm B is doing better. There are some intricacies that we will discuss in this chapter. 4
5 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical Significance Techniques Model Comparison 5
6 Supervised Learning Performance Metrics Metrics that are applied when the ground truth is known (E.g., ification tasks) Outline: 2 X 2 Confusion Matrix Multilevel Confusion Matrix Visual Metrics Crossvalidation 6
7 2X2 Confusion Matrix An 2X2 matrix, is used to tabulate the results of 2class supervised learning problem and entry (i,j) represents the number of elements with class label i, but predicted to have class label j. True Positive Actual Predicted +  False Negative + f ++ f + C = f ++ + f +  f + f  D = f + + f  A = f ++ + f + B = f + + f  T = f ++ + f + + f + + f  False Positive + and are two class labels True Negative 7
8 2X2 Confusion Metrics Example Results from a ification Algorithms Corresponding 2x2 matrix for the given table Vertex ID Actual Predicted Actual Predicted C = D = 3 A = 6 B = 2 T = 8 True positive = 4 False positive = 1 True Negative = 1 False Negative =2 8
9 2X2 Confusion Metrics Performance Metrics Walkthrough different metrics using the following example 1. Accuracy is proportion of correct predictions 2. Error rate is proportion of incorrect predictions 3. Recall is the proportion of + data points predicted as + 4. Precision is the proportion of data points predicted as + that are truly + 9
10 Multilevel Confusion Matrix An nxn matrix, where n is the number of classes and entry (i,j) represents the number of elements with class label i, but predicted to have class label j 10
11 Multilevel Confusion Matrix Example Actual Marginal Sum of Predictions Predicted Marginal Sum of Actual Values T = 14 11
12 Multilevel Confusion Matrix Conversion to 2X2 f ++ Predicted f f + f Actual Actual 2X2 Matrix Specific to Predicted 1 (+) Not 1 () 1 (+) 2 2 C = 4 Not 1 () 2 8 D = 10 A = 4 B = 10 T = 14 We can now apply all the 2X2 metrics Accuracy = 2/14 Error = 8/14 Recall = 2/4 Precision = 2/4
13 Multilevel Confusion Matrix Performance Metrics Actual Predicted Critical Success Index or Threat Score is the ratio of correct predictions for class L to the sum of vertices that belong to L and those predicted as L 2. Bias  For each class L, it is the ratio of the total points with class label L to the number of points predicted as L. Bias helps understand if a model is over or underpredicting a class 13
14 Confusion Metrics Rcode library(performancemetrics) data(m) M [,1] [,2] [1,] 4 1 [2,] 2 1 twocrossconfusionmatrixmetrics(m) data(multilevelm) MultiLevelM [,1] [,2] [,3] [1,] [2,] [3,] multilevelconfusionmatrixmetrics(multilevelm) 14
15 Visual Metrics Metrics that are plotted on a graph to obtain the visual picture of the performance of two class classifiers (0,1)  Ideal ROC plot 1 True Positive Rate AUC = 0.5 (1,1) Predicts the +ve class all the time (0,0) Predicts the ve class all the time 0 0 False Positive Rate 1 Plot the performance of multiple models to decide which one performs best 15
16 Understanding Model Performance based on ROC Plot Models that lie in this upper left have good performance Note: This is where you aim to get the model 1. Models that lie in lower left are conservative. 2. Will not predict + unless strong evidence 3. Low False positives but high False Negatives 1 True Positive Rate 0 0 AUC = 0.5 False Positive Rate 1 1. Models that lie in upper right are liberal. 2. Will predict + with little evidence 3. High False positives Models that lie in this area perform worse than random Note: Models here can be negated to move them to the upper right corner 16
17 ROC Plot Example 1 True Positive Rate M 1 (0.1,0.8) M 3 (0.3,0.5) M 2 (0.5,0.5) 0 0 False Positive Rate M 1 s performance occurs furthest in the upperright direction and hence is considered the best model. 1 17
18 Crossvalidation Crossvalidation also called rotation estimation, is a way to analyze how a predictive data mining model will perform on an unknown dataset, i.e., how well the model generalizes Strategy: 1. Divide up the dataset into two nonoverlapping subsets 2. One subset is called the test and the other the training 3. Build the model using the training dataset 4. Obtain predictions of the test set 5. Utilize the test set predictions to calculate all the performance metrics Typically crossvalidation is performed for multiple iterations, selecting a different nonoverlapping test and training set each time 18
19 Types of Crossvalidation holdout: Random 1/3 rd of the data is used as test and remaining 2/3 rd as training kfold: Divide the data into k partitions, use one partition as test and remaining k1 partitions for training Leaveoneout: Special case of kfold, where k=1 Note: Selection of data points is typically done in stratified manner, i.e., the class distribution in the test set is similar to the training set 19
20 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical Significance Techniques Model Comparison 20
21 Unsupervised Learning Performance Metrics Metrics that are applied when the ground truth is not always available (E.g., Clustering tasks) Outline: Evaluation Using Prior Knowledge Evaluation Using Cluster Properties 21
22 Evaluation Using Prior Knowledge To test the effectiveness of unsupervised learning methods is by considering a dataset D with known class labels, stripping the labels and providing the set as input to any unsupervised leaning algorithm, U. The resulting clusters are then compared with the knowledge priors to judge the performance of U To evaluate performance 1. Contingency Table 2. Ideal and Observed Matrices 22
23 Contingency Table Cluster Same Cluster Different Cluster Same u 11 u 10 Different u 01 u 00 (A) To fill the table, initialize u 11, u 01, u 10, u 00 to 0 (B) Then, for each pair of points of form (v,w): 1. if v and w belong to the same class and cluster then increment u if v and w belong to the same class but different cluster then increment u if v and w belong to the different class but same cluster then increment u if v and w belong to the different class and cluster then increment u 00 23
24 Contingency Table Performance Metrics Example Matrix Same Cluster Cluster Different Cluster Same 9 4 Different 3 12 Rand Statistic also called simple matching coefficient is a measure where both placing a pair of points with the same class label in the same cluster and placing a pair of points with different class labels in different clusters are given equal importance, i.e., it accounts for both specificity and sensitivity of the clustering Jaccard Coefficient can be utilized when placing a pair of points with the same class label in the same cluster is primarily important 24
25 Ideal and Observed Matrices Given that the number of points is T, the idealmatrix is a TxT matrix, where each cell (i,j) has a 1 if the points i and j belong to the same class and a 0 if they belong to different clusters. The observedmatrix is a TxT matrix, where a cell (i,j) has a 1 if the points i and j belong to the same cluster and a 0 if they belong to different cluster Mantel Test is a statistical test of the correlation between two matrices of the same rank. The two matrices, in this case, are symmetric and, hence, it is sufficient to analyze lower or upper diagonals of each matrix 25
26 Evaluation Using Prior Knowledge Rcode library(performancemetrics) data(contingencytable) ContingencyTable [,1] [,2] [1,] 9 4 [2,] 3 12 contingencytablemetrics(contingencytable) 26
27 Evaluation Using Cluster Properties In the absence of prior knowledge we have to rely on the information from the clusters themselves to evaluate performance. 1. Cohesion measures how closely objects in the same cluster are related 2. Separation measures how distinct or separated a cluster is from all the other clusters Here, g i refers to cluster i, W is total number of clusters, x and y are data points, proximity can be any similarity measure (e.g., cosine similarity) We want the cohesion to be close to 1 and separation to be close to 0 27
28 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical Significance Techniques Model Comparison 28
29 Optimizing Metrics Performance metrics that act as optimization functions for a data mining algorithm Outline: Sum of Squared Errors Preserved Variability 29
30 Sum of Squared Errors Squared sum error (SSE) is typically used in clustering algorithms to measure the quality of the clusters obtained. This parameter takes into consideration the distance between each point in a cluster to its cluster center (centroid or some other chosen representative). For d j, a point in cluster g i, where m i is the cluster center of g i, and W, the total number of clusters, SSE is defined as follows: This value is small when points are close to their cluster center, indicating a good clustering. Similarly, a large SSE indicates a poor clustering. Thus, clustering algorithms aim to minimize SSE. 30
31 Preserved Variability Preserved variability is typically used in eigenvectorbased dimension reduction techniques to quantify the variance preserved by the chosen dimensions. The objective of the dimension reduction technique is to maximize this parameter. Given that the point is represented in r dimensions (k << r), the eigenvalues are λ 1 >=λ 2 >=.. λ r1 >=λ r. The preserved variability (PV) is calculated as follows: The value of this parameter depends on the number of dimensions chosen: the more included, the higher the value. Choosing all the dimensions will result in the perfect score of 1. 31
32 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical Significance Techniques Model Comparison 32
33 Statistical Significance Techniques Methods used to asses a pvalue for the different performance metrics Scenario: We obtain say cohesion =0.99 for clustering algorithm A. From the first look it feels like 0.99 is a very good score. However, it is possible that the underlying data is structured in such a way that you would get 0.99 no matter how you cluster the data. Thus, 0.99 is not very significant. One way to decide that is by using statistical significance estimation. We will discuss the Monte Carlo Procedure in next slide! 33
34 Monte Carlo Procedure Empirical pvalue Estimation Monte Carlo procedure uses random sampling to assess the significance of a particular performance metric we obtain could have been attained at random. For example, if we obtain a cohesion score of a cluster of size 5 is 0.99, we would be inclined to think that it is a very cohesive score. However, this value could have resulted due to the nature of the data and not due to the algorithm. To test the significance of this 0.99 value we 1. Sample N (usually 1000) random sets of size 5 from the dataset 2. Recalculate the cohesion for each of the 1000 sets 3. Count R: number of random sets with value >= 0.99 (original score of cluster) 4. Empirical pvalue for the cluster of size 5 with 0.99 score is given by R/N 5. We apply a cutoff say 0.05 to decide if 0.99 is significant Steps 14 is the Monte Carlo method for pvalue estimation. 34
35 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical Significance Techniques Model Comparison 35
36 Model Comparison Metrics that compare the performance of different algorithms Scenario: 1) Model 1 provides an accuracy of 70% and Model 2 provides an accuracy of 75% 2) From the first look, Model 2 seems better, however it could be that Model 1 is predicting 1 better than 2 3) However, 1 is indeed more important than 2 for our problem 4) We can use model comparison methods to take this notion of importance into consideration when we pick one model over another Costbased Analysis is an important model comparison method discussed in the next few slides. 36
37 Costbased Analysis In realworld applications, certain aspects of model performance are considered more important than others. For example: if a person with cancer was diagnosed as cancerfree or viceversa then the prediction model should be especially penalized. This penalty can be introduced in the form of a costmatrix. Cost Matrix Actual Predicted c 11 c 10  c 01 c 00 Associated with f 10 or u 10 Associated with f 11 or u 11 Associated with f 01 or u 01 Associated with f 00 or u 00 37
38 Costbased Analysis Cost of a Model The cost and confusion matrices for Model M are given below Confusion Matrix Predicted +  Cost Matrix Predicted +  Actual + f 11 f 10  f 01 f 00 Actual + c 11 c 10  c 01 c 00 Cost of Model M is given as: 38
39 Costbased Analysis Comparing Two Models This analysis is typically used to select one model when we have more than one choice through using different algorithms or different parameters to the learning algorithms. Cost Matrix Actual Predicted Confusion Matrix of M y Actual Predicted Cost of M y : 200 Cost of M x : 100 Confusion Matrix of M x Actual Predicted C Mx < C My Purely, based on cost model, M x is a better model 39
40 Costbased Analysis Rcode library(performancemetrics) data(mx) data(my) data(costmatrix) Mx [,1] [,2] [1,] 4 1 [2,] 2 1 My [,1] [,2] [1,] 3 2 [2,] 2 1 costanalysis(mx,costmatrix) costanalysis(my,costmatrix) 40
Evaluation & Validation: Credibility: Evaluating what has been learned
Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model
More informationCLASSIFICATION AND CLUSTERING. Anveshi Charuvaka
CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training
More informationData Mining  Evaluation of Classifiers
Data Mining  Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationW6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDDLAB ISTI CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationOverview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set
Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.unisb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification
More informationData Clustering. Dec 2nd, 2013 Kyrylo Bessonov
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms kmeans Hierarchical Main
More informationFacebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
More informationClustering & Association
Clustering  Overview What is cluster analysis? Grouping data objects based only on information found in the data describing these objects and their relationships Maximize the similarity within objects
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical
More informationUsing multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
More informationCrossValidation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
More informationStatistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and treebased classification techniques.
More informationFUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 3448 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
More informationClustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
More informationNCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation:  Feature vector X,  qualitative response Y, taking values in C
More informationDATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
More information15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
More informationFinal Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
More informationData Mining Practical Machine Learning Tools and Techniques
Counting the cost Data Mining Practical Machine Learning Tools and Techniques Slides for Section 5.7 In practice, different types of classification errors often incur different costs Examples: Loan decisions
More informationJoint models for classification and comparison of mortality in different countries.
Joint models for classification and comparison of mortality in different countries. Viani D. Biatat 1 and Iain D. Currie 1 1 Department of Actuarial Mathematics and Statistics, and the Maxwell Institute
More informationGuido Sciavicco. 11 Novembre 2015
classical and new techniques Università degli Studi di Ferrara 11 Novembre 2015 in collaboration with dr. Enrico Marzano, CIO Gap srl Active Contact System Project 1/27 Contents What is? Embedded Wrapper
More information1. The standardised parameters are given below. Remember to use the population rather than sample standard deviation.
Kapitel 5 5.1. 1. The standardised parameters are given below. Remember to use the population rather than sample standard deviation. The graph of crossvalidated error versus component number is presented
More informationSocial Media Mining. Network Measures
Klout Measures and Metrics 22 Why Do We Need Measures? Who are the central figures (influential individuals) in the network? What interaction patterns are common in friends? Who are the likeminded users
More informationMultivariate Analysis of Ecological Data
Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology
More informationData Mining Individual Assignment report
Björn Þór Jónsson bjrr@itu.dk Data Mining Individual Assignment report This report outlines the implementation and results gained from the Data Mining methods of preprocessing, supervised learning, frequent
More informationComparison of Nonlinear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Nonlinear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Nonlinear
More informationEnvironmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
More informationHow to report the percentage of explained common variance in exploratory factor analysis
UNIVERSITAT ROVIRA I VIRGILI How to report the percentage of explained common variance in exploratory factor analysis Tarragona 2013 Please reference this document as: LorenzoSeva, U. (2013). How to report
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationData Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distancebased Kmeans, Kmedoids,
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 SigmaRestricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationSPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
More informationAzure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Daybyday Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More informationFor supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall
Cluster Validation Cluster Validit For supervised classification we have a variet of measures to evaluate how good our model is Accurac, precision, recall For cluster analsis, the analogous question is
More informationMining the Software Change Repository of a Legacy Telephony System
Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,
More informationPractical Graph Mining with R. 5. Link Analysis
Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationSTATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Webbased Analytics Table
More informationChapter 7. Hierarchical cluster analysis. Contents 71
71 Chapter 7 Hierarchical cluster analysis In Part 2 (Chapters 4 to 6) we defined several different ways of measuring distance (or dissimilarity as the case may be) between the rows or between the columns
More informationCategorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors ChiaHui Chang and ZhiKai Ding Department of Computer Science and Information Engineering, National Central University, ChungLi,
More informationBIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
More informationRegression Clustering
Chapter 449 Introduction This algorithm provides for clustering in the multiple regression setting in which you have a dependent variable Y and one or more independent variables, the X s. The algorithm
More informationClustering. Adrian Groza. Department of Computer Science Technical University of ClujNapoca
Clustering Adrian Groza Department of Computer Science Technical University of ClujNapoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 Kmeans 3 Hierarchical Clustering What is Datamining?
More informationLecture 13: Validation
Lecture 3: Validation g Motivation g The Holdout g Resampling techniques g Threeway data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model
More informationData Mining Practical Machine Learning Tools and Techniques
Credibility: Evaluating what s been learned Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Issues: training, testing,
More informationSupervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
More informationClassifiers & Classification
Classifiers & Classification Forsyth & Ponce Computer Vision A Modern Approach chapter 22 Pattern Classification Duda, Hart and Stork School of Computer Science & Statistics Trinity College Dublin Dublin
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationMachine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer
Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next
More informationExample: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering
Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? KMeans Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar
More informationExperiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University Email: {rashid,nick,vlado}@cs.dal.ca Abstract We address
More informationCI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
More informationFactor Analysis. Chapter 420. Introduction
Chapter 420 Introduction (FA) is an exploratory technique applied to a set of observed variables that seeks to find underlying factors (subsets of variables) from which the observed variables were generated.
More informationTan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2. Tid Refund Marital Status
Data Mining Classification: Basic Concepts, Decision Trees, and Evaluation Lecture tes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Classification: Definition Given a collection of
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationOn CrossValidation and Stacking: Building seemingly predictive models on random data
On CrossValidation and Stacking: Building seemingly predictive models on random data ABSTRACT Claudia Perlich Media6 New York, NY 10012 claudia@media6degrees.com A number of times when using crossvalidation
More information10810 /02710 Computational Genomics. Clustering expression data
10810 /02710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intracluster similarity low intercluster similarity Informally,
More informationData Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
More informationWe discuss 2 resampling methods in this chapter  crossvalidation  the bootstrap
Statistical Learning: Chapter 5 Resampling methods (Crossvalidation and bootstrap) (Note: prior to these notes, we'll discuss a modification of an earlier train/test experiment from Ch 2) We discuss 2
More informationData Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototypebased clustering Densitybased clustering Graphbased
More informationPractical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
More informationDecision Support System Methodology Using a Visual Approach for Cluster Analysis Problems
Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems Ran M. Bittmann School of Business Administration Ph.D. Thesis Submitted to the Senate of BarIlan University RamatGan,
More informationMeasurement in ediscovery
Measurement in ediscovery A Technical White Paper Herbert Roitblat, Ph.D. CTO, Chief Scientist Measurement in ediscovery From an informationscience perspective, ediscovery is about separating the responsive
More informationAnomaly detection. Problem motivation. Machine Learning
Anomaly detection Problem motivation Machine Learning Anomaly detection example Aircraft engine features: = heat generated = vibration intensity Dataset: New engine: (vibration) (heat) Density estimation
More informationDistance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center
Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I  Applications Motivation and Introduction Patient similarity application Part II
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Lecture 15  ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.standrews.ac.uk twk@standrews.ac.uk Tom Kelsey ID505917AUC
More informationSchools Valueadded Information System Technical Manual
Schools Valueadded Information System Technical Manual Quality Assurance & Schoolbased Support Division Education Bureau 2015 Contents Unit 1 Overview... 1 Unit 2 The Concept of VA... 2 Unit 3 Control
More informationData Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data
More informationMeasurement Systems Analysis MSA for Suppliers
Measurement Systems Analysis MSA for Suppliers Copyright 20032007 Raytheon Company. All rights reserved. R6σ is a Raytheon trademark registered in the United States and Europe. Raytheon Six Sigma is a
More informationChapter ML:XI (continued)
Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis DensityBased Cluster Analysis Cluster Evaluation Constrained
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms Kmeans and its variants Hierarchical clustering
More informationCLUSTER ANALYSIS FOR SEGMENTATION
CLUSTER ANALYSIS FOR SEGMENTATION Introduction We all understand that consumers are not all alike. This provides a challenge for the development and marketing of profitable products and services. Not every
More informationCluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico
Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from
More informationComponent Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
More informationCrossvalidation for detecting and preventing overfitting
Crossvalidation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures.
More informationData Mining Project Report. Document Clustering. Meryem UzunPer
Data Mining Project Report Document Clustering Meryem UzunPer 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. Kmeans algorithm...
More informationBootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationCredit Card Fraud Detection and ConceptDrift Adaptation with Delayed Supervised Information
Credit Card Fraud Detection and ConceptDrift Adaptation with Delayed Supervised Information Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi, and Gianluca Bontempi 15/07/2015 IEEE IJCNN
More informationLocal outlier detection in data forensics: data mining approach to flag unusual schools
Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential
More informationLABEL PROPAGATION ON GRAPHS. SEMISUPERVISED LEARNING. Changsheng Liu 10302014
LABEL PROPAGATION ON GRAPHS. SEMISUPERVISED LEARNING Changsheng Liu 10302014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph
More informationTutorial Segmentation and Classification
MARKETING ENGINEERING FOR EXCEL TUTORIAL VERSION 1.0.8 Tutorial Segmentation and Classification Marketing Engineering for Excel is a Microsoft Excel addin. The software runs from within Microsoft Excel
More informationCanonical Correlation
Chapter 400 Introduction Canonical correlation analysis is the study of the linear relations between two sets of variables. It is the multivariate extension of correlation analysis. Although we will present
More informationJava Modules for Time Series Analysis
Java Modules for Time Series Analysis Agenda Clustering Nonnormal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series
More informationUSING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE FREE NETWORKS AND SMALLWORLD NETWORKS
USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE FREE NETWORKS AND SMALLWORLD NETWORKS Natarajan Meghanathan Jackson State University, 1400 Lynch St, Jackson, MS, USA natarajan.meghanathan@jsums.edu
More informationQuality and Complexity Measures for Data Linkage and Deduplication
Quality and Complexity Measures for Data Linkage and Deduplication Peter Christen and Karl Goiser Department of Computer Science, Australian National University, Canberra ACT 0200, Australia {peter.christen,karl.goiser}@anu.edu.au
More informationNETZCOPE  a tool to analyze and display complex R&D collaboration networks
The Task Concepts from Spectral Graph Theory EU R&D Network Analysis Netzcope Screenshots NETZCOPE  a tool to analyze and display complex R&D collaboration networks L. Streit & O. Strogan BiBoS, Univ.
More informationIdentification of noisy variables for nonmetric and symbolic data in cluster analysis
Identification of noisy variables for nonmetric and symbolic data in cluster analysis Marek Walesiak and Andrzej Dudek Wroclaw University of Economics, Department of Econometrics and Computer Science,
More informationPredictive Data modeling for health care: Comparative performance study of different prediction models
Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath hiremat.nitie@gmail.com National Institute of Industrial Engineering (NITIE) Vihar
More informationA Survey of Kernel Clustering Methods
A Survey of Kernel Clustering Methods Maurizio Filippone, Francesco Camastra, Francesco Masulli and Stefano Rovetta Presented by: Kedar Grama Outline Unsupervised Learning and Clustering Types of clustering
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
More informationDistances, Clustering, and Classification. Heatmaps
Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be
More informationT61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
More informationKnearestneighbor: an introduction to machine learning
Knearestneighbor: an introduction to machine learning Xiaojin Zhu jerryzhu@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison slide 1 Outline Types of learning Classification:
More informationCommon factor analysis
Common factor analysis This is what people generally mean when they say "factor analysis" This family of techniques uses an estimate of common variance among the original variables to generate the factor
More information