CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka

Similar documents
Classification Techniques (1)

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Cluster Analysis: Basic Concepts and Algorithms

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Classification: Decision Trees

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering UE 141 Spring 2013

Distances, Clustering, and Classification. Heatmaps

Social Media Mining. Data Mining Essentials

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Performance Metrics for Graph Mining Tasks

Information Retrieval and Web Search Engines

Unsupervised learning: Clustering

Cluster Analysis: Advanced Concepts

Data Mining Part 5. Prediction

Knowledge Discovery and Data Mining

Machine Learning using MapReduce

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Cluster Analysis: Basic Concepts and Algorithms

For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall

How To Cluster

Data Mining - Evaluation of Classifiers

Chapter 7. Cluster Analysis

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Knowledge Discovery and Data Mining

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Nonlinear Classification

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Knowledge Discovery and Data Mining

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Evaluation & Validation: Credibility: Evaluating what has been learned

Neural Networks Lesson 5 - Cluster Analysis

Introduction to Data Mining

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Classification algorithm in Data mining: An Overview

Using multiple models: Bagging, Boosting, Ensembles, Forests

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Introduction of Information Visualization and Visual Analytics. Chapter 4. Data Mining

Using Data Mining for Mobile Communication Clustering and Characterization

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Sentiment analysis using emoticons

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Introduction to Clustering

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

Classification and Prediction

Personalized Hierarchical Clustering

An Introduction to Data Mining

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Foundations of Artificial Intelligence. Introduction to Data Mining

Content-Based Recommendation

Data Mining for Knowledge Management. Classification

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Azure Machine Learning, SQL Data Mining and R

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Adaptive Anomaly Detection for Network Security

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

An Introduction to Cluster Analysis for Data Mining

Big Data Analytics CSCI 4030

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Machine Learning: Overview

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Data Mining Algorithms Part 1. Dejan Sarka

CS Data Science and Visualization Spring 2016

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Final Project Report

Machine Learning Capacity and Performance Analysis and R

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Data Mining + Business Intelligence. Integration, Design and Implementation

OUTLIER ANALYSIS. Data Mining 1

Data Mining: Overview. What is Data Mining?

Analytics on Big Data

Supervised Learning (Big Data Analytics)

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Classification Techniques for Remote Sensing

Cluster Analysis using R

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

K-Means Clustering Tutorial

Transcription:

CLASSIFICATION AND CLUSTERING Anveshi Charuvaka

Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining

Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Illustrating Classification Task

Examples of Classification Task Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc

Classification Techniques Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines

Instance Based Classifiers Examples: Rote-learner Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly Nearest neighbor Uses k closest points (nearest neighbors) for performing classification

Nearest Neighbor Classifiers Basic idea: If it walks like a duck, quacks like a duck, then it s probably a duck Compute Distance Test Record Training Records Choose k of the nearest records

Nearest-Neighbor Classifiers Requires three things The set of stored records Distance Metric to compute distance between records The value of k, the number of nearest neighbors to retrieve To classify an unknown record: Compute distance to other training records Identify k nearest neighbors Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

Definition of Nearest Neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x

1 nearest-neighbor Voronoi Diagram

Nearest Neighbor Classification Compute distance between two points: Euclidean distance Determine the class from nearest neighbor list take the majority vote of class labels among the k- nearest neighbors Weigh the vote according to distance weight factor, w = 1/d 2

Nearest Neighbor Classification Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes

Nearest Neighbor Classification Scaling issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M

Evaluating Performance of Classifier Instances are partitioned into TRAIN and TEST sets TRAIN set is used to build the model TEST is used to evaluate the classifier Methods for creating TRAIN and TEST sets Holdout Random Sub-sampling Cross-Validation Bootstrap

Classifier Accuracy Accuracy Number of instances correctly classified Problematic for for unbalanced classes 990 Class A 10 Class B Classifier always predicts A has accuracy = 99%

Unbalanced Classes Precision = TP/P Recall = TP/P F1 Sensitivity = TP/P Specificity = TN/N ROC (receiver operating characteristic)

Model Generalization Underfitting Overfitting Model Complexity

What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized

Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters

Types of Clusterings Partitional Clustering A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset Hierarchical clustering A set of nested clusters organized as a hierarchical tree

Partitional Clustering Original Points A Partitional Clustering

Hierarchical Clustering Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram

K-means Clustering Partitional clustering approach Prototype-based Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified

K-means Clustering Details Initial centroids are often chosen randomly. Clusters produced vary from one run to another. The centroid is (typically) the mean of the points in the cluster. Closeness is measured by Euclidean distance, cosine similarity, correlation, etc. K-means will converge for common similarity measures mentioned above. Most of the convergence happens in the first few iterations. Often the stopping condition is changed to Until relatively few points change clusters Complexity is O( n * K * I * d ) n = number of points, K = number of clusters, I = number of iterations, d = number of attributes

Importance of Choosing Initial Centroids

Importance of Choosing Initial Centroids

Limitations of K-means K-means has problems when clusters are of differing Sizes Densities Non-globular shapes K-means has problems when the data contains outliers.

Evaluating K-means Clusters Most common measure is Sum of Squared Error (SSE) For each point, the error is the distance to the nearest cluster To get SSE, we square these errors and sum them. K i=1 x C i SSE = dist(c i, x) 2 x is a data point in cluster C i and c i is the representative point for cluster C i Given two sets of clusters, we can choose the one with the smallest error One easy way to reduce SSE is to increase K, the number of clusters

Relative Index K-Means: SSE vs Number of Clusters

Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster analysis, the analogous question is how to evaluate the goodness of the resulting clusters? But clusters are in the eye of the beholder! Then why do we want to evaluate them? To avoid finding patterns in noise To compare clustering algorithms To compare two sets of clusters To compare two clusters

Clusters found in Random Data Random Points DBSCAN K-means Complete Link

Measures of Cluster Validity Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following three types. External Index: Used to measure the extent to which cluster labels match externally supplied class labels. Entropy Internal Index: Used to measure the goodness of a clustering structure without respect to external information. Sum of Squared Error (SSE) Relative Index: Used to compare two different clusterings or clusters. Often an external or internal index is used for this function, e.g., SSE or entropy

Measuring Cluster Validity Via Correlation Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets. Corr = -0.9235 Corr = -0.5810

Using Similarity Matrix for Cluster Validation Order the similarity matrix with respect to cluster labels and inspect visually.

Using Similarity Matrix for Cluster Validation Clusters in random data are not so crisp K-means