Introduction to Clustering

Similar documents
Clustering UE 141 Spring 2013

How To Cluster

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Statistical Databases and Registers with some datamining

K-Means Clustering Tutorial

They can be obtained in HQJHQH format directly from the home page at:

DATA ANALYSIS II. Matrix Algorithms

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Neural Networks Lesson 5 - Cluster Analysis

CLUSTER ANALYSIS FOR SEGMENTATION

SoSe 2014: M-TANI: Big Data Analytics

Distances, Clustering, and Classification. Heatmaps

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Cluster Analysis: Basic Concepts and Algorithms

Using Data Mining for Mobile Communication Clustering and Characterization

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering Connectionist and Statistical Language Processing

Hierarchical Cluster Analysis Some Basics and Algorithms

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

COC131 Data Mining - Clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Exploratory data analysis for microarray data

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Cluster analysis with SPSS: K-Means Cluster Analysis

Multivariate Analysis

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Standardization and Its Effects on K-Means Clustering Algorithm

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Social Media Mining. Data Mining Essentials

Cluster Analysis using R

Map-Reduce for Machine Learning on Multicore

A New Method for Dimensionality Reduction using K- Means Clustering Algorithm for High Dimensional Data Set

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

Multidimensional data analysis

Lecture 3: Linear methods for classification

Hierarchical Clustering Analysis

Machine Learning using MapReduce

Performance Metrics for Graph Mining Tasks

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Applying Data Analysis to Big Data Benchmarks. Jazmine Olinger

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Part 2: Community Detection

Distances between Clustering, Hierarchical Clustering

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Tutorial Segmentation and Classification

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

SPSS Tutorial. AEB 37 / AE 802 Marketing Research Methods Week 7

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

A Cluster Analysis Approach for Banks Risk Profile: The Romanian Evidence

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Unsupervised learning: Clustering

Didacticiel - Études de cas

Rank one SVD: un algorithm pour la visualisation d une matrice non négative

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence

Segmentation of stock trading customers according to potential value

Adaptive Framework for Network Traffic Classification using Dimensionality Reduction and Clustering

Final Project Report

Chapter 7. Cluster Analysis

Cluster Analysis for Optimal Indexing

NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

IMPROVISATION OF STUDYING COMPUTER BY CLUSTER STRATEGIES

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Vector Quantization and Clustering

Bisecting K-Means for Clustering Web Log data

A Demonstration of Hierarchical Clustering

Cluster Analysis: Advanced Concepts

Cluster analysis Cosmin Lazar. COMO Lab VUB

Manifold Learning Examples PCA, LLE and ISOMAP

How To Make A Credit Risk Model For A Bank Account

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

A Relevant Document Information Clustering Algorithm for Web Search Engine

Package MixGHD. June 26, 2015

The Data Mining Process

B490 Mining the Big Data. 2 Clustering

Transcription:

Introduction to Clustering Yumi Kondo Student Seminar LSK301 Sep 25, 2010 Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 1 / 36

Microarray Example N=65 P=1756 Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 2 / 36

Clustering The data set {x ij }, i =1,..,N, j=1,...,p consist of P features measured on n independent observations. Clustering Clustering algorithm seek to assign N observations in p space, labeled x 1,.., x N to one of K groups, based on some similarity measure. Unsupervised learning the problem of finding groups in data without the help of a response variable No right or wrong partition umi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 3 / 36

What is similarity measure? x 1 and x 2 are observation vectors in p dimention Some examples Euclidean distance x i x i 2 n[,2] 0 1 2 3 4 5 0 1 2 3 4 5 n[,1] Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 4 / 36

What is similarity measure? x 1 and x 2 are observation vectors in p dimention Some examples Euclidean distance x i x i 2 n[,2] 0 1 2 3 4 5 0 1 2 3 4 5 n[,1] Absolute distance x i x i 2 1 Correlation with d = 1 correlation Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 4 / 36

what is K-means? Clustering method Hierarchical Clustering Non-hierarchical Clustering -K-means Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 5 / 36

Hierarchical Clustering Outline what is K-means? Hierarchical Clustering It produces a dendrogram that represents a nested set of clusters: depending on where the dendrogram is cut, between 1 and N clusters can result. Cool Microarray example http : //genome www.stanford.edu/breast c ancer/molecularportraits/download.shtm Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 6 / 36

what is K-means? Hierarchical Clustering PRO CON Nice tree! (dendrogram) Visualize the different levels of similarity between observations. computationally expensive! Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 7 / 36

what is K-means? Non-hierarchical clustering, K-means K-mean with Euclidian distance as a similarity measure Solution of K-mean clustering is the partition such that min WSS C 1,..,C K = min C 1,..,C K K 1 n k k=1 i,i C k x i x i 2 -white board Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 8 / 36

what is K-means? note; WSS = = = K 1 x i x i 2 2n k i,i C k K 1 K K x i x k + x k x i 2 2n k k=1 k=1 i=1 i =1 K n k x i x k 2 k=1 i=1 Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 9 / 36

Algorithm for K-mean Outline what is K-means? Step 1 and step 2 are iterated until convergence. Step 1. Given cluster assignment C 1,.., C K, cluster centroids are calculated as i C ˆµ k = k x i k = 1,.., K N Step 2. Given cluster centroids, objective function is minimized by assigning each observation to the closest cluster mean. I i = argmin 1 k K x i ˆµ k 2 white board Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 10 / 36

what is K-means? Correlation as similarity measurement in K-means It is not so easy to create an algorithm when similarity measurement is correlation. No simple analytic form for cluster centroid :( Data transformation approach 1. normalize the observation vector x i = x i x x i x 2 P umi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 11 / 36

what is K-means? Correlation as similarity measurement in K-means It is not so easy to create an algorithm when similarity measurement is correlation. No simple analytic form for cluster centroid :( Data transformation approach 1. normalize the observation vector x i = x i x x i x 2 2. x i ỹ i 2 d ρx,y P x i ỹ i 2 x i x = y i ȳ 2 x i x 2 P y i ȳ 2 P = x i x 2 x i x 2 P P P x i x 2 y i ȳ 2 (x i x) (y i ȳ) y 2 i ȳ y i ȳ 2 P 2 = 2p 2p (x i x) (y i ȳ) x i x y i ȳ umi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 11 / 36

what is K-means? Non-hierarchical Method K-means Drawback of K-means No pretty tree The number of clusters must be pre-known! Not robust Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 12 / 36

K must be preknown but how? GAP statistics, Tibshirani et al (2001) Clest Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 13 / 36

GAP statistics idea behind GAP statistics Find ˆk such that WSS k shows an elbow decline Jump to WSS -cool example in R Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 14 / 36

Definition GAP(k) = E null (log(wss(k)) ˆ log(wss(k)) ˆk = the smallest k such that GAP(k) GAP(k + 1) s k+1 Standardize the graph of log (WSS(k)) by comparing it with its expectation under an appropriate null reference distribution of the data ˆ E null (log(wss(k)) = 1 B B log(wss(k) b ) i=1 -another cool one in R Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 15 / 36

Reference distribution: K=1 Generate the reference features from a uniform distribution over a box aligned with the principal components of the data. Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 16 / 36

Reference distribution: K=1 Generate the reference features from a uniform distribution over a box aligned with the principal components of the data. Orthogonally diagonalize S X = PDP. D is a diagonal matrix with the eigenvalues,λ 1,.., λ p of S on the diagonal, arranged so that,λ 1,.., λ p 0 and P is an orthogonal matrix whose columns are the corresponding unit eigenvectors u 1,.., u p. Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 16 / 36

Reference distribution: K=1 Generate the reference features from a uniform distribution over a box aligned with the principal components of the data. Orthogonally diagonalize S X = PDP. D is a diagonal matrix with the eigenvalues,λ 1,.., λ p of S on the diagonal, arranged so that,λ 1,.., λ p 0 and P is an orthogonal matrix whose columns are the corresponding unit eigenvectors u 1,.., u p. Transform via X = XP. Then S X = P S X P = P PDP P = D. The transformed data is no longer correlated. Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 16 / 36

Reference distribution: K=1 Generate the reference features from a uniform distribution over a box aligned with the principal components of the data. Orthogonally diagonalize S X = PDP. D is a diagonal matrix with the eigenvalues,λ 1,.., λ p of S on the diagonal, arranged so that,λ 1,.., λ p 0 and P is an orthogonal matrix whose columns are the corresponding unit eigenvectors u 1,.., u p. Transform via X = XP. Then S X = P S X P = P PDP P = D. The transformed data is no longer correlated. Draw uniform features Z over the range of the columns of X. Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 16 / 36

Reference distribution: K=1 Generate the reference features from a uniform distribution over a box aligned with the principal components of the data. Orthogonally diagonalize S X = PDP. D is a diagonal matrix with the eigenvalues,λ 1,.., λ p of S on the diagonal, arranged so that,λ 1,.., λ p 0 and P is an orthogonal matrix whose columns are the corresponding unit eigenvectors u 1,.., u p. Transform via X = XP. Then S X = P S X P = P PDP P = D. The transformed data is no longer correlated. Draw uniform features Z over the range of the columns of X. Finally we back-transform via Z = Z P to give reference data set Z. Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 16 / 36

Definition GAP(k) = E null (log(wss(k))) ˆ log(wss(k)) ˆk = the smallest k such that GAP(k) GAP(k + 1) s k+1 ˆ E null (log(wss(k)) = 1 B B log(wss(k) b ) i=1 Standardize the graph of log (WSS(k)) by comparing it with its expectation under an appropriate null reference distribution of the data. Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 17 / 36

Prediction-based resampling method, Clest Clest returns K which has the most stable predictability in clustering procedure. Algorithm For each K, repeat the following process B times for data set and reference data set partition data set into learning set and testing set perform K-means on learning set, return classifiers Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 18 / 36

Prediction-based resampling method, Clest Clest returns K which has the most stable predictability in clustering procedure. Algorithm For each K, repeat the following process B times for data set and reference data set partition data set into learning set and testing set perform K-means on learning set, return classifiers classify testing set by classifiers, return C 1,classifier,.., C K,classifiers Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 18 / 36

Prediction-based resampling method, Clest Clest returns K which has the most stable predictability in clustering procedure. Algorithm For each K, repeat the following process B times for data set and reference data set partition data set into learning set and testing set perform K-means on learning set, return classifiers classify testing set by classifiers, return C 1,classifier,.., C K,classifiers classify testing set by Kmeans, return C 1,K mean,..., C K,K means Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 18 / 36

Prediction-based resampling method, Clest Clest returns K which has the most stable predictability in clustering procedure. Algorithm For each K, repeat the following process B times for data set and reference data set partition data set into learning set and testing set perform K-means on learning set, return classifiers classify testing set by classifiers, return C 1,classifier,.., C K,classifiers classify testing set by Kmeans, return C 1,K mean,..., C K,K means measure the similarity of two partitions C 1,classifier,.., C K,classifiers and C 1,K mean,..., C K,K means Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 18 / 36

Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 19 / 36

Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 20 / 36

Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 21 / 36

Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 22 / 36

Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 23 / 36

Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 24 / 36

Compute S(k,cluster labels 1, cluster labels 2) Repeat this process B times for each K and obtain the average of measure S k Repeat the algorithm for reference dataset for B timesand obtain S 0 k Obtain standardized similarity measure d k =S k S 0 k and ˆK=argmax k {1,..,K} d k Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 25 / 36

Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 26 / 36

Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 27 / 36

Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 28 / 36

Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 29 / 36

Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 30 / 36

Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 31 / 36

Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 32 / 36

Compute S(k,cluster labels 1, cluster labels 2) Repeat this process B times for each K and obtain the average of measure S k Repeat the algorithm for reference dataset and obtain S 0 k Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 33 / 36

Compute S(k,cluster labels 1, cluster labels 2) Repeat this process B times for each K and obtain the average of measure S k Repeat the algorithm for reference dataset and obtain S 0 k Obtain standardized similarity measure d k =S k S 0 k and ˆK=argmax k {1,..,K} d k Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 33 / 36

Similarity measures of two partitions Let P and Q represent two partitions i>i I CER = P(i,i ) I Q(i,i ) ( n 2) { 1 if i and i belong to the same cluster by partitioning P I P(i,i ) = 0 otherwise 0 CER 1 CER= 0 means perfect agreement of two partitions CER= 1 means complete disagreement of two partitions Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 34 / 36

Does Clest outperform GAP statistics? Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 35 / 36

Tibshirani, Robert,et al. Estimating the number of clusters in a data set via the gap statistic Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 36 / 36