Prototype Methods: K-Means

Similar documents
Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Social Media Mining. Data Mining Essentials

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Vector Quantization and Clustering

Machine Learning using MapReduce

Lecture 3: Linear methods for classification

Environmental Remote Sensing GEOG 2021

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

How To Cluster

Neural Networks Lesson 5 - Cluster Analysis

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Statistical Machine Learning

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Using Data Mining for Mobile Communication Clustering and Characterization

Data Mining - Evaluation of Classifiers

2002 IEEE. Reprinted with permission.

Active Learning SVM for Blogs recommendation

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

DATA MINING USING INTEGRATION OF CLUSTERING AND DECISION TREE

Using multiple models: Bagging, Boosting, Ensembles, Forests

Supervised and unsupervised learning - 1

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Big Data Analytics CSCI 4030

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

B490 Mining the Big Data. 2 Clustering

Probabilistic Latent Semantic Analysis (plsa)

Calculation of Minimum Distances. Minimum Distance to Means. Σi i = 1

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Constrained Clustering of Territories in the Context of Car Insurance

CLUSTER ANALYSIS FOR SEGMENTATION

Content-Based Recommendation

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Linear Threshold Units

K-Means Clustering Tutorial

Sentiment analysis using emoticons

Topographic Change Detection Using CloudCompare Version 1.0

Information Retrieval and Web Search Engines

A Study of Web Log Analysis Using Clustering Techniques

Cluster Analysis: Advanced Concepts

An Overview of Knowledge Discovery Database and Data mining Techniques

Distances, Clustering, and Classification. Heatmaps

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Classification of Fingerprints. Sarat C. Dass Department of Statistics & Probability

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

GIS Tutorial 1. Lecture 2 Map design

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection

LCs for Binary Classification

Standardization and Its Effects on K-Means Clustering Algorithm

STA 4273H: Statistical Machine Learning

Clustering Connectionist and Statistical Language Processing

Data Mining. Nonlinear Classification

Visual Data Mining. Motivation. Why Visual Data Mining. Integration of visualization and data mining : Chidroop Madhavarapu CSE 591:Visual Analytics

Prototype-based classification by fuzzification of cases

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Hierarchical Cluster Analysis Some Basics and Algorithms

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

D-optimal plans in observational studies

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

HT2015: SC4 Statistical Data Mining and Machine Learning

Introduction to Clustering

Chapter 7. Cluster Analysis

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Colour Image Segmentation Technique for Screen Printing

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Adaptive information source selection during hypothesis testing

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

A MULTIVARIATE OUTLIER DETECTION METHOD

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Data Mining Essentials

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Monday Morning Data Mining

Data Mining Yelp Data - Predicting rating stars from review text

Map-Reduce for Machine Learning on Multicore

Sampling Within k-means Algorithm to Cluster Large Datasets

Categorical Data Visualization and Clustering Using Subjective Factors

JPEG compression of monochrome 2D-barcode images using DCT coefficient distributions

Server Load Prediction

Comparison of K-means and Backpropagation Data Mining Algorithms

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Self Organizing Maps: Fundamentals

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Biometric Authentication using Online Signatures

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

HIGH DIMENSIONAL UNSUPERVISED CLUSTERING BASED FEATURE SELECTION ALGORITHM

Transcription:

Prototype Methods: K-Means Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu

Prototype Methods Essentially model-free. Not useful for understanding the nature of the relationship between the features and class outcome. They can be very effective as black box prediction engines. Training data: {(x 1, g 1 ), (x 2, g 2 ),..., (x N, g N )}. The class labels g i {1, 2,..., K}. Represent the training data by a set of points in feature space, also called prototypes. Each prototype has an associated class label, and classification of a query point x is made to the class of the closest prototype. Methods differ according to how the number and the positions of the prototypes are decided.

K-means Assume there are M prototypes denoted by Z = {z 1, z 2,..., z M }. Each training sample is assigned to one of the prototype. Denote the assignment function by A( ). Then A(x i ) = j means the ith training sample is assigned to the jth prototype. Goal: minimize the total mean squared error between the training samples and their representative prototypes, that is, the trace of the pooled within cluster covariance matrix. arg min Z,A N x i z A(xi ) 2 i=1

Denote the objective function by L(Z, A) = N x i z A(xi ) 2. i=1 Intuition: training samples are tightly clustered around the prototypes. Hence, the prototypes serve as a compact representation for the training data.

Necessary Conditions If Z is fixed, the optimal assignment function A( ) should follow the nearest neighbor rule, that is, A(x i ) = arg min j {1,2,...,M} x i z j. If A( ) is fixed, the prototype z j should be the average (centroid) of all the samples assigned to the jth prototype: i:a(x z j = i )=j x i, N j where N j is the number of samples assigned to prototype j.

The Algorithm Based on the necessary conditions, the k-means algorithm alternates the two steps: For a fixed set of centroids (prototypes), optimize A( ) by assigning each sample to its closest centroid using Euclidean distance. Update the centroids by computing the average of all the samples assigned to it. The algorithm converges since after each iteration, the objective function decreases (non-increasing). Usually converges fast. Stopping criterion: the ratio between the decrease and the objective function is below a threshold.

Example Training set: {1.2, 5.6, 3.7, 0.6, 0.1, 2.6}. Apply k-means algorithm with 2 centroids, {z 1, z 2 }. Initialization: randomly pick z 1 = 2, z 2 = 5. fixed update 2 {1.2, 0.6, 0.1, 2.6} 5 {5.6, 3.7} {1.2, 0.6, 0.1, 2.6} 1.125 {5.6, 3.7} 4.65 1.125 {1.2, 0.6, 0.1, 2.6} 4.65 {5.6, 3.7} The two prototypes are: z 1 = 1.125, z 2 = 4.65. The objective function is L(Z, A) = 5.3125.

Initialization: randomly pick z 1 = 0.8, z 2 = 3.8. fixed update 0.8 {1.2, 0.6, 0.1} 3.8 {5.6, 3.7, 2.6} {1.2, 0.6, 0.1 } 0.633 {5.6, 3.7, 2.6 } 3.967 0.633 {1.2, 0.6, 0.1} 3.967 {5.6, 3.7, 2.6} The two prototypes are: z 1 = 0.633, z 2 = 3.967. The objective function is L(Z, A) = 5.2133. Starting from different initial values, the k-means algorithm converges to different local optimum. It can be shown that {z 1 = 0.633, z 2 = 3.967} is the global optimal solution.

Classification by K-means The primary application of k-means is in clustering, or unsupervised classification. It can be adapted to supervised classification. Apply k-means clustering to the training data in each class separately, using R prototypes per class. Assign a class label to each of the K R prototypes. Classify a new feature x to the class of the closest prototype. The above approach to using k-means for classification is referred to as Scheme 1.

Another Approach Another approach to classification by k-means Apply k-means clustering to the entire training data, using M prototypes. For each prototype, count the number of samples from each class that are assigned to this prototype. Associate the prototype with the class that has the highest count. Classify a new feature x to the class of the closest prototype. This alternative approach is referred to as Scheme 2.

Simulation Two classes both follow normal distribution with common covariance matrix. ( ) 1 0 The common covariance matrix Σ =. 0 1 The means of the two classes are: ( ) 0.0 µ 1 = µ 0.0 2 = ( 1.5 1.5 The prior probabilities of the two classes are π 1 = 0.3, π 2 = 0.7. A training and a testing data set are generated, each with 2000 samples. )

The optimal decision boundary between the two classes is given by LDA, since the two class-conditioned densities of the input are both normal with common covariance matrix. The optimal decision rule is (i.e., the Bayes rule): { 1 X1 + X G(X ) = 2 0.9351 2 otherwise The error rate computed using the test data set and the optimal decision rule is 11.75%.

The scatter plot of the training data set. Red star: Class 1. Blue circle: Class 2.

K-means with Scheme 1 For each class, use R = 6 prototypes. The 6 prototypes for each class are shown. The black solid line is the boundary between the two classes given by K-means. The green dash line is the optimal decision boundary. The error rate based on the test data is 20.15%.

K-means with Scheme 2 For the entire training data set, use M = 12 prototypes. By counting the number of samples from each class that fall into each prototype, 4 prototypes are labeled as Class 1 and the other 8 are labeled as Class 2. The prototypes for each class are shown below. The black solid line is the boundary between the two classes given by K-means. The green dash line is the optimal decision boundary. The error rate based on the test data is 13.15%.

Compare the Two Schemes Scheme 1 works when there is a small amount of overlap between classes. Scheme 2 is more robust when there is a considerable amount of overlap. For Scheme 2, there is no need to specify the number of prototypes for each class separately. The number of prototypes assigned to each class is adjusted automatically. Classes with higher priors tend to occupy more prototypes.

Scheme 2 has better statistical interpretation. Attempt to estimate Pr(G = j X ) by a piece-wise constant function. Partition the feature space into cells. Assume Pr(G = j X ) constant for X in one cell. Estimate Pr(G = j X ) by the empirical frequencies of the classes based on all the training samples that fall into this cell.

Initialization Randomly pick up the prototypes to start the k-means iteration. Different initial prototypes may lead to different local optimal solutions given by k-means. Try different sets of initial prototypes, compare the objective function at the end to choose the best solution. When randomly select initial prototypes, better make sure no prototype is out of the range of the entire data set.

Initialization in the above simulation: Generated M random vectors with independent dimensions. For each dimension, the feature is uniformly distributed in [ 1, 1]. Linearly transform the jth feature, Z j, j = 1, 2,..., p in each prototype (a vector) by: Z j s j + m j, where s j is the sample standard deviation of dimension j and m j is the sample mean of dimension j, both computed using the training data.

Linde-Buzo-Gray (LBG) Algorithm An algorithm developed in vector quantization for the purpose of data compression. Y. Linde, A. Buzo and R. M. Gray, An algorithm for vector quantizer design, IEEE Trans. on Communication, Vol. COM-28, pp. 84-95, Jan. 1980.

The algorithm 1. Find the centroid z (1) 1 of the entire data set. 2. Set k = 1, l = 1. 3. If k < M, split the current centroids by adding small offsets. If M k k, split all the centroids; otherwise, split only M k of them. Denote the number of centroids split by k = min(k, M k). For example, to split z (1) 1 into two centroids, let z (2) 1 = z (1) 1, z (2) 2 = z (1) 1 + ɛ, where ɛ has a small norm and a random direction. 4. k k + k; l l + 1. 5. Use {z (l) 1, z(l) 2,..., z(l) k } as initial prototypes. Apply k-means iteration to update these prototypes. 6. If k < M, go back to step 3; otherwise, stop.

Tree-structured Vector Quantization (TSVQ) for Clustering 1. Apply 2 centroids k-means to the entire data set. 2. The data are assigned to the 2 centroids. 3. For the data assigned to each centroid, apply 2 centroids k-means to them separately. 4. Repeat the above step.

Compare with LBG: For LBG, after the initial prototypes are formed by splitting, k-means is applied to the overall data set. The final result is M prototypes. For TSVQ, data partitioned into different centroids at the same level will never affect each other in the future growth of the tree. The final result is a tree structure. Fast searching For k-means, to decide which cell a query x goes to, M (the number of prototypes) distances need to be computed. For the tree-structured clustering, to decide which cell a query x goes to, only 2 log 2 (M) distances need to be computed.

Comments on tree-structured clustering: It is structurally more constrained. But on the other hand, it provides more insight into the patterns in the data. It is greedy in the sense of optimizing at each step sequentially. An early bad decision will propagate its effect. It provides more algorithmic flexibility.

Choose the Number of Prototypes Cross-validation (data driven approach): For different number of prototypes, compute the classification error rate using cross-validation. The number of prototypes that yields the minimum error rate according to cross-validation is selected. Cross-validation is often rather effective. Rule of thumb: on average every cell contains at least 5 10 samples. Other model selection approaches.

Example Diabetes data set. The two principal components are used. For the number of prototypes M = 1, 2,..., 20, apply k-means to classification. Two-fold cross-validation is used to compute the error rates. The error rate vs. M is plotted below. When M 8, the performance is close. The minimum error rate 27.34% is achieved by M = 13.

The prototypes assigned to the two classes and the classification boundary are shown below. 9 prototypes are assigned to Class 1 (without diabetes); 4 to Class 2 (with diabetes).