K-Means Clustering. Clustering and Classification Lecture 8

Similar documents
Clustering UE 141 Spring 2013

K-Means Clustering Tutorial

1.5 Oneway Analysis of Variance

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Introduction to Analysis of Variance (ANOVA) Limitations of the t-test

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Information Retrieval and Web Search Engines

Cluster analysis with SPSS: K-Means Cluster Analysis

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Cluster Analysis: Basic Concepts and Algorithms

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Lecture 4 Online and streaming algorithms for clustering

SPSS Tutorial. AEB 37 / AE 802 Marketing Research Methods Week 7

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Chapter 5 Analysis of variance SPSS Analysis of variance

Standardization and Its Effects on K-Means Clustering Algorithm

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Multivariate Normal Distribution

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

CHAPTER 14 NONPARAMETRIC TESTS

Statistical Databases and Registers with some datamining

Distances, Clustering, and Classification. Heatmaps

Lecture 3: Finding integer solutions to systems of linear equations


Lecture 6 Online and streaming algorithms for clustering

Unsupervised learning: Clustering

Linear Models in STATA and ANOVA

Regression III: Advanced Methods

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Experimental Designs (revisited)

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Multivariate Analysis

Object Recognition and Template Matching

Social Media Mining. Data Mining Essentials

Introduction to Clustering

A successful market segmentation initiative answers the following critical business questions: * How can we a. Customer Status.

Cluster Analysis: Advanced Concepts

Tutorial Segmentation and Classification

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

14.74 Lecture 11 Inside the household: How are decisions taken within the household?

ABSORBENCY OF PAPER TOWELS

They can be obtained in HQJHQH format directly from the home page at:

Statistiek II. John Nerbonne. October 1, Dept of Information Science

Math 4310 Handout - Quotient Vector Spaces

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool

Performance Metrics for Graph Mining Tasks

Manifold Learning Examples PCA, LLE and ISOMAP

Clustering. Chapter Introduction to Clustering Techniques Points, Spaces, and Distances

Introduction to Quantitative Methods

Multivariate Analysis of Variance (MANOVA): I. Theory

Data Mining: Algorithms and Applications Matrix Math Review

Applied Algorithm Design Lecture 5

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Applying Data Analysis to Big Data Benchmarks. Jazmine Olinger

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Testing and Interpreting Interactions in Regression In a Nutshell

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

UNDERSTANDING THE TWO-WAY ANOVA

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Rank-Based Non-Parametric Tests

Cluster Analysis: Basic Concepts and Algorithms

POST-HOC SEGMENTATION USING MARKETING RESEARCH

Emotion Detection from Speech

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES

Solutions to Homework 10 Statistics 302 Professor Larget

STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. Clarificationof zonationprocedure described onpp

CLUSTER ANALYSIS FOR SEGMENTATION

Post-hoc comparisons & two-way analysis of variance. Two-way ANOVA, II. Post-hoc testing for main effects. Post-hoc testing 9.

10. Comparing Means Using Repeated Measures ANOVA

B490 Mining the Big Data. 2 Clustering

Map-Reduce for Machine Learning on Multicore

The Euclidean Algorithm

Advanced Ensemble Strategies for Polynomial Models

Cluster Analysis. Examples. Chapter

Practical Introduction to Machine Learning and Optimization. Alessio Signorini

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster Analysis. Chapter. Chapter Outline. What You Will Learn in This Chapter

Psychology 60 Fall 2013 Practice Exam Actual Exam: Next Monday. Good luck!

How To Solve The Cluster Algorithm

Comparison of K-means and Backpropagation Data Mining Algorithms

Linear Programming for Optimization. Mark A. Schulze, Ph.D. Perceptive Scientific Instruments, Inc.

DISCRIMINANT FUNCTION ANALYSIS (DA)

Environmental Remote Sensing GEOG 2021

RSA Question 2. Bob thinks that p and q are primes but p isn t. Then, Bob thinks Φ Bob :=(p-1)(q-1) = φ(n). Is this true?

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

Chapter 1. Vector autoregressions. 1.1 VARs and the identi cation problem

Multivariate Analysis of Variance (MANOVA)

Constrained Clustering of Territories in the Context of Car Insurance

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Chapter 7. Cluster Analysis

Statistical tests for SPSS

Transcription:

K-Means Clustering Clustering and Lecture 8

Today s Class K-means clustering: What it is How it works What it assumes Pitfalls of the method (locally optimal results) 2

From Last Time If you recall the last few minutes of our last class. I implored you to consider how to take data from a single variable, and do an ANOVA Just in this case, you had no idea what the grouping variable was. How would you assign the observations to groups? 3

ANOVA Model (Really, the GLM) The One-Way ANOVA model: The model can be re-written as: Here X ij is the indicator that observation i is a member of group j (for dummy coding). 4

Imagine If Imagine I set the following parameters for the data: β 0 = 64 β 1 = 5 σ ε2 = 9 I will then use R to get some data But I will not let you see what X happens to be. 5

Ok, So You Tell Me How would you find what X was for each person? How would you find what the means for each group were? Note that the data we picked had group means of 64 and 69. These numbers were selected to mimic the distribution of heights for American males and females. If given a set of heights, can you give me: The gender an individual height came from? The mean for both genders? 6

K-means Clustering Algorithm 7

K-Means Clustering Algorithm The K-means clustering algorithm is a simple method for estimating the mean (vectors) of a set of K-groups. The simplicity of the algorithm also can lead to some bad solutions. Sub-optimal clusterings of objects 8

K-means Algorithm Step #1 A typical version of the K-means algorithm runs in the following steps: Seed 2 1. Initial cluster seeds are chosen (at random). These represent the temporary means of the clusters. Imagine our random numbers were 60 for group 1 and 70 for group 2. data[, 1] 55 60 65 70 75 0 200 400 600 800 1000 Index Seed 1 9

K-means Algorithm Step #2 2. The squared Euclidean distance from each object to each cluster is computed, and each object is assigned to the closest cluster. data[, 1] 55 60 65 70 75 0 200 400 600 800 1000 Index 10

K-means Algorithm Step #2 2. The squared Euclidean distance from each object to each cluster is computed, and each object is assigned to the closest cluster. data3[, 1] 55 60 65 70 75 0 200 400 600 800 1000 Index 11

K-means Algorithm Step #3 3. For each cluster, the new centroid is computed and each seed value is now replaced by the respective cluster centroid. The new mean for cluster 1 is 62.3 The new mean for cluster 2 is 68.9 data3[, 1] 55 60 65 70 75 0 200 400 600 800 1000 Index 12

K-means Algorithm Step #4 #6 4. The squared Euclidean distance from an object to each cluster is computed, and the object is assigned to the cluster with the smallest squared Euclidean distance. 5. The cluster centroids are recalculated based on the new membership assignment. 6. Steps 4 and 5 are repeated until no object moves clusters. 13

K-means Assumptions K-means uses the squared Euclidean distance to allocate objects to clusters. There is the implicit assumption that the data should have roughly the same scale to use such distances. Because the squared Euclidean distance is used, the K-means algorithm proceeds to try to find a minimum for the Error Sum of Squares (SSE). The weak assumption is each group has roughly the same SSE meaning that the variance/covariance matrix between objects within a group is equal. 14

Error Sum of Squares For the entire set of objects, the Error Sum of Squares is calculated by: The goal of K-means is to find a solution such that there are no other solutions with lower SSE. This commonly does not happen in a single run of the algorithm. In our example, because we used a single variable and had two groups that were relatively non-overlapping, it will. 15

Wrapping Up K-means is a nice method to quickly sort your data into clusters. All you need to know are the number of clusters you seek to find. Local optima in K-means can derail your results, if you are not careful. Run the process many times with differing starting values. 16

Local Optima As discussed in the article by Steinley (2003), the solutions for K-means are not guaranteed to be globally optimal. Globally optimal = solution where SSE is smallest compared to *any* other possible solution. Solutions typically found in K-means are locally optimal they have found the peak of a function in a small part of the space. 17

Local Optima Example Using code from Steinley (2003), we will now demonstrate the optimality problem in K-means using MATLAB. Steinley s code takes an input data set and runs K- means a number of times, recording the SSE at each time. The lowest SSE should be the one used. To demonstrate, I will use our data with zero local optima, and the data set described on the next page. 18

Hartigan Data The data come from Hartigan (1975, p. 88) part of your readings. The nutritional content of eight foods are listed with respect to energy, protein, and calcium. We will try putting these objects into three clusters. 19

K-means Estimation Process Part of the process of estimation of K-means is determining how many means are necessary. Hartigan (and others) suggest running (M)ANOVA following the solution to tell if the groups are significantly different. If so, more clusters may need to be extracted try with K+1 If not, too many clusters may have been extracted. This approach, however is at best a crude heuristic for determining the number of clusters. There will be bias in the hypothesis test. 20

Summarizing Your Result To summarize your result, you can take the cluster means and describe the clusters. You can also examine the objects within each cluster to determine cluster meaning. Just note that if you find a local optima, both of these entities will change if you run the analysis again. 21

Wrapping Up K-means is a simple procedure for extracting clusters from data. Sometimes simplicity is not good if assumptions are violated. The K-means procedure will serve as an entry point into similar procedures in the future. We will find that by using statistical distributions, we can achieve similar results with more flexible assumptions. 22

Next Time How to do K-means in R. Discussion of class project specifics. Presentation and discussion of an empirical research article featuring K-means. 23