Chapter 3: Cluster Analysis



Similar documents
Licensing Windows Server 2012 R2 for use with virtualization technologies

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

DIRECT DATA EXPORT (DDE) USER GUIDE

The ad hoc reporting feature provides a user the ability to generate reports on many of the data items contained in the categories.

Implementing ifolder Server in the DMZ with ifolder Data inside the Firewall

Times Table Activities: Multiplication

Disk Redundancy (RAID)

Licensing Windows Server 2012 for use with virtualization technologies

Getting Started Guide

SolarWinds Technical Reference

Why Can t Johnny Encrypt? A Usability Evaluation of PGP 5.0 Alma Whitten and J.D. Tygar

Some Statistical Procedures and Functions with Excel

Fund Accounting Class II

Writing a Compare/Contrast Essay

ISAM TO SQL MIGRATION IN SYSPRO

How to deploy IVE Active-Active and Active-Passive clusters

Traffic monitoring on ProCurve switches with sflow and InMon Traffic Sentinel

How do I evaluate the quality of my wireless connection?

Chapter 7. Cluster Analysis

TRAINING GUIDE. Crystal Reports for Work

WINDOW REPLACEMENT Survey

TaskCentre v4.5 Send Message (SMTP) Tool White Paper

Welcome to Microsoft Access Basics Tutorial

IX- On Some Clustering Techniques for Information Retrieval. J. D. Broffitt, H. L. Morgan, and J. V. Soden

Wireless Light-Level Monitoring

Data mining methodology extracts hidden predictive information from large databases.

Using PayPal Website Payments Pro UK with ProductCart

Welcome to CNIPS Training: CACFP Claim Entry

Software and Hardware Change Management Policy for CDes Computer Labs

Data Validation and Iteration

In this lab class we will approach the following topics:

AMERITAS INFORMATION TECHNOLOGY DISASTER RECOVERY AND DATA CENTER STRATEGY

Budget Planning. Accessing Budget Planning Section. Select Click Here for Budget Planning button located close to the bottom of Program Review screen.

NAVIPLAN PREMIUM LEARNING GUIDE. Analyze, compare, and present insurance scenarios

From Beginner To Winner

Outpatient Therapy G-Code Edit Findings January 30, Mary Sue Gardner, RN/BSN Senior Nurse Analyst

Access EEC s Web Applications... 2 View Messages from EEC... 3 Sign In as a Returning User... 3

This report provides Members with an update on of the financial performance of the Corporation s managed IS service contract with Agilisys Ltd.

Configuring BMC AREA LDAP Using AD domain credentials for the BMC Windows User Tool

PBX Remote Line Extension using Mediatrix 4104 and 1204 June 22, 2011

Exercise 5 Server Configuration, Web and FTP Instructions and preparatory questions Administration of Computer Systems, Fall 2008

Trends and Considerations in Currency Recycle Devices. What is a Currency Recycle Device? November 2003

QAD Operations BI Metrics Demonstration Guide. May 2015 BI 3.11

Getting Started Guide

WHITE PAPER. Vendor Managed Inventory (VMI) is Not Just for A Items

LeadStreet Broker Guide

Lesson Study Project in Mathematics, Fall University of Wisconsin Marathon County. Report

efusion Table of Contents

Group Term Life Insurance: Table I Straddle Testing and Imputed Income for Dependent Life Insurance

Phi Kappa Sigma International Fraternity Insurance Billing Methodology

1.3. The Mean Temperature Difference

COUNTRY REPORT: Sweden

In this chapter, you will learn to use net present value analysis in cost and price analysis.

ENERGY CALIBRATION IN DPPMCA AND XRS-FP REV A0 ENERGY CALIBRATION IN DPPMCA AND XRS-FP

Access to the Ashworth College Online Library service is free and provided upon enrollment. To access ProQuest:

Retirement Planning Options Annuities

Document Management Versioning Strategy

NOVA COLLEGE-WIDE COURSE CONTENT SUMMARY ITE MULTIMEDIA SOFTWARE (3 CR.)

System Business Continuity Classification

A Novel Method of Spam Mail Detection using Text Based Clustering Approach

Networking Best Practices

BackupAssist SQL Add-on

Firewall/Proxy Server Settings to Access Hosted Environment. For Access Control Method (also known as access lists and usually used on routers)

FundingEdge. Guide to Business Cash Advance & Bank Statement Loan Programs

UNIT PLAN. Methods. Soccer Unit Plan 20 days, 40 minutes in length. For 7-12 graders. Name

Treasury Gateway Getting Started Guide

Applied Spatial Statistics: Lecture 6 Multivariate Normal

Within the program, students combine two or more areas of study into one interdisciplinary program. Current program options include:

STIOffice Integration Installation, FAQ and Troubleshooting

NAVIPLAN PREMIUM LEARNING GUIDE. Existing insurance coverage

Business Intelligence represents a fundamental shift in the purpose, objective and use of information

UNCITRAL COLLOQIUM ON FINANCING INTELLECTUAL PROPERTY ASSETS. (by: Kiriakoula Hatzikiriakos, McMillan Binch Mendelsohn)

Integrate Marketing Automation, Lead Management and CRM

ATL: Atlas Transformation Language. ATL Installation Guide

CSE 231 Fall 2015 Computer Project #4

time needed to collect and analyse data.

Mandatory Courses Optional Courses Elective Courses

Tipsheet: Sending Out Mass s in ApplyYourself

Completing the CMDB Circle: Asset Management with Barcode Scanning

Transcription:

Chapter 3: Cluster Analysis 3.1 Basic Cncepts f Clustering 3.1.1 Cluster Analysis 3.1. Clustering Categries 3. Partitining Methds 3..1 The principle 3.. K-Means Methd 3..3 K-Medids Methd 3..4 CLARA 3..5 CLARANS 3.3 Hierarchical Methds 3.4 Density-based Methds 3.5 Clustering High-Dimensinal Data 3. Outlier Analysis

3.1.1 Cluster Analysis Unsupervised learning (i.e., Class label is unknwn) Grup data t frm new categries (i.e., clusters), e.g., cluster huses t find distributin patterns Principle: Maximizing intra-class similarity & minimizing interclass similarity Typical Applicatins WWW, Scial netwrks, Marketing, Bilgy, Library, etc.

3.1. Clustering Categries Partitining Methds Cnstruct k partitins f the data Hierarchical Methds Creates a hierarchical decmpsitin f the data Density-based Methds Grw a given cluster depending n its density (# data bjects) Grid-based Methds Quantize the bject space int a finite number f cells Mdel-based methds Hypthesize a mdel fr each cluster and find the best fit f the data t the given mdel Clustering high-dimensinal data Subspace clustering Cnstraint-based methds Used fr user-specific applicatins

Chapter 3: Cluster Analysis 3.1 Basic Cncepts f Clustering 3.1.1 Cluster Analysis 3.1. Clustering Categries 3. Partitining Methds 3..1 The principle 3.. K-Means Methd 3..3 K-Medids Methd 3..4 CLARA 3..5 CLARANS 3.3 Hierarchical Methds 3.4 Density-based Methds 3.5 Clustering High-Dimensinal Data 3. Outlier Analysis

3..1 Partitining Methds: The Principle Given A data set f n bjects K the number f clusters t frm Organize the bjects int k partitins (k<=n) where each partitin represents a cluster The clusters are frmed t ptimize an bjective partitining criterin Objects within a cluster are similar Objects f different clusters are dissimilar

3.. K-Means Methd Chse 3 bjects (cluster centrids) Gal: create 3 clusters (partitins) Assign each bject t the clsest centrid t frm Clusters Update cluster centrids + + +

K-Means Methd Recmpute Clusters + + + If Stable centrids, then stp + + +

K-Means Algrithm Input K: the number f clusters D: a data set cntaining n bjects Output: A set f k clusters Methd: (1) Arbitrary chse k bjects frm D as in initial cluster centers () Repeat (3) Reassign each bject t the mst similar cluster based n the mean value f the bjects in the cluster (4) Update the cluster means (5) Until n change

K-Means Prperties The algrithm attempts t determine k partitins that minimize the square-errr functin E k i 1 p C i ( p m i ) E: the sum f the squared errr fr all bjects in the data set P: the data pint in the space representing an bject m i : is the mean f cluster C i It wrks well when the clusters are cmpact cluds that are rather well separated frm ne anther

K-Means Prperties Advantages K-means is relatively scalable and efficient in prcessing large data sets The cmputatinal cmplexity f the algrithm is O(nkt) n: the ttal number f bjects k: the number f clusters t: the number f iteratins Nrmally: k<<n and t<<n Disadvantage Can be applied nly when the mean f a cluster is defined Users need t specify k K-means is nt suitable fr discvering clusters with nncnvex shapes r clusters f very different size It is sensitive t nise and utlier data pints (can influence the mean value)

Variatins f the K-Means Methd A few variants f the k-means which differ in Selectin f the initial k means Dissimilarity calculatins Strategies t calculate cluster means Handling categrical data: k-mdes (Huang 9) Replacing means f clusters with mdes Using new dissimilarity measures t deal with categrical bjects Using a frequency-based methd t update mdes f clusters A mixture f categrical and numerical data Nvember, 010 Data Mining: Cncepts and Techniques 11

3..3 K-Medids Methd Minimize the sensitivity f k-means t utliers Pick actual bjects t represent clusters instead f mean values Each remaining bject is clustered with the representative bject (Medid) t which is the mst similar The algrithm minimizes the sum f the dissimilarities between each bject and its crrespnding reference pint E k i 1 p C i p i E: the sum f abslute errr fr all bjects in the data set P: the data pint in the space representing an bject O i : is the representative bject f cluster C i

K-Medids Methd: The Idea Initial representatives are chsen randmly The iterative prcess f replacing representative bjects by n representative bjects cntinues as lng as the quality f the clustering is imprved Fr each representative Object O Fr each nn-representative bject R, swap O and R Chse the cnfiguratin with the lwest cst Cst functin is the difference in abslute errr-value if a current representative bject is replaced by a nn-representative bject

K-Medids Methd: Example Data Objects O 1 A 1 A O 3 4 O 3 3 O 4 4 7 O 5 O 4 O 7 7 3 O 7 4 O 9 5 O 10 7 9 7 5 4 3 3 4 1 10 9 7 5 3 4 5 7 9 Gal: create tw clusters Chse randmly tw medids O = (3,4) O = (7,4)

K-Medids Methd: Example Data Objects A 1 O 1 A O 3 4 O 3 3 O 4 4 7 O 5 O 4 O 7 7 3 O 7 4 O 9 5 O 10 7 9 7 5 4 3 1 3 cluster1 4 5 3 4 5 7 9 Cluster1 = {O 1, O, O 3, O 4 } 10 cluster Assign each bject t the clsest representative bject Using L1 Metric (Manhattan), we frm the fllwing clusters 7 9 Cluster = {O 5, O, O 7, O, O 9, O 10 }

K-Medids Methd: Example O 1 A 1 A O 3 4 O 3 3 O 4 4 7 O 5 O 4 O 7 7 3 O 7 4 O 9 5 O 10 7 Data Objects 3 4 5 7 9 3 4 5 7 9 1 3 4 5 7 9 10 Cmpute the abslute errr criterin [fr the set f Medids (O,O)] 10 9 7 5 4 3 1 1 p E k i C p i i cluster1 cluster

K-Medids Methd: Example Data Objects A 1 O 1 A O 3 4 O 3 3 O 4 4 7 O 5 O 4 O 7 7 3 O 7 4 O 9 5 O 10 7 9 7 5 4 3 1 3 cluster1 4 5 3 4 5 7 9 The abslute errr criterin [fr the set f Medids (O,O)] 10 cluster E ( 3 4 4) (3 11 ) 7 9 0

K-Medids Methd: Example Data Objects A 1 O 1 A O 3 4 O 3 3 O 4 4 7 O 5 O 4 O 7 7 3 O 7 4 O 9 5 O 10 7 9 7 5 4 3 1 Chse a randm bject O 7 Swap O and O7 3 cluster1 4 5 3 4 5 7 9 Cmpute the abslute errr criterin [fr the set f Medids (O,O7)] 10 cluster E ( 3 4 4) ( 1 3 3) 7 9

K-Medids Methd: Example Data Objects A 1 O 1 A O 3 4 O 3 3 O 4 4 7 O 5 O 4 O 7 7 3 O 7 4 O 9 5 O 10 7 9 7 5 4 3 cluster1 3 4 1 5 10 9 7 3 4 5 7 9 Cmpute the cst functin Abslute errr [fr O,O 7 ] Abslute errr [O,O ] S 0 cluster S> 0 it is a bad idea t replace O by O 7

K-Medids Methd Data Objects A 1 O 1 A O 3 4 O 3 3 O 4 4 7 O 5 O 4 O 7 7 3 O 7 4 O 9 5 O 10 7 9 7 5 4 3 1 3 cluster1 4 5 3 4 5 7 9 In this example, changing the medid f cluster did nt change the assignments f bjects t clusters. 10 cluster What are the pssible cases when we replace a medid by anther bject? 7 9

K-Medids Methd Cluster 1 Cluster A B B First case The assignment f P t A des nt change p Representative bject Randm Object Currently P assigned t A Cluster 1 Cluster A p B B Secnd case P is reassigned t A Representative bject Randm Object Currently P assigned t B

K-Medids Methd Cluster 1 Cluster A p B B Third case P is reassigned t the new B Representative bject Randm Object Currently P assigned t B Cluster 1 Cluster A Furth case p B B P is reassigned t B Representative bject Randm Object Currently P assigned t A

K-Medids Algrithm(PAM) PAM : Partitining Arund Medids Input K: the number f clusters D: a data set cntaining n bjects Output: A set f k clusters Methd: (1) Arbitrary chse k bjects frm D as representative bjects (seeds) () Repeat (3) Assign each remaining bject t the cluster with the nearest representative bject (4) Fr each representative bject O j (5) Randmly select a nn representative bject O randm () Cmpute the ttal cst S f swapping representative bject Oj with O randm (7) if S<0 then replace O j with O randm () Until n change

K-Medids Prperties(k-medids vs.k-means) The cmplexity f each iteratin is O(k(n-k) ) Fr large values f n and k, such cmputatin becmes very cstly Advantages K-Medids methd is mre rbust than k-means in the presence f nise and utliers Disadvantages K-Medids is mre cstly that the k-means methd Like k-means, k-medids requires the user t specify k It des nt scale well fr large data sets

3..4 CLARA CLARA (Clustering Large Applicatins) uses a sampling-based methd t deal with large data sets A randm sample shuld clsely represent the riginal data sample PAM The chsen medids will likely be similar t what wuld have been chsen frm the whle data set

CLARA Draw multiple samples f the data set Apply PAM t each sample Chse the best clustering Return the best clustering Clusters Clusters Clusters PAM PAM PAM sample 1 sample sample m

CLARA Prperties Cmplexity f each Iteratin is: O(ks + k(n-k)) s: the size f the sample k: number f clusters n: number f bjects PAM finds the best k medids amng a given data, and CLARA finds the best k medids amng the selected samples Prblems The best k medids may nt be selected during the sampling prcess, in this case, CLARA will never find the best clustering If the sampling is biased we cannt have a gd clustering Trade ff-f efficiency

3..5 CLARANS CLARANS (Clustering Large Applicatins based upn RANdmized Search ) was prpsed t imprve the quality and the scalability f CLARA It cmbines sampling techniques with PAM It des nt cnfine itself t any sample at a given time It draws a sample with sme randmness in each step f the search

CLARANS: The idea Clustering view Current medids medids Cst=10 Cst=5 Cst=1 Cst=0 Cst= Cst=3 Cst=5 Keep the current medids

CLARA CLARANS: The idea Draws a sample f ndes at the beginning f the search Neighbrs are frm the chsen sample Restricts the search t a specific area f the riginal data First step f the search Neighbrs are frm the chsen sample Current medids Sample medids secnd step f the search Neighbrs are frm the chsen sample

CLARANS: The idea CLARANS Des nt cnfine the search t a lcalized area Stps the search when a lcal minimum is fund Finds several lcal ptimums and utput the clustering with the best lcal ptimum First step f the search Draw a randm sample f neighbrs Current medids Original data medids secnd step f the search Draw a randm sample f neighbrs The number f neighbrs sampled frm the riginal data is specified by the user

CLARANS Prperties Advantages Experiments shw that CLARANS is mre effective than bth PAM and CLARA Handles utliers Disadvantages The cmputatinal cmplexity f CLARANS is O(n ), where n is the number f bjects The clustering quality depends n the sampling methd

Summary f Sectin 3. Partitining methds find sphere-shaped clusters K- mean is efficient fr large data sets but sensitive t utliers PAM uses centers f the clusters instead f means CLARA and CLARANS are used fr clustering large databases