Unsupervised Learning and Clustering

Similar documents

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Cluster Analysis. Cluster Analysis

What is Candidate Sampling

L10: Linear discriminants analysis

The Greedy Method. Introduction. 0/1 Knapsack Problem

Lecture 2: Single Layer Perceptrons Kevin Swingler

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

A Simple Approach to Clustering in Excel

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

1 Example 1: Axis-aligned rectangles

The OC Curve of Attribute Acceptance Plans

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

1. Measuring association using correlation and regression

Support Vector Machines

Cluster Analysis of Data Points using Partitioning and Probabilistic Model-based Algorithms

We are now ready to answer the question: What are the possible cardinalities for finite fields?

Credit Limit Optimization (CLO) for Credit Cards

Estimating the Number of Clusters in Genetics of Acute Lymphoblastic Leukemia Data

An Alternative Way to Measure Private Equity Performance

Extending Probabilistic Dynamic Epistemic Logic

Forecasting the Direction and Strength of Stock Market Movement

A DATA MINING APPLICATION IN A STUDENT DATABASE

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

HÜCKEL MOLECULAR ORBITAL THEORY

Abstract. Clustering ensembles have emerged as a powerful method for improving both the

A machine vision approach for detecting and inspecting circular parts

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

This circuit than can be reduced to a planar circuit

Enabling P2P One-view Multi-party Video Conferencing

Realistic Image Synthesis

An Enhanced Super-Resolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

Adaptive Fractal Image Coding in the Frequency Domain

A Comparative Study of Data Clustering Techniques

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Statistical Methods to Develop Rating Models

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Loop Parallelization

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

Detecting Global Motion Patterns in Complex Videos

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

Robust Design of Public Storage Warehouses. Yeming (Yale) Gong EMLYON Business School

A Fast Incremental Spectral Clustering for Large Data Sets

Enterprise Master Patient Index

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

Implementations of Web-based Recommender Systems Using Hybrid Methods

The Mathematical Derivation of Least Squares

Learning from Multiple Outlooks

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

2008/8. An integrated model for warehouse and inventory planning. Géraldine Strack and Yves Pochet

Fast Fuzzy Clustering of Web Page Collections

Recurrence. 1 Definitions and main statements

Quantization Effects in Digital Filters

Sngle Snk Buy at Bulk Problem and the Access Network

Mining Feature Importance: Applying Evolutionary Algorithms within a Web-based Educational System

An MILP model for planning of batch plants operating in a campaign-mode

Microarray data normalization and transformation

Probabilistic Linear Classifier: Logistic Regression. CS534-Machine Learning

Logistic Regression. Steve Kroon

To Fill or not to Fill: The Gas Station Problem

Distributed Multi-Target Tracking In A Self-Configuring Camera Network

BRNO UNIVERSITY OF TECHNOLOGY

Data Visualization by Pairwise Distortion Minimization

Dynamic Fuzzy Pattern Recognition

Kernel Methods for General Pattern Analysis

J. Parallel Distrib. Comput.

How To Solve A Problem In A Powerline (Powerline) With A Powerbook (Powerbook)

Least Squares Fitting of Data

Lecture 3: Force of Interest, Real Interest Rate, Annuity

Texas Instruments 30X IIS Calculator

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Human Tracking by Fast Mean Shift Mode Seeking

21 Vectors: The Cross Product & Torque

Logical Development Of Vogel s Approximation Method (LD-VAM): An Approach To Find Basic Feasible Solution Of Transportation Problem

Customer Segmentation Using Clustering and Data Mining Techniques

Performance Analysis and Coding Strategy of ECOC SVMs

Lecture 2 Sequence Alignment. Burr Settles IBS Summer Research Program 2008 bsettles@cs.wisc.edu

A Data Mining-Based OLAP Aggregation of. Complex Data: Application on XML Documents

AN APPOINTMENT ORDER OUTPATIENT SCHEDULING SYSTEM THAT IMPROVES OUTPATIENT EXPERIENCE

Ants Can Schedule Software Projects

where the coordinates are related to those in the old frame as follows.

ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models

IMPACT ANALYSIS OF A CELLULAR PHONE

Figure 1. Inventory Level vs. Time - EOQ Problem

Question 2: What is the variance and standard deviation of a dataset?

DEFINING %COMPLETE IN MICROSOFT PROJECT

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

Rotation Kinematics, Moment of Inertia, and Torque

APPLICATION OF COMPUTER PROGRAMMING IN OPTIMIZATION OF TECHNOLOGICAL OBJECTIVES OF COLD ROLLING

Web Object Indexing Using Domain Knowledge *

HowHow to Find the Best Online Stock Broker

Faraday's Law of Induction

Transcription:

Unsupervsed Learnng and Clusterng

Supervsed vs. Unsupervsed Learnng Up to now we consdered supervsed learnng scenaro, where we are gven 1. samples 1,, n 2. class labels for all samples 1,, n Ths s also called learnng wth teacher, snce correct answer (the true class) s provded Today we consder unsupervsed learnng scenaro, where we are only gven 1. samples 1,, n Ths s also called learnng wthout teacher, snce correct answer s not provded do not splt data nto tranng and test sets

Unsupervsed Learnng Data s not labeled a lot s known easer Parametrc Approach assume parametrc dstrbuton of data estmate parameters of ths dstrbuton much harder than supervsed case Non Parametrc Approach group the data nto clusters, each cluster (hopefully) says somethng about categores (classes) present n the data lttle s known harder

Clusterng Seek natural clusters n the data What s a good clusterng? nternal (wthn the cluster) dstances should be small eternal (ntra-cluster) should be large Clusterng s a way to dscover new categores (classes)

What we Need for Clusterng 1. Promty measure, ether smlarty measure s(, k ): large f, k are smlar dssmlarty(or dstance) measure d(, k ): small f, k are smlar large d, small s large s, small d 2. Crteron functon to evaluate a clusterng good clusterng bad clusterng 3. Algorthm to compute clusterng For eample, by optmzng the crteron functon

How Many Clusters? 3 clusters or 2 clusters? Possble approaches 1. f the number of clusters to k 2. fnd the best clusterng accordng to the crteron functon (number of clusters may vary)

Promty Measures good promty measure s VERY applcaton dependent Clusters should be nvarant under the transformatons natural to the problem For eample for object recognton, should have nvarance to rotaton dstance 0 For character recognton, no nvarance to rotaton 9 6

Dstance (dssmlarty) Measures Manhattan (cty block) dstance ( ) ( ) ( ) = = d k k j k j d 1, Eucldean dstance ( ) ( ) ( ) ( ) = = d k k j k j d 1 2, translaton nvarant appromaton to Eucldean dstance, cheaper to compute ( ) ( ) ( ) ma, 1 k j k d k j d = Chebyshev dstance appromaton to Eucldean dstance, cheapest to compute

Smlarty Measures Cosne smlarty: (, ) s = j the smaller the angle, the larger the smlarty scale nvarant measure popular n tet retreval T j j Correlaton coeffcent popular n mage processng s (, ) j = d d k= 1 ( k) ( k) ( ) d ( k) 2 ( k) ( ) k= 1 k= 1 ( j) j ( j) j 2 1/ 2

Feature Scale old problem: how to choose approprate relatve scale for features? [length (n meters or cms?), weght(n n grams or kgs?)] In supervsed learnng, can normalze to zero mean unt varance wth no problems n clusterng ths s more problematc, f varance n data s due to cluster presence, then normalzng features s not a good thng before normalzaton after normalzaton

Smplest Clusterng Algorthm Havng defned a promty functon, can develop a smple clusterng algorthm go over all sample pars, and put them n the same cluster f the dstance between them s less then some threshold dstance d 0 (or f smlarty s larger than s 0 ) Pros: smple to understand and mplement Cons: very dependent on d 0 (or s 0 ), automatc choce of d 0 (or s 0 )s not an easly solved ssue d 0 too small: too many clusters d 0 larger: reasonable clusterng d 0 too large: too few clusters

Crteron Functons for Clusterng Have samples 1,, n Suppose parttoned samples nto c subsets D 1,,D c D 1 D 3 There are appromately c n /c! dstnct parttons D 2 Can defne a crteron functon J(D 1,,D c ) whch measures the qualty of a parttonng D 1,,D c Then the clusterng problem s a well defned problem the optmal clusterng s the partton whch optmzes the crteron functon

SSE Crteron Functon Let n be the number of samples n D, and defne the mean of samples n s D µ = 1 n Then the sum-of-squared errors crteron functon (to mnmze) s: c 2 JSSE = µ D = 1 D µ 1 µ 2 Note that the number of clusters, c, s fed

SSE Crteron Functon J SSE = c = 1 D 2 µ SSE crteron approprate when data forms compact clouds that are relatvely well separated SSE crteron favors equally szed clusters, and may not be approprate when natural groupngs have very dfferent szes large J SSE small J SSE

Falure Eample for J SSE larger J SSE smaller J SSE The problem s that one of the natural clusters s not compact (the outer rng)

Other Mnmum Varance Crteron Functons We can elmnate constant terms from c 2 JSSE = µ = 1 2 c = 1 D We get an equvalent crteron functon: J E n 1 y 2 = 1 n y D D d = average Eucldan dstance between all pars of samples n D Can obtan other crteron functons by replacng - y 2 by any other measure of dstance between ponts n D Alternatvely can replace d by the medan, mamum, etc. nstead of the average dstance 2

Mamum Dstance Crteron Consder J c ma = n 2 ma y = 1 y D, D Solves prevous case However J ma s not robust to outlers smallest J ma smallest J ma

Other Crteron Functons Recall defnton of scatter matrces scatter matr for th cluster S = ( µ )( µ ) wthn the cluster scatter matr D c S W = S Determnant of S w roughly measures the square of the volume Assumng S w s nonsngular, defne determnant crteron functon: c Jd = SW = S J d s nvarant to scalng of the as, and s useful f there are unknown rrelevant lnear transformatons of the data = 1 = 1 t

Iteratve Optmzaton Algorthms Now have both promty measure and crteron functon, need algorthm to fnd the optmal clusterng Ehaustve search s mpossble, snce there are appromately c n /c! possble parttons Usually some teratve algorthm s used 1. Fnd a reasonable ntal partton 2. Repeat: move samples from one group to another s.t. the objectve functon J s mproved move samples to mprove J J = 777,777 J =666,666

Iteratve Optmzaton Algorthms Iteratve optmzaton algorthms are smlar to gradent descent move n the drecton of descent (ascent), but not n the steepest descent drecton snce have no dervatve of the objectve functon soluton depends on the ntal pont cannot fnd global mnmum Man Issue How to move from current parttonng to the one whch mproves the objectve functon

K-means Clusterng We now consder an eample of teratve optmzaton algorthm for the specal case of J SSE objectve functon J SSE = k = 1 D 2 µ for a dfferent objectve functon, we need a dfferent optmzaton algorthm, of course F number of clusters to k (c = k) k-means s probably the most famous clusterng algorthm t has a smart way of movng from current parttonng to the net one

K-means Clusterng k = 3 1. Intalze pck k cluster centers arbtrary assgn each eample to closest center 2. compute sample means for each cluster 3. reassgn all samples to the closest mean 4. f clusters changed at step 3, go to step 2

K-means Clusterng Consder steps 2 and 3 of the algorthm 2. compute sample means for each cluster JSSE = µ 1 µ 2 = 1 D = sum of 3. reassgn all samples to the closest mean k µ µ 1 µ 2 by ther old means, the If we represent clusters error has gotten smaller 2

K-means Clusterng 3. reassgn all samples to the closest mean µ 1 µ 2 If we represent clusters by ther old means, the error has gotten smaller However we represent clusters by ther new means, and mean s always the smallest representaton of a cluster z D 1 2 z ( 2 t 2 = 2 z+ z ) 2 z 2 z = 1 n D 1 = ( + z ) D D = 0

K-means Clusterng We just proved that by dong steps 2 and 3, the objectve functon goes down n two step, we found a smart move whch decreases the objectve functon Thus the algorthm converges after a fnte number of teratons of steps 2 and 3 However the algorthm s not guaranteed to fnd a global mnmum µ 1 µ 2 2-means gets stuck here global mnmum of J SSE

K-means Clusterng Fndng the optmum of J SSE s NP-hard In practce, k-means clusterng performs usually well It s very effcent Its soluton can be used as a startng pont for other clusterng algorthms Stll 100 s of papers on varants and mprovements of k-means clusterng every year

Herarchcal Clusterng Up to now, consdered flat clusterng? For some data, herarchcal clusterng s more approprate than flat clusterng Herarchcal clusterng

Herarchcal Clusterng: Bologcal Taonomy anmal plant wth spne no spne seed producng spore producng dog cat jellyfsh apple rose mushroom mold

Herarchcal Clusterng: Dendogram preferred way to represent a herarchcal clusterng s a dendrogram Bnary tree Level k corresponds to parttonng wth n-k+1 clusters f need k clusters, take clusterng from level n-k+1 If samples are n the same cluster at level k, they stay n the same cluster at hgher levels dendrogram typcally shows the smlarty of grouped clusters

Eample

Herarchcal Clusterng: Venn Dagram Can also use Venn dagram to show herarchcal clusterng, but smlarty s not represented quanttatvely

Herarchcal Clusterng Algorthms for herarchcal clusterng can be dvded nto two types: 1. Agglomeratve (bottom up) procedures Start wth n sngleton clusters Form herarchy by mergng most smlar clusters 3 4 2 5 6 2. Dvsve (top bottom) procedures Start wth all samples n one cluster Form herarchy by splttng the worst clusters

Dvsve Herarchcal Clusterng Any flat algorthm whch produces a fed number of clusters can be used set c = 2

Agglomeratve Herarchcal Clusterng ntalze wth each eample n sngleton cluster whle there s more than 1 cluster 1. fnd 2 nearest clusters 2. merge them Four common ways to measure cluster dstance 1. mnmum dstance d ( D, D ) = mn y mn j D, y D j 2. mamum dstance d (, ) ma ma D D j = y 3. average dstance 4. mean dstance D, y D j 1 ( D, D j) = d avg y n n mean j D y D j ( D D ) = µ d µ, j j

Sngle Lnkage or Nearest Neghbor Agglomeratve clusterng wth mnmum dstance d ( D, D ) = mn y 3 mn 5 j D, y D j 1 2 4 generates mnmum spannng tree encourages growth of elongated clusters dsadvantage: very senstve to nose what we want at level wth c=3 what we get at level wth c=3 nosy sample

Complete Lnkage or Farthest Neghbor Agglomeratve clusterng wth mamum dstance d ma encourages compact clusters ( D, D ) = ma y j D, y D j 1 2 3 4 Does not work well f elongated clusters present 5 ( ) D 1 ( D ) D D 2 3 d ma D1,D2 < dma 2,D3 thus D 1 and D 2 are merged nstead of D 2 and D 3

Average and Mean Agglomeratve Clusterng Agglomeratve clusterng s more robust under the average or the mean cluster dstance 1 ( D, D j) = d avg y n n mean j D y D j ( D D ) = µ d µ, j j mean dstance s cheaper to compute than the average dstance unfortunately, there s not much to say about agglomeratve clusterng theoretcally, but t does work reasonably well n practce

Agglomeratve vs. Dvsve Agglomeratve s faster to compute, n general Dvsve may be less blnd to the global structure of the data Dvsve Agglomeratve when takng the frst step (splt), have access to all the data; can fnd the best possble splt n 2 parts when takng the frst step mergng, do not consder the global structure of the data, only look at parwse structure

Frst (?) Applcaton of Clusterng John Snow, a London physcan plotted the locaton of cholera deaths on a map durng an outbreak n the 1850s. The locatons ndcated that cases were clustered around certan ntersectons where there were polluted wells -- thus eposng both the problem and the soluton. From: Nna Mshra HP Labs

Applcaton of Clusterng Astronomy SkyCat: Clustered 210 9 sky objects nto stars, galaes, quasars, etc based on radaton emtted n dfferent spectrum bands. From: Nna Mshra HP Labs

Applcatons of Clusterng Image segmentaton Fnd nterestng objects n mages to focus attenton at From: Image Segmentaton by Nested Cuts, O. Veksler, CVPR2000

Applcatons of Clusterng Image Database Organzaton for effcent search

Applcatons of Clusterng Data Mnng Technology watch Derwent Database, contans all patents fled n the last 10 years worldwde Searchng by keywords leads to thousands of documents Fnd clusters n the database and fnd f there are any emergng technologes and what competton s up to Marketng Customer database Fnd clusters of customers and talor marketng schemes to them

Applcatons of Clusterng gene epresson profle clusterng smlar epressons, epect smlar functon U18675 4CL -0.151-0.207 0.126 0.359 0.208 0.091-0.083-0.209 M84697 a-tub 0.188 0.030 0.111 0.094-0.009-0.173-0.119-0.136 M95595 ACC2 0.000 0.041 0.000 0.000 0.000 0.000 0.000 0.000 X66719 ACO1 0.058 0.155 0.082 0.284 0.240 0.065-0.159-0.010 U41998 ACT 0.096-0.019 0.070 0.137 0.089 0.038 0.096-0.070 AF057044 ACX1 0.268 0.403 0.679 0.785 0.565 0.260 0.203 0.252 AF057043 ACX2 0.415 0.000-0.053 0.114 0.296 0.242 0.090 0.230 U40856 AIG1 0.096-0.106-0.027-0.026-0.005-0.052 0.054 0.006 U40857 AIG2 0.311 0.140 0.257 0.261 0.158 0.056-0.049 0.058 AF123253 AIM1-0.040 0.002-0.202-0.040 0.077 0.081 0.088 0.224 X92510 AOS 0.473 0.560 0.914 0.625 0.375 0.387 0.019 0.141 From:De Smet F., Mathys J., Marchal K., Thjs G., De Moor B. & Moreau Y. 2002. Adaptve Qualty-based clusterng of gene epresson profles, Bonformatcs, 18(6), 735-746.

Applcatons of Clusterng Proflng Web Users Use web access logs to generate a feature vector for each user Cluster users based on ther feature vectors Identfy common goals for users Shoppng Job Seekers Product Seekers Tutorals Seekers Can use clusterng results to mprovng web content and desgn

Summary Clusterng (nonparametrc unsupervsed learnng) s useful for dscoverng nherent structure n data Clusterng s mmensely useful n dfferent felds Clusterng comes naturally to humans (n up to 3 dmensons), but not so to computers It s very easy to desgn a clusterng algorthm, but t s very hard to say f t does anythng good General purpose clusterng does not est, for best results, clusterng should be tuned to applcaton at hand