Clustering Algorithms. Data Mining Clustering. Distance. Example. More Than One Mean. Mean Clustering

Size: px

Start display at page:

Download "Clustering Algorithms. Data Mining Clustering. Distance. Example. More Than One Mean. Mean Clustering"

Aleesha Hunter
7 years ago
Views:

1 Clustering Algorithms Data Mining Clustering Kevin Swingler Organise data into a number of distinct groups (clusters) according to the similarity of their members and their differences from other clusters Take a new data point and assign it to one of the clusters (or, possibly, to none of them) 1 of 34 2 of 34 Distance Clustering is usually based on the distance between data points For numeric data, Euclidean distance is often used: d n ( i i= 1 = q i p ) 2 m 1 x 1 m 2 Data point x 1 belongs to mean m 2 because it is closest to it. 3 of 34 4 of 34 Mean Clustering We will look at an approach to clustering numeric data based on picking a number of mean values one for each cluster You hopefully know that the mean (average) of a data set of size S is: x = S S x 5 of 34 More Than One Mean What if we suspect that our data set is actually a number of data sets mixed together, Each one has a mean value of its own But we don t know which data point belongs to which set Clustering algorithms separate out the data and calculate the means 6 of 34 1

2 Mean Clustering Target Imagine we think there are 5 clusters in our data We want to calculate 5 means: m 1, m 2, m 3, m 4, m 5 And assign each data point, x i, to one mean only That would lead to 5 data sets, S 1 S 5 Aim Target is to minimise the total distance between the data points and the means to which they are assigned: arg min( S) k i= 1 x j s x m j i 2 7 of 34 8 of 34 K-Means Clustering K-Means The k-means algorithm is a well known method for clustering data by calculating the mean of each cluster You must decide how many clusters you want (the value k) The algorithm chooses the data subsets and calculates the means to minimise the total distance from all data points to their mean Imagine a machine that worked in two distinct states, e.g fast and slow Mean temperature might be 50 for the slow speed and 80 for the fast speed Temperature Time 9 of of 34 K-Means The machine might have a number of distinct states, all with differing acceptable ranges of temperature, pressure etc We don t know what these different states are, nor how many there are of them A clustering algorithm will find them K means does so by finding the middle point of each How K-Means Works You tell it how many clusters you want it to find: k 1. It picks k different points from the data and assumes they are the centres of the clusters 2. It then calculates which of these clusters all the other points fall into by measuring their distance 3. Then, it calculates the average of all the points in each cluster and that is the new centre for each 4. Repeat from 2. until no points swap clusters 11 of of 34 2

3 K-Means Disadvantages Only measures the mean for each cluster tells you nothing of its shape. You must assume the cluster is round, but they rarely are You need to know k before you start The distance measure, in its simple form, assumes that all ranges are equally important Clustering Algorithms Correct (or best) number of clusters cannot always be known Can be more than one acceptable way to organise a given set of data into clusters Algorithms are un-supervised. They are not given category names to fit data to Distance measures may need careful design 13 of of 34 Hierarchical Clustering Minimum Spanning Tree Hierarchical Clustering Clusters Dendogram Looks for clusters within clusters Cluster 1 (root) is the whole data set That splits into a small number of subsets Each subset splits into 0 or more subsets etc. 15 of of 34 Hierarchical Clustering Algorithm Start with the same number of clusters as you have data points every point is a cluster of its own Find the two clusters that are closest together and join them into one. Calculate their new centre Repeat until you have the desired number of clusters Qualities of a Cluster The cluster hierarchy (and the k-means list of means) may store other data about its clusters: Population size: how many data points are in that cluster? Variance and range how far from the centre does most of the data lie 17 of of 34 3

4 Association Rules Association Rules Market Basket Analysis Customers in a shop usually buy more than one item at a time Are there patterns in the purchases that help the shop? 19 of of 34 Data Structure Association rules are derived from data that has a variable organisation: No discrimination between inputs and outputs Data organised into variable sized baskets Baskets contain items Data set is a series of baskets Analysis forms items into itemsets Data set: Basket 1 = Fish, Rice, Cabbage Basket 2 = Milk, Cornflakes : : Etc. 21 of of 34 Definitions - Data Item = single object or event, e.g Bread Basket = A set of items that co-occurred, e.g Bread and Milk bought together Itemset = Any collection 1 or more of items (could be a subset of a basket) 23 of 34 The Rules A rule links two itemsets and is written thus: X => Y Where X and Y are itemsets E.g. {Bread}=>{Butter} links bread buying to butter buying in the same basket E.g {Egg, Flour, Milk} => {Sugar} Or {Egg,Flour} => {Milk, Sugar} 24 of 34 4

5 The Rules X => Y Rules have two qualities associated with them: Confidence = % of (transactions that contain X) that also contain Y Support = % of all transactions that contain X and Y {Bread}=>{Butter}, c=60%, s=10% If someone buys bread, they will buy butter 60% of the time 10% of all visitors to the shop buy bread and butter 25 of of 34 Direction The direction is important: X=>Y is not the same as Y=>X For example, 80% of people who buy a torch buy batteries 5% of people who buy batteries buy a torch Rule Sets A rule set contains a number of rules You could find all the useful rules for a complete rule set Many rules would have such low support and confidence that they are useless So, a rule set will have a minimum support and confidence level, below which rules are discarded 27 of of 34 Finding the Rules The apriori algorithm works as follows: 1. Find all the acceptable itemsets - Support 2. Use them to generate acceptable rules Confidence So, we find all the itemsets with more than our chosen support and them combine them into every possible rule, keeping those with an acceptable confidence Step 1 Generate Itemsets 1. Find all the acceptable item sets of size 1 2. Use the items from step 1 to generate all itemsets of size two and count their support. Keep those that are supported. 3. Repeat for increasingly large itemsets until none of the current size are supported 29 of of 34 5

6 With a minimum support of 20% Bread = 40%: Keep Milk = 60%: Keep Porcini = 2%: Discard {Bread, Milk} = 30%: Keep {Bread, Milk, Sardines} = 15%: Discard These are NOT rules yet! Just itemsets Step 2: Generate Rules Generate every combination from the acceptable rule sets: X => Y where X Y = Empty That is, where nothing in X appears in Y, and vice-versa. 31 of of 34 {Bread} => {Milk} is good {Bread, Milk} => {Coffee} is good {Bread} => {Bread, Milk} is not allowed Finally Discard all the rules that have a confidence score lower than some pre-defined target Remember, confidence is the percentage of baskets that contain both parts of the rule 33 of of 34 6

Analytics on Big Data

Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis