Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015

Size: px

Start display at page:

Download "Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015"

Lily Wiggins
8 years ago
Views:

1 Clustering Big Data Efficient Data Mining Technologies J Singh and Teresa Brooks June 4, 2015

2 Hello Bulgaria ( A website with thousands of pages... Some pages identical to other pages Some pages nearly identical to other pages same text, different pictures We want smart indexing of the collection Save just one copy of the duplicate pages Save one copy of the nearly duplicate pages Filter out similar documents when returning search results And we want to keep the index up to date 2 2

The Naïve Way to Address this Challenge Represent each document as a dot in d-dimensional space Run a k-means algorithm on the document set Resulting in k clusters When presented with a new document

3 The Naïve Way to Address this Challenge Represent each document as a dot in d-dimensional space Run a k-means algorithm on the document set Resulting in k clusters When presented with a new document Find the nearest cluster Find the documents within the nearest cluster that are nearest to the document in question Can be skipped if the cluster is small enough i.e., k is large enough that everything in the cluster is close! 3 3

4 The Naïve Way has conceptual problems No good way to decide optimal k All documents have to be re-clustered if we want to change k Forces a document to be in a single cluster In practice, a document can be similar to multiple clusters All clusters are roughly the same size In practice, this terrain is lumpy some documents are one-of-a-kind and others are similar to many others. 4 4

5 The Naïve Way has technical problems End result is subject to initial choice of centroids Leads to results not being repeatable Performance is O(nk), or worse! Especially unfortunate because we want k to be large Algorithm is not easily adapted to map/reduce We need a pipeline of map/reduce jobs to compute it 5 5

6 Any Alternatives? Clustering has been picked over quite well due to its combination of interesting math and wide applicability Two dominant types have emerged: Hierarchical clustering Partitional clustering (e.g., k-means) k-means Variations based on Choice of Initial Centroids Choice of k Parameters at each iteration 6 6

Another line of inquiry: Nearest Neighbor Based on partitioning the search space Quad Trees kd-trees Locality-Sensitive Hashing Hash functions are

7 Another line of inquiry: Nearest Neighbor Based on partitioning the search space Quad Trees kd-trees Locality-Sensitive Hashing Hash functions are locality-sensitive, if, for a random hash function h, for any pair of points p,q : Pr[h(p)=h(q)] is high if p is close to q Pr[h(p)=h(q)] is low if p is far from q 7 7

8 More on Nearest Neighbor Locality-Sensitive Hashing Hash functions are locality-sensitive, if, for a random hash random function h, for any pair of points p,q we have: Pr[h(p)=h(q)] is high if p is close to q Pr[h(p)=h(q)] is low if p is far from q Indyk-Motwani

9 The LSH Idea Treat items as vectors in d- dimensional space. Draw k random hyper-planes in that space. For each hyper-plane: Is each vector on the (0) side of the hyperplane or the (1) side? Hash(Item 1 ) = 000 Hash(Item 3 ) = 101 Hashes each item into a number The magic is in choosing h 1, h 2, h h 3 9 h 2 9

10 The LSH Hash Code Idea Breaks d-dimensional space into proximity-polyhedra. Each purple block represents a document Each Bucket represents a group of alike docs Docs within each bucket still need to be compared to see which ones are the closest Buckets 10

11 A Brief History of LSH Origins at Stanford (1998) Continuing research in universities Stanford, MIT, Rutgers, Cornell, Continuing research in Industry Intel, Microsoft, Google, Textbook: A. Rajaraman and J. Ullman (2010). ( Our contribution: An extensible implementation for large datasets 11 11

12 Choosing hash functions Introducing minhash 1. Sample each document to get its shingles small fragments Mary had a mary, ary, ry h, y ha, had, CTAGTATAAA CTAGTATA, TAGTATAA, AGTATAAA, now is the time now is, is the, the time 2. Calculate the hash value for every shingle. 3. Store the minimum hash value found in step Repeat steps 2 and 3 with different hash algorithms 199 more times to get a total of 200 minhash values

13 Interesting thing about minhashes The resulting minhashes are 200 integer values representing a random selection of shingles. Property of minhashes: If the minhashes for two docs are the same, their shingles are likely to be the same If the shingles for two docs are the same, the docs themselves are likely to be the same Beware Minhash is specific to a particular similarity measure Jaccard similarity Other hash families exist for other similarity measures 13 13

14 All 200 minhashes must match? If all minhashes match, it implies a strong similarity between docs. To catch most cases with weaker similarity Don t compare all minhashes at once, compare them in bands. Candidate pairs are those that hash to the same bucket for 1 band. Sometimes one band will reject a pair and another band will consider it a candidate

15 LSH Involves a Tradeoff Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives. False positives need to examine more pairs that are not really similar. More processing resources, more time. False negatives failed to examine pairs that were similar, didn t find all similar results. But got done faster! 15 15

16 Summary Mine the data and place members into hash buckets When you need to find a match, hash it and possible nearest neighbors will be in one of b buckets. Algorithm performance O(n) 16 16

17 Going Beyond k-means Demo J Singh and Teresa Brooks March 17, 2015

18 Peerbelt Results Example 18 18

19 Database Architecture Requirements Need a very large range of bucket numbers Bucket Numbers in our implementation are to Most buckets are empty Empty buckets must not take any space in the database Some buckets have a lot of documents in them, we need to be able to locate all of them To find documents similar to a given document, Bucketize the document, then find other documents in the same buckets 19 19

20 Implementation: OpenLSH We started OpenLSH to provide a framework for LSH Factor out the database Started on Google App Engine Virtualized interface to make it work on Cassandra Factor out the calculation engine Started on Google App Engine Can plug in Google MapReduce Ported to run in Batch mode on Cassandra 20 20

21 Using OpenLSH We re looking for one or two interesting use cases Application areas: Near de-duplicaction (covered with Peerbelt s data) Stocks that move independent of the herd Filtering unique stories from the News Contact us to discuss 21 21

22 What you can do For more information: Links to code and data set are included Run on App Engine Minimum setup required Adapt it to your environment and need If you need help, send or create a Github issue. Send us a pull request for any improvements you make

23 Thank you J Singh Principal, DataThinks Algorithms for j. datathinks. org Adj. Prof, Computer Science, WPI Teresa Brooks Senior Software Xero 23 23

24 Going Beyond k-means Appendix Slides J Singh and Teresa Brooks June 4, 2015

25 Running LSH on a cluster of machines Can be implemented on a Map Reduce Architecture def map(string docname, String doc): # [ skipped ] for bkt in buckets: emit (bkt, docname) Buckets def reduce(string bkt, Iterator docnames): # [ skipped ] for dn in docnames: emit (bkt, dn) Map Step 25 Reduce Step

26 Extending OpenLSH (p1) Distance Measures The minhash family of functions using Jaccard Distance is just one of several family of functions that be used with the LSH technique. Jaccard Similarity is a measure of how close sets are. The real distance (closeness) measure for sets is Jaccard Distance, which is 1 minus the Jaccard Similarity. Other Distance Measures: Euclidian Distance (used in spaces with dimensions) Cosine Distance (used in spaces with dimensions) Edit Distance (used when two points are strings) Hamming Distance (cat kat kit) 26 26

27 Extending OpenLSH (p2) Parallelize it We suggested a potential map/reduce algorithm, Another paper: Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing Sundaram et al, 2014 ( App Engine provides the map reduce infrastructure to serve as foundation 27 27

28 LSH Tradeoff Example If we had fewer than 20 bands, (and more rows / band) fewer pairs would be selected for comparison, the number of false positives would go down, but the number of false negatives would go up, Performance would go up but so would the error rate! 28 28

29 A Brief History of LSH Origins at Stanford Indyk, Piotr.; Motwani, Rajeev. (1998). ( Gionis, A.; Indyk, P.; Motwani, R. (1999). ( Continuing work at MIT (Indyk) Parallel LSH Textbook: A. Rajaraman and J. Ullman (2010). ( Our contribution: An extensible implementation for large datasets 29 29

Entity Resolution Fingerprints Similar News Articles. Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

Entity Resolution Fingerprints Similar News Articles Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman 2 The entity-resolution problem is to examine a collection of records and