Statistical Learning Theory Meets Big Data

Transcription

1 Statistical Learning Theory Meets Big Data Randomized algorithms for frequent itemsets Eli Upfal Brown University

2 Data, data, data In God we trust, all others (must) bring data Prof. W.E. Deming, Statistician, (attributed) Every two days now we create as much information as we did from the dawn of civilization up until 2003 Eric Schmidt, Google Exec. Chairman, 2009 Data explosion is bigger than Moore's Law Marissa Mayer, Yahoo! CEO, 2009

3 Big Data Every two days now we create as much information as we did from the dawn of civilization up until 2003 Eric Schmidt, Google Exec. Chairman, 2009 Data explosion is bigger than Moore's Law Marissa Mayer, Yahoo! CEO, 2009

4 Big Data Big data requires big storage and long execution time It's great to have big data but do we need to process all of it? Trade off accuracy of results for execution speed Can approximation algorithms for data analysis be fast and still guarantee high-quality results?

5 Frequent Itemsets Market Basket Analysis You own a grocery store Have copy of receipts from sales For each customer, you have the list of products she bought Interested in what groups of products are bought together correlated items Business decisions Optimization

6 Settings Transactional dataset D Figure from Tan et al. - Introduction to Data Mining

7 Transactions Transactional dataset D transaction Figure from Tan et al. - Introduction to Data Mining

8 Items Transactions are built on items from a domain items Figure from Tan et al. - Introduction to Data Mining

9 Itemsets Sets of items are called itemsets Itemsets Figure from Tan et al. - Introduction to Data Mining

10 Frequency Frequency of an itemset X in dataset D: : fraction of transactions of D containing X Example: Milk: 4/5, {Bread, Milk}: 3/5 Figure from Tan et al. - Introduction to Data Mining

11 Frequent Itemsets Frequent Itemsets mining with respect to Find all itemsets X with frequency

12 Frequent Itemsets Frequent Itemsets mining with respect to Find all itemsets X with frequency Top -K Frequent Itemsets Find all itemsets at least as frequent as the kth most frequent Association Rules X in a transaction is correlated with Y in the transaction

13 Frequent Itemsets mining Classical problem in data analysis There are algorithms to extract exact collection of FI's APriori [AS94], FPGrowth [HPY00], Eclat [Zaki00]

14 Frequent Itemsets mining Well studied classical problem in data analysis There are algorithms to extract exact collection of FI's APriori [AS94], FPGrowth [HPY00], Eclat [Zaki00] Running time depends on number of transactions in the dataset and on the number of frequent itemsets 10^8 transactions considered normal size 10^4 items 2^(10^4) itemsets total A transaction with d items contains 2^d itemsets

15 How to speed up mining? Decrease dependency on number of transactions:

16 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower

17 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable

18 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable Given, extract collection W of FI's such that: Frequency 0 1

19 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable Given, extract collection W of FI's such that: Frequency 0 1 Must be in W

20 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable Given, extract collection W of FI's such that: Frequency 0 1 Must not be in W Must be in W

21 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable Given, extract collection W of FI's such that: Frequency 0 1 Must not be in W May be in W Must be in W

22 Observa(on Data mining is exploratory in nature Exact results are usually not needed We want a good, realistic idea of the process Fast analysis is more important... if it still guarantees good-enough results! 22 Riondato et al., PARMA, CIKM'12

23 How to speed up mining? Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable Given, extract collection W of itemset from sample, such that: Frequency 0 1 Must not be in W May be in W Must be in W Problem: choose right sample size and right

24 Choosing If we have, for all itemsets X simultaneously then does the trick

25 Choosing If we have, for all itemsets X simultaneously then does the trick Frequency 0 1 Must not be in W May be in W Must be in W

26 Choosing If we have, for all itemsets X simultaneously then does the trick Frequency 0 1 Must not be in W May be in W Must be in W Need sample size S s.t., with prob. at least, for all itemsets X, we have

27 Choosing the sample size Naïve approach Given itemset X, frequency of X in sample S is distributed like a binomial: Use Chernoff bound to bound Apply Union bound over all itemsets to find S such that

28 Choosing the sample size Naïve approach Given itemset X, frequency of X in sample S is distributed like a binomial: Use Chernoff bound to bound Apply Union bound over all itemsets to find S such that Problem: There's an exponential number of itemsets Sample size would too large (linear the in number of elements)

29 Vapnik-Chervonenkis Dimension Combinatorial property of a collection of subsets from a domain Measures the richness or expressivity of the collection of subsets If we know the VC-dim of a collection of subsets, we can compute the minimum sample size needed to approximate the sizes of all the subsets using the sample

30 Range spaces VC-Dimension is defined on range spaces (B,R): range space B: domain R: collection of subsets from B (ranges) No restrictions: B can be infinite R can be infinite R can contain infinitely-large subsets of B

31 Vapnik-Chervonenkis Dimension Range space For any, define C is shattered if The VC-Dimension of (B,R) is the size of the largest shattered subset of B

32 Example of VC-Dimension y B = x

33 Example of VC-Dimension y B = R = all axis-aligned rectangles in x

34 Example of VC-Dimension y B = R = all axis-aligned rectangles in x

35 Example of VC-Dimension y B = R = all axis-aligned rectangles in x Shattering 4 points: Easy Take any 4 points s.t. no 3 of them are aligned

36 Example of VC-Dimension y B = R = all axis-aligned rectangles in x Shattering 4 points: Easy Take any 4 points s.t. no 3 of them are aligned y x

37 Example of VC-Dimension y B = R = all axis-aligned rectangles in x Shattering 4 points: Easy Take any 4 points s.t. no 3 of them are aligned y Need 16 rectangles x

38 Example of VC-Dimension y B = R = all axis-aligned rectangles in x Shattering 5 points?

39 Example of VC-Dimension y B = R = all axis-aligned rectangles in x Shattering 5 points: impossible

40 Example of VC-Dimension y B = R = all axis-aligned rectangles in x Shattering 5 points: impossible Take any 5 points One of them that is contained in all rectangles containing the other four Impossible to find a rectangle containing only the other four VC(B,R)=4

41 Approximating sizes of ranges Sample Theorem: Let (B,R) have VC(B,R) d. Given, let S be a collection of points from B sampled uniformly at random. If then, Can approximate sizes of all simultaneously No need of union bound

42 VC-Dimension for Frequent Itemsets B = dataset D (set of transactions)

43 VC-Dimension for Frequent Itemsets B = dataset D (set of transactions) Itemset X, = transactions of D containing X

47 VC-Dimension for Frequent Itemsets B = dataset D (set of transactions) Itemset X, = transactions of D containing X To approximate using the sample theorem, need upper bound to VC(B,R)

48 Upper Bound to VC-Dim for FI's B = dataset D (set of transactions) Itemset X, = transactions of D containing X Theorem: [The d-index of a dataset] Let d be the maximum integer such that D contains at least d transactions of length at least d. Then VC(B,R) d

49 The d- index of the dataset d-index d: the maximum integer such that D contains at least d transactions of length at least d Example: this dataset has d-index d=3 bread beer milk coffee chips coke pasta bread coke chips milk coffee pasta milk Independent of D Can be computed with a single scan of the dataset d is an upper bound to the VC-dimension 49

50 Choosing the right sample size Theorem Let Let D be a dataset and d be the max integer such that D contains at least d transactions of length at least d Let S be a collection of transactions of D sampled independently and uniformly at random, with Then

51 Algorithm To extract approximate collection of freq itemsets: 1.Compute d for the dataset D 2.Compute sample size (function of d) 3.Create random sample S from D 4.Extract Freq Itemsets from S using

52 Algorithm To extract approximate collection of Freq. Itemsets: 1.Compute d for the dataset D 2.Compute sample size given d 3.Create random sample S from D 4.Extract Frequent Itemsets from S using Theorem: With probability at least, the returned set W of itemsets satisfies the desired property Frequency 0 1 Must not be in W May be in W Must be in W

53 A closer look... Expression of sample size:

54 A closer look... Expression of sample size: d: the maximum integer such that D contains at least d transactions of length at least d

55 A closer look... Expression of sample size: d: the maximum integer such that D contains at least d transactions of length at least d does not depend on D does not depend on does not depend on number of itemsets

56 A closer look... Expression of sample size: d: the maximum integer such that D contains at least d transactions of length at least d does not depend on D does not depend on does not depend on number of itemsets The time to mine S does not depend on D!!!

57 Proof (Intuition) For a set of k transactions to be shattered, each transaction must appear in 2^(k-1) different 's where X is an itemset A transaction only appears in the 's of the itemsets X it contains A transaction of length w contains 2^(w)-1 itemsets Need w k for the transaction to belong to a shattered set of size k To shatter k transactions they must all have length k Max k for which it happens is upper bound to VC-Dim

58 Experiments Sample always fits into main memory Output always satisfies required approx. guarantees Frequency accuracy even better than guaranteed Mining time significantly improved

59 Conclusions We showed how VC-dimension, a highly theoretical concept from statistical learning theory, can help in developing an efficient algorithm to approximate the collection of frequent itemsets, addressing one of the Big Data challenges through samp Riondato & Upfal: Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. ECML/PKDD (1)

60 PARMA: A Randomized Algorithm for Approximate Associa;on Rules Mining in MapReduce Matteo Riondato, Justin A. DeBrabant, Rodrigo Fonseca, Eli Upfal

61 Goals Extract high quality approximation of FI(D,θ) by mining several small samples of D in parallel Adapt to available computational resources Minimize data replication and communication Achieve high parallelism 61 Riondato et al., PARMA, CIKM'12

62 Why parallelizing? For a very high quality approximation, the sample would still be large Sample creation time depends on size of dataset Parallelization allows us to achieve: higher accuracy (lower ε) higher confidence (lower δ) faster results Makes sense when dataset is stored on a distributed filesystem 62 Riondato et al., PARMA, CIKM'12

63 MapReduce framework for easy parallelization on large clusters of commodity machines Allows very fast analysis of large datasets Algorithms expressed through two functions: Map: (k,v) (k1,v1),(k2,v2),... Reduce: (r,(w1,w2,...)) (k1,v1),(k2,v2),... It is expensive to move data around (shuffling) Design goal: avoid data replication 63 Riondato et al., PARMA, CIKM'12

64 The op'miza'on framework Sample should fit into local memory No more samples than number of processors Exploit available computational resources to maximize quality of approximation Constraints and Objective function expressed in a Mixed Integer Non Linear Program Convex feasibility region and convex obj. function Easy and fast to solve with global MINLP solver 64 Riondato et al., PARMA, CIKM'12

65 The Algorithm: PARMA Very simple design 2 MapReduce Rounds 1 st round Map: create samples Reduce: mine each sample 2 nd round Map: identity function Reduce: aggregation/filtering of results 65 Riondato et al., PARMA, CIKM'12

66 1 st Round: Sampling and Mining Parallelization of the sequential algorithm Map: Sample Creation Random sampling with replacement... in parallel! Input: (transaction id, transaction) Output: (sample id, transaction) Reduce: Mining of each sample in parallel using lowered frequency threshold θ-ε/2 Can use any sequential exact mining algorithm (APriori, FPGrowth,...) Input: (sample id, (transaction1, transaction2,...)) Output: (itemset, (frequency, confidence interval)) 66 Riondato et al., PARMA, CIKM'12

67 2 nd Phase: Aggrega:on Map: Identity function Reduce: Filtering and Aggregation Input: (itemset,((freq1,conf_int1),(freq2,conf_int2),...) Output: (itemset, (frequency, confidence interval)) An itemset is sent in output only if it was frequent in a sufficient number of samples (determined by the optimization framework) Confidence interval: smallest interval containing a sufficient number of estimations freq1,freq2,... Frequency: median of confidence interval 67 Riondato et al., PARMA, CIKM'12

68 Analysis Theorem: With probability at least 1-δ, output is an ε-approximation to FI(D,θ) Proof intuition: Boosting-like approach Outputs of the reducers in stage 1 are ε-approximations for the local samples, with some probability 1-φ If an itemset appears in enough ε-approximations, then it is probably frequent in the dataset Consequences: better approximation, fewer false positives 68 Riondato et al., PARMA, CIKM'12

69 Implementa:on PARMA implemented as Java library for Hadoop Easy to run, possible integration with Apache Mahout FP-Growth used for the sample mining phase PARMA is agnostic to the choice of mining algorithm We used an existing Java implementation 1400 lines of code Available on github 69 Riondato et al., PARMA, CIKM'12

70 Experiments Run on Amazon Elastic MapReduce Up to 16 machines, with 17GB of RAM Artificial datasets using standard benchmark generators Various parameters (transaction size, items,...) Size between 5 to 50 million transactions Compared performances of PARMA with PFP and naive distributed counting PARMA performs significantly better (1.5x to 5x speed improvement) in all tests 70 Riondato et al., PARMA, CIKM'12

71 Run-me Analysis Runtime does not grow with size of data Because individual sample size does not grow! Much faster than PFP, faster as data grows! PARMA gets faster with more machines, even with more data. 71 Riondato et al., PARMA, CIKM'12

72 Speedup Quasi linear speedup Stage 1 very close to optimal Stage 2 no speedup Running time depends on number of itemsets No way to control the number of itemsets Common in frequent ideal itemsets mining algorithms 7 total Stage 1 Stage 2 speedup Riondato et al., PARMA, CIKM' number of instances 72

73 Accuracy Output always ε-approximation to FI(D,θ) Output not much larger than real collection Better frequency accuracy than guaranteed Smaller confidence intervals than guaranteed Riondato et al., PARMA, CIKM'12

74 Conclusions We showed how to efficiently use MapReduce to mine a high-quality approximation of the frequent itemsets and association rules from a large dataset with minimum data replication and quasi-linear speedup 74 Riondato et al., PARMA, CIKM'12

75 Publications Riondato, Upfal. Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. ECML PKDD 2012 Riondato, DeBrabant, Fonseca, Upfal. PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce. CIKM 2012