Sublinear Algorithms for Big Data. Part 4: Random Topics

Transcription

1 Sublinear Algorithms for Big Data Part : Random Topics Qin Zhang 1-1

2 Topic 3: Random sampling in distributed data streams (based on a paper with ormode, Muthukrishnan and Yi, PODS 10, JAM 12) 2-1

3 Distributed streaming Motivated by database/networking applications Adaptive filters [Olston, Jiang, Widom, SIGMOD 03] A generic geometric approach [Scharfman et al. SIGMOD 06] Prediction models [ormode, Garofalakis, Muthukrishnan, Rastogi, SIGMOD 05] sensor networks network monitoring 3-1 environment monitoring cloud computing

4 Reservoir sampling [Waterman??; Vitter 85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample -1

5 Reservoir sampling [Waterman??; Vitter 85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample When the i-th item arrives With probability s/i, use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 s/i, throw it away -2

6 Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ(s log n) When k 2, reservoir sampling has cost O(n) because it s costly to track i S k S 3 S S 1 time

7 Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ(s log n) When k 2, reservoir sampling has cost O(n) because it s costly to track i S k Tracking i approximately? Sampling won t be uniform S 3 S S 1 time

8 Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ(s log n) When k 2, reservoir sampling has cost O(n) because it s costly to track i S k S 3 S 2 Tracking i approximately? Sampling won t be uniform Key observation: We don t have to know the size of the population in order to sample! 5-3 S 1 time

9 Basic idea: binary Bernoulli sampling 6-1

10 Basic idea: binary Bernoulli sampling

11 Basic idea: binary Bernoulli sampling onditioned upon a row having s active items, we can draw a sample from the active items 6-3

12 Basic idea: binary Bernoulli sampling onditioned upon a row having s active items, we can draw a sample from the active items The could maintain a Bernoulli sample of size between s and O(s) 6-

13 Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 sites S 1 S 2 S 3 Sites send in every item w.pr. 2 i In epoch i: S k 17-1

14 Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) 17-2

15 Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Randomlysplitsthelowersampleintoanewlowerandanuppersample 17-3

16 Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Randomlysplitsthelowersampleintoanewlowerandanuppersample orrectness: (1): In epoch i, each item is maintained in w. pr. 2 i 17-

17 Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Randomlysplitsthelowersampleintoanewlowerandanuppersample orrectness: (1): In epoch i, each item is maintained in w. pr. 2 i (2): Always s items are maintained in 17-5

18 A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S 18-1

19 A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S

20 A running example upper lower 1 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S

21 A running example upper lower 1 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S

22 A running example upper lower 1 2 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S

26 A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) Now lower sample = 3 discard upper sample split lower sample advance to Epoch 1 sites S 1 S 2 S 3 S

27 A running example upper lower 1 5 Maintain s = 3 samples Epoch 0 (p = 1) Now lower sample = 3 discard upper sample split lower sample advance to Epoch 1 sites S 1 S 2 S 3 S

28 A running example (cont.) upper lower 1 5 Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S

29 A running example (cont.) upper lower 1 5 Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard) 19-2

30 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard)

31 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard)

32 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard) (discard) 19-5

33 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard) (discard)

34 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S (discard) (discard)

35 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) Again lower sample = 3 discard upper sample split lower sample advance to Epoch 2 sites S 1 S 2 S 3 S (discard) (discard)

36 A running example (cont.) upper lower Maintain s = 3 samples Epoch 1 (p = 1/2) Again lower sample = 3 discard upper sample split lower sample advance to Epoch 2 sites S 1 S 2 S 3 S (discard) (discard)

37 A running example (cont.) upper lower Maintain s = 3 samples Epoch 2 (p = 1/) More items will be discarded locally sites S 1 S 2 S 3 S (discard) (discard)

38 A running example (cont.) upper lower Maintain s = 3 samples Epoch 2 (p = 1/) More items will be discarded locally Intuition: maintain a sample prob. at each site p s/n (n: total # items) without knowing n. sites S 1 S 2 S 3 S (discard) (discard)

39 Random sampling Analysis Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Splitsthelowersampleintoanewlowersampleandanuppersample Analysis: (Msgs sent) Messages sent per epoch O(k +s) # epochs O(logn) = O((k +s)logn) 21-1

40 Random sampling Analysis and experiments an be 1. improved to Θ(klog k/s n+slogn) and 2. extended to sliding window cases. 22-1

41 Random sampling Analysis and experiments an be 1. improved to Θ(klog k/s n+slogn) and 2. extended to sliding window cases. Experiments on the real data set from 1998 world cup logs. log 2 (cost) theory 15 log 2 (cost) cost 10 3 practice log 2 s Basic case cost VS sample size n = ,k = log 2 s Time-based sliding cost VS sample size n = ,k = ( log2 Time-based sliding cost VS window size s = 128,k = 128 ) w

42 Random sampling Analysis and experiments an be 1. improved to Θ(klog k/s n+slogn) and 2. extended to sliding window cases. log 2 (cost) theory 15 practice Experiments on the real data set from 1998 world cup logs log 2 s Basic case cost VS sample size n = ,k = 128 total # items n = 7,000,000 # items sent,000 size of sample s = 128 # sites k =

43 Sampling from a (time-based) sliding window sliding window expired windows frozen window current window t 10-1

44 Sampling from a (time-based) sliding window sliding window t expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR 10-2

45 Sampling from a (time-based) sliding window sliding window expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes min{s,# live items}, fine. t 10-3

46 Sampling from a (time-based) sliding window sliding window expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes min{s,# live items}, fine. The key issue: how to guarantee both have sizes s? as items in the frozen window are expiring... t 10-

47 Sampling from a (time-based) sliding window sliding window expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes min{s,# live items}, fine. The key issue: how to guarantee both have sizes s? as items in the frozen window are expiring... Solution: In the frozen window, find a good sample rate such that the sample size s. t 10-5

48 Dealing with the frozen window sliding window expired windows frozen window current window t Keep all the levels? Need O(w) communication 11-1

49 Dealing with the frozen window sliding window expired windows frozen window current window t Keep all the levels? Need O(w) communication (s = 2) Keep most recent sampled items in a level until s of them are also sampled at the next level. Total size: O(slogw) 11-2

50 Dealing with the frozen window sliding window expired windows frozen window current window t Keep all the levels? Need O(w) communication (s = 2) Keep most recent sampled items in a level until s of them are also sampled at the next level. Total size: O(slogw) Guaranteed: There is a blue window with s sampled items that covers the unexpired portion of the frozen window 11-3

51 Dealing with the frozen window: The algorithm (s = 2) Each site builds its own level-sampling structure for the current window until it freezes Needs O(slogw) space and O(1) time per item 12-1

52 Dealing with the frozen window: The algorithm (s = 2) Each site builds its own level-sampling structure for the current window until it freezes Needs O(slogw) space and O(1) time per item When the current window freezes For each level, do a k-way merge to build the level of the global structure at the. Total communication O((k + s) log w) 12-2