Sublinear Algorithms for Big Data Part : Random Topics Qin Zhang 1-1
Topic 3: Random sampling in distributed data streams (based on a paper with ormode, Muthukrishnan and Yi, PODS 10, JAM 12) 2-1
Distributed streaming Motivated by database/networking applications Adaptive filters [Olston, Jiang, Widom, SIGMOD 03] A generic geometric approach [Scharfman et al. SIGMOD 06] Prediction models [ormode, Garofalakis, Muthukrishnan, Rastogi, SIGMOD 05] sensor networks network monitoring 3-1 environment monitoring cloud computing
Reservoir sampling [Waterman??; Vitter 85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample -1
Reservoir sampling [Waterman??; Vitter 85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample When the i-th item arrives With probability s/i, use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 s/i, throw it away -2
Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ(s log n) When k 2, reservoir sampling has cost O(n) because it s costly to track i S k S 3 S 2 5-1 S 1 time
Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ(s log n) When k 2, reservoir sampling has cost O(n) because it s costly to track i S k Tracking i approximately? Sampling won t be uniform S 3 S 2 5-2 S 1 time
Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ(s log n) When k 2, reservoir sampling has cost O(n) because it s costly to track i S k S 3 S 2 Tracking i approximately? Sampling won t be uniform Key observation: We don t have to know the size of the population in order to sample! 5-3 S 1 time
Basic idea: binary Bernoulli sampling 6-1
Basic idea: binary Bernoulli sampling 0 1 0 1 0 0 0 1 0 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 6-2
Basic idea: binary Bernoulli sampling 0 1 0 1 0 0 0 1 0 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 onditioned upon a row having s active items, we can draw a sample from the active items 6-3
Basic idea: binary Bernoulli sampling 0 1 0 1 0 0 0 1 0 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 onditioned upon a row having s active items, we can draw a sample from the active items The could maintain a Bernoulli sample of size between s and O(s) 6-
Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 sites S 1 S 2 S 3 Sites send in every item w.pr. 2 i In epoch i: S k 17-1
Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) 17-2
Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Randomlysplitsthelowersampleintoanewlowerandanuppersample 17-3
Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Randomlysplitsthelowersampleintoanewlowerandanuppersample orrectness: (1): In epoch i, each item is maintained in w. pr. 2 i 17-
Random sampling Algorithm [with ormode, Muthu & Yi, PODS 10 JAM 11] Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Randomlysplitsthelowersampleintoanewlowerandanuppersample orrectness: (1): In epoch i, each item is maintained in w. pr. 2 i (2): Always s items are maintained in 17-5
A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S 18-1
A running example upper lower Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S 1 18-2
A running example upper lower 1 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S 1 18-3
A running example upper lower 1 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S 2 1 18-
A running example upper lower 1 2 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S 2 1 18-5
A running example upper lower 1 2 3 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S 2 3 1 18-6
A running example upper lower 1 2 3 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S 2 3 1 5 18-7
A running example upper lower 1 2 3 5 Maintain s = 3 samples Epoch 0 (p = 1) sites S 1 S 2 S 3 S 2 3 1 5 18-8
A running example upper lower 1 2 3 5 Maintain s = 3 samples Epoch 0 (p = 1) Now lower sample = 3 discard upper sample split lower sample advance to Epoch 1 sites S 1 S 2 S 3 S 2 3 1 5 18-9
A running example upper lower 1 5 Maintain s = 3 samples Epoch 0 (p = 1) Now lower sample = 3 discard upper sample split lower sample advance to Epoch 1 sites S 1 S 2 S 3 S 5 2 3 1 18-10
A running example (cont.) upper lower 1 5 Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S 5 2 3 1 19-1
A running example (cont.) upper lower 1 5 Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S 2 3 1 5 6 (discard) 19-2
A running example (cont.) upper lower 1 5 7 Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S 2 3 1 5 6 (discard) 7 19-3
A running example (cont.) upper lower 1 5 7 8 Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S 2 3 1 5 6 (discard) 8 7 19-
A running example (cont.) upper lower 1 5 7 8 Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S 2 3 1 5 6 (discard) 8 7 9 (discard) 19-5
A running example (cont.) upper lower 1 5 7 8 Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S 2 3 1 5 6 (discard) 8 7 9 (discard) 10 19-6
A running example (cont.) upper lower 1 5 7 8 10 Maintain s = 3 samples Epoch 1 (p = 1/2) sites S 1 S 2 S 3 S 2 3 1 5 6 (discard) 8 7 9 (discard) 10 19-7
A running example (cont.) upper lower 1 5 7 8 10 Maintain s = 3 samples Epoch 1 (p = 1/2) Again lower sample = 3 discard upper sample split lower sample advance to Epoch 2 sites S 1 S 2 S 3 S 2 3 1 5 6 (discard) 8 7 9 (discard) 10 19-8
A running example (cont.) upper lower 1 5 10 Maintain s = 3 samples Epoch 1 (p = 1/2) Again lower sample = 3 discard upper sample split lower sample advance to Epoch 2 sites S 1 S 2 S 3 S 2 3 1 5 6 (discard) 8 7 9 (discard) 10 20-1
A running example (cont.) upper lower 1 5 10 Maintain s = 3 samples Epoch 2 (p = 1/) More items will be discarded locally sites S 1 S 2 S 3 S 2 3 1 5 6 (discard) 8 7 9 (discard) 10 20-2
A running example (cont.) upper lower 1 5 10 Maintain s = 3 samples Epoch 2 (p = 1/) More items will be discarded locally Intuition: maintain a sample prob. at each site p s/n (n: total # items) without knowing n. sites S 1 S 2 S 3 S 2 3 1 5 6 (discard) 8 7 9 (discard) 10 20-3
Random sampling Analysis Initialize i = 0 In epoch i: upper lower Sites send in every item w.pr. 2 i oordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. sites S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the broadcasts to k sites advance to epoch i i+1 Discards the upper sample Splitsthelowersampleintoanewlowersampleandanuppersample Analysis: (Msgs sent) Messages sent per epoch O(k +s) # epochs O(logn) = O((k +s)logn) 21-1
Random sampling Analysis and experiments an be 1. improved to Θ(klog k/s n+slogn) and 2. extended to sliding window cases. 22-1
Random sampling Analysis and experiments an be 1. improved to Θ(klog k/s n+slogn) and 2. extended to sliding window cases. Experiments on the real data set from 1998 world cup logs. log 2 (cost) theory 15 log 2 (cost) cost 10 3 practice 1 17 11 13 12 11 10 9 0 1 2 3 5 6 7 8 9 1011 log 2 s Basic case cost VS sample size n = 7000000,k = 128 16 15 1 13 12 0 1 2 3 5 6 7 8 9 1011 log 2 s Time-based sliding cost VS sample size n = 320000,k = 128 10 9 8 7 6 ( 0 1 2 3 5 6 log2 Time-based sliding cost VS window size s = 128,k = 128 ) w 10 22-2
Random sampling Analysis and experiments an be 1. improved to Θ(klog k/s n+slogn) and 2. extended to sliding window cases. log 2 (cost) theory 15 practice 1 13 12 11 10 Experiments on the real data set from 1998 world cup logs. 9 0 1 2 3 5 6 7 8 9 1011 log 2 s Basic case cost VS sample size n = 7000000,k = 128 total # items n = 7,000,000 # items sent,000 size of sample s = 128 # sites k = 128 22-3
Sampling from a (time-based) sliding window sliding window expired windows frozen window current window t 10-1
Sampling from a (time-based) sliding window sliding window t expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR 10-2
Sampling from a (time-based) sliding window sliding window expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes min{s,# live items}, fine. t 10-3
Sampling from a (time-based) sliding window sliding window expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes min{s,# live items}, fine. The key issue: how to guarantee both have sizes s? as items in the frozen window are expiring... t 10-
Sampling from a (time-based) sliding window sliding window expired windows frozen window current window Sample for sliding window = need new ideas (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes min{s,# live items}, fine. The key issue: how to guarantee both have sizes s? as items in the frozen window are expiring... Solution: In the frozen window, find a good sample rate such that the sample size s. t 10-5
Dealing with the frozen window sliding window expired windows frozen window current window t 0 0 0 0 0 0 0 0 0 0 0 0 0 Keep all the levels? Need O(w) communication 11-1
Dealing with the frozen window sliding window expired windows frozen window current window t 0 0 0 0 0 0 0 0 0 0 0 0 0 Keep all the levels? Need O(w) communication (s = 2) Keep most recent sampled items in a level until s of them are also sampled at the next level. Total size: O(slogw) 11-2
Dealing with the frozen window sliding window expired windows frozen window current window t 0 0 0 0 0 0 0 0 0 0 0 0 0 Keep all the levels? Need O(w) communication (s = 2) Keep most recent sampled items in a level until s of them are also sampled at the next level. Total size: O(slogw) Guaranteed: There is a blue window with s sampled items that covers the unexpired portion of the frozen window 11-3
Dealing with the frozen window: The algorithm 0 0 0 0 0 0 0 0 0 0 0 0 0 (s = 2) Each site builds its own level-sampling structure for the current window until it freezes Needs O(slogw) space and O(1) time per item 12-1
Dealing with the frozen window: The algorithm 0 0 0 0 0 0 0 0 0 0 0 0 0 (s = 2) Each site builds its own level-sampling structure for the current window until it freezes Needs O(slogw) space and O(1) time per item When the current window freezes For each level, do a k-way merge to build the level of the global structure at the. Total communication O((k + s) log w) 12-2