Engines Anis Nasir Coauthors: Gianmarco, Nicolas, David, Marco
Stream Processing Engines Online Machine Learning Real Time Query Processing ConCnuous ComputaCon Distributed RPC 2
Stream Processing Engines Streaming ApplicaCons are represented by Directed Acyclic Graphs (DAGs) Source Worker Worker Source Worker 3
Stream Grouping Key or Fields Grouping Hash- based assignment Stateful operacons, e.g., page rank, degree count Shuffle Grouping Round- robin assignment Stateless operacons, e.g., data logging, OLTP All Grouping 4
Data DistribuCon Skewed DistribuCon Power- law Zipf DistribuCon Log Normal Scale- free network Example ApplicaCon Social Networks Web Users Economy Biology 5
Key Grouping Worker Source Worker Source Worker 6
Shuffle Grouping Worker Source Worker Aggregator Source Worker 7
Stream Grouping Key Grouping Efficient RouCng Load Imbalance Shuffle Grouping Load Balance AddiConal Memory AddiConal AggregaCon phase 8
A possible solucon Dynamic load rebalancing detect load imbalance perform data migracon Challenges How o_en to check the load imbalance MigraCon is not directly supported with most of the DSPEs and requires extra modificacon State management for stateful operacon 9
Power of two choices (POTC) Balls- and- bins problem Algorithm For each ball, pick two bins uniformly at random Assign the ball to least loaded of the two bins Issues Non uniform key distribucon Load InformaCon 10
Power of two choices Worker Source Worker Issues: Consensus on keys Load InformaCon Load imbalance Source Worker 11
ParCal Key Grouping (PKG) The Power of Both Choices: Key Splieng Split each key into two server Assign each instance using power of two choices Benefits: Decentralized Stateless Handle Skew 12
ParCal Key Grouping Local load escmacon each source escmates load on workers using the local roucng history Benefits: No coordinacon among sources No communicacon with workers 13
ParCal Key Grouping Worker Source Worker Aggregator Source Worker 14
Analysis Problem FormulaCon n workers - > bins keys ki ε K - > colors m messages - > colored balls Minimize the difference between maximum and average workload 15
Analysis Key DistribuCon We pick each key ki ε K with probability pi from the distribucon D, where p1 p2 p3. Maximum load is proporconal to the most frequent key (with probability p1) If p1 > 2/n the expected imbalance will be lower bounded by I(m) = (p1 /2 1/n) m 16
Analysis Assume a key distribucon D with maximum probability p1 2/n. Then the imbalance a_er m steps of Greedy- d process sacsfies, with probability at least 1 1/n 17
Analysis An example with four workers In ideal scenario, each worker should handle 25% of the keys We need to consider three cases: When p1 = 2/4 = 0.5 When p1 > 0.5 When p1 < 0.5 18
Analysis An example with four workers In ideal scenario, each worker should handle 25% of the keys We need to consider three cases: When p1 = 2/4 = 0.5 When p1 > 0.5 When p1 < 0.5 19
Analysis An example with four workers In ideal scenario, each worker should handle 25% of the keys We need to consider three cases: When p1 = 2/4 = 0.5 When p1 > 0.5 When p1 < 0.5 20
ApplicaCons Most algorithms that use Shuffle Grouping can be expressed using ParCal Key Grouping to reduce: Memory footprint AggregaCon overhead Algorithms that use Key Grouping can be rewriren to achieve load balance 21
Examples Naïve Bayes Classifier Streaming Parallel Decision Trees Heavy Hirers and Space Saving 22
Naïve Bayes Classifier Counts co- occurrences of each feature and class value Key Grouping VerCcal Parallelism: each feature is tracked by single worker process Shuffle Grouping Horizontal Parallelism: each feature is tracked by all worker processes ParCal Key Grouping Each feature is tracked by exactly two processes 23
Stream Groupings: A summary Stream Grouping Pros Cons Key Grouping - Scalable - Load Imbalance Shuffle Grouping - Load Balance - Memory Overhead - AggregaCon O(W) Par;al Key Grouping - Scalable - Load Balance - Memory Cost - AggregaCon O(1) 24
Experiments What is the effect of key spli=ng on POTC? How does local es;ma;on compare to a global oracle? How robust is ParCal Key Grouping? How does PKG perform on a real deployment on Apache Storm? 25
Metric Load Imbalance the difference between the maximum and the average load of the workers at Cme t 26
Effect of Key Splieng Wikipedia (WP) Twirer (TW) Workers 27
Local Load EsCmaCon 28
Robustness Changing trends in data: cashtags Used in the stock market to idencfy a publicly traded company: e.g., $AAPL for Apple Skewed load at source: social networks Test different data distribucon at the sources 29
Robustness 5 workers 10 workers 50 workers 100 workers 30
Robustness 25 - I (messages) 20 15 10 5 Uniform L 5 Skewed L 5 Uniform L 10 Skewed L 10 Uniform L 15 Skewed L 15 Uniform L 20 Skewed L 20 5 10 50 100 workers 31
Real deployment: Apache Storm Throughput (keys/s) 1600 1400 1200 1000 800 600 400 200 0 PKG SG KG 0 0.2 0.4 0.6 0.8 1 (a) CPU delay (ms) 1200 1100 10s 30s 60s 60s 300s 600s 300s 30s PKG SG 10s 1000 KG 0.10 0 2.10 6 4.10 6 6.10 6 (b) Memory (keys) 600s 32
Future Work Dynamic Load Balancing Key MigraCon: moving top- k keys across workers ParCCon MigraCon: moving subset of keys across workers Handling worker churn with ParCal Key Grouping Applying queuing theory with ParCal Key Grouping for load balancing 33
Conclusion ParCal Key Grouping (PKG) reduces the load imbalance by up to seven orders of magnitude compared to Key Grouping PKG imposes constant memory and aggregacon overhead, i.e., O(1), compared to Shuffle Grouping that is O(W) Apache Storm 60% improvement in throughput 45% improvement in latency 34
Anis Nasir Coauthors: Gianmarco, Nicolas, David, Marco
Datasets Twirer, 1.2G tweets (crawled July 2012) Wikipedia, 22M access logs Twirer, 690K cashtags (crawled Nov 2013) Social Networks, 69M edges SyntheCc, 10M keys 36
Streaming Parallel Decision Tree Use shuffle grouping, where each worker generates histogram Aggregator is used to merge the results from W workers Memory footprint proporconal to W workers ParCal Key Grouping reduces overhead to 2 workers 37
Heavy Hirers and Space Saving Solve top- k items in constant Cme and space Key Grouping Error dependent on a single escmator Poor load balancing Shuffle grouping Error bounds depends on number of workers ParCal Key Grouping Berer Load Balancing Berer EsCmaCon 38