Load Balancing in Stream Processing Engines. Anis Nasir Coauthors: Gianmarco, Nicolas, David, Marco

Size: px

Start display at page:

Download "Load Balancing in Stream Processing Engines. Anis Nasir Coauthors: Gianmarco, Nicolas, David, Marco"

Ralf Palmer
10 years ago
Views:

1 Engines Anis Nasir Coauthors: Gianmarco, Nicolas, David, Marco

2 Stream Processing Engines Online Machine Learning Real Time Query Processing ConCnuous ComputaCon Distributed RPC 2

3 Stream Processing Engines Streaming ApplicaCons are represented by Directed Acyclic Graphs (DAGs) Source Worker Worker Source Worker 3

4 Stream Grouping Key or Fields Grouping Hash- based assignment Stateful operacons, e.g., page rank, degree count Shuffle Grouping Round- robin assignment Stateless operacons, e.g., data logging, OLTP All Grouping 4

5 Data DistribuCon Skewed DistribuCon Power- law Zipf DistribuCon Log Normal Scale- free network Example ApplicaCon Social Networks Web Users Economy Biology 5

6 Key Grouping Worker Source Worker Source Worker 6

7 Shuffle Grouping Worker Source Worker Aggregator Source Worker 7

8 Stream Grouping Key Grouping Efficient RouCng Load Imbalance Shuffle Grouping Load Balance AddiConal Memory AddiConal AggregaCon phase 8

9 A possible solucon Dynamic load rebalancing detect load imbalance perform data migracon Challenges How o_en to check the load imbalance MigraCon is not directly supported with most of the DSPEs and requires extra modificacon State management for stateful operacon 9

10 Power of two choices (POTC) Balls- and- bins problem Algorithm For each ball, pick two bins uniformly at random Assign the ball to least loaded of the two bins Issues Non uniform key distribucon Load InformaCon 10

11 Power of two choices Worker Source Worker Issues: Consensus on keys Load InformaCon Load imbalance Source Worker 11

12 ParCal Key Grouping (PKG) The Power of Both Choices: Key Splieng Split each key into two server Assign each instance using power of two choices Benefits: Decentralized Stateless Handle Skew 12

13 ParCal Key Grouping Local load escmacon each source escmates load on workers using the local roucng history Benefits: No coordinacon among sources No communicacon with workers 13

14 ParCal Key Grouping Worker Source Worker Aggregator Source Worker 14

15 Analysis Problem FormulaCon n workers - > bins keys ki ε K - > colors m messages - > colored balls Minimize the difference between maximum and average workload 15

16 Analysis Key DistribuCon We pick each key ki ε K with probability pi from the distribucon D, where p1 p2 p3. Maximum load is proporconal to the most frequent key (with probability p1) If p1 > 2/n the expected imbalance will be lower bounded by I(m) = (p1 /2 1/n) m 16

17 Analysis Assume a key distribucon D with maximum probability p1 2/n. Then the imbalance a_er m steps of Greedy- d process sacsfies, with probability at least 1 1/n 17

18 Analysis An example with four workers In ideal scenario, each worker should handle 25% of the keys We need to consider three cases: When p1 = 2/4 = 0.5 When p1 > 0.5 When p1 <

19 Analysis An example with four workers In ideal scenario, each worker should handle 25% of the keys We need to consider three cases: When p1 = 2/4 = 0.5 When p1 > 0.5 When p1 <

20 Analysis An example with four workers In ideal scenario, each worker should handle 25% of the keys We need to consider three cases: When p1 = 2/4 = 0.5 When p1 > 0.5 When p1 <

21 ApplicaCons Most algorithms that use Shuffle Grouping can be expressed using ParCal Key Grouping to reduce: Memory footprint AggregaCon overhead Algorithms that use Key Grouping can be rewriren to achieve load balance 21

22 Examples Naïve Bayes Classifier Streaming Parallel Decision Trees Heavy Hirers and Space Saving 22

23 Naïve Bayes Classifier Counts co- occurrences of each feature and class value Key Grouping VerCcal Parallelism: each feature is tracked by single worker process Shuffle Grouping Horizontal Parallelism: each feature is tracked by all worker processes ParCal Key Grouping Each feature is tracked by exactly two processes 23

24 Stream Groupings: A summary Stream Grouping Pros Cons Key Grouping - Scalable - Load Imbalance Shuffle Grouping - Load Balance - Memory Overhead - AggregaCon O(W) Par;al Key Grouping - Scalable - Load Balance - Memory Cost - AggregaCon O(1) 24

25 Experiments What is the effect of key spli=ng on POTC? How does local es;ma;on compare to a global oracle? How robust is ParCal Key Grouping? How does PKG perform on a real deployment on Apache Storm? 25

26 Metric Load Imbalance the difference between the maximum and the average load of the workers at Cme t 26

27 Effect of Key Splieng Wikipedia (WP) Twirer (TW) Workers 27

28 Local Load EsCmaCon 28

29 Robustness Changing trends in data: cashtags Used in the stock market to idencfy a publicly traded company: e.g., $AAPL for Apple Skewed load at source: social networks Test different data distribucon at the sources 29

30 Robustness 5 workers 10 workers 50 workers 100 workers 30

31 Robustness 25 - I (messages) Uniform L 5 Skewed L 5 Uniform L 10 Skewed L 10 Uniform L 15 Skewed L 15 Uniform L 20 Skewed L workers 31

32 Real deployment: Apache Storm Throughput (keys/s) PKG SG KG (a) CPU delay (ms) s 30s 60s 60s 300s 600s 300s 30s PKG SG 10s 1000 KG (b) Memory (keys) 600s 32

33 Future Work Dynamic Load Balancing Key MigraCon: moving top- k keys across workers ParCCon MigraCon: moving subset of keys across workers Handling worker churn with ParCal Key Grouping Applying queuing theory with ParCal Key Grouping for load balancing 33

34 Conclusion ParCal Key Grouping (PKG) reduces the load imbalance by up to seven orders of magnitude compared to Key Grouping PKG imposes constant memory and aggregacon overhead, i.e., O(1), compared to Shuffle Grouping that is O(W) Apache Storm 60% improvement in throughput 45% improvement in latency 34

35 Anis Nasir Coauthors: Gianmarco, Nicolas, David, Marco

36 Datasets Twirer, 1.2G tweets (crawled July 2012) Wikipedia, 22M access logs Twirer, 690K cashtags (crawled Nov 2013) Social Networks, 69M edges SyntheCc, 10M keys 36

37 Streaming Parallel Decision Tree Use shuffle grouping, where each worker generates histogram Aggregator is used to merge the results from W workers Memory footprint proporconal to W workers ParCal Key Grouping reduces overhead to 2 workers 37

38 Heavy Hirers and Space Saving Solve top- k items in constant Cme and space Key Grouping Error dependent on a single escmator Poor load balancing Shuffle grouping Error bounds depends on number of workers ParCal Key Grouping Berer Load Balancing Berer EsCmaCon 38

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next