Load Balancing for Distributed Stream Processing Engines. Muhammad Anis Uddin Nasir EMDC

Size: px

Start display at page:

Download "Load Balancing for Distributed Stream Processing Engines. Muhammad Anis Uddin Nasir EMDC 2011-13"

Natalie Holt
8 years ago
Views:

1 Load Balancing for Distributed Stream Processing Engines Muhammad Anis Uddin Nasir EMDC

2 About me Ex EMDC from Batch 011 (the party batch) Currently PhD Student at KTH Royal Institute of Technology Under supervision of Šarūnas Girdzijauskas Internship Ex-intern at Yahoo Labs Barcelona Intern at Aalto University

Technology Under supervision of Šarūnas Girdzijauskas

3 About me 3

4 About me 4

5 About me 5

6 About me 6

7 About me 7

8 About me 8

9 Why PhD? Freedom Work timings Set of problems Flexibility Conferences Internships 9

10 Research Interest Stream Processing Social Network Analysis Distributed Systems Decentralized Online Social Networks

11 Stream Processing Engines Streaming Application Online Machine Learning Real Time Query Processing Continuous Computation Streaming Frameworks Storm, Borealis, S4, Samza, Spark Streaming 11

12 Stream Processing Model Streaming Applications are represented by Directed Acyclic Graphs (DAGs) Source Data Stream Source Operators Data Channels 1

13 Stream Grouping Key or Fields Grouping Hash-based assignment Stateful operations, e.g., page rank, degree count Shuffle Grouping Round-robin assignment Stateless operations, e.g., data logging, OLTP 13

14 Key Grouping Key Grouping Scalable Low Memory Load Imbalance 14

15 Shuffle Grouping Shuffle Grouping Load Balance Memory O(W) Aggregation O(W) 15

16 Problem Formulation Input is a unbounded sequence of messages from a key distribution Each message is assigned to a worker for processing Load balance properties Memory Load Balance Network Load Balance Processing Load Balance Metric: Load Imbalance 16

for processing Load balance properties Memory Load Balance

17 Partial Key Grouping (PKG) Key Splitting Split each key into two server Assign each instance using power of two choices Local Load Estimation each source estimates load on workers using the local routing history 17

of two choices Local Load Estimation each source

18 Partial Key Grouping (PKG) Key Splitting Local Load Estimation 00 1 Source 1 0 Source 18

19 Partial Key Grouping (PKG) Key Splitting Distributed Stateless Handle Skew Local load estimation No coordination among sources No communication with workers 19

20 Partial Key Grouping PKG Load Balance Memory O(1) Aggregation O(1) 0

21 Streaming Applications Most algorithms that use Shuffle Grouping can be expressed using Partial Key Grouping to reduce: Memory footprint Aggregation overhead Algorithms that use Key Grouping can be rewritten to achieve load balance 1

22 Stream Grouping: Summary Grouping Pros Cons Key Grouping Scalable Memory Load Balance Load Imbalance Shuffle Grouping Partial Key Grouping Memory O(W) Aggregation O(W) Scalable Load Balance Memory O(1) Aggregation O(1)

23 Experiments What is the effect of key splitting on POTC? How does local estimation compare to a global oracle? How does PKG perform on a real deployment on Apache Storm? 3

24 Experimental Setup Metric the difference of maximum and the average load of the workers at time t Datasets Twitter, 1.G tweets (crawled July 01) Wikipedia, M access logs Twitter, 690K cashtags (crawled Nov 013) Social Networks, 69M edges Synthetic, M keys 4

25 Effect of Key Splitting 5

26 Local Load Estimation 6

27 Throughput (keys/s) Real deployment: Apache Storm PKG SG 00 KG s 300s 60s s 30s 60s s PKG SG KG 30s s (a) CPU delay (ms) 600s (b) Memory (keys)

28 Load Balancing for Distributed Stream Processing Engines Muhammad Anis Uddin Nasir EMDC

29 Does PKG work for highly ed data? 9

30 Skewness Source Source 30

31 Skewness Source Source 31

32 Skewness Source Source 3

33 High level ideas Splitting the high frequency keys on more than two workers Challenges How to decide which are the high frequency keys? How many workers to split the high frequency keys? 33

34 Solution Divide stream into two components Heavy Hitters Tail Distribution Split Heavy Hitters on all the workers Use Greedy W-Choices or Round Robin for heavy hitters Use vanilla PKG for tail 34

35 Skewness Source Source 35

36 Experiments- Threshold 5 workers I(t) (messages) workers 0 W-Choice / W 1/ W 1/ W 1/4 W 1/8 W 1. W-Choice workers I(t) (messages) W-Choice RR workers 0 RR RR workers 0-1 W-Choice workers 0 0 workers 0 50 workers 0-7 RR 1. 36

37 Experiments- Load Imbalance 5 workers I(t) (messages) workers k unique items PKG W-Choice RR -1 - k unique items workers workers 0 0k unique items 0k unique items M unique items -7 1M unique items 1M unique items M unique items workers workers k unique items workers workers workers 0-1 k unique items workers 0 0k unique items k unique items workers 0-3 I(t) (messages) -6 I(t) (messages) 50 workers

38 Conclusion Partial Key Grouping (PKG) reduces the load imbalance by up to seven orders of magnitude compared to Key Grouping PKG imposes constant memory and aggregation overhead, i.e., O(1), compared to Shuffle Grouping that is O(W) W-Choices improves PKG to achieve nearly perfect load balance for highly ed input 38

39 Load Balancing for Distributed Stream Processing Engines Muhammad Anis Uddin Nasir EMDC

Load Balancing in Stream Processing Engines. Anis Nasir Coauthors: Gianmarco, Nicolas, David, Marco

Load Balancing in Stream Processing Engines. Anis Nasir Coauthors: Gianmarco, Nicolas, David, Marco Engines Anis Nasir Coauthors: Gianmarco, Nicolas, David, Marco Stream Processing Engines Online Machine Learning Real Time Query Processing ConCnuous ComputaCon Distributed RPC 2 Stream Processing Engines