THEMIS: Fairness in Data Stream Processing under Overload

Size: px

Start display at page:

Download "THEMIS: Fairness in Data Stream Processing under Overload"

Junior Watts
10 years ago
Views:

1 THEMIS: Fairness in Data Stream Processing under Overload Evangelia Kalyvianaki City University London, UK Marco Fiscato Imperial College London, UK Theodoros Salonidis IBM Research, USA Peter R. Pietzuch Peter Pietzuch Imperial College London, UK Systems Support for Big Data and Social Graphs CTR, King s College London 2014

IBM Research, USA Peter R. Pietzuch prp@doc.ic.ac.

2 The Puzzle of Big Data Processing Tools Online popular applications rely on a plethora of tools to support users queries. For example: 1. LinkedIn supports [Sigmod13]: a. online data centres à Avatara, Voldermort, Kafka b. offline data centres à Azkaban, Hadoop, Kafka 2. Conviva: 1. Spark, Hadoop, Hive for historical analysis 2. Spark Streaming for real time Data centres run collaboratively multiple different big data processing engines 2

online data centres à Avatara, Voldermort, Kafka b. offline data centres à Azkaban, Hadoop, Kafka 2.

3 The Puzzle of Big Data Real-Time Processing Engines in Data Centres Data Center Twitter Storm cluster cluster SEEP cluster cluster Spark Streaming Apache S4 Queries overload data center resources. How to efficiently allocate resources across clusters/engines? 3

cluster Spark Streaming Apache S4 Queries overload data center

4 Data Shedding a A well-known mechanism technique to handle transient overload conditions Data is to Center discard data [][][] overload conditions is to discard data Twitter Storm overloaded SEEP Spark Streaming overloaded Apache S4 How to control shedding across clusters/engines and queries in a distributed and fair manner? 4

discard data Twitter Storm overloaded SEEP Spark Streaming overloaded Apache S4

5 Fairness in Data Stream Processing under Overload Key Contributions: 1. SIC processing metric to quantify shedding in a query-agnostic way 2. Use of SIC to pass shedding information among clusters 3. Distributed SIC fairness to address continuous overload Outline: 1. SIC processing quality metric 2. SIC fairness distributed algorithm 3. SIC fairness shedder 4. Evaluation results 5. Future work 5

Use of SIC to pass shedding information among clusters 3.

6 Source Information Content (SIC) Metric measures the contribution of data from sources to results perfect processing degraded processing results 4/61 + 1/ /3 1 11/6 < 3 2/6 1/2 2/6 1/2 1/2 1/2 1/3 1/3 1/3 UDO UDO UDO 1/3 1/3 1/3 sources 1 1/4 1/4 1/4 1/4 6

processing results 4/61 + 1/2 1 + +2/3 1 11/6 < 3 2/6 1/2 2/6 1/2

7 SIC Correlation to Result Correctness mean absolute error AVG gaussian uniform exponential mixed planetlab mean absolute error COUNT gaussian uniform exponential mixed planetlab SIC values 1 mean absolute error MAX 0 gaussian uniform exponential mixed planetlab SIC values 0.4 There is correlation to result correctness and the SIC metric. Stronger 0.2correlation for certain types of queries SIC values 7

2 COUNT gaussian uniform exponential mixed planetlab 0 0 0.2 0.4 0.6 0.8 1 SIC values 1 mean absolute error 0.8 0.

8 Fair Shedding for Equalising SIC values each local shedder equalises the SIC values of its own queries global coordination is achieved with local informed shedding Data Center Twitter Storm fair shedder fair shedder SEEP Spark Streaming result and local SIC result and local SIC result and local SIC fair shedder Apache S4 fair shedder 8

Center Twitter Storm fair shedder fair shedder SEEP Spark Streaming result and local

9 SIC Fair Shedder operator processing operator threads processing threads input buffer output data output data fair shedder shedder projects the effect of SIC loss at the result query SIC à local decisions for global convergence online cost model estimates the time to process an average tuple à nodes heterogeneity 9

loss at the result query SIC à local decisions for global convergence online

10 Single-Node Fairness least degraded 1 mean Jain's index 1 most fair mean SIC Jain's index most degraded number of quries least fair SIC fair shedder scales gracefully to the number of queries 10

11 Multi-Cluster Fairness most fair SIC fairness random 18 nodes, 2,000 operators mix workload: cov, top-5, avg least fair Equal SIC fairness is better than the random 11

12 Conclusions and Future Work 1. Data shedding to address continuous overload 2. A simple SIC metric to measure source data contribution 3. Global fairness convergence with local informed decisions Future Work: 1. Data shedding for approximate, controlled computing 2. Semantic shedding Thank you! Questions? [email protected] 12

Global fairness convergence with local informed decisions Future Work: 1.

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional