THEMIS: Fairness in Data Stream Processing under Overload Evangelia Kalyvianaki City University London, UK Marco Fiscato Imperial College London, UK Theodoros Salonidis IBM Research, USA Peter R. Pietzuch prp@doc.ic.ac.uk Peter Pietzuch Imperial College London, UK Systems Support for Big Data and Social Graphs CTR, King s College London 2014
The Puzzle of Big Data Processing Tools Online popular applications rely on a plethora of tools to support users queries. For example: 1. LinkedIn supports [Sigmod13]: a. online data centres à Avatara, Voldermort, Kafka b. offline data centres à Azkaban, Hadoop, Kafka 2. Conviva: 1. Spark, Hadoop, Hive for historical analysis 2. Spark Streaming for real time Data centres run collaboratively multiple different big data processing engines 2
The Puzzle of Big Data Real-Time Processing Engines in Data Centres Data Center Twitter Storm cluster cluster SEEP cluster cluster Spark Streaming Apache S4 Queries overload data center resources. How to efficiently allocate resources across clusters/engines? 3
Data Shedding a A well-known mechanism technique to handle transient overload conditions Data is to Center discard data [][][] overload conditions is to discard data Twitter Storm overloaded SEEP Spark Streaming overloaded Apache S4 How to control shedding across clusters/engines and queries in a distributed and fair manner? 4
Fairness in Data Stream Processing under Overload Key Contributions: 1. SIC processing metric to quantify shedding in a query-agnostic way 2. Use of SIC to pass shedding information among clusters 3. Distributed SIC fairness to address continuous overload Outline: 1. SIC processing quality metric 2. SIC fairness distributed algorithm 3. SIC fairness shedder 4. Evaluation results 5. Future work 5
Source Information Content (SIC) Metric measures the contribution of data from sources to results perfect processing degraded processing results 4/61 + 1/2 1 + +2/3 1 11/6 < 3 2/6 1/2 2/6 1/2 1/2 1/2 1/3 1/3 1/3 UDO UDO UDO 1/3 1/3 1/3 sources 1 1/4 1/4 1/4 1/4 6
SIC Correlation to Result Correctness mean absolute error 0.5 0.4 0.3 0.2 0.1 AVG gaussian uniform exponential mixed planetlab mean absolute error 1 0.8 0.6 0.4 0.2 COUNT gaussian uniform exponential mixed planetlab 0 0 0.2 0.4 0.6 0.8 1 SIC values 1 mean absolute error 0.8 0.6 MAX 0 gaussian uniform exponential mixed planetlab 0 0.2 0.4 0.6 0.8 1 SIC values 0.4 There is correlation to result correctness and the SIC metric. Stronger 0.2correlation for certain types of queries. 0 0 0.2 0.4 0.6 0.8 1 SIC values 7
Fair Shedding for Equalising SIC values each local shedder equalises the SIC values of its own queries global coordination is achieved with local informed shedding Data Center Twitter Storm fair shedder fair shedder SEEP Spark Streaming result and local SIC result and local SIC result and local SIC fair shedder Apache S4 fair shedder 8
SIC Fair Shedder operator processing operator threads processing threads input buffer output data output data fair shedder shedder projects the effect of SIC loss at the result query SIC à local decisions for global convergence online cost model estimates the time to process an average tuple à nodes heterogeneity 9
Single-Node Fairness least degraded 1 mean Jain's index 1 most fair mean SIC 0.8 0.6 0.4 0.8 0.6 0.4 Jain's index most degraded 0 210 180 150 120 90 60 30 number of quries 330 300 270 240 0 least fair 0.2 0.2 SIC fair shedder scales gracefully to the number of queries 10
Multi-Cluster Fairness most fair SIC fairness random 18 nodes, 2,000 operators mix workload: cov, top-5, avg least fair Equal SIC fairness is better than the random 11
Conclusions and Future Work 1. Data shedding to address continuous overload 2. A simple SIC metric to measure source data contribution 3. Global fairness convergence with local informed decisions Future Work: 1. Data shedding for approximate, controlled computing 2. Semantic shedding Thank you! Questions? evangelia.kalyvianaki.1@city.ac.uk http://www.staff.city.ac.uk/~sbbj913/ 12