THEMIS: Fairness in Data Stream Processing under Overload



Similar documents
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

How Companies are! Using Spark

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Kafka & Redis for Big Data Solutions

Systems for Fun and Profit

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Putting Apache Kafka to Use!

HiBench Introduction. Carson Wang Software & Services Group

Big Data Analytics - Accelerated. stream-horizon.com

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Application Development. A Paradigm Shift

Apache Kafka Your Event Stream Processing Solution

Big Data Analysis: Apache Storm Perspective

HiBench Installation. Sunil Raiyani, Jayam Modi

Hadoop Ecosystem B Y R A H I M A.

Bayesian networks - Time-series models - Apache Spark & Scala

Hadoop in the Enterprise

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Introducing Storm 1 Core Storm concepts Topology design

Big Data and Industrial Internet

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

BIG DATA ANALYTICS For REAL TIME SYSTEM

Streaming items through a cluster with Spark Streaming

Performance and Energy Efficiency of. Hadoop deployment models

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

HDP Hadoop From concept to deployment.

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

NOT IN KANSAS ANY MORE

The basic data mining algorithms introduced may be enhanced in a number of ways.

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Introduction To Hive

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Streamdrill: Analyzing Big Data Streams in Realtime

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

Ali Ghodsi Head of PM and Engineering Databricks

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

MapReduce with Apache Hadoop Analysing Big Data

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Managing large clusters resources

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Next-Gen Big Data Analytics using the Spark stack

Architectures for massive data management

Customer Case Study. Sharethrough

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Load Distribution in Large Scale Network Monitoring Infrastructures

Enabling Multi-pipeline Data Transfer in HDFS for Big Data Applications

Data Security in Hadoop

Wisdom from Crowds of Machines

E6895 Advanced Big Data Analytics Lecture 4:! Data Store

Big Data Analytics: Challenges, Tools

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Chapter 5: Stream Processing. Big Data Management and Analytics 193

CSE-E5430 Scalable Cloud Computing Lecture 11

Energy Efficient MapReduce

Big Data Course Highlights

Future Internet Technologies

Dominik Wagenknecht Accenture

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Integrating Apache Spark with an Enterprise Data Warehouse

Data Stream Ingestion & Complex Event Processing Systems for Data Driven Decisions. White Paper.

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

HDP Enabling the Modern Data Architecture

Apache Flink Next-gen data analysis. Kostas

Comprehensive Analytics on the Hortonworks Data Platform

Big Data Readiness. A QuantUniversity Whitepaper. 5 things to know before embarking on your first Big Data project

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Practical Hadoop. Security. Bhushan Lakhe

LPV model identification for power management of Web service systems Mara Tanelli, Danilo Ardagna, Marco Lovera

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

L1: Introduction to Hadoop

Large-Scale Data Processing

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Transcription:

THEMIS: Fairness in Data Stream Processing under Overload Evangelia Kalyvianaki City University London, UK Marco Fiscato Imperial College London, UK Theodoros Salonidis IBM Research, USA Peter R. Pietzuch prp@doc.ic.ac.uk Peter Pietzuch Imperial College London, UK Systems Support for Big Data and Social Graphs CTR, King s College London 2014

The Puzzle of Big Data Processing Tools Online popular applications rely on a plethora of tools to support users queries. For example: 1. LinkedIn supports [Sigmod13]: a. online data centres à Avatara, Voldermort, Kafka b. offline data centres à Azkaban, Hadoop, Kafka 2. Conviva: 1. Spark, Hadoop, Hive for historical analysis 2. Spark Streaming for real time Data centres run collaboratively multiple different big data processing engines 2

The Puzzle of Big Data Real-Time Processing Engines in Data Centres Data Center Twitter Storm cluster cluster SEEP cluster cluster Spark Streaming Apache S4 Queries overload data center resources. How to efficiently allocate resources across clusters/engines? 3

Data Shedding a A well-known mechanism technique to handle transient overload conditions Data is to Center discard data [][][] overload conditions is to discard data Twitter Storm overloaded SEEP Spark Streaming overloaded Apache S4 How to control shedding across clusters/engines and queries in a distributed and fair manner? 4

Fairness in Data Stream Processing under Overload Key Contributions: 1. SIC processing metric to quantify shedding in a query-agnostic way 2. Use of SIC to pass shedding information among clusters 3. Distributed SIC fairness to address continuous overload Outline: 1. SIC processing quality metric 2. SIC fairness distributed algorithm 3. SIC fairness shedder 4. Evaluation results 5. Future work 5

Source Information Content (SIC) Metric measures the contribution of data from sources to results perfect processing degraded processing results 4/61 + 1/2 1 + +2/3 1 11/6 < 3 2/6 1/2 2/6 1/2 1/2 1/2 1/3 1/3 1/3 UDO UDO UDO 1/3 1/3 1/3 sources 1 1/4 1/4 1/4 1/4 6

SIC Correlation to Result Correctness mean absolute error 0.5 0.4 0.3 0.2 0.1 AVG gaussian uniform exponential mixed planetlab mean absolute error 1 0.8 0.6 0.4 0.2 COUNT gaussian uniform exponential mixed planetlab 0 0 0.2 0.4 0.6 0.8 1 SIC values 1 mean absolute error 0.8 0.6 MAX 0 gaussian uniform exponential mixed planetlab 0 0.2 0.4 0.6 0.8 1 SIC values 0.4 There is correlation to result correctness and the SIC metric. Stronger 0.2correlation for certain types of queries. 0 0 0.2 0.4 0.6 0.8 1 SIC values 7

Fair Shedding for Equalising SIC values each local shedder equalises the SIC values of its own queries global coordination is achieved with local informed shedding Data Center Twitter Storm fair shedder fair shedder SEEP Spark Streaming result and local SIC result and local SIC result and local SIC fair shedder Apache S4 fair shedder 8

SIC Fair Shedder operator processing operator threads processing threads input buffer output data output data fair shedder shedder projects the effect of SIC loss at the result query SIC à local decisions for global convergence online cost model estimates the time to process an average tuple à nodes heterogeneity 9

Single-Node Fairness least degraded 1 mean Jain's index 1 most fair mean SIC 0.8 0.6 0.4 0.8 0.6 0.4 Jain's index most degraded 0 210 180 150 120 90 60 30 number of quries 330 300 270 240 0 least fair 0.2 0.2 SIC fair shedder scales gracefully to the number of queries 10

Multi-Cluster Fairness most fair SIC fairness random 18 nodes, 2,000 operators mix workload: cov, top-5, avg least fair Equal SIC fairness is better than the random 11

Conclusions and Future Work 1. Data shedding to address continuous overload 2. A simple SIC metric to measure source data contribution 3. Global fairness convergence with local informed decisions Future Work: 1. Data shedding for approximate, controlled computing 2. Semantic shedding Thank you! Questions? evangelia.kalyvianaki.1@city.ac.uk http://www.staff.city.ac.uk/~sbbj913/ 12