Load Balancing for Distributed Stream Processing Engines. Muhammad Anis Uddin Nasir EMDC 2011-13



Similar documents
Load Balancing in Stream Processing Engines. Anis Nasir Coauthors: Gianmarco, Nicolas, David, Marco

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

5 Performance Management for Web Services. Rolf Stadler School of Electrical Engineering KTH Royal Institute of Technology.

CS 558 Internet Systems and Technologies

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Putting Apache Kafka to Use!

Fast Data in the Era of Big Data: Twitter s Real-

Online and Scalable Data Validation in Advanced Metering Infrastructures

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Graphalytics: A Big Data Benchmark for Graph-Processing Platforms

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Big Data Analytics - Accelerated. stream-horizon.com

Exploring Oracle E-Business Suite Load Balancing Options. Venkat Perumal IT Convergence

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Gossip Based Partitioning and Replication Middleware in Network Infrastructure

Tolerating Late Timing Faults in Real Time Stream Processing Systems. Stuart Perks

Map-Reduce for Machine Learning on Multicore

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Architectures for massive data management

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

HiBench Introduction. Carson Wang Software & Services Group

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Infrastructures for big data

Hadoop Ecosystem B Y R A H I M A.

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

On demand synchronization and load distribution for database grid-based Web applications

B4: Experience with a Globally-Deployed Software Defined WAN TO APPEAR IN SIGCOMM 13

Testing Big data is one of the biggest

Management by Network Search

I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

Streaming items through a cluster with Spark Streaming

Group Based Load Balancing Algorithm in Cloud Computing Virtualization

Can the Elephants Handle the NoSQL Onslaught?

How Companies are! Using Spark

Energy Efficient MapReduce

Big Data Analysis: Apache Storm Perspective

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Case Study - I. Industry: Social Networking Website Technology : J2EE AJAX, Spring, MySQL, Weblogic, Windows Server 2008.

Measuring CDN Performance. Hooman Beheshti, VP Technology

Recommendations for Performance Benchmarking

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

QoS & Traffic Management

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

BIG DATA What it is and how to use?

Java Performance. Adrian Dozsa TM-JUG

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Play with Big Data on the Shoulders of Open Source

Bayesian networks - Time-series models - Apache Spark & Scala

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

Mining Large Datasets: Case of Mining Graph Data in the Cloud

DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION DEVELOPMENT WITH CASCADING

Cognos8 Deployment Best Practices for Performance/Scalability. Barnaby Cole Practice Lead, Technical Services

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Optimization of QoS for Cloud-Based Services through Elasticity and Network Awareness

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

Journée Thématique Big Data 13/03/2015

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

GraySort on Apache Spark by Databricks

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Eclipse Exam Tutorial - Pros and Cons

LOAD BALANCING TECHNIQUES FOR RELEASE 11i AND RELEASE 12 E-BUSINESS ENVIRONMENTS

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Real-time Big Data Analytics with Storm

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

GeoGrid Project and Experiences with Hadoop

Affinity Aware VM Colocation Mechanism for Cloud

Load Manager Administrator s Guide For other guides in this document set, go to the Document Center

Managing large clusters resources

MANAGING RESOURCES IN A BIG DATA CLUSTER.

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Multifaceted Resource Management for Dealing with Heterogeneous Workloads in Virtualized Data Centers

ICS Principles of Operating Systems

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Sla Aware Load Balancing Algorithm Using Join-Idle Queue for Virtual Machines in Cloud Computing

A Brief Introduction to Apache Tez

Big Data. A general approach to process external multimedia datasets. David Mera

MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM?

Scalability Factors of JMeter In Performance Testing Projects

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

project collects data from national events, both natural and manmade, to be stored and evaluated by

Keywords Distributed Computing, On Demand Resources, Cloud Computing, Virtualization, Server Consolidation, Load Balancing

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

COSC1300, COSC1301. Web Servers and Web Technology

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers

Using Kafka to Optimize Data Movement and System Integration. Alex

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

VDI FIT and VDI UX: Composite Metrics Track Good, Fair, Poor Desktop Performance

Big Graph Processing: Some Background

Comparison of PBRR Scheduling Algorithm with Round Robin and Heuristic Priority Scheduling Algorithm in Virtual Cloud Environment

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

TITAN BIG GRAPH DATA WITH CASSANDRA #TITANDB #CASSANDRA12

Transcription:

Load Balancing for Distributed Stream Processing Engines Muhammad Anis Uddin Nasir EMDC 011-13

About me Ex EMDC from Batch 011 (the party batch) Currently PhD Student at KTH Royal Institute of Technology Under supervision of Šarūnas Girdzijauskas Internship Ex-intern at Yahoo Labs Barcelona Intern at Aalto University

About me 3

About me 4

About me 5

About me 6

About me 7

About me 8

Why PhD? Freedom Work timings Set of problems Flexibility Conferences Internships 9

Research Interest Stream Processing Social Network Analysis Distributed Systems Decentralized Online Social Networks

Stream Processing Engines Streaming Application Online Machine Learning Real Time Query Processing Continuous Computation Streaming Frameworks Storm, Borealis, S4, Samza, Spark Streaming 11

Stream Processing Model Streaming Applications are represented by Directed Acyclic Graphs (DAGs) Source Data Stream Source Operators Data Channels 1

Stream Grouping Key or Fields Grouping Hash-based assignment Stateful operations, e.g., page rank, degree count Shuffle Grouping Round-robin assignment Stateless operations, e.g., data logging, OLTP 13

Key Grouping Key Grouping Scalable Low Memory Load Imbalance 14

Shuffle Grouping Shuffle Grouping Load Balance Memory O(W) Aggregation O(W) 15

Problem Formulation Input is a unbounded sequence of messages from a key distribution Each message is assigned to a worker for processing Load balance properties Memory Load Balance Network Load Balance Processing Load Balance Metric: Load Imbalance 16

Partial Key Grouping (PKG) Key Splitting Split each key into two server Assign each instance using power of two choices Local Load Estimation each source estimates load on workers using the local routing history 17

Partial Key Grouping (PKG) Key Splitting Local Load Estimation 00 1 Source 1 0 Source 18

Partial Key Grouping (PKG) Key Splitting Distributed Stateless Handle Skew Local load estimation No coordination among sources No communication with workers 19

Partial Key Grouping PKG Load Balance Memory O(1) Aggregation O(1) 0

Streaming Applications Most algorithms that use Shuffle Grouping can be expressed using Partial Key Grouping to reduce: Memory footprint Aggregation overhead Algorithms that use Key Grouping can be rewritten to achieve load balance 1

Stream Grouping: Summary Grouping Pros Cons Key Grouping Scalable Memory Load Balance Load Imbalance Shuffle Grouping Partial Key Grouping Memory O(W) Aggregation O(W) Scalable Load Balance Memory O(1) Aggregation O(1)

Experiments What is the effect of key splitting on POTC? How does local estimation compare to a global oracle? How does PKG perform on a real deployment on Apache Storm? 3

Experimental Setup Metric the difference of maximum and the average load of the workers at time t Datasets Twitter, 1.G tweets (crawled July 01) Wikipedia, M access logs Twitter, 690K cashtags (crawled Nov 013) Social Networks, 69M edges Synthetic, M keys 4

Effect of Key Splitting 5

Local Load Estimation 6

Throughput (keys/s) Real deployment: Apache Storm 1600 1400 100 00 800 600 400 PKG SG 00 KG 0 0 0. 100 600s 300s 60s 10 300s 30s 60s s PKG SG KG 30s s 0.6 00 0 1 0. (a) CPU delay (ms) 600s 6 6. 4. (b) Memory (keys) 6. 7 6

Load Balancing for Distributed Stream Processing Engines Muhammad Anis Uddin Nasir EMDC 011-13

Does PKG work for highly ed data? 9

Skewness Source Source 30

Skewness Source Source 31

Skewness Source Source 3

High level ideas Splitting the high frequency keys on more than two workers Challenges How to decide which are the high frequency keys? How many workers to split the high frequency keys? 33

Solution Divide stream into two components Heavy Hitters Tail Distribution Split Heavy Hitters on all the workers Use Greedy W-Choices or Round Robin for heavy hitters Use vanilla PKG for tail 34

Skewness Source Source 35

Experiments- Threshold 5 workers I(t) (messages) 0-1 - -3-4 -5-6 -7 workers 0 W-Choice / W 1/ W 1/ W 1/4 W 1/8 W 1. W-Choice -1-1 - - - -3-3 -3-4 -4-4 -5-5 -5-6 -6-6 -7-7 -7 5 workers I(t) (messages) W-Choice -1 1. RR 1. 50 workers 0 RR RR -1-1 -1 - - - -3-3 -3-3 -4-4 -4-4 -5-5 -5-5 -6-6 -6-6 -7-7 -7 1. 1. 0-1. 1. 0 workers 0-1 W-Choice workers 0 0 workers 0 50 workers 0-7 RR 1. 36

Experiments- Load Imbalance 5 workers I(t) (messages) workers 0-1 - -3-4 -5 k unique items PKG W-Choice RR -1 - k unique items -1 - -4-5 -5-5 -6 1. 5 workers 0 1. -7 50 workers 0 0k unique items 0k unique items -1-1 -1 - - - - -3-3 -3-3 -4-4 -4-4 -5-5 -5-5 -6-6 -6-6 -7-7 -7 1. 1M unique items -7 1M unique items 1M unique items -1 - -4-4 -4-4 -5-5 -5-5 -7 1. 1. 1M unique items -3-6 -7 - -3-6 1. 0-3 0 workers 0-1 - -3-6 50 workers 0-1 - 1. 0k unique items workers 0-1 5 workers 1. 1. 0 workers 0-1 k unique items workers 0 0k unique items - -7-1 -3-7 -4-7 1. k unique items 0-4 -6-3 -6 0 workers 0-3 I(t) (messages) -6 I(t) (messages) 50 workers 0-6 -7 1. -7 1. 37

Conclusion Partial Key Grouping (PKG) reduces the load imbalance by up to seven orders of magnitude compared to Key Grouping PKG imposes constant memory and aggregation overhead, i.e., O(1), compared to Shuffle Grouping that is O(W) W-Choices improves PKG to achieve nearly perfect load balance for highly ed input 38

Load Balancing for Distributed Stream Processing Engines Muhammad Anis Uddin Nasir EMDC 011-13