Load Balancing in Stream Processing Engines. Anis Nasir Coauthors: Gianmarco, Nicolas, David, Marco

Similar documents
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Big Data Analytics - Accelerated. stream-horizon.com

Energy Efficient MapReduce

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

Recommendations for Performance Benchmarking

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Bringing Big Data Modelling into the Hands of Domain Experts

Can the Elephants Handle the NoSQL Onslaught?

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

IMPROVED PROXIMITY AWARE LOAD BALANCING FOR HETEROGENEOUS NODES

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Performance and Energy Efficiency of. Hadoop deployment models

Evaluation of NoSQL databases for large-scale decentralized microblogging

Using In-Memory Computing to Simplify Big Data Analytics

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Scalable Architecture on Amazon AWS Cloud

Rackspace Cloud Databases and Container-based Virtualization

A Novel Data Placement Model for Highly-Available Storage Systems

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

GraySort on Apache Spark by Databricks

Virtuoso and Database Scalability

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Load Balancing. Load Balancing 1 / 24

Performance and scalability of a large OLTP workload

Self-Tuning Memory Management of A Database System

Big Data & Scripting Part II Streaming Algorithms

The Complete Performance Solution for Microsoft SQL Server

FPGA-based Multithreading for In-Memory Hash Joins

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

LOAD BALANCING TECHNIQUES FOR RELEASE 11i AND RELEASE 12 E-BUSINESS ENVIRONMENTS

LBPerf: An Open Toolkit to Empirically Evaluate the Quality of Service of Middleware Load Balancing Services

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

In Memory Accelerator for MongoDB

International journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online

2015 The MathWorks, Inc. 1

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications

HiBench Introduction. Carson Wang Software & Services Group

TESTING AND OPTIMIZING WEB APPLICATION S PERFORMANCE AQA CASE STUDY

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Hadoop in the Hybrid Cloud

Chapter 20: Data Analysis

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Exploring Oracle E-Business Suite Load Balancing Options. Venkat Perumal IT Convergence

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

White Paper. Optimizing the Performance Of MySQL Cluster

Benchmarking Hadoop & HBase on Violin

Big Data and Data Science: Behind the Buzz Words

Understanding the Benefits of IBM SPSS Statistics Server

Testing Big data is one of the biggest

Sla Aware Load Balancing Algorithm Using Join-Idle Queue for Virtual Machines in Cloud Computing

Hadoop and Map-Reduce. Swati Gore

Product Overview. UNIFIED COMPUTING Managed Load Balancing Data Sheet

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

MAGENTO HOSTING Progressive Server Performance Improvements

Cognos8 Deployment Best Practices for Performance/Scalability. Barnaby Cole Practice Lead, Technical Services

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Implement Hadoop jobs to extract business value from large and varied data sets

Bayesian networks - Time-series models - Apache Spark & Scala

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Case Study - I. Industry: Social Networking Website Technology : J2EE AJAX, Spring, MySQL, Weblogic, Windows Server 2008.

Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Java Performance. Adrian Dozsa TM-JUG

Sharding and MongoDB. Release MongoDB, Inc.

The International Journal Of Science & Technoledge (ISSN X)

Data Mining. Nonlinear Classification

AgencyPortal v5.1 Performance Test Summary Table of Contents

On demand synchronization and load distribution for database grid-based Web applications

Performance Monitoring of Parallel Scientific Applications

Big Graph Processing: Some Background

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Cray: Enabling Real-Time Discovery in Big Data

International Journal of Engineering Research & Management Technology

CS 558 Internet Systems and Technologies

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Spark and the Big Data Library

A Brief Introduction to Apache Tez

5 Performance Management for Web Services. Rolf Stadler School of Electrical Engineering KTH Royal Institute of Technology.

Performance Testing and Optimization in Web-Service Based Applications

Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering

Whitepaper. Innovations in Business Intelligence Database Technology.

PERFORMANCE MODELS FOR APACHE ACCUMULO:

How To Write A Database Program

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

HiBench Installation. Sunil Raiyani, Jayam Modi

Fast Data in the Era of Big Data: Twitter s Real-

these three NoSQL databases because I wanted to see a the two different sides of the CAP

BBM467 Data Intensive ApplicaAons

Big Fast Data Hadoop acceleration with Flash. June 2013

Transcription:

Engines Anis Nasir Coauthors: Gianmarco, Nicolas, David, Marco

Stream Processing Engines Online Machine Learning Real Time Query Processing ConCnuous ComputaCon Distributed RPC 2

Stream Processing Engines Streaming ApplicaCons are represented by Directed Acyclic Graphs (DAGs) Source Worker Worker Source Worker 3

Stream Grouping Key or Fields Grouping Hash- based assignment Stateful operacons, e.g., page rank, degree count Shuffle Grouping Round- robin assignment Stateless operacons, e.g., data logging, OLTP All Grouping 4

Data DistribuCon Skewed DistribuCon Power- law Zipf DistribuCon Log Normal Scale- free network Example ApplicaCon Social Networks Web Users Economy Biology 5

Key Grouping Worker Source Worker Source Worker 6

Shuffle Grouping Worker Source Worker Aggregator Source Worker 7

Stream Grouping Key Grouping Efficient RouCng Load Imbalance Shuffle Grouping Load Balance AddiConal Memory AddiConal AggregaCon phase 8

A possible solucon Dynamic load rebalancing detect load imbalance perform data migracon Challenges How o_en to check the load imbalance MigraCon is not directly supported with most of the DSPEs and requires extra modificacon State management for stateful operacon 9

Power of two choices (POTC) Balls- and- bins problem Algorithm For each ball, pick two bins uniformly at random Assign the ball to least loaded of the two bins Issues Non uniform key distribucon Load InformaCon 10

Power of two choices Worker Source Worker Issues: Consensus on keys Load InformaCon Load imbalance Source Worker 11

ParCal Key Grouping (PKG) The Power of Both Choices: Key Splieng Split each key into two server Assign each instance using power of two choices Benefits: Decentralized Stateless Handle Skew 12

ParCal Key Grouping Local load escmacon each source escmates load on workers using the local roucng history Benefits: No coordinacon among sources No communicacon with workers 13

ParCal Key Grouping Worker Source Worker Aggregator Source Worker 14

Analysis Problem FormulaCon n workers - > bins keys ki ε K - > colors m messages - > colored balls Minimize the difference between maximum and average workload 15

Analysis Key DistribuCon We pick each key ki ε K with probability pi from the distribucon D, where p1 p2 p3. Maximum load is proporconal to the most frequent key (with probability p1) If p1 > 2/n the expected imbalance will be lower bounded by I(m) = (p1 /2 1/n) m 16

Analysis Assume a key distribucon D with maximum probability p1 2/n. Then the imbalance a_er m steps of Greedy- d process sacsfies, with probability at least 1 1/n 17

Analysis An example with four workers In ideal scenario, each worker should handle 25% of the keys We need to consider three cases: When p1 = 2/4 = 0.5 When p1 > 0.5 When p1 < 0.5 18

Analysis An example with four workers In ideal scenario, each worker should handle 25% of the keys We need to consider three cases: When p1 = 2/4 = 0.5 When p1 > 0.5 When p1 < 0.5 19

Analysis An example with four workers In ideal scenario, each worker should handle 25% of the keys We need to consider three cases: When p1 = 2/4 = 0.5 When p1 > 0.5 When p1 < 0.5 20

ApplicaCons Most algorithms that use Shuffle Grouping can be expressed using ParCal Key Grouping to reduce: Memory footprint AggregaCon overhead Algorithms that use Key Grouping can be rewriren to achieve load balance 21

Examples Naïve Bayes Classifier Streaming Parallel Decision Trees Heavy Hirers and Space Saving 22

Naïve Bayes Classifier Counts co- occurrences of each feature and class value Key Grouping VerCcal Parallelism: each feature is tracked by single worker process Shuffle Grouping Horizontal Parallelism: each feature is tracked by all worker processes ParCal Key Grouping Each feature is tracked by exactly two processes 23

Stream Groupings: A summary Stream Grouping Pros Cons Key Grouping - Scalable - Load Imbalance Shuffle Grouping - Load Balance - Memory Overhead - AggregaCon O(W) Par;al Key Grouping - Scalable - Load Balance - Memory Cost - AggregaCon O(1) 24

Experiments What is the effect of key spli=ng on POTC? How does local es;ma;on compare to a global oracle? How robust is ParCal Key Grouping? How does PKG perform on a real deployment on Apache Storm? 25

Metric Load Imbalance the difference between the maximum and the average load of the workers at Cme t 26

Effect of Key Splieng Wikipedia (WP) Twirer (TW) Workers 27

Local Load EsCmaCon 28

Robustness Changing trends in data: cashtags Used in the stock market to idencfy a publicly traded company: e.g., $AAPL for Apple Skewed load at source: social networks Test different data distribucon at the sources 29

Robustness 5 workers 10 workers 50 workers 100 workers 30

Robustness 25 - I (messages) 20 15 10 5 Uniform L 5 Skewed L 5 Uniform L 10 Skewed L 10 Uniform L 15 Skewed L 15 Uniform L 20 Skewed L 20 5 10 50 100 workers 31

Real deployment: Apache Storm Throughput (keys/s) 1600 1400 1200 1000 800 600 400 200 0 PKG SG KG 0 0.2 0.4 0.6 0.8 1 (a) CPU delay (ms) 1200 1100 10s 30s 60s 60s 300s 600s 300s 30s PKG SG 10s 1000 KG 0.10 0 2.10 6 4.10 6 6.10 6 (b) Memory (keys) 600s 32

Future Work Dynamic Load Balancing Key MigraCon: moving top- k keys across workers ParCCon MigraCon: moving subset of keys across workers Handling worker churn with ParCal Key Grouping Applying queuing theory with ParCal Key Grouping for load balancing 33

Conclusion ParCal Key Grouping (PKG) reduces the load imbalance by up to seven orders of magnitude compared to Key Grouping PKG imposes constant memory and aggregacon overhead, i.e., O(1), compared to Shuffle Grouping that is O(W) Apache Storm 60% improvement in throughput 45% improvement in latency 34

Anis Nasir Coauthors: Gianmarco, Nicolas, David, Marco

Datasets Twirer, 1.2G tweets (crawled July 2012) Wikipedia, 22M access logs Twirer, 690K cashtags (crawled Nov 2013) Social Networks, 69M edges SyntheCc, 10M keys 36

Streaming Parallel Decision Tree Use shuffle grouping, where each worker generates histogram Aggregator is used to merge the results from W workers Memory footprint proporconal to W workers ParCal Key Grouping reduces overhead to 2 workers 37

Heavy Hirers and Space Saving Solve top- k items in constant Cme and space Key Grouping Error dependent on a single escmator Poor load balancing Shuffle grouping Error bounds depends on number of workers ParCal Key Grouping Berer Load Balancing Berer EsCmaCon 38