Load Balancing for Distributed Stream Processing Engines. Muhammad Anis Uddin Nasir EMDC

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Load Balancing for Distributed Stream Processing Engines. Muhammad Anis Uddin Nasir EMDC 2011-13"

Transcription

1 Load Balancing for Distributed Stream Processing Engines Muhammad Anis Uddin Nasir EMDC

2 About me Ex EMDC from Batch 011 (the party batch) Currently PhD Student at KTH Royal Institute of Technology Under supervision of Šarūnas Girdzijauskas Internship Ex-intern at Yahoo Labs Barcelona Intern at Aalto University

3 About me 3

4 About me 4

5 About me 5

6 About me 6

7 About me 7

8 About me 8

9 Why PhD? Freedom Work timings Set of problems Flexibility Conferences Internships 9

10 Research Interest Stream Processing Social Network Analysis Distributed Systems Decentralized Online Social Networks

11 Stream Processing Engines Streaming Application Online Machine Learning Real Time Query Processing Continuous Computation Streaming Frameworks Storm, Borealis, S4, Samza, Spark Streaming 11

12 Stream Processing Model Streaming Applications are represented by Directed Acyclic Graphs (DAGs) Source Data Stream Source Operators Data Channels 1

13 Stream Grouping Key or Fields Grouping Hash-based assignment Stateful operations, e.g., page rank, degree count Shuffle Grouping Round-robin assignment Stateless operations, e.g., data logging, OLTP 13

14 Key Grouping Key Grouping Scalable Low Memory Load Imbalance 14

15 Shuffle Grouping Shuffle Grouping Load Balance Memory O(W) Aggregation O(W) 15

16 Problem Formulation Input is a unbounded sequence of messages from a key distribution Each message is assigned to a worker for processing Load balance properties Memory Load Balance Network Load Balance Processing Load Balance Metric: Load Imbalance 16

17 Partial Key Grouping (PKG) Key Splitting Split each key into two server Assign each instance using power of two choices Local Load Estimation each source estimates load on workers using the local routing history 17

18 Partial Key Grouping (PKG) Key Splitting Local Load Estimation 00 1 Source 1 0 Source 18

19 Partial Key Grouping (PKG) Key Splitting Distributed Stateless Handle Skew Local load estimation No coordination among sources No communication with workers 19

20 Partial Key Grouping PKG Load Balance Memory O(1) Aggregation O(1) 0

21 Streaming Applications Most algorithms that use Shuffle Grouping can be expressed using Partial Key Grouping to reduce: Memory footprint Aggregation overhead Algorithms that use Key Grouping can be rewritten to achieve load balance 1

22 Stream Grouping: Summary Grouping Pros Cons Key Grouping Scalable Memory Load Balance Load Imbalance Shuffle Grouping Partial Key Grouping Memory O(W) Aggregation O(W) Scalable Load Balance Memory O(1) Aggregation O(1)

23 Experiments What is the effect of key splitting on POTC? How does local estimation compare to a global oracle? How does PKG perform on a real deployment on Apache Storm? 3

24 Experimental Setup Metric the difference of maximum and the average load of the workers at time t Datasets Twitter, 1.G tweets (crawled July 01) Wikipedia, M access logs Twitter, 690K cashtags (crawled Nov 013) Social Networks, 69M edges Synthetic, M keys 4

25 Effect of Key Splitting 5

26 Local Load Estimation 6

27 Throughput (keys/s) Real deployment: Apache Storm PKG SG 00 KG s 300s 60s s 30s 60s s PKG SG KG 30s s (a) CPU delay (ms) 600s (b) Memory (keys)

28 Load Balancing for Distributed Stream Processing Engines Muhammad Anis Uddin Nasir EMDC

29 Does PKG work for highly ed data? 9

30 Skewness Source Source 30

31 Skewness Source Source 31

32 Skewness Source Source 3

33 High level ideas Splitting the high frequency keys on more than two workers Challenges How to decide which are the high frequency keys? How many workers to split the high frequency keys? 33

34 Solution Divide stream into two components Heavy Hitters Tail Distribution Split Heavy Hitters on all the workers Use Greedy W-Choices or Round Robin for heavy hitters Use vanilla PKG for tail 34

35 Skewness Source Source 35

36 Experiments- Threshold 5 workers I(t) (messages) workers 0 W-Choice / W 1/ W 1/ W 1/4 W 1/8 W 1. W-Choice workers I(t) (messages) W-Choice RR workers 0 RR RR workers 0-1 W-Choice workers 0 0 workers 0 50 workers 0-7 RR 1. 36

37 Experiments- Load Imbalance 5 workers I(t) (messages) workers k unique items PKG W-Choice RR -1 - k unique items workers workers 0 0k unique items 0k unique items M unique items -7 1M unique items 1M unique items M unique items workers workers k unique items workers workers workers 0-1 k unique items workers 0 0k unique items k unique items workers 0-3 I(t) (messages) -6 I(t) (messages) 50 workers

38 Conclusion Partial Key Grouping (PKG) reduces the load imbalance by up to seven orders of magnitude compared to Key Grouping PKG imposes constant memory and aggregation overhead, i.e., O(1), compared to Shuffle Grouping that is O(W) W-Choices improves PKG to achieve nearly perfect load balance for highly ed input 38

39 Load Balancing for Distributed Stream Processing Engines Muhammad Anis Uddin Nasir EMDC

Load Balancing in Stream Processing Engines. Anis Nasir Coauthors: Gianmarco, Nicolas, David, Marco

Load Balancing in Stream Processing Engines. Anis Nasir Coauthors: Gianmarco, Nicolas, David, Marco Engines Anis Nasir Coauthors: Gianmarco, Nicolas, David, Marco Stream Processing Engines Online Machine Learning Real Time Query Processing ConCnuous ComputaCon Distributed RPC 2 Stream Processing Engines

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,

More information

5 Performance Management for Web Services. Rolf Stadler School of Electrical Engineering KTH Royal Institute of Technology. stadler@ee.kth.

5 Performance Management for Web Services. Rolf Stadler School of Electrical Engineering KTH Royal Institute of Technology. stadler@ee.kth. 5 Performance Management for Web Services Rolf Stadler School of Electrical Engineering KTH Royal Institute of Technology stadler@ee.kth.se April 2008 Overview Service Management Performance Mgt QoS Mgt

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

CS 558 Internet Systems and Technologies

CS 558 Internet Systems and Technologies CS 558 Internet Systems and Technologies Dimitris Deyannis deyannis@csd.uoc.gr 881 Heat seeking Honeypots: Design and Experience Abstract Compromised Web servers are used to perform many malicious activities.

More information

Putting Apache Kafka to Use!

Putting Apache Kafka to Use! Putting Apache Kafka to Use! Building a Real-time Data Platform for Event Streams! JAY KREPS, CONFLUENT! A Couple of Themes! Theme 1: Rise of Events! Theme 2: Immutability Everywhere! Level! Example! Immutable

More information

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013 F1: A Distributed SQL Database That Scales Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013 What is F1? Distributed relational database Built to replace sharded MySQL back-end of AdWords

More information

Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

More information

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional

More information

Fast Data in the Era of Big Data: Twitter s Real-

Fast Data in the Era of Big Data: Twitter s Real- Fast Data in the Era of Big Data: Twitter s Real- Time Related Query Suggestion Architecture Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin Presented by: Rania Ibrahim 1 AGENDA Motivation

More information

Online and Scalable Data Validation in Advanced Metering Infrastructures

Online and Scalable Data Validation in Advanced Metering Infrastructures Online and Scalable Data Validation in Advanced Metering Infrastructures Chalmers University of technology Agenda 1. Problem statement 2. Preliminaries Data Streaming 3. Streaming-based Data Validation

More information

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to

More information

Tolerating Late Timing Faults in Real Time Stream Processing Systems. Stuart Perks

Tolerating Late Timing Faults in Real Time Stream Processing Systems. Stuart Perks School of Computing FACULTY OF ENGINEERING Tolerating Late Timing Faults in Real Time Stream Processing Systems Stuart Perks Submitted in accordance with the requirements for the degree of Advanced Computer

More information

Graphalytics: A Big Data Benchmark for Graph-Processing Platforms

Graphalytics: A Big Data Benchmark for Graph-Processing Platforms Graphalytics: A Big Data Benchmark for Graph-Processing Platforms Mihai Capotă, Tim Hegeman, Alexandru Iosup, Arnau Prat-Pérez, Orri Erling, Peter Boncz Delft University of Technology Universitat Politècnica

More information

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful. Architectures Cluster Computing Job Parallelism Request Parallelism 2 2010 VMware Inc. All rights reserved Replication Stateless vs. Stateful! Fault tolerance High availability despite failures If one

More information

Big Data Analytics - Accelerated. stream-horizon.com

Big Data Analytics - Accelerated. stream-horizon.com Big Data Analytics - Accelerated stream-horizon.com StreamHorizon & Big Data Integrates into your Data Processing Pipeline Seamlessly integrates at any point of your your data processing pipeline Implements

More information

Architectures for massive data management

Architectures for massive data management Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with

More information

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is

More information

Exploring Oracle E-Business Suite Load Balancing Options. Venkat Perumal IT Convergence

Exploring Oracle E-Business Suite Load Balancing Options. Venkat Perumal IT Convergence Exploring Oracle E-Business Suite Load Balancing Options Venkat Perumal IT Convergence Objectives Overview of 11i load balancing techniques Load balancing architecture Scenarios to implement Load Balancing

More information

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015 Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data volumes More data sources DBs, logs, behavioral & business event streams, sensors Faster analysis Next day to hours

More information

Gossip-based Partitioning and Replication Middleware for Online Social Networks

Gossip-based Partitioning and Replication Middleware for Online Social Networks Gossip-based Partitioning and Replication Middleware for Online Social Networks Muhammad Anis Uddin Nasir Master of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:181 School of Information and

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA Processing billions of events every day Neha Narkhede Co-founder and Head of Engineering @ Stealth Startup Prior to this Lead, Streams Infrastructure

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing /35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of

More information

Big Data Analysis: Apache Storm Perspective

Big Data Analysis: Apache Storm Perspective Big Data Analysis: Apache Storm Perspective Muhammad Hussain Iqbal 1, Tariq Rahim Soomro 2 Faculty of Computing, SZABIST Dubai Abstract the boom in the technology has resulted in emergence of new concepts

More information

Performance and Clustering of J2EE Applications

Performance and Clustering of J2EE Applications Performance and Clustering of J2EE Applications SWE 645, Fall 2004 Nick Duan 1 Objectives Identify performance problems in J2EE Identify solutions for performance enhancement in development and configuration

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Infrastructures for big data

Infrastructures for big data Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)

More information

Bayesian networks - Time-series models - Apache Spark & Scala

Bayesian networks - Time-series models - Apache Spark & Scala Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing Eric Charles [http://echarles.net] @echarles Datalayer [http://datalayer.io] @datalayerio FOSDEM 02 Feb 2014 NoSQL DevRoom

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps I Logs Apache Kafka, Stream Processing, and Real-time Data Jay Kreps The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs

More information

Testing Big data is one of the biggest

Testing Big data is one of the biggest Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

More information

B4: Experience with a Globally-Deployed Software Defined WAN TO APPEAR IN SIGCOMM 13

B4: Experience with a Globally-Deployed Software Defined WAN TO APPEAR IN SIGCOMM 13 B4: Experience with a Globally-Deployed Software Defined WAN TO APPEAR IN SIGCOMM 13 Google s Software Defined WAN Traditional WAN Routing Treat all bits the same 30% ~ 40% average utilization Cost of

More information

Streaming items through a cluster with Spark Streaming

Streaming items through a cluster with Spark Streaming Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015 Who am I? > Project Management Committee (PMC) member

More information

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part: Cloud (data) management Ahmed Ali-Eldin First part: ZooKeeper (Yahoo!) Agenda A highly available, scalable, distributed, configuration, consensus, group membership, leader election, naming, and coordination

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

Management by Network Search

Management by Network Search Management by Network Search Misbah Uddin, Prof. Rolf Stadler KTH Royal Institute of Technology, Sweden Dr. Alex Clemm Cisco Systems, CA, USA November 11, 2014 ANRP Award Talks Session IETF 91 Honolulu,

More information

Can the Elephants Handle the NoSQL Onslaught?

Can the Elephants Handle the NoSQL Onslaught? Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented

More information

How Companies are! Using Spark

How Companies are! Using Spark How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

SPARK USE CASE IN TELCO. Apache Spark Night 9-2-2014! Chance Coble!

SPARK USE CASE IN TELCO. Apache Spark Night 9-2-2014! Chance Coble! SPARK USE CASE IN TELCO Apache Spark Night 9-2-2014! Chance Coble! Use Case Profile Telecommunications company Shared business problems/pain Scalable analytics infrastructure is a problem Pushing infrastructure

More information

Real-time Big Data Analytics with Storm

Real-time Big Data Analytics with Storm Ron Bodkin Founder & CEO, Think Big June 2013 Real-time Big Data Analytics with Storm Leading Provider of Data Science and Engineering Services Accelerating Your Time to Value IMAGINE Strategy and Roadmap

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

Group Based Load Balancing Algorithm in Cloud Computing Virtualization

Group Based Load Balancing Algorithm in Cloud Computing Virtualization Group Based Load Balancing Algorithm in Cloud Computing Virtualization Rishi Bhardwaj, 2 Sangeeta Mittal, Student, 2 Assistant Professor, Department of Computer Science, Jaypee Institute of Information

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Big Data. A general approach to process external multimedia datasets. David Mera

Big Data. A general approach to process external multimedia datasets. David Mera Big Data A general approach to process external multimedia datasets David Mera Laboratory of Data Intensive Systems and Applications (DISA) Masaryk University Brno, Czech Republic 7/10/2014 Table of Contents

More information

The Evolvement of Big Data Systems

The Evolvement of Big Data Systems The Evolvement of Big Data Systems From the Perspective of an Information Security Application 2015 by Gang Chen, Sai Wu, Yuan Wang presented by Slavik Derevyanko Outline Authors and Netease Introduction

More information

Measuring CDN Performance. Hooman Beheshti, VP Technology

Measuring CDN Performance. Hooman Beheshti, VP Technology Measuring CDN Performance Hooman Beheshti, VP Technology Why this matters Performance is one of the main reasons we use a CDN Seems easy to measure, but isn t Performance is an easy way to comparison shop

More information

Recommendations for Performance Benchmarking

Recommendations for Performance Benchmarking Recommendations for Performance Benchmarking Shikhar Puri Abstract Performance benchmarking of applications is increasingly becoming essential before deployment. This paper covers recommendations and best

More information

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014 Four Orders of Magnitude: Running Large Scale Accumulo Clusters Aaron Cordova Accumulo Summit, June 2014 Scale, Security, Schema Scale to scale 1 - (vt) to change the size of something let s scale the

More information

On demand synchronization and load distribution for database grid-based Web applications

On demand synchronization and load distribution for database grid-based Web applications Data & Knowledge Engineering 51 (24) 295 323 www.elsevier.com/locate/datak On demand synchronization and load distribution for database grid-based Web applications Wen-Syan Li *,1, Kemal Altintas, Murat

More information

KSU/CCIS CSC227 Tutorial # 4 CPU Scheduling SUMMER

KSU/CCIS CSC227 Tutorial # 4 CPU Scheduling SUMMER KSU/CCIS CSC227 Tutorial # 4 CPU Scheduling SUMMER - 2010 Question # 1: Select the most appropriate answer for each of the following questions: 1. is the number of processes that complete their execution

More information

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation: CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm

More information

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter Basic info Open sourced September 19th Implementation is 15,000 lines of code Used by over 25 companies >2400 watchers on Github

More information

QoS & Traffic Management

QoS & Traffic Management QoS & Traffic Management Advanced Features for Managing Application Performance and Achieving End-to-End Quality of Service in Data Center and Cloud Computing Environments using Chelsio T4 Adapters Chelsio

More information

Java Performance. Adrian Dozsa TM-JUG 18.09.2014

Java Performance. Adrian Dozsa TM-JUG 18.09.2014 Java Performance Adrian Dozsa TM-JUG 18.09.2014 Agenda Requirements Performance Testing Micro-benchmarks Concurrency GC Tools Why is performance important? We hate slow web pages/apps We hate timeouts

More information

Play with Big Data on the Shoulders of Open Source

Play with Big Data on the Shoulders of Open Source OW2 Open Source Corporate Network Meeting Play with Big Data on the Shoulders of Open Source Liu Jie Technology Center of Software Engineering Institute of Software, Chinese Academy of Sciences 2012-10-19

More information

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage Big Graph Analytics on Neo4j with Apache Spark Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage My background I only make it to the Open Stages :) Probably because Apache Neo4j

More information

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra January 2014 Legal Notices Apache Cassandra, Spark and Solr and their respective logos are trademarks or registered trademarks

More information

Case Study - I. Industry: Social Networking Website Technology : J2EE AJAX, Spring, MySQL, Weblogic, Windows Server 2008.

Case Study - I. Industry: Social Networking Website Technology : J2EE AJAX, Spring, MySQL, Weblogic, Windows Server 2008. Case Study - I Industry: Social Networking Website Technology : J2EE AJAX, Spring, MySQL, Weblogic, Windows Server 2008 Challenges The scalability of the database servers to execute batch processes under

More information

TITAN BIG GRAPH DATA WITH CASSANDRA #TITANDB #CASSANDRA12

TITAN BIG GRAPH DATA WITH CASSANDRA #TITANDB #CASSANDRA12 TITAN BIG GRAPH DATA WITH CASSANDRA #TITANDB #CASSANDRA12 Matthias Broecheler, CTO August VIII, MMXII AURELIUS THINKAURELIUS.COM Abstract Titan is an open source distributed graph database build on top

More information

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Mining Large Datasets: Case of Mining Graph Data in the Cloud Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large

More information

DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION DEVELOPMENT WITH CASCADING

DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION DEVELOPMENT WITH CASCADING DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION DEVELOPMENT WITH CASCADING Supreet Oberoi VP Field Engineering, Concurrent Inc GET TO KNOW CONCURRENT Leader in Application Infrastructure

More information

Management & Analysis of Big Data in Zenith Team

Management & Analysis of Big Data in Zenith Team Management & Analysis of Big Data in Zenith Team Zenith Team, INRIA & LIRMM Outline Introduction to MapReduce Dealing with Data Skew in Big Data Processing Data Partitioning for MapReduce Frequent Sequence

More information

Research trends relevant to data warehousing and OLAP include [Cuzzocrea et al.]: Combining the benefits of RDBMS and NoSQL database systems

Research trends relevant to data warehousing and OLAP include [Cuzzocrea et al.]: Combining the benefits of RDBMS and NoSQL database systems DATA WAREHOUSING RESEARCH TRENDS Research trends relevant to data warehousing and OLAP include [Cuzzocrea et al.]: Data source heterogeneity and incongruence Filtering out uncorrelated data Strongly unstructured

More information

Optimization of QoS for Cloud-Based Services through Elasticity and Network Awareness

Optimization of QoS for Cloud-Based Services through Elasticity and Network Awareness Master Thesis: Optimization of QoS for Cloud-Based Services through Elasticity and Network Awareness Alexander Fedulov 1 Agenda BonFIRE Project overview Motivation General System Architecture Monitoring

More information

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,

More information

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011 Real-time Analytics at Facebook: Data Freeway and Puma Zheng Shao 12/2/2011 Agenda 1 Analytics and Real-time 2 Data Freeway 3 Puma 4 Future Works Analytics and Real-time what and why Facebook Insights

More information

Cognos8 Deployment Best Practices for Performance/Scalability. Barnaby Cole Practice Lead, Technical Services

Cognos8 Deployment Best Practices for Performance/Scalability. Barnaby Cole Practice Lead, Technical Services Cognos8 Deployment Best Practices for Performance/Scalability Barnaby Cole Practice Lead, Technical Services Agenda > Cognos 8 Architecture Overview > Cognos 8 Components > Load Balancing > Deployment

More information

Journée Thématique Big Data 13/03/2015

Journée Thématique Big Data 13/03/2015 Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

More information

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II) UC BERKELEY Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II) Anthony D. Joseph LASER Summer School September 2013 My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter

More information

A SURVEY ON LOAD BALANCING ALGORITHMS FOR CLOUD COMPUTING

A SURVEY ON LOAD BALANCING ALGORITHMS FOR CLOUD COMPUTING A SURVEY ON LOAD BALANCING ALGORITHMS FOR CLOUD COMPUTING Avtar Singh #1,Kamlesh Dutta #2, Himanshu Gupta #3 #1 Department of Computer Science and Engineering, Shoolini University, avtarz@gmail.com #2

More information

Using Summingbird for aggregating eye tracking data to find patterns in images in a multi-user environment

Using Summingbird for aggregating eye tracking data to find patterns in images in a multi-user environment Using Summingbird for aggregating eye tracking data to find patterns in images in a multi-user environment Johan Fogelström and Remzi Can Aksoy School of Computer Science and Communication (CSC), Royal

More information

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex Holmes @

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex Holmes @ The Top 10 7 Hadoop Patterns and Anti-patterns Alex Holmes @ whoami Alex Holmes Software engineer Working on distributed systems for many years Hadoop since 2008 @grep_alex grepalex.com what s hadoop...

More information

Unified Big Data Analytics Pipeline. 连 城 lian@databricks.com

Unified Big Data Analytics Pipeline. 连 城 lian@databricks.com Unified Big Data Analytics Pipeline 连 城 lian@databricks.com What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an

More information

LOAD BALANCING TECHNIQUES FOR RELEASE 11i AND RELEASE 12 E-BUSINESS ENVIRONMENTS

LOAD BALANCING TECHNIQUES FOR RELEASE 11i AND RELEASE 12 E-BUSINESS ENVIRONMENTS LOAD BALANCING TECHNIQUES FOR RELEASE 11i AND RELEASE 12 E-BUSINESS ENVIRONMENTS Venkat Perumal IT Convergence Introduction Any application server based on a certain CPU, memory and other configurations

More information

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011 BookKeeper Flavio Junqueira Yahoo! Research, Barcelona Hadoop in China 2011 What s BookKeeper? Shared storage for writing fast sequences of byte arrays Data is replicated Writes are striped Many processes

More information

Développement logiciel pour le Cloud (TLC)

Développement logiciel pour le Cloud (TLC) Table of Contents 3. PaaS: the example of Google AppEngine Guillaume Pierre Université de Rennes 1 Fall 2012 http://www.globule.org/~gpierre/ 1 2 Developing Java applications in AppEngine 3 The Data Store

More information

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Big Data Architecture

Big Data Architecture Big Architecture Guido Schmutz BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH Guido Schmutz Working for Trivadis for more than

More information

Big Data Analytics Hadoop and Spark

Big Data Analytics Hadoop and Spark Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software

More information

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island Big Data Principles and best practices of scalable real-time data systems NATHAN MARZ JAMES WARREN II MANNING Shelter Island contents preface xiii acknowledgments xv about this book xviii ~1 Anew paradigm

More information

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB Overview of Databases On MacOS Karl Kuehn Automation Engineer RethinkDB Session Goals Introduce Database concepts Show example players Not Goals: Cover non-macos systems (Oracle) Teach you SQL Answer what

More information

Affinity Aware VM Colocation Mechanism for Cloud

Affinity Aware VM Colocation Mechanism for Cloud Affinity Aware VM Colocation Mechanism for Cloud Nilesh Pachorkar 1* and Rajesh Ingle 2 Received: 24-December-2014; Revised: 12-January-2015; Accepted: 12-January-2015 2014 ACCENTS Abstract The most of

More information

Making Big Data Processing Simple with Spark. Matei Zaharia

Making Big Data Processing Simple with Spark. Matei Zaharia Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and general cluster computing engine that generalizes the MapReduce model Makes it easy and fast

More information

MANAGING RESOURCES IN A BIG DATA CLUSTER.

MANAGING RESOURCES IN A BIG DATA CLUSTER. MANAGING RESOURCES IN A BIG DATA CLUSTER. Gautier Berthou (SICS) EMDC Summer Event 2015 www.hops.io @hopshadoop We are producing lot of data Where does they Come From? On-line services : PBs per day Scientific

More information

An Introduction to MOHAMMAD REZA KARIMI DASTJERDI SPRING

An Introduction to MOHAMMAD REZA KARIMI DASTJERDI SPRING An Introduction to MOHAMMAD REZA KARIMI DASTJERDI SPRING 2015 1 Table Of Contents Introduction Problems with RDBMs What is Hadoop? Who use Hadoop? Job Positions History Hadoop Distributions Hadoop Ecosystem

More information

Managing large clusters resources

Managing large clusters resources Managing large clusters resources ID2210 Gautier Berthou (SICS) Big Processing with No Locality Job( /crawler/bot/jd.io/1 ) submi t Workflow Manager Compute Grid Node Job This doesn t scale. Bandwidth

More information

Rakam: Distributed Analytics API

Rakam: Distributed Analytics API Rakam: Distributed Analytics API Burak Emre Kabakcı May 30, 2014 Abstract Today, most of the big data applications needs to compute data in real-time since the Internet develops quite fast and the users

More information