Towards an Optimized Big Data Processing System

Similar documents
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Datacenters

Grid and Cloud Scheduling and Resource Management

Scheduling in IaaS Cloud Computing Environments: Anything New? Alexandru Iosup Parallel and Distributed Systems Group TU Delft

Cloud Computing Support for Massively Social Gaming (Rain for the Thirsty)

Scalable High Performance Systems

How To Scale A Complex Big Data Workflow

Open source Google-style large scale data analysis with Hadoop

BIG DATA TRENDS AND TECHNOLOGIES

Graphalytics: A Big Data Benchmark for Graph-Processing Platforms

CloudRank-D:A Benchmark Suite for Private Cloud Systems

GRAPHALYTICS A Big Data Benchmark for Graph-Processing Platforms

JSSSP, IPDPS WS, Boston, MA, USA May 24, 2013 TUD-PDS

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Large scale processing using Hadoop. Ján Vaňo

Hadoop IST 734 SS CHUNG

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Professor and Director, Division of Biological Sciences & Computation Institute, University of Chicago

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Hadoop implementation of MapReduce computational model. Ján Vaňo

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Duke University

Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity

Apache Hadoop. Alexandru Costan

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

Extending Hadoop beyond MapReduce

Virtualizing Apache Hadoop. June, 2012

C-Meter: A Framework for Performance Analysis of Computing Clouds

C-Meter: A Framework for Performance Analysis of Computing Clouds

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data1 Generat / 20

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

Stress-Testing Clouds for Big Data Applications

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Deployment Strategies for Distributed Applications on Cloud Computing Infrastructures

BIG DATA USING HADOOP

Chapter 7. Using Hadoop Cluster and MapReduce

Delay Scheduling. A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling

Open source large scale distributed data management with Google s MapReduce and Bigtable

From Wikipedia, the free encyclopedia

A Brief Introduction to Apache Tez

Characterizing Task Usage Shapes in Google s Compute Clusters

Big Data and Industrial Internet

CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing. University of Florida, CISE Department Prof.

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Big Data Processing with Google s MapReduce. Alexandru Costan

CSE-E5430 Scalable Cloud Computing Lecture 2

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

MapReduce and Hadoop Distributed File System V I J A Y R A O

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

NextGen Infrastructure for Big DATA Analytics.

How To Scale Out Of A Nosql Database

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Hadoop. Sunday, November 25, 12

From Internet Data Centers to Data Centers in the Cloud

S06: Open-Source Stack for Cloud Computing

Data Refinery with Big Data Aspects

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

# Not a part of 1Z0-061 or 1Z0-144 Certification test, but very important technology in BIG DATA Analysis

Hadoop Ecosystem B Y R A H I M A.

Analyzing Big Data with AWS

SEAIP 2009 Presentation

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Computing in clouds: Where we come from, Where we are, What we can, Where we go

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

The Inside Scoop on Hadoop

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

DynamicCloudSim: Simulating Heterogeneity in Computational Clouds

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Energy Efficient MapReduce

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

MapReduce and Hadoop Distributed File System

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Big Data Benchmark. The Cloudy Approach. Dhruba Borthakur Nov 7, 2012 at the SDSC Industry Roundtable on Big Data and Big Value

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Hadoop & Spark Using Amazon EMR

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

NoSQL and Hadoop Technologies On Oracle Cloud

COMP9321 Web Application Engineering

A SURVEY ON MAPREDUCE IN CLOUD COMPUTING

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

YARN Apache Hadoop Next Generation Compute Platform

Federated Big Data for resource aggregation and load balancing with DIRAC

THE HADOOP DISTRIBUTED FILE SYSTEM

A Performance Analysis of Distributed Indexing using Terrier

Delft University of Technology Parallel and Distributed Systems Report Series. The Peer-to-Peer Trace Archive: Design and Comparative Trace Analysis

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

A Survey on Big Data Concepts and Tools

BIG DATA CHALLENGES AND PERSPECTIVES

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data With Hadoop

Real Time Big Data Processing

Building your Big Data Architecture on Amazon Web Services

Entering the Zettabyte Age Jeffrey Krone

KOALA-C: A Task Allocator for Integrated Multicluster and Multicloud Environments

Introduction to Apache YARN Schedulers & Queues

CSE-E5430 Scalable Cloud Computing. Lecture 4

Transcription:

Towards an Optimized Big Data Processing System The Doctoral Symposium of the IEEE/ACM CCGrid 2013 Delft, The Netherlands Bogdan Ghiţ, Alexandru Iosup, and Dick Epema Parallel and Distributed Systems Group Delft University of Technology Delft, The Netherlands COMMIT/

PhD at TU Delft Candidate: Bogdan Ghiţ Group: Parallel and Distributed Systems Supervisors: Dick Epema, Alexandru Iosup Research Topic: Resource management in grids and clouds Start date: 24 October 2011 Finish date: 24 October 2015 2

The Data Deluge According to one estimate mankind created 150 exabytes (billion gigabytes) of data in 2005. This year, it will create 1,200 exabytes. The Data Deluge, The Economist, 25 February 2010 The data is difficult to store, even harder to analyze it 3

Data Sources Computer Science LinkedIn Dataset Size Daily batch processing for People you may know recommendations 15TB 1TB 100GB 10GB 1GB P2PTA BTWorld GTA Facebook 30 PB by March 2011 = 3,000 x Library of Congress 06 09 10 11 13 Year 4

MapReduce and Beyond Master-slave model MR-cluster Stack of frameworks for large-scale data processing Multiple users vs. Isolation MR-clusters on-demand Isolation w.r.t. performance, data, failure, and versioning Data volume vs. Limited resources Use resources from multiple clusters Dynamically change the size Performance vs. Fairness Capacity-based model Capability-based model 5

Road Map I. System prototype II. Benchmarking Tool for MR III. Performance model for MR IV. Optimize the resource management for MR PhD start-up PhD thesis Nov. 2011 Nov. 2012 Nov. 2013 Nov. 2014 Nov. 2015 MR-clusters on-demand Dynamic resizing of an MR-cluster MTAGS 2012 Best Paper Award Workload analysis I AM HERE IEEE Big Data 2013 Fairness between multiple MR-clusters Identify the performance boundaries Scheduling decisions based on the data and the workload Extended system Applicability to other frameworks and infrastructures 6

Dynamic MapReduce Clusters Complex resource management Single / multiple physical clusters Placement and scheduling policies Change resource allocations at runtime Data management issues MR-cluster structure: data replication vs. data locality Core nodes Execute tasks and store data locally Replication required when removed Transient nodes Execute tasks, do not store data Data transfers to read/write data 7

Resizing Mechanism Question: Given an MR-cluster, how can you tell if it is overloaded or underloaded? Monitor the MR cluster utilization: Grow-Shrink Policy (GSP) with transient nodes Size of grow and shrink steps: S grow and S shrink Baseline policies: grow with core nodes (GGDP) or grow with transient nodes (GGP) S grow S grow F min # tasks # slots S shrink F max S shrink Timeline 8

System Prototype Koala Grid Scheduler Enables processor and data co-allocation Implements placement and scheduling policies Application types: cycle-scavenging, workflows, OpenMPI Koala and MapReduce Developed an MR-Runner module to schedule MR jobs Provides isolated MR-clusters on a per-user basis Koala mechanism for resizing the MR-clusters MR jobs submissions transparent to Koala DAS-4 Infrastructure Real-world experiments on a multicluster system 6 clusters, over 1600 cores, 150 machines, 180 TB, 1-10 Gbit/s 9 Bogdan Ghit, Nezih Yigitbasi, Dick Epema. Resource Management for Dynamic MapReduce Clusters in Multicluster Systems (Best Paper Award). MTAGS 2012

Transient Nodes 10 x 20 x 30 x 30 x 20 x 10 x 40 x 50 GB data set Wordcount scales better than Sort on transient nodes 10

Resizing Performance 20 x 20 x Resizing bounds F min = 0.25 F max = 1.25 50 MR jobs 1 50 GB 20 x Resizing steps GSP S grow = 5 S shrink = 2 GG(D)P S grow = 2 11

Road Map I. System prototype II. Benchmarking Tool for MR III. Performance model for MR IV. Optimize the resource management for MR PhD start-up PhD thesis Nov. 2011 Nov. 2012 Nov. 2013 Nov. 2014 Nov. 2015 MR-clusters on-demand Dynamic resizing of an MR-cluster MTAGS 2012 Best Paper Award I AM HERE Workload analysis IEEE Big Data 2013 Fairness between multiple MR-clusters Identify the performance boundaries Scheduling decisions based on the data and the workload Extended system Applicability to other frameworks and infrastructures 12

Workload Analysis Question: Which are the major MapReduce use cases? Google, Facebook, Yahoo!, Cloudera, Microsoft Findings from 12 published production traces Our analysis of other 4 production traces Complex Workload Large variations in job submissions rates 90% of the jobs in all traces process and generate less than 1 GB, and complete in under 1 minute For large jobs, variations in job sizes vs. job durations Our PDS group analyzes 15 TB of BitTorrent logs with MapReduce 13

Benchmarking Tool Real-world applications Text processing, web searching, machine learning Trace-based workloads Analysis and modeling of traces from production clusters BTWorld use case Complex MR workflow 14 Pig queries / 33 MR jobs Aggregations, selections, joins projections 14

Road Map I. System prototype II. Benchmarking Tool for MR III. Performance model for MR IV. Optimize the resource management for MR PhD start-up PhD thesis Nov. 2011 Nov. 2012 Nov. 2013 Nov. 2014 Nov. 2015 MR-clusters on-demand Dynamic resizing of an MR-cluster MTAGS 2012 Best Paper Award Workload analysis I AM HERE IEEE Big Data 2013 Fairness between multiple MR-clusters Identify the performance boundaries Scheduling decisions based on the data and the workload Extended system Applicability to other frameworks and infrastructures 15

Fair-Sharing Across Multiple Users Question: Given multiple MR-clusters, how can you tell if one is working better than another? Schedule and provision concurrent MR-clusters Differentiate users and converge to a division of resources such that they get similar performance Weighted proportional allocations: Take snapshots in time of the queue sizes Maintain a history of finished jobs 16

Provisioning Policies Question: Can we obtain better performance with the dynamic MR-clusters? Data is hard to move Aprox. 3 h to transfer 1 TB between HDFS and the local storage (900 Mbps write speed) Removing a node with 100 GB makes ~ 6 failed jobs (1 Gbit/s Ethernet, avg. map task duration 24 s, most jobs have less than 150 tasks) Explore a large space of policies: Policies for establishing the weights (fair-shares) Policies for growing (core or transient nodes, single or multiple clusters) Policies for shrinking (preemptive or non-preemptive) 17

Performance Model Question: Which are the performance boundaries of the MR processing system? Analytical and statistical methods Metrics: Fairness users get similar performance Elasticity dynamic MR clusters Performace isolation multiple MR clusters Velocity of data processing Adaptivity to data explosion 18 http://research.spec.org/

Road Map I. System prototype II. Benchmarking III. Performance Tool for MR model for MR IV. Optimize the resource management for MR PhD start-up PhD thesis Nov. 2011 Nov. 2012 Nov. 2013 Nov. 2014 Nov. 2015 MR-clusters on-demand Dynamic resizing of an MR-cluster MTAGS 2012 Best Paper Award Workload analysis I AM HERE IEEE Big Data 2013 Fairness between multiple MR-clusters Identify the performance boundaries Scheduling decisions based on the data and the workload Extended system Applicability to other frameworks and infrastructures 19

Optimize the MapReduce System Question: Are the results obtained so far relevant for the large domain of data-processing systems? Provisioning policies with different optimization targets Incorporate knowledge about the workloads in scheduling and provisioning decisions Release the extended system with the full functionalities Investigate the applicability to other programming models and infrastructures 20

More Information Team: D. Epema, A. Iosup, M. Capotă, T. Hegeman, N. Yigitbasi, L. Fei, PDS publication database www.pds.ewi.tudelft.nl/research-publications/publications Home pages www.pds.ewi.tudeltf.nl/ghit www.pds.ewi.tudelft.nl/epema www.pds.ewi.tudelft.nl/~iosup Web sites: KOALA: www.st.ewi.tudelft.nl/koala 21