Improving MapReduce Performance in Heterogeneous Environments



Similar documents
Improving MapReduce Performance in Heterogeneous Environments

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data and Apache Hadoop s MapReduce

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

MAPREDUCE [1] is proposed by Google in 2004 and

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

GraySort on Apache Spark by Databricks

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Cloud Computing using MapReduce, Hadoop, Spark

Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud

BBM467 Data Intensive ApplicaAons

An improved task assignment scheme for Hadoop running in the clouds

Delay Scheduling. A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling

What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani

Facilitating Consistency Check between Specification and Implementation with MapReduce Framework

Research on Job Scheduling Algorithm in Hadoop

A Cost-Evaluation of MapReduce Applications in the Cloud

Survey on Scheduling Algorithm in MapReduce Framework

Matchmaking: A New MapReduce Scheduling Technique

Big Data Analytics Hadoop and Spark

Analysis of Information Management and Scheduling Technology in Hadoop

Towards a Resource Aware Scheduler in Hadoop

MapReduce (in the cloud)

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Fault Tolerance in Hadoop for Work Migration

Scheduling Algorithms in MapReduce Distributed Mind

PIKACHU: How to Rebalance Load in Optimizing MapReduce On Heterogeneous Clusters

A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Energy Efficient MapReduce

Spark: Making Big Data Interactive & Real-Time

Hadoop Parallel Data Processing

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India

An efficient Mapreduce scheduling algorithm in hadoop R.Thangaselvi 1, S.Ananthbabu 2, R.Aruna 3

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Job Scheduling for MapReduce

Cloud Computing based on the Hadoop Platform

MyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Apache Hadoop. Alexandru Costan

Snapshots in Hadoop Distributed File System

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

From GWS to MapReduce: Google s Cloud Technology in the Early Days

NetFlow Analysis with MapReduce

An Efficient Checkpointing Scheme Using Price History of Spot Instances in Cloud Computing Environment

Survey on Job Schedulers in Hadoop Cluster

Decoupling Storage and Computation in Hadoop with SuperDataNodes

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Hadoop Architecture. Part 1

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Daniel J. Adabi. Workshop presentation by Lukas Probst

Evaluating HDFS I/O Performance on Virtualized Systems

Virtual Machine Based Resource Allocation For Cloud Computing Environment

Performance Modeling of MapReduce Jobs in Heterogeneous Cloud Environments Zhuoyao Zhang University of Pennsylvania

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Parallel Computing. Benson Muite. benson.

MapReduce and Hadoop Distributed File System

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

MapReduce with Apache Hadoop Analysing Big Data

Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing

A Service for Data-Intensive Computations on Virtual Clusters

An Adaptive Scheduling Algorithm for Dynamic Heterogeneous Hadoop Systems

Hadoop & its Usage at Facebook

Performance Tuning and Scheduling of Large Data Set Analysis in Map Reduce Paradigm by Optimal Configuration using Hadoop

Machine- Learning Summer School

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

High performance computing network for cloud environment using simulators

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Hadoop & its Usage at Facebook

Application Development. A Paradigm Shift

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Cloud Computing Summary and Preparation for Examination

Introduction to Cloud Computing

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Transcription:

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley

Motivation 1. MapReduce becoming popular Open-source implementation, Hadoop, used by Yahoo!, Facebook, Last.fm, Scale: 20 PB/day at Google, O(10,000) nodes at Yahoo, 3000 jobs/day at Facebook

Motivation 2. Utility computing services like Amazon Elastic Compute Cloud (EC2) provide cheap on-demand computing Price: 10 cents / VM / hour Scale: thousands of VMs Caveat: less control over performance

Results Main challenge for Hadoop on EC2 was performance heterogeneity, which breaks task scheduler assumptions Designed new LATE scheduler that can cut response time in half

Outline 1. MapReduce background 2. The problem of heterogeneity 3. LATE: a heterogeneity-aware scheduler

What is MapReduce? Programming model to split computations into independent parallel tasks Hides the complexity of fault tolerance At 10,000 s of nodes, some will fail every day J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004.

Fault Tolerance in MapReduce 1. Nodes fail re-run tasks 2. Nodes slow (stragglers) run backup tasks Node 1 Node 2 How to do this in heterogeneous environment?

Heterogeneity in Virtualized Environments VM technology isolates CPU and memory, but disk and network are shared Full bandwidth when no contention Equal shares when there is contention 2.5x performance difference IO Performance per VM (MB/s) 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 VMs on Physical Host

Backup Tasks in Hadoop s Default Scheduler Start primary tasks, then look for backups to launch as nodes become free Tasks report progress score from 0 to 1 Launch backup if progress < avgprogress 0.2

Problems in Heterogeneous Environment 1. Too many backups, thrashing shared resources like network bandwidth 2. Wrong tasks backed up 3. Backups may be placed on slow nodes 4. Breaks when tasks start at different times Example: ~80% of reduces backed up, most losing to originals; network thrashed

Idea: Progress Rates Instead of using progress values, compute progress rates, and back up tasks that are far enough below the mean Problem: can still select the wrong tasks

Progress Rate Example 1 min 2 min Node 1 1 task/min Node 2 Node 3 3x slower 1.9x slower Time (min)

Progress Rate Example What if the job had 5 tasks? 2 min Node 1 Node 2 time left: 1 min Node 3 time left: 1.8 min Time (min) Node 2 is slowest, but should back up Node 3 s task!

Our Scheduler: LATE Insight: back up the task with the largest estimated finish time Longest Approximate Time to End Look forward instead of looking backward Sanity thresholds: Cap number of backup tasks Launch backups on fast nodes Only back up tasks that are sufficiently slow

LATE Details Estimating finish times: progress rate = estimated time left = progress score execution time 1 progress score progress rate Threshold values: 10% cap on backups, 25 th percentiles for slow node/task Validated by sensitivity analysis

LATE Example 2 min Node 1 Node 2 Node 3 Progress = 66% Progress = 5.3% Estimated time left: (1-0.66) / (1/3) = 1 Estimated time left: (1-0.05) / (1/1.9) = 1.8 Time (min) LATE correctly picks Node 3

Evaluation Environments: EC2 (3 job types, 200-250 nodes) Small local testbed Self-contention through VM placement Stragglers through background processes

EC2 Sort with Stragglers Normalized Response Time 2.5 2.0 1.5 1.0 0.5 No Backups Hadoop Native LATE Scheduler 0.0 Worst Best Average Average 58% speedup over native, 220% over no backups 93% max speedup over native

EC2 Sort without Stragglers Normalized Response Time 1.4 1.2 1 0.8 0.6 0.4 0.2 No Backups Hadoop Native LATE Scheduler 0 Worst Best Average Average 27% speedup over native, 31% over no backups

Conclusion Heterogeneity is a challenge for parallel apps, and is growing more important Lessons: Back up tasks which hurt response time most Mind shared resources 2x improvement using simple algorithm

Questions????

EC2 Grep and Wordcount Grep WordCount Normalized Response Time 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Worst Best Average No Backups Hadoop Native LATE Scheduler Normalized Response Time 3 2.5 2 1.5 1 0.5 0 Worst Best Average No Backups Hadoop Native LATE Scheduler 36% gain over native 57% gain over no backups 8.5% gain over native 179% gain over no backups