Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)



Similar documents
How To Make A Cluster Of Workable, Efficient, And Efficient With A Distributed Scheduler

Improving MapReduce Performance in Heterogeneous Environments

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Delay Scheduling. A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Nexus: A Common Substrate for Cluster Computing

GraySort on Apache Spark by Databricks

Big Data and Scripting Systems beyond Hadoop

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

How To Create A Data Visualization With Apache Spark And Zeppelin

CSE-E5430 Scalable Cloud Computing Lecture 11

Computing at Scale: Resource Scheduling Architectural Evolution and Introduction to Fuxi System

The Berkeley AMPLab - Collaborative Big Data Research

Job Scheduling for MapReduce

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Distributed Scheduling with Apache Mesos in the Cloud. PhillyETE - April, 2015 Diptanu Gon

Extending Hadoop beyond MapReduce

Conquering Big Data with BDAS (Berkeley Data Analytics)

YARN Apache Hadoop Next Generation Compute Platform

Tachyon: memory-speed data sharing

Apache Hadoop. Alexandru Costan

Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling

Big Data Processing. Patrick Wendell Databricks

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Cloud Computing using MapReduce, Hadoop, Spark

Hadoop IST 734 SS CHUNG

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud

Moving From Hadoop to Spark

Big Data and Scripting Systems build on top of Hadoop

Ali Ghodsi Head of PM and Engineering Databricks

HiBench Introduction. Carson Wang Software & Services Group

An Open Source Memory-Centric Distributed Storage System

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Large scale processing using Hadoop. Ján Vaňo

Introduction to Apache YARN Schedulers & Queues

Performance Tuning and Optimizing SQL Databases 2016

A Brief Introduction to Apache Tez

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Architectures for massive data management

Apache Spark and Distributed Programming

Spark: Cluster Computing with Working Sets

Scaling Out With Apache Spark. DTL Meeting Slides based on

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Big data blue print for cloud architecture

Hadoop Ecosystem B Y R A H I M A.

Dominant Resource Fairness: Fair Allocation of Multiple Resource Types

Manjrasoft Market Oriented Cloud Computing Platform

Chapter 1 - Web Server Management and Cluster Topology

Processing NGS Data with Hadoop-BAM and SeqPig

PIKACHU: How to Rebalance Load in Optimizing MapReduce On Heterogeneous Clusters

Managing large clusters resources

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Automating Big Data Benchmarking for Different Architectures with ALOJA

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

Towards a Resource Aware Scheduler in Hadoop

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Optimizing Shared Resource Contention in HPC Clusters

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Why the Datacenter needs an Operating System. Dr. Bernd Mathiske Senior Software Architect Mesosphere

Big Data Analytics Hadoop and Spark

A Cost-Evaluation of MapReduce Applications in the Cloud

Unified Big Data Processing with Apache Spark. Matei

6. How MapReduce Works. Jari-Pekka Voutilainen

Black-box and Gray-box Strategies for Virtual Machine Migration

Hadoop-BAM and SeqPig

Dell In-Memory Appliance for Cloudera Enterprise

LARGE-SCALE DATA STORAGE APPLICATIONS

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Networking in the Hadoop Cluster

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Can t We All Just Get Along? Spark and Resource Management on Hadoop

Rakam: Distributed Analytics API

Introduction to Big Data Training

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Tachyon: A Reliable Memory Centric Storage for Big Data Analytics

Final Report. Cluster Scheduling. Submitted by: Priti Lohani

Embracing Spark as the Scalable Data Analytics Platform for the Enterprise

9/26/2011. What is Virtualization? What are the different types of virtualization.

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Map-Reduce for Machine Learning on Multicore

How Companies are! Using Spark

Transcription:

UC BERKELEY Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II) Anthony D. Joseph LASER Summer School September 2013

My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter Needs an Operating System 3. Mesos, part one 4. Dominant Resource Fairness 5. Mesos, part two 6. Spark 2

Collaborators Matei Zaharia Benjamin Hindman Andy Konwinski Ali Ghodsi Randy Katz Scott Shenker Ion Stoica 3

Apache Mesos A common resource sharing layer for diverse frameworks Hive Hadoop MPI Hypertbale Cassandra Spark PIQL SCADS Datacenter OS (e.g., Apache Mesos) Node OS (e.g. Linux) Node OS (e.g. Windows) Node OS (e.g. Linux) Run multiple instances of the same framework» Isolate production and experimental jobs» Run multiple versions of the framework concurrently Support specialized frameworks for problem domains 4

20,000+ lines of C++ Implementation APIs in C, C++, Java, and Python Master failover using ZooKeeper Frameworks ported: Hadoop, MPI, Torque New specialized frameworks: Spark, Apache/HaProxy Open source Apache project http://mesos.apache.org/ 5

Frameworks Ported frameworks:» Hadoop (900 line patch)» MPI (160 line wrapper scripts) New frameworks:» Spark, Scala framework for iterative jobs (1300 lines)» Apache+haproxy, elastic web server farm (200 lines) 6

Isolation Mesos has pluggable isolation modules to isolate tasks sharing a node Currently supports Linux Containers and Solaris projects» Can isolate memory, CPU, IO, network bandwidth Could be a great place to use VMs 7

Apache ZooKeeper Leader Server Server Server Server Server Client Client Client Client Client Client Client Multiple servers require coordination» Leader Election, Group Membership, Work Queues, Data Sharding, Event Notifications, Configuration, and Cluster Management Highly available, scalable, distributed coordination kernel» Ordered updates and strong persistence guarantees» Conditional updates (version), Watches for data changes

Connect to currently active master Scheduler Hadoop JobTracker Scheduler MPI Scheduler ZooKeeper used to elect one active Mesos master Master Failure Mesos Master 2 Mesos Master 3 Mesos Master 1 Zoo 5 Zoo 4 Zoo 1 ZooKeeper Zoo 3 Zoo 2 Connect to currently active master Slave 1 Hadoop Executor MPI Executor Slave 2 Hadoop Executor Hadoop Executor Slave 3 JVM Executor 9

Resource Revocation Killing tasks to make room for other users Killing typically not needed for short tasks» If avg task length is 2 min, a new framework gets 10% of all machines within 12 seconds on avg Hadoop job and task durations at Facebook 10

Resource Revocation (2) Not the normal case because fine-grained tasks enable quick reallocation of resources Sometimes necessary:» Long running tasks never relinquishing resources» Buggy job running forever» Greedy user who decides to makes his task long Safe allocation lets frameworks have long running tasks defined by allocation policy» Users will get at least safe share within specified time» If stay below safe allocation, task won t be killed 11

Resource Revocation (3) Dealing with long tasks monopolizing nodes» Let slaves have long slots and short slots» Short slots killed if used too long by a task Revoke only if a user is below its safe share and is interested in offers» Revoke tasks from users farthest above their safe share» Framework given a grace period before killing its tasks 12

Example: Running MPI on Mesos Users always told their safe share» Avoid revocation by staying below it Giving each user a small safe share may not be enough if jobs need many machines Can run a traditional HPC scheduler as a user with a large safe share of the cluster, and have MPI jobs queue up on it» E.g. Torque gets 40% of cluster 13

Example: Torque on Mesos 20% Facebook.com 40% 40% Safe share = 40% Spam Ads Torque User 1 User 2 Job 1 MPI Job MPI MPI Job Job MPI Job Job 1 Job 2 Job 4 14

Some Mesos Deployments 1,000 s of nodes running over a dozen production services Genomics researchers using Hadoop and Spark on Mesos Spark in use by Yahoo! Research Spark for analytics Hadoop and Spark used by machine learning researchers 15

Results

Dynamic Resource Sharing Share of Cluster 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1 31 61 91 121 151 181 211 241 271 301 331 Time (s) MPI Hadoop Spark 17

Elastic Web Server Farm Load calculation HTTP HTTP request request httperf Scheduler (haproxy) task Load gen framework resource offer Mesos master Mesos slave Mesos slave status update Mesos slave Load gen executor task Web executor task (Apache) Load gen executor task Web executor task (Apache) Load gen task executor Web executor task task (Apache) 18

Web Framework Results 19

Scalability Task startup overhead with 200 frameworks 20

Fault Tolerance Mean time to recovery, 95% confidence 21

Deep Dive Experiments Macrobenchmark experiment» Test the benefits of using Mesos to multiplex a cluster between multiple diverse frameworks High level goals of experiment» Demonstrate increased CPU/memory utilization due to multiplexing available resources» Demonstrate job runtime speedups 22

Macrobenchmark setup 100 Extra Large EC2 instances (4 cores/15gb ram per machine) Experiment length: ~25 minutes Realistic workload 1. A Hadoop instance running a mix of small and large jobs based on the workload at Facebook 2. A Hadoop instance running a set of large batch jobs 3. Spark running a series of machine learning jobs 4. Torque running a series of MPI jobs 23

Goal of experiment Run the four frameworks and corresponding workloads» 1 st on a cluster that is shared via Mesos» 2 nd on 4 partitioned clusters, each ¼ the size of the shared cluster Compare resource utilization and workload performance (i.e., job run times) on static partitioning vs. sharing with Mesos 24

Macrobenchmark Details: Breakdown of the Facebook Hive (Hadoop) Workload mix Bin Job Type Map Tasks Reduce Tasks Jobs Run 1 Selection 1 NA 38 2 Text search 2 NA 18 3 Aggregation 10 2 14 4 Selection 50 NA 12 5 Aggregation 100 10 6 6 Selection 200 NA 6 7 Text Search 400 NA 4 8 Join 400 30 2 25

100 node cluster Results: CPU Allocation Hadoop (Batch Jobs) Hadoop (Facebook Mix) 26

Sharing With Mesos vs. No-Sharing (Dedicated Cluster) Share of Cluster 1 0.8 0.6 0.4 0.2 0 (a) Facebook Hadoop Mix 0 200 400 600 800 1000 1200 1400 1600 Time (s) Dedicated Cluster Mesos Share of Cluster 1 0.8 0.6 0.4 0.2 0 (b) Large Hadoop Mix 0 500 1000 1500 2000 2500 3000 Time (s) Dedicated Cluster Mesos Share of Cluster 1 0.8 0.6 0.4 0.2 0 (c) Spark Dedicated Cluster Mesos 0 200 400 600 800 1000 1200 1400 1600 1800 Share of Cluster 1 0.8 0.6 0.4 0.2 0 (d) Torque / MPI Dedicated Cluster Mesos 0 200 400 600 800 1000 1200 1400 1600 Time (s) Time (s) 27

Cluster Utilization Mesos vs. Dedicated Clusters CPU Utilization (%) Memory Utilization (%) 100 80 60 40 20 0 50 40 30 20 10 0 Mesos Static 0 200 400 600 800 1000 1200 1400 1600 Time (s) Mesos Static 0 200 400 600 800 1000 1200 1400 1600 Time (s) 28

Job Run Times (and Speedup) Grouped By Framework Framework Sum of Exec times on Dedicated Cluster (s) Sum of Exec Times on Mesos (s) Speedup Facebook Hadoop Mix 7235 6319 1.14 Large Hadoop Mix 3143 1494 2.10 Spark 1684 1338 1.26 Torque / MPI 3210 3352 0.96 2x speedup for Large Hadoop Mix 29

Job Run Times (and Speedup) Grouped by Job Type Framework Facebook Hadoop Mix Job Type Time on Dedicated Cluster (s) selection (1) 24 0.84 text search (2) 31 0.90 aggregation (3) 82 0.94 selection (4) 65 1.40 aggregation (5) 192 1.26 selection (6) 136 1.71 text search (7) 137 2.14 join (8) 662 1.35 Avg. Speedup on Mesos Large Hadoop Mix text search 314 2.21 Spark ALS 337 1.36 Torque / MPI small tachyon 261 0.91 large tachyon 822 0.88 30

Discussion: Facebook Hadoop Mix Results Smaller jobs perform worse on Mesos:» Side effect of interaction between fair sharing performed by Hadoop framework (among its jobs) and performed by Mesos (among frameworks)» When Hadoop has more than 1/4 of the cluster, Mesos allocates freed up resources to framework farthest below its share» Significant effect on any small Hadoop job submitted during this time (long delay relative to its length)» In contrast, Hadoop running alone can assign resources to the new job as soon as any of its tasks finishes 31

Discussion: Facebook Hadoop Mix Results Similar problem with hierarchical fair sharing appears in networks» Mitigation #1: run small jobs on a separate framework, or» Mitigation #2: use lottery scheduling as the Mesos allocation policy 32

Discussion: Torque Results Torque is the only framework that performed worse, on average, on Mesos» Large tachyon jobs took on average 2 minutes longer» Small ones took 20s longer 33

Discussion: Torque Results Causes of delay» Partially due to Torque having to wait to launch 24 tasks on Mesos before starting each job average delay is 12s» Rest of the delay may be due to stragglers (slow nodes)» In standalone Torque run, two jobs each took ~60s longer to run than others» Both jobs used a node that performed slower on singlenode benchmarks than the others (Linux reported a 40% lower bogomips value on the node)» Since tachyon hands out equal amounts of work to each node, it runs as slowly as the slowest node 34

Macrobenchmark Summary Evaluated performance of diverse set of frameworks representing realistic workloads running on Mesos versus a statically partitioned cluster Showed 10% increase in CPU utilization, 18% increase in memory utilization Some frameworks show significant speed ups in job run time Some frameworks show minor slowdowns in job run time due to experimental/environmental artifacts 35

Summary Mesos is a platform for sharing data centers among diverse cluster computing frameworks» Enables efficient fine-grained sharing» Gives frameworks control over scheduling Mesos is» Scalable (50,000 slaves)» Fault-tolerant (MTTR 6 sec)» Flexible enough to support a variety of frameworks (MPI, Hadoop, Spark, Apache, ) 36

My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter Needs an Operating System 3. Mesos, part one 4. Dominant Resource Fairness 5. Mesos, part two 6. Spark 37