Cof low. Mending the Application-Network Gap in Big Data Analytics. Mosharaf Chowdhury. UC Berkeley



Similar documents
Coflow! A Networking Abstraction For Cluster Applications! Mosharaf Chowdhury Ion Stoica! UC#Berkeley#

Decentralized Task-Aware Scheduling for Data Center Networks

Apache Hadoop. Alexandru Costan

Advanced Computer Networks. Scheduling

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Decentralized Task-Aware Scheduling for Data Center Networks

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

GraySort on Apache Spark by Databricks

Chapter 7. Using Hadoop Cluster and MapReduce

Joint Optimization of Overlapping Phases in MapReduce

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Task Scheduling in Hadoop

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Survey on Scheduling Algorithm in MapReduce Framework

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Unified Big Data Analytics Pipeline. 连 城

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

ICS Principles of Operating Systems

Matchmaking: A New MapReduce Scheduling Technique

MAPREDUCE Programming Model

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

HiBench Introduction. Carson Wang Software & Services Group

Big Data and Scripting Systems beyond Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop Scheduler w i t h Deadline Constraint

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Performance and Energy Efficiency of. Hadoop deployment models

Lecture Outline Overview of real-time scheduling algorithms Outline relative strengths, weaknesses

MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy

Load Balancing in Data Center Networks

Big Fast Data Hadoop acceleration with Flash. June 2013

Load Balancing in Data Center Networks

CSE-E5430 Scalable Cloud Computing Lecture 2

Large-Scale Data Processing

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

Big Data Processing with Google s MapReduce. Alexandru Costan

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Unified Big Data Processing with Apache Spark. Matei

Apache Flink Next-gen data analysis. Kostas

Adaptive Task Scheduling for Multi Job MapReduce

Ali Ghodsi Head of PM and Engineering Databricks

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

PIKACHU: How to Rebalance Load in Optimizing MapReduce On Heterogeneous Clusters

Big Data With Hadoop

Parallel Computing. Benson Muite. benson.

Architectures for Big Data Analytics A database perspective

Energy Efficient MapReduce

Big Data Analytics - Accelerated. stream-horizon.com

Fault Tolerance in Hadoop for Work Migration

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop & Spark Using Amazon EMR

Big Data Challenges in Bioinformatics

OPERATING SYSTEMS SCHEDULING

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Hedera: Dynamic Flow Scheduling for Data Center Networks

Cloud Computing Trends

Beyond the Stars: Revisiting Virtual Cluster Embeddings

Data-Intensive Computing with Map-Reduce and Hadoop

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Fair Scheduler. Table of contents

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Paolo Costa

Big Data Processing. Patrick Wendell Databricks

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

Locality-Sensitive Operators for Parallel Main-Memory Database Clusters

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Hadoop Ecosystem B Y R A H I M A.

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Management & Analysis of Big Data in Zenith Team

Towards Predictable Datacenter Networks

Operating Systems. III. Scheduling.

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Parameterizable benchmarking framework for designing a MapReduce performance model

Transcription:

Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley

Big Data The volume of data businesses want to make sense of is increasing Increasing variety of sources Web, mobile, wearables, vehicles, scientific, Cheaper disks, SSDs, and memory Stalling processor speeds

Big Datacenters for Massive Parallelism BlinkDB Storm Pregel DryadLINQ MapReduce 2005 Hadoop Dryad GraphLab Spark Spark-Streaming GraphX Dremel Hive 2010 2015

Data-Parallel Applications Multi-stage dataflow Computation interleaved with communication Computation Stage (e.g., Map, Reduce) Distributed across many machines Tasks run in parallel Communication Stage (e.g., Shuffle) Between successive computation stages Reduce Stage A communication stage cannot complete until all the data have been transferred Map Stage

Communication is Crucial Performance Facebook jobs spend ~25% of runtime on average in intermediate comm. 1 As SSD-based and in-memory systems proliferate, the network is likely to become the primary bottleneck 1. Based on a month-long trace with 20,000 jobs and 150 Million tasks, collected from a 000-machine Facebook production MapReduce cluster.

Flow Transfers data from a source to a destination Independent unit of allocation, sharing, load balancing, and/or prioritization Faster Communication Stages: Networking Approach Configuration should be handled at the system level

Existing Solutions WFQ CSFQ D DeTail PDQ pfabric GPS RED ECN XCP RCP DCTCP D 2 TCP FCP 1980s 1990s 2000s 2005 2010 2015 Per-Flow Fairness Flow Completion Time Independent flows cannot capture the collective communication behavior common in data-parallel applications

Why Do They Fall Short? r 1 r 2 s 1 s 2 s

Why Do They Fall Short? r 1 r 2 s 1 1 1 r 1 s 1 s 2 s s 2 2 s 1 s 2 s Datacenter 2 r 2 s Network

Why Do They Fall Short? s 1 1 1 r 1 s 2 2 2 r 2 s Datacenter Network Link to r 1 Link to r 2 Per-Flow Fair Sharing 2 4 6 time 5 5 Shuffle Completion Time = 5 Avg. Flow Completion Time =.66 Solutions focusing on flow completion time cannot further decrease the shuffle completion time

Improve Application-Level Performance 1 s 1 s 2 s 1 2 Datacenter Network 1 2 r 1 r 2 Slow down faster flows to accelerate slower flows Link to r 1 Per-Flow Fair Sharing 5 Shuffle Completion Time = 5 Link to r 1 Data-Proportional Per-Flow Fair Sharing Allocation 4 4 4 Shuffle Completion Time = 4 Link to r 2 2 4 6 time 5 Avg. Flow Completion Time =.66 Link to r 2 4 4 4 2 4 6 time Avg. Flow Completion Time = 4 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM 2011.

Faster Communication Stages: Systems Approach Applications know their performance goals, but they have no means to let the network know Configuration should be handled by the end users

Faster Communication Stages: Systems Approach Faster Communication Stages: Networking Approach Configuration should be handled by the end users Configuration should be handled at the system level

ME MIND THE GAP Holistic Approach Applications and the Network Working Together

Cof low Communication abstraction for data-parallel applications to express their performance goals 1. Minimize completion times, 2. Meet deadlines, or. Perform fair allocation.

Broadcast Aggregation All-to-All Single Flow Shuffle Parallel Flows

1 1 for faster #1 completion of coflows? How to schedule coflows online 2.. 2.. to meet #2 more deadlines? N. Datacenter. N for fair # allocation of the network?

Varys 1 Enables coflows in data-intensive clusters 1. Coflow Scheduler 2. Global Coordination. The Coflow API Faster, application-aware data transfers throughout the network Consistent calculation and enforcement of scheduler decisions Decouples network optimizations from applications, relieving developers and end users 1. Efficient Coflow Scheduling with Varys, SIGCOMM 2014.

Cof low Communication abstraction for data-parallel applications to express their performance goals 1. The size of each flow, 2. The total number of flows, and. The endpoints of individual flows.

Benefits of Inter-Coflow Scheduling Coflow 1 Coflow 2 Link 2 Link 1 Units 6 Units 2 Units Fair Sharing Smallest-Flow First 1,2 The Optimal L2 L1 L2 L1 L2 L1 2 4 6 time Coflow1 comp. time = 5 Coflow2 comp. time = 6 2 4 6 time Coflow1 comp. time = 5 Coflow2 comp. time = 6 2 4 6 time Coflow1 comp. time = Coflow2 comp. time = 6 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM 2012. 2. pfabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM 201.

Inter-Coflow Scheduling is NP-Hard Coflow 1 Coflow 2 Link 2 Link 1 Units 6 Units 2 Units L2 L1 Fair Sharing Concurrent Open Shop Scheduling 1 Examples include job scheduling and L1 caching blocks 2 4 6 time Solutions use a ordering heuristic Coflow1 comp. time = 6 Coflow2 comp. time = 6 L2 Flow-level Prioritization 1 2 4 6 time Coflow1 comp. time = 6 Coflow2 comp. time = 6 L2 L1 The Optimal 2 4 6 time Coflow1 comp. time = Coflow2 comp. time = 6 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM 2012. 1. 2. A pfabric: Note on Minimal the Complexity Near-Optimal of the Datacenter Concurrent Transport, Open Shop SIGCOMM 201. Problem, Journal of Scheduling, 9(4):89 96, 2006

Inter-Coflow Scheduling is NP-Hard Coflow 1 Coflow 2 Link 2 Link 1 Units 6 Units 2 Units Concurrent Open Shop Scheduling with Coupled Resources Examples include job scheduling and caching blocks Solutions use a ordering heuristic Consider matching constraints 6 2 Input Links Output Links 1 1 Datacenter

Varys Employs a two-step algorithm to minimize coflow completion times 1. Ordering heuristic 2. Allocation algorithm Keep an ordered list of coflows to be scheduled, preempting if needed Allocates minimum required resources to each coflow to finish in minimum time

Ordering Heuristic C 1 ends C 2 ends C ends 6 1 1 O 1 O 2 5 2 2 O 5 Datacenter C 1 C 2 C Length 5 6 Time 1 19 Shortest-First (Total CCT = 5)

Ordering Heuristic C 1 ends C 2 ends C ends C ends C 2 ends C 1 ends 6 1 1 O 1 O 1 O 2 O 2 5 2 2 O O 5 Datacenter C 1 C 2 C Width 2 1 Time 1 19 Shortest-First (5) 6 Time 16 19 Narrowest-First (Total CCT = 41)

Ordering Heuristic C 1 ends C 2 ends C ends C ends C 2 ends C 1 ends 6 1 1 O 1 O 1 O 2 O 2 5 2 2 O O 5 Datacenter Time 1 19 Shortest-First (5) 6 Time 16 19 Narrowest-First (41) C 1 C 2 C Size 9 10 6 O 1 O 2 C ends C 1 ends C 2 ends O 6 9 Time Smallest-First 19 (4)

Ordering Heuristic C 1 ends C 2 ends C ends C ends C 2 ends C 1 ends 6 1 1 O 1 O 1 O 2 O 2 5 2 2 O O 5 Datacenter Time 1 19 Shortest-First (5) 6 Time 16 19 Narrowest-First (41) C 1 C 2 C Bottleneck 10 6 O 1 O 2 C ends C 1 ends C 2 ends C 1 ends O 1 O 2 C ends C 2 ends O O 6 9 Time Smallest-First 19 9 Time 19 Smallest-Bottleneck (4) (1)

Allocation Algorithm A coflow cannot finish before its very last flow Finishing flows faster than the bottleneck cannot decrease a coflow s completion time Allocate minimum flow rates such that all flows of a coflow finish together on time

Varys Enables coflows in data-intensive clusters 1. Coflow Scheduler 2. Global Coordination. The Coflow API Faster, application-aware data transfers throughout the network Consistent calculation and enforcement of scheduler decisions Decouples network optimizations from applications, relieving developers and end users

The Need for Coordination C 1 ends C 2 ends 4 1 1 O 1 2 2 O 2 O 5 4 4 9 Time C 1 C 2 Bottleneck 4 5 Scheduling with Coordination (Total CCT = 1)

The Need for Coordination C 1 ends C 2 ends C 1 ends C 2 ends 4 1 1 O 1 O 1 2 2 O 2 O O 2 O 5 4 4 9 Time Scheduling with Coordination 7 Time 12 Scheduling without Coordination (Total CCT = 1) (Total CCT = 19) Uncoordinated local decisions interleave coflows, hurting performance

Varys Architecture Centralized master-slave architecture Applications use a client library to communicate with the master Actual timing and rates are determined by the coflow scheduler Sender Receiver Driver Put Get Reg Varys Daemon Topology Monitor Usage Estimator Coflow Scheduler Varys Master Varys Daemon TaskName f Network Interface (Distributed) File System Comp. Tasks calling Varys Client Library Varys Daemon 1. Download from http://varys.net

Varys Enables coflows in data-intensive clusters 1. Coflow Scheduler 2. Global Coordination. The Coflow API Faster, application-aware data transfers throughout the network Consistent calculation and enforcement of scheduler decisions Decouples network optimizations from applications, relieving developers and end users

The Coflow API 1. NO changes to user jobs 2. NO storage management register reducers @driver b register(broadcast) s register(shuffle) put shuffle id b.put(content) get unregister broadcast mappers driver (JobTracker) b.unregister() s.unregister() @mapper b.get(id) id s1 s.put(content) @reducer s.get(id sl )

Evaluation A 000-machine trace-driven simulation matched against a 100-machine EC2 deployment 1. Does it improve performance? YES 2. Can it beat non-preemptive solutions?. Do we really need coordination?

Better than Per-Flow Fairness Comm. Heavy Comm. Improv. Job Improv. EC2.16X 1.85X 2.50X 1.25X Sim. 4.86X.21X.9X 1.11X

Preemption is Necessary [Sim.] 25 22.07 Overhead Over Varys 20 15 10 5 1.00.21 5.65 5.5 1.10 NO What About Starvation 0 1 2, 4 Varys Per-Flow Fair FIFO Per-Flow Priority FIFO-LM Varys NC Fairness Prioritization 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM 2011

Lack of Coordination Hurts [Sim.] 25 22.07 Overhead Over Varys 20 15 10 5 0 5.65 5.5.21 1.00 1.10 1 2, 4 Varys Per-Flow Fair FIFO Per-Flow Priority FIFO-LM Varys NC Fairness Prioritization Smallest-flow-first (per-flow priorities) Minimizes flow completion time FIFO-LM 4 performs decentralized coflow scheduling Suffers due to local decisions Works well for small, similar coflows 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM 2011 2. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM 2012. pfabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM 201 4. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM 2014

Cof low Communication abstraction for data-parallel applications to express their performance goals 1. The size of each flow, 2. The total number of flows, and. The endpoints of individual flows. ç Pipelining between stages ç Speculative executions ç Task failures and restarts

How to Perform Coflow Scheduling Without Complete Knowledge?

Implications Minimize Avg. Comp. Time Flows in a Single Link Coflows in an Entire Datacenter With complete knowledge Smallest-Flow-First Ordering by Bottleneck Size + Data-Proportional Rate Allocation Without complete knowledge Least-Attained Service (LAS)?

Revisiting Ordering Heuristics C 1 ends C 2 ends C ends C ends C 2 ends C 1 ends 6 1 1 O 1 O 1 O 2 O 2 5 2 2 O O 5 Time 1 19 Shortest-First 6 Time 16 19 Narrowest-First (5) (41) C ends C 1 ends C 2 ends C 1 ends C ends C 2 ends O 1 O 2 O O 1 O 2 O 6 9 Time Smallest-First 19 9 Time 19 Smallest-Bottleneck (4) (1)

Coflow-Aware LAS (CLAS) Set priority that decreases with how much a coflow has already sent The more a coflow has sent, the lower its priority Smaller coflows finish faster Use total size of coflows to set priorities Avoids the drawbacks of full decentralization

Coflow-Aware LAS (CLAS) Continuous priority reduces to fair sharing when similar coflows coexist Priority oscillation Coflow 1 Coflow 2 2 4 6 time Coflow1 comp. time = 6 Coflow2 comp. time = 6 FIFO works well for similar coflows Avoids the drawbacks of full decentralization 2 4 6 time Coflow1 comp. time = Coflow2 comp. time = 6

Discretized Coflow-Aware LAS (D-CLAS) Priority discretization Change priority when total size exceeds predefined thresholds Scheduling policies FIFO within the same queue Prioritization across queue Weighted sharing across queues Guarantees starvation avoidance FIFO FIFO FIFO Q K Q 2 Q 1 Lowest- Priority Queue Highest- Priority Queue

How to Discretize Priorities? Exponentially spaced thresholds K : number of queues A : threshold constant E : threshold exponent FIFO Q K E K-1 A Lowest- Priority Queue Loose coordination suffices to calculate global coflow sizes FIFO Q 2 Slaves make independent decisions in between E 2 A E 1 A Small coflows (smaller than E 1 A) do not experience coordination overheads! E 1 A FIFO 0 Q 1 Highest- Priority Queue

Closely Approximates Varys [Sim. & EC2] Overhead Over Varys 25 20 15 10 5 0 1.00.21 5.65 5.5 22.07 1 2, 4 1.10 Varys Per-Flow Fair FIFO Per-Flow Priority FIFO-LM Varys NC Fairness Prioritization w/o Complete Knowledge 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM 2011 2. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM 2012. pfabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM 201 4. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM 2014

My Contributions Orchestra SIGCOMM 11 Merged with Spark Varys Coflow SIGCOMM 14 Open-Source Aalo SIGCOMM 15 Open-Source Application-Aware Network Scheduling

My Contributions Orchestra SIGCOMM 11 Varys SIGCOMM 14 Aalo SIGCOMM 15 Application-Aware Network Scheduling

My Contributions Spark NSDI 12 Sinbad SIGCOMM 1 Network-Aware Applications Orchestra SIGCOMM 11 Varys SIGCOMM 14 Aalo SIGCOMM 15 Application-Aware Network Scheduling FairCloud SIGCOMM 12 HARP SIGCOMM 12 ViNEYard ToN 12 Datacenter Resource Allocation

My Contributions Spark NSDI 12 Top-Level Apache Project Orchestra SIGCOMM 11 Merged with Spark FairCloud Varys SIGCOMM 14 Open-Source HARP Sinbad SIGCOMM 1 Merged at Facebook Aalo SIGCOMM 15 Open-Source ViNEYard SIGCOMM 12 SIGCOMM 12 ToN 12 @HP @Microsoft Bing Open-Source Network-Aware Applications Application-Aware Network Scheduling Datacenter Resource Allocation

Communication-First Big Data Systems In-Datacenter Analytics Cheaper SSDs and DRAM, proliferation of optical networks, and resource disaggregation will make network the primary bottleneck Inter-Datacenter Analytics Bandwidth-constrained wide-area networks End User Delivery Faster and responsive delivery of analytics results over the Internet for better end user experience

Systems ME MIND THE GAP Networking Better capture application-level performance goals using coflows Coflows improve application-level performance and usability Extends networking and scheduling literature Coordination even if not free is worth paying for in many cases mosharaf@cs.berkeley.edu http://mosharaf.com

Improve Flow Completion Times s 1 1 1 r 1 s 2 2 2 r 2 s Datacenter Link to r 1 Per-Flow Fair Sharing Shuffle Completion Time = 5 Link to r 1 1 Smallest-Flow First 1,2 2 4 Shuffle Completion Time = 6 Link to r 2 2 4 6 time Avg. Flow Completion Time =.66 Link to r 2 1 2 6 2 4 6 time Avg. Flow Completion Time = 2.66 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM 2012. 2. pfabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM 201.

Distributions of Coflow Characteristics Frac. of Coflows 1 0.8 0.6 0.4 0.2 0 1.E+06 1.E+08 1.E+10 1.E+12 1.E+14 Coflow Length (Bytes) Frac. of Coflows 1 0.8 0.6 0.4 0.2 0 1.E+00 1.E+04 1.E+08 Coflow Width (Number of Flows) Frac. of Coflows 1 0.8 0.6 0.4 0.2 0 1.E+06 1.E+08 1.E+10 1.E+12 1.E+14 Coflow Size (Bytes) Frac. of Coflows 1 0.8 0.6 0.4 0.2 0 1.E+06 1.E+08 1.E+10 1.E+12 1.E+14 Coflow Bottleneck Size (Bytes)

Traffic Sources 1. Ingest and replicate new data 2. Read input from remote machines, when needed. Transfer intermediate data 4. Write and replicate output Percentage of Traffic by Category at Facebook 46 10 0 14

Distribution of Shuffle Durations Performance Facebook jobs spend ~25% of runtime on average in intermediate comm. 1 Fraction of Jobs 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Fraction of Runtime Spent in Shuffle Month-long trace from a 000- machine MapReduce production cluster at Facebook 20,000 jobs 150 Million tasks

Theoretical Results Structure of optimal schedules Permutation schedules might not always lead to the optimal solution Approximation ratio of COSS-CR Polynomial-time algorithm with constant approximation ratio ( 64 ) 1 The need for coordination Fully decentralized schedulers can perform arbitrarily worse than the optimal 1. Due to Zhen Qiu, Cliff Stein, and Yuan Zhong from the Department of Industrial Engineering and Operations Research, Columbia University, 2014

The Coflow API 1. NO changes to user jobs 2. NO storage management register reducers @driver b register(broadcast, numflows) s register(shuffle, numflows, {b}) put shuffle id b.put(content, size) get unregister broadcast mappers driver (JobTracker) b.unregister() s.unregister() @mapper b.get(id) id s1 s.put(content, size) @reducer s.get(id sl )

Varys Employs a two-step algorithm to support coflow deadlines 1. Admission control 2. Allocation algorithm Do not admit any coflows that cannot be completed within deadline without violating existing deadlines Allocate minimum required resources to each coflow to finish them at their deadlines

More Predictable Facebook Trace Simulation EC2 Deployment 100 100 % of Coflows 75 50 25 % of Coflows 75 50 25 0 Varys Fair EDF (Earliest-Deadline First) 1 0 Varys Fair Met Deadline Not Admitted Missed Deadline 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM 2012

Optimizing Communication # Comm. Params * Spark-v1.1.1 6 Performance: Systems Approach Let users figure it out Hadoop-v1.2.1 YARN-v2.6.0 10 20 * Lower bound. Does not include many parameters that can indirectly impact communication; e.g., number of reducers etc. Also excludes control-plane communication/rpc parameters.

Experimental Methodology Varys deployment in EC2 100 m2.4xlarge machines Each machine has 8 CPU cores, 68.4 GB memory, and 1 Gbps NIC ~900 Mbps/machine during all-to-all communication Trace-driven simulation Detailed replay of a day-long Facebook trace (circa October 2010) 000-machine,150-rack cluster with 10:1 oversubscription