Cof low Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley
Big Data The volume of data businesses want to make sense of is increasing Increasing variety of sources Web, mobile, wearables, vehicles, scientific, Cheaper disks, SSDs, and memory Stalling processor speeds
Big Datacenters for Massive Parallelism BlinkDB Storm Pregel DryadLINQ MapReduce 2005 Hadoop Dryad GraphLab Spark Spark-Streaming GraphX Dremel Hive 2010 2015
Data-Parallel Applications Multi-stage dataflow Computation interleaved with communication Computation Stage (e.g., Map, Reduce) Distributed across many machines Tasks run in parallel Communication Stage (e.g., Shuffle) Between successive computation stages Reduce Stage A communication stage cannot complete until all the data have been transferred Map Stage
Communication is Crucial Performance Facebook jobs spend ~25% of runtime on average in intermediate comm. 1 As SSD-based and in-memory systems proliferate, the network is likely to become the primary bottleneck 1. Based on a month-long trace with 20,000 jobs and 150 Million tasks, collected from a 000-machine Facebook production MapReduce cluster.
Flow Transfers data from a source to a destination Independent unit of allocation, sharing, load balancing, and/or prioritization Faster Communication Stages: Networking Approach Configuration should be handled at the system level
Existing Solutions WFQ CSFQ D DeTail PDQ pfabric GPS RED ECN XCP RCP DCTCP D 2 TCP FCP 1980s 1990s 2000s 2005 2010 2015 Per-Flow Fairness Flow Completion Time Independent flows cannot capture the collective communication behavior common in data-parallel applications
Why Do They Fall Short? r 1 r 2 s 1 s 2 s
Why Do They Fall Short? r 1 r 2 s 1 1 1 r 1 s 1 s 2 s s 2 2 s 1 s 2 s Datacenter 2 r 2 s Network
Why Do They Fall Short? s 1 1 1 r 1 s 2 2 2 r 2 s Datacenter Network Link to r 1 Link to r 2 Per-Flow Fair Sharing 2 4 6 time 5 5 Shuffle Completion Time = 5 Avg. Flow Completion Time =.66 Solutions focusing on flow completion time cannot further decrease the shuffle completion time
Improve Application-Level Performance 1 s 1 s 2 s 1 2 Datacenter Network 1 2 r 1 r 2 Slow down faster flows to accelerate slower flows Link to r 1 Per-Flow Fair Sharing 5 Shuffle Completion Time = 5 Link to r 1 Data-Proportional Per-Flow Fair Sharing Allocation 4 4 4 Shuffle Completion Time = 4 Link to r 2 2 4 6 time 5 Avg. Flow Completion Time =.66 Link to r 2 4 4 4 2 4 6 time Avg. Flow Completion Time = 4 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM 2011.
Faster Communication Stages: Systems Approach Applications know their performance goals, but they have no means to let the network know Configuration should be handled by the end users
Faster Communication Stages: Systems Approach Faster Communication Stages: Networking Approach Configuration should be handled by the end users Configuration should be handled at the system level
ME MIND THE GAP Holistic Approach Applications and the Network Working Together
Cof low Communication abstraction for data-parallel applications to express their performance goals 1. Minimize completion times, 2. Meet deadlines, or. Perform fair allocation.
Broadcast Aggregation All-to-All Single Flow Shuffle Parallel Flows
1 1 for faster #1 completion of coflows? How to schedule coflows online 2.. 2.. to meet #2 more deadlines? N. Datacenter. N for fair # allocation of the network?
Varys 1 Enables coflows in data-intensive clusters 1. Coflow Scheduler 2. Global Coordination. The Coflow API Faster, application-aware data transfers throughout the network Consistent calculation and enforcement of scheduler decisions Decouples network optimizations from applications, relieving developers and end users 1. Efficient Coflow Scheduling with Varys, SIGCOMM 2014.
Cof low Communication abstraction for data-parallel applications to express their performance goals 1. The size of each flow, 2. The total number of flows, and. The endpoints of individual flows.
Benefits of Inter-Coflow Scheduling Coflow 1 Coflow 2 Link 2 Link 1 Units 6 Units 2 Units Fair Sharing Smallest-Flow First 1,2 The Optimal L2 L1 L2 L1 L2 L1 2 4 6 time Coflow1 comp. time = 5 Coflow2 comp. time = 6 2 4 6 time Coflow1 comp. time = 5 Coflow2 comp. time = 6 2 4 6 time Coflow1 comp. time = Coflow2 comp. time = 6 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM 2012. 2. pfabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM 201.
Inter-Coflow Scheduling is NP-Hard Coflow 1 Coflow 2 Link 2 Link 1 Units 6 Units 2 Units L2 L1 Fair Sharing Concurrent Open Shop Scheduling 1 Examples include job scheduling and L1 caching blocks 2 4 6 time Solutions use a ordering heuristic Coflow1 comp. time = 6 Coflow2 comp. time = 6 L2 Flow-level Prioritization 1 2 4 6 time Coflow1 comp. time = 6 Coflow2 comp. time = 6 L2 L1 The Optimal 2 4 6 time Coflow1 comp. time = Coflow2 comp. time = 6 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM 2012. 1. 2. A pfabric: Note on Minimal the Complexity Near-Optimal of the Datacenter Concurrent Transport, Open Shop SIGCOMM 201. Problem, Journal of Scheduling, 9(4):89 96, 2006
Inter-Coflow Scheduling is NP-Hard Coflow 1 Coflow 2 Link 2 Link 1 Units 6 Units 2 Units Concurrent Open Shop Scheduling with Coupled Resources Examples include job scheduling and caching blocks Solutions use a ordering heuristic Consider matching constraints 6 2 Input Links Output Links 1 1 Datacenter
Varys Employs a two-step algorithm to minimize coflow completion times 1. Ordering heuristic 2. Allocation algorithm Keep an ordered list of coflows to be scheduled, preempting if needed Allocates minimum required resources to each coflow to finish in minimum time
Ordering Heuristic C 1 ends C 2 ends C ends 6 1 1 O 1 O 2 5 2 2 O 5 Datacenter C 1 C 2 C Length 5 6 Time 1 19 Shortest-First (Total CCT = 5)
Ordering Heuristic C 1 ends C 2 ends C ends C ends C 2 ends C 1 ends 6 1 1 O 1 O 1 O 2 O 2 5 2 2 O O 5 Datacenter C 1 C 2 C Width 2 1 Time 1 19 Shortest-First (5) 6 Time 16 19 Narrowest-First (Total CCT = 41)
Ordering Heuristic C 1 ends C 2 ends C ends C ends C 2 ends C 1 ends 6 1 1 O 1 O 1 O 2 O 2 5 2 2 O O 5 Datacenter Time 1 19 Shortest-First (5) 6 Time 16 19 Narrowest-First (41) C 1 C 2 C Size 9 10 6 O 1 O 2 C ends C 1 ends C 2 ends O 6 9 Time Smallest-First 19 (4)
Ordering Heuristic C 1 ends C 2 ends C ends C ends C 2 ends C 1 ends 6 1 1 O 1 O 1 O 2 O 2 5 2 2 O O 5 Datacenter Time 1 19 Shortest-First (5) 6 Time 16 19 Narrowest-First (41) C 1 C 2 C Bottleneck 10 6 O 1 O 2 C ends C 1 ends C 2 ends C 1 ends O 1 O 2 C ends C 2 ends O O 6 9 Time Smallest-First 19 9 Time 19 Smallest-Bottleneck (4) (1)
Allocation Algorithm A coflow cannot finish before its very last flow Finishing flows faster than the bottleneck cannot decrease a coflow s completion time Allocate minimum flow rates such that all flows of a coflow finish together on time
Varys Enables coflows in data-intensive clusters 1. Coflow Scheduler 2. Global Coordination. The Coflow API Faster, application-aware data transfers throughout the network Consistent calculation and enforcement of scheduler decisions Decouples network optimizations from applications, relieving developers and end users
The Need for Coordination C 1 ends C 2 ends 4 1 1 O 1 2 2 O 2 O 5 4 4 9 Time C 1 C 2 Bottleneck 4 5 Scheduling with Coordination (Total CCT = 1)
The Need for Coordination C 1 ends C 2 ends C 1 ends C 2 ends 4 1 1 O 1 O 1 2 2 O 2 O O 2 O 5 4 4 9 Time Scheduling with Coordination 7 Time 12 Scheduling without Coordination (Total CCT = 1) (Total CCT = 19) Uncoordinated local decisions interleave coflows, hurting performance
Varys Architecture Centralized master-slave architecture Applications use a client library to communicate with the master Actual timing and rates are determined by the coflow scheduler Sender Receiver Driver Put Get Reg Varys Daemon Topology Monitor Usage Estimator Coflow Scheduler Varys Master Varys Daemon TaskName f Network Interface (Distributed) File System Comp. Tasks calling Varys Client Library Varys Daemon 1. Download from http://varys.net
Varys Enables coflows in data-intensive clusters 1. Coflow Scheduler 2. Global Coordination. The Coflow API Faster, application-aware data transfers throughout the network Consistent calculation and enforcement of scheduler decisions Decouples network optimizations from applications, relieving developers and end users
The Coflow API 1. NO changes to user jobs 2. NO storage management register reducers @driver b register(broadcast) s register(shuffle) put shuffle id b.put(content) get unregister broadcast mappers driver (JobTracker) b.unregister() s.unregister() @mapper b.get(id) id s1 s.put(content) @reducer s.get(id sl )
Evaluation A 000-machine trace-driven simulation matched against a 100-machine EC2 deployment 1. Does it improve performance? YES 2. Can it beat non-preemptive solutions?. Do we really need coordination?
Better than Per-Flow Fairness Comm. Heavy Comm. Improv. Job Improv. EC2.16X 1.85X 2.50X 1.25X Sim. 4.86X.21X.9X 1.11X
Preemption is Necessary [Sim.] 25 22.07 Overhead Over Varys 20 15 10 5 1.00.21 5.65 5.5 1.10 NO What About Starvation 0 1 2, 4 Varys Per-Flow Fair FIFO Per-Flow Priority FIFO-LM Varys NC Fairness Prioritization 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM 2011
Lack of Coordination Hurts [Sim.] 25 22.07 Overhead Over Varys 20 15 10 5 0 5.65 5.5.21 1.00 1.10 1 2, 4 Varys Per-Flow Fair FIFO Per-Flow Priority FIFO-LM Varys NC Fairness Prioritization Smallest-flow-first (per-flow priorities) Minimizes flow completion time FIFO-LM 4 performs decentralized coflow scheduling Suffers due to local decisions Works well for small, similar coflows 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM 2011 2. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM 2012. pfabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM 201 4. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM 2014
Cof low Communication abstraction for data-parallel applications to express their performance goals 1. The size of each flow, 2. The total number of flows, and. The endpoints of individual flows. ç Pipelining between stages ç Speculative executions ç Task failures and restarts
How to Perform Coflow Scheduling Without Complete Knowledge?
Implications Minimize Avg. Comp. Time Flows in a Single Link Coflows in an Entire Datacenter With complete knowledge Smallest-Flow-First Ordering by Bottleneck Size + Data-Proportional Rate Allocation Without complete knowledge Least-Attained Service (LAS)?
Revisiting Ordering Heuristics C 1 ends C 2 ends C ends C ends C 2 ends C 1 ends 6 1 1 O 1 O 1 O 2 O 2 5 2 2 O O 5 Time 1 19 Shortest-First 6 Time 16 19 Narrowest-First (5) (41) C ends C 1 ends C 2 ends C 1 ends C ends C 2 ends O 1 O 2 O O 1 O 2 O 6 9 Time Smallest-First 19 9 Time 19 Smallest-Bottleneck (4) (1)
Coflow-Aware LAS (CLAS) Set priority that decreases with how much a coflow has already sent The more a coflow has sent, the lower its priority Smaller coflows finish faster Use total size of coflows to set priorities Avoids the drawbacks of full decentralization
Coflow-Aware LAS (CLAS) Continuous priority reduces to fair sharing when similar coflows coexist Priority oscillation Coflow 1 Coflow 2 2 4 6 time Coflow1 comp. time = 6 Coflow2 comp. time = 6 FIFO works well for similar coflows Avoids the drawbacks of full decentralization 2 4 6 time Coflow1 comp. time = Coflow2 comp. time = 6
Discretized Coflow-Aware LAS (D-CLAS) Priority discretization Change priority when total size exceeds predefined thresholds Scheduling policies FIFO within the same queue Prioritization across queue Weighted sharing across queues Guarantees starvation avoidance FIFO FIFO FIFO Q K Q 2 Q 1 Lowest- Priority Queue Highest- Priority Queue
How to Discretize Priorities? Exponentially spaced thresholds K : number of queues A : threshold constant E : threshold exponent FIFO Q K E K-1 A Lowest- Priority Queue Loose coordination suffices to calculate global coflow sizes FIFO Q 2 Slaves make independent decisions in between E 2 A E 1 A Small coflows (smaller than E 1 A) do not experience coordination overheads! E 1 A FIFO 0 Q 1 Highest- Priority Queue
Closely Approximates Varys [Sim. & EC2] Overhead Over Varys 25 20 15 10 5 0 1.00.21 5.65 5.5 22.07 1 2, 4 1.10 Varys Per-Flow Fair FIFO Per-Flow Priority FIFO-LM Varys NC Fairness Prioritization w/o Complete Knowledge 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM 2011 2. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM 2012. pfabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM 201 4. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM 2014
My Contributions Orchestra SIGCOMM 11 Merged with Spark Varys Coflow SIGCOMM 14 Open-Source Aalo SIGCOMM 15 Open-Source Application-Aware Network Scheduling
My Contributions Orchestra SIGCOMM 11 Varys SIGCOMM 14 Aalo SIGCOMM 15 Application-Aware Network Scheduling
My Contributions Spark NSDI 12 Sinbad SIGCOMM 1 Network-Aware Applications Orchestra SIGCOMM 11 Varys SIGCOMM 14 Aalo SIGCOMM 15 Application-Aware Network Scheduling FairCloud SIGCOMM 12 HARP SIGCOMM 12 ViNEYard ToN 12 Datacenter Resource Allocation
My Contributions Spark NSDI 12 Top-Level Apache Project Orchestra SIGCOMM 11 Merged with Spark FairCloud Varys SIGCOMM 14 Open-Source HARP Sinbad SIGCOMM 1 Merged at Facebook Aalo SIGCOMM 15 Open-Source ViNEYard SIGCOMM 12 SIGCOMM 12 ToN 12 @HP @Microsoft Bing Open-Source Network-Aware Applications Application-Aware Network Scheduling Datacenter Resource Allocation
Communication-First Big Data Systems In-Datacenter Analytics Cheaper SSDs and DRAM, proliferation of optical networks, and resource disaggregation will make network the primary bottleneck Inter-Datacenter Analytics Bandwidth-constrained wide-area networks End User Delivery Faster and responsive delivery of analytics results over the Internet for better end user experience
Systems ME MIND THE GAP Networking Better capture application-level performance goals using coflows Coflows improve application-level performance and usability Extends networking and scheduling literature Coordination even if not free is worth paying for in many cases mosharaf@cs.berkeley.edu http://mosharaf.com
Improve Flow Completion Times s 1 1 1 r 1 s 2 2 2 r 2 s Datacenter Link to r 1 Per-Flow Fair Sharing Shuffle Completion Time = 5 Link to r 1 1 Smallest-Flow First 1,2 2 4 Shuffle Completion Time = 6 Link to r 2 2 4 6 time Avg. Flow Completion Time =.66 Link to r 2 1 2 6 2 4 6 time Avg. Flow Completion Time = 2.66 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM 2012. 2. pfabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM 201.
Distributions of Coflow Characteristics Frac. of Coflows 1 0.8 0.6 0.4 0.2 0 1.E+06 1.E+08 1.E+10 1.E+12 1.E+14 Coflow Length (Bytes) Frac. of Coflows 1 0.8 0.6 0.4 0.2 0 1.E+00 1.E+04 1.E+08 Coflow Width (Number of Flows) Frac. of Coflows 1 0.8 0.6 0.4 0.2 0 1.E+06 1.E+08 1.E+10 1.E+12 1.E+14 Coflow Size (Bytes) Frac. of Coflows 1 0.8 0.6 0.4 0.2 0 1.E+06 1.E+08 1.E+10 1.E+12 1.E+14 Coflow Bottleneck Size (Bytes)
Traffic Sources 1. Ingest and replicate new data 2. Read input from remote machines, when needed. Transfer intermediate data 4. Write and replicate output Percentage of Traffic by Category at Facebook 46 10 0 14
Distribution of Shuffle Durations Performance Facebook jobs spend ~25% of runtime on average in intermediate comm. 1 Fraction of Jobs 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Fraction of Runtime Spent in Shuffle Month-long trace from a 000- machine MapReduce production cluster at Facebook 20,000 jobs 150 Million tasks
Theoretical Results Structure of optimal schedules Permutation schedules might not always lead to the optimal solution Approximation ratio of COSS-CR Polynomial-time algorithm with constant approximation ratio ( 64 ) 1 The need for coordination Fully decentralized schedulers can perform arbitrarily worse than the optimal 1. Due to Zhen Qiu, Cliff Stein, and Yuan Zhong from the Department of Industrial Engineering and Operations Research, Columbia University, 2014
The Coflow API 1. NO changes to user jobs 2. NO storage management register reducers @driver b register(broadcast, numflows) s register(shuffle, numflows, {b}) put shuffle id b.put(content, size) get unregister broadcast mappers driver (JobTracker) b.unregister() s.unregister() @mapper b.get(id) id s1 s.put(content, size) @reducer s.get(id sl )
Varys Employs a two-step algorithm to support coflow deadlines 1. Admission control 2. Allocation algorithm Do not admit any coflows that cannot be completed within deadline without violating existing deadlines Allocate minimum required resources to each coflow to finish them at their deadlines
More Predictable Facebook Trace Simulation EC2 Deployment 100 100 % of Coflows 75 50 25 % of Coflows 75 50 25 0 Varys Fair EDF (Earliest-Deadline First) 1 0 Varys Fair Met Deadline Not Admitted Missed Deadline 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM 2012
Optimizing Communication # Comm. Params * Spark-v1.1.1 6 Performance: Systems Approach Let users figure it out Hadoop-v1.2.1 YARN-v2.6.0 10 20 * Lower bound. Does not include many parameters that can indirectly impact communication; e.g., number of reducers etc. Also excludes control-plane communication/rpc parameters.
Experimental Methodology Varys deployment in EC2 100 m2.4xlarge machines Each machine has 8 CPU cores, 68.4 GB memory, and 1 Gbps NIC ~900 Mbps/machine during all-to-all communication Trace-driven simulation Detailed replay of a day-long Facebook trace (circa October 2010) 000-machine,150-rack cluster with 10:1 oversubscription