Advanced Computer Networks. Scheduling

Transcription

1 Oriana Riva, Department of Computer Science ETH Zürich Advanced Computer Networks Scheduling Patrick Stuedi, Qin Yin and Timothy Roscoe Spring Semester 2015

2 Outline Last time Load balancing Layer-4 switching Layer-7 switching TCP Splicing Ananta Today Scheduling Data transfer orchestration 2 research projects 2

3 Cluster computing frameworks Cluster computing frameworks like MapReduce, Spark, etc., are an important class of applications in data centers Web search, machine learning, etc Performance and efficiency of these frameworks is of major interest Challenges for networking discussed in lecture 6 (TCP) Incast Buffer buildup at switches due to mice & elephant traffic patterns Missed deadlines (and SLAs) Poor utilization of multiple paths 3

4 Remember: Partition / Aggregate Pattern Partition Work Aggregate Results 4

5 Two key traffic patterns in MapReduce Map Stage Broadcast One-to-many Partition work Job Shuffle Many-to-many Reduce Stage Aggregate resultswith end-system provision 5

6 MapReduce logs from Facebook cluster Weeklong trace of 188'000 mapreduce jobs from a 3000-node cluster 33% of time in shuffle on average 6

7 In lecture 6 (TCP) we have......discussed several techniques to improve networking in partition/aggregate type of applications: Fine-grain TCP timers: reduces long tail effects DCTCP: reduces queue buildup D3 and D2DCTCP: meet deadlines and SLAs MPTCP: leverage multiple network paths These approaches are all working on a per-flow basis None of them looks at the collective behavior of flows by taking job semantics into account No coordination between individual network transfers within a single job 7

8 In lecture 6 (TCP) we have......discussed several techniques to improve networking in partition/aggregate type of applications: Fine-grain TCP timers: reduces long tail effects DCTCP: reduces queue buildup D3 and D2DCTCP: meet deadlines and SLAs MPTCP: leverage multiple network paths These approaches are all working on a per-flow basis None of them looks at the collective behavior of flows by taking job semantics into account No coordination between individual network transfers within a single job 8

9 Lack of coordination can hurt the performance 9

10 Scalability of Netflix recommendation system Bottlenecked by communication as cluster size increases 10

11 Two research projects Orchestra Coordinate all transfers within a mapreduce job FastPass Coordinate packet transfers and path selection among all flows 11

12 Orchestra: Managing Data Transfers in Computer Clusters

13 Orchestra: key idea Optimize at the level of transfers instead of individual flows Transfer: set of all flows transporting data between two stages of a job Coordination done through three control components Cornet: cooperative broadcast Weighted shuffle scheduling: shuffle coordination Inter-transfer controller (ITC): global coordination 13

14 Cornet: cooperative broadcast Key idea: Broadcasting in MapReduce is very similar to data distribution mechanisms in the Internet like BitTorrent BitTorrent-like protocol optimized for data centers Split data up into blocks and distribute them across nodes in the data center On receiving: request/gather blocks from various nodes Receivers of blocks become part of sender set (BitTorrent) Cornet differs from classical BitTorrent: Blocks are much larger (4MB) Data center is assumed to have high-bandwidth No need for incentives No selfish peers in the data center Topology aware Topology of data center is known Receiver chooses sender in the same rack 14

15 Cornet performance Experiment: 100GB data to 100 receivers on Amazon EC2 cluster Traditional broadcast implementations use distributed file system to store and retrieve broadcast data Cornet is about 4-5 times more efficient 15

16 Shuffle: Status Quo (1) Reducers Mappers Map output To receivers (top) need to fetch separate pieces of data from each sender If every sender has equal amount of data, all links are equally loaded and utilized What if data sizes are unbalanced? 16

17 Shuffle: Status Quo (1) Reducers Mappers Map output To receivers (top) need to fetch separate pieces of data from each sender If every sender has equal amount of data, all links are equally loaded and utilized What if data sizes are unbalanced? 17

18 Shuffle: Sender Bottleneck Reducers Mappers Map output Senders s1, s2, s4 and s5 have one data unit for each receiver Sender s3 has two data units for both receivers The link of the sender s3 becomes the bottleneck if flows share bandwidth in fair way 18

19 Orchestra: Weighted Shuffle Scheduling (WSS) Key idea: Assign weights to each flow in a shuffle Make the weight proportional to the data that needs to be transported 19

20 Example: shuffle with fair bandwidth sharing Reducers Mappers Map output Each receiver fetches data at 1/3 units/seconds from the three senders (three flows sharing bandwidth at receiver) After 3 seconds, all data from s1, s2, s4 and s5 is fetched But one unit of data left for both receivers at s3 s3 transmits the two remaining units at 1/2 units per seconds to each receiver (two flows sharing the bandwidth at sender) After two more seconds all units are transferred Total time = 5 seconds 20

25 Example: shuffle with weighted scheduling Reducers Mappers Map output Receivers fetch data at 1/4 units/seconds from s1, s2, s4 and s5...and: fetch data at 1/2 units/seconds s3 Fetching data from s1, s2, s4 and s5: 4 seconds Fetching data from s3: 4 seconds Total time = 4 seconds (25% faster than fair sharing) 25

28 Orchestra: End-to-end evaluation 1.9x faster on 90 nodes 28

29 Fastpass: A Centralized Zero-Queue Datacenter Network

30 Fastpass: key idea Instead of flows sending packets uncoordinated..use a central and datacenter wide arbiter to schedule and assign paths to all packets! No need for queues at switches Very high utilization Fastpass Arbiter 30

31 Example: Packet from A to B 31

32 Fastpass challenges Fine-grained timing: can an arbiter make scheduling decisions at the latencies required? Time to transfer a 1500-byte MTU-sized packet at 10Gbit/s is 1230 nanoseconds Assign batches of timeslots in one go Scalability: arbiter needs to serve requests from many nodes Efficient parallelization of requests on a multicore system 32

33 Fastpass system design Client application issues send() call on a socket Fastpass library intercepts call and sends demand request message (source, destination, data size) to arbiter The arbiter processes each request, performing two functions: Timeslot allocation: assignment of a set of timeslots to transmit the data Path selection: assignment of a path through the network for each packet Arbiter communicates timeslot and path information to the client 33

34 Timeslot allocation Input: set of demands (source/destination pairs, number of time slots required per pair) Arbiter sorts demands by last timeslot allocated to each pair (fairness) Arbiter processes demands in sorted order while making sure no slot is double booked In example: 3 rd demand cannot be allocated since destination slot is already taken 34

35 Path selection using edge coloring ToR 2-tier network topology core Example in a network with two tiers (ToR and core): Each ToR switch connected to a subset of core switches Path selection entails assigning a core switch to each packet Path selection through edge coloring in four steps: (1) Input: matching of src/dst pairs (2) Bipartite graph of ToR switches where src/dst pairs of every packet are connected (3) Color edges so that no two edges incident on the same ToR have the same color (4) Colors identify which core switch a packet is using 35

36 Path selection using edge coloring ToR 2-tier network topology core Example in a network with two tiers (ToR and core): Each ToR switch connected to a subset of core switches Path selection entails assigning a core switch to each packet Path selection through edge coloring in four steps: (1) Input: matching of src/dst pairs (2) Bipartite graph of ToR switches where src/dst pairs of every packet are connected (3) Color edges so that no two edges incident on the same ToR have the same color (4) Colors identify which core switch a packet is using (1) (2) (3) (4) 36

40 Ping roundtrip times Setup: Single rack with 32 servers 4 servers generate traffic to a single receiver Fastpass reduces the end-to-end latency by 15x 40

41 Queue length Setup: Single rack with 32 servers 4 servers generate traffic to a single receiver Fastpass reduces the queue length by 242x 41

42 Summary Two typical traffic patterns in data processing applications running in datacenters are Broadcast Shuffle Uncoordinated network transfers lead to inefficiences Two research efforts: Orchestra: coordinate network transfers within a MapReduce job Fastpass: coordinate packet transfers and paths used across a data center 42

43 References Managing Data Transfers in Computer Clusters with Orchestra, Sigcomm 2011 Fastpass: A centralized Zero-Queue Datacenter Network, Sigcomm