Minimizing the Total Weighted Completion Time of Coflows in Datacenter Networks

Transcription

1 Minimizing the Tota Weighted Competion Time of Cofows in Datacenter Networks Zhen Qiu Ciff Stein and Yuan Zhong ABSTRACT Communications in datacenter jobs (such as the shuffe operations in MapReduce appications often invove many parae fows which may be processed simutaneousy This highy parae structure presents new scheduing chaenges in optimizing job-eve performance objectives in data centers Chowdhury and Stoica [11] introduced the cofow abstraction to capture these communication patterns and recenty Chowdhury et a [1] deveoped effective heuristics to schedue cofows In this paper we consider the probem of efficienty scheduing cofows with reease dates so as to minimize the tota weighted competion time which has been shown to be strongy NP-hard [1] Our main resut is the first poynomia-time deterministic approximation agorithm for this probem with an approximation ratio of 67/ and a randomized version of the agorithm with a ratio of / Our resuts use techniques from both combinatoria scheduing and matching theory and rey on a cever grouping of cofows We aso run experiments on a Facebook trace to test the practica performance of severa agorithms incuding our deterministic agorithm Our experiments suggest that simpe agorithms provide effective approximations of the optima and that our deterministic agorithm has near-optima performance 1 INTRODUCTION With the exposive growth of data-parae computation frameworks such as MapReduce [14] Hadoop [1 8 2] Spark [6] Googe Datafow [2] etc modern data centers are abe to process arge-scae data sets at an unprecedented speed A key factor in materiaizing this efficiency is paraeism: many appications written in these frameworks The authors are with the Department of Industria Engineering and Operations Research Coumbia University New York NY USA {zq2110 cs205 yz2561}@coumbiaedu This research is supported in part by NSF grants CCF and CCF Permission to make digita or hard copies of a or part of this work for persona or cassroom use is granted without fee provided that copies are not made or distributed for profit or commercia advantage and that copies bear this notice and the fu citation on the first page To copy otherwise to repubish to post on servers or to redistribute to ists requires prior specific permission and/or a fee Copyright 20 ACM ---// $1500 Figure 1: A MapReduce appication in a 2 2 network aternate between computation and communication stages where a typica computation stage produces many pieces of intermediate data for further processing which are transferred between groups of servers across the network Data transfer within a communication stage invoves a arge coection of parae fows and a computation stage often cannot start unti a fows within a preceding communication stage have finished [12 15] Whie the aforementioned paraeism creates opportunities for faster data processing it aso presents chaenges for network scheduing In particuar traditiona networking techniques focus on optimizing fow-eve performance such as minimizing fow competion times and ignore appicationeve performance metrics For exampe the time that a communication stage competes is the time that the ast fow within that stage finishes so it does not matter if other fows of the same stage compete much earier than the ast To faithfuy capture appication-eve communication requirements Chowdhury and Stoica [11] introduced the cofow abstraction defined to be a coection of parae fows with a common performance goa Effective scheduing heuristics were proposed in [1] to optimize cofow competion times In this paper we are interested in deveoping scheduing agorithms with provabe performance guarantees and our main contribution is a deterministic poynomia-time 67/ approximation-agorithm and a randomized poynomia-time ( /-approximation agorithm for the probem of minimizing the tota weighted competion time of cofows with reease dates These are the first O(1-approximation agorithms for this probem We aso conduct experiments on rea data gathered from Facebook in which we compare our deterministic agorithm and its modifications to severa other agorithms evauate their reative performances and compare our soutions to an LP-based ower bound Our agorithm performs we and is much coser to the ower bound than the worst-case anaysis predicts 11 System Mode

2 The network In order to describe the cofow scheduing probem we first need to specify the conceptua mode of the datacenter network Simiar to [1] we abstract out the network as one giant non-bocking switch [ ] with m ingress ports and m egress ports which we ca an m m network switch where m specifies the network size Ingress ports represent physica or virtua inks (eg Network Interface Cards where data is transferred from servers to the network and egress ports represent inks through which servers receive data We assume that the transfer of data within the network switch is instantaneous so that any data that is transferred out of an ingress port is immediatey avaiabe at the corresponding egress port There are capacity constraints on the ingress and egress ports For simpicity we assume that a ports have unit capacity ie one unit of data can be transferred through an ingress/egress port per unit time See Figure 1 for an exampe of a 2 2 datacenter network switch In the seque we sometimes use the terms inputs and outputs to mean ingress and egress ports respectivey Cofows A cofow is defined as a coection of parae fows with a common performance goa We assume that a fows within a cofow arrives to the system at the same time the reease date of the cofow To iustrate consider the shuffing stage of a MapReduce appication with 2 mappers and 2 reducers that arrives to a 2 2 network at time 0 as shown in Figure 1 Both mappers need to transfer intermediate data to both reducers Therefore the shuffing stage consists of 2 2 = 4 parae fows each one corresponding to a pair of ingress and egress ports For exampe the size of the fow that needs to be transferred from input 1 to output 2 is 2 In genera we use an m m matrix D = (d m ij=1 to represent a cofow in a m m network d denotes the size of the fow to be transferred from input i to output j We aso assume that fows consist of discrete data units so their sizes are integers Scheduing constraints Since each input can transmit at most one data unit and each output can receive at most one data unit per time sot a feasibe schedue for a singe time sot can be described by a matching between the inputs and outputs When an input is matched to an output a corresponding data unit (if avaiabe is transferred across the network To iustrate consider «again the cofow in Figure 1 given by the matrix 1 2 When this cofow is the ony one present it 2 1 can be competed in «time sots using «matching schedues « described by and Here entry 1 indicates a connection between the corresponding input and output so for exampe the first matching connects input 1 with output 1 and input 2 with output 2 When mutipe cofows are present it is aso possibe for a matching to invove data units from different cofows An aternative approach toward modeing the scheduing constraints is to aow the feasibe schedues to consist of rate aocations so that fractiona data units can be processed in each time sot This approach corresponds to finding fractiona matchings and is used in most of the networking iterature We do not adopt this approach When rate aocations can vary continuousy over time there is a much arger (infinite set of aowabe schedues Furthermore uness the time horizon is exceptionay short restricting decision making to integra time units resuts in a provaby negigibe degradation of performance Integra matchings wi give rise to a much ceaner probem formuation as we see beow without sacrificing the richness of the probem Probem statement We consider the foowing offine cofow scheduing probem with reease dates There are n cofows indexed by k = 1 2 n Cofow k is reeased to the system at time r k k = 1 2 n Let the matrix of fow sizes of cofow m k be denoted by D (k = where is the size ij=1 of the fow to be transferred from input i to output j of cofow k The competion time of cofow k denoted by C k is the time when a fows from cofow k have finished processing Data units can be transferred across the network subject to the scheduing constraints described earier Let y(i j k t be the number of data units being served in time sot t which beong to cofow k and which require transfer from input i to output j Then in each time sot t the foowing 2m matching constraints must be satisfied For i = 1 2 m P n P m j =1 y(i j k t 1 ensuring that input i processes at most one data unit at a time and simiary P n P m i =1 y(i j k t 1 for each output j = 1 2 m For given positive weight parameters w k k = 1 2 n we are interested in minimizing P m w kc k the tota weighted competion time of cofows with reease dates In a data center cofows often come from different appications so the tota weighted competion time of cofows is a reasonabe user/appication oriented performance objective A arger weight indicates higher priority ie the corresponding cofow needs to be competed more quicky To summarize we can formuate our probem as the foowing mathematica program C k (O y(i j k t Minimize w k C k subject to for i j = 1 m k = 1 n; (1 t=1 m y(i j k t 1 for i = 1 m t; (2 j =1 m y(i j k t 1 for j = 1 m t; ( i =1 y(i j k t = 0 if t < r k for i j = 1 m t; (4 y(i j k t binary i j t k (5 The oad constraints (1 state that a processing requirements of fows need to be met upon the competion of each cofow (2 and ( are the matching constraints The reease date constraints (4 guarantee that cofows are being served ony after they are reeased in the system Note that this mathematica program is not an integer inear programming formuation because variabes C k are in the imit of the summation The cofow scheduing probem (O generaizes some weknown scheduing probems First when m = 1 it is easy to see that cofow scheduing is equivaent to singe-machine scheduing with reease dates with the objective of minimizing the tota weighted competion time where preemption

3 is aowed The atter probem is strongy NP-hard [22] which immediatey impies the NP-hardness of probem (O in genera When a cofow matrices are diagona cofow scheduing is equivaent to a concurrent open shop scheduing probem [ 0] This connection has been observed in [1] and is described in more detai in Appendix A for competeness Utiizing this connection it can be shown that the probem (O is strongy NP-hard even in the specia case where r k = 0 and w k = 1 for a k A major difference between concurrent open shop scheduing and cofow scheduing is that for concurrent open shop there exists an optima permutation schedue in which jobs can be processed in the same order on a machines [] whereas permutation schedues need not be optima for cofow scheduing [1] 12 Main Resuts Theoretica resuts Since the cofow scheduing probem (O is NP-hard we focus on finding approximation agorithms that is agorithms which run in poynomia time and returns a soution whose vaue is guaranteed to be cose to optima Let C k (OP T and C k (A be the competion times of cofow k under an optima and an approximation scheduing agorithm respectivey Our main resuts are: Theorem 1 There exists a deterministic poynomia time 67/-approximation agorithm ie P n w kc k (A P n w kc k (OP T 67 Theorem 2 There exists a randomized poynomia time ( /-approximation agorithm ie E ˆP n w kc k (A P n w kc k (OP T When a cofows have reease dates 0 ie r k = 0 for a k we can improve the approximation ratios Coroary 1 If a cofows are reeased into the system at time 0 then there exists a deterministic poynomia time 64/-approximation agorithm Coroary 2 If a cofows are reeased into the system at time 0 then there exists a randomized poynomia time ( /-approximation agorithm Our deterministic (described in Agorithm 2 and randomized agorithms combine ideas from combinatoria scheduing and matching theory aong with some new insights First as with many other scheduing probems (see eg [17] particuary for average competion time we reax the probem formuation (O to a poynomia-sized interva-indexed inear program (LP The reaxation invoves both dropping the matching constraints (2 and ( and using intervas to make the LP poynomia sized We then sove this LP and use an optima soution to the LP to obtain an ordered ist of cofows Then we use this ist to derive an actua schedue To do so we partition cofows into a poynomia number of groups based on the minimum required competion times of the ordered cofows and schedue the cofows in the same group as a singe cofow using matchings obtained from an integer version of the Birkhoff-von Neumann decomposition theorem (Lemma 4 and Agorithm 1 The anaysis of the agorithm coupes techniques from the two areas in interesting ways We anayze the interva indexed inear program using toos simiar to those used for other average competion time probems especiay concurrent open shop The interva-indexed rather than timeindexed formuation is necessary to obtain a poynomia time agorithm and we use a ower bound based on a prioritybased oad cacuation (see Lemma We aso show how each cofow can be competed by a time bounded by a constant times the optima competion time via a cever grouping and decomposition of the cofow matrices Here a chaenge is to decompose the not necessariy poynomia ength schedue into a poynomia number of matchings Experimenta findings We evauate our agorithm as we as the impact of severa additiona agorithmic and heuristic decisions incuding cofow ordering cofow grouping and backfiing Our evauation uses a Hive/MapReduce trace coected from a arge production custer at Facebook [12 1] The main findings are as foows with a more detaied discussion in Section 4 and Appendix D Agorithms with cofow grouping consistenty outperform those without grouping Simiary agorithms that use backfiing consistenty outperform those that do not use backfiing The performance of agorithms that use the LP-based ordering (15 is simiar to those that order cofows according to their oads (see (18 When combined with grouping and backfiing these agorithms are neary optima Note that the ordering of cofows according to oad is used in [1] Our LP-based deterministic agorithm has near-optima performance 1 Reated work The cofow abstraction was first proposed in [11] athough the idea was present in a previous paper [12] Chowdhury et a [12] observed that the optima processing time of a cofow is exacty equa to its oad when the network is scheduing ony a singe cofow and buit upon this observation to schedue data transfer in a datacenter network Chowdhury et a [1] introduced the cofow scheduing probem without reease dates and provided effective scheduing heuristics They aso observed the connection of cofow scheduing with concurrent open shop estabished the NPhardness of the cofow scheduing probem and showed via a simpe counter-exampe how permutation schedues need not be optima for cofow scheduing There is a great dea of success over the past 20 years on combinatoria scheduing to minimize average competion time see eg [ ] This ine of works typicay uses a inear programming reaxation to obtain an ordering of jobs and then uses that ordering in some other poynomia-time agorithm There has aso been much work on shop scheduing which we do not survey here but note that traditiona shop scheduing is not concurrent In the anguage of our probem that woud mean that traditionay two fows in the same cofow coud not be processed simutaneousy The recenty studied concurrent open shop probem

4 removes this restriction and modes fows that can be processed in parae There have been severa resuts showing that even restrictive specia cases are NP-hard [ ] There were severa agorithms with super-constant (actuay at east m the number of machines approximation ratios eg [ 2 4 5] Recenty there have been severa constant factor approximation agorithms using LP-reaxations Wang and Cheng [5] used an interva-indexed formuation Severa authors have observed that a reaxation in competion time variabes is possibe [ ] and Mastroii et a [25] gave a prima-dua 2-approximation agorithm and showed stronger hardness resuts Reaxations in competion time variabes presume the optimaity of permutation schedues which does not hod for the cofow scheduing probem Thus our work buids on the formuation in Wang and Cheng [5] even though their approach does not yied the strongest approximation ratio The design of our scheduing agorithms reies cruciay on a fundamenta resut (Lemma 4 in this paper concerning the decomposition of nonnegative integer-vaued matrices into permutation matrices which states that such a matrix can be written as a sum of ρ permutation matrices where ρ is the maximum coumn and row sum This resut is cosey reated to the cassica Birkhoff-von Neumann theorem [7] and has been stated in different forms and appied in different contexts For an appication in scheduing theory see eg [21] For appications in communication networks see eg [ ] 2 LINEAR PROGRAM (LP RELAATION In this section we present an interva-indexed inear program reaxation of the scheduing probem (O which produces a ower bound on P n w kc k (OP T the optima vaue of the tota weighted competion time as we as an ordering of cofows for our approximation agorithms ( 21 We then define and anayze the concepts of maximum tota input/output oads which respect the ordering produced by the LP and reate these concepts to C k (OP T ( 22 These reations wi be used in the proofs of our main resuts in 21 Two Linear Program Reaxations From the discussion in 11 we know that probem (O is NP-hard Furthermore the formuation in (1 - (5 is not immediatey of use since it is at east as hard as an integer inear program We can however formuate an interva-indexed inear program (LP by reaxing the foowing components from the origina formuation (i First we deveop new oad constraints (see (8 and (9 by reaxing the matching constraints (2 and ( and the oad constraints (1 and formuate a time-indexed inear program (The matching constraints wi be enforced in the actua scheduing agorithm The time-indexed LP has been used many times (eg [5 17] but typicay for non-shop scheduing probems Note that in order to use it for our probem we drop the expicit matching constraints We ca this (LP- EP beow (ii Second in order to get a poynomia sized formuation we divide time (which may not be poynomiay bounded into a set of geometricay increasing intervas We ca this an interva-indexed integer program and it is aso commony used in combinatoria scheduing In doing so we have a weaker reaxation than the time-indexed one but one that can be soved in poynomia time We then reax the interva-indexed integer program to a inear program and sove the inear program To impement reaxation (i et us examine the oad and matching constraints (1 ( Constraints (2 and ( impy that in each time sot t each input/output can process at most one data unit Thus the tota amount of work that can be processed by an input/output by time t is at most t For each time sot t and each k = 1 2 n et z (k t {0 1} be an indicator variabe of the event that cofow k competes in time sot t Then t m s=1 j =1 z (k s and t m i j z(k s s=1 i =1 are respectivey the tota amount of work on input i and output j from a cofows that compete before time t Therefore for each t we must have t m s=1 j =1 t m s=1 i =1 z (k s t for a i = 1 2 m (6 i j z(k s t for a j = 1 2 m (7 which are the oad constraints on the inputs and outputs To compete reaxation (i we require an upper bound on the time needed to compete a cofows in an optima scheduing agorithm To this end note that the naive agorithm which schedues one data unit in each time sot can compete processing a the cofows in T = max k {r k } + P n P m ij=1 d(k units of time and it is cear that an optima scheduing agorithm can finish processing a cofows by time T Taking into account constraints (6 and (7 and reaxing the integer constraints z (k t {0 1} into the corresponding inear constraints we can formuate the foowing inear programming reaxation (LP-EP of the cofow scheduing probem (O (LP-EP t m s=1 j =1 t m s=1 i =1 z (k t = 0 if r k + T t=1 T Minimize w k tz (k t t=1 subject to z (k s t for i = 1 m t = 1 T ; (8 i j z(k s t for j = 1 m t = 1 T ; (9 m j =1 > t or r k + z (k t = 1 for k = 1 n; m i =1 z (k t 0 for k = 1 n t = 1 2 T i > t; (10 j Since the time T can be exponentiay arge in the sizes of the probem inputs it is not a priori cear that the reaxation LP-EP can be soved in poynomia time In order to reduce the running time and find a poynomia time agorithm we divide the time horizon into increasing time intervas: [0 1] (1 2] (2 4] (2 L 2 2 L 1 ] where L is chosen to be the smaest integer such that 2 L 1 T The inequaity guarantees that 2 L 1 is a sufficienty arge time horizon to compete a the cofows even under a naive schedue We aso define the foowing notation for time points: τ 0 = 0 and τ = 2 1 for = 1 L Thus the th time interva runs from time τ 1 to τ

5 For k = 1 n and = 1 L et x (k be the binary decision variabe which indicates whether cofow k is schedued to compete within the interva (τ 1 τ ] We approximate the competion time variabe C k by P L =1 τ 1x (k the eft end point of the time interva in which cofow k finishes and consider the foowing inear program reaxation (LP L (LP Minimize w k =1 m u=1 j =1 m u=1 i =1 x (k = 0 if r k + L =1 τ 1 x (k subject to x (k u τ for i = 1 m = 1 L; (11 i j x(k u τ for j = 1 m = 1 L; (12 m j =1 > τ or r k + x (k = 1 for k = 1 n; m i =1 x (k 0 for k = 1 n = 1 L i j > τ ; (1 The reaxations (LP-EP and (LP are simiar except that the time-indexed variabes z (k t are repaced by intervaindex variabes x (k The foowing emma is immediate Lemma 1 The optima vaue of the inear program (LP is a ower bound on the optima tota weighted competion time P n w kc k (OP T of cofow scheduing probem (O Proof Consider an optima schedue of probem (O and set x (k u = 1 if cofow k competes within the uth time interva This is a feasibe soution to probem (LP with the oad and capacity constraints (11 and (12 and the feasibiity constraint (1 a satisfied Moreover since cofow k competes within the uth interva the cofow competion time is at east τ 1 Hence the objective vaue of the feasibe soution constructed is no more than the optima tota weighted competion time P n w kc k (OP T Since the constraint matrix in probem (LP is of size O((n+ m og T by O(n og T and the maximum size of the coefficients is O(og T the number of bits of input to the probem is O(n(m+n(og T The interior point method can sove probem (LP in poynomia time [20] From an optima soution to (LP we can obtain an ordering of cofows and use this order in our scheduing agorithms (see eg Agorithm 2 To do so et an optima soution to probem (LP be x (k for k = 1 n and = 1 L The reaxed probem (LP computes an approximated competion time C k = L =1 τ 1 x (k (14 for cofow k k = 1 2 n based on which we reorder cofows More specificay we re-order and index the cofows in a nondecreasing order of the approximated competion times C k ie C 1 C 2 C n (15 For the rest of the paper we wi stick to this ordering and indexing of the cofows Wang and Cheng [5] gave a 16/-approximation agorithm for the concurrent open shop probem using a simiar interva-indexed inear program Our agorithms are more invoved because we aso have to address the matching constraints which do not appear in the concurrent open shop probem 22 Maximum Tota Input / Output Loads Here we define the maximum tota input/output oads respecting the ordering and indexing of (15 For each k k = 1 2 n define the maximum tota input oad I k the maximum tota output oad J k and the maximum tota oad V k by I k = max i=1m ( m k j =1 g=1 J k = max j=1m ( m k i j i =1 g=1 and V k = max{i k J k } (16 P k g=1 d(g respectivey For each i (each j respectivey P m j =1 Pm P k i =1 g=1 d(g i j respectivey is the tota processing requirement on input i (output j from cofows 1 2 k That is the tota oad is the sum of the oads of the ower numbered cofows By the oad constraints V k is a universa ower bound on the time required to finish processing cofows 1 2 k under any scheduing agorithm We state this fact formay as a emma Lemma 2 For k = 1 2 n et C (k be the time that a cofows 1 k compete where the indexing respects the order (15 Then under any scheduing agorithm for a k simutaneousy V k C (k (17 The foowing emma which states that with a proper ordering of the cofows V k is a 16/-approximation of the optima C k (OP T for a k simutaneousy is crucia for the proof of our main resuts in the next section The proof of the emma is simiar to that of Theorem 1 in [5] We defer this proof to Appendix C Lemma Let C k be computed from probem (LP by Eq (14 and be indexed such that (15 is satisfied Then V k (16/C k (OP T k = 1 n APPROIMATION ALGORITHMS In this section we describe a deterministic and a randomized poynomia time scheduing agorithm with approximation ratios of 67/ and / respectivey in the presence of arbitrary reease dates The same agorithms have approximation ratios 64/ and / respectivey when a cofows are reeased at time 0 Both agorithms are based on the idea of efficienty scheduing cofows according to the ordering (15 produced by (LP To fuy describe the scheduing agorithms we first present some preiminaries on the minima amount of time to finish processing an arbitrary cofow using ony matching schedues (Agorithm 1 and 1 Agorithm 1 wi be used cruciay in the design of the approximation agorithms and as we wi see it is effectivey an integer version of the famous Birkhoffvon Neumann theorem [7] and hence the name Birkhoff-von Neumann decomposition Detais of the main agorithms are provided in 2 and we suppy proofs of the compexities and approximation ratios in

6 1 Birkhoff-von Neumann Decomposition For an arbitrary cofow matrix D = (d m ij=1 where d Z + for a i and j we define ρ(d the oad of cofow D as foows 8 < ρ(d = max : max i=1m ( m j =1 d max j=1m ( m i =1 d i j 9 = ; (18 Note that for each i P m j =1 d is the tota processing requirement of cofow D on input i and for each j P m i =1 d i j is that on output j By the matching constraints ρ(d is a universa ower bound on the competion time of cofow D were it to be schedued aone Lemma 4 There exists a poynomia time agorithm which finishes processing cofow D in ρ(d time sots (using the matching schedues were it to be schedued aone Agorithm 1 describes a poynomia-time scheduing agorithm that can be used to prove Lemma 4 The idea of the agorithm is as foows For a given cofow matrix D = (d m ij=1 we first augment it to a arger matrix D whose row and coumn sums are a equa to ρ(d (Step 1 In each iteration of Step 1 we increase one entry of D such that at east one more row or coumn sums to ρ Therefore at most 2m 1 iterations are required to get D We then decompose D into permutation matrices that correspond to the matching schedues (Step 2 More specificay at the end of Step 2 we can write D = P U u=1 quπu so that Πu are permutation matrices q u N are such that P U u=1 qu = ρ(d and U m 2 The key to this decomposition is Step 2 (ii where the existence of a perfect matching M can be proved by a simpe appication of Ha s matching theorem [18] We now consider the compexity of Step 2 In each iteration of Step 2 we can set at east one entry of D to zero so that the agorithm ends in m 2 iterations The compexity of finding a perfect matching in Step 2 (ii is O(m eg by using the maximum bipartite matching agorithm Since both Steps 1 and 2 of Agorithm 1 has poynomia-time compexity the agorithm itsef has poynomia-time compexity If we divide both sides of the identity D = P U u=1 quπu by ρ(d then D/ρ(D is a douby stochastic matrix and the coefficients q u/ρ(d sum up to 1 Therefore D/ρ(D is a convex combination of the permutation matrices Π u Because of this natura connection of Agorithm 1 with the Birkhoff-von Neumann theorem [7] we ca the agorithm the Birkhoff-von Neumann decomposition Lemma 4 has been stated in sighty different forms see eg Theorem 1 in [21] Theorem 4 in [1] and Fact 2 in [26] 2 Approximation Agorithms Here we present our deterministic and randomized scheduing agorithms The deterministic agorithm is summarized in Agorithm 2 which consists of 2 steps In Step 1 we sove (LP to get the approximated competion time C k for cofow ordering Then in Step 2 for each k = 1 2 n we compute the maximum tota oad V k of cofow k and identify the time interva (τ r(k 1 τ r(k ] that it beongs to where τ are defined in 21 A the cofows that fa into the same time interva are combined into and treated as a singe cofow and processed using Agorithm 1 in 1 The randomized scheduing agorithm foows the same steps as the deterministic one except for the choice of the Agorithm 1: Birkhoff-von Neumann Decomposition Data: A singe cofow D = (d m ij=1 Resut: A scheduing agorithm that uses at most a poynomia number of different matchings Step 1: Augment D to a matrix D m = d where ij=1 d d for a i and j and a row and coumn sums of D are equa to ρ(d n Pmj o n Let η = min nmin i =1 d Pm oo min j i =1 d i j be the minimum of row sums and coumn sums and et ρ(d be defined according to Equation (18 D D whie (η < ρ do i P arg min m D i j =1 ; j P arg min m D j i =1 i j D D + pe where p = min{ρ P m D j =1 i j ρ P m D i =1 i j } E = 1 if i = i and j = j and E = 0 otherwise η min end n min i n Pm j =1 D o min j n Pm i =1 D i j Step 2: Decompose D into permutation matrices Π whie ( D 0 do oo (i Define an m m binary matrix G where G = 1 if D > 0 and G = 0 otherwise for i j = 1 m (ii Interpret G as a bipartite graph where an (undirected edge (i j is present if and ony if G = 1 Find a perfect matching M on G and define an m m binary matrix Π for the matching by Π = 1 if (i j M and Π = 0 otherwise for i j = 1 m (iii D D qπ where q = min{ D : Π > 0} Process cofow D using the matching M for q time sots More specificay process datafow from input i to output j for q time sots if (i j M and there is processing requirement remaining for i j = 1 m end time intervas in Step 2 of Agorithm 2 Define the random time points τ by τ 0 = 0 and τ = T 0a 1 where a = 1+ 2 and T 0 Unif[1 a] is uniformy distributed between 1 and a Having picked the random time points τ we then proceed to process the cofows in a simiar fashion to the deterministic agorithm Namey for each k = 1 2 n we identify the (random time interva (τ r (k 1 τ r (k] that cofow k beongs to and for a cofows that beong to the same time interva they are combined into a singe cofow and processed using Agorithm 1 Proofs of Main Resuts We now estabish the compexity and performance properties of our agorithms We first provide the proof of Theorem 1 in detai By a sight modification of the proof of Theorem 1 we can estabish Coroary 1 We then provide the proofs of Theorem 2 and Coroary 2

7 Agorithm 2: Deterministic LP-based Approximation m Data: Cofows for k = 1 n ij=1 Resut: A scheduing agorithm that uses at most a poynomia number of different matchings Step 1: Given n cofows sove the inear program (LP Let an optima soution be given by x (k for = 1 2 L and k = 1 2 n Compute the approximated competion time C k by Eq (14 Order and index the cofows according to (15 Step 2: Compute the maximum tota oad V k for each k by (16 Suppose that V k (τ r(k 1 τ r(k ] for some function r( of k Let the range of function r( consist of vaues s 1 < s 2 < < s P and define the sets S u = {k : τ su 1 < V k τ su } u = 1 2 P u 1 whie u P do After a the cofows in set S u are reeased schedue them as a singe cofow with transfer requirement Pk S u from input i to output j and finish processing the cofow using Agorithm 1 u u + 1; end Proofs of Theorem 1 and Coroary 1 Let C k (A be the competion time of cofow k under the deterministic scheduing agorithm (Agorithm 2 The foowing proposition wi be used to prove Theorem 1 Proposition 1 For a k = 1 2 n the cofow competion time C k (A satisfies C k (A max 1 g k {rg} + 4V k (19 where we reca that r g is the reease time of cofow g and the tota oads V k are defined in Eq (16 Proof Reca the notation used in Agorithm 2 For any cofow k S u V k τ su By Lemma 4 we know that a cofows in the set S u can be finished processing within τ su units of time Define τ 0 = 0 and τ u = τ u 1 + τ su u = 1 2 P A simpe induction argument shows that under Agorithm 2 C k (A the competion time of cofow k satisfies C k (A max 1 g k {r g}+ τ u if k S u We now prove by induction on u that τ u 2τ r(k if k S u u = 1 2 P Suppose that this hods for k S u Let k = max{k : k S u} such that k + 1 S u+1 Then τ u+1 = τ u + τ su+1 2τ r(k + τ r(k +1 Since the time interva increases geometricay and satisfies τ +1 = 2τ for = 1 2 L τ u+1 2τ r(k +1 = 2τ r(k if k S u+1 This competes the induction Furthermore if τ r(k 1 < V k τ r(k τ r(k = 2τ r(k 1 < 2V k Thus C k (A max 1 g k {rg}+2τ r(k max 1 g k {rg}+4v k The proof of Theorem 1 is now a simpe consequence of Lemmas 1 and and Proposition 1 Proof of Theorem 1 For a k max 1 g k {r g} C k where C k (cf (15 are the approximated competion times computed from (LP foows immediatey from the feasibiity constraints (1 and the ordering of C k in (15 By Proposition 1 and Lemma we have C k (A max 1 g k {rg}+4v k C k + 64 C k(op T It foows that w k C k (A w k C k + 64 «C k(op T w k C k (OP T + 64 w k C k (OP T = 67 w k C k (OP T where the second inequaity foows from Lemma 1 We now consider the running time of Agorithm 2 The program (LP in Step 1 can be soved in poynomia time as discussed in 21 Thus it suffices to show that Step 2 runs in poynomia time Since there are ony O(og T a poynomia number of intervas of the form (τ 1 τ ] it now suffices to show that for each u = 1 2 P Agorithm 1 competes processing a cofows in the set S u in poynomia time where we reca the definition of S u in Step 2 of Agorithm 2 But this foows from Lemma 4 Thus Agorithm 2 runs in poynomia time Proof of Coroary 1 Consider Agorithm 2 and suppose that a cofows are reeased at time 0 ie r k = 0 for a k By Proposition 1 C k (A 4V k for a k = 1 2 n By inspection of the proof of Theorem 1 we have w k C k (A 64 w k C k (OP T The fact that Agorithm 2 has a poynomia running time was estabished in the proof of Theorem 1 Before proceeding to the proofs of Theorem 2 and Coroary 2 et us provide some remarks on the upper bounds (19 in Proposition 1 Reca that Ineq (19 hod simutaneousy for a k Then a natura question is whether the upper bounds in (19 are tight We eave this question as future work but provide the foowing observation for now Suppose that a cofows are reeased at time 0 then the upper bounds in (19 become C k (A 4V k for a k By inspecting the proof of Proposition 1 it is easy to see that in fact C (k (A 4V k for a k where C (k (A is the competion time of cofows 1 2 k under our agorithm Compared this with the ower bounds (17 in Lemma 2 we see that the upper bounds are off by a factor of at most 4 The ower bounds (17 cannot be achieved simutaneousy for a k; this fact can be demonstrated through a simpe counter-exampe See Appendix B for detais Proofs of Theorem 2 and Coroary 2 Simiar to the proof of Theorem 1 the proof of Theorem 2 reies on the foowing proposition the randomized counterpart of Proposition 1 Proposition 2 Let C k (A be the random competion time of cofow k under the randomized scheduing agorithm

8 described in 2 Then for a k = 1 2 n E[C k (A ] max {rg} + 1 g k 2 + «2 V k (20 Proof Reca the random time points τ defined by τ 0 = 0 and τ = T 0a 1 where a = and T 0 Unif[1 a] Suppose that V k (τ r(k 1 τ r(k] and et T k = τ r(k τ r(k 1 for a k Then τ r(k T k = T 0a τ r(k 1 T 0a τ r(k 1 T 0a τ r(k 2 = a a 1 Since T 0 Unif[1 a] T k is uniformy distributed on the interva ((a 1V k /a (a 1 V k Thus E ˆτ a r(k = a 1 E[T k] = a a 1 1 «a 1 + a 1 V k = 1 + a V k 2 a 2 Simiar to the τ u used in the proof of Proposition 1 define τ u inductivey by τ 0 = 0 and τ u = τ u 1 +τ s u u = 1 2 P If k S u then r(k τ u τ = τ r(k + τ r(k a =1 + τ r(k a T 0 a a 1 τ r(k Simiar to the proof of Proposition 1 we can estabish that if k S u then C k (A max 1 g k {r g}+ τ u Thus with a = 1 + 2» E[C k (A ] E max {rg}+ τ u 1 g k max {rg}+ a 1 g k a 1 E[τ r(k] max {rg}+ a2 + a 1 g k 2(a 1 V k = max {rg}+ 1 g k 2 + «2 V k Proof of Theorem 2 By Proposition 2 and Lemma we have E[C k (A ] max 1 g k {r `/2 g}+ + 2 V k < C k + ` / Ck (OP T It foows that " n # E w k C k (A w k C k ««2 C k (OP T w k C k (OP T = «2 n w k C k (OP T «2 n w k C k (OP T where the second inequaity foows from Lemma 1 Proof of Coroary 2 Consider the randomized agorithm and suppose that a cofows are reeased at time 0 ie r k = 0 for a k By Proposition 2 for a k = 1 2 n E[C k (A ] `/2 + 2 V k By inspection of the proof of Theorem 2 we have " n # E w k C k (A «2 n w k C k (OP T 4 EPERIMENTS In previous sections we presented deterministic and randomized approximation agorithms with provabe performance guarantees In this section we conduct some preiminary experiments to evauate the practica performance of severa agorithms incuding our deterministic agorithm described in 2 At a high eve both of our agorithms consist of two reated stages The ordering stage computes an ordering of cofows and the scheduing stage produces a sequence of feasibe schedues that respects this ordering It is intuitivey cear that an inteigent ordering of cofows in the ordering stage can substantiay reduce cofow competion times As a resut we consider three different cofow orderings incuding the LP-based ordering (15 and study how they affect agorithm performance See 41 for more detais The derivation of the actua sequence of schedues in the scheduing stage reies on two key ideas: scheduing according to an optima (Birkhoff-von Neumann decomposition and a suitabe grouping of the cofows We note that to some extent grouping can be thought of as a dovetaiing procedure where skewed cofow matrices are consoidated to form more uniform ones which can be efficienty ceared by matching schedues It is then reasonabe to expect that grouping can improve performance so we compare agorithms with grouping and those without grouping to understand its effect The particuar grouping procedure that we consider here is the one described in Agorithm 2 Backfiing is a common strategy used in scheduing for computer systems to increase utiization of system resources (see eg [1] Therefore we wi aso investigate the performance gain of using backfiing in the scheduing stage We focus on one natura backfiing technique described in detaied in 41 In summary we wi evauate the performance impact of cofow ordering cofow grouping and backfiing Our evauation uses a Hive/MapReduce trace coected from a arge production custer at Facebook [12 1] The main findings are as foows Agorithms with cofow grouping consistenty outperform those without grouping Simiary agorithms that use backfiing consistenty outperform those that do not use backfiing The performance of agorithms that use the LP-based ordering (15 is simiar to those that order cofows according to their oads (see (18 When combined with grouping and backfiing these agorithms are neary optima Furthermore our LP-based deterministic agorithm has near-optima performance 41 Methodoogy Workoad We use the same workoad as described in [1] The workoad is based on a Hive/MapReduce trace at Facebook that was coected on a 000-machine custer with 150 racks so the datacenter in the experiments can be modeed as network switch (and each cofow represented by a matrix The custer has a 10:1 core-to-rack oversubscription ratio with a tota bisection bandwidth of 00Gbps Therefore each ingress/egress port has a capacity of 1Gbps or equivaenty 128MBps We seect the time unit to be 1/128 second accordingy so that each port has

9 the capacity of 1MB per time unit We fiter the cofows based on the number of non-zero fows which we denote by M and we consider three coections of cofows fitered by the conditions M 50 M 40 and M 0 respectivey As pointed out in [1] cofow scheduing agorithms may be ineffective for very sparse cofows in rea datacenters due to communication overhead so we investigate the performance of our agorithms for these three coections We aso assume that a cofows arrive at time 0 so we do not consider the effect of reease dates Agorithms and Metrics We consider 12 different scheduing agorithms which are specified by the ordering used in the ordering stage and the actua sequence of schedues used in the scheduing stage We consider three different orderings which we describe in detai beow and the foowing 4 cases in the scheduing stage: (a without grouping or backfiing which we refer to as the base case (b without grouping but with backfiing (c with grouping and without backfiing and (d with both grouping and backfiing Agorithm 2 corresponds to the combination of LP-based ordering and case (c For ordering three different possibiities are considered We use H A to denote the naive ordering of cofows by cofow IDs from the production trace H ρ to denote the ordering of cofows by the ratios between the maximum oad ρ (defined in (18 and weight w which is aso considered in [1] and H LP to denote the LP-based cofow ordering given in (15 Given an ordering of the cofows it is possibe to partition them into groups using the procedure described in Step 2 of Agorithm 2 As discussed we wi consider agorithms with and without such grouping When we do group the cofows we treat a cofows within a group as a whoe and say that they are consoidated into an aggregated cofow Scheduing of cofows within a group makes use of the Birkhoff-von Neumann decomposition described in Agorithm 1 respecting the cofow order Namey if two data units from cofows k and k within the same group use the same pair of input and output and k is ordered before k then we aways process the data unit from cofow k first For backfiing given a sequence of cofows with a given order for cofow k with a cofow matrix D (k k = 1 n we compute the augmented cofow matrix D (k and schedue the cofow by the Birkhoff-von Neumann decomposition in Agorithm 1 This decomposition may introduce unforced ide time whenever D (k D (k When we use a schedue that matches input i to output j to serve cofow k with D (k D (k < and if there is no more service requirement on the pair of input i and output j for cofow k we backfi in order from the fows on the same pair of ports in the subsequent cofows When grouping is used backfiing is appied to the aggregated cofows We compare the performances of different agorithms by considering ratios of the tota weighted competion times For the choice of cofow weights we consider both the case of uniform weights as we as the case where weights are given by a random permutation of the set {1 2 n} 42 Performance We compute the tota weighted competion times for a orders in the 4 different cases (a (d described in 41 through a set of experiments on fitered cofow data The compete resuts can be found in Appendix D and we (a Comparison of tota weighted competion times with respect to the base case for each order Data are fitered by M 50 Weights are random (b Comparison of tota weighted competion times evauated on data fitered by M 50 in case (d with both grouping and backfiing Figure 2: Comparisons of tota weighted competion times present representative comparisons of the agorithms here Figure 2a pots the tota weighted competion times as percentages of the base case (a for the case of random weights Grouping and backfiing both improve the tota weighted competion time with respect to the base case for a orders The reduction in the tota weighted competion time from grouping is up to 2719% and is consistenty higher than the reduction from backfiing which is up to 868% For a orders scheduing with both grouping and backfiing (ie case (d gives the smaest tota weighted competion time but the improvement from case (c (ony grouping to case (d is margina We then compare the performances of different cofow orderings Figure 2b shows the comparison of tota weighted competion times evauated on fitered cofow data for both equa weight and random weights in case (d where the scheduing stage uses both grouping and backfiing Compared with H A both H ρ and H LP reduce the tota weighted competion times of cofows by a ratio up to 805 and 779 respectivey with H ρ performing consistenty better than H LP A natura question to ask is how cose H ρ and H LP are to the optima In order to get a tight ower bound of the scheduing probem (O we sove (LP-EP for the case of random weights and when the number of non-zero fows M 50 The ratio of the ower bound over the weighted competion time under H LP is which impies that both H ρ and H LP in the ordering stage provide a good approximation of the optima in practice Due to time constraint we did not compute the ower bounds for a cases because (LP-EP is exponentia in the size of the input and is extremey time consuming to sove Our experiments are ony preiminary and eave open many questions For exampe we shoud systematicay measure the benefit of the time-indexed versus the intervaindexed inear program We shoud aso compare the performance of the randomized agorithm incude varying reease dates and think of other heuristics to improve the soutions obtained We eave this work and testing on other data sets to the fu paper 5 CONCLUSION AND OPEN PROBLEMS We have given the first O(1-approximation agorithms for minimizing the tota weighted competion time of cofows in a datacenter network in the presence of reease dates and have performed preiminary experiments to evauate the agorithm and severa additiona heuristics Beyond the obvious question of improving the approximation ratio this

10 work opens up severa additiona interesting directions in cofow scheduing such as the consideration of other metrics and the addition of other reaistic constraints such as precedence constraints We are particuary interested in minimizing weighted cofow processing time (usuay caed fow time in the iterature which is harder to approximate (the various hardness resuts from singe machine schedue wi ceary carry over but may ead to the deveopment and understanding of better agorithms possiby by considering resource augmentation Perhaps the most interesting questions invove making the agorithms and modes more practica for them to work in rea-time in a rea system Whie we consider reease dates our agorithms are not on-ine as they require the soution of an LP to compute a goba ordering We woud aso ike to remove the centraized contro and deveop distributed agorithms more suitabe for impementation in a data center To do so woud require the deveopment of a much simper agorithm than the one here possiby a prima-dua based agorithm Finay we observe that in appications the D matrices may have uncertainty and it woud be interesting to design agorithms to dea with this uncertainty via robust or stochastic optimization Acknowedgement Yuan Zhong woud ike to thank Mosharaf Chowdhury and Ion Stoica for numerous discussions on the cofow scheduing probem and for sharing the Facebook data 6 REFERENCES [1] Apache hadoop [2] Googe datafow [] Reza Ahmadi Uttarayan Bagchi and Thomas Roemer Coordinated scheduing of customer orders for quick response Nava Research Logistics 52(6: [4] Mohammad Aizadeh Shuang Yang Miad Sharif Sachin Katti Nick McKeown Baaji Prabhakar and Scott Shenker pfabric: Minima near-optima datacenter transport SIGCOMM Computer Communication Review 4(4: [5] E Baas On the facia structure of scheduing poyhedra Mathematica Programming Studies 24: [6] Hitesh Baani Paoo Costa Thomas Karagiannis and Ant Rowstron Towards predictabe datacenter networks SIGCOMM Computer Communication Review 41(4: [7] Garrett Birkhoff Tres observaciones sobre e agebra inea Univ Nac Tkcumán Rev A 5: [8] Dhruba Borthakur The hadoop distributed fie system: Architecture and design Hadoop Project Website 2007 [9] Cheng-Shang Chang Wen-Jyh Chen and Hsiang-Yi Huang Birkhoff-von neumann input buffered crossbar switches In INFOCOM voume pages [10] Zhi-Long Chen and Nichoas G Ha Suppy chain scheduing: Confict and cooperation in assemby systems Operations Research 55(6: [11] Mosharaf Chowdhury and Ion Stoica Cofow: A networking abstraction for custer appications In HotNets-I pages [12] Mosharaf Chowdhury Matei Zaharia Justin Ma Michae I Jordan and Ion Stoica Managing data transfers in computer custers with orchestra SIGCOMM Computer Communication Review 41(4: [1] Mosharaf Chowdhury Yuan Zhong and Ion Stoica Efficient cofow scheduing with Varys In SIGCOMM 2014 [14] Jeffrey Dean and Sanjay Ghemawat Mapreduce: Simpified data processing on arge custers In OSDI pages [15] Fahad Dogar Thomas Karagiannis Hitesh Baani and Ant Rowstron Decentraized task-aware scheduing for data center networks Technica Report MSR-TR [16] Naveen Garg Amit Kumar and Vinayaka Pandit Order scheduing modes: Hardness and agorithms In FSTTCS pages [17] Lesie A Ha Andreas S Schuz David B Shmoys and Joe Wein Scheduing to minimize average competion time: Off-ine and on-ine approximation agorithms Mathematics of Operations Research 22(: [18] Marsha Ha Combinatoria Theory Addison-Wesey 2nd edition 1998 [19] Nanxi Kang Zhenming Liu Jennifer Rexford and David Waker Optimizing the one big switch abstraction in software-defined networks In CoNET pages [20] Narendra Karmarkar A new poynomia-time agorithm for inear programming Combinatorica 4(4: [21] E L Lawer and J Labetoue On preemptive scheduing of unreated parae processors by inear programming Journa of the ACM 25(4: [22] J K Lenstra A H G Rinnooy Kan and P Brucker Compexity of machine scheduing probems Annas of Discrete Mathematics 1: [2] Joseph Y T Leung Haibing Li and Michae Pinedo Scheduing orders for mutipe product types to minimize tota weighted competion time Discrete Appied Mathematics 155(8: [24] J Li and N Ansari Enhanced birkhoff-von neumann decomposition agorithm for input queued switches IEE Proceedings - Communications 148(6: [25] Monado Mastroii Maurice Queyranne Andreas S Schuz Oa Svensson and Neson A Uhan Minimizing the sum of weighted competion times in a concurrent open shop Operations Research Letters 8(5: [26] Michae J Neey Eytan Modiano and Yuan-Sheng Cheng Logarithmic deay for N N packet switches under the crossbar constraint IEEE/ACM Transactions on Networking 15(: [27] Cynthia A Phiips Ciff Stein and Joe Wein Minimizing average competion time in the presence of reease dates Mathematica Programming 82(1-2: [28] Michae Pinedo Scheduing: Theory Agorithms and Systems Springer New York NY USA rd edition 2008 [29] Lucian Popa Arvind Krishnamurthy Syvia Ratnasamy and Ion Stoica Faircoud: Sharing the network in coud computing In HotNets- pages 22:1 22: [0] Thomas A Roemer A note on the compexity of the concurrent open shop probem In Integer Programming and Combinatoria Optimization pages [1] Devavrat Shah John N Tsitsikis and Yuan Zhong On queue-size scaing for input-queued switches preprint 2014 [2] Konstantin Shvachko Hairong Kuang Sanjay Radia and Robert Chanser The hadoop distributed fie system In MSST pages [] Martin Skutea List Scheduing in Order of α-points on a Singe Machine voume 484 of Lecture Notes in Computer Science pages Springer Berin Heideberg 2006 [4] Chang Sup Sung and Sang Hum Yoon Minimizing tota weighted competion time at a pre-assemby stage composed of two feeding machines Internationa Journa of Production Economics 54(: [5] Guoqing Wang and TC Edwin Cheng Customer order scheduing to minimize tota weighted competion time Omega 5(5: [6] Matei Zaharia Mosharaf Chowdhury Tathagata Das Ankur Dave Justin Ma Murphy McCauey Michae J Frankin Scott Shenker and Ion Stoica Resiient distributed datasets: A faut-toerant abstraction for in-memory custer computing In NSDI pages

11 Appendices A CONNECTION WITH CONCURRENT OPEN SHOP SCHEDULING The cofow scheduing probem (O is cosey reated with the concurrent open shop probem [ 0] as observed in [1] Consider the formuation (O of cofow scheduing When a the cofows are given by diagona matrices cofow scheduing is equivaent to a concurrent open shop scheduing probem [1] To see the equivaence reca the probem setup of concurrent open shop A set of n jobs are reeased into a system with m parae machines at different times For k = 1 2 n job k is reeased at time r k and requires p (k i units of processing time on machine i for i = 1 2 m The competion time of job k which we denote by C k is the time that its processing requirements on a machines are competed Let z(i k t be the indicator variabe which equas 1 if job k is served on machine k at time t and 0 otherwise Let w k be positive weight parameters k = 1 2 n Then we can formuate a program (CO for concurrent open shop scheduing (CO Minimize w k C k subject to C k z(i k t p ik for i = 1 m k = 1 n; (21 t=1 z(i k t = 0 if t < r k i k t; (22 z(i k t binary i k t (2 Compare program (CO with (O Suppose for each k = 1 2 n D (k is a diagona matrix with diagona entries given by p (k i i = 1 2 m Then when i j we aways have y(i j k t = 0 If we rewrite y(i i k t as z(i k t then in this case program (O is equivaent to (CO The concurrent open shop scheduing probem has been shown to be strongy NP-hard [] Due to the connection discussed above we see immediatey that the cofow scheduing probem is aso strongy NP-hard We summarize this resut in the foowing emma Lemma 5 Probem (O is NP-hard for m 2 Athough there are simiarities between the concurrent open shop and the cofow scheduing probem there are aso key differences which make the cofow scheduing probem a more chaenging one First cofow scheduing invoves couped resources the inputs and outputs are couped by the matching constraints As such the cofow scheduing probem has aso been caed concurrent open shop with couped resources in [1] Second for concurrent open shop there exists an optima schedue in which jobs can be processed in the same order on a machines ie the schedue is a permutation schedue [] In contrast permutation schedues need not be optima for cofow scheduing [1] LP-reaxations based on competion time variabes which have been used for concurrent open shop (see eg [25] use permutation schedues in a crucia way and do not immediatey extend to the cofow scheduing probem B LOWER BOUNDS (17 CANNOT BE ACHIEVED SIMULTANEOUSLY We construct a simpe counter-exampe to show that we cannot achieve the ower bounds V k in (17 simutaneousy for a the k The counter-exampe invoves 2 cofows in a datacenter with inputs/outputs which we can represent as matrices Let us use D (1 and D (2 to denote these two matrices for cofow 1 and cofow 2 respectivey Suppose that D (1 and D (2 are given by A Define two time points t 1 = max{i 1 J 1} = 18 and t 2 = max{i 2 J 2} = 0 For cofow 1 to compete before time t 1 a the fows of cofow 1 must finish processing at time t 1 By the structure of D (1 inputs 1 & and outputs 1 & must work at fu capacity on cofow 1 throughout the duration of time period 0 to t 1 Furthermore for both cofows to compete before time t 2 a ports must work at fu capacity throughout the time period 0 to t 2 due to the structure of D (1 + D (2 Therefore at time t 1 the remaining fows to be transferred across the network switch are a from cofow 2 the amount of which must be exacty t 2 t 1 = 12 for 1 A each pair of input and output Let matrix D (2 = d(2 represent the coection of remaining fows to be transferred from time t 1 to time t 2 D(2 is a matrix with a the row sum and coumn sum equa to 12 and satisfies D (2 D (2 However since cofow 1 uses up a the capacity on inputs 1 & and outputs 1 & from time 0 up to time t 1 the (2 remaining fows d = d (2 that such matrix does not exist because for (i j (2 2 We concude (2 (2 d 21 + d 2 = 20 > 12 C PROOF OF LEMMA The idea of the proof foows Wang and Cheng [5] Suppose that τ u 1 < C k τ u for some u We consider the foowing three cases Case (1 C k < 5τ u 1 4 For any g = 1 k we have 5τ u 1 4 Therefore > C k C g = = τ u 1 u =1 L =1 u =1 τ 1 τ u L =u+1 = 2τ u 1 1 > 8 u =1 Let g P = arg min u 1 g k =1 x(g We know from Eq (16 that 8 ( < m ( V k = max : max k m k 9 = i max j j i j ; =1 g=1 i =1 g=1 8 ( < m ( = max : max k m k 9 = i max j j i j ; =1 g=1 i =1 g=1 u u x (g x (g =1 =1

12 We then have ( ( V k = max max i ( m max j i =1 ( max max j ( max max j max i ( m m k j =1 g=1 k g=1 ( m i =1 g=1 max i ( u i j k j =1 g=1 k ( u m i j =1 i =1 g=1 m u =1 u =1 =1 j =1 g=1 i j x(g u =1 x (g u =1 x (g u u u =1 Using the constraints (11 and (12 we have V k =1 x (g =1 x (g τ u P u < τu = τu 1 < C k =1 x(g Case (2 5τ u 1 C 4 k < τ u 1 2 Define a sequence α h h = 1 2 by α 1 = 1 4 h 1 α h = 1 α q 1 α P h 1 αq h = 2 x (g Note that a simpe induction shows that for a h α h > 0 and P h αq < 1/2 Furthermore P h αq 1/2 as h Thus there exists some h such that h α q τ u 1 < C k 1 + For any g = 1 k we have and 1 + h α q τ u 1 C k C g = It foows that V k < < u =1 > τ u L =u+1 > 1 P h τ u 2τ u P u < =1 x(g 1 P h L =1 h α q τ u 1 τ 1 = τ u 1 2 αq αq = 4 C k (1 P h αq(1 + P h 1 αq By the definition of α q we have h h 1 1 α q 1 + α q = 1 α P h 1 αq = 1 α 1 u =1 4τ u 1 1 P h αq h α q Therefore V k < Case ( C k τ u 1 2 For any g = 1 k we have and hence τ u > C k C g = 4 C k 1 α 1 < 16 C k L =1 u+1 = τ u+1 1 =1 u+1 =1 τ 1 > τ u+1 L =u+2 u+1 = 2τ u 1 > 1 2 =1 Using the same argument as in case (1 we have V k < τ u+1 P u+1 =1 x(g < 2τ u+1 = 8τ u 1 < 16 C k We know from Lemma 1 that C k C k (OP T and the resut foows D TABLE OF EPERIMENTAL RESULTS We present a the resuts from the set of experiments on data under various fitering rues using different weights w as we describe in the experimenta section We compute the normaized tota weighted competion times in Tabe 1 to evauate the performance of the combination of orders H A H ρ and H LP in the ordering stage and 4 cases in the scheduing stage (a without grouping and without backfiing (b without grouping and with backfiing (c with grouping and without backfiing (d with grouping and with backfiing The normaization is with respect to the competion times in case (d with the ordering H LP Fitering Case Equa weight Random weights rues H A H ρ H LP H A H ρ H LP M 50 (a (b (c (d M 40 (a (b (c (d M 0 (a (b (c (d Tabe 1: Normaized tota weighted competion times for the combination of ordering heuristics and 4 scheduing methods Data are fitered based on M