RAPIER: Integrating Routing and Scheduling for Coflow-aware Data Center Networks

: Inegraing Rouing and cheduling for Coflow-aware Daa Cener Neworks Yangming Zhao (UETC), ai Chen, Wei Bai, Minlan Yu (UC), Chen Tian (HUT), Yanhui Geng (Huawei), Yiming Zhang (NUDT), Dan Li (Tsinghua), heng Wang (UETC) ING Group, HUT Absrac In he daa flow models of oday s daa cener applicaions such as MapReduce, park and Dryad, muliple flows can comprise a coflow group semanically. Only compleing all flows in a coflow is meaningful o an applicaion. To opimize applicaion performance, rouing and scheduling mus be oinly considered a he level of a coflow raher han individual flows. However, prior soluions have significan limiaion: hey only consider scheduling, which is insufficien. To his end, we presen, a coflow-aware nework opimizaion framework ha seamlessly inegraes rouing and scheduling for beer applicaion performance. Using a smallscale esbed implemenaion and large-scale simulaions, we demonsrae ha significanly reduces he average coflow compleion ime (CCT) by up o 79.% compared o he sae-of-he-ar scheduling-only soluion, and i is readily implemenable wih exising commodiy swiches. I. INTRODUCTION Cluser compuing frameworks such as MapReduce [], Dryad [8], park [] and so on have become he mainsream plaforms for daa processing and analysis in oday s cloud services. A common feaure of hese differen compuing paradigms is ha hey all implemen a daa flow compuing model, in which a group of daa flows need o pass hrough a sequence of inermediae processing sages before generaing he final resuls. These inermediae flow ransfers can accoun for more han % of ob compleion ime [8], and have a significan impac on ob performance. Therefore, opimizing such flow ransfers is imporan for applicaions. The erm coflow is defined as he se of all flows ransferring daa beween wo sages of a ob [7]. To opimize applicaion performance, we need o opimize flow ransfers a he level of coflow raher han individual ones. This is because he ob compleion ime depends on he ime i akes o complee he enire coflow, insead of he ime o complee individual flows composing i. For example, in MapReduce [] and BP [], a sage canno complee, or someimes even sar, before i receives all he flows in a coflow from he previous sage. From an applicaion s perspecive, when a sage is pending for he inpu daa, he CPU ofen sis idle or is under-uilized. As a resul, reducing he coflow compleion ime (CCT) can furher improve CPU uilizaion, maximizing applicaion performance and ob hroughpu in a given ime period. To minimize average CCT, boh rouing and scheduling mus be considered simulaneously (see ecion II for deails). This work was performed when Yangming Zhao was an inern suden a he ING Group of HUT under supervision of Prof. ai Chen. However, prior soluions for nework flow opimizaion such as [,, 8, 9,,,, ] have significan limiaions (Table I): some of hem (e.g., [,,,, ]) are ineffecive, because hey are coflow-agnosic ha do no accoun for collecive behaviors of flows belonging o a coflow; exising coflow-aware soluions (e.g., [8, 9, ]) are insufficien, because alhough hese approaches improve by saring o consider coflow level semanics, hey only focus on scheduling while neglecing an indispensable componen rouing. We show in ecion V ha his can direcly lead o.7% performance loss. Relaed work Coflow-aware Rouing cheduling pfabric [], PDQ [], Pase [], D [], ec. No No Yes Varys [9], Baraa [] Orchesra [8] Yes No Yes Yes Yes Yes TABLE I UMMARY OF PRIOR OLUTION AND COMPARION TO Moivaed by his siuaion, we design and implemen, a coflow-aware nework opimizaion sysem for daa cener neworks (DCNs). To improve average CCT, seamlessly combines rouing and scheduling ogeher by formulaing i as a oin opimizaion model. This model is a nonlinear programming and conains ineger variables; i is impossible o be direcly solved. Accordingly, we propose an efficien heurisic o approximaely solve his problem based on he relaxaion of he model. We evaluae using a small-scale esbed implemenaion as well as large-scale simulaions. Our evaluaion resuls show ha can reduce average CCT by up o 79.% and.%, compared o he scheduling-only and rouing-only schemes respecively. Our implemenaion verifies ha RAPI- ER can be readily implemenable wih exising commodiy swiches. In summary, he main highlighs of his paper include: A key observaion ha boh rouing and scheduling mus be oinly considered for opimizing average CCT. (ecion II) A coflow-aware nework opimizaion soluion,, which akes ino accoun rouing and scheduling simulaneously for he firs ime. In he course of sysem design, we also develop fas and efficien online algorihms o approximaely solve heoreical NP-hard problems.

M d u D f u D a f a Group Mb u Mcomplees u Group b complees d D d D =Mb =Mb f f a b f f a b f a =Mb f a =Mb f a =Mb f a =Mb f a =Mb Group a complees MGroup d M a complees d D D =M =M Group a complees Group a complees M u M u f a =Mb f a =Mb f a =Mb f a =Mb =Mb =Mb D D M d M d =M =M f a f a =Mb u D M u f a Group a complees M d =Mb =M d D f a M d D =M f a =Mb D f a =Mb =Mb M cheduling on cheduling f a on f a cheduling on cheduling f on M b (a) A possible case by ECMP cheduling (b) Rouing on M wihou u D coflow concep (c) OpimalCoflow rouingb complees u D Coflow a complees M u M f a =Mb u f a =Mb f a f a f u D cheduling on M u D u D cheduling cheduling on on M M u D u D b f a f a f =M f u D b =M f a a f a f a fgroup a b complees D Group b complees f a D =Mb d D Group b complees =Mb Coflow d D a complees f d D a cheduling on M d D d D cheduling on on M d D f a M d M d f a =Mb f a =Mb f d D f a b f f a a Group a complees f a Group a complees Coflow b complees Group a complees cheduling Coflow on a M complees Coflow b complees Coflow Coflow b complees complees Coflow a complees Coflow a complees u D u D Coflow b complees f (d) Opimal a scheduling up on (a) f a (e) Opimal scheduling up on (b) (f) Opimal scheduling up on (c) u D u D Fig.. Af a moivaingf example, where (a) (c) show differen rouing schemes, and (d) (f) show he opimal scheduling Group b complees u D schemes for (a) (c). a f b Group b complees d D f a (ecion cheduling III and on M ecion d D IV). A real esbed f f a b implemenaion f f and exensive large-scale cheduling M a b d D cheduling simulaions. on M d D (ecion V) f f d D a Group b a complees f a II. A MOTIVATING Group a complees f a EXAMPLE Coflow a complees In his secion, we make Coflow keyb complees observaions hrough a moivaing example in Fig. Coflow. b In complees his example, here are wo he average CCT. Coflow a complees Coflow b complees Coflow a complees coflows: Coflow a has u D flows f a and f a wih he sizes of u D f a f u D Mb and Mb respecively; b Coflow f u D b has flows a and f a u D f a f f a b wih he sizes of Mb and Mb respecively; and he f a f a link bandwidhs are all Mbps. As a reference poin, he Coflow a complees opimal average cheduling CCT on Mof d D his example should be.s. d D Coflow a complees The firscheduling akeaway on M d D f a is ha scheduling f d D b alone is no sufficien design below. d D o opimize average f a CCT. When he rouing is fixed, good III. DEIGN OVERVIEW f a f scheduling can minimize he b Coflow average b complees CCT by deermining Coflow b complees Coflow a complees he sequence of flows o sendcoflow oub raffic. complees Fig.(a) shows a Coflow b complees Coflow a complees Coflow a complees case of randomized rouing by equal-cos mulipah (ECMP). Coflow b complees Wih a naive cheduling scheduling on M u D such as fair sharing, boh coflows are dominaed by cheduling pah f on M a M u D d D hence heir CCT are boh s. If using he opimal f a scheduling shown in Fig.(d), he CCTs for wo coflows become s and s respecively; apparenly, schedulingcheduling does play on M a criical d D role. However, he average CCT (in his rouing) cheduling is only f M.s, d D which sill has a.s gap o he a real opimal value.s. I is clear ha rouing should also f a play a criical role: Coflow he loads a complees of wo pahs in Fig.(a) are severely unbalanced, where Coflow pah b complees M Coflow a complees d D has a raffic load doubles ha of pah Coflow M u b complees D. The second akeaway is ha considering rouing and scheduling separaely canno opimize average CCT. As an example in Fig.(b), a load-balancing rouing resuls in: boh flows of Coflow a are roued on M u D while he flows of Coflow b on M d D; now he nework is more balanced. However, he opimal CCTs for coflows a and b in his case are.s and.s respecively (see Fig.(e)); he average CCT.s is sill no opimal. The reason is ha flows of he same coflow are roued hrough he same pah, which leaves lile space for scheduling o ake effec for reducing The conclusion is ha boh rouing and scheduling mus be oinly considered in order o opimize average CCT. In our example, he minimal average CCT can be achieved by combining he rouing in Fig.(c) and scheduling in Fig.(f). In his case, he CCTs of wo coflows are s and.s respecively, and he average CCT is minimized. This moivaes our opimizes he average CCT in daa-inensive DCNs by coordinaing rouing and scheduling flows in he neworks. Given each coflow wih informaion abou is individual flows, such as flow sizes, and sources/desinaions, deermines which pahs o carry hese flows, when o sar hem, and a wha rae o serve hem, in order o opimize he average CCT of all he coflows in he neworks. Inspired by [8, 9], we design o work in a cenralized, cooperaive manner. This decision is also coheren wih many recen cenralized daa cener designs such as [,,,,,, ], ec. As prior works [, 9,,, ], assumes ha he informaion abou a coflow can be readily derived from upper layer applicaions [7] or using sae-of-he-ar predicion echniques []. A. Desirable Properies We idenify he following goals when designing. cheduling on f a cheduling on

Algorihm : The Framework : Procedure MinimizeMeanCCT(Coflows Ω, Bandwidh R) : or all he coflows in Ω non-increasingly according o heir waiing ime; Ω Ω : while Ω Φ do : T min, C min Φ; : for C Ω do : T C =MinimumCCT(C, R); /* compue he minimum compleion ime for coflow C, and he corresponding rouing and rae allocaion */ 7: if C.waiT ime() > δ hen 8: T min T C, C min C; 9: break; : end if : if T C < T min hen : T min T C, C min C; : end if : end for : Ω Ω \ C min ; : Assign all he flows in coflow C min using rouing and raes compued in Line, and hen updae R; 7: end while 8: DisribueBandwidh(Ω, R); /* disribue he remaining bandwidh for work conservaion */ 9: end procedure calabiliy: is necessarily an online sysem. Up on a new incoming coflow, he algorihms mus be able o quickly and efficienly decide he rouing pahs, raes, and scheduling orders for all individual flows in he coflow. For his purpose, hese algorihms mus run in real-ime wih low ime complexiy. arvaion-free: As allows bandwidh preempion, we mus ensure ha any coflow should no sarve for an arbirarily long period, hough his migh benefi he average CCT in he nework. Work-conserving: Work-conservaion means ha he nework resource sis idle only if here is no raffic demand in he nework. We require o be workconserving o fully uilize nework capaciy and o minimize CCT. Readily deployable: The sysem should be readily implemenable wih exising commodiy swiches and easy o deploy wihou modifying any nework devices. Ensure coexisence: The sysem mus be able o work wih all ypes of raffic. Especially, laency-sensiive ineracive raffic mus be delivered wihou any delay. B. in a Nushell A a high level, o achieve scalabiliy, mainly orchesraes large coflows of daa-inensive applicaions, while laency-sensiive individual flows and small coflows are reaed as background raffic; background raffic can be sen direcly and roued over he nework using ECMP. A sie broker periodically predics he usage of background raffic in each link, and derives he residual bandwidh for coflow scheduling. We describe he coflow opimizaion framework of wih Algorihm, which is invoked whenever a new coflow comes or an exising coflow finishes. More specifically, when a new coflow arrives, is riggered o compue he rouing and he ransmission rae for each individual flow (we allow bandwidh preempion as below). When an exising coflow finishes and nework resource is released, we also need o rigger o deermine which coflows should ake up he released bandwidh. The underlying scheduling policy assumes is he well-known minimum remaining ime firs (MRTF) [9, ]. As he inpu of Algorihm, all he coflows ha are no compleed should be included in Ω. In his case, even if a coflow is occupying he bandwidh in he nework, i may be preemped if a smaller coflow comes. On he oher hand, if par of a coflow is served, is remaining volume informaion should be updaed when we recompue he coflow order. To preven sarvaion, prioriizes coflows which are waiing for a ime longer han a user-defined hreshold o schedule (Line 7-). Oher han ha, i is he urn for he coflow wih he minimum compleion ime (Line -7). When a coflow is seleced o send, updaes he bandwidh uilizaion and coninues o find he nex coflow wih he nex minimum compleion ime (hrough Line -). Afer he schedule order is deermined, he remaining bandwidh is disribued o differen coflows for he work-conservaion purpose (Line 8). Noe ha here are wo key algorihms in. The firs one is o calculae he minimum compleion ime for each coflow given he informaion of all individual flows in his coflow and he nework resource ha can be used (Line ). The oher one is o disribue he remaining bandwidh for work-conservaion (Line 8). Designing hese wo complex algorihms is challenging. IV. ALGORITHM DETAIL In his secion, we presen he deails for he wo key algorihms in. In ecion IV-A, we discuss how o calculae he minimum compleion ime for a single coflow by oinly opimizing rouing and scheduling. Afer ha, we analyze he approximaion raio of our algorihm in ecion IV-B. In ecion IV-C, we presen he heurisic algorihm in o disribue he remaining bandwidh o flows for work-conservaion. A. Minimize ingle Coflow Compleion Time Given he informaion of all he flows in a coflow, such as flow volume, source, desinaion, and nework resource (he residual bandwidh on each link), we can formulae he problem o minimize he CCT of a coflow i as follows: minimize i ()

ubec o: v i = i, (a) b i b i x k i R l l (b) x k i =, (c) k x k i {, },, k (d) i in he obecive is he compleion ime of coflow i, and hence i should be minimized. ymbol v i is he flow volume of he h flow in coflow i, while b i is he bandwidh assigned o his flow. Wih consrain (a), we direcly enforce ha he compleion ime of all he flows o equal o he CCT, since i is reasonable o have all he flows in a coflow o have he same compleion ime (aka, he boleneck s compleion ime) in he opimum soluion [7] [9]. Le x k i indicae wheher he h flow in coflow i uses is k h pah (he link se of his pah is denoed by p k i), he lef-hand erm of (b) calculae he capaciy ha is used by coflow i on link l, which should be less han he residual capaciy of link l, denoed by R l. (c) and (d) require ha a flow only chooses one rouing pah. I is impossible o solve problem () direcly, since his programming no only is nonlinear, bu also has binary ineger variables. This problem is an ineger muli-commodiy flow problem ha is proven o be NP-hard []. Therefore, we resor o designing an efficien heurisic o solve his problem. Based on consrain (a), we know ha he rae of each flow b i is direcly proporional o is volume v i, i.e., b i = α i v i. The larger α i means more bandwidh is obained by flows in coflow i, and hence he smaller compleion ime is required. In fac, we have α i = / i. The programming model () can be modified as follows: ubec o: maximize α i () α i v i x k i R l l (a) (c), (d) However, here are sill binary variables x k i in programming model (), which leads he problem o be inracable on large scale sysems. Therefore, we furher relax he binary consrain and obain: maximize α i () ubec o: v i (α i x k i) R l l (a) x k i l (b) (c) I should be noed ha here is a produc of wo variables (i.e., α i and x k i) in consrain (a). I makes he problem difficul o solve since i is a concave opimizaion. To solve his problem, variables m k i are inroduced o subsiue his produc and we obain: ubec o: maximize α i () v i m k i R l l (a) m k i = α i (b) k m k i l (c) Now, problem () becomes a linear programming ha has only n i + variables and L + F i consrains (n i is he number of candidae pah for h flow in coflow i, L is he number of links, and F i is he number of flows in coflow i). This is a small scale linear programming and can be solved in a imely manner. However, since we relax he binary ineger consrain, he soluion may be ha some x k i are decimal facions. To solve his problem, we roue he h flow in coflow i o a pah k such ha m k i = max k m k i. When he pah of each flow is deermined, i.e., x k i is fixed, we can go back o problem (). We subsiue in he obained x k i values, and make () a linear programming problem and solve i. Given he pah of each flow, he minimum CCT is exacly he inverse of he obecive in (). In summary, he heurisic is: inegrae scheduling and rouing in he opimizaion ogeher and le scheduling guide he rouing selecion; afer fixing he rouing wih he approximaion, he opimal scheduling is hen derived. The heurisic o pursue he minimum compleion ime of a coflow is summarized as Algorihm. B. Approximaion Bound Analysis We have presened a heurisic o pursue he minimum CCT by relaxing problem (). In his par, we show how good he performance of he algorihm is hrough heoreical analysis. Algorihm : Minimize Coflow Compleion Time : Procedure MinimumCCT(Coflow C i, Bandwidh R) : olve problem () wih coflow and nework resource uilizaion informaion : for all flow in coflow C i do : Iniialize x k i for all k : k arg k max m k i : x k i 7: end for 8: olve () by fixing x k i o be he resul obained from Line -7 9: b i α i v i, where α i is he obecive in Line 8 : i α : reurn i : end procedure

Theorem : Assume he minimum CCT is min and alg is he CCT obained by Algorihm, hen alg min where is he number of candidae pahs for each flow. Proof: To prove his heorem, he equivalen proposiion is α alg αmax where α alg and α max are he inverse of alg and min, respecively. Assume he obecive of problem () is α upper, here mus be α upper α max () From Algorihm, we roue each flow o he pah wih maximum m k i, and hence we have m k i αupper xk i () for any k. ubsiue () ino he consrains of problem (), we have α upper v ix k i m i v i R l Combine wih he fac ha he consrain (c) and (d) are guaraneed by Algorihm, we know ha αupper is a feasible soluion o problem () for he he given x k i. I means ha α alg αupper αmax I is worh noing ha alhough he heoreical bound is loose, in pracice our implemenaion obains very good resuls. C. Disribue Bandwidh for Work-conservaion In ecion IV-A, only allocaes minimal bandwidh o each flow, such ha all he flows in a coflow are compleed simulaneously. However, here may be some remaining bandwidh ha can be used o serve more flows. We pursue work-conserving propery by disribuing he remaining bandwidh o flows in o opimize he overall sysem performance. The key poin in disribuing bandwidh is how o deermine he order of flows o preemp he bandwidh. A firs, for he coflows ha have already been scheduled, more bandwidh for any flow in i canno improve is CCT. Therefore, among all coflows, he coflows ha have no been scheduled should have higher prioriy o use he remaining bandwidh; his also helps preven sarvaion. Wihin a coflow, we prefer o allocae more bandwidh o he larger flows han he smaller ones. This is because he flows wih larger raffic volume are more likely o be he boleneck of a coflow, i.e., complee las if all he flows are served by bes-effor delivery. Based on all hese consideraions, we design Algorihm o disribue bandwidh o flows for work conserving purpose. (7) Algorihm : Disribue Remaining Bandwidh : Procedure DisribueBandwidh(Coflows Ω, Bandwidh R) : Non-increasingly sor all he coflows in Ω in erms of heir minimal CCT : for all C Ω do : Non-increasingly sor all he flows in C in erms of flow volume : for all f Ω do : AssignBandwidh(f,R) 7: end for 8: end for 9: end procedure : Procedure AssignBandwidh(Flow f, Bandwidh R) : maxbandwidh : for All he candidae pahs for f, p do : pahbandwidh : for All he links l p do : if R l < pahbandwidh hen : pahbandwidh R l 7: end if 8: end for 9: if pahbandwidh > maxbandwidh hen : maxbandwidh pahbandwidh : end if : end for : reurn maxbandwidh : end procedure In Line, we sor he coflows non-increasingly in erms of heir minimal CCT. In his case, he coflows wih infinie CCT, i.e., ha are no scheduled, can ge higher prioriy o preemp he bandwidh. The procedure AssignBandwidh() assign he bandwidh o corresponding flows. Noe ha he procedure AssignBandwidh() will roue a flow o he pah ha can provide i wih he maximum bandwidh. V. EVALUATION We evaluae hrough a small-scale esbed emulaion as well as large-scale simulaions. chemes o compare: wih. We compare he following schemes : all he flows are roued by ECMP and all of hem fairly compee for bandwidh. cheduling-only (Varys): roues all he flows by ECMP bu schedules hem according o MRTF, which is concepually equivalen o he sae-of-he-ar Varys [9]. Rouing-only: roues all he flows o pursue load balancing bu all he flows should fairly compee for bandwidh. Through comparison wih he las wo schemes, we can inspec he benefis brough by he wo ingrediens of : rouing and scheduling, respecively.

.HUQHO 8VHU $OLFDWLRQ (QIRUFHPHQW 7&, HWILOWHU+RRN /LQX[7&+7% )ORZWDEOH 'DHPRQ (QIRUFHPHQW.HUQHORGXOH DFNHWPRGLILHU LRFWO,&GULYHU Fig.. ofware sack of s bandwidh enforcemen. Merics: In his secion, we define he performance of scheme compared o scheme as CCT CCT CCT, where CCT and CCT are he average CCT derived by scheme and scheme, respecively. Wihou declaraion, he performance is compared o baseline scheme. ummary of he main resuls is as follows: Through he experimen on he small-scale leaf-spine esbed (Fig. ), we can see ha 8.% and 8.% of he average CCT can be reduced by, compared o he baseline and rouing-only schemes respecively. Resuls from simulaions repeaedly indicae ha RAPI- ER can reduce he average CCT by up o 79.%,.%, 9.79%, compared o sae-of-he-ar schedulingonly(e.g., Varys [9]), rouing-only, and baseline schemes in differen scenarios. When nework load changes, he performance of is relaively sable; as a comparison, he performances of rouing-only and scheduling-only schemes vary a lo. When iner-coflow arrival inerval is large, consisenly shows very high performance gain. However, even if all he coflows arrive a he same ime (he smalles arrival inerval), can sill achieve.8% performance improvemen in Faree and 9.8% in VL. A. Implemenaion and Tesbed Emulaions Implemenaion: The prooype sysem consiss of he cenral conroller and end hos enforcemen modules. For rouing enforcemen we use he ofware-defined Neworking (DN) echnology o enable explici rouing. For bandwidh enforcemen, we leverage Linux Traffic Conrol (TC) o perform per-flow rae limiing. The archiecure of s bandwidh enforcemen is shown in Fig.. The enforcemen daemon a he user space communicaes wih he kernel module via iocl o manage he flow able. The kernel module, locaing beween TCP/IP sack and TC, inerceps all ougoing packes and modifies nfmark field of socke buffer based on he rules in flow able. The modified packes are hen delivered o TC for rae limiing. We leverage wo-level Hierarchical Token Bucke (HTB) in TC: he roo node classifies packes o heir corresponding Fig.. Tesbed opology. leaf nodes based on nfmark field and he leaf nodes enforce per-flow raes. Tesbed: We build a leaf-spine opology as shown in Fig.. I inerconnecs 9 hoss hrough leaf (ToR) swiches conneced o spine swiches using Gbps links, resuling in a nonblocking fabric. We use Prono 9 8-por Gigabi Eherne swich wih PicO. sysem ha suppors boh Layer / and OpenFlow. Each server has a -core Inel E-.8GHz CPU, 8G memory, GB hard disk and G Eherne NICs. The O of servers is Debian. bi version wih Linux..8. kernel. The CPU, memory or hard disk is no a boleneck in he experimens. We use iperf o generae TCP flows. The base round-rip ime in our esbed is around us. Experimen: In our experimen, we inec coflows ino he nework o evaluae he performance of. As a comparison, we also evaluae he cases of baseline and rouing-only schemes. We do no include scheduling-only because we canno ge he exac flow pahs wih ECMP on he esbed. All he informaion of his experimen is summarized in Table II. I should be noed ha he performance of baseline scheme is averaged by ries, due o he randomness of ECMP. From his experimen, we can see ha can save 8. 7. 8. = 8.% of he average CCT compared o he baseline scheme, and i can reduce he average CCT by.7 7..7 = 8.% compared o he rouing-only scheme. Overhead: To make sure ha he overhead of he enforcemen module is negligible, we measured he exra CPU usage inroduced by s enforcemen module. We generaed more han 9Mbps of raffic wih more han flows on a rack server (wih -core Inel E-.8GHz CPU). The exra CPU overhead inroduced was around % (one core) compared wih he case ha s enforcemen module was no used (no rae limiing). The hroughpu remained same in boh cases. Acually, we noe ha, apar from he sofware soluions, some recen hardware soluions [] can also be used o achieve precise rae enforcemen especially a high link speeds, offloading some work from he CPU. B. Large cale imulaions imulaion mehodology: Exising packe-level simulaors such as ns- are no suiable o our case due o heir high overhead []. imilar o [, 9], we develop our own flow-level simulaor. The simulaor accouns for he flow arrival evens and deparure evens, raher han packe sending and receiving

Coflow Id# Flow Id# ource Desinaion Volume (GB) M M.7 M M.9 M M9.9 M8 M. M M.9 M7 M 7.9 7 M9 M. TABLE II Coflow Compleion Time (s) Rouing-only. 8. 7..9 89... 89. UMMARY OF TETBED EXPERIMENT: THE AVERAGE CCT OF I 7., ROUTING-ONLY I.7, AND BAELINE I 8.. Average coflow compleion ime (s)..... Performance compared o baseline(%) 8 7 Rouing only (a) Faree 8 8 Coflow widh Rouing only (c) Faree 8 8 Coflow widh Fig.. Average coflow compleion ime (s) 7 Performance compared o baseline(%) Rouing only (b) VL 8 8 Coflow widh (d) VL Rouing only 8 8 Coflow widh The impac of coflow widh. evens, o reduce he simulaion complexiy. I updaes he rae and remaining volume of each flow when an even occurs. To solve linear programming in, we embed he API provided by CPLEX. ino our simulaor. In he simulaions, we use many-o-many communicaion paern wihin a coflow and assume he iner-coflow arrival rae follows a Poisson disribuion. We mainly evaluae aspecs ha may affec he performance of : he widh of a coflow (i.e., he number of flows wihin a coflow), he number of coflows in he nework, and he iner-coflow arrival inerval. For reasonable simulaion ime, we choose -server Faree [] and VL [] as opologies. We also compared he resuls on -server Faree wih ha on 89-server Faree (on which he simulaor runs much slower, over 8 hours for us one ry) and observed similar performance. In he simulaions, each of our resuls is an average of ries. The overall simulaion resuls are shown in Fig. -. In general, we can see ha ouperforms all oher schemes in all scenarios. Impac of coflow widh: In each round of simulaions, we send coflows wih he same widh ino he nework. Fig. shows he simulaion resuls. From his figure, we make he following observaions. Firsly, as shown in Fig. (a) and (b), he absolue average CCT is increased wih he coflow widh. Compared o he baseline scheme, can reduce average CCT by up o 79.% in Faree, and.% in VL. Wihou rouing, he scheduling-only scheme would loss up o 9.%.% =.7% (see Fig. (c) a widh of ) of he performance in Faree. econdly, in Fig. (c) and (d), we observe a rend ha he relaive performance of scheduling-only scheme almos increases wih he coflow widh on boh opologies. The reason is ha when he coflow widh is relaively small, all he coflows are disribued a differen pars of he nework. In his case, coflows are unlikely o compee for bandwidh wih each oher. As a resul, he scheduling does no have much benefi, and rouing-only scheme achieves almos he same performance as. Wih he increase of coflow widh, differen coflows will inerleave wih each oher. Then, scheduling can effecively reduce average CCT by conrolling he flow ransmission raes. Again, in Fig. (c) and (d), he performance of rouingonly scheme increases wih he coflow widh a firs, bu hen decreasing wih i. This is because he flow collision probabiliy (muliple flows are concurrenly acive a he same link) increases wih he flow number in he nework. When he coflow widh is relaively small, rouing scheme can ge good performance as i solves such collision. However, when he coflow widh is relaively large, he rouing-only scheme canno avoid such collision and hence show poor performance. Thirdly, comparing Fig. (c) wih (d), he rouing-only scheme has beer performance in Faree han ha in VL. The reason is ha in Faree, more link-disoined pahs can be found for differen flows if hey are from differen sourcedesinaion pairs, which is no he case in VL. Accordingly, rouing has more opimizaion space o improve he average CCT in Faree. Fourhly, here is an anomaly in VL when he coflow widh is 8 (see Fig. (d)). We should have expeced ha he relaive performance of scheduling-only scheme would increase wih he flow number in he nework. However, his expecaion goes agains he acual simulaion resuls. We noe in Fig. (b) ha large average CCT is caused by many flows in he nework, which makes a large denominaor in he performance definiion. This accouns for he drop. Impac of coflow number: To evaluae how he performance of is influenced by he coflow number in he nework, we fix he coflow widh o be 8. From he resuls in Fig., we make he following observaions. Firsly, as shown in Fig. (a) and (b), he average CCT

Average coflow compleion ime (s) 7 Performance compared o baseline(%) Rouing only (a) Faree Coflow number in he nework Rouing only (c) Faree Coflow number in he nework Fig.. Average coflow compleion ime (s) Performance compared o baseline(%) 9 8 7 Rouing only (b) VL Coflow number in he nework (d) VL Rouing only Coflow number in he nework The impac of coflow number. Average coflow compleion ime (s) Performance compared o baseline(%) 9 8 7 9 8 7 (a) Faree Rouing only....8 Average iner coflow arrival inerval (s)....8 Average iner coflow arrival inerval (s) Fig.. (c) Faree Rouing only Average coflow compleion ime (s) Performance compared o baseline(%) 8 9 8 7 (b) VL Rouing only....8 Average iner coflow arrival inerval (s) Rouing only (d) VL....8 Average iner coflow arrival inerval (s) The impac of iner-coflow number arrival inerval. is increased wih he number of coflows in he nework. always ouperforms rouing-only and scheduling-only schemes obviously. We can see from Fig. (a) ha in Faree he performance of is up o.79..79 = 8.8% compared o scheduling-only scheme (wih coflows) and.9.9.9 = 9.89% compared o rouing-only scheme (wih coflows). econdly, from Fig. (c) and (d), we can see ha, in boh opologies, keeps relaively sable performance wih differen coflow number. The sable performance of comes from is combinaion of rouing and scheduling. When rouing makes less conribuion o wih he increase of coflow number, scheduling can conribue more o compensae he performance loss. Thirdly, we find ha he scheduling-only scheme always ouperforms rouing-only scheme in VL (Fig. (d)), and scheduling-only scheme is more effecive in VL han in Faree (Fig. (c) and (d)). Acually, when here are more coflows compeing wih each oher on he same link, scheduling makes more conribuion o han rouing does. Compared o Faree, here are fewer up links from ToRs in VL, so i is likely ha more flows will inerleave wih each oher in VL han in Faree. Hereby, scheduling is more efficien for VL han for Faree. Impac of iner-coflow arrival inerval: To invesigae he impac of he iner-coflow arrival inerval, we send ou coflows wih he widh of 8 ino he nework, and observe he relaionship beween he sysem performance and arrival inerval of sequenial coflows. Noe ha he larger average iner-coflow arrival inerval indicaes he lower coflow arrival rae. Zero arrival inerval means ha all he coflows arrive a he same ime. We se he larges average iner-coflow arrival inerval o be.8s, since each coflow may complee in a mos s if i monopolizes he nework in our simulaions. From Fig., we make he following observaions. Firsly, as shown in Fig. (a) and (b), he average CCT is decreased wih he increase of average iner-coflow arrival inerval. This is obvious because, as explained above, larger iner-coflow arrival inerval means lower coflow arrival rae. Furhermore, from Fig.(c) and Fig.(d), we find ha wih differen iner-coflow arrival inervals, can reduce CCT by up o 9.79% in Faree and 8.% in VL compared o baseline scheme. Even compared o he rouing-only scheme and scheduling-only scheme, in Faree (Fig.(a)) he performance of can be up o... =.% (when arrival inerval is.s) and.77.9.77 = 79.% (when arrival inerval is.s), respecively. econdly, he performance of scheduling-only scheme may firsly increase as he average coflow arrival inerval increases, and hen decrease if he average iner-coflow arrival inerval coninue increasing afer a cerain poin (see Fig.(c) and (d)). When he iner-coflow arrival inerval is small, many coflows should wai for he compleion of oher coflows. Hence, he baseline is large and i resuls in bad performance (see Fig. (a) and (b)). When he iner-coflow arrival inerval is large, he laer coflows may come when he previous coflows almos complee. In his case, he scheduling scheme does no ake effec o reduce he average CCT, since only a few coflows are in nework a he same ime. Thirdly, in Fig.(c) and (d), we can see ha he performance of has he same rend as scheduling-only scheme when he iner-coask arrival inerval is small, while has he same rend as rouing-only scheme when he inercoask arrival inerval is large. This is also because ha scheduling is no effecive o reduce he average CCT when only a few coflows are acive in nework concurrenly, while rouing does no ake effec when oo many coflows in he nework. Takeaways: For reducing average CCT, rouing conribues more when he nework is relaively ligh loaded, since rouing can reduce unnecessary flow collisions. As a comparison, scheduling is more criical when nework load increases, and coflows inerleave wih each oher. The success of is ha i inegraes boh rouing and scheduling, hence always

ouperforms oher schemes regardless of he nework saus. VI. RELATED WOR conains wo pars: rouing and scheduling. There is a large specrum of relaed work along eiher rouing or scheduling. We only review some closely relaed ones here. Flow rouing in DCNs: Tradiional raffic engineering soluions inside a daa cener [] or across daa ceners [7, 9] focus on improving he nework resource uilizaion while no reducing he average CCT. They leverage he shor-erm raffic predicabiliy in DCNs o improve he sysem performance. In anoher work, zupdae [] applies o he scenario where some nework componens face failure, while Hedera [] and Due [] focus on how o disribue flows o balance he raffic load in he nework. Relaive o hem, invesigaes how o disribue he flows belonging o he same coflow evenly ino he nework so ha he average CCT can be furher reduced by scheduling. Individual flow scheduling in DCNs: There are also many exising work on opimizing nework uilizaion and reducing average flow compleion ime (FCT) by using scheduling mehods, such as PDQ [] and pfabric []. Boh PDQ and pfabric are flow scheduling schemes o minimize FCT by agging prioriy on he packes. Unforunaely, neiher of hem can be implemened using exising commodiy swiches, and hence hey are no easy o widely deploy. Furhermore, hey do no ake ino accoun he flow dependency semanics and hus are coflow-agnosic. Coflow scheduling in DCNs: Orchesra [8] is perhaps he firs work ha ake he semanics among flow ino accoun when opimizing he flow ransfers in daa cener clusers. Afer ha, he work [7] summarizes he raffic paerns and flow dependency in DCNs and explicily proposes he concep of coflow. Then, recen soluions (e.g., Varys [9] and Barra []) sar o apply he coflow concep (or ask-aware) in heir nework opimizaions, however hey only focus on scheduling while neglecing an indispensable par rouing, which make hese soluions insufficien. VII. CONCLUION is a sysem which opimizes average coflow compleion ime in DCNs by inegraing rouing and scheduling. To he bes of our knowledge, is he firs work ha proposes and proves he posiion ha rouing and scheduling mus be oinly considered for opimizing he average CCT. Through real implemenaion and exensive simulaions, we demonsrae ha works wih exising commodiy swiches and preserves remarkable performance advanages over he scheduling-only or rouing-only soluions. Acknowledgemens This work was suppored in par by HRGC-EC, Naional Basic Research Program of China (97) under Gran CB, CB9, CB, CB78, Huawei Noah s Ark Lab, NFC Fund (7, 7, 79, 77, 9, and 987), and he Fundamenal Research Funds for he Cenral Universiies. REFERENCE [] M. Al-Fares, A. Loukissas, and A. Vahda, A calable, Commodiy Daa Cener Nework Archiecure, IGCOMM Compu. Commun. Rev., vol. 8, no., pp. 7, Aug. 8. [] M. Al-Fares,. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahda, Hedera: Dynamic Flow cheduling for Daa Cener Neworks, in NDI,. [] M. Alizadeh,. Yang, M. harif,. ai, N. Mceown, B. Prabhakar, and. henker, pfabric: Minimal Near-opimal Daacener Transpor, in IGCOMM. [] T. Benson, A. Anand, A. Akella, and M. Zhang, MicroTE: Fine Grained Traffic Engineering for Daa Ceners, in CoNEXT. []. Chen, C. Guo, H. Wu, J. Yuan, Z. Feng, Y. Chen,. Lu, and W. Wu, Generic and Auomaic Address Configuraion for Daa Ceners, in IGCOMM,. []. Chen, A. inglay, A. inghz,. Ramachandranz, L. Xuz, Y. Zhangz, X. Wen, and Y. Chen, OA: An Opical wiching Archiecure for Daa Cener Neworks wih Unprecedened Flexibiliy, in NDI,. [7] M. Chowdhury and I. oica, Coflow: A Neworking Absracion for Cluser Applicaions, in HoNes-XI,. [8] M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. oica, Managing daa ransfers in compuer clusers wih orchesra. in IGCOMM. [9] M. Chowdhury, Y. Zhong, and I. oica, Efficien Coflow cheduling wih Varys, in IGCOMM. [] J. Dean and. Ghemawa, MapReduce: implified Daa Processing on Large Clusers, Commun. ACM, vol., no., pp. 7, Jan. 8. [] F. R. Dogar, T. aragiannis, H. Ballani, and A. Rowsron, Decenralized Task-Aware cheduling for Daa Cener Neworks, in IGCOMM. []. Even, A. Iai, and A. hamir, On he complexiy of ime able and muli-commodiy flow problems, in Foundaions of Compuer cience, 97., h Annual ymposium on, Oc 97, pp. 8 9. [] R. Gandhi, H. H. Liu, Y. C. Hu, G. Lu, J. Padhye, L. Yuan, and M. Zhang, Due: Cloud cale Load Balancing wih Hardware and ofware, in IGCOMM. []. Ghemawa, H. Gobioff, and.-t. Leung, The Google File ysem, in OP. [] A. Greenberg, J. R. Hamilon, N. Jain,. andula, C. im, P. Lahiri, D. A. Malz, P. Pael, and. engupa, VL: A calable and Flexible Daa Cener Nework, ACM IGCOMM Compu. Commun. Rev., vol. 9, no., pp., Aug. 9. [] C.-Y. Hong, M. Caesar, and P. B. Godfrey, Finishing Flows Quickly wih Preempive cheduling, in IGCOMM. [7] C.-Y. Hong,. andula, R. Mahaan, M. Zhang, and V. Gill, Achieving High Uilizaion wih ofware-driven WAN, in IGCOMM. [8] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Feerly, Dryad: Disribued Daa-parallel Programs from equenial Building Blocks, in Eurosys 7. [9]. Jain, A. umar,. M, J. Ong, L. Pouievski, A. ingh,. Venkaa, J. W, J. Zhou, M. Zhu, J. Zolla, U. Hlzle,. uar, A. Vahda, and G. Inc, B: Experience wih a Globally-Deployed ofware Defined WAN, in IGCOMM. [] H. H. Liu, X. Wu, M. Zhang, L. Yuan, R. Waenhofer, and D. A. Malz, zupdae: updaing daa cener neworks wih zero loss, in ACM IGCOMM. [] G. Malewicz, M. H. Ausern, A. J. Bik, J. C. Dehner, I. Horn, N. Leiser, and G. Czakowski, Pregel: A ysem for Large-scale Graph Processing, in IGMOD. [] A. Munir, G. Baig,. Ireza, I. A. Qazi, F. Dogar, and A. Liu, Friends, no Foes - ynhesizing Exising Daa Cener Transpor raegies, in IGCOMM. [] Y. Peng,. Chen, G. Wang, W. Bai, Z. Ma, and L. Gu, HadoopWach: A Firs ep Towards Comprehensive Traffic Forecasing in Cloud Compuing, in INFOCOM. []. Radhakrishnan, Y. Geng, V. Jeyakumar, A. abbani, G. Porer, and A. Vahda, ENIC: calable NIC for End-hos Rae Limiing, in NDI. [] C. Wilson, H. Ballani, T. aragiannis, and A. Rowron, Beer Never Than Lae: Meeing Deadlines in Daacener Neworks, ACM IGCOM- M Compu. Commun. Rev., vol., no., pp., Aug.. [] M. Zaharia, M. Chowdhury, M. J. Franklin,. henker, and I. oica, park: Cluser Compuing wih Working es, in HoCloud.