zupdate: Updating Data Center Networks with Zero Loss

Transcription

1 zupate: Upating Data Center Networks with Zero Loss Hongqiang Harry Liu Yale University Lihua Yuan Microsoft Xin Wu Duke University Roger Wattenhofer Microsoft Ming Zhang Microsoft Davi A. Maltz Microsoft ABSTRACT Datacenter networks (DCNs) are constantly evolving ue to various upates such as switch upgraes an VM migrations. Each upate must be carefully planne an execute in orer to avoi isrupting many of the mission-critical, interactive applications hoste in DCNs. The key challenge arises from the inherent ifficulty in synchronizing the changes to many evices, which may result in unforeseen transient link loa spikes or even congestions. We present one primitive, zupate, to perform congestion-free network upates uner asynchronous switch an traffic matrix changes. We formulate the upate problem using a network moel an apply our moel to a variety of representative upate scenarios in DCNs. We evelop novel techniques to hanle several practical challenges in realizing zupate as well as implement the zupate prototype on OpenFlow switches an eploy it on a testbe that resembles real DCN topology. Our results, from both real-worl experiments an large-scale trace-riven simulations, show that zupate can effectively perform congestion-free upates in prouction DCNs. Categories an Subject Descriptors: C.2. [Computer Communication Networks]: Network Architecture an Design Network communications. C.2.3 [Computer Communication Networks]: Network Operations. Keywors: Data Center Network, Congestion, Network Upate.. INTRODUCTION The rise of clou computing platform an Internet-scale services has fuele the growth of large atacenter networks (DCNs) with thousans of switches an hunres of thousans of servers. Due to the sheer number of hoste services an unerlying physical evices, DCN upates occur frequently, whether triggere by the operators, applications, or sometimes even failures. For example, DCN operators routinely upgrae existing switches or onboar new switches to fix known bugs or to a new capacity. For applications, migrating VMs or reconfiguring loa balancers are consiere the norm rather than the exception. Despite their prevalence, DCN upates can be challenging an istressing even for the most experience operators. One key rea- Permission to make igital or har copies of all or part of this work for personal or classroom use is grante without fee provie that copies are not mae or istribute for profit or commercial avantage an that copies bear this notice an the full citation on the first page. Copyrights for components of this work owne by others than ACM must be honore. Abstracting with creit is permitte. To copy otherwise, or republish, to post on servers or to reistribute to lists, requires prior specific permission an/or a fee. Request permissions from [email protected]. SIGCOMM 3, August 2 6, 203, Hong Kong, China. Copyright 20XX ACM /3/08...$5.00. son is because of the complex nature of the upates themselves. An upate usually must be performe in multiple steps, each of which is well planne to minimize isruptions to the applications. Each step can involve changes to a myria of switches, which if not properly coorinate may lea to catastrophic incients. Making matters even worse, there are ifferent types of upate with iverse requirements an objectives, forcing operators to evelop an follow a unique process for each type of upate. Because of these reasons, a DCN upate may take hours or ays to carefully plan an execute while still running the risk of spiraling into operators nightmare. This stark reality calls for a simple yet powerful abstraction for DCN upates, which can relieve the operators from the nitty-gritty, such as eciing which evices to change or in what orer, while offering seamless upate experience to the applications. We ientify three essential properties of such an abstraction. First, it shoul provie a simple interface for operators to use. Secon, it shoul hanle a wie range of common upate scenarios. Thir, it shoul provie certain levels of guarantee which are relevant to the applications. The seminal work by Reitblatt et al. [7] introuces two abstractions for network upates: per-packet an per-flow consistency. These two abstractions guarantee that a packet or a flow is hanle either by the ol configuration before an upate or by the new configuration after an upate, but never by both. To implement such abstractions, they propose a two-phase commit mechanism which first populates the new configuration to the mile of the network an then flips the packet version numbers at the ingress switches. These abstractions can preserve certain useful trace properties, e.g., loop-free uring the upate, as long as these properties hol before an after the upate. While the two abstractions are immensely useful, they are not the panacea for all the problems uring DCN upates. In fact, a DCN upate may trigger network-wie traffic migrations, in which case many flows configurations have to be change. Because of the inherent ifficulty in synchronizing the changes to the flows from ifferent ingress switches, the link loa uring an upate coul get significantly higher than that before or after the upate (see example in 3). This problem may further exacerbate when the application traffic is also fluctuating inepenently from the changes to switches. As a result, nowaays operators are completely in the ark about how baly links coul be congeste uring an upate, not to mention how to come up with a feasible workaroun. This paper introuces one key primitive, zupate, to perform congestion-free network-wie traffic migration uring DCN upates. The letter z means zero loss an zero human effort. With zupate, operators simply nee to escribe the en requirements of the upate, which can easily be converte into a set of input con-

2 straints to zupate. Then zupate will attempt to compute an execute a sequence of steps to progressively meet the en requirements from an initial traffic matrix an traffic istribution. When such a sequence is foun, zupate guarantees that there will be no congestion throughout the upate process. We emonstrate the power an simplicity of zupate by applying it to several realistic, complex upate scenarios in large DCNs. To formalize the traffic migration problem, we present a network moel that can precisely escribe the relevant state of a network specifically the traffic matrix an traffic istribution. This moel enables us to erive the sufficient an necessary conitions uner which the transition between two network states will not incur any congestion. Base on that, we propose an algorithm to fin a sequence of lossless transitions from an initial state to an en state which satisfies the en constraints of an upate. We also illustrate by examples how to translate the high-level humanunerstanable upate requirements into the corresponing mathematical constraints which are compliant with our moel. zupate can be reaily implemente on existing commoity OpenFlow switches. One major challenge in realizing zupate is the limite flow an group table sizes on those switches. Base on the observation that ECMP works sufficiently well for most of the flows in a DCN [9], we present an algorithm that greeily consoliates such flows to make efficient use of the limite table space. Furthermore, we evise heuristics to reuce the computation time an the switch upate overhea. We summarize our contributions as follows: We introuce the zupate primitive to perform congestionfree network upates uner asynchronous switch an traffic matrix changes. We formalize the network-wie traffic migration problem using a network moel an propose a novel algorithm to solve it. We illustrate the power of zupate by applying it to several representative upate scenarios in DCNs. We hanle several practical challenges, e.g. switch table size limit an computation complexity, in implementing zupate. We buil a zupate prototype on top of OpenFlow [3] switches an Floolight controller []. We extensively evaluate zupate both on a real network testbe an in large-scale simulations riven by the topology an traffic eman from a large prouction DCN. 2. DATACENTER NETWORK Topology: A state-of-the-art DCN typically aopts a FatTree or Clos topology to attain high bisection banwith between servers. Figure 2(a) shows an example in which the switches are organize into three layers from the top to the bottom: Core, Aggregation (Agg) an Top-of-Rack (ToR). Servers are connecte to the ToRs. Forwaring an routing: In such a hierarchical network, traffic traverses a valley-free path from one ToR to another: first go upwars an then ownwars. To limit the forwaring table size, the servers uner the same ToR may share one IP prefix an forwaring is performe base on each ToR s prefix. To fully exploit the reunant paths, each switch uses ECMP to evenly split traffic among multiple next hops. The emergence of Software Define Networks (SDN) allows the forwaring tables of each switch to be irectly controlle by a logically centralize controller, e.g., via the OpenFlow APIs, ramatically simplifying the routing in DCNs. Flow an group tables on commoity switches: An (OpenFlow) switch forwars packets by matching packet heaers, e.g., source an estination IP aresses, against entries in the so calle flow table (Figure 6). A flow entry specifies a pattern use for matching an actions taken on matching packets. To perform multipath forwaring, a flow entry can irect a matching packet to an entry in a group table, which can further irect the packet to one of its multiple next hops by hashing on the packet heaer. Various hash functions may be use to implement ifferent loa balancing schemes, such as ECMP or Weighte-Cost-Multi-Path (WCMP). To perform pattern matching, the flow table is mae of TCAM (Ternary Content Aressable Memory) which is expensive an power-hungry. Thus, the commoity switches have limite flow table size, usually between K to 4K entries. The group table typically has K entries. 3. NETWORK UPDATE PROBLEM Scenario VM migration Loa balancer reconfiguration Switch firmware upgrae Switch failure repair New switch onboaring Description Moving VMs among physical servers. Changing the mapping between a loa balancer an its backen servers. Rebooting a switch to install new version of firmware. Shutting own a faulty switch to prevent failure propagation. Moving traffic to a new switch to test its functionality an compatibility. Table : The common upate scenarios in prouction DCNs. We surveye the operators of several prouction DCNs about the typical upate scenarios an liste them in Table. One common problem that makes these DCN upates har is they all have to eal with so calle network-wie traffic migration where the forwaring rules of many flows have to be change. For example in switch firmware upgrae, in orer to avoi impacting the applications, operators woul move all the traffic away from a target switch before performing the upgrae. Taking VM migration as another example, to relocate a group of VM s, all the traffic associate with the VM s will be migrate as well Figure : Transient loa increase uring traffic migration. Such network-wie traffic migration, if not one properly, coul lea to severe congestion. The funamental reason is because of the ifficulty in synchronizing the changes to the flows from ifferent ingress switches, causing certain links to carry significantly more traffic uring the migration than before or after the migration. We illustrate this problem using a simple example in Figure. Flows f an f 2 enter the network from ingress switches s an s 2 respectively. To move the traffic istribution from the initial one in (a) to the final one in (b), we nee to change the forwaring rules in both s an s 2. As shown in (c), link l 2 will carry the aggregate traffic of f an f 2 if s is change before s 2. Similar problem will occur if s 2 is change first. In fact, this problem cannot be solve by the two-phase commit mechanism propose in [7]. A moern DCN hosts many interactive applications such as search an avertisement which require very low latencies. Prior research [5] reporte that even small losses an queuing elays coul ramatically elevate the flow completion time an impair the user-perceive

3 e Z e Z e Z e Z e Z e Z e Z e Z e Z e Z e Z e Z KZ '' e e e Z e Z e Z e Z e Z e Z e Z e Z e Z e Z e Z e Z e Z e Z e Z e Z e Z e Z e Z KZ '' e e e Z e Z e Z e Z e Z e Z e Z Figure 2: This example shows how to perform a lossless firmware upgrae through careful traffic istribution transitions. performance. Thus, it is critical to avoi congestion uring DCN upates. Performing lossless network-wie traffic migration can be highly tricky in DCNs, because it often involves changes to many switches an its impact can ripple throughout the network. To avoi congestion, operators have to evelop a thoughtful migration plan in which changes are mae step-by-step an in an appropriate orer. Furthermore, certain upate (e.g., VM migration) may require coorination between servers an switches. Operators, thus, have to carefully calibrate the impact of server changes along with that of switch changes. Finally, because each upate scenario has its istinctive requirements, operators toay have to create a customize migration plan for each scenario. Due to the reasons above, network-wie traffic migration is an aruous an complicate process which coul take weeks for operators to plan an execute while some of the subtle yet important corner cases might still be overlooke. Thus, risk-averse operators sometimes eliberately efer an upate, e.g., leaving switches running an out-of-ate, buggy firmware, because the potential amages from the upate may outweigh the gains. Such tenency woul severely hurt the efficiency an agility of the whole DCN. Our goal: is to provie a primitive calle zupate to manage the network-wie traffic migration for all the DCN upates shown in Table. In our approach, operators only nee to provie the en requirements of a specific DCN upate, an then zupate will automatically hanle all the etails, incluing computing a lossless (perhaps multi-step) migration plan an coorinating the changes to ifferent switches. This woul ramatically simplify the migration process an minimize the buren place on operators. 4. OVERVIEW In this section, we illustrate by two examples how asynchronous switch an traffic matrix changes lea to congestion uring traffic migration in DCN an how to prevent the congestion through a carefully-esigne migration plan. Switch firmware upgrae: Figure 2(a) shows a FatTree [4] network where the capacity of each link is 000. The numbers above the core switches an those below the ToRs are traffic emans. The number on each link is the traffic loa an the arrow inicates the traffic irection. This figure shows the initial traffic istribution D 0 where each switch uses ECMP to evenly split the traffic among the next hops. For example, the loa on link ToR AGG is 300, half of the traffic eman ToR ToR 2. The busiest link CORE AGG 3 has a loa of 920, which is the sum of 620 (eman CORE ToR 3/4 ), 50 (traffic AGG CORE ), an 50 (traffic AGG 5 CORE ). No link is congeste. Suppose we want to move the traffic away from AGG before taking it own for firmware upgrae. A naive way is to isable link ToR AGG so that all the eman ToR ToR 2 shifts to link ToR AGG 2 whose loa becomes 600 (as shown in Figure 2(b)). As a result, link CORE 3 AGG 4 will have a loa of 070 by combining 300 (traffic AGG 2 CORE 3), 50 (traffic AGG 6 CORE 3), an 620 (eman CORE 3 ToR 3/4 ), exceeing its capacity. Figure 2(c) shows the preceing congestion can be prevente through proper traffic istribution D 2, where ToR 5 forwars 500 traffic on link ToR 5 AGG 5 an 00 traffic on link ToR 5 AGG 6 instea of using ECMP. This reuces the loa on link CORE 3 AGG 4 to 970, right below its capacity. However, to transition from the initial D 0 (Figure 2(a)) to D 2 (Figure 2(c)), we nee to change the traffic split ratio on both ToR an ToR 5. Since it is har to change two switches simultaneously, we may en up with a traffic istribution shown in Figure 2() where link CORE AGG 3 is congeste, when ToR 5 is change before ToR. Conversely, if ToR is change first, we will have the traffic istribution D 2 in Figure 2(b) where link CORE 3 AGG 4 is congeste. Given the asynchronous changes to ifferent switches, it seems impossible to transit from D 0 (Figure 2(a)) to D 2 (Figure 2(c)) without causing any loss. Our basic iea is to introuce an intermeiate traffic istribution D as a stepping stone, such that the transitions D 0 D an D D 2 are both lossless. Figure 2(e) is such an intermeiate D where ToR splits traffic by 200:400 an ToR 5 splits traffic by 450:50. It is easy to verify that no link is congeste in D since the busiest link CORE AGG 3 has a loa of 945. Furthermore, when transitioning from D 0 (Figure 2(a)) to D, no matter in what orer ToR an ToR 5 are change, there will be no congestion. Figure 2(f) gives an example where ToR 5 is change before ToR. The busiest link CORE AGG 3 has a loa of 995. Although not shown here, we verifie that there is no congestion

4 Figure 3: This example shows how to avoi congestion by choosing the proper traffic split ratios for switches. if ToR is change first. Similarly, the transition from D (Figure 2(e)) to D 2 (Figure 2(c)) is lossless regarless of the change orer of ToR an ToR 5. In this example, the key challenge is to fin the appropriate D 2 (Figure 2(c)) that satisfies the firmware upgrae requirement (moving traffic away from AGG ) as well as the appropriate D (Figure 2(e)) that briges the lossless transitions from D 0 (Figure 2(a)) to D 2. We will explain how zupate computes the intermeiate an final traffic istributions in 5.3. Loa balancer reconfiguration: Figure 3(a) shows another Fat- Tree network where the link capacity remains 000. One of the links CORE 2 AGG 3 is own (a common incient in DCN). Two servers S an S 2 are sening traffic to two services S an S 2, locate in Container an Container 2 respectively. The labels on some links, for example, ToR 5 AGG 3, are in the form of l +l 2, which inicate the traffic loa towars Container an Container 2 respectively. Suppose all the switches use ECMP to forwar traffic, Figure 3(a) shows the traffic istribution uner the initial traffic matrix T 0 where the loa on the busiest link CORE AGG is 920, which is the sum of 600 (the eman CORE ToR /4 ) an 320 (the traffic on link AGG 3 CORE towars Container ). No link is congeste. To be resilient to the failure of a single container, we now want S an S 2 to run in both Container an Container 2. For S 2, we will instantiate a new server uner ToR 3 an reconfigure its loa balancer (LB) (not shown in the figure) to shift half of its loa from ToR 9 to ToR 3. For S, we will take similar steps to shift half of its loa from ToR 3 to ToR 9. Figure 3(b) shows the traffic istribution uner the final traffic matrix T 2 after the upate. Note that the traffic on link ToR 5 AGG 3 is because half of it goes to the S uner ToR 2 in Container an the other half goes to the S uner ToR 9 in Container 2. It is easy to verify that there is no congestion. However, the reconfiguration of the LBs of S an S 2 usually cannot be one simultaneously because they resie on ifferent evices. Such asynchrony may lea to a transient traffic matrix T shown in Figure 3(c) where S 2 LB is reconfigure before S s. This causes link ToR 6 AGG 3 to carry traffic, half of which goes to the S 2 in Contain, an further causes congestion on link CORE AGG. Although not shown here, we have verifie that congestion will happen if S s LB is reconfigure first. The congestion above is cause by asynchronous traffic matrix changes. Our basic iea to solve this problem is to fin the proper traffic split ratios for the switches such that there will be no congestion uner the initial, final or any possible transient traffic matrices uring the upate. Figure 3() shows one such solution where ToR 5 an ToR 6 sen 240 traffic to AGG 3 an 400 traffic to AGG 4, an other switches still use ECMP. The loa on the busiest link CORE AGG now becomes 960 an hence no link is congeste uner T. Although not shown here, we have verifie that, given such traffic split ratios, the network is congestion-free uner the initial T 0 (Figure 3(a)), the final T 2 (Figure 3(b)), an the transient traffic matrix where S s LB is reconfigure first. Generally, the asynchronous reconfigurations of multiple LBs coul result in a large number of possible transient traffic matrices, making it har to fin the proper traffic split ratios for all the switches. We will explain how to solve this problem with zupate in 6.2. The zupate process: We provie zupate(t 0, D 0, C) to perform lossless traffic migration for DCN upates. Given an initial traffic matrix T 0, zupate will attempt to compute a sequence of lossless transitions from the initial traffic istribution D 0 to the

5 ze & ê / êe ^ ^ ê ê Figure 4: The high-level working process of zupate. final traffic istribution D n which satisfies the upate requirements C. D n woul then allow an upate, e.g., upgraing switch or reconfiguring LB, to be execute without incurring any loss. Figure 4 shows the overall workflow of zupate. In the following, we will first present a network moel for escribing lossless traffic istribution transition ( 5., 5.2) an an algorithm for computing a lossless transition plan ( 5.3). We will then explain how to represent the constraints in each upate scenario ( 6). After that, we will show how to implement the transition plan on switches with limite table size ( 7., 7.2), reuce computational complexity by confining search space ( 7.3), an reuce transition overhea by picking a proper search objective function ( 7.4). 5. NETWORK MODEL This section escribes a network moel uner which we formally efine the traffic matrix an traffic istribution as the inputs to zupate. We use this moel to erive the sufficient an necessary conitions for a lossless transition between two traffic istributions. In the en, we present an algorithm for computing a lossless transition plan using an optimization programming moel. V The set of all switches. E The set of all links between switches. G The irecte network graph G = (V, E). e v,u A irecte link from switch v to u. c v,u The link capacity of e v,u. f A flow from an ingress to an egress switch. s f The ingress switch of f. f The egress switch of f. p f A path taken by f from s f to f. G f The subgraph forme by all the p f s. T The traffic matrix of the network. T f The flow size of f in T. lv,u f The traffic loa place on e v,u by f. D A traffic istribution D := {lv,u f, f e v,u E}. rv,u f A rule for f on e v,u: rv,u f = lv,u/t f f. R All rules in network: R := {rv,u f, f e v,u E}. Rv f Rules for f on switch v: {rv,u u f : e v,u E}. R f Rules for f in the network: {rv,u e f v,u E}. D(T ) All feasible traffic istributions which fully eliver T. D C (T ) All traffic istributions in D(T ) which satisfy constraints C. P(T ) P(T ) := {(D, D 2 ) D, D 2 D(T ), irect transition D to D 2 is lossless.} Table 2: The key notations of the network moel. 5. Abstraction of Traffic Distribution Network, flow an traffic matrix: A network is a irecte graph G = (V, E), where V is the set of switches, an E is the set of links between switches. A flow f enters the network from an ingress switch (s f ) an exits at an egress switch ( f ) through one or multiple paths (p f ). Let G f be the subgraph forme by all the p f s. For instance, in Figure 5, suppose f takes the two paths (, 2, 4, 6, 8) an (, 2, 4, 7, 8) from switch to switch 8, then G f is comprise of switches {, 2, 4, 6, 7, 8} an the links between them. A traffic matrix T efines the size of each flow T f. Traffic istribution: Let lv,u f be f s traffic loa on link e v,u. We efine D = {lv,u f, f e v,u E} as a traffic istribution, which represents each flow s loa on each link. Given a T, we call D feasible if it satisfies: f : u V l f s f,u = v V f, v V \ {s f, f } : u V l f v, f = T f () l f v,u = u V l f u,v (2) f, e v,u / G f : l f v,u = 0 (3) e v,u E : f l f v,u c v,u (4) Equations () an (2) guarantee that all the traffic is fully elivere, (3) means a link shoul not carry f s traffic if it is not on the paths from s f to f, an (4) means no link is congeste. We enote D(T ) as the set of all feasible traffic istributions uner T. Flow rule: We efine a rule for a flow f on link e v,u as r f v,u = l f v,u/t f, which is essentially a normalize value of l v,u by the flow size T f. We also efine the set of rules in the whole network as R = {r f v,u f, e v,u E}, the set of rules of a flow f as R f = {r f v,u e v,u E}, an the set of rules for a flow f on a switch v as R f v = {r f v,u u : e v,u E}. Given T, we can compute the rule set R for the traffic istribution D an vice versa. We use D = T R to enote the corresponence between D an R. We call a R feasible if its corresponing D is feasible. In practice, we will install R into the switches to realize the corresponing D uner T because R is inepenent from the flow sizes an can be irectly implemente with the existing switch functions. We will iscuss the implementation etails in Lossless Transition between Traffic Distributions To transition from D to D 2 uner given T, we nee to change the corresponing rules from R to R 2 on all the switches. A basic requirement for a lossless transition is that both D an D 2 are feasible: D D(T ) D 2 D(T ). However, this requirement is insufficient ue to asynchronous switch changes as shown in 4. We explain this problem in more etail using an example in Figure 5, which is the subgraph G f of f from ToR to ToR 8 in a small Clos network. Each of switches -5 has two next hops towars ToR 8. Thus l f 7,8 epens on f s rules on switches -5. When the switches are change asynchronously, each of them coul be using either ol or new rules, resulting in 2 5 potential values of l f 7,8. Generally, the number of potential values of lv,u f grows exponentially with the number of switches which may influence lv,u. f To guarantee a lossless transition uner an arbitrary switch change orer, Equation (4) must hol for any potential value of lv,u, f which is computationally infeasible to check in a large network. To solve the state explosion problem, we leverage the two-phase commit mechanism [7] to change the rules of each flow. In the first phase, the new rules of f (R f,2 ) are ae to all the switches while f s packets tagge with an ol version number are still processe with the ol rules (R f, ). In the secon phase, s f tags f s packets with a new version number, causing all the switches to process the packets with the new version number using R f,2. To see how two-phase commit helps solve the state explosion problem, we observe that the subgraph G f of a flow f (Figure 5) has multiple layers an the propagation elay between two ajacent layers is almost a constant. When there is no congestion, the queuing an processing elays on switches are negligibly small. Sup-

6 Figure 5: Two-phase commit simplifies link loa calculations. pose switch flips to the new rules at time 0, switch 4 will receive the packets with the new version number on both of its incoming interfaces at τ 0 + τ an flip to the new rules at the same time. It will never receive a mix of packets with two ifferent version numbers. Moreover, all the switches in the same layer will flip to the new rules simultaneously. This is illustrate in Figure 5 where switch 4 an switch 5 (in shae boxes) just flippe to the new rules while switch 6 an switch 7 (in unshae boxes) are still using the ol rules. Formally, we can prove LEMMA. Suppose a network uses two-phase commit to transition the traffic istribution of a flow f from D to D 2. If G f satisfies the following three conitions: i) Layere structure: All switches in G f can be partitione into sets L 0,..., L m, where L 0 = {s f }, L m = { f } an e v,u G f, if v L k, then u L k+. ii) Constant elay between ajacent layers: e v,u G f, let s v,u an r v,u be the sening rate of v an the receiving rate of u, δ v,u be the elay from the time when s v,u changes to the time when r v,u changes. Suppose v, v 2 V k, e v,u, e v2,u 2 G f : δ v,u = δ v2,u 2 = δ k. iii) No switch queuing or processing elay: Given a switch u an e v,u, e u,w G f : if r v,u changes from l v,u to l 2 v,u simultaneously, then s u,w changes from l u,w to l 2 u,w immeiately at the same time. then we have e v,u G f, s v,u an r v,u are either lv,u or lv,u 2 uring the transition. PROOF. See appenix. Two-phase commit reuces the number of potential values of l f v,u to just two, but it oes not completely solve the problem. In fact, when each f is change asynchronously via two-phase commit, the number of potential values of f lf v,u in Equation (4) will be 2 n where n is the number of flows. To further reuce the complexity in checking (4), we introuce the following: LEMMA 2. When each flow is change inepenently, a transition from D to D 2 is lossless if an only if: e v,u E : f max {l f, v,u, l f,2 v,u} c v,u (5) PROOF. At any snapshot uring the transition, e v,u E, let Fv,u/F v,u 2 be the set of flows with the ol/new loa values. Due to two-phase commit, Fv,u Fv,u 2 contains all the flows on e v,u. : Construct Fv,u an Fv,u 2 as follows: f is put into Fv,u if lv,u f, lv,u, f,2 otherwise it is put into Fv,u. 2 Because the transition is congestion-free, we have: f F v,u l f, v,u + f F 2 v,u l f,2 v,u = f max {l f, v,u, l f,2 v,u} c v,u Hence, (5) hols. : When (5) hols, we have: f F v,u l f, v,u + f F 2 v,u l f,2 v,u f max {l f, v,u, l f,2 v,u} c v,u Thus no link is congeste at any snapshot uring the transition. Lemma 2 means we only nee to check Equation (5) to ensure a lossless transition, which is now computationally feasible. Note that when the flow changes are epenent, e.g., the flows on the same ingress switch are tagge with a new version number simultaneously, (5) will be a sufficient conition. We efine P(T ) as the set of all pairs of feasible traffic istributions (D, D 2 ) which satisfy (5) uner traffic matrix T. 5.3 Computing Transition Plan Given T 0 an D 0, zupate tries to fin a feasible D n which satisfies constraints C an can be transitione from D 0 without loss. The search is one by constructing an optimization programming moel M. In the simplest form, M is comprise of D n as the variable an two constraints: (i) D n D C(T 0); (ii) (D 0, D n) P(T 0). Note that (i) & (ii) can be represente with equations () (5). We efer the iscussion of constraints C in 6. If such a D n is foun, the problem is solve (by the efinitions of D C(T 0) an P(T 0) in Table 2) an a lossless transition can be performe in one step. Algorithm : zupate(t 0, D 0, C) // D 0 is the initial traffic istribution 2 // If D 0 satisfies the constraints C, return D 0 irectly 3 if D 0 D C(T 0) then 4 return [D 0] ; 5 // The initial # of steps is, N is the max # of steps 6 n ; 7 while n N o 8 M new optimization moel ; 9 D[0] D 0 ; for k, 2,..., n o D[k] new traffic istribution variable ; M.aVariable(D[k]) ; // D[k] shoul be feasible uner T 0 M.aConstraint(D[k] D(T 0)) ; for k, 2,..., n o // Transition D[k ] D[k] is lossless ; M.aConstraint((D[k ], D[k]) P(T 0)) ; // D[n] shoul satisfy the constraints C M.aConstraint(D[n] D C(T 0)) ; // An objective is optional M.aObjective(objective) ; if M.solve() = Successful then return D[ n] ; n n + ; return [] // no solution is foun ; However, sometimes we cannot fin a D n which satisfies the two constraints above. When this happens, our key iea is to introuce a sequence of intermeiate traffic istributions (D,..., D n ) to brige the transition from D 0 to D n via n steps. Specifically,

7 zupate will attempt to fin D k (k =,..., n) which satisfy: (I) D n D C(T 0); (II) (D k, D k ) P(T 0). If such a sequence is foun, it means a lossless transition from D 0 to D n can be performe in n steps. In this general form of M, D k (k =,..., n) are the variables an (I) & (II) are the constraints. Algorithm shows the pseuocoe of zupate(t 0, D 0, C). Since we o not know how many steps are neee in avance, we will search from n = an increment n by until a solution is foun or n reaches a preefine limit N. In essence, we aim to minimize the number of transition steps to save the overall transition time. Note that there may exist many solutions to M, we will show how to pick a proper objective function to reuce the transition overhea in HANDLING UPDATE SCENARIOS In this section, we apply zupate to various upate scenarios liste in Table. Specifically, we will explain how to formulate the requirements of each scenario as zupate s input constraints C. 6. Network Topology Upates Certain upate scenarios, e.g., switch firmware upgrae, switch failure repair, an new switch on-boaring, involve network topology changes but no traffic matrix change. We may use zupate to transition from the initial traffic istribution D 0 to a new traffic istribution D which satisfies the following requirements. Switch firmware upgrae & switch failure repair: Before the operators shutown or reboot switches for firmware upgrae or failure repair, they want to move all the traffic away from those switches to avoi isrupting the applications. Let U be the set of caniate switches. The preceing requirement can be represente as the following constraints C on the traffic istribution D : f, u U, e v,u E : l f, v,u = 0 (6) which forces all the neighbor switches to stop forwaring traffic to switch u before the upate. New evice on-boaring: Before the operators a a new switch to the network, they want to test the functionality an performance of the new switch with some non-critical prouction traffic. Let u 0 be the new switch, F test be the test flows, an G f (u 0) be the subgraph forme by all the p f s which traverse u 0. The preceing requirement can be represente as the following constraints C on the traffic istribution D : f F test, e v,u / G f (u 0) : l f, v,u = 0 (7) f / F test, e v,u0 E : l f, v,u 0 = 0 (8) where (7) forces all the test flows to only use the paths through u 0, while (8) forces all the non-test flows not to traverse u 0. Restoring ECMP: A DCN often uses ECMP in normal conition, but WCMP uring upates. After an upgrae or testing is complete, operators may want to restore ECMP in the network. This can simply be represente as the following constraints C on D : f, v V, e v,u, e v,u2 G f : l f, v,u = l f, v,u 2 (9) which forces switches to evenly split f s traffic among next hops. 6.2 Traffic Matrix Upates Certain upate scenarios, e.g., VM migration an LB reconfiguration, will trigger traffic matrix changes. Let T 0 an T be the initial an final traffic matrices, we may use zupate to transition from the initial traffic istribution D 0 to a new D whose corresponing rule set R is feasible uner T 0, T, an any possible transient traffic matrices uring the upate. As explaine in 4, the number of possible transient traffic matrices can be enormous when many LBs (or VMs) are being upate. It is thus computationally infeasible even to enumerate all of them. Our key iea to solve this problem is to introuce a maximum traffic matrix T max that is larger than T 0, T an any possible transient traffic matrices an only search for a D whose corresponing R is feasible uner T max. Suppose uring the upate process, the real traffic matrix T (t) is a function of time t. We efine f : T max,f := sup(t f (t)) where sup means the upper boun over time t. We erive the following: LEMMA 3. Given a rule set R, if T max R D(T max), we have T (t) R D(T (t)). PROOF. Because T f (t) T f,max an T max R D(T max), e v,u E, we have: f T f (t) r f, v,u f Hence, T (t) R D(T (t)). T f,max r f, v,u c v,u Lemma 3 says if R is feasible uner T max, it is feasible throughout the upate process. This means, before upating the traffic matrix from T 0 to T, we may use zupate to transition from D 0 into D whose corresponing R is feasible uner T max. This leas to the following constraints C on D : D = T 0 R (0) T max R D(T max) () Here T max is specifie by the applications owners who are going to perform the upate. In essence, lemma 3 enables the operators to migrate multiple VMs in parallel, saving the overall migration time, while not incurring any congestion. 7. PRACTICAL ISSUES In this section, we iscuss several practical issues in implementing zupate incluing switch table size, computational complexity, transition overhea, an unplanne failures an traffic matrix variations. 7. Implementing zupate on Switches The output of zupate is a sequence of traffic istributions, each of which can be implemente by installing its corresponing flow table entries into the switches. Given a flow f s traffic istribution on a switch v: {lv,u}, f we compute a weight set Wv f in which wv,u f = lv,u/ f u i lv,u f i. In practice, a zupate flow f is the collection of all the 5-tuple flows from the same ingress to the same egress switches, since all these 5-tuple flows share the same set of paths. Wv f can then be implemente on switch v by hashing each of f s 5-tuple flows into one next hop in f s next hop set using WCMP. As in [7], the version number use by two-phase commit can be encoe in the VLAN tag. Figure 6 shows an example of how to map the zupate flows to the flow an group table entries on an OpenFlow switch. The zupate flow (ver 2, s 2, 2) maps to the switch flow table entry (vlan 2, s 2, 2) which further points to group 2 in the switch group table. group 2 implements this flow s weight set {0.25, 0.75, } using the SELECT group type an the WCMP hash function.

8 Figure 6: Implementing zupate on an OpenFlow switch. 7.2 Limite Flow an Group Table Size As escribe in 2, a commoity switch has limite flow table size, usually between K an 4K entries. However, a large DCN may have several hunres ToR switches, elevating the number of zupate flows beyon 00K, far exceeing the flow table size. Making things even worse, because two-phase commit is use, a flow table may hol two versions of the entry for each flow, potentially oubling the number of entries. Finally, the group table size on commoity switches also poses a challenge, since it is aroun K (sometimes smaller than the flow table size). Our solution to this problem is motivate by one key observation: ECMP works reasonably well for most of the flows in a DCN. During transition, there usually exist only several bottleneck links on which congestion may arise. Such congestion can be avoie by ajusting the traffic istribution of a small number of critical flows. This allows us to significantly cut own the number of flow table entries by keeping most of the flows in ECMP. Consoliating flow table entries: Let S be the flow table size an n be the number of ToR switches. In a flow table, we will always have one wilcar entry for the estination prefix of each ToR switch, resulting in n wilcar entries in the table. Any flow that matches a wilcar entry will simply use ECMP. Figure 6 shows an example where the switch flow table has three wilcar entries for estinations, 2 an 3. Since the weight set of zupate flows (ver, s 4, 3) an (ver 2, s 4, 3) is {, 0.5, 0.5}, they both map to one wilcar entry (*, *, 3) an use ECMP. Suppose we nee to consoliate k zupate flows into the switch flow table. Excluing the wilcar entries, the flow table still has S n free entries (note that S is almost certainly larger than n). Therefore, we will select S n critical flows an install a specific entry for each of them while forcing the remaining non-critical flows to use the wilcar entries (ECMP). This is illustrate in Figure 6 where the zupate flows (ver, s, ) an (ver, s 3, 2) map to specific an wilcar entries in the switch flow table respectively. To resolve matching ambiguity in the switch flow table, a specific entry, e.g., (vlan, s, ), always has higher priority than a wilcar entry, e.g., (*, *, ). The remaining question is how to select the critical flows. Suppose Dv f is the traffic istribution of a zupate flow f on switch v, we calculate the corresponing D v f which is f s traffic istribution if it uses ECMP. We use δv f = u lf v,u l v,u f to quantify the penalty we pay if f is force to use ECMP. To minimize the penalty cause by the flow consoliation, we pick the top S n flows with the largest penalty as the critical flows. In Figure 6, there are 3 critical flows whose penalty is greater than 0 an 5 noncritical flows whose penalty is 0. Because of two-phase commit, each zupate flow has two versions. We follow the preceing process to consoliate both versions of the flows into the switch flow table. As shown in Figure 6, zupate flows (ver, s 4, 3) an (ver 2, s 4, 3) share the same wilcar entry in the switch flow table. In contrast, zupate flows (ver, s, ) an (ver 2, s, ) map to one specific entry an one wilcar entry in the switch flow table separately. On some switches, the group table size may not be large enough to hol the weight sets of all the critical flows. Let T be group table size an m be the number of ECMP group entries. Because a group table must at least hol all the ECMP entries, T is almost always greater than m. After excluing the ECMP entries, the group table still has T m free entries. If S n > T m, we follow the preceing process to select T m critical flows with the largest penalty an install a group entry for each of them while forcing the remaining non-critical flows to use ECMP. After flow consoliation, the real traffic istribution D may eviate from the ieal traffic istribution D compute by zupate. Thus, an ieal lossless transition from D 0 to D n may not be feasible ue to the table size limits. To keep the no loss guarantee, zupate will check the real loss of transitioning from D 0 to D n after flow consoliation an return an empty list if loss oes occur. 7.3 Reucing Computational Complexity In 5.3, we construct an optimization programming moel M to compute a lossless transition plan. Let F, V, an E be the number of flows, switches an links in the network an n be the number of transition steps. The total number of variables an constraints in M is O(n F E ) an O(n F ( V + E )). In a large DCN, it coul take a long time to solve M. Given a network, V an E are fixe an n is usually very small, the key to shortening the computation time is to reuce F. Fortunately in DCNs, congestion usually occurs only on a small number of bottleneck links uring traffic migration, an such congestion may be avoie by just manipulating the traffic istribution of the bottleneck flows that traverse those bottleneck links. Thus, our basic iea is to treat only the bottleneck flows as variables while fixing all the non-bottleneck flows as constants in M. This effectively reuces F to be the number of bottleneck flows, which is far smaller than the total number of flows, ramatically improving the scalability of zupate. Generally, without solving the (potentially expensive) M, it is ifficult to precisely know the bottleneck links. To circumvent this problem, we use a simple heuristic calle ECMP-Only (or ECMP- O) to roughly estimate the bottleneck links. In essence, ECMP- O mimics how operators perform traffic migration toay by solely relying on ECMP. For network topology upate ( 6.), the final traffic istribution D must satisfy Equations (6) (8), each of which is in the form of lv,u f, = 0. To meet each constraint lv,u f, = 0, we simply remove the corresponing u from f s next hop set on switch v. After that, we compute D by splitting each flow s traffic among its remaining

9 next hops using ECMP. Finally, we ientify the bottleneck links as: i) the congeste links uring the one-step transition from D 0 to D (violating Equation (5)); ii) the congeste links uner D after the transition is one (violating Equation (4)). For traffic matrix upate ( 6.2), ECMP-O oes not perform any traffic istribution transition, an thus congestion can arise only uring traffic matrix changes. Let T max be the maximum traffic matrix an R ecmp be ECMP, we simply ientify the bottleneck links as the congeste links uner D max = T max R ecmp. 7.4 Transition Overhea To perform traffic istribution transitions, zupate nees to change the flow an group tables on switches. Besies guaranteeing a lossless transition, we woul also like to minimize the number of table changes. Remember that uner the optimization moel M, there may exist many possible solutions. We coul favor solutions with low transition overhea by picking a proper objective function. As just iscusse, the ECMP-relate entries (e.g., wilcar entries) will remain static in the flow an group tables. In contrast, the non-ecmp entries (e.g., specific entries) are more ynamic since they are irectly influence by the transition plan compute by zupate. Hence, a simple way to reuce the number of table changes is to nuge more flows towars ECMP. This prompts us to minimize the following objective function in M: n i= f,v,u,w l f,i v,u l f,i v,w, where e v,u, e v,w E (2) in which n is the number of transition steps. Clearly, the objective value is 0 when all the flows use ECMP. One nice property of Equation(2) is its linearity. In fact, because Equations() (2) are all linear, M becomes a linear programming (LP) problem which can be solve efficiently. 7.5 Failures an Traffic Matrix Variations It is trivial for zupate to hanle unplanne failures uring transitions. In fact, failures can be treate in the same way as switch upgraes (see 6.) by aing the faile switches/links to the upate requirements, e.g., those faile switches/links shoul not carry any traffic in the future. zupate will then attempt to re-compute a transition plan from the current traffic matrix an traffic istribution to meet the new upate requirements. Hanling traffic matrix variations is also quite simple. When estimating T max, we may multiply it by an error margin η (η > ). Lemma 3 guarantees that the transitions are lossless so long as the real traffic matrix T ηt max. 8. IMPLEMENTATION Figure 7 shows the key components an workflow of zupate. When an operator wants to perform a DCN upate, she will submit a request containing the upate requirements to the upate scenario translator. The latter converts the operator s request into the formal upate constraints ( 6). The zupate engine takes the upate constraints together with the current network topology, traffic matrix, an flow rules an attempts to prouce a lossless transition plan ( 5, 7.3 & 7.4). If such a plan cannot be foun, it will notify the operator who may ecie to revise or postpone the upate. Otherwise, the transition plan translator will convert the transition plan into the corresponing flow rules ( 7. & 7.2). Finally, the OpenFlow controller will push the flow rules into the switches. The zupate engine an the upate scenario translator consists of lines of C# coe with Mosek [2] as the linear program- Figure 7: zupate s prototype implementation. ming solver. The transition plan translator is written in 500+ lines of Python coe. We use Floolight 0.9 [] as the OpenFlow controller an commoity switches which support OpenFlow.0 [3]. Given that WCMP is not available in OpenFlow.0, we emulate WCMP as follows: given the weight set of a zupate flow f at switch v, for each constituent 5-tuple flow ξ in f, we first compute the next hop u of ξ accoring to WCMP hashing an then insert a rule for ξ with u as the next hop into v. 9. EVALUATIONS In this section, we show zupate can effectively perform congestion-free traffic migration using both testbe experiments an large-scale simulations. Compare to alternative traffic migration approaches, zupate can not only prevent loss but also reuce the transition time an transition overhea. 9. Experimental Methoology Testbe experiments: Our testbe experiments run on a FatTree network with 4 CORE switches an 3 containers as illustrate in Figure 3. (Note that there are 2 aitional ToRs connecte to AGG 3,4 which are not shown in the figure because they o not sen or receive any traffic). All the switches support OpenFlow.0 with 0Gbps link spee. A commercial traffic generator is connecte to all the ToRs an CORE s to inject 5-tuple flows at pre-configure constant bit rate. Large-scale simulations: Our simulations are base on a prouction DCN with hunres of switches an tens of thousans of servers. The flow an group table sizes are 750 an,000 entries respectively, matching the numbers of the commoity switches use in the DCN. To obtain the traffic matrices, we log the socket events on all the servers an aggregate the logs into ingress-to-egress flows over 0-minute intervals. A traffic matrix is comprise of all the ingressto-egress flows in one interval. From the 44 traffic matrices in a typical working ay, we pick 3 traffic matrices that correspon to the minimum, meian an maximum network-wie traffic loas respectively. The simulations run on a commoity server with qua-core Intel Xeon 2.3GHz CPU an 48GB RAM. Alternative approaches: We compare zupate with three alternative approaches: () zupate-one-step (zupate-o): It uses zupate to compute the final traffic istribution an then jumps from the initial traffic istribution to the final one irectly, omitting all the intermeiate steps. (2) ECMP-O (efine in 7.3). (3) ECMP-Planne (ECMP-P): For traffic matrix upate, ECMP-P oes not perform any traffic istribution transition (like ECMP-O). For network topology upate, ECMP-P has the same final traffic istribution as ECMP-O. Their only ifference is, when there are k ingress-to-egress flows to be migrate from the initial traffic istribution to the final traffic istribution, ECMP-O migrates all the k flows in one step while ECMP-P migrates only one flow in each

10 Link utilization CORE-AGG3 change ToR CORE3-AGG time(s) Link utilization CORE-AGG3 change ToR5 CORE3-AGG4 change ToR time(s) (a) ECMP-O (b) zupate-o (c) zupate Figure 8: The link utilization of the two busiest links in the switch upgrae example. Link utilization CORE-AGG3 change ToR5 change ToR CORE3-AGG4 change ToR5 change ToR time(s) Link utilization CORE-AGG reconfig S2 CORE3-AGG6 reconfig S Link utilization time(s) time(s) (a) ECMP (b) zupate Figure 9: The link utilization of the two busiest links in LB reconfiguration example CORE-AGG change ToR5 change ToR6 reconfig S reconfig S2 CORE3-AGG6 change ToR5 change ToR6 step, resulting in k! caniate migration sequences. In our simulations, ECMP-P will evaluate,000 ranomly-chosen caniate migration sequences an use the one with the minimum losses. In essence, ECMP-P mimics how toay s operators sequentially migrate multiple flows in DCN. Performance metrics: We use the following metrics to compare ifferent approaches. () Link utilization: the ratio between the link loa an the link capacity. For the ease of presentation, we represent link congestion as link utilization value higher than 00%. (2) Post-transition loss (Post-TrLoss): the maximum link loss rate after reaching the final traffic istribution. (3) Transition loss (TrLoss): the maximum link loss rate uner all the possible ingress-to-egress flow migration sequences uring traffic istribution transitions. (4) Number of steps: the whole traffic migration process can be ivie into multiple steps. The flow migrations within the same step are one in parallel while the flow migrations of the next step cannot start until the flow migrations of the current step complete. This metric reflects how long the traffic migration process will take. (5) Switch touch times (STT): the total number of times the switches are reconfigure uring a traffic migration. This metrics reflects the transition overhea. Calculating T max: T max inclues two components: T b an Tapp max. T b is the backgroun traffic which is inepenent from the application being upate. Tapp max is the maximum traffic matrix comprise of only the ingress-to-egress flows (f app s) relate to the applications being upate. We calculate Tapp max as follows: for each f app, the size of f app in in Tapp max is the largest size that f app can possibly get uring the entire traffic matrix upate process. 9.2 Testbe Experiments We now conuct testbe experiments to reprouce the two traffic migration examples escribe in 4. Switch upgrae: Figure 8 shows the real-time utilization of the two busiest links, CORE AGG 3 an CORE 3 AGG 4, in the switch upgrae example (Figure 2). Figure 8(a) shows the transition process from Figure 2a (0s 6s) to Figure 2b (6s 4s) uner ECMP-O. The two links initially carry the same amount of traffic. At 6s, ToR AGG is eactivate, triggering traffic loss on CORE 3 AGG 4 (Figure 2b). The congestion lasts until 4s when ToR AGG is restore. Note that we eliberately shorten the switch upgrae perio for illustration purpose. In aition, because only one ingress-to-egress flow (ToR ToR 2) nees to be migrate, ECMP-P is the same as ECMP-O. Figure 8(b) shows the transition process from Figure 2a (0s 6s) to Figure 2 (6s 8s) to Figure 2c (8s 6s) uner zupate-o. At 6s 8s, ToR 5 an ToR are change asynchronously, leaing to a transient congestion on CORE AGG 3 (Figure 2). After ToR,5 are change, the upgraing of AGG is congestion-free at 8s 6s (Figure 2c). Once the upgraing of AGG completes at 6s, the network is restore back to ECMP. Again because of the asynchronous changes to ToR 5 an ToR, another transient congestion happens on CORE 3 AGG 4 at 6s 8s. Figure 8(c) shows the transition process from Figure 2a (0s 6s) to Figure 2e (8s 6s) to Figure 2c (8s 26s) uner zupate. Due to the introuction of an intermeiate traffic istribution between 8s 6s (Figure 2e), the transition process is lossless espite of asynchronous switch changes at 6s 8s an 6s 8s. LB reconfiguration: Figure 9 shows the real-time utilization of the two busiest links, CORE AGG an CORE 3 AGG 6 in the LB reconfiguration example (Figure 3). Figure 9(a) shows the migration process from Figure 3a (0s 6s) to Figure 3c (6s 4s) to Figure 3b (after 4s) uner ECMP. At 6s 4s, S 2 s LB an S s LB are reconfigure asynchronously, causing congestion on CORE AGG (Figure 3c). After both LB s are reconfigure at 4s, the network is congestion-free (Figure 3b). Figure 9(b) shows the migration process from Figure 3a (0s 6s) to Figure 3 (0s 8s) to Figure 3b (after 22s) uner zupate. By changing the traffic split ratio on ToR 5 an ToR 6 at 6s 8s, zupate ensures the network is congestion-free even though S 2 s LB an S s LB are reconfigure asynchronously at 0s 8s. Once the LB reconfiguration completes at 8s, the traffic split ratio on ToR 5 an ToR 6 is restore to ECMP at 20s 22s. Note that zupate is the same as zupate-o in this experiment, because there is no intermeiate step in the traffic istribution transition. 9.3 Large-Scale Simulations We run large-scale simulations to stuy how zupate enables lossless switch onboaring an VM migration in prouction DCN. Switch onboaring: In this experiment, a new CORE switch is initially connecte to each container but carries no traffic. We then ranomly select % of the ingress-to-egress flows, as test flows, to traverse the new CORE switch for testing. Figure 0(a) compares ifferent migration approaches uner the meian networkwie traffic loa. The y-axis on the left is the traffic loss rate an the y-axis on the right is the number of steps. zupate attains zero loss by taking 2 transition steps. Although not shown in the

11 Traffic Loss Rate (%) Traffic Loss Rate (%) Transition-Loss Post-Transition-Loss #Step zupate zupate-o ECMP-O ECMP-P (a) Switch Onboaring Transition-Loss Post-Transition-Loss #Step zupate zupate-o ECMP (b) VM Migration #Step 000 #Step Figure 0: Comparison of ifferent migration approaches. KZ '' Z e e e le / DW h e e le Figure : Why congestion occurs in switch onboaring. figure, our flow consoliation heuristic ( 7.2) successfully fits a large number of ingress-to-egress flows into the limite switch flow an group tables. zupate-o has no post-transition loss but 8% transition loss because it takes just one transition step. ECMP-O incurs 7% transition loss an 3.5% post-transition loss. This is a bit counterintuitive because the overall network capacity actually increases with the new switch. We explain this phenomenon with a simple example in Figure. Suppose there are 7 ingress-to-egress flows to ToR, each of which is 2Gbps, an the link capacity is 0Gbps. Figure (a) shows the initial traffic istribution uner ECMP where each ownwar link to ToR carries 7Gbps traffic. In Figure (b), 4 out of the 7 flows are selecte as the test flows an are move to the new CORE 3. Thus, CORE 3 AGG 2 has 8Gbps traffic (the 4 test flows) an CORE 2 AGG 2 has 3Gbps traffic (half of the 3 non-test flows ue to ECMP). This in turn overloas AGG 2 ToR with Gbps traffic. Figure (c) shows zupate avois the congestion by moving 2 non-test flows away from CORE 2 AGG 2 to AGG ToR. This leaves only Gbps traffic (half of the remaining non-test flow) on CORE 2 AGG 2 an reuces the loa on AGG 2 ToR to 9Gbps. ECMP-P has smaller transition loss (4%) than ECMP-O because ECMP-P attempts to use a flow migration sequence that incurs the minimum loss. They have the same post-transition loss because their final traffic istribution is the same. Compare to zupate, ECMP-P has significantly higher loss although it takes hunres of transition steps (which also implies much longer transition perio). VM migration: In this experiment, we migrate a group of VMs from one ToR to another ToR in two ifferent containers. During the live migration, the ol an new VMs establish tunnels to synchronize ata an running states [6, 3]. The total traffic rate of the tunnels is 6Gbps. Figure 0(b) compares ifferent migration approaches uner the meian network-wie traffic loa. zupate takes 2 steps to reach a traffic istribution that can accommoate the large volume of tunneling traffic an the varying traffic matrices uring the live migration. Hence, it oes not have any loss. In contrast, zupate-o has 4.5% transition loss because it skips the intermeiate step taken by zupate. We combine ECMP-O an ECMP-P into ECMP because they are the same for traffic matrix upates ( 9.). ECMP s post-transition loss is large (7.4%) because it cannot hanle the large volume of tunneling traffic uring the live migration. Impact of traffic loa: We re-run the switch onboaring experiment uner the minimum, meian, an maximum network-wie traffic loas. In Figure 2, we omit the loss of of zupate an the post-transition loss of zupate-o, since all of them are 0. Loss Rate (%) zupate-o/trloss ECMP-P/TrLoss ECMP-O/TrLoss ECMP/Post-TrLoss zupate/#step 0 0 Min-Loa Me-Loa Max-Loa Figure 2: Comparison uner ifferent traffic loas. We observe that only zupate can attain zero loss uner ifferent levels of traffic loa. Surprisingly, the transition loss of zupate-o an ECMP-O is actually higher uner the minimum loa than uner the meian loa. This is because the traffic loss is etermine to a large extent by a few bottleneck links. Hence, without careful planning, it is risky to perform network-wie traffic migration even uring off-peak hours. Figure 2 also shows zupate takes more transition steps as the network-wie traffic loa grows. This is because when the traffic loa is higher, it is more ifficult for zupate to fin the spare banwith to accommoate the temporary link loa increase uring transitions. Transition overhea: Table 3 shows the number of switch touch times (STT) of ifferent migration approaches in the switch onboaring experiment. Compare to the STT of zupate-o an ECMP-O, the STT of zupate is ouble because it takes two steps instea of one. However, this also inicates zupate touches at most 68 switches which represent a small fraction of the several hunres switches in the DCN. This can be attribute to the heuristics in 7.3 an 7.4 which restrict the number of flows to be migrate. ECMP-P has much larger STT than the other approaches because it takes a lot more transition steps. zupate zupate-o ECMP-O ECMP-P STT Table 3: Comparison of transition overhea. The computation time of zupate is reasonably small for performing traffic migration in large DCNs. In fact, the running time is below minute for all the experiments except the maximum traffic loa case in Figure 2, where it takes 2.5 minutes to compute a 4-step transition plan. This is because of the heuristic in 7.3 which ties the computation complexity to the number of bottleneck flows rather than the total number of flows, effectively reucing the number of variables by at least two orers of magnitue #Step

12 0. RELATED WORK Congestion uring upate: Several recent papers focus on preventing congestion uring a specific type of upate. Raza et al. [6] stuy the problem of how to scheule link weight changes uring IGP migrations. Ghorbani et al. [9] attempt to fin a congestionfree VM migration sequence. In contrast, our work provies one primitive for a variety of upate scenarios. Another key ifference is they o not consier the transient congestion cause by asynchronous traffic matrix or switch changes since they assume there is only one link weight change or one VM being migrate at a time. Routing consistency: There is a rich boy of work on preventing transient misbehaviors uring routing protocol upates. Vanbever et al. [8] an Francois et al. [8] seek to guarantee no forwaring loop uring IGP migrations an link metric reconfigurations. Consensus routing [0] is a policy routing protocol aiming at eliminating transient problems uring BGP convergence times. The work above emphasizes on routing consistency rather than congestion. Several tools have been create to statically check the correctness of network configurations. Feamster et al. [7] built a tool to etect errors in BGP configurations. Heaer Space Analysis (HSA) [2] an Anteater [5] can check a few useful network invariants, such as reachability an no loop, in the forwaring plane. Built on the earlier work, VeriFlow [4] an realtime HSA [] have been evelope to check network invariants on-the-fly.. CONCLUSION We have introuce zupate for performing congestion-free traffic migration in DCNs given the presence of asynchronous switch an traffic matrix changes. The core of zupate is an optimization programming moel that enables lossless transitions from an initial traffic istribution to a final traffic istribution to meet the preefine upate requirements. We have built a zupate prototype on top of OpenFlow switches an Floolight controller an emonstrate its capability in hanling a variety of representative DCN upate scenarios using both testbe experiments an largescale simulations. zupate, as in its current form, works only for hierarchical DCN topology such as FatTree an Clos. We plan to exten zupate to support a wier range of network topologies in the future. 2. ACKNOWLEDGMENTS We thank our shepher, Nick Feamster, an anonymous reviewers for their valuable feeback on improving this paper. 3. REFERENCES [] Floolight. [2] MOSEK. [3] OpenFlow.0. openflow-spec-v.0.0.pf. [4] M. Al-Fares, A. Loukissas, an A. Vahat. A Scalable, Commoity Data Center Network Architecture. In SIGCOMM 08. [5] M. Alizaeh, A. Greenberg, D. A. Maltz, J. Pahye, P. Patel, B. Prabhakar, S. Sengupta, an M. Sriharan. Data Center TCP DCTCP. In SIGCOMM 0. [6] C. Clark, K. Fraser, S. Han, J. G. Hansen, E. Jul, C. Limpach, I. Pratt, an A. Warfiel. Live Migration of Virtual Machines. In NSDI 05. [7] N. Feamster an H. Balakrishnan. Detecting BGP Configuration Faults with Static Analysis. In NSDI 05. [8] P. Francois, O. Bonaventure, B. Decraene, an P. A. Coste. Avoiing Disruptions During Maintenance Operations on BGP Sessions. IEEE Trans. on Netw. an Serv. Manag., [9] S. Ghorbani an M. Caesar. Walk the Line: Consistent Network Upates with Banwith Guarantees. In HotSDN 2. [0] J. P. John, E. Katz-Bassett, A. Krishnamurthy, T. Anerson, an A. Venkataramani. Consensus Routing: the Internet as a Distribute System. In NSDI 08. [] P. Kazemian, M. Chang, H. Zeng, G. Varghese, N. McKeown, an S. Whyte. Real Time Network Policy Checking Using Heaer Space Analysis. In NSDI 3. [2] P. Kazemian, G. Varghese, an N. McKeown. Heaer Space Analysis: Static Checking for Networks. In NSDI 2. [3] E. Keller, S. Ghorbani, M. Caesar, an J. Rexfor. Live Migration of an Entire Network (an its hosts). In HotNets 2. [4] A. Khurshi, W. Zhou, M. Caesar, an P. B. Gofrey. Veriflow: Verifying Network-Wie Invariants in Real Time. In HotSDN 2. [5] H. Mai, A. Khurshi, R. Agarwal, M. Caesar, P. B. Gofrey, an S. T. King. Debugging the Data Plane with Anteater. In SIGCOMM. [6] S. Raza, Y. Zhu, an C.-N. Chuah. Graceful Network State Migrations. Networking, IEEE/ACM Transactions on, 20. [7] M. Reitblatt, N. Foster, J. Rexfor, C. Schlesinger, an D. Walker. Abstractions for Network Upate. In SIGCOMM 2. [8] L. Vanbever, S. Vissicchio, C. Pelsser, P. Francois, an O. Bonaventure. Seamless Network-Wie IGP Migrations. In SIGCOMM. [9] X. Wu, D. Turner, C.-C. Chen, D. A. Maltz, X. Yang, L. Yuan, an M. Zhang. NetPilot: Automating Datacenter Network Failure Mitigation. In SIGCOMM 2. APPENDIX Proof of Lemma : Assume at time t = τ k, flow f s traffic istribution is: v L i(i < k), e v,u G f : s v,u = r v,u = l 2 v,u. v L i(i > k), e v,u G f : s v,u = r v,u = l v,u. v L k, e v,u G f : all the s v,u s are changing from l v,u to l 2 v,u simultaneously, but all the r v,u s are l v,u. Consier u L k+, e v,u, e u,w G f : Accoring to Conition ii), in the uration τ k t < τ k+ = τ k + δ k, all the r v,u s remain lv,u. Therefore, f s traffic istribution is: v L i(i < k), e v,u G f : s v,u = r v,u = l 2 v,u, because nothing has change on these links. v L i(i > k), e v,u G f : s v,u = r v,u = l v,u, because nothing has change on these links. v L k, e v,u G f : s v,u = l 2 v,u an r v,u = l v,u, since the rate change on the sening en has not reache the receiving en ue to the link elay. At t = τ k+, u L k+, all the r v,u s change from l v,u to l 2 v,u simultaneously. Accoring to Conition iii), all the s u,w s also change from l u,w to l 2 u,w at the same time. Thus, at t = τ k+ : v L i(i < k + ), e v,u G f : s v,u = r v,u = l 2 v,u v L i(i > k + ), e v,u G f : s v,u = r v,u = l v,u v L k+, e v,u G f : all the s v,u s are changing from l v,u to l 2 v,u simultaneously, but all the r v,u s are l v,u. At the beginning of the transition t = τ 0, s f, the only switch in L 0, starts to tag f s packets with a new version number, causing all of its outgoing links to change from the ol sening rates to the new sening rates simultaneously. Hence, we have: v L i(i > 0), e v,u G f : s v,u = r v,u = l v,u. v L 0, e v,u G f : all the s v,u s are changing from l v,u to l 2 v,u simultaneously, but all the r v,u s are l v,u. which matches our preceing assumption at t = τ k. Since we have erive that if the assumption hols for t = τ k, it also hols for t = τ k+. Hence it hols for the whole transition process. Because in each uration [τ k, τ k+ ), e v,u G f, s v,u an r v,u are either l v,u or l 2 v,u, proof completes.