IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 12, DECEMBER 2013 1

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 12, DECEMBER 2013 1 Scaabe Muti-Cass Traffic Management in Data Center Backbone Networks Amitabha Ghosh, Sangtae Ha, Edward Crabbe, and Jennifer Rexford Abstract Large onine service providers (OSPs) often buid private backbone networks to interconnect data centers in mutipe ocations. These data centers house numerous appications that produce mutipe casses of traffic with diverse performance objectives. Appications in the same cass may aso have differences in reative importance to the OSP s core business. By controing both the hosts and the routers, an OSP can perform both appication rate-contro and network routing. However, centraized management of both rates and routes does not scae due to excessive message-passing between the hosts, routers, and management systems. Simiary, fuy-distributed approaches do not scae and converge sowy. To overcome these issues, we investigate two semi-centraized designs that ie at practica points aong the spectrum between fuy-distributed and fuy-centraized soutions. We achieve scaabiity by distributing computation across mutipe tiers of an optimization machinery. Our first design uses two tiers, representing the backbone and casses, to compute cass-eve ink bandwidths and appication sending rates. Our second design has an additiona tier representing individua data centers. Using optimization, we show that both designs provaby maximize the aggregate utiity over a traffic casses. Simuations on reaistic backbones show that the 3-tier design is more scaabe, but converges sower than the 2-tier design. Index Terms Traffic management; data centers; optimization; scaabiity; mutipe traffic casses. I. INTRODUCTION A. Wide-Area Traffic Management for OSPs RECENT years have seen an unprecedented growth in the number of data centers being depoyed by arge OSPs [1] [5]. These data centers store massive amounts of data and host a variety of services. Furthermore, to improve performance and reiabiity, mutipe data centers are depoyed to cover arge geographica regions with high-speed backbone networks interconnecting them [6], [7]. These backbones are often owned by the same OSP; for instance, Googe, Yahoo!, and Microsoft interconnect their mutipe arge-scae data centers with their own private backbones [8]. As the backbones themseves represent a substantia investment, it is highy desirabe that they be used efficienty and effectivey, whie simutaneousy respecting the characteristics of the traffic they carry. Manuscript received December 15, 2012; revised Juy 31, 2013. A. Ghosh is with the Networking Division at UtopiaCompression, and did this work as a post-doctora research associate at Princeton University (e-mai: amitabhg@utopiacompression.com). S. Ha is with the Eectrica Engineering Department at Princeton University (e-mai: sangtaeh@princeton.edu). E. Crabbe is with Googe (e-mai: edc@googe.com). J. Rexford is with the Computer Science Department at Princeton University (e-mai: jrex@cs.princeton.edu). Digita Object Identifier 10.1109/JSAC.2013.1312xx. 0733-8716/13/$31.00 c 2013 IEEE The types of services and appications hosted in these OSP networks can be quite diverse. In some cases, the same OSP which runs the backbone network aso contros the traffic sources. The appications can range from back-end services, such as background computation, search indexing, and data repication, to cient-triggered front-end services for creating end-user content [9], such as web search, onine gaming, and ive video streaming. Each of these appications can beong to a different cass of traffic that has its own performance objectives. For exampe, interactive appications are ikey to be deay-sensitive, whereas back-end appications can be throughput-sensitive. Appications that beong to the same traffic cass may aso have differences in reative importance to the OSP s core business. Despite many chaenges, these OSP network characteristics provide new opportunities for wide-area traffic management. In traditiona traffic engineering (TE), an ISP contros backbone routing and ink scheduing, but does not contro how end hosts shoud perform congestion contro. In contrast, an OSP contros both the hosts and the routers, offering the fexibiity to jointy optimize rate contro, backbone routing, and ink scheduing. Not knowing the end-to-end performance objectives of the appications, ISPs typicay focus on optimizing indirect measures of performance, such as minimizing congestion. In contrast, an OSP can maximize the aggregate performance across a appications, with a different utiity function for each traffic cass. Centraized traffic engineering is increasingy viewed as a viabe approach for wide-area networks. For instance, Googe has depoyed an OpenFow-enabed Software Defined Networking (SDN) soution for centraized TE in their G- scae interna backbone for interconnecting mutipe data centers [10]. Centraized TE soutions have the potentia to be agorithmicay simper, faster, scaabe, and more efficient. However, they are not scaabe for joint rate and route contro across mutipe casses, due to the need for excessive messagepassing between hosts, routers, and management systems. Fuy-distributed soutions, ikewise, are aso not scaabe and exhibit sow convergence. B. Semi-Centraized Scaabe Design Choices In this paper, we expore scaabe architectures that jointy optimize rate contro, routing, and ink scheduing. On one hand, our design choices are motivated by the advantages of centraized TE and its industry adoption. On the other hand, since we aso wish to perform rate contro, our approach distributes information and computation across mutipe tiers of an optimization machinery. To this end, we examine two

2 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 12, DECEMBER 2013 semi-centraized designs that both use a sma number of management entities to optimay aocate resources, but differ in their degree of distributedness.thefirstdesignhastwotiers with the management entities at the backbone and cass eves, whereas the second design has three tiers with an additiona management entity at the data center eve. The entities exchange information with each other to optimay subdivide network bandwidth between appications of different casses. Using optimization theory, we show that both our designs provaby maximize the aggregate utiity over a appications and traffic casses. We summarize our two designs beow: A2-TierDesign:The 2-tier design has a singe management entity caed ink coordinator (LC) on the first tier, and mutipe management entities caed cass aocators (CAs) on the second tier. The LC computes cass-eve aggregate bandwidth for every ink in the backbone and sends it to the CAs. Each CA then subdivides this bandwidth between different appications in its own cass. Using prima decomposition, we show that this 2-tier design not ony can compute optima routing paths, but aso map appication-eve sending rates on these paths. Simuations on reaistic backbone topoogies show that the system converges quicky (within a few tens of iterations for suitabe choices of step sizes), though at the expense of moderate amount of message-passing between the management entities. A 3-Tier Design: The 3-tier design has an additiona type of management entity caed data center aocator (DA) on the third tier. The LC, as before, computes cass-eve aggregate bandwidth and sends it to the CAs. Each CA, however, now subdivides this bandwidth not between different appications ike in the 2-tier design, but across mutipe data centers hosting traffic in its own cass. The task of computing appication-eve sending rates is deegated to the DAs. We use atwo-eveprimadecompositiontoproveoptimaityofthis3- tier design. Simuations show that the system exchanges fewer messages compared to the 2-tier design, but at the expense of sighty sower convergence. The rest of the paper is organized as foows: In Section II, we present the network and data center traffic modes, and formuate the muti-cass backbone traffic management probem. In Section III, we describe the componentsandfunctionaities of the two semi-centraized designs. In Section IV, we prove the optimaity of these designs using the theory of optimization decomposition. In Section V, we compare the two designs in terms of their convergence behavior and the number of messages exchanged unti convergence. We present reated work in Section VI, and finay concude in Section VII. II. MULTI-CLASS TRAFFIC-MANAGEMENT PROBLEM In this section, we first describe our inter-data center traffic mode and backbone network mode, and then formuate the muti-cass traffic-management probem. We aso introduce our key notations, as summarized in Tabe I. A. Inter-Data Center Traffic Mode 1) Traffic Casses and Fows: We consider a set J of data centers, geographicay-distributed and interconnected over a backbone network. The backbone carries inter-data center traffic fowing between different sourceanddestinationhosts ocated in the data centers. This traffic can be categorized into a set K of distinct casses based on its performance objectives. For instance, some of this traffic can beong to athroughput-sensitivecass,sometoadeay-sensitivecass, and some can have varying degrees of both throughput and deay requirements. Typicay, however, the network wi have just a few traffic casses. We index the data centers by j and the traffic casses by k. The cardinaities of the sets J and K are denoted by J and K, respectivey. Traffic within a given cass can beong to severa appications, each of which, in turn, may produce a arge number of fows. For exampe, deay-sensitive traffic may incude appications such as video streaming, Voice-over-IP (VoIP), and mutimedia conferencing, and each of these can have mutipe fows between different source-destination pairs. For simpicity of exposition, however, we do not differentiate between mutipe appications within a cass, but, instead, refer to the traffic between a given source-destination pair in a particuar direction as a fow, regardessoftheappicationto which it beongs. For instance, a fow can be at the granuarity of an individua UDP session, or at an aggregate eve between two end hosts or subnets in a particuar direction. We index the fows by s, anddenotebyf the set of a fows. We use F k to denote the set of fows that beongs to a particuar cass k. 2) Utiity Functions and Weights: We characterize the performance objective of each cass k by a utiity function U k ( ), parameterized by two positive coefficients a k and b k.thus, a the fows of a given cass have the same utiity function. We design this cass-wise utiity function as having two parts in it, f k ( ) and g k ( ), whicharemutipiedbythecoefficients a k and b k,respectivey,toincorporatedifferentdegreesof sensitivity to throughput and deay. We mode throughputsensitivity by making the first part, f k ( ), dependentonthe tota sending rate, and mode deay-sensitivity by making the second part, g k ( ), dependentontheaverageend-to-enddeay. The arguments of these functions are described ater in this section. The two coefficients can take different vaues to mode different types of traffic. For instance, a fixed rate traffic with deay requirement (e.g., VoIP) can be modeed using a k =0and b k > 0, whereasavariaberatetrafficwithdeay requirement (e.g., mutimedia streaming) can be modeed using both a k > 0 and b k > 0. Likewise,avariaberate traffic with no deay requirement (e.g., database backup or fie-downoading) can use a k > 0 and b k =0. Modeing Fow Importance: Despite a the fows of a given cass sharing the same utiity function, the individua fows may have differences in reative importance. For exampe, two database back-up fows that beong to a throughputsensitive cass, or two video streaming fows that beong to a deay-sensitive cass, may have different business importance. We mode this reative importance by assigning a positive weight to each fow, denoted by w k s for fow s of cass k. The weighted utiity of fow s of cass k can now be written

GHOSH et a.: SCALABLEMULTI-CLASSTRAFFICMANAGEMENTINDATACENTERBACKBONENETWORKS 3 as: Symbo TABLE I TABLE OF KEY NOTATIONS. Description L Set of unidirectiona inks in the backbone. K Set of distinct traffic casses. J Set of data centers. P Set of avaiabe paths. F Set of a fows across a casses. F k Set of fows in cass k. c Capacity of ink. ws k Weight of fow s of cass k. zsp k Rate of fow s of cass k on its p th path. y k Bandwidth aocated for cass k on ink. U k ( ) Utiity function of cass k. a k, b k Parameters to mode throughput and deay sensitivity of traffic in cass k. d k Average queuing deay on ink experienced by traffic of cass k. u k Utiization of ink due to traffic of cass k. A Topoogy matrix. R Path routing matrix. U k s = wk s [ ] a k f k ( ) b k g k ( ), (1) where the subtraction indicates that a fow gains more utiity by owering its average end-to-end deay. Modeing Throughput-Sensitivity: We assume that the rate-dependent function f k ( ) is non-negative, stricty concave, non-decreasing, and twice differentiabe in the tota sending rate x k s of fow s of cass k. Forexampe,thefunction can be ogarithmic, such as og ( x k s ),orbeongtoamore genera cass of functions that subsumes different notions of rate-fairness, typicay used for Internet TCP congestion contro (e.g., max-min fairness, proportiona fairness, α- fairness, etc.) [11], [12]. The concavity of f k ( ) is justified because of diminishing returns of the resources aocated to the fows. In other words, the first unit of bandwidth aocated to a fow is often more vaued than every additiona unit, especiay when the fow aready has an adequate bandwidth aocation. We aso assume that each fow has an infinite backog. Modeing Deay-Sensitivity: The second part g k ( ) of the utiity function is dependent on the average end-to-end deay for traffic in cass k.weconsiderthisaverageend-to-enddeay for a given path to be the sum of the propagation deays and the average queuing deays for a the inks on that path. We choose to minimize the average deay because of two reasons: (1) it resuts in ower deay for more traffic than minimizing some other metric, for instance, the maximum deay aong the paths; and (2) for mathematica convenience, which aows us to formuate the probem as a convex optimization probem and sove it more easiy. We do not caim to mode the queuing deay accuratey, but, instead, this is our way of keeping the inks beow high oad, whie favoring paths with ow propagation deay. The average queuing deay experienced by a fow on a given physica ink depends on severa factors, such as the number of queues, the ink-scheduing poicy, and the bandwidth aocated on that ink. In this work, we assume that each ink ink 1 ink 2 A ink 3 Fig. 1. Queue scheduing poicies: A work-conserving scheduer woud transmit additiona packets for the fow of cass A if ink 1 is ide, thus, causing additiona congestion on the common ink 3. A non-work-conserving scheduer woud not transmit such additiona packets. maintains one queue per cass, and configures aggregate ink bandwidth on a per-cass basis. To this end, the inks empoy a non-work-conserving scheduer. The reason we choose such ascheduerinsteadofawork-conservingoneisasfoows: Suppose there are two fows that share a common ink, but beong to two different casses A and B, as shown in Figure 1. If ink 1 is ide, then a work-conserving scheduer wi transmit additiona packets for the fow of cass A. This wi cause additiona congestion on ink 3, and incorrecty signa that cass A traffic needs more bandwidth on ink 3, eading to a redistribution of resources with esser bandwidth aocated to cass B. Suppose d k denotes the average queuing deay experienced by a packet of cass k on ink.thequeuingdeayisafunction of the ink oad due to the traffic of cass k, andtheaggregate bandwidth on ink aocated to cass k. This cass-eve bandwidth can be thought as the cass having its own virtua ink with a capacity equa to the aggregate bandwidth aocated on the physica ink. We denote the ink oad by L k,andthe cass-eve bandwidth by y k.wechoosetheformofthedeay function in such a way that it penaizes inks that approach or exceed their capacity. For mathematica convenience, we aso want this deay function to be non-negative, convex, nondecreasing, and twice-differentiabe. Thus, instead of using the M/M/1 queuing formua for average deay, d k =1/(y k Lk ), which is undefined for over-utiized inks, we use a piecewise inear approximation to it, foowing the approach in [13]. Denoting the utiization of ink due to the traffic of cass k by u k (i.e., u k = L k /yk ), we define this piecewise inear function as: d k B ( u k ) = m k L k + qk, (2) where m k = m(u k )/yk is a function that determines the sope of the region for cass k, andq k = q(u k )/yk is a function that determines the associated y-intercept for cass k. Likein[13], the vaues of d k are sma for ow utiizations, but increase rapidy as the oad approaches or exceeds the ink capacity. B. Backbone Network Mode 1) Backbone Topoogy and Paths: We mode the backbone network as a set L of interconnecting unidirectiona inks, with a finite capacity c,andapropagationdeayp for every

4 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 12, DECEMBER 2013 ink L.Tosupportargevoumesofdatatransferbetween remote data center ocations, in our mode, we consider the inks to typicay have high capacities. Let a path p be a non-empty subset of L, andetp be the set of avaiabe paths. The set P does not necessariy incude a possibe paths in the physica topoogy, but ony a subset of them chosen by operators. Since the backbone comprises onghau, high-capacity inks, most data center pairs have ony a few paths connecting them. These paths are set up in advance by OSP operators using, for instance, MPLS (Muti-Protoco Labe Switching) abes. Each path emerges from an edge router in a source data center, and passes through intermediate routers before ending at another edge router in a destination data center. 2) Fow Rates and Muti-Path Routing: We assume that each fow can use mutipe paths from its source to destination end host. Muti-path routing for inter-data center traffic is reativey easy to impement, because the backbone as we as the sources are owned by the same OSP. Muti-path routing provides severa benefits by spitting high-voume traffic aong mutipe paths, thereby increasing the end-to-end capacity. We denote by zsp k the sending rate of fow s of cass k on its p th path. We assume that each fow can be spit fexiby, i.e., the host or router originating the fow can send packets at the computed rates aong each of the specified paths. Without proper handing [14], such fow-eve muti-path routing might cause out-of-order deivery of packets. Conventiona techniques, such as hash-based spitting of traffic, can prevent this out-of-order arriva of packets beonging to the same TCP or UDP session. 3) Routing Matrix: Aroutingmatrixtypicayencodesthe mapping between paths, inks, and traffic fows in a network. For inter-data center routing, we spit this singe matrix into two parts to bring scaabiity in the resuting system. Since the number of fows per cass can be very arge compared to the number of paths and inks, we store the mapping between paths and inks in a smaer matrix A, caedthe topoogy matrix, ofsize L P, andthemappingbetween fows and paths in a arger matrix R, caedthepath routing matrix, ofsize P F. Thus,ifamanagementserverinthe backbone needs information ony about the inks comprising the paths, but not about the fows on those paths, it can store ony the smaer matrix A. This saves memory and communication overhead that woud be needed otherwise to keep a singe matrix updated a the time due to dynamic arriva and departure of fows. The eements of A and R are: { 1, if ink ies on path p A p = 0, otherwise. { 1, if fow s of cass k uses path p Rsp k = 0, otherwise. Since a fow can go over mutipe paths, we consider its weighted average deay as the sum of the products of path rates and average end-to-end deays on those paths. This captures the true average deay over a bits of traffic in the fow. Using (2), the deay-dependent function g k ( ) takes the foowing form for fow s of cass k: g k ( u k ) = p P R k sp zk sp ( ( A p p + m k Lk + ) ) qk. (3) L The ink oad is given by L k = s F k p P A pr k spz k sp. C. Maximizing Aggregate Utiity In an OSP network, the backbone and the traffic sources are under the same ownership, simpifying the task of choosing an objective function. As a natura choice, we consider that the OSP aims to maximize the sum of the utiities of a fows across a traffic casses, thus, maximizing the goba socia wefare. In other words, the fows do not have conficting objectives, but, instead, share a common goa prescribed by the OSP. The constraints in our probem are the ink capacity constraints, and the variabes are the fow path rates, zsp k,and the aggregate bandwidth, y k,foreachcassk on ink. The optimization probem can thus be written as: GLOBAL: maximize U = w k [ s a k f k ( x k ) s b k g k ( u k )] k K s F k subject to A p Rsp k zsp k y k, k, variabes s F k p P k K y k c, z k sp 0, k, s, p y k 0, k, where x k s = p P Rk sp zk sp is the tota sending rate of fow s of cass k over a its paths. The first constraint says that the tota sending rate of a the fows of cass k that go over ink cannot exceed the aggregate bandwidth y k.thesecond constraint says that the tota bandwidth aocated to a the casses on ink cannot exceed the capacity c.intheabove form, (4) is a convex optimization probem, due to the inear inequaity constraints and the concave objective function. III. SEMI-CENTRALIZED SCALABLE DESIGNS In this section, we present two semi-centraized designs that are both scaabe and use a sma number of management entities to optimay aocate fow-eve sending rates across mutipe casses in an OSP network. We describe the different components and their functionaitiescomprisingthe two designs, as we as the messages exchanged between the management entities. A. A Semi-Centraized 2-Tier Design Typicay, data centers host appications that originate a arge number of fows which beong to different traffic casses with diverse performance requirements. The fows can have different weights and can fow over mutipe paths from their source to destination end hosts. In such a setting, centraized traffic management soutions suffer from scaabiity issues, due to the need for excessive message-passing between the hosts, routers, and management systems. Likewise,fuy-distributed soutions are aso not scaabe and exhibit sow convergence. (4)

GHOSH et a.: SCALABLEMULTI-CLASSTRAFFICMANAGEMENTINDATACENTERBACKBONENETWORKS 5 Tier-1 Link Coordinator Tier-1 Link Coordinator Tier-2 Cass Aocator CA 1 1 2 Cass Aocator Cass Aocator... CA k... CA K 3 3 3......... DC 1 Sources DC j DC J Fig. 2. A semi-centraized 2-tier design with two types of management entities: a singe LC on tier-1, and mutipe CAs on tier-2. The arrows indicate message-passing. To overcome these drawbacks, we foow a top-down approach and propose a semi-centraized 2-tier design (see Figure 2) that is moduar, scaabe, and requires ony moderate amount of message-passing. The design has two types of management entities on its two tiers. On tier-1, it has a singe management entity caed a ink coordinator (LC), and on tier-2, there are mutipe management entities caed cass aocators (CAs). There is one CA assigned for each cass. These entities are hosted from one or mutipe servers in the backbone. Each type of entity has imited knowedge about the backbone and inter-data center traffic, but can communicate with each other to optimize fow-eve sending rates across mutipe casses. In the foowing, we describe the functionaities of these components. Link Coordinator on Tier-1: The LC is a centraized entity that knows ony about the network topoogy. In particuar, it knows the capacity c of every ink, andthetopoogy matrix A, but does not know the utiity functions of the casses, the weights, or the paths traversed by the individua fows. The LC optimizes aggregate ink bandwidths across mutipe casses in the backbone. In particuar, it computes an aggregate bandwidth y k for each cass k on every ink in the backbone. This aggregate bandwidth is shared by a the fows of cass k that go over ink, andiscaedthecasseve bandwidth. The LC then sends these y kstothecason tier-2 for driving fow rate aocation decisions. In turn, the LC receives a set of optima Lagrange mutipiers from each CA. This message-passing is shown in Figure 2 by the arrows abeed 1 and 2 between the LC and the CA of cass k. The Lagrange mutipiers, as described in the next section, are optima subgradients of an optimization probem soved by each CA, and are functions of the y ks. Cass Aocator on Tier-2: Unike the LC, which has no information about the inter-data center traffic, each CA knows the utiity function of its own cass, the weights of the fows, and the paths traversed by them. These weights and paths are communicated to the CAs by the end hosts. There are a tota of K CAs, one assigned for each cass k, denotedbyca k. Each CA optimizes sending rates across mutipe fows that beong to its own cass. In particuar, the CA of cass k first receives the cass-eve bandwidths y ksfromthelc,andthen soves an optimization probem to subdivide these bandwidths among the fows of its own cass. The CA then sends these Tier-2 Cass Aocator CA 1 Tier-3 DC Aocator DA k1 DC 1... CA k... CA K............... Sources 1 2 Cass Aocator DC Aocator DA kj DC j Cass Aocator 4 3 3 4 4 3 5 DC Aocator DA kj DC J Fig. 3. A semi-centraized 3-tier design with three types of management entities: a singe LC on tier-1, mutipe CAs on tier-2, and mutipe DAs on tier-3. The arrows indicate message-passing. optima path rates to the sources hosting these fows, as shown by the arrows abeed 3 in Figure 2. It aso sends the optima Lagrange mutipiers to the LC. B. A Semi-Centraized 3-Tier Design In our 2-tier design, each CA sends the fow-eve path rates to the end hosts that beong to its own cass. However, since the fows of a given cass can be hosted from any of the data centers, each CA potentiay communicates with a data centers. When the number of fows is very arge, this incurs expensive contro pane overhead. To overcome this drawback, we now propose a semi-centraized 3-tier design that is more scaabe and has ess message-passing overhead. The 3-tier design has a third type of management entity caed a data center aocator (DA) on tier-3, in addition to the LC on tier-1 and the CAs on tier-2 (see Figure 3). The functionaity of the LC is the same as before, i.e., it optimizes aggregate ink bandwidths across mutipe casses in the backbone. The CAs and the DAs, however, perform different tasks. In particuar, the task of aocating fow-eve path rates for a given cass is now spit between a CA and mutipe DAs assigned for that cass. We describe the functionaities of the CAs and the DAs in the foowing. Cass Aocator on Tier-2: Each CA optimizes aggregate ink bandwidths across mutipe data centers. As before, there are a tota of K CAs, one assigned for each cass k, denoted by CA k.unikethe2-tierdesign,wherethecasknewabout the utiity functions and the weights of the fows, in this 3-tier design, each CA knows ony the paths traversed by the fows of its own cass. The CA of cass k first receives the casseve bandwidths y ksfromthelc,andthensubdividesthese among mutipe data centers that host fows of cass k. We ca this the DC-eve bandwidth and denote it by y kj.this bandwidth is shared by the fows of cass k that originate from data center j and go over ink. TheCAthensendsthesey kj s to the DAs on tier-3 for driving fow-rate aocation decisions. In turn, the CA receives a set of optima Lagrange mutipiers from each DA. These Lagrange mutipiers are functions of the y kj s. The message-passing between the CA and the DAs is

6 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 12, DECEMBER 2013 Tier-2 SUBPROBLEM CLASS(1) Tier-1 MASTER PRIMAL... SUBPROBLEM... CLASS(k) SUBPROBLEM CLASS(K) Link Coordinator k* y k Fig. 5. Message passing in the 2-tier design. Cass Aocator CAk zk sp Fig. 4. A singe-eve prima decomposition of GLOBAL into a master prima on tier-1, and K cass-eve subprobems on tier-2. shown in Figure 3 by the arrows abeed 3 and 4. Finay, each CA aso sends the Lagrange mutipiers to the LC after optimizing the DC-eve bandwidths, as shown by the arrows abeed 1. Data Center Aocator on Tier-3: The DAs ie on tier- 3andarehostedbyserversocatedwithinthedatacenters. There is one DA assigned for each cass in each data center, summing up to a tota of KJ DAs. We denote the DA for cass k in data center j by DA kj.eachdahasaocaized view of the inter-data center traffic. It ony knows the utiity function of its cass, and the weights and paths of the fows that originate from its own data center. Each DA optimizes sending rates across mutipe fows in the same data center. In particuar, DA kj first receives the DC-eve bandwidths y kj s from the CA, and then optimay subdivides these among the fows of cass k that originate from data center j. Itthensends these optima path rates to the respective sources, shown by the arrow abeed 5 in Figure 3, as we as the optima Lagrange mutipiers to the CA from which it receives the y kj s. IV. OPTIMIZATION DECOMPOSITION In this section, we describe the optimization decomposition underying the two designs to prove optimaity of the traffic management protocos. We first present a singe-eve prima decomposition for the 2-tier design, and then a two-eve prima decomposition of the 3-tier design. We aso describe the associated message-passing, and the rate aocation updates by the different management entities. A. Decomposing the 2-Tier Design In our 2-tier design, the cass-eve bandwidths are first computed by the LC on tier-1, and then are subdivided by the CAs on tier-2 to compute fow-eve path rates. This particuar rate aocation strategy suggests a decomposition of GLOBAL into mutipe smaer subprobems, each of which can be soved independenty by a CA, whie together being coordinated by the LC. Such a decomposition, where the primary resources (i.e., bandwidths) are first optimized and coordinated at an aggregate eve, and then further subdivided among individua components (i.e., fows) by soving mutipe smaer subprobems, is known as prima decomposition. The coordination probem that optimizes aggregate resources is caed the master prima. 1) Prima Decomposition: The LC first aocates a casseve bandwidth y k for each cass k and ink in the backbone. This decomposes GLOBAL into a tota of K subprobems, as shown in Figure 4. Each of these subprobems is then independenty soved by a CA to subdivide y k among the fows of cass k that use ink. Wedefinethesubprobemfor cass k as: CLASS(k): maximize subject to U k = s F k p P s F k w k s [ a k f k ( x k ) s b k g k ( u k )] A p Rsp k zk sp yk, variabes z k sp 0, s F k,p where the right hand side y k of the constraint is a fixed quantity that pays the roe of fixed ink capacities, as in standard network utiity maximization (NUM) probems. These cass-eve subprobems are coordinated by soving a master prima in the LC. We define this master prima as: MASTER-PRIMAL: maximize subject to U = U ( k y k) k c, k K y k variabes y k 0, k, ( where U ) k y k is the optima objective vaue of (5) for a fixed cass-eve bandwidth vector y k = {y k, }. 2) Subgradient Updates: The master prima and the casseve subprobems are soved using a subgradient method in an iterative fashion unti convergence. Suppose at iteration t of the master prima, the LC aocates a cass-eve bandwidth vector y k (t) for cass k. TheCAthensubdividesthis bandwidth among the fows of cass k by soving (5). Let the optima Lagrange mutipiers corresponding to the constraints in (5) be λ k. Ceary, this is a function of y k (t). These Lagrange mutipiers are sent by the CAs to the LC, which then updates the cass-eve bandwidths for the next iteration as: [ y k (t +1)= y k (t)+β(t) λ ( k y k (t) )], (7) X where [ ] X denotes the projection onto the feasibe set X,and β(t) is a sma positive step size for iteration t. Figure5shows the message exchange for subgradient updates. We note that the above prima decomposition computes optima sending rates for a the fows in the network across a casses and data centers. B. Decomposing the 3-Tier Design In our 3-tier design, the cass-eve bandwidths are first computed by the LC on tier-1, and then are subdivided by (5) (6)

GHOSH et a.: SCALABLEMULTI-CLASSTRAFFICMANAGEMENTINDATACENTERBACKBONENETWORKS 7 Tier-2 SECONDARY-PRIMAL CLASS(1) Tier-1 MASTER-PRIMAL SECONDARY-PRIMAL...... CLASS(k) SECONDARY-PRIMAL CLASS(K) Link Coordinator sk* y k Cass Aocator CAk Fig. 7. Message passing in a 3-tier design. kj* y kj Data Center Aocator DAkj zk sp Tier-3 SUBPROBLEM... SUBPROBLEM... SUBPROBLEM DATACENTER(k,1) DATACENTER(k,j) DATACENTER(K,J) Fig. 6. A two-eve prima decomposition of GLOBAL into a master prima on tier-1, K cass-eve secondary primas on tier-2, and KJ DC-eve subprobems on tier-3. the CAs on tier-2 across mutipe data centers to compute DC-eve bandwidths. These DC-eve bandwidths are further subdivided by the DAs on tier-3 to compute fow-eve path rates. Like in the 2-tier design, this rate aocation strategy aso suggests a prima decomposition of GLOBAL into mutipe smaer subprobems that can be coordinated by different management entities. However, now there are two eves of prima decomposition; one in which GLOBAL is decomposed into mutipe subprobems at tier-2, and the other in which each tier-2 subprobem is further decomposedintomutipe smaer subprobems at tier-3. We describe this two-eve prima decomposition in the foowing. 1) A Two-Leve Prima Decomposition: As before, the LC first aocates a cass-eve bandwidth y k for each cass k and ink in the backbone. This decomposes GLOBAL into K cass-eve subprobems that are coordinated by a master prima in the LC, as shown in Figure 6. Each cass-eve subprobem is soved by a CA. However, unike in the 2-tier design where the CAs compute fow-eve path rates, each CA now ony aocates an aggregate DC-eve bandwidth for each data center. Computing the fow-eve path rates is deegated to the DAs, which sove the DC-eve subprobems. Let F kj denotes the set of fows of cass k that originate from data center j. Aso,etDATACENTER(k, j) denotethesubprobem soved by the DA of cass k in data center j, whichwedefine as: DATACENTER(k, j): maximize subject to U kj = s F kj p P s F kj w k s variabes z k sp 0, s Fkj,p [ a k f k ( x k ) s b k g k ( u k )] A p Rsp k zsp k y kj, where, as before, the right hand side y kj of the constraint is a fixed quantity. We note that the structure of (8) is simiar to that of (5), except for the superscript j which restricts considering ony the fows that originate from data center j. TheseDCeve subprobems are coordinated by the CAs. In particuar, CA k coordinates a the DC-eve probems that beong to cass k by soving a secondary prima, whichisdefinedas: (8) SECONDARY-PRIMAL(k): maximize subject to variabes U k = U ( kj y kj) j y k, j J y kj y kj 0, j, ( where U ) kj y kj is the optima objective vaue of (8) for a fixed DC-eve bandwidth vector y kj = {y kj, }. Finay, these secondary primas are coordinated by a master prima, definedas: MASTER-PRIMAL: maximize subject to U = U ( k y k) k c, k K y k variabes y k 0, (9) (10) where U k ( y k ) is the optima objective vaue of (9) for a given cass-eve bandwidth y k. 2) Subgradient Updates: The primas and the subprobems are soved in an iterative fashion unti convergence using a subgradient method as before. In particuar, a the secondary primas need to converge first before the master prima goes on to the next iteration. Let t and t denote the iteration indices of the master prima and secondary primas, respectivey. Suppose at iteration t, the LC fixes a cass-eve aggregate bandwidth y k (t). TheCAthenoptimaydividesthisamong the DAs of cass k to sove (9). In particuar, suppose at iteration t, the CA fixes a DC-eve aggregate bandwidth y kj (t ).ThentheDAusesthisasthetotabudgettosove (8) and compute fow-eve path rates. Suppose the optima Lagrange mutipiers corresponding to the constraints in (8) is λ kj,whichisafunctionofy kj (t ). These Lagrange mutipiers are sent by the DAs to the CA, which then updates the DC-eve bandwidth as: y kj (t +1)= [ y kj (t )+α(t ) s k ( y kj (t ) )] X. (11) Here s k ( ) is a subgradient corresponding to the constraints in (9), and is given by the sum of the optima Lagrange mutipiers of (8) as: s k = j J λkj ( y kj (t ) ).Asbefore, [ ] X denotes the projection onto the feasibe set X,andα(t ) is a sma positive step size for iteration t. Once a the secondary primas converge, the next iteration of the cass-eve bandwidth aocation takes pace. Suppose, the optima Lagrange mutipiers once (9) converges be s k. These Lagrange mutipiers are again a function of y k (t), and

8 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 12, DECEMBER 2013 DC2 1 1 2 3 3 DC3 Fig. 8. A simpe 3-node, 3-ink topoogy. DC1 2 % error 10 8 6 4 2 =1 =2 =5 =10 =20 9 11 DC4 10 DC2 6 5 4 DC1 1 2 0 0 50 100 150 # of iterations to convergence Fig. 10. Rate of convergence of the 2-tier design for different vaues of the cass-eve step size β. 8 DC3 Fig. 9. Abiene backbone network with 4 data centers ocated at nodes 1, 6, 8, and 11. are sent by the CAs to the LC. The LC then updates the casseve bandwidth as: 7 y k (t +1)= [ y k (t)+β(t) s ( y k (t) )] Y. (12) Here s( ) is a goba subgradient corresponding to the constraints in (10), and is given by the sum of the optima Lagrange mutipiers of (9) as: s = k K sk ( y k (t) ).As before, [ ] Y denotes the projection onto the feasibe set Y, and β(t) is a sma positive step size for iteration t. Figure7shows the message exchange for these subgradient updates. Like in the 2-tier design, here aso the two-eve prima decomposition computes optima sending rates for a the fows in the network across a casses and data centers. V. PERFORMANCE EVALUATION In this section, we evauate the performance of the two designs on two different topoogies. We first compare their convergence behavior, and then the number of messages exchanged unti convergence. A. Experimenta Setup Backbone Topoogies: We experiment with two network topoogies: (1) a simpe 3-node, 3-ink topoogy, as shown in Figure 8, and (2) the Abiene backbone network [15], as shown in Figure 9. The simpe topoogy aows us to reason about the system more easiy, whie the Abiene topoogy is more reaistic and has more number of nodes and path diversity. The simpe topoogy has a data center ocated at each node. We assume that both data centers, DC1 and DC2, host two casses of traffic that are throughput-sensitive. We set the parameters for the two casses as: a 1 =2, b 1 =0for cass C1, and a 2 =10, b 2 =0for cass C2. We assume that the ratedependent function is ogarithmic, and the weights of a the fows in both casses are set to one. Each ink has a capacity 3 of 100 Mbps. Foowing (1), the utiity functions are therefore given by: U k s = ak og( p P Rk sp zk sp ). The Abiene backbone has a ink capacity of 1 Gbps in each direction, and the propagation deays are chosen so as to approximate reaistic vaues between respective pairs of nodes. We suppose that there are four data centers ocated at nodes 1, 6, 8, and 11. Out of the many possibe paths between the data centers, we choose the first three shortest paths between each pair, resuting in a tota of 36 paths. B. Convergence: Sensitivity to Step Sizes α and β In our 2-tier design, the tunabe step size β contros how the cass-eve bandwidths and the goba utiity react to changes in the subgradient λ k (Equation (7)). Simiary, in our 3-tier design, the step sizes β and α, respectivey, contros how the cass-eve and DC-eve bandwidths react to changes in the subgradients λ k and s k (Equations (11) and (12)). In the foowing experiments, we compare the convergence behavior of the goba utiity in our two designs for constant vaues of these step sizes. We define convergence as being within 0.01% of the optima. We ca β the cass-eve step size, and α the DC-eve step size. Wefirstexperimentwiththesimpe topoogy. Convergence Rate of the 2-Tier Design: In Figure 10, we pot the percentage error of the goba utiity for the 2-tier design for different vaues of the cass-eve step size β. We observe that the utiity converges for a vaues of β between 1 and 20, and the rate of convergence increases rapidy for arger β. Werantheexperimentfordifferentvauesofβ up to 50, and a of them converge very quicky (not a those graphs are shown here). For instance, starting at β =25, the convergence is aways within 3 to 5 iterations. For β > 50, however, the utiity vaues osciate and the system never converges; the reason being, as β gets arger, there is a tendency for the aocations to overshoot beyond the feasibe region every singe iteration. Thus, the convergence of the 2-tier design is more sensitive to vaues of β when they are sma (i.e., 1 β 20), but not much once β exceeds 25. Convergence Rate of the 3-Tier Design: Figure 11 shows the percentage error of the goba utiity for the 3-tier design for different vaues of the cass-eve step size β,whenthedceve step size α is fixed at 2. We observe that the utiity con-

GHOSH et a.: SCALABLEMULTI-CLASSTRAFFICMANAGEMENTINDATACENTERBACKBONENETWORKS 9 % error 15 10 5 =1 =2 =5 =10 =20 0 0 100 200 300 400 500 # of iterations to convergence Fig. 11. The rate of convergence of the 3-tier design for different vaues of the cass-eve step size β, whenthedc-evestepsizeα is fixed at 2. # of iterations to convergence 250 200 150 100 50 =2 =4 =6 =10 0 2 6 10 14 18 22 26 30 Fig. 12. Number of iterations to convergence in the 3-tier design for sweeping the cass-eve step size β for different vaues of the DC-eve step size α. verges for a vaues of β beow 40, and the rate of convergence increases for higher β. Theexperimentisrunwiththenumber of iterations of the inner oop set to five for every iteration of the outer oop. We reca that the inner oop aocates DC-eve bandwidths (controed by α), whie the outer oop aocates cass-eve bandwidths (controed by β). Thus, for β =1, which takes about 470 iterations to converge, the number of cass-eve aocations is 470/5 =94.Comparing Figure 10 and 11, we see that the rate of convergence for the 3-tier design is sower than the 2-tier design. For β above 40, however, the cass-eve bandwidths overshoot the feasibe region in every iteration of the outer oop, and, as a resut, the system never converges. To get a better understanding of how sensitive the 3-design is in terms of its convergence behavior, we fix the DC-eve step size α at different vaues and sweep the cass-eve step size β through a the vaues between 2 and 30 in steps of 2. Figure 12 shows the pots for four different vaues of α. Weobservethatforfixedα, thenumberofiterationsto convergence initiay decreases for increasing β, butgraduay becomes amost constant for arge β. Forβ more than 18, we see that a the four curves fatten out and the β vaues have amost no effect on the rate of convergence. This behavior is simiar to the 2-tier design, where the convergence rate is very quick and remains amost unaffected for arge vaues of β. We aso see that the pots for arge vaues of α (i.e., α =4, 6, 10) are very cose to each other, indicating that the 3-tier design is TABLE II RATE OF CONVERGENCE OF THE TWO DESIGNS FOR DIFFERENT VALUES OF THE CLASS-LEVEL STEP SIZE β. ALSO SHOWN ARE THE GOOD VALUES OF THE DC-LEVEL STEP SIZE α FOR THE 3-TIER DESIGN. Cass-eve step size β 2-tier design 3-tier design sma β =1, 2 sow very sow, a α medium β =5, 10 moderate sow, a α arge β =20, 30 fast moderate, a α very arge 30 <β<40 fast moderate, α 16 extremey arge 40 β<50 fast does not converge β 50 does not converge does not converge ess sensitive to arge α. Furthermore,we find that for α>16 and β>30, thesystemosciatesandneverconverges.thus, the good vaues of α and β for which the 3-tier design is stabe are when α 16 and β 30. In summary, Tabe II ists the different vaues of α and β, and the convergence behavior of the two designs. In practice, one shoud choose those vaues of α and β for which the system converges quicky. We have not experimented with dynamic traffic demands in this work; however, for the private OSP backbone networks under consideration, the demand variabiity can be controed to some extent because the traffic sources are owned by the OSP, and the step sizes can be chosen accordingy. With respect to faut toerance (e.g., ink faiures), our proposed designs can expoit muti-path routing to spit traffic over the good paths, and aso re-run the optimization on the modified topoogy after faiure to compute new optima sending rates [16]. C. Message-Passing Overhead In both our designs, the management entities exchange messages with each other in order to aocate cass-eve and/or DC-eve bandwidths in an iterative fashion. In this section, we compute and compare the number of messages exchanged in the two designs. In the anaysis beow, F kj is the set of fows of cass k that are hosted from data center j, andtheroutingmatrix Rsp k encodes the paths used by each fow s. Aso,asstatedin Tabe I, K is number of traffic casses and L is the number of inks in the backbone. We aso assume that the number of cass-eve aocations required for the 2-tier design to converge is N, and that required for the 3-tier design to converge is N. For the 3-tier design, we further assume that M iterations of DC-eve aocations are needed for each iteration of the cass-eve aocation. Message-Passing in the 2-Tier Design: In our 2-tier design, the LC first aocates the cass-eve bandwidths, y k s, to each CA in the beginning of each iteration. The tota number of variabes passed by the LC to a the CAs is therefore KL. Each CA then computes the path rates for the individua fows of its own cass and sends the rates to the respective sources. Since the fows of each cass can originate from any of the data centers, each CA potentiay communicates with a the data centers. Thus, the number of path rate variabes sent by the CA of cass k is j J s F kj p Rk sp. Finay, after computing

10 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 12, DECEMBER 2013 the fow-eve path rates, each CA passes the optima Lagrange mutipiers, λ k,tothelcattheendofeachiteration,thus, sending a tota of KL variabes. Summing up, the number of variabes exchanged unti convergence is #ofmessages= N 2KL + R k sp. k K j J s F kj p (13) Message-Passing in the 3-Tier Design: In our 3-tier design, the LC first aocates the cass-eve bandwidths, y ks, to each CA in the beginning of each iteration of the outer oop, thus, sending a tota of KL variabes. Each CA then subdivides the y ksintodc-evebandwidths,ykj s, for each data center j, andsendsthesetothedas.thenumberof variabes passed by the CAs to a the DAs in each iteration of the inner oop is therefore JKL.EachDAthencomputes the path rates for the individua fows and sends the rates to the respective sources. However, since the DAs are ocated within each data center, these messages are sent ocay and are not counted in our anaysis. Once the fow-eve path rates are computed, each DA sends the optima Lagrange mutipiers, λ kj,tothecaofitscass.thisincurspassingatotaof JKL variabes for each inner oop iteration. Finay, each CA sends the optima Lagrange mutipiers, s k,tothelcatthe end of each iteration of the outer oop. This requires passing atotaofkl variabes from a the CAs. Summing up, the number of variabes exchanged unti convergence is #ofmessages = N [2KL +2JKLM]. (14) Comparison of Message-Passing Overhead: From (13), we see that the number of messages exchanged in the 2-tier design depends on the number of the fows, F kj,ofeachcass k hosted from each data center j. However,the2-tierdesign has ony a singe oop which iterativey aocates cass-eve bandwidths unti convergence. On the other hand, from (14) we observe that the number of messages exchanged in the 3- tier design does not depend on the number of fows; however, the design has two oops which iterativey aocate both casseve and DC-eve bandwidths. In Figure 13, we compare the message-passing overhead in the two designs as a function of the number of fows in each cass hosted between every pair of data centers in the Abiene topoogy. We choose step sizes for the two designs that converge within 100 iterations. We observe that, as expected, the number of messages exchanged in the 3-tier design remains constant with the number of fows. In the 2-tier design, however, though the number of messages exchanged is fewer for sma number of fows, it increases once the number of fows per cass hosted from each data center exceeds sighty more than 110. This amounts to a tota of 110 2 4=880 fows. Thus, in practice, if the number of fows is sma, one woud prefer to use the 2-tier design; otherwise, the 3-tier design is preferred. VI. RELATED WORK Traffic management using optimization theory [17], [18] has been extensivey studied in iterature [13], [19] [21]. Using # of messages exchanged 15 x 105 2 tier design 3 tier design 10 5 0 10 0 10 1 10 2 10 3 # of fows in each cass per data center Fig. 13. Message-passing overhead of the two designs as a function of the number of fows in each cass hosted from each data center in the Abiene topoogy. mutipe decompositions, a distributed protoco caed TRUMP is proposed in [20], which is adaptive, robust, and fexibe. In [21], a distributed architecture caed DaVinci is presented which can run customized protocos on mutipe virtua networks sharing the same physica topoogy to support appications with diverse performance objectives. Using adaptive network virtuaization, it is shown that DaVinci maximizes the aggregate utiity of a virtua networks. In [13], a distributed muti-path protoco is proposed for deay-sensitive, ineastic traffic using decomposition techniques. Despite using decomposition theory, the protocos proposed in [13], [20], [21] are fuy-distributed, which rey on periodic ink feedbacks from the routers to adjust the sending rates. In an ISP network, this is perhaps more desirabe because the traffic sources are typicay owned by end users, whereas the operators contro the routing paths and tune the ink weights. In comparison, the focus of this work is on widearea OSP networks [22], where the sources as we as the backbone are owned by the same OSP. To this end, our proposed designs are semi-centraized, which use a sma number of management entities to compute optima fow-eve path rates across mutipe casses and data centers. This makes our protocos better in terms of scaabiity, message-passing overhead, and convergence behavior. Moreover, whie TRUMP does not consider mutipe casses of traffic and different appications within each cass, DaVinci does not consider scaabiity, or appications within each virtua network having different weights. There have been severa recent studies on traffic modeing and characterization within a singe data center [23] [25]. However, itte is known about traffic management in OSP networks [9]. The study in [9] uses Yahoo! datasets to anayze both inter-data center and cient traffic, reveaing a hierarchica depoyment of data centers at different geographica ocations. In [24], the authors capture both macroscopic and microscopic traffic characteristics in data center networks from arge-scae data sets. In [23], an empirica study of end-to-end traffic patterns is carried out across mutipe data centers to examine

GHOSH et a.: SCALABLEMULTI-CLASSTRAFFICMANAGEMENTINDATACENTERBACKBONENETWORKS 11 tempora and spatia variations in ink oads and osses. The work aso provides a framework to derive parameters that define sending behavior of the traffic sources. Finay, Googe s gobay-depoyed wide area network (WAN) caed B4 [26] uses Software Defined Networking (SDN) principes and OpenFow [27] to simutaneousy support standard routing protocos and centraized TE. Leveraging the common ownership (by Googe) of a the appications, servers, and data center networks a the way up to the edge of the backbone, B4 can (1) adjudicate among competing demands under resource constraints, (2) use muti-path forwarding/tunneing to everage avaiabe network capacity according to appication priority, and (3) dynamicay reaocate bandwidth in the face of ink/switch faiures or shifting appication demands. VII. CONCLUSION In this work, we examined two aternative traffic management designs for OSP networks. Our designs are both scaabe and semi-centraized, but differ in their degree of distributedness. They have a tiered structure and use a sma number of management entities to optimize sending rates at the fow-eve across mutipe traffic casses. Using prima decomposition, we showed that both the designs are provaby optima in maximizing the aggregate utiity of a fows. Simuations on areaisticbackbonetopoogiesreveaedquickerconvergence rate of the 2-tier protoco as compared to the 3-tier one, but with increased message-passing. Our future research incudes experimenting with rea data sets. ACKNOWLEDGMENT Our work was supported in part by NSF grant 1162112 and agiftfromgooge.theauthorswoudiketothankprofessor Mung Chiang for his insightfu comments, and Neeanjana Gautam for proofreading the manuscript that heped us improve the quaity of this work. REFERENCES [1] Z. Zhang, M. Zhang, A. Greenberg, Y. C. Hu, R. Mahajan, and B. Christian, Optimizing Cost and Performance in Onine Service Provider Networks, in USENIX NSDI, SanJose,CA,Apri2010,pp. 33 47. [2] Googe Data Centers, http://www.googe.com/about/datacenters/. [3] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu, BCube: A High Performance, Server-Centric Network Architecture for Moduar Data Centers, in ACM SIGCOMM, Barceona, August 2009, pp. 63 74. [4] Googe Throws Open Doors to Its Top-Secret Data Center, http:// www.wired.com/wiredenterprise/2012/10/ff-inside-googe-data-center/. [5] How Dirty is Your Data? http://www.greenpeace.org/internationa/en/ pubications/reports/how-dirty-is-your-data/. [6] A. Mahimkar, A. Chiu, R. Doverspike, M. D. Feuer, P. Magi, E. Mavrogiorgis, J. Pastor, S. L. Woodward, and J. Yates, Bandwidth on Demand for Inter-Data Center Communication, in ACM HotNets, Cambridge, MA, November 2011, pp. 24:1 24:6. [7] A. Greenberg, J. Hamiton, D. A. Matz, and P. Pate, The Cost of a Coud: Research Probems in Data Center Networks, ACM SIGCOMM Computer Communication Review, vo.39,no.1,pp.68 73,December 2008. [8] Inside Microsoft s Internet Infrastructure & Its Pans For The Future, http://gigaom.com/2008/06/30/ microsofts-internet-infrastructure-its-big-pans/. [9] Y. Chen, S. Jain, V. K. Adhikary, Z.-L. Zhang, and K. Xu, A First Look at Inter-Data Center Traffic Characteristics via Yahoo! Datasets, in IEEE INFOCOM, Shanghai,Apri 2011,pp.1620 1628. [10] Googe, Inter-Datacenter WAN with Centraized TE using SDN and OpenFow, Open Networking Summit 2012, https://www. opennetworking.org/images/stories/downoads/misc/googesdn.pdf. [11] F. P. Key, A. Mauoo, and D. Tan, Rate Contro for Communication Networks: Shadow Prices, Proportiona Fairness and Stabiity, J. Operationa Research Society, vo.49,pp.237 252,March1998. [12] F. P. Key, Fairness and Stabiity of End-to-End Congestion Contro, European Journa of Contro, pp.159 176,2003. [13] U. Javed, M. Suchara, J. He, and J. Rexford, Mutipath Protoco for Deay-sensitive Traffic, in COMSNETS, Bangaore,January2009,pp. 438 445. [14] S. Kandua, D. Katabi, S. Sinha, and A. Berger, Dynamic Load Baancing Without Packet Reordering, ACM SIGCOMM Computer Communication Review, vo.37,pp.51 62,Apri2007. [15] Abiene Backbone, http://abiene.internet2.edu. [16] M. Suchara, D. Xu, R. Doverspike, D. Johnson, and J. Rexford, Network Architecture for Joint Faiure Recovery and Traffic Engineering, in ACM SIGMETRICS, San Jose,CA,June 2011,pp.97 108. [17] D. Paomar and M. Chiang, Aternative Distribution Agorithms for Network Utiity Maximization: Framework and Appications, IEEE/ACM Trans. Autom. Contro, vo. 52, no. 12, pp. 2254 2269, December 2007. [18] M. Chiang, S. H. Low, R. Caderbank, and J. C. Doye, Layering as Optimization Decomposition: A Mathematica Theory of Network Architectures, Proc. IEEE, vo.95,no.1,pp.255 312,January2007. [19] S. Foyd and V. Jacobson, Link-Sharing and Resource Management Modes for Packet Networks, IEEE/ACM Trans. Netw., vo. 3, no. 4, pp. 365 386, August 1995. [20] J. He, M. Suchara, M. Breser, J. Rexford,andM.Chiang, Rethinking Internet Traffic Management: From Mutipe Decompositions to a Practica Protoco, in ACM CoNEXT, New York, December 2007, pp. 1 12. [21] J. He, R. Zhang-Shen, Y. Li, C.-Y. Lee, J. Rexford, and M. Chiang, DaVinci: Dynamicay Adaptive Virtua Networks for a Customized Internet, in ACM CoNEXT, Madrid, December 2008, pp. 15:1 15:12. [22] J. W. Jiang, Wide-Area Traffic Management for Coud Services, Ph.D. Thesis, Princeton University, 2012. [23] T. Benson, A. Anand, A. Akea, andm. Zhang, UnderstandingData Center Traffic Characteristics, ACM SIGCOMM Computer Communication Review, vo.40,pp.92 99,January2010. [24] S. Kandua, S. Sengupta, A. Greenberg, P. Pate, and R. Chaiken, The Nature of Data Center Traffic: Measurements & Anaysis, in ACM IMC, Chicago, IL, November 2009, pp. 202 208. [25] A. Greenberg, J. R. Hamiton, N. Jain, S. Kandua, C. Kim, P. Lahiri, D. A. Matz, P. Pate, and S. Sengupta, VL2: A Scaabe and Fexibe Data Center Network, in ACM SIGCOMM, Barceona, August 2009, pp. 51 62. [26] S. Jain, A. Kumar, S. Manda, J. Ong, L. Poutievski, A. Singh, S. Venkata, J. Wanderer, J. Zhou, M. Zhu, J. Zoa, U. Höze, S. Stuart, and A. Vahdat, B4: Experience with a Gobay-Depoyed Software Defined WAN, in ACM SIGCOMM, HongKong,August2013. [27] OpenFow Specification, http://www.openfow.org/wp/documents/. Amitabha Ghosh received his Ph.D. in Eectrica Engineering from the University of Southern Caifornia in 2010, and was a post-doc at Princeton University during 2010-2012. His research interests ie in the design and anaysis of agorithms and protocos for sensor networks, ceuar networks, Wi-Fi, sateite communications, cognitive radios, and data center networks. Between 2000 and 2004, he worked for Acate-Lucent and Honeywe Labs in Bangaore on GSM/GPRS and ad hoc networks. Prior to that, he received his B.S. in Physics from Indian Institute of Technoogy, Kharagpur, in 1996, and his M.E. in Computer Science and Engineering from Indian Institute of Science, Bangaore, in 2000. He is currenty a Research & Deveopment Scientist at UtopiaCompression based in Los Angees, where he works on deveoping wireess networking and communication technoogies for tactica networks.

12 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 12, DECEMBER 2013 Sangtae Ha is the CTO and VP Engineering at DataMi. He received his Ph.D. in Computer Science from North Caroina State University in 2009. During his Ph.D. study, he co-authored CUBIC, a new mechanism for TCP congestion contro, and was the primary engineer pushing it into Linux 2.6.18. CUBIC has been the defaut TCP congestion contro agorithm in Linux since 2006 and is currenty used by more than 40% of Internet servers around the word as we as severa hundred miion Android/Linux devices. After competing his Ph.D., he co-founded the Princeton EDGE Lab (http://scenic.princeton.edu) as its first Associate Director in 2009 and ed its research team as an Associate Research Schoar at Princeton University from 2010 to 2013. In particuar, he ed the DataMi (http://www.datami.com) project team, which deveops Smart Data Pricing (SDP) soutions to hep Internet Service Providers (ISPs) meet the increasingy untenabe demand for mobie data. He is a co-founder of DataMi Inc. and a technica consutant to a few startups. He is an IEEE Senior Member and was a genera co-chair of the Smart Data Pricing workshop in INFOCOM 2013. In 2014, he wi join the University of Coorado Bouder as an Assistant Professor. Edward Crabbe is a network architect at Googe, where his responsibiities range from overa network design incuding backbone switching and routing architectures to nove protoco design and specification. He has been working on arge networks in one form or another since the ate nineties, and has hed senior positions at a number of tier one backbones and severa of today s top enterprises and coud hosting providers. Jennifer Rexford is the Gordon Y.S. Wu Professor of Engineering in the Computer Science department at Princeton University. Whie working at AT&T Research from 1996 to 2005, she designed networkmanagement techniques that are in daiy use in AT&Ts backbone network. Jennifer was chair of ACM SIGCOMM from 2003 to 2007, and currenty co-chairs the advisory counci of NSF s CISE directorate. She is an ACM Feow and received ACM s Grace Murray Hopper Award for outstanding young computer professiona of the year for 2004.