BigPi: Sharing Link Pools in Cloud Networks

Transcription

1 : Sharing Link Pools in Cloud Networks Yu Chen, Xin Wu, Qiang Cao, Xiaowei Yang and Theophilus Benson TR23-, Department of Computer Science, Duke University Abstract In cloud networks, sharing network resources among different VMs is critical to performance isolation. A sharing scheme aims to fully utilize the network resource, and effectively prevent performance interference and even malicious attacks such as Denial-of-Service (DoS) attacks. Existing bandwidth allocation schemes like attempt to achieve performance isolation on a per-link basis. However, due to unbalanced congestion created by single-path routes, that level of isolation is suboptimal for multi-rooted trees widely used in todays datacenters. In, we define a link pool as a set of parallel links from the same rack or pod, and find that if we evenly split each flow into all paths, we can share the bandwidth on the per-link-pool basis. Sharing at the granularity of link pools achieves higher network utilization and per-source fairness, by leveraging all potential paths. Through rigorous analysis and extensive simulation, we show that is highly scalable, and achieves both stronger isolation and higher overall throughput than that feasible when sharing at a per-link basis. In addition, within a VM, s bandwidth allocation achieves higher tail throughput. I. INTRODUCTION Driven by auto-scaling online services and pay-as-yougo business model, cloud data centers have become highly shared environments. Virtual machines (VMs) of different tenants are inevitably competing for shared network resources. This contention for network resources introduces performance interference and uncertainty. In extreme cases, the open cloud service platforms suffer from abuse or even malicious attacks such as DoS attacks. As a result, both industry and the research community are actively looking for the solutions that guarantee efficient and fair sharing of cloud networks. A. The problem Cloud providers architect their networks and design their management APIs to achieve economies of scale over both infrastructural and operational costs. To maximize the performance of these networks while maintaining economies of scale, cloud providers build high bisection bandwidth with cheap commodity switches interconnected as multi-rooted tree [2], [6]. Providers recuperate the cost of the networks indirectly through the costs paid by the tenants sharing the network. As a direct result of this, providers are motivated to maximize the number of tenants sharing the network while simultaneously ensuring fairness. The need to ensure maximal utility while providing fairness leads to two novel problems. First, that of efficient utilization of a multi-rooted tree network of commodity devices. More specifically, given the existence of multi-path, the cloud must be able to leverage path parallelism and utilize all available bandwidth. Second, that of enforcing fairness between different VMs and thus providing performance isolation. The notion of fairness is further complicated by the fact that fair-share is an application specific concept. For applications like web services, an important problem is to reduce the tail latency. A long tail latency potentially harms the user experience. B. The challenge Although piecemeal solutions exist for each of the two problems, no holistic solutions exist to solve both problems. For example, ECMP [] and MPTCP [2] both exploit parallelism by spreading load across links and aim only for fairness for individual flow. A direct result of ignoring VMs when enforcing fairness, is that malicious or selfish VMs may create multiple flow and starve or hurt other VMs. [22] over comes this drawback by providing per-vm fairness and enforcing isolation at the level of VM rather than flow. Unfortunately, fails to exploit path parallelism because enforces per-vm isolation by restricting VM traffic to a single path. Without a holistic solution that both exploits the path parallelism and enforces fairness, we argue that providers will be limited in the number VMs they can support and in the type of service level agreement that they can make. Motivated by this, we aim to answer the question: can we design a scalable rate allocation mechanism that achieves high utilization of the multi-rooted tree topology, per-source fairness among the VMs and high tail throughput? The answer can help the cloud provider meet the service level agreements and the tenants achieve better user experience. C. Our solution We design and implement, a hypervisor-based multipath rate allocation scheme. Instead of sharing links, shares a cloud network at a new granularity, which we call link pools. A link pool consists of the parallel links with equal capacity from/to the same rack/pod in a multi-rooted tree. By evenly splitting each flow to all its shortest paths, we carefully design our protocol so that the links within each link pool share the same level of congestion. Thus, balances the congestion condition on all the parallel links within in each link pool. This effectively reduces the perlink constraints to the per-link-pool constraints, and hence renders us the opportunity to find rate allocations that suit cloud sharing better. By balancing the congestion among parallel links, intelligently shifts unsatisfied demands on a congested link to other links with excess bandwidth within

2 2 the same link pool. Such intra-link-pool traffic shifting not only results in higher overall utilization, but also improves the tail throughput. Besides, since the fairness is not coupled with the sharing granularity, still achieves the per-source fairness as previous schemes like. We further explore the optimal flow splitting strategy in multi-rooted trees. We find that a feasible demand matrix under an arbitrary set of routes is also feasible by evenly splitting each flow to all its shortest paths. Because a link pool consists links from/to the same rack/pod, the removal of a link pool will disconnect the corresponding rack/pod from the rest of the network, and thus it is a cut. Because the source and destination are on two sides of each link pool, all the flows have to at least pass the same set of link pools as the shortest paths. Detoured flows might bounce around at the intermediate switches, increasing the congestion level on other link pools. By evenly splitting each flow to the shortest paths, incurs the least possible congestion, and achieves higher utilization and tail throughput than other routing strategies. The main idea of is to estimate the rate of a VM on each link by running a weighted multipath rate adaptation loop periodically, and then divide the bottleneck rate to its subflows. When there is no link failure or background traffic, s estimation of the rates for the links within the same link pool are theoretically the same, and thus effectively implements the even split of each flow to all its available paths. In the weighted multipath rate adaptation loop, periodically applies the weighted additive increase for every destination, and the weighted multiplicative decrease for every path. Compared with a single-path rate adaptation loop, the multipath rate adaptation loop increases the network utilization and the tail throughput. We choose per-path feedbacks in the design of to achieve graceful degradation when there are link failures or background traffic. D. Our contributions In summary, this work makes the following contributions, We highlight the fact that a cross-layer design is required to design protocols that allow for efficient and fair sharing of cloud networks. Based on this observation, we propose, a hypervisor-based rate allocation scheme that provides VM-level performance isolation in multi-rooted tree networks. We designed a multipath rate adaptation loop that evenly splits flows across all shortest path while tolerating link failures and background traffic. We implemented a prototype of and evaluated it on the Deterlab testbed [3]. Our evaluations show that achieves high utilization, per-source fairness and high tail throughput. We also implemented in ns-2 [4], an event-driven simulator. Our extensive simulation shows that achieves VM-level isolation in various traffic patterns and topologies and can scale to 248 endhosts. The rest of this paper is organized as follows. In Section II, we introduce the background. In Section III, we discuss the goals and challenges. Section IV and V discuss the details of our design and implementation. The evaluation of is in Section VI and VII. We discuss alternative design choices in Section VIII. Section IX concludes our work. II. BACKGROUND AND RELATED WORK There are two categories of work in the field of sharing the cloud networks. The first category proposes policies to share different traffic patterns and achieve performance isolation in multi-tenant networks. The second category focuses on achieving high utilization in multi-rooted tree topology. We discuss both and our findings about them. A. Performance isolation in the public cloud The public cloud is shared by multiple mutually untrusting tenants running a wide range of applications. Without network-level performance isolation, the traffic from the selfish/malicious tenants can interfere with, or even starve the well-behaved tenants. For example, with existing protocols like TCP, a selfish/malicious VM can get more than its fair share or even launch denial-of-service attacks by sending high-rate UDP flows, or multiple parallel flows. Having to compete with the selfish/malicious VM, the well-behaved VM gets less than its fair share and suffers from starvation. There are two approaches to address the problem of performance isolation. The first approach delivers the packet with best effort and achieves per-source fairness. The second approach provides quality of service by providing a minimum bandwidth guarantee for each VM. ) Per-source fairness: [22] is designed to achieve per-source fairness. Its controller is at located the virtualization network stack of the end hosts. In, the administrator decides the weight of each VM. The controller then divides the per-vm weight to the VM s flows. The controller enforces the per-flow rate assignment of each flow by a congestioncontrolled tunnel. The poor selection of routes is the major problem for singlepath in a multi-rooted tree topology. Some of routes might be more congested than the other routes. This leds to two consequences. First, like all the single-path transport protocols, some of the paths are underutilized. Second, because the per-source fairness does not provide any guarantee for each single flow, the one-to-many flows can yield to the one-to-one flows on more congested routes, resulting in a poor tail latency. In applications like web services, this will harm the experience of the end users. [27] 2) Minimum guarantee: The second approach aims to provide the minimum guarantee for every VM. Oktopus [8] uses reservation to achieve minimum bandwidth guarantee. However, it is not work conserving. EyeQ [4] assumes a network with full-bisection bandwidth, and only shares the accesss links. ElasticSwitch [2] achieves minimum guarantee and work conservation at the same time. To provide the minimum guarantee, admission control and planned VM placement are necessary in general. [8] One challenge of admission control is to design a policy that strikes a balance between complexity and performance. A simple admission policy like virtual cluster [8] results in unused bandwidth and fewer admissions of the VMs at each host.

3 3 To admit more VMs and achieve a higher utilization, the policies usually requires more information about the traffic matrix of their cloud applications. For example, Proteus [26] observed the variance of the demand at different MapReduce phases and requires profiling of the data-intensive applications. CloudMirror [6] requires an in-depth understanding about the network demands between different application components. [23] observed that 3% of its traffic are between different tenants, and requires an understanding about the communication patterns between the tenants. Because the cloud provider does not have direct access to those applications, those policies assumes that the tenants have an in-depth understanding about their temporal and spatial network usage pattern, and also the incentives to share the information with the cloud provider. B. Achieving high utilization in multi-rooted trees The topologies of the public clouds are dominated by multirooted trees. The Fattree topology [6] in Figure and the Cisco [2] topology are two examples of these multi-rooted trees. A multi-rooted tree usually consists of several different layers of switches. For example, in the 3-layer multirooted tree as shown Figure, at the top level of a cluster are the core switches. A core switch interconnects the pods below it. Each pod is an independent and replicable management unit. At the top level of a pod are the aggregate switches. An aggregate switch interconnects the racks below it. Each rack consists of a ToR switch and a few hosts. The path parallelism lies on the two layers. First, each rack is connected to multiple aggregate switches; second, each pod is connected to multiple core switches. Instead of using high-end switches as the core switch of the network, topologies like Fattree create multiple parallel paths from a source to a destination through multiple commodity switches, and thus achieve a similar oversubscription ratio with a much lower cost. Given the sheer number of flows and the high churn observed in today s datacenter networks, an important problem is to achieve high utilization of those parallel paths in a scalable way. There are a few possible bandwidth allocation scheme to highly utilize the multi-rooted trees, at different layers. Equal- Cost-Multiple-Path (ECMP) [] and multipath TCP [2] are two of them. First, at the network layer, ECMP leverages the path symmetry and maps each flow to a path by hashing the source-destination IP addresses and application ports. Second, at the transport layer, MPTCP divides a flow into multiple subflows on different paths, and achieves congestion control and path selection. Because these algorithms are not specifically designed for public clouds, performance isolation is not one of their goals. III. GOALS AND CHALLENGES This paper targets the problem of sharing the multi-rooted tree topology among a set of mutually untrusting tenants. The tenants are selfish, and can be malicious. The VMs owned by the tenants can send to arbitrary number of destinations, with agg core core 2 core 3 core 4 agg 2 ToR ToR 2 agg 2 agg 22 ToR 2 ToR 22 agg 3 agg 32 ToR 3 ToR 32 agg 4 agg 42 ToR 4 ToR 42 Fig.. The VMs sending one flow and many flows have different amount of bandwidth to share in a Fattree topology. any kinds of transport protocols. Because the minimum guarantee approach requires the tenants to understand its network demand and have the incentive to share the information about their applications, in this work, we focus on the best-effort approach. A. Goals is a distributed multipath rate allocation scheme that aims to achieve the following four goals, ) Scalability: should scale to the sheer number of machines and the churn observed in today s cloud networks [9]. 2) High utilization of multi-rooted trees: Because there are multiple parallel paths between any source-destination pair, when there is unsatisfied demand, none of the paths should be underutilized. With high utilization, throughput-sensitive jobs does not have excess bandwidth to leverage, and thus cannot further improve their completion times. 3) Per-source fairness: Each VM is assigned a weight. With per-source fairness, each VM only gets its weighted fair share of the bottleneck links. A malicious/selfish tenant is unable to send more traffic by simply changing the number of flows or the underlying transport protocols. 4) High tail throughput: In addition to per-source fairness, the per-flow allocation inside a VM should be max-min fair to prevent the low tail throughput and consequently harm to service quality of the cloud applications. For example, there is a database server VM sending two flows to two different clients. If the allocation is biased, a client might be blocked, resulting in slow response to a web server, and consequently a timeout of the webserver. This can potentially harm the user experience of the web service [27]. B. Challenges In this section, we analyze the challenges of achieving all the goals with a single-path rate allocation scheme. To achieve high utilization, a single-path rate allocation scheme is often complemented by a multipath routing algorithm. Because centralized routing algorithms have potential scaling problems [], we only consider the distributed multipath routing algorithms like ECMP, where a flow randomly selects a route. With such routing algorithms, the VMs sending one-tomany flows covers more links at the inter-rack and inter-pod

4 4 agg agg 2 agg agg 2 ToR ToR 2 ToR ToR 2 Fig. 2. (a) Three VMs share two bottlenecks. (b) Two VMs sending one flow and two flows share two bottlenecks. Examples to show the tradeoffs between our goals links than the VMs sending one-to-one flows. This also creates unbalanced congestion, where some links carry more flows than others. For example, in Figure, the red VM covers more links at the inter-rack links than the blue VM by sending multiple flows; and the links with the blue flows covers more flows than other links at the same tier. We find two problems that prohibit the single-path rate allocation scheme to achieve all the goals. ) High utilization v.s. per-source fairness: A single-path rate allocation scheme makes sure all the VMs get its weighted fair share on each path. If all the routes of a VM are congested, the VM can get a rate less than its demand. According to the definition of the per-source fairness, all the other VMs with no less demand gets the same rate. As a result, there might be excess bandwidth on the unused paths. The congested VM could have used them to improve its rates without harming others; and thus, the single-path rate allocation scheme violates the high utilization goal. For example, in Figure 2(a), there are three VMs sharing two bottleneck links, each has a demand of. To achieve per-source fairness, regardless of the routing algorithm, at least two of the VMs have unsatisfied demand, when there is excess capacity on the alternate route. 2) High utilization+per-source fairness v.s. High-tail throughput: In a single-path rate allocation scheme, a VM with more flows usually have more paths to leverage. To achieve high utilization, the VM tends to use the bandwidth on less congested paths, and yield the capacity of a more congested path to the competing VMs with fewer flows for better per-source fairness. As a result, each of its flows on the more congested path gets a lower throughput than its other flows. For example, in Figure 2(b), there are two source VMs. The blue source VM in the first rack is sending two flows b and b 2, and the red source VM is sending one flow r. The first blue flow b shares a bottleneck with r. With high utilization and per-source fairness, b 2 and r will get a throughput of ; however, b only will not get any throughput. This allocation blocks flow b and results in poor tail throughput for the blue VM. If we choose to equalize the throughput of b and b 2, we cannot achieve per-source fairness and high utilization at the same time. Getting the routes right solves the problem. If the blue flows passes agg, and the red flow passes agg 2, a single path rate allocation scheme achieves all the goals. However, with distributed multipath routing protocols like ECMP, the probability of getting the right routes is only /4. Fig. 3. (a) Fattree topology Fattree topology and its equivalent tree (b) Fattree s equivalent tree This probability further decreases to 2/8 when there are 3 aggregation switches. While the two problems are different, both of them are caused by the unbalanced congestion on different paths. If there are multiple parallel dedicated paths with the same bandwidth and delay between two nodes, we are able to balance the congestion by having the sender evenly split the flows to all those parallel paths. Together those paths can be viewed as a pool with a bandwidth equal to the aggregate bandwidth of all the paths. Instead of sharing individual paths, we share the pool, and thus increase the probability of achieving all the goals. In the previous examples, by pooling the paths between the ToR and ToR 2, it is possible to achieve all the goals with a properly designed multipath rate allocation scheme. In Section IV, we generalize this observation to our design of, a multipath rate allocation scheme that shares the link pools in multi-rooted trees. IV. DESIGN OF Inspired by the concept of resource pools [24] and the equivalence classes [2], explicitly defines the link pools in a multi-rooted tree and construct the equivalent tree from the link pools, as in Figure 3. Link pools are the basic sharing units in. Sharing the link pools makes more flow demands feasible than any other routing choices; And thus, it is more likely to achieve all the goals with link pools. We design, a multipath rate allocation scheme to share the link pools. runs inside the hypervisor of the sending host. It is a distributed solution, and can scale well to the churn of today s public cloud networks. leverages multiple paths between a source-destination pairs, and shares the link pools by balancing the congestion on the links within the same pool. divides the bandwidth of a bottleneck link pool to different VMs according to their weights on that link pool; and thus achieve the high utilization, per-source fairness without harming the tail throughput of the flows within a pool. In the subsequent sections, we first introduce the concept of link pools and show its advantages over other routing choices in Section IV-A. We then carefully design our rate allocation scheme to fairly share the link pools in Section IV-B. A. Link pools in a multi-rooted tree The idea of a link pool is to combine the parallel links and treat them as a single link. The links within a link pool share

5 5 the congestion, and thus can mitigate the problems caused by unbalanced congestion on different paths. In a multi-rooted tree, if we evenly split each flow to all its shortest paths, the links from/to the same rack/pod get the same amount of congestion because of the symmetry of the topology and the traffic pattern. We define those links as a link pool. We then show that if a demand is feasible with an arbitrary set of routes, it is also feasible by evenly splitting every flow to all its shortest paths. ) Link pool: Definition. All the links from/to the same ToR switch or the same pod is a link pool. A link inside the rack form a link pool for itself. Any two links from the same link pool are equivalent links. Figure 3(a) shows an example of the link pools. The links with the same color and dot type are in the same link pool. If we use all the 4 shortest paths, and evenly split the flows between the two hosts to these paths, the amount of traffic on the links in the same link pool is the same. In fact, if every flow evenly split the traffic over all its available paths, the amount of traffic on the colored equivalent links are the same. We further generalize this observation to all the link pools in Lemma. Lemma. Given a traffic matrix, assume each flow is infinitely divisible. If we evenly split each flow to all its available paths, then the amount of traffic on two links are the same if they are in the same link pool. Proof: We first prove the from direction. A similar proof follows for the to direction because of the symmetry in multi-rooted trees. We first prove for equivalent links from the same rack, then prove for the equivalent links from the same pod. When a flow arrives at a ToR switch from an end host, it has the same number of paths with the same hop count through each of its outgoing links (the ToR-aggr links). Hence the algorithm will evenly split the flow on those outgoing links. Because the aggr-core links only have inter-pod flows, we consider those flows only. First, the amount of traffic from the same aggregate switch is the same, following the same logic above. Because the aggregate switches in the same pod are connected to the same set of ToR switches; and all of those ToR routers evenly split the traffic to these aggregate switches, these aggregate switches will receive the equal amount of incoming traffic. Because the outgoing traffic is equal to the incoming traffic and they again get evenly splitted to the aggrcore links, the aggr-core links from the same pod get the same amount of traffic. 2) Equivalent tree: Figure 3 shows the equivalent tree of the Fattree in Figure 3(a). According to Lemma, if we evenly split each flow to its shortest paths, the equivalent links have the same amount of traffic, and thus the same bandwidth constraint. Within a pod, every ToR switch forms a link pool with all the aggregate switches. And thus, by collapsing all the aggregate switches to a big aggregate switch, and linking it to each ToR switch with the aggregate traffic and bandwidth of the corresponding link pool, the constraints do not change. Similarly, we can collapse the core switches to a single core switch and construct an equivalent tree. Formally, we define the equivalent tree of a multi-rooted tree by a top-down recursive construction in Algorithm. Algorithm : Construction of Equivalent Tree For a multi-rooted tree MT(V,E), we construct its equivalent tree T(V,E ) as follows, ) Create the core. We create the single core switch c V. 2) Create the pods. For each pod P MT, we create P T as follows, a) Create an aggregate switch and connect it to the core c. we create an aggregate switch a in the node set of P and connect it to the core c, with a link of bandwidth n pod E, where n pod aggr core aggr coreb aggr core in is the number of outgoing aggr core links per pod, and B aggr core is the bandwidth of aggr core links in E. b) Directly replicate the racks. For each rack R P, we create its isomorphic mapping R in T, and connect its ToR switch t to the aggregate switch a, with a link of bandwidth n rack ToR aggr B ToR aggr in T, where n rack ToR aggr is the number of outgoing ToR aggr links per rack, and B ToR aggr is the bandwidth of ToR aggr links in MT. 3) Correctness of the link pools: In this section, we prove that if a demand matrix is feasible for a multi-rooted tree with arbitrary routes, it is also for its equivalent tree. And thus it is also feasible by splitting each flow to all its shortest paths. We first show that the demand on any link pool in MT is no less than the demand on its corresponding link in T, because of possible detours, in Lemma 2. Lemma 2. Any path between a source-destination pair must pass the same set of link pools in MT, which maps to its route in T. Proof: To start a route in MT, it must first reach the ToR switch of the source; similarly, it must end at the ToR switch of the destination. And thus, it must pass the pool from/to those ToR switches if the communication is inter-rack. Similarly, if the communication is inter-pod, it must reach an aggregate switch in the same pod as the source in the route up and an aggregate switch in the same pod as the destination in the route down, and thus pass the link pools from/to the these pods. Hence regardless of the detour, a route pass the same set of link pools, which maps to its route in T. We define the rate allocation matrix for Graph G as D(G) here, where the demand from Host i to Host j is D ij (G). Theorem. For a multi-rooted tree MT(V,E) and its equivalent tree T(V,E ) constructed according to Algorithm, a demand matrix D is feasible in T, if and only if it is also feasible in MT.

6 6 Proof: Let D be a demand matrix for MT. For a demand from Host i to Host j, D ij, we construct our routing protocol as follows. If it is the intra-rack, because there is one single path, we allocate it to that path. If it is within the same pod, but not inside the same rack, we evenly split D ij to n rack ToR aggr shortest paths. Otherwise, it is inter-pod, we evenly splitd ij to n pod aggr core shortest paths. According to the construction of the equivalence tree and Lemma, there is a one-to-one mapping from an bandwidth constraint in the a pool in MT to the bandwidth constraints in its equivalent links in T. Hence the feasibility of a demand in T implies the feasibility of a demand in MT. To prove the other direction, we need to consider routes in general, i.e, it is possible to have detours in a route in M T. However, because Theorem 2 has shown that the demand on a link pool in M T is no less than the demand on its corresponding link in T, a demand is feasible in T if it is feasible in MT. B. Rate Allocation Scheme We design at the hypervisor of the endhosts. It scales well compared with putting the rate control at a centralized controller, or enforcing the fair queuing at the switches. Inspired by the rate allocator [22], in, the administrator assigns each VM a network weight according to its type. A VM s network weight is divided to its destinations, according to their demands and current share of the bandwidth. In, because we need to split a flow to all its paths, we add one more abstraction, the paths. A destination s weight to each destination is further divided to different paths. Periodically, the hypervisor estimates a VM s aggregate rate for every link on its routes by running a weighted multipath TCP-like rate adaptation loop for all the active paths to all the destinations. We carefully design the rate adaptation loop to make it tolerant to link failures and background traffic. then finds the bottleneck link on each path and allocate its rate to all the flows in a max-min fair fashion. Similar to the weighted TCP-like rate adaptation loop [], [22], [8], uses UDP packets to send the congestion feedback from the receiver hypervisor to the sender hypervisor. When there is no link failure or background traffic, a perdestination rate adaptation loop evenly splits each flow to all its shortest paths, and thus is able to achieve all our goals. However, in real datacenters, failures are frequent [9] and background traffic might interfere with the normal traffic. With per-destination feedback, failures or conflict with the background traffic on one path decreases the rates of the subflows on other paths. As a result, the throughput-sensitive applications at the VM might take substantially longer to finish and miss important deadlines. is designed to tolerate link failures and background traffic. Instead of having a single rate adaptation loop between each source-destination pair, runs separate rate adaptation loops for each path. When there is no background traffic or link failures, evenly splits a flow to all the paths. This is because all the paths passes the same set of link pools; within each link pool, the links have the same congestion level. So all the paths to the same destination will expect get the same rate estimation. To cope with background traffic or link failures, achieves graceful performance degradation by fully utilizing the normal paths, and quickly respond to the failures or interference of background traffic on affected paths. Our rate adaptation logic is as shown in Algorithm 2. Inspired by MPTCP [2], our rate adaptation logic consists of two parts. Periodically, if there is no loss on the path, triggers the weighted additive increase phase. This phase ensures that the VM increases its bandwidth share at a speed proportional to its weight. If the feedback from the receiver hypervisor indicates a loss on the path, triggers the weighted multiplicative decrease phase. Only congested paths decrease their rate by a fixed portion α. This ensures quick response to link failures and background traffic and achieves graceful performance degradation. Algorithm 2: s rate adaptation loop to estimate a VM s rate r l on a Link l Periodically for each Path p l { // w: weight assigned to the VM. // b p : Received bytes for path p. // b p,l : Received bytes for path p on Link l. } if feedback is ACK then Path p s global proportion: p p bp i bi ; r l r l +w p p ; end else Path p s proportion on l: p p,l b p,l i l b i,l ; r l r l r l α p p,l ; end V. IMPLEMENTATION We implement our prototype as a user-level process that controls the hierarchical token bucket filter of tc. We modify Open vswitch [5] to provide the congestion feedbacks. Our implementation consists of 25 lines of C++ code. Similar to [25], we create multiple paths between two VMs by assigning multiple address to a VM with IP aliasing, in the form of path id.pod id.rack id.host id. Each intermediate hop chooses the right path according to the destination IP address. We implement the rate limiting logic at the user space. Periodically(every 25ms), computes the rate limit for each path between every source destination pair and configures the rate limit of the hierarchical token bucket filter for its corresponding pair of source and destination IP addresses. To enable congestion detection, we modify the Open vswitch code at the sender side to append a sequence number to each data packet. We modify the IP header and store the sequence number at the identification field. The field is reserved for reassembling fragmented packets, and we assume no packet fragmentation in datacenter networks. We also

7 7 change the Open vswitch code at the receiver side, to detect the holes in the sequence space on packet arrivals. We cache the congestion detection results and send the feedback UDP packets from the Open vswitch every ms. We use the high resolution timers at the Kernel to control the timing of the feedback packets. Because evenly splits every flow to all the available paths, the delay on all those paths are expected to be very close. We also do not observe significant reordering for packets sent on different paths in our testbed experiments. It is possible to achieve in-order delivery of packets by maintaining a small buffer and a global sequence number. We left it as future work. VI. EVALUATION ON DETERLAB In this section, we evaluate our design on Deterlab [3]. The topology we used is a 2-level multi-rooted tree with 4 racks and 2 aggregate switches, each rack has 3 hosts. The bandwidth of every link is Gbps. Each end host is a PC36 machine with a dual Xeon 3GHz processor, 2GB of RAM and 4 network interfacecards(nics), Each switch is also a PC36 machine running Open vswitch(nics), only forwarding packets based on the destination address. We have two source VMs in the experiment with the same weight. The well-behaved VM sends traffic to one destination. We evaluate two scenarios for the selfish VMs. In the first scenario, the selfish VM sends multiple flows to the same destination. In the second scenario, the selfish VM sends to multiple destinations. The sources and destinations are in different racks. None of the VMs are in the same host. Because the result of depends on the underlying routes, we run the evaluation for 2 times with randomly selected routes by ECMP. Each evaluation lasts 6 seconds. We plot the maximum, average and minimum statistics for to give the readers a complete view of its performance. We use the proportion of selfish traffic to quantify per-source fairness. ) Selfish VMs have multiple flows.: In this scenario, we let the selfish VM send multiple flows to one destination. Figure 4(a) plots aggregate throughput of the two VMs. With more flows, the gap between max and min throughputs of is much smaller. This is because ECMP distributes the flows more evenly with more flows over all the available paths. achieves better throughput. Figure 4(b) plots the proportion of selfish traffic. Both and are able to achieve per-source fairness. Because combines the feedbacks for flows from the same destination, and ECMP maps the flows to different paths. Those feedbacks interferes with each other. For example, consider the scenario where the selfish VM sends two flows. With a probability of /2, achieves the average value, where two small flows are mapped to different paths, with one of them colliding with the large flow. The loss feedback is then applied to both flows, resulting in sub-optimal throughput for the other flow of the selfish VM. In this scenario, trades the high utilization for fairness between flows within a VM., by sharing on a per-link-pool basis, achieves higher utilization than. Throughput(Gb/s) 2.5 : :2 :66 33:66 Well-behaved: Selfish Percentage of selfish traffic : :2 :66 33:66 Well-behaved: Selfish Fig. 4. The aggregate throughput and the proportion of selfish traffic, when selfish VM sends multiple flows to the same destination Throughput(Gb/s) 2.5 : :2 :3 Well-behaved: Selfish Percentage of selfish traffic : :2 :3 Well-behaved: Selfish Fig. 5. The aggregate throughput and the proportion of selfish traffic, when selfish VM sends flows to multiple destinations 2) Selfish VMs have multiple destinations.: In this scenario, we let the selfish VM send traffic to multiple destinations. Figure 5(a) shows the aggregate throughput for both VMs. While achieves a bottleneck utilization of around 9%, achieves a bottleneck utilization of around 75%. The gap between the throughputs of and is smaller than in Figure 4(a), because the multiple flows from the selfish agent no longer share the same loss feedback. However, has a higher throughput than in general. Figure 5(b) shows the percentage of selfish traffic. the selfish traffic of gets a share of around 5%; however, the selfish traffic of gets noticeable more share than the well-behaved traffic. In fact, the selfish traffic in gets a share up to around 6% in some circumstances. In this scenario, trades the tail throughput for higher utilization and per-source fairness. Figure 6 shows the ratio between the minimum flow throughput and the maximum flow throughput for the selfish VM. This figure evaluates the allocation s tail throughput compared with other flows. Because the source VMs are both bounded by the network, we would expect to see equal throughput for flows within the selfish VM and thus a ratio of. The ratio is significantly smaller than for ; and achieves a ratio of approximately. The reason is that trades the tail throughput within a VM for persource fairness. eliminates this tradeoff by per-link-pool sharing, and achieves an in-vm rate allocation closer to the max-min allocation according to the flow demands. We have also evaluated the case where flows are rate-limited by the host. In this experiment, we limit the rate of one of the one-to-many flows and measure the ratio between the minimum throughput and the maximum throughput for the other one-to-many flows without rate limits. The result is as shown in Figure 7. Similar to the case above, the rate limit on one flow does not change the tail throughput a lot.

8 8 min/max :2 :3 Well-behaved: Selfish Fig. 6. The ratio between minimum flow throughput and the maximum flow throughput within the selfish VM, when the selfish VM is sending flows to multiple destinations Bottleneck utilization, fanout = 2, fanout = 2, fanout = 3, fanout = 3, fanout = 5, fanout = Proportion of selfish throughput, fanout = 2, fanout = 3, fanout = 5, fanout = 2, fanout = 3, fanout = Bottleneck utilization Rate limit (Mbps) Fig. 7. The ratio between minimum flow throughput and the maximum flow throughput within the selfish VM, when the selfish VM is sending flows to three destinations VII. SIMULATION RESULTS Because of the limits on the number of machines and the number of host ports, we use extensive ns-2 simulations to evaluate the scalability of. We first vary different parameters of topology and traffic pattern and evaluate our algorithms against those parameters. We then change the number of available paths for each flow. In the last part, we investigate its scalability in a 248-node Fattree variant. Simulation Setup We use a 2-level multi-rooted tree consisting of 2 racks, with hosts in each rack. All the links in the topology have a bandwidth of Gbps, and a delay of ms, yielding a round trip time of 2ms. We choose the bandwidth and delay according to our measurements in Deterlab. The control interval for the rate controller is 25ms. By default, there are two core switches interconnecting those two racks. Each simulation runs for 6 seconds. Because the paths used play an important role for single-path, we run the simulation for 5 times, using different random seeds. We plot the maximum, average and minimum statistics for. We use the proportion of selfish traffic to quantify per-source fairness. A. Robustness to topology change In practice, the topologies and the traffic patterns vary a lot from one cloud network to another. In this section, we strive to answer the effect of those parameters to our algorithm. In this section, we vary the fanins and fanouts and show their impact on different bandwidth allocation schemes. The fanin of a switch is the number of active endhosts sending the traffic Fig. 8. The utilization of the core network and the proportion of selfish traffic, when the selfish VMs send multiple flows to the same destination Bottleneck utilization, fanout = 2, fanout = 2, fanout = 3, fanout = 3, fanout = 5, fanout = Proportion of selfish throughput, fanout = 2, fanout = 3, fanout = 5, fanout = 2, fanout = 3, fanout = Fig. 9. The utilization of the core network and the proportion of selfish traffic, when the selfish VMs send flows to multiple destinations to it. It determines the number of flows a switch multiplexes, and the burstiness of the traffic. The fanout of a switch is the number of outgoing links from a switch. It determines the level of path diversity in the topology. In terms of traffic pattern, we vary the mix ratio between one-to-one flows and one-to-many flows. Throughput. Figures 8(a) and 9(a) show the aggregate throughput for all VMs with changing fanins, each curves represents a different fanout, the error-bar represents the maximum, minimum and average throughput under different traffic mixes. We take the average for each traffic matrix for, since there are multiple ways of path selection. The utilization is computed as the ratio of the used core capacity. Note that the slope of at the left side of each figure occurs when there is insufficient number of active VMs. In those cases, the bottleneck is at the edge, instead of the core. In both figures, for the same fanout, the curve for is always below the curve for, indicating a higher throughput for. The largest gap occurs when the fanout is approximately the same as the fanin. The traffic mix does not affect the result much. It is not surprising because is able to fully utilize the bottleneck link pool; however, the collisions with make some links more congested than the other. This is especially conspicuous when the fanin is approximately equal to the fanout, where ECMP is not able

9 9 to balance the congestion evenly on each path. Isolation. In Figures 8(b) and 9(b), we plot the proportion of selfish traffic. We normalize the throughput for each class of flows. The proportion of selfish traffic is always around for, This is because evenly splits the flows to links within a pool, and thus make collisions less likely to happen. The results of are different with different types of selfish traffic. When the selfish flows are sent to the same destination, it achieves a lower proportion of oneto-many traffic than its fair share, because of the share of congestion feedbacks from multiple flows within a VM. When the selfish flows are sent to different destinations, it achieves a slightly higher throughput than its fair share. That s because with ECMP, selfish VMs usually have more ways of leveraging the capacity on less congested paths, and are thus less likely to starve than well-behaved VMs. The result is consistent with our conclusion in previous bar graphs. Outliers for. The figures, however, do not show the outliers for, since it takes average for under different path selections. We found that the outlier is more likely to occur when there is a small percentage of selfish or well-behaved traffic and when the selfish traffic is multiple flows to the same destination. For example, in one of our simulations, there are 9 VMs, 8 out of them are selfish. Out of all the flows, the well-behaved flow is allocated to one bottleneck with 5 of the selfish flows, all the other 9 selfish flows are allocated to another bottleneck. In this scenario, because of the per-destination feedback from, the selfish flow is only able to get 5/9 of the capacity in the first link, resulting in an average throughput of 3/9 for selfish VMs and a throughput of 4/9 for well-behaved VMs. In this case, is significantly biased towards the VM with a single flow. B. Varying the number of paths for each flow One interesting question to ask is whether a small number of paths is sufficient for leveraging the link pools. In this section, we change the number of paths for each flow. To observe a wider spectrum, We set the number of core switches to 5, and change the number of paths for each flow. In Figures and 2, we plot the throughput for different types of selfish traffic. Each curve represents different number of paths for rate adaptation loop. If does not use all the paths, it randomly selects the paths for different runs and we take the average. The figures show that with 2 paths achieves higher throughput than single-path and is quite close to with all the paths. However, it shows a higher variance in aggregate throughput than with all the paths. The difference and the variance peaks when the fanin is equal to the fanout. The proportion of selfish traffic is close to 5%, and does not change much with the number of paths. C. Scaling up To evaluate the scalability, we use a 6-pod Fattree-variant with 248 hosts, where the over-subscription ratio within a Bottleneck utilization 5 paths 2 paths path Proportion of selfish throughput paths 2 paths path Fig.. The aggregate throughput and the proportion of selfish traffic, when selfish VM sends multiple flows to the same destination Bottleneck utilization 5 paths 2 paths path Proportion of selfish throughput.6 5 paths 2 paths path Fig.. The aggregate throughput and the proportion of selfish traffic, when selfish VM sends flows to multiple destinations pod is :2, and the over-subscription ratio between the aggrcore links and the ToR-aggr links is :. All the links are having a bandwidth of Gbps. We run our simulation with staggered traffic pattern. In Staggered(p,p 2 ) traffic pattern, The sources have destination on average. Each source selects its destination randomly. A flow has a probablity of p to be within a rack, and a probability of p 2 to be in different racks within a pod, and a probability of p p 2 to be crosspods. We run our simulations with staggered(.2,.3) and staggered(,.3) to change the proportion of local traffic. We run 2 different simulations for each staggered traffic pattern with different random seeds. Each simulation runs for a period of 6 seconds. In Figure 2, the well-behaved VMs send only one flow, the rest of them are selfish VMs, sending 2 to 6 flows, with a decreasing probability of sending more flows. We plot the normalized throughput for each category. We can see that those two types of VMs get about the same throughput. Therefore, scales well to thousands of machines. VIII. DISCUSSION Challenges of applying to more general topologies. The problem of achieving per-source fair allocation on more general topologies can be formulated as the max-min multicommodity flow problem. If each flow is only allowed to use one path, it can be formulated as the unsplittable max-min multi-commodity flow [5]. The unsplittable max-min multicommodity flow is known to be NP-hard. If each flow is allowed to use more than one path, it can be formulated as the splittable max-min multi-commodity flow [7]. The splittable

10 Proportion by VM type well-behaved VMs selfish VMs Staggered (,.3) Staggered (.2,.2) Fig. 2. The proportion of different type of VMs with different Traffic patterns in a 6-pod Fattree max-min multi-commodity flow is is solvable by iterative convex optimization. So far we haven t seen any distributed algorithm for general topologies. Faster convergence of. In this paper, we use a multipath AIMD control loop for. Because the control interval is larger than the largest round trip time in a datacenter network, and the bandwidth of the datacenter network is expected to continue to grow in the future, a control loop that achieves faster convergence is necessary. [22] suggested the application of Cubic TCP [3], and incoporation of the explicit congestion feedback mechanism used by DCTCP [7]. This is also applicable to because the rate of the subflows are controlled together, as a flow in the equivalent tree. However, the control loop does not affect the isolation in the long run. The benefit of a better control loop is mainly higher throughputs in networks with large delay-bandwidth networks. Choice of flow weights. FairCloud [9] shows the tradeoff between minimum guarantee, proportionality and utilization incentives and suggested better ways of sharing the cloud. Instead of using per-source weights, it assigns the weight between each source-destination pair. is orthogonal to FairCloud and can approximate PS-N and PS-P in FairCloud by applying the weights from FairCloud and allocate the flow capacity on a basis of per-link-pool in multi-rooted trees. IX. CONCLUSION Sharing cloud networks has tremendous influence on cloud computing. In this paper, we present, a new bandwidth allocation scheme for today s cloud networks. evenly splits the flows into all the available paths and share the network bandwidth at the granularity of link pools, instead of links. Compared with existing bandwidth allocation scheme, achieves higher utilization and stronger isolation. also scales well to networks of up to 248 hosts in our simulation. Therefore, we believe is a good choice for sharing today s cloud networks. REFERENCES [] Analysis of an equal-cost multi-path algorithm, rfc [2] Cisco data center infrastructure 2.5 design guide. /Data Center/DC Infra2 5/DCI SRNDa.pdf. [3] Deterlab. [4] The ns-2 simulator. [5] Open vswitch. [6] AL-FARES, M., LOUKISSAS, A., AND VAHDAT, A. A scalable, commodity data center network architecture. In Proceedings of ACM SIGCOMM 28 (28), ACM, pp [7] ALIZADEH, M., GREENBERG, A., MALTZ, D. A., PADHYE, J., PATEL, P., PRABHAKAR, B., SENGUPTA, S., AND SRIDHARAN, M. Data center tcp (dctcp). In Proceedings of ACM SIGCOMM 2 (2), ACM, pp [8] BALLANI, H., COSTA, P., KARAGIANNIS, T., AND ROWSTRON, A. Towards predictable datacenter networks. In Proceedings of ACM SIGCOMM 2 (2), ACM, pp [9] BENSON, T., ANAND, A., AKELLA, A., AND ZHANG, M. Understanding data center traffic characteristics. SIGCOMM Computer Communications Review 4, (Jan. 2), [] CROWCROFT, J., AND OECHSLIN, P. Differentiated end-to-end internet services using a weighted proportional fair sharing tcp. SIGCOMM Computer Communication Review 28, 3 (July 998), [] CURTIS, A. R., MOGUL, J. C., TOURRILHES, J., YALAGANDULA, P., SHARMA, P., AND BANERJEE, S. Devoflow: scaling flow management for high-performance networks. In Proceedings of ACM SIGCOMM 2 (2), ACM, pp [2] DIXIT, A., PRAKASH, P., HU, Y., AND KOMPELLA, R. R. On the impact of packet spraying in data center networks. In Proceedings of IEEE INFOCOM 23 (23). [3] HA, S., RHEE, I., AND XU, L. Cubic: a new tcp-friendly high-speed tcp variant. SIGOPS Operating Systems Review 42, 5 (July 28), [4] JEYAKUMAR, V., ALIZADEH, M., MAZIÈRES, D., PRABHAKAR, B., KIM, C., AND GREENBERG, A. Eyeq: Practical network performance isolation at the edge. In Proceedings of USENIX NSDI 23 (23), pp [5] KLEINBERG, J., RABANI, Y., AND TARDOS, E. Fairness in routing and load balancing. In IEEE FOCS 999 (999), pp [6] LEE, J., LEE, M., POPA, L., TURNER, Y., BANERJEE, S., SHARMA, P., AND STEPHENSON, B. Cloudmirror: application-aware bandwidth reservations in the cloud. In Proceedings of USENIX HotCloud 23 (22), USENIX Association, pp. 6. [7] NACE, D., DOAN, L. N., KLOPFENSTEIN, O., AND BASHLLARI, A. Max-min fairness in multi-commodity flows. University of Technology of Compigne Technical Report. [8] NANDAGOPAL, T., LEE, K.-W., LI, J.-R., AND BHARGHAVAN, V. Scalable service differentiation using purely end-to-end mechanisms: Features and limitations. Computer networks 44, 6 (Apr. 24), [9] POPA, L., KUMAR, G., CHOWDHURY, M., KRISHNAMURTHY, A., RATNASAMY, S., AND STOICA, I. Faircloud: sharing the network in cloud computing. In Proceedings of ACM SIGCOMM 22 (22), ACM, pp [2] POPA, L., YALAGANDULA, P., BANERJEE, S., MOGUL, J. C., TURNER, Y., AND SANTOS, J. R. Elasticswitch: Practical workconserving bandwidth guarantees for cloud computing. In Proceedings of ACM SIGCOMM 23 (23), SIGCOMM 3, ACM, pp [2] RAICIU, C., BARRE, S., PLUNTKE, C., GREENHALGH, A., WISCHIK, D., AND HANDLEY, M. Improving datacenter performance and robustness with multipath tcp. In Proceedings of ACM SIGCOMM 2 (August 2), pp [22] SHIEH, A., KANDULA, S., GREENBERG, A., KIM, C., AND SAHA, B. Sharing the data center network. In Proceedings of USENIX NSDI 2 (2), USENIX Association, pp [23] SINGLA, A., HONG, C.-Y., POPA, L., AND GODFREY, P. B. Chatty tenants and the cloud network sharing problem. In Proceedings of USENIX NSDI 23 (23), USENIX Association, pp [24] WISCHIK, D., HANDLEY, M., AND BRAUN, M. B. The resource pooling principle. SIGCOMM Comput. Commun. Rev. 38, 5 (Sept. 28), [25] WU, X., AND YANG, X. Dard: Distributed adaptive routing for datacenter networks. In Proceedings of IEEE ICDCS 22 (22), IEEE Computer Society, pp [26] XIE, D., DING, N., HU, Y. C., AND KOMPELLA, R. The only constant is change: Incorporating time-varying network reservations in data centers. In Proceedings of ACM SIGCOMM 22 (New York, NY, USA, 22), SIGCOMM 2, ACM, pp [27] XU, Y., MUSGRAVE, Z., NOBLE, B., AND BAILEY, M. Bobtail: Avoiding long tails in the cloud. In Proceedings of USENIX NSDI 23 (23), USENIX Association, pp