The Advantages of Multi-Domain Layer-2 Exchange Model

Similar documents
Capacity of Inter-Cloud Layer-2 Virtual Networking!

ExoGENI: A Mul-- Domain IaaS Testbed

Towards Predictable Datacenter Networks

The Case for Enterprise Ready Virtual Private Clouds

Software-Defined Networks Powered by VellOS

CloudNet: Enterprise. AT&T Labs Research, Joint work with: Timothy Wood, Jacobus van der Merwe, and Prashant Shenoy

Beyond the Stars: Revisiting Virtual Cluster Embeddings

DREAMER and GN4-JRA2 on GTS

WHITEPAPER. VPLS for Any-to-Any Ethernet Connectivity: When Simplicity & Control Matter

Experiences with Dynamic Circuit Creation in a Regional Network Testbed

Network Virtualization

VXLAN: Scaling Data Center Capacity. White Paper

The Case for Enterprise-Ready Virtual Private Clouds

Facility Usage Scenarios

Analysis of Network Segmentation Techniques in Cloud Data Centers

The FEDERICA Project: creating cloud infrastructures

EVOLVING ENTERPRISE NETWORKS WITH SPB-M APPLICATION NOTE

Virtualization, SDN and NFV

November Defining the Value of MPLS VPNs

Lecture 02b Cloud Computing II

Network Architecture and Topology

Virtual Machine in Data Center Switches Huawei Virtual System

On Tackling Virtual Data Center Embedding Problem

Testing Network Virtualization For Data Center and Cloud VERYX TECHNOLOGIES

M.Sc. IT Semester III VIRTUALIZATION QUESTION BANK Unit 1 1. What is virtualization? Explain the five stage virtualization process. 2.

BUILDING A NEXT-GENERATION DATA CENTER

Agenda. Distributed System Structures. Why Distributed Systems? Motivation

Brocade Solution for EMC VSPEX Server Virtualization

Dynamic Resource Allocation in Software Defined and Virtual Networks: A Comparative Analysis

Multitenancy Options in Brocade VCS Fabrics

WORKFLOW ENGINE FOR CLOUDS

ConnectX -3 Pro: Solving the NVGRE Performance Challenge

Flexible SDN Transport Networks With Optical Circuit Switching

OpenNaaS based Management Solution for inter-data Centers Connectivity

A Hybrid Electrical and Optical Networking Topology of Data Center for Big Data Network

International Journal of Advanced Research in Computer Science and Software Engineering

Network Virtualization for Large-Scale Data Centers

OVERLAYING VIRTUALIZED LAYER 2 NETWORKS OVER LAYER 3 NETWORKS

Network Virtualization Network Admission Control Deployment Guide

Cloud Infrastructure Planning. Chapter Six

Advanced Computer Networks. Datacenter Network Fabric

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

VMDC 3.0 Design Overview

Load Balancing Mechanisms in Data Center Networks

Scalable Approaches for Multitenant Cloud Data Centers

Extending Networking to Fit the Cloud

CLOUD NETWORKING THE NEXT CHAPTER FLORIN BALUS

DEMYSTIFYING ROUTING SERVICES IN SOFTWAREDEFINED NETWORKING

The Keys for Campus Networking: Integration, Integration, and Integration

Network Virtualization: A Tutorial

Traffic Engineering for Multiple Spanning Tree Protocol in Large Data Centers

Communication Networks. MAP-TELE 2011/12 José Ruela

Overview of Routing between Virtual LANs

Task Placement in a Cloud with Case-based Reasoning

Disaster Recovery Design Ehab Ashary University of Colorado at Colorado Springs

Simplifying Data Center Network Architecture: Collapsing the Tiers

Transform Your Business and Protect Your Cisco Nexus Investment While Adopting Cisco Application Centric Infrastructure

Ethernet Fabrics: An Architecture for Cloud Networking

Elasticity-Aware Virtual Machine Placement in K-ary Cloud Data Centers

Grid Computing Vs. Cloud Computing

On Tackling Virtual Data Center Embedding Problem

Virtual Machine Management with OpenNebula in the RESERVOIR project

Multi-dimensional Affinity Aware VM Placement Algorithm in Cloud Computing

Accelerate Private Clouds with an Optimized Network

Data Center Networking Designing Today s Data Center

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Cloud Atlas: A Software Defined Networking Abstraction for Cloud to WAN Virtual Networking

TRILL for Data Center Networks

A Security State Transfer Model for Virtual Machine Migration in Cloud Infrastructure

The Role of Carrier Ethernet in Business Applications

Network System Design Lesson Objectives

Network Enabled Cloud

TRILL for Service Provider Data Center and IXP. Francois Tallet, Cisco Systems

International Journal of Computer & Organization Trends Volume21 Number1 June 2015 A Study on Load Balancing in Cloud Computing

Infrastructure as a Service (IaaS)

SummitStack in the Data Center

Juniper Networks QFabric: Scaling for the Modern Data Center

All-Flash Arrays Weren t Built for Dynamic Environments. Here s Why... This whitepaper is based on content originally posted at

Cloud Storage and Online Bin Packing

Windows Server 2008 R2 Hyper-V Live Migration

UNDERSTANDING BUSINESS ETHERNET SERVICES

CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING

Keywords Distributed Computing, On Demand Resources, Cloud Computing, Virtualization, Server Consolidation, Load Balancing

Agility has become a key initiative for business leaders. Companies need the capability

Network Virtualization and Data Center Networks Data Center Virtualization - Basics. Qin Yin Fall Semester 2013

Elastic Management of Cluster based Services in the Cloud

UNDERSTANDING BUSINESS ETHERNET SERVICES

TRILL Large Layer 2 Network Solution

Business Case for BTI Intelligent Cloud Connect for Content, Co-lo and Network Providers

WAN and VPN Solutions:

Analysis on Virtualization Technologies in Cloud

Fibre Channel over Ethernet in the Data Center: An Introduction

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

SummitStack in the Data Center

A Novel Switch Mechanism for Load Balancing in Public Cloud

Chapter 2 Addendum (More on Virtualization)

Data Center Virtualization and Cloud QA Expertise

Preparing Your IP Network for High Definition Video Conferencing

APPLICATION NOTE 210 PROVIDER BACKBONE BRIDGE WITH TRAFFIC ENGINEERING: A CARRIER ETHERNET TECHNOLOGY OVERVIEW

SINGLE-TOUCH ORCHESTRATION FOR PROVISIONING, END-TO-END VISIBILITY AND MORE CONTROL IN THE DATA CENTER

Transcription:

Scaling up Applications over Distributed Clouds with Dynamic Layer-2 Exchange and Broadcast Service Yufeng Xin, Ilya Baldin, Chris Heermann, Anirban Mandal, and Paul Ruth Renci, University of North Carolina at Chapel Hill, NC, USA Email: {yxin,ibaldin,ckh,anirban,pruth}@renci.org Abstract In this paper, we study the problem of provisioning large-scale virtual clusters over federated clouds connected by multi-domain, layer-2 wide area networks. We first present the virtual cluster request abstraction and the abstraction models for substrate resource pools. Based on these two abstraction models, we developed a novel layer-2 exchange mechanism and an implementation of it in a multi-domain networked cloud environment. The design of the mechanism takes into consideration the realistic constraints in current network and cloud systems. We show that efficient cluster splitting, cloud data center selection and resource allocation algorithms can be developed to provision large-scale virtual clusters across cloud sites. A prototype system has been deployed and integrated into the ExoGENI testbed for about a year, and is being heavily used by scientific and data analytic applications. Index Terms Virtual Cluster, Layer-2 Exchange, Interdomain Broadcasting, Distributed Cloud I. INTRODUCTION Fueled by the big data movement and the advancement of virtualization technologies, scaling up Cloud Computing Infrastructure-as-a-Service (IaaS) to multiple geographically distributed cloud sites has attracted considerable attention [2], [15], [16]. The major motivations are: (1) to further extend server consolidation from individual physical servers to the rack or datacenter level. (2) to lower the operational costs associated with the complexity of creating and maintaining the virtual system using high level of quality of service, availability, and resource utilization; (3) to scale up the provisioning capability by federating several small- or middle-scale cloud sites. Large-scale virtualized systems, such as cloud-based Hadoop and multi-layer web service clusters, normally require a large number of customized virtual machines (VMs) interconnected by high bandwidth network channels. To express and map a real topology with specific links connecting pairs of VMs(Pipe model) is extremely difficult for both customers and service providers, as the traffic patterns between the VMs are usually highly dynamic and the general virtual topology mapping and maintenance are both computationally hard and operationally expensive [15]. As a result, various topology abstractions (e.g. so-called Hose model) have been developed and widely used, e.g.,, the virtual cluster, virtual oversubscribed cluster, or tenant application graph abstraction [4], [8], [11]. In these abstraction models, clusters of VMs connect to one or more than one virtual switches that are interconnected as a single- or multilayer virtual tree. However, these solutions are all developed for single data center environments where tree-like physical data center networks topologies are always assumed. Similar abstraction approaches are applicable in describing the allocatable resources in a distributed IaaS cloud environment, where each cloud site can be modeled as resource pools of different types of VMs with different CPU, memory, storage, and network specifications. We believe this unified abstraction approach makes virtual system embedding easier and more efficient as its hides the provisioning details inside each cloud site. There are three different request embedding scenarios: (1) always place a request entirely in a single cloud site; (2) a request, like a large-scale web service, can be split into multiple pieces, each of which is placed in a cloud site, but these pieces don t communicate with each other; (3) a request, in the form of a virtual topology, can be split and embedded into multiple cloud sites, and these pieces need inter-cloud networking to interconnect with each other. In general, when compared to the single data center scenario, the distributed cloud environment presents some additional technical challenges: (1) federation mechanisms and systems are needed to handle the trust, security, and pricing issues, especially when the sites are owned by different providers; (2) resources from different sites are more likely to be heterogenous in terms of resource quantity and quality, which makes request placement decisions more difficult; (3) additional wide area transit networks to connect the cloud sites need to be explicitly included in the substrate resource model in order to do cross-site request topology embedding. In particular, the transit networks connecting the distributed cloud sites have a multi-domain mesh topology rather than a tree-like topology, and all have different resource quantity and quality constraints. As a result, the wide area network resources need to be considered as first-class allocatable resources to satisfy the high bandwidth requirements normally posed by largescale virtual system topology requests. The need to set up dynamic bandwidth-guaranteed connections between the cloud data centers as part of the provisioned virtual system makes the best-effort TCP/IP Internet inadequate. Thanks to the recent technical advancement, many

advanced research and educational networks, and some commercial service providers, support dynamic circuit services, by leveraging the MPLS, VLAN, carrier Ethernet or even layer-1 optical lightpath capabilities. Enabling this capability requires substantial engineering efforts to bring the service to the premise of the cloud sites, and to provide advanced interdomain network control and management. In this paper, we focus on the problem of virtual cluster mapping in a wide-area, multi-domain environment based on the topology and resource abstraction models. The treelike topology of the virtual cluster abstraction implies that broadcasting and multicasting would be the desirable communication services that can significantly simplify the mapping process while greatly improving the resource utilization in the transit networks. Various IP layer multicasting solutions have proved difficult and are not widely available. Another popular solution is the Virtual Private LAN Services (VPLS) that bridges multiple MPLS endpoints into a single LAN segment [14]. However, VPLS is not widely available due to its high cost and its complexity in automating the configuration. Due to its pervasive deployment, low operational cost, and bandwidth guarantee capability, layer-2 Ethernet networking has been the most popular solution to provision the dynamic connection services since the Grid Computing era [13]. In recent years, it has been chosen by many distributed IaaS projects or testbeds that include GENI, as the underlying virtual networking platform. Different mechanisms and tools, so-called stitching in the GENI context [1], [6], have also been developed to provision multi-domain dynamic end-to-end layer-2 circuits. However, efficient layer-2 Ethernet broadcasting and multicasting solutions don t exist at all in a multidomain networking environment, not to mention automating the service with software. In this paper, we present a novel Layer-2 Ethernet Exchange mechanism to provide wide-area broadcast services by deploying advanced Ethernet switches at strategically chosen locations in the multi-domain environment. The virtual switches in the requests virtual topology abstraction are mapped to these exchange points. With this solution, we will show that provisioning of large-scale virtual computing clusters over federated clouds becomes operationally efficient to IaaS providers. It also becomes straightforward to the users as they can continue to use the familiar topology abstractions they have been using in a single data center environment. We also demonstrate that this service is automatically provisioned using a software implementation. As far as we know, our work is the first to address this problem systematically and present a solution approach based on an off-the-shelf Ethernet switch platform. The presented mechanism and solution approach are general, which can be easily integrated into any distributed Cloud or Grid computing system. Our major contributions are two-fold: (1) Layer- 2 broadcasting service based on a novel layer-2 exchange mechanism and implementation in a multi-domain multicloud environment to support wide-area broadcasting service; (2) algorithmic studies on efficient cluster splitting, cloud data center selection and resource allocation for large-scale cluster requests, which includes an efficient broadcasting tree construction algorithm. The remainder of the paper is organized as follows. In Section II, we present the layer-2 broadcast service with the ethernet exchange mechanism. In Section III, we further formally define the cluster splitting and resource allocation problem in this environment and present a set of heuristics to solve the problem. In section IV, we validate the proposed mechanism, and study the performance of the algorithms via simulation and experimental studies in the ExoGENI testbed that has deployed the presented solution to provide large-scale virtual cluster service. We conclude the paper in Section V. II. VIRTUAL CLUSTER AND INTER-DOMAIN LAYER-2 BROADCASTING An efficient resource abstraction model is very important in a federated cloud system as it enables easy interoperability between different cloud sites, presents a unified view to the users to acquire resources, and still allows autonomic control of each individual site. A. Virtual Cluster and Resource Abstraction A virtual cluster request abstraction can be represented as < N, B > where N (virtual) nodes connect to a virtual root R via links with bandwidth B as shown in Fig. 1(a). A more general representation is < N, B, S, O >, a virtual oversubscribed cluster [4] as shown in Fig. 1(b). The added parameters are S representing the grouping of N VMs in a two-layer structure, S = {S 1, S 2,..., S k }, k = S, and O being the oversubscription ratio that discounts the total bandwidth through the root by the intra-group communication. This model assumes an even distribution of VM groups, so the inter-group bandwidth is approximated to be S B O. We generalize the model to allow arbitrary group sizes, and V i is the virtual switch for connecting VMs in the group S i with an intra-group bandwidth B i = B S i and B i is the required inter-group bandwidth for S i. R is the virtual root switch for the entire cluster. This cluster group abstraction can be easily expressed by the NodeGroup abstraction in ExoGENI, which specifies a group of VMs of same type and using the same customized image. VMWare recently introduced a similar abstraction called resource pool [8]. R# B# B# B# B# " s 1 # s 2 # s 3 # s N # (a)#virtual#cluster# B# B# R# B 1 # B 2 # B k # " V 1 # V 2 # V k # " " " " S 1 # S 2 # S k # (b)#virtual#oversubscribe#cluster# Fig. 1: Virtual Cluster Abstraction

An example distributed cloud system is shown in Fig. 2. There are several cloud sites (C 1... C M ), each of which is served by a corresponding regional network service provider from (R 1... R M ). In the center is a small number of wide area transit network service providers (P 1... P N ) that connect the regional networks. The transit networks connected in an arbitrary topology provide either bandwidth guaranteed dynamic or static virtual connection services between the cloud sites. To express the allocatable resources from a cloud site or a network domain, we adopt a generic resource pool abstraction model. An available resource of a cloud site is represented as < C i, CC i >. C i is the total number of VM resource units that can be reserved, where a unit is defined as a minimum VM type, i.e.,, a VM with 1 CPU core, 3G memory, and 30G storage in the ExoGENI testbed. Other instance types for users to request, i.e., medium, large, or extra-large, are predefined as multiplies of the unit type with different amount of CPU, memory, and storage. CC i is a constraint vector that represents the total amount of CPU cores, memory, and storage capacity offered by the cloud site i. In this way, we handle the physical server heterogeneity at the cloud site level and leave the actual VM placement to the local policy of individual cloud sites. The resource of a network domain is represented as < V i, NC i >, where V i represents the total number of connections that the i th network provider can accommodate and NC i is a constraint vector that represents various network constraints. For example, for a layer-2 network, this abstraction allows the provider to express the available VLAN tags and the available bandwidth in the border interfaces that will be considered in provisioning the bandwidth guaranteed interdomain connection. C 1 R 1 R 2 C 2 P 2 P 3 P 4 V 1 R9 R 3 P 1 V 2 R 8 S C C8 1 9 S 2 S 3 C 3 Fig. 2: Distributed Cloud Environment with Multi-Domain Layer-2 Networks R 4 C 4 V 3 R 7 B. Inter-domain VLAN Exchange and Broadcasting As we mentioned before, mapping virtual clusters to a data center with tree-type physical network topology has been E 1 R5 C 7 R 6 C 5 C 6 extensively studied as the virtual root switch and cluster switches can be directly mapped to the data center switches to form a virtual broadcast domain per virtual cluster. It is not possible to do so in a distributed cloud environment interconnected by multi-domain networks because (1) none of the existing network providers provide inter-domain layer- 2 broadcast services, (2) using point-to-point connections to serve a virtual cluster placed in N cloud sites would be very difficult, either by setting up a full mesh virtual topology among the N sites or by setting up extra routing nodes for a non-mesh virtual topology. Both solutions require configuring multiple network interfaces in each VM, (3) dynamic circuit service is only available in a few advanced providers among some of the cloud sites. In order to connect two remote cloud sites (static sites), it is a common practice to pre-configure a number of static connections. For M such sites, O(M 2 ) static connections are required, which is both cumbersome and inefficient. Even for sites that are able to be connected via dynamic connections, doing so in a multi-domain network environment is only feasible in a few advanced distributed IaaS control frameworks like the ExoGENI system that has a set of advanced stitching mechanisms to enable this inter-domain dynamic circuit capability [3]. Our solution is to deploy VLAN Exchanges in strategically chosen locations to serve two purposes: (1) to enable dynamic VLAN circuit service between cloud sites that can only preconfigure static VLAN circuits to other places, we call this VLAN exchange service, which essentially extends the dynamic circuit services to static sites, (2) to set up dynamic virtual layer-2 broadcast domains among multiple VLAN segments, we call this layer-2 broadcasting service. As shown in Fig. 2, E 1 is such a VLAN exchange, which consists of high-capacity Ethernet switches controlled by a provisioning software as part of the distributed cloud provisioning software system. In the multi-domain multi-cloud network model, an Exchange is modeled as a separate domain. The advantages of this solution are multi-fold: (1) instead of O(M 2 ), only O(M) static connections are needed for M static sites. And instead of negotiating with and preconfiguring static connections to every other site, a static site only needs to do so with the exchange, which would normally be taken care of by the cloud federation system; (2) the basic bridging mechanism in the exchange is VLAN translation, i.e., each site can pre-configure connections with different VLAN ranges to the exchange domain, which can pick up two arbitrary VLAN tags and connect them together via VLAN translation. Therefore, this mechanism relieves the operators from enforcing the stringent VLAN continuity requirement and dealing with the complexity of creating multi-domain VLAN connections. It also reduces the dependency on other advanced carrier Ethernet capabilities (e.g., Q-in-Q) in order to set up end-to-end layer-2 connections; (3) the enabled layer- 2 broadcasting service makes it a relatively simple process to create a virtual layer-2 broadcasting domain for a virtual cluster embedding in multiple cloud sites: each participating cloud site creates a connection with possibly different VLAN

tags to the Exchange domain, the exchange switch will then translate the incoming VLANs into one common VLAN; (4) layer-2 network isolation is guaranteed between different users virtual systems with bandwidth guarantee; (5) every cloud site can still implement its own virtual networking solution (provisioning V i ) independently as long as it can interface with the exchange service. Fig. 2 illustrates a distributed cloud system equipped with such an Exchange. We first show using the VLAN exchange service to set up a VLAN connection between two virtual cluster groups S 1 and S 2 embedded the cloud sites C 8 and C 9 : first two VLAN connections (inter-domain circuit) V 1 and V 2 are created from C 8 and C 9 to E 1, then E 1 translates V 2 to V 1. Next, we show the capability of layer-2 broadcasting service to scale up the virtual cluster by adding a group S 3 embedded in C 7 via another connection V 3 to E 1, which subsequently translates V 3 to V 1. As a result, we just created a large-scale virtual cluster that is embedded in three different cloud sites. Fig. 3 depicts the internal structure Exchange"Switch" of an exchange domain, which consists of two Ethernet switches: Access switch and e 1" e 2" " e 3" Exchange switch. The access switch plays the role of client switch of the Access"Switch" transit networks and takes in the dynamic V 1" V 2" V 3" or static VLAN connections from Fig. 3: Inter-domain VLAN Exchange the neighboring domains. Due to the significant cost of transponder(s), normally there is only one drop port at a PoP (Point of Presence) of advanced backbone transit network service providers, e.g.,, Internet 2 or ESNet. Therefore, there may be multiple connections coming to the same port of the access switch. At the same time, the main technical constraint is that VLAN translation is one-to-one per port, i.e., it is not possible to translate tags i and j coming in the same port to a common tag k. Therefore, the second exchange switch is necessary that connects to the access switch via several port-to-port direct links (Exchange ports). The number of these ports limits the maximum cardinality of a broadcast connection. Upon multiple connections of a request coming into the access switch, each connection will be put on an exchange link between the two switches. Finally, at the exchange switch, the VLAN tags of each connection can be translated to an arbitrary common VLAN on its corresponding exchange port. For the example virtual cluster, connections from S 2 and S 3 come in the same access port, along with the connection from S 1 are mapped to three different exchange ports e 1, e 2, and e 3, through which V 2 and V 3 are all translated to V 1, to create a single Ethernet broadcast domain. C. Sparse Deployment of Exchange Service For a system with large number of geographical distributed cloud sites, one exchange domain is not enough because of (1) limited capacity (available VLAN tags and bandwidth) in an exchange switch for all the inter-group connections to go through; (2) the long distance to some remote cloud sites that would incur unacceptable delays. On the other hand, deployment of exchange domains comes with increased cost of high-capacity switches and the associated co-location and fiber connectivity cost to transit network service providers. Therefore, it is desirable to only deploy a small number of exchanges in strategically chosen locations to serve the users request. In this case, the objective is to deploy as few exchange domains as possible while keeping the diameter of any virtual clusters bounded in a multi-domain network environment. In reality, taking into consideration both deployment and operation cost, the exchanges are likely deployed in the PoP locations of the backbone networks because of their abundant capacity and connectivity. In order to take the geographical distribution of these PoPs into consideration when modeling the inter-domain network environment, we abstract each network domain as a full mesh topology between its border interfaces (PoP or interconnects to connect to other network domains or cloud sites). What we obtain is an inter-domain graph G(V, E), where the node set V includes all the border interfaces of the network domains and the cloud sites. The edges, E include the inter-domain links and the intra-domain links in the mesh network abstraction. Each link is defined with the available bandwidth and delay as E i (b i, d i ). In this way, deployment of an exchange domain is equivalent to enabling a node in G with the exchange or broadcasting capability. This problem can be modeled as a variant of the Facility location selection problem, a well-studied NP-Hard problem that has several efficient heuristics [5]. III. VIRTUAL CLUSTER PROVISIONING: EMBEDDING AND LOAD BALANCE Resource allocation for virtual systems is often modeled as various topology embedding problems which are NP-Hard in general. For the distributed cloud environment connected by inter-domain networks, popular objectives include minimizing the distance among the selected cloud sites [2] or minimizing the inter-domain bandwidth allocation [15]. In this paper, we focus on the virtual cluster-like requests whose embedding in a bandwidth constrained physical topology is equivalent to the NP-Hard VPN embedding in the popular hose model [10]. Recently a greedy heuristic on data center tree-like topology embedding xjik,was developed to maximize the bandwidth utilization [4]. Our embedding problem is different from the existing studies because (1) our underlying substrate system model is a general topology with some Exchange nodes; (2) the virtual roots are always mapped to the Exchange Broadcasting Domains. Nevertheless, this does not change the hardness

of the problem. Below we outline some allocation models, policies and the associated algorithms that have different objectives, performance and complexity tradeoffs. A. Static virtual cluster embedding To illustrate how virtual cluster mapping works in the presence of exchange domains, we first consider a static mapping case where groups of a virtual oversubscribe cluster (VOC) request < N, B, S, O > is given, and the group-to-cloud mapping S i C i is known. Static mapping is a most common feature of distributed cloud systems, where users are allowed to choose where to create their VMs, normally based on informed performance, availability, or security considerations. In the simplest form, it is straightforward to first choose an exchange domain to be the virtual root so that the average distance or delay from all the mapping sites is minimized, then to provision an inter-domain circuit for each branch, from the site C i to R, with individual assigned VLAN into the exchange domain. In reality, as shown in Fig. 1(b), the bandwidth requirement (B i ) on each branch of the virtual cluster may be asymmetric. As we mentioned earlier, according to [4], B i = min( S i, N S i ) B, because the intra-group traffic B i does not leave its cloud site i. It is easy to see that this mapping problem can be solved as a multi-commodity flow problem consisting of k flows of (R, C i, B i ), i 1,..., k, when the links in G(V, E) are assigned a cost function. When there are multiple exchange domains, mapping a large-scale virtual cluster can take advantage of flexibility of creating a multi-layer tree with the intermediate nodes mapped to an exchange domain, so that the total cost of the mapping can be minimized. However this makes the problem a special form of the NP-Hard multicast or Steiner tree problem. Furthermore, in addition to bound the delay from the sites to the root, we also want to minimize the delay variance among all the groups. So the objective becomes to make the tree as balanced as possible, i.e., minimizing the layers or the height of the tree, which is another variaty of the NP-Hard Steiner tree problem. We ll leave these advanced problem variants to future studies. B. Dynamic virtual cluster embedding We now look at the general VOC case. Our resource allocation policy is to always map a group S i as whole into a cloud site and multiple groups can be consolidated into a same cloud site in order to minimize the number of involved cloud sites and subsequently minimize the inter-cloud communications. This problem can be equivalent to the classical NP-Hard multi-dimensional bin packing problem [7], because there are multiple resource types in a request VM and multiple resource constraints in the cloud sites which are the bins [9]. We adopt the popular FirstFit heuristic for bin packing of the VMs. We note the bandwidth requirements and constraints need to be taken into consideration as individual dimensions of the bins. An efficient heuristic solution is to iterate over the bin packing solution space and call the static VC embedding algorithm studied in last section in each iteration, until the final solution is reached. Finding the bin packing to form a balanced tree rooted at the exchange will add extra complexity. C. Automatic virtual cluster splitting and embedding In this section, we turn our attention to the most general virtual cluster (VC) mapping problem which does not require pre-grouping so that VMs can be split into arbitrary groups to be embedded into arbitrary cloud sites. Several studies showed a major pitfall of cloud computing is the virtualization overhead of the VMs, which may grow substantially when too many VMs are packed in a physical server [12], since absolute resource (network, I/O, etc.) isolation among VMs or different tenants is impossible. Therefore load balancing is very important to a distributed cloud environment, in which we wish the resource utilization of all cloud sites to be as even as possible. We see this objective is the opposite of the original bin packing objective that attempts to pack VMs into as few bins as possible. In order to better trade off these conflicting objectives, more advanced policies and algorithms need to be designed to map a large virtual cluster. We start with the definition of an assignment utility function on embedding the group S i (s i = S i ) in cloud site C i as U i (S i, C i ) = 1 C i s i, s i C i. The virtual cluster is split in k groups (k is a variable), each group is embedded in a separate cloud site. For a given k, the resource allocation optimization objective is to find the optimal grouping Si, i 1,..., k in order to minimize the total utility function. s.t. U = k U i (S i, C i ) = k 1 C i s i (1) k s i = N (2) The objective function can be rewritten as: U = k 1 1 C i s i + 1 C k N + k 1 s i From the first order optimization condition, we obtain k 1 partial differentiation equations, the i th one is: (3) k 1 s j + s i = C i C k + N (4) j=1 Along with the constraint 2, we can solve the k linear equations to obtain the optimal solution s i. In order to find the optimal number of groups k, the algorithm iteratively changes the number of splits and solves above linear equations to obtain the total utility for a particular splitting until k is found. The comprehensive solution would take the interdomain connection utility function into consideration in each iteration. We leave the algorithmic performance study to our future papers.

IV. DEPLOYMENT AND EVALUATION Since its initial deployment in 2013, the virtual cluster capability based on the multi-domain broadcast exchange service has become one of the most popular features in the ExoGENI testbed, due to its easy use and high expressibility and performance for large scale virtual systems. An example is shown in Fig. 4, which is a live screenshot from the ExoGENI client GUI. On the left we show a virtual cluster that has four groups of different sizes required to be in four different cloud sites in Boston (MA), Berkeley (CA), Gainesville (FL), and Chapel Hill (NC). On the right shows the manifest, the actually provisioned system that maps the virtual root to the exchange point deployed in the Internet 2 PoP in Raleigh, North Carolina, and the four inter-domain paths connecting the groups mapped in four cloud sites. Fig. 4: A virtual cluster and its automated provisioning in Exogeni testbed V. CONCLUSION In recent years, there has been a growing interest in scaling up cloud IaaS offerings by federating many geographically distributed cloud sites. To automate provisioning in such an environment, efficient resource abstraction model is key, as it enables easy interoperability between different cloud sites, presents a unified view to the users to acquire resources, and still allows autonomy of each individual site. The second key is the capability to provision dynamic bandwidth guaranteed inter-cloud connections in the inter-domain networking environment. In this paper, we presented the virtual cluster request abstraction and the substrate resource pool abstraction models. To enable efficient inter-cloud virtual cluster provisioning, a novel layer-2 exchange and broadcasting mechanism and implementation in multi-domain network environment is introduced. We outlined several efficient cluster splitting, cloud data center selection and resource allocation algorithms to map the large-scale virtual clusters. A prototype system has been implemented and integrated into the ExoGENI testbed, and is getting heavily used by scientific and data analysis applications. We believe that the new models and capabilities we developed will open up several new research and development opportunities in architecture, policy, and algorithm design for efficiently operating and expanding large-scale distributed cloud systems. We plan to conduct more algorithmic studies and experiments on the performance of virtual clusters of different applications distributed in the testbed. ACKNOWLEDGMENT This work is supported by the US NSF GENI initiative and NSF Award ACI 1032573, and DOE ASCR DE-SC0005286. REFERENCES [1] http://wiki.exogeni.net. [2] M. Alicherry and T. V. Lakshman. Network aware resource allocation in distributed clouds. In INFOCOM, 2012 Proceedings IEEE, pages 963 971, March 2012. [3] I. Baldine, Y. Xin, A. Mandal, P. Ruth, A. Yumerefendi, and J. Chase. Exogeni: A multi-domain infrastructure-as-a-service testbed. In Trident- Com: International Conference on Testbeds and Research Infrastructures for the Development of Networks and Communities, June 2012. [4] H. Ballani, P. Costa, T. Karagiannis, and A. Rowstron. Towards predictable datacenter networks. In Proceedings of the ACM SIGCOMM 2011 Conference, SIGCOMM 11, pages 242 253, New York, NY, USA, 2011. ACM. [5] M. Bateni and M. Hajiaghayi. Assignment problem in content distribution networks: unsplittable hard-capacitated facility location. In Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, New York, 2009. [6] M. Bermana, J. S. Chaseb, L. Landweberc, A. Nakaod, M. Otte, D. Raychaudhurif, R. Riccig, and I. Seskarf. Geni: A federated testbed for innovative network experiments. In Computer Networks, 2014. [7] L. Epstein and R. van Stee. Optimal online algorithms for multidimensional packing problems. SIAM J. Comput., 35(2). [8] A. Gulati, A. Holler, M. Ji, G. Shanmuganathan, C. Waldspurger, and X. Zhu. Vmware distributed resource management: Design, implementation, and lessons learned. VMware Technical Journal, Spring 2012. [9] Y. Guo, A. Stolyar, and A. Walid. Shadow-routing based dynamic algorithms for virtual machine placement in a network cloud. In INFOCOM, 2013 Proceedings IEEE, pages 620 628, April 2013. [10] A. Gupta, J. Kleinberg, A. Kumar, R. Rastogi, and B. Yener. Provisioning a virtual private network: A network design problem for multicommodity flow. In Proceedings of the Thirty-third Annual ACM Symposium on Theory of Computing, STOC 01, pages 389 398, New York, NY, USA, 2001. ACM. [11] J. Lee, M. Lee, L. Popa, Y. Turner, S. Banerjee, P. Sharma, and B. Stephenson. Cloudmirror: Application-aware bandwidth reservations in the cloud. In Presented as part of the 5th USENIX Workshop on Hot Topics in Cloud Computing, San Jose, CA, 2013. USENIX. [12] A. Mandal, P. Ruth, I. Baldin, Y. Xin, C. Castillo, M. Rynge, and E. Deelman. Evaluating i/o aware network management for scientific workflows on networked clouds. In Proceedings of the Third International Workshop on Network-Aware Data Management, NDM 13, pages 2:1 2:10, New York, NY, USA, 2013. ACM. [13] F. Travostino, J. Mambretti, and G. K.-E. (Editors). Grid Networks: Enabling Grids with Advanced Communication Technology. Wiley, 2006. [14] T. Wood, K. Ramakrishnan, P. Shenoy, and J. Van der Merwe. Cloudnet: dynamic pooling of cloud resources by live wan migration of virtual machines. In Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments (VEE 11), volume 46, New York, NY, July 2011. ACM. [15] Y. Xin, I. Baldine, A. Mandal, C. Heermann, J. Chase, and A. Yumerefendi. Embedding virtual topologies in networked clouds. In Proceedings of the 6th International Conference on Future Internet Technologies, CFI 11, pages 26 29, New York, NY, USA, 2011. ACM. [16] Q. Zhang, Q. Zhu, M. F. Zhani, R. Boutaba, and J. L. Hellerstein. Dynamic service placement in geographically distributed clouds. IEEE Journal on Selected Areas in Communications (JSAC), 31(10), Oct. 2013.