1 This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier s archiving and manuscript policies are encouraged to visit:
2 Computer Networks 55 (211) Contents lists available at ScienceDirect Computer Networks journal homepage: Understanding data center network architectures in virtualized environments: A view from multi-tier applications q Yueping Zhang a,, Ao-Jan Su b, Guofei Jiang a a NEC Laboratories America, Inc., Princeton, NJ 854, USA b Northwestern University Evanston, IL 628, USA article info abstract Article history: Received 27 April 21 Received in revised form 23 December 21 Accepted 2 March 211 Available online 11 March 211 Responsible Editor: H. De Meer Keywords: Data center networks Multi-tier systems Server virtualization Application performance In recent years, data center network (DCN) architectures (e.g., DCell , FiConn , BCube , FatTree , and VL2 ) received a surge of interest from both the industry and academia. However, evaluation of these newly proposed DCN architectures is limited to MapReduce or scientific computing type of traffic patterns, and none of them provides an in-depth understanding of their performance in conventional transaction systems under realistic workloads. Moreover, it is also unclear how these architectures are affected in virtualized environments. In this paper, we fill this void by conducting an experimental evaluation of FiConn and FatTree, each respectively as a representative of hierarchical and flat architectures, in a clustered three-tier transaction system using a virtualized deployment. We evaluate these two architectures from the perspective of application performance and explicitly consider the impact of server virtualization. Our experiments are conducted in two major testing scenarios, service fragmentation and failure resilience, from which we observe several fundamental characteristics that are embedded in both classes of network topologies and cast a new light on the implication of virtualization in DCN architectures. Issues observed in this paper are generic and should be properly considered in any DCN design before the actual deployment, especially in mission-critical real-time transaction systems. Ó 211 Elsevier B.V. All rights reserved. 1. Introduction Driven by the recent proliferation of Cloud services and trend of consolidating enterprise IT systems, data centers are experiencing a rapid growth in both scale and complexity. At the same time, it becomes widely recognized that traditional tree-like structures of data center networks (DCN) encounter a variety of challenges, such as limited server to server connectivity, vulnerability to single point of failure, lack of agility, insufficient scalability, and resource fragmentation . To address these problems, in the recent two years several network architectures q A shorter version  of this paper appeared in IEEE IWQoS 21. Corresponding author. addresses: (Y. Zhang), northwestern.edu (A.-J. Su), (G. Jiang). [1,8,1,11,14,16] have been proposed for large-scale data centers and gained a significant amount of attention from both industry practitioners and the research community. In general, all existing DCN architectures can be classified into two categories, hierarchical and flat. The former is represented by the conventional tree topology (followed by recent proposals such as DCell , FiConn , and BCube ) and employs a layered structure that is constructed recursively from lower-level components. The latter (e.g., FatTree  and VL2 ) on the contrary organizes all servers into the same level and interconnects L2 switches using certain topologies (e.g., the Clos networks ). All these newly proposed DCN architectures are shown to significantly improve scalability, failure resilience, and aggregate network bandwidth over the conventional tree structure. However, all existing work assumes general-purpose underlying systems and does not explic /$ - see front matter Ó 211 Elsevier B.V. All rights reserved. doi:1.116/j.comnet
3 Y. Zhang et al. / Computer Networks 55 (211) itly consider performance of complex applications or interactions between different application components. In contrast, data centers may have very different traffic patterns based on the applications running on the top, e.g., searching/indexing, media streaming, cloud computing, or transaction systems. Moreover, there does not exist an experimental comparison of these DCN architectures for a better understanding of their characteristics in a common physical setting. Finally, an in-depth study of the impact of server virtualization on DCN architectures is also missing in the current picture. In this paper, we seek to fill the void by conducting an experimental evaluation of two DCN architectures, FiConn and FatTree (which are respectively a representative of the hierarchical and flat architectures, see Section 2), in a conventional mission-critical multi-tier transaction system. We first implement FiConn and FatTree in a fully virtualized testbed and then justify correctness of our implementation by comparing the experimental results to those obtained from a non-virtualized testbed. We then deploy RUBiS , a clustered three-tier ebay-like on-line auction system, on both FiConn and FatTree and examine application performance in different scenarios. We focus on two test cases, service fragmentation and failure resilience. In the study of service fragmentation, we mix all servers together instead of grouping servers (i.e., VM s in our case) belonging to the same service tier into the same physical machine (PM). The reason is twofold. On one hand, this allows sufficient network traffic to be injected into the network, thereby enabling us to study networking properties of the underlaying DCN architectures. On the other hand, this way, we are also able to examine, when service tiers are fragmented into various network locations, how application performance is affected by DCN architectures. Particularly, our results suggest that Fi- Conn is subject to performance degradation under service fragmentation (especially when CPU- and network-intensive components are collocated on the same PM) due to its server-centric design, while FatTree demonstrates higher resilience to this situation as a result of its flat server layout. The second testing scenario is failure resilience, where we examine performance of FiConn and FatTree after certain servers are down. Since both FiConn and Fat- Tree have multiple routing paths for each pair of nodes, traffic is rerouted through other paths upon server failures. Our results show that application performance is more subject to degradation in FiConn than in FatTree due to Fi- Conn s interference issue between routing and computing. However, FatTree s network-centric design does not come without expense. For instance, FatTree exhibits inefficient resource utilization in certain cases as a result of its routing behavior Contributions The main contributions of this paper can be summarized into the following three aspects. First, we perform an experimental comparison of newly proposed DCN architectures in a common system setting. Second, this paper employs a fully virtualized implementation and explicitly examines the impact of server virtualization on DCN architectures. Third, all experiments are performed in a cluster-based three-tier system where application performance (e.g., request throughput and response latency) is the major measurement metric. At the time of this writing, we believe that none of the above three points have been properly studied in existing literature and this paper is the first to address all of them simultaneously. Even though our experiments focus on FiConn and FatTree, most of our results can be generalized and applied as guidelines for designing any DCN architectures purported to be deployed in modern virtualized data centers hosting mission-critical real-time transaction systems. However, we understand that this work is by no means a comprehensive experimental study of all existing DCN architectures. Instead, we encourage readers to consider it as an initial step towards a deeper understanding of these architectures in a more realistic environment. The rest of the paper is structured as follows. In Section 2, we provide a brief review on existing data center architectures. In Section 3, we discuss the motivation for conducting this work and goals we seek to achieve in the paper. In Section 4, we provide details of system setup for our experiments. In Section 5, we present the experimental results and our main findings. Finally, in Section 6, we conclude this paper and point out directions for future work. 2. Background In this section, we briefly review the conventional and newly proposed data center network architectures. Based on how the system is constructed, we can classify existing DCN architectures into two categories, hierarchical and flat Hierarchical architectures In a hierarchical architecture, servers are arranged in different levels. A higher-level network is composed of multiple lower-levels components. Clearly, the most widely adopted hierarchical data center architecture is the conventional tree structure, in which the entire system can be constructed by recursive composition of sub-trees. Another hierarchical DCN architecture is DCell , which is proposed by Guo et al. and designed to improve DCN s robustness to link/server failures and scalability to a large number of servers. Similarly to the tree structure, DCell employs a recursive design, in which a level-k component DCell k is composed of (t k 1 + 1) level-(k 1) elements DCell k 1, where t k 1 represents the number of servers that are in a single DCell k 1. It is shown in  that the number of servers in DCell scales doubly exponentially to node level k. In addition, DCell is proven to be highly robust to link failures in that it has a bisection width lower bounded by 1/2t k lnt k of its embedded complete graph. A comprehensive study of DCell s structural and graph properties is available in .
4 2198 Y. Zhang et al. / Computer Networks 55 (211) Guo et al. further propose BCube , which uses several layers of mini-switches to connect different levels of BCube nodes. For instance, when constructing a level-k BCube, one needs to connect the kth port of the ith server in the jth BCube k 1 to the jth port of the ith layer-k switch. This way, BCube provides multiple parallel paths between any pair of servers, thereby improving server-to-server network capacity and fault tolerance. In both DCell and BCube, routing is performed by each server, which is equipped with multiple network ports for forwarding traffic. Specifically, each server in a DCell k or BCube k requires (k + 1) network ports. Using BCube-based containerized data centers, Wu et al.  develop an inter-container architecture called MDCube. Similar to DCell and BCube, MDCube also employs a server-centric design and implements routing and load balancing functionalities in servers. The last hierarchical topology in our list is FiConn , which shares the same recursive design principle as DCell and BCube, but uses only two ports for each server. Thus, FiConn benefits from reduced wiring cost, but as a direct consequence provides less aggregate network capacity and lower fault tolerance than DCell and BCube. FiConn also differs from DCell in that FiConn balances link usage at different levels, while in DCell links in higher levels will carry more traffic than in lower levels. Same as DCell and BCube, FiConn also achieves high scalability and excellent fault tolerance Flat architectures In contrast to hierarchical architectures, another class of DCN s utilize a flat organization of servers, which are placed into a single layer and are interconnected with switches. Network traffic routing and forwarding is conducted purely in switches. One representative of this class is the FatTree architecture introduced by Vahdat et al. . FatTree applies the Clos topologies originally designed for telephone networks and organizes commodity switches into a k-ary fat-tree, which constitutes a richly connected backbone for inter-server network communication. Using redundant switches and interconnections, each pair of servers in FatTree enjoy a separate routing path and therefore can communicate with each other at nearly the full speed of their local network interfaces. Based on the Fat- Tree architecture, the authors further develop a network addressing scheme called PortLand , which separates the MAC address and physical location of each server to improve agility of the entire system. Bearing the same design principle as FatTree, VL2  (or its predecessor Monsoon ) is another example of flat architectures. In VL2, interconnection topology of switches follows a folded Clos network  where intermediate and aggregation switches are respectively organized into either side of a bipartite graph. According to , if each aggregation and intermediate switch has D A and D I ports, respectively, the aggregate capacity between these two layers is D A D I /2 times of the link capacity. In the same vein of Port- Land, VL2 employs a directory system to realize separation of a server s network address and physical location. 3. Motivation The data center market is currently experiencing an exponential growth  due to consolidation of enterprise IT systems and proliferation of Cloud services (e.g., Amazon Web Services , Google Apps , and Microsoft Azure ). The core technology that is driving this trend is virtualization due to its appealing cost reduction capability, high agility, excellent flexibility, etc. On the other hand, the cost benefit and operational efficiency brought by data center consolidation in turn stimulates the development and deployment of virtualization technologies. According to , more than half of the new servers shipped in 29 are virtualized, and that number will hit 8% by 212. At the same time, however, virtualization also blurs the boundaries between physical resources and introduces an added layer of complexity in performance management of large-scale data centers. This is especially the case in mission-critical multi-tier transaction systems, where multiple layers of servers and a spectrum of services need to work together following an intricate application logic. However, all existing efforts on experimental understanding of DCN architectures [1,8,1,11,14,16] focus only on general-purpose systems under synthetic work load patterns with application contexts limited to MapReduce and scientific computing. None of them considers more realistic and complicated systems (such as multi-tier applications) or examine system performance from the applications perspective, both of which we believe are critical for understanding and developing any DCN architectures. Driven by this motivation, we conduct in this paper an experimental evaluation of DCN architectures in a fully virtualized testbed hosting a three-tier on-line auction system. Through this study, we attempt to answer questions such as (1) Will application performance be affected by DCN architectures, if so, how and why? (2) What are the fundamental differences between hierarchical and flat DCN architectures? (3) Are the DCN architectures of interest vulnerable to certain failure patterns? (4) Does server virtualization introduce any complications into performance of DCN architectures? 4. System architecture We start with an introduction of our system architecture and implementation methodologies Cluster-based three-tier transaction system In the paper, we conduct experiments in a three-tier transaction system running an ebay-like auction service RUBiS  as illustrated in Fig. 1. As seen from the figure, the system is composed of a workload generator and a three-tier processing system. We set up the RUBiS workload generator on a separate physical machine to avoid performance interference with the back-end servers. The generated workload is represented by a number of concurrent user sessions with a series of interactions following a pre-defined Markov transition matrix. RUBiS client sessions are implemented in light-weight Java threads and
5 Y. Zhang et al. / Computer Networks 55 (211) Fig. 1. Three-tier Web service. therefore are able to emulate a large set of concurrent endusers performing a rich variety of actions of an auction web client, such as registration, bidding, searching, browsing, and leaving feedbacks. These end-user requests are further converted by the three-tier servers into different operation sequences in the back-end according to the application logic. This three-tier architecture is a stereotype for many Web services that consist of a Web server layer, an application server layer, and a database layer. We implement each of the three tiers using a cluster of four virtual machines with certain load balancing mechanisms. First, we set up the Web server cluster with Apache  Web servers and the HAProxy load balancing Web cluster application . The master Web server dispatches HTTP requests to other Web servers using a round-robin mechanism. Second, the application server cluster is implemented with Tomcat 4.1  running RUBiS servlets. The front-end Web servers communicate with application servers via the Apache JServ Protocol (AJP) load balancing mechanism and connect to back-end database servers by the Java Database Connectivity (JDBC) protocol. Third, the database cluster consists of one MySQL database server and three storage nodes. Data are evenly partitioned and distributed onto the three storage nodes to avoid database hotspots Virtualization We employ a fully virtualized implementation of DCN architectures. Specifically, for each physical machine, we install a CentOS bit operating system and Xen 3.1 hypervisor . We note that conclusions obtained in the paper are generic and should hold for other hypervisor implementation. All physical machines are equipped with Intel Core 2 Duo 2.33 GHz CPU and 4 GB RAM. In addition, we install on each physical machine an Intel quad-port Gigabit network interface card to provide abundant link capacities and isolated network channels for the VM s. For the VM s, we obtain CentOS 5.3 VM images from stacklet.com  and install appropriate applications according to the role of the VM s in the three service tiers as discussed in the previous section Monitoring system RUBiS is equipped with a distributed monitoring system called Ganglia , which installs a daemon process called GMond on each server and streams real-time monitoring data (e.g., CPU, memory, network, and disk I/O) of the servers to a central visualization server. However, these data are rendered in XML and directly exported without maintaining a local copy in each server. There is no clear way for us to obtain the raw monitoring data and conduct relevant analysis, such as correlation between time series and deep inspection of network packets. Therefore, Ganglia cannot be directly used for our purpose. Instead, we instrument our own monitoring systems by combining standard utilities such as tcpstat, vmstat, and TCPdump. For the client, we just utilize the tools provided by RUBiS to monitor various application QoS metrics, such as response time, request throughput, and loss rate FiConn setup We implement two DCN architectures, FiConn and Fat- Tree, which are respectively a representative of the hierarchical and flat DCN topologies. Fig. 2(a) depicts a nonvirtualized setup of a level-1 FiConn, which is composed
6 22 Y. Zhang et al. / Computer Networks 55 (211) Fig. 2. FiConn testbed setup. of three level- FiConn s each consisting of four servers with the same specifications as described in Section 4.2. These four servers are bridged together via a mini switch and two of them also serve as routers for relaying network traffic traveling to/from the other two level- FiConn s. Fig. 2(b) is an illustration of our implementation of this level-1 FiConn in a virtualized testbed. As seen from the figure, we use three physical machines (PM) to implement the three level- FiConn. Each PM hosts four virtual machines and utilizes three Gigabit Ethernet ports, one of which serves as the network bridge for inter-connecting the four VM s and the other two serve as router ports. The three PM s are directly connected with each other using CAT 6 cross-over cables. We implement FiConn s routing algorithm using a statically configured IP routing table in each VM FatTree setup We employ a 12-server FatTree topology as shown in Fig. 3(a). These 12 servers are grouped into three Pod s, each of which is connected to the Core layer via a group of four inter-connected Pod switches. The Core layer also has four switches, each of which is directly connected to all the three Pod s. Our virtualized implementation of this FatTree topology is illustrated in Fig. 3(b) and is similar to FatTree s original implementation presented in . Specifically, each Pod is contained in a single physical machine. The Pod switches and the corresponding two-level routing algorithm are implemented using the Click modular router  with user-level configurations. The physical machines are equipped with four Gigabit Ethernet ports, each of which is assigned to one of the four up-facing links connecting the Pod and Core switches. The four Core switches are implemented as four separate VLANs on a single 24-port Gigabit switch. Based on the above implementation, we next present an experimental evaluation of FiConn and FatTree in our virtualized testbed hosting the three-tier online auction system RUBiS. 5. Experimental evaluation 5.1. System validation Before conducting further evaluation, we first need to ensure that our virtualized implementation does not introduce undesirable artifacts and is able to faithfully reveal characteristics of the original DCN architectures. Towards this end, we implement FiConn and FatTree in a separate non-virtualized testbed. Specifically, our non-virtualized implementation of FiConn exactly follows the settings illustrated in Fig. 2(a) where each server corresponds to an actual physical machine. Similarly, our non-virtualized implementation of FatTree is identical to that shown Fig. 3(a) except for the Pod switches (i.e., edge and aggregation switches), which in our case are implemented using the Click modular router due to unavailability of programmable hardware switches. Then, we conduct on both the virtualized and physical testbeds the same set of experiments, which is composed of seven experiments with workload generated by 16, 32, 64, 128, 256, 512, 124 clients, respectively. In each experiment, we measure the average aggregate request throughput of all clients and average response time of all (around 27) types of runtime transactions, including Register, Browse, ViewItem, PutBid, and Search. We also calculate standard deviation of throughput and response time to capture second-order dynamics of the measurements. Experiment results of FiConn and FatTree in both the virtualized and non-virtualized testbeds are shown in Fig. 4. As seen in Fig. 4(a), throughput of RUBiS in virtualized FiConn increases linearly (we note the logarithmic scale of the x-axis) with the number of clients up to 512 clients and drops for 124 clients. This drop is because the machine running 124 concurrent client threads experiences high CPU utilization and therefore is not able to generate requests at a higher frequency. This is also justified by a similar behavior observed from the non-virtualized testbed as shown in the figure. End-to-end response time also exhibits a similar trend, where the response latency exhibits a nonlinear jump when the number of clients
7 Y. Zhang et al. / Computer Networks 55 (211) Fig. 3. FatTree testbed setup. Throughput (req/sec) Throughput virtualization Throughput non virtualization Response time virtualization Response time nonvirtualization Response time (ms) Throughput (req/sec) Throughput virtualization Throughput non virtualization Response time virtualization Response time nonvirtualization Response time (ms) Number of clients (threads) Number of clients (threads) Fig. 4. Implementation validation.
8 222 Y. Zhang et al. / Computer Networks 55 (211) change from 512 to 124. Compared to throughput, the high standard deviation of response time is due to the nature of different types of requests, some of which are simple front-end operations (e.g., Browse) and some require complicated back-end processing (e.g., Search). From Fig. 4(b), we can see that FatTree behaves similarly to FiConn under different number of clients. We note that under large numbers of clients (i.e., 512 and 124), FatTree exhibits lower throughput and longer response time as compared to Fi- Conn. This phenomenon is mainly a result of a fundamental property of the FatTree architecture. We delay a detailed discussion of this issue until Section Moreover, as seen from both figures, FiConn and FatTree produce consistent behavior in the virtualized and non-virtualized testbeds, thereby demonstrating that our virtualized implementation does not introduce fundamental artifacts or performance overhead and therefore is able to faithfully capture properties of DCN architectures with non-virtualized implementation. As virtualization becomes the driving technology of today s data centers, we focus in this paper on evaluating DCN architectures in fully virtualized environments. Thus, in the following two subsections, we examine FiConn and FatTree in more testing scenarios using only the virtualized implementation. In Section 5.4, we summarize our findings with a discussion of the impact of server virtualization on DCN architectures FiConn experiments We start with FiConn and evaluate it in two testing cases, service fragmentation and failure resilience. 1 We keep this naming for ease of reference and note that it may not be the ideal placement for other application settings or network architectures Service fragmentation As mentioned in Section 4, a level-1 FiConn includes three level- FiConn s, each consisting of four servers. At the same time, from the perspective of application setup, we organize the 12 servers into three tiers (i.e., Web, application, and database), each of which also has four servers. Clearly, performance of the system is optimized if the three tiers are exactly mapped to the three level- FiConn s, i.e., placing the three clusters of Web servers, application servers, and database servers in FiConn , FiConn , and Fi- Conn , respectively. This is because this way intra-tier network traffic (which is oftentimes data-intensive, such as in the database layer) is delivered by efficient memory swapping between VM s residing on the same physical machine. Experiments performed in the previous subsection (see Fig. 4) follow this VM placement scheme, which we call ideal placement 1 for ease of reference. In practice, however, it is very difficult to guarantee that servers with intensive communication are always placed close to each other. This is especially true in Cloud environments where applications arrive/depart dynamically and resources are allocated/recycled on-demand. In such environments, servers with intensive communication or belonging to the same service tier are very likely to be scattered into different network segments. We call this situation service fragmentation and are interested in understanding its impact on different DCN architectures. To achieve this goal, we mix together servers of different service tiers and rerun experiments conducted in Fig. 4(a). We plot results for 256, 512, and 124 clients in Fig. 5(b). Comparing these results with those obtained from ideal placement illustrated in Fig. 5(a), it is clear that under heavy workloads (i.e., 512 and 124 clients), throughput in FiConn exhibits severe fluctuations and low average value under service fragmentation than under the ideal placement. The underlying cause is twofold. First, when VM s belonging to the same service layer are placed onto different physical machine, a significant amount of communication traffic is converted from low-overhead intra-pm memory copying to high-overhead inter-pm network I/O, which slows down the system s performance. Second, placing multiple resource-consuming VM s onto the same PM may lead to contention problems. Particularly, in our setting, two database servers and one Web server are placed on a single physical server. Both types of servers are very CPU-intensive and are subject to competition of CPU under large numbers of concurrent clients. This situation is illustrated in Fig. 6, where under 512 clients, both the Web and database servers experience CPU saturation, which slows down processing of Web requests and database queries and in turn translates into throughput fluctuation depicted in Fig. 5(b). Moreover, for 512 clients, the Web server and database server exhibits transition of CPU utilization from around 1% to 1% at time 8 s and 12 s, respectively. This shift is a result of behavior of RU- BiS clients, which usually start with light actions (such as accessing the home page and browsing items) and then switch to operations involving heaving back-end processing (such as searching and bidding) Failure resilience A common feature of both FiConn and FatTree is that there exist multiple routing paths between every pair of nodes. As a result, both architectures provide a certain resilience to link/node failures. In this subsection, we examine FiConn s robustness to such failures. First, we take a closer look at how routing is realized in virtualized Fi- Conn. Each level- FiConn has three active Ethernet interfaces, eth1, eth2, and eth3. Interface eth1 is a dummy port used as a network bridge in Xen s Dom to interconnect the four VM s. The other two ports, eth2 and eth3, are actual routing ports for forwarding traffic to/from the other two level- FiConn s. Each routing port is associated with one VM, which we call a routing node. InFiConn , for instance, the two routing nodes are VM s A and C. Everytime when a packet generated by any VM in Fi- Conn  needs to go to a server in FiConn , it first has to go to VM A, then to a routing node in FiConn  via eth2, and finally to its destination server. Based on this understanding, we examine FiConn s performance in two failure patterns as demonstrated in Fig. 7. Specifically, we construct the first failure scenario (which we call Case I) by bringing down a non-routing node B (which is a Web server in our case) in FiConn . As a consequence of this action, Web requests will be evenly distributed to the other three Web servers; but traffic