Cloud-Scale Data Center Network Architecture. Cheng-Chun Tu Advisor: Tzi-cker Chiueh

Transcription

1 Cloud-Scale Data Center Network Architecture Cheng-Chun Tu Advisor: Tzi-cker Chiueh September 10, 2011

2 Abstract Cloud-scale data center network imposed unique requirements that are different from the tradition network architecture, which is based on a combination of Layer 2 Ethernet switches and Layer 3 routers. The state-of-the-art shows that the Layer 3 and Layer 2 model today brings significant configuration overhead and fails to meet some critical requirements of virtualized data center. As Ethernet offers high performance-to-cost ratio and ease of configuration, we argue that it is desirable to build the cloud-scale data center network replying only on Ethernet technology. ITRI (Industrial Technology Research Institute, Taiwan) container computer is a modular computer designed to be a building block for constructing cloudscale data centers. Rather than using a traditional data center network architecture, ITRI container computer s internal interconnection fabric, called Peregrine, is specially architected to meet the scalability, fast fail-over and multitenancy requirements of these data centers. Peregrine is an all-layer 2 network that is designed to support up to one million Layer 2 end points, provide quick recovery from any single network link/device failure, and incorporate dynamic load-balancing routing to make the best use of all physical network links. In addition, Peregrine features a unique private IP address reuse mechanism that allows virtual machines assigned the same IP address to run on it simultaneously without interfering with on another. Finally, the Peregrine architecture is implementable using only off-the-shelf commodity Ethernet switches. This report describes the design and implementation of a fully operational Peregrine prototype, which is built on a folded Clos physical network topology, and the results and analysis of a performance evaluation study based on measurements taken on prototype.

3 Contents 1 Introduction Characteristics of Cloud-Scale Data Centers Scale-out model Virtualization Multi-tenancy Requirements on Cloud-Scale Data Center Networks Any-to-any connectivity with non-blocking fabric Virtual machine mobility Fast fail-over Support for Multi-tenancy Load balancing routing Current Data Center Network Architecture Hybrid design: Layer 2 plus Layer Limitations of standard Ethernet Revisiting the classic Ethernet Scalability issues of Ethernet Mapping the L2 + L3 design to cloud-scale requirements All Layer 2 Network Design Issues Standards and Industrial Solutions Link Aggregation Protocols ECMP: Equal-Cost Multi-Path TRILL and RBridge aq: Shortest Path Bridging Cisco FabricPath Brocade VCS Juniper QFabric OpenFlow Academic Solution PortLand VL Monsoon

4 4 Peregrine: An All-Layer-2 Container Computer Network Architecture Introduction ITRI Container Computer Two-Stage Dual-Mode Packet Forwarding Fast Fail-Over Load Balancing Routing Peregrine Implementation and Performance Evaluation Prototype Implementation Network Initialization Effectiveness of Load Balancing Routing Packet Forwarding Performance Fail-over Latency Conclusion

5 Chapter 1 Introduction Cloud-scale data center is a computer house where a large number of computer systems and associated facilities are gathered together. Cloud-scale data center provides power management, cooling, data communication infrastructure, management interfaces, and security. In recent years, clod-scale data centers are built to provide environments for variety of applications that handle the core business and critical operational data for companies. Business applications such as on-line financial transaction processing, multimedia content delivery, and computationally intensive workload are critical to business revenue. Companies rely heavily on data center because it provides centralized control and management for business applications. As such, data center is a key component that need to be carefully designed to meet the growing performance requirements. 1.1 Characteristics of Cloud-Scale Data Centers Scale-out model Cloud-scale data centers require high-performance network interconnect at the scale of tens of thousands of servers. To connect such a large number of hosts, traditional data centers form a tree-like physical topology with progressively more expensive and specialized high-end devices when moving up the network hierarchy. The model is called scale-up design, which suffers from limited scalability. For example, communication between hundreds of racks requires high capacity backplane, which is usually deployed using highest-end IP switches/routes with large backplane capacity. The existence of few and critical components in the cloud-scale data center requires extra high reliability, periodically upgraded, and great maintenance efforts. Over the years, the scale-out model has been replaced by scale-up model, which aggregates a large number of low-cost and commodity devices to achieve the equal functionalities provided by the few specific expensive components. The optimization no longer implemented in the 3

6 few specific components but overall as a system-wide design. The use of commoditized devices brings the flexibility that the deployment can be easy to scale to a large number or to shrink to individual needs. Due to the large scale in data center, the difference of cost between using the commodity devices versus non-commodity, high-end devices can amount to billions of dollars for the cloud service providers [2, 26] Virtualization Virtualization has proven it a good solution to provide low cost host in the cloudscale data center. Machines using virtualization software can easily consolidate multiple applications/oses from variety of vendors into a single physical server. While this trend seems to be on the server side only, surprisingly it has directly influence on the underlying network. The following lists the characteristics of cloud-scale data center network that comes from the virtualization. 1. Unpredictable bandwidth requirement: With typical virtualized server usually having more than four network interface cards, the density of virtual machines per physical machine becomes higher. The adoption of virtual machine migration software, i.e., VMware vmotion, to dynamically move servers on demand across the network greatly increase the volume of network traffic. Moreover, with hypervisors deployed in the data center, the software such as VMware usually requires a broader layer 2 domain, which is incompatible with traditional data center network design. The concepts such as dynamically provisioning and virtual machine migration make it very difficult to determine where the traffic is coming from and where the traffic is going to. With the increased number of virtual machines running and moving between different physical servers, the application traffic becomes unpredictable and congestion becomes common cases. 2. Security policy enforcement: In the cloud-scale, virtualized data center, operators need to maintain policy changes and manage configurations over the devices. The conventional design applies physical separation to protect or isolate different data center elements. For example, the multi-tier model - web-server, application, and database - are often separated physically at different locations. Today, all the tiers may be run on one single physical machine. As a result, virtualization breaks down the mutli-tier security model. Moreover, while machine migrates from one server to another becomes easy, often there are dependent devices need to be configured as well, such as firewall and intrusion prevention systems. 3. Management overhead: The boundary between network, storage, security teams becomes blurring. For example, when moving virtual machines from one to another, application bandwidth requirement and security policies needs to be properly reconfigured at multiple devices such as routers, 4

7 switches, and load balancers. Moreover, the hypervisor comes with a software implemented switch or virtual bridge, which runs at the server side, hidden from the network management. Management of these elements becomes an issues and network management tools today are only beginning to support the concept of virtualization Multi-tenancy One important characteristic of cloud computing is that most information technology users do not need to own there hardware and software infrastructure. They either pay for their IT infrastructure usage on demand or get it for free from the cloud service providers. In this new ecology, this requires the cloud service providers, who own the cloud-scale data center, to consolidate the physical data center infrastructure (PDC) and create multiple illusions of resources assigned for each individual tenant dynamically. The basic resource abstraction for a tenant is a virtual data center (VDC), which consists of One or multiple virtual machines, each equipped with a virtual CPU, specific amount of physical memory, and a virtual disk of a specific capacity. A guarantee minimum network bandwidth to the Internet. A set of management policies and configurations, including firewall, network configuration, disk backup policy, etc. The binary images for the OS and applications to be deployed on the virtual machines. Tenants either fully rely on VDC as their IT technology or partially integrate their on-premise data centers into the cloud VDC. The latter requires a seamless integration of the remote resources in the cloud and local resources hosted by the tenant. By identifying the characteristics and challenges to the cloud-scale data center network, in the next section, we list the general requirements for data center network architecture and map these requirements to various solutions presented in chapter Requirements on Cloud-Scale Data Center Networks Cloud-scale data center network architecture needs to support wide range of application types, each with different requirements. Large number of servers, applications, increasing complexity of management, high availability and reliability requirements are all key issues data center framework needs to optimize. Before embarking on the design of the network architecture, we carefully reviewed the related research literature and studied possible use cases and came up with the following requirements: 5

8 1.2.1 Any-to-any connectivity with non-blocking fabric Servers are typically equipped with multiple NICs to provide high aggregation throughput for applications running on VMs. With the unprecedented bandwidth requirement in today s application, the underlying physical topology should guarantee there is rich connectivity between each PM s NICs and there should be no or low oversubscription, which refers to the ratio of the allocated bandwidth per user to the guaranteed bandwidth per user. In other words, 1:1 oversubscription means any arbitrary host in the data center should be able to communicate with any other host at the full bandwidth capacity of its network interface. Rich connectivity also means having multiple candidate paths between two end points for various routing protocols to pick the optimal path. This implies less chance of application performance degradation caused by network congestion and or device failures. The ultimate goal is to create a logically single switch, with non-blocking internal fabric and is able to scale up to support the requirement of cloud-scale data center: 100,000 or more ports and approaching one million virtual machines Virtual machine mobility Virtual machine (VM) migration is one of the strategies to dynamically reassign VMs to PMs based on the cloud resource provisioning system. To maximize efficiency, virtual machine should be able to migrate transparently to any physical machine in the data center for the load balancing or energy saving purposes. More specifically, transparent migration means the IP address of the VM and the security policies remains consistent without affecting any service running on the VM after migration. The cloud-scale network architecture should create the illusion of a single layer 2 network as an unrestricted VM migration domain Fast fail-over With the trend of using commodity [2] hardware, failures in the cloud-scale data center will be common. In the conventional design, detection of failures and recovery from it depends on layer 2 or layer 3 routing protocols, e.g., IS-IS and OSPF, which are broadcast-based and require seconds to recovery. For example, spanning tree protocol (STP) periodically probes the network states and takes tens of seconds to recalculate the tree when links or switches fail. However, variety of network and data intensive applications hosted in the data center, e.g., on-line financial transaction processing, execute in milliseconds and definitely can not tolerate a network meltdown of several seconds. The requirement of low fail-over latency includes detecting the failure events, responding to the management server, possibly some root cause analysis, and finally taking the recovery actions. We expect the cloud-scale data center network to provide fast recovery mechanism at the scale of milliseconds to minimize the impact on application performance. 6

9 1.2.4 Support for Multi-tenancy Cloud service providers offer their resources to customers (customers) in an ondemand fashion. Customers using cloud resources to dynamically extend their existing IT infrastructure. Multi-tenancy support for cloud-scale data center means that the cloud infrastructure should be able to share with different customers (tenants) upon a large physical resources. To be more specific, the physical resources should be able to be partitioned into multiple logical and independent network resources and is able to be dynamically allocate to each customers. Moreover, the cloud service provider should offer flexible and seamless deployment approach for customers to integrate or extend their on-premise IT infrastructure into the cloud. The most straightforward way for integrating a cloud-based virtual data center with an on-premise physical data center (PDC) into a seamless data center is to ensure they share the same IP address space, i.e., IP addresses are allocated and reclaimed by a single entity, and connect them with a pair of VPN gateways that are properly configured Load balancing routing Designing the effective routing strategies for cloud-scale data center depends on the understanding of traffic patterns. [12] collects socket-level logs from 1500 servers and analyzed the one month data set. They conclude two traffic patterns named Work-Seeks-Bandwidth and Scatter-Gather patterns from servers supporting map reduce style jobs as well as distributed replicated block storage layer for persistent storage. Another study from [7] indicates that the variability in data center traffic is not amendable to concise summarization and hence engineering routes for just a few giant flows from traffic metrics is unlikely to work well. While it is still too early to claim whether there is predictable traffic pattern in data center network traffic, there are a few important metrics that the routing protocols should consider: 1. Topology aware: Cloud-scale data center usually deploy a mesh-like topology with multiple redundant links interconnecting between ToR, aggregation, and core layers. The routing algorithm should be able to understand the topology and efficiently use the rich connectivity. 2. Multipathing: the routing protocol should be able to establish multiple paths between two end hosts to increase the aggregated bandwidth and avoid link congestion at particular hot-spot. 3. Many-to-one/many-to-many traffic pattern: With the massive deployment of MapReduce-like applications in the cloud, it s better that the routing protocols can be optimized to this special many-to-one/many-to-many traffic pattern. 4. Traffic engineering: Since every workload in the data center is controlled by the cloud service provider, it is practical for the cloud IT staff to measure and analyze the traffic characteristics. By classifying different 7

10 requirements, e.g., low-latency or high throughput, the routing protocol engineers the routes according to the individual need. 5. Resilience: the routing protocols should be resilient to any changes in the topology and response in time to various events that affect the current routing decision. 8

11 Chapter 2 Current Data Center Network Architecture Given the above requirements, this section first presents the design of conventional data center network, which is a layer 2 plus layer 3 design. We next map the requirements to the conventional design and discuss the issues and causes of those issues from the design. 2.1 Hybrid design: Layer 2 plus Layer 3 Ethernet becomes one of the most popular LAN technologies today in many environments including enterprise networks, campus networks, and data centers. Even Internet service providers use Ethernet as their backbone network to carry traffic between multiple sites. Because of Ethernet s high performance-to-cost ratio and ease of configuration, almost every computer system today is equipped with one or multiple Ethernet network interface cards and Ethernet becomes the dominant networking technology. To achieve the ease of configuration, or the plug-and-play property, one of the fundamental design decision of Ethernet is using broadcast model for querying and locating a specific services. For example, Address Resolution Protocol (ARP) [21] uses broadcast to discover the target IP address to its MAC address mapping. The DHCP protocol depends on broadcast to locate the DHCP server and then IP address configuration is assigned to the client. Although the broadcast model brings many conveniences, it restricts its size to only hundreds of hosts [17] and thus unscalable when deploying on a large network. To deal with Ethernet s limited scalability, a large network today is composed of multiple Ethernet LANs interconnected by IP routing, the so-called Layer 2 plus Layer 3 (L2+L3) solution. In this design, the size of an Ethernet LAN is usually restricted to a few hundreds of hosts and each LAN forms an IP subnet. An IP subnet is a subset of network given by an IP prefix representing the network identification. Each host in the subnet is assigned a host number. 9

12 Internet Internet CR CR Layer 3 network AR AR AR AR Layer 2 network AS ToR ToR AS ToR LB ToR: Top-of-Rack Switch AS: Aggr L2 Switch LB: Load Balancer AR: Access L3 Router CR: Core L3 router Rack Rack Rack Figure 2.1: Layer 2 plus Layer 3 hybrid network architecture design for data centers. The host number with IP prefix is the hosts IP address. The router typically contains many interfaces and each interface associates with an IP subnet. The information for how to determine the outing interface from IP prefix is maintained in the routing table, a data structure that maps IP prefix to the outgoing interfaces. Figure 2.1 shows the convention L2 + L3 network architecture for large campus or data center network. The network is a hierarchy from the core router at top to the ToR (Top of Rack) switches and servers. Usually there are 20 to 40 servers per rack, and each server is equipped with multiple NICs connected to ToR switches. ToR switches connect to the aggregation switches (AS) where the server-to-server traffic cross the rack flows through. Firewall and server load balancing techniques are applied at this layer to optimize the network and secure applications. The left bottom side of the figure forms a single layer 2 domain. In order to scale to large number of nodes, another layer of network, layer 3 routing, is deployed to interconnect multiple Layer 2 domains. The access routers (AR) connects to access switches (AS) downstream and connects to core routers (CR) for traffic coming from and going to the Internet. 10

13 2.2 Limitations of standard Ethernet Revisiting the classic Ethernet Ethernet, standardized as IEEE 802.3, is a family of frame-based network technologies for local area networks (LAN). A LAN consists of multiple Ethernet bridges and hosts. Each hosts in Ethernet is assigned an unique 48-bit MAC (Media Access Control) address. Ethernet bridge connects multiple hosts and bridges to form a multi-hop network and maintains a data structure called forwarding table, a map that maintains the destination MAC address to the outgoing port on the bridge. When a frame is presenting on a particular port, the switch automatically associates the port to the source MAC address in the frame, a process named source port learning. The bridge then forwards packets by looking up the forwarding table using packets destination MAC address and decides the outgoing port. If the destination MAC address is not present in the table, the bridge broadcasts the packets to all ports except the receiving port, resulting in a domain-wide flood. Another cause of flooding is by sending the broadcast frame using broadcast MAC address, i.e. ff:ff:ff:ff:ff:ff. Ethernet bridges, if not properly connected, suffers from the broadcast storm caused by loops in the physical topology. Unlike IP packets, Ethernet frame does not carry a TTL (Time To Live) field in the frame. When a broadcast frame enters a Ethernet with loop topology, the frame will be repeatably replicated and forwarded to other bridges. This generates infinite number of frames in the current LAN and blocks all other network traffic, resulting in a network meltdown. The IEEE 802.1D STP (Spanning Tree Protocol) aims to solve the loop problem. Given an arbitrary network topology, the bridges at multiple layers running STP will coordinate themselves and form an unique tree. STP first auto elects a root bridge as the root node of the tree and collectively computes the spanning tree by calculating the distance to the root bridge. The STP converges when all the links in the tree are either in the forwarding state or in the blocking state. The Ethernet bridges then are only allowed to forward frames to the ports in forwarding state. Coupling with the broadcast-based delivery model, this design gives Ethernet one of the most promising features: plug-and-play simplicity. When you connect Ethernet switch and hosts together, Ethernet is able to discover the topology automatically and learn about host address and location on the network with little configuration Scalability issues of Ethernet This enchanting plug-and-play property of Ethernet has proven to be successful in the past decades. Unfortunately, it does not come without a cost. We will discuss the the fundamental design model of Ethernet and why it does not meet the requirements for cloud-scale data center. 11

14 loop STP-enabled Equivalent to a single tree SW1 SW2 SW1 SW2 SW1 blocked blocked SW2 SW3 SW4 SW3 SW4 SW3 SW4 Figure 2.2: Classic STP Loop prevention in Ethernet. Limited forwarding table size When frames enter, Ethernet switch determines the outgoing interface by looking up the forwarding table. The forwarding table in switch contains one entry per destination MAC address. Each entry is associated with an aging time, and once expires, the entry will become invalid and can be reuse by new entry. However, in the case of heavy loading and large and diverse number of hosts communicating to each other, the forwarding table might be full and can not hold any new entry. In this case, the incoming frame without destination MAC address present in the table will be flooded to all ports and cause traffic storm. So why not increase the size of forwarding table? The reason is that forwarding table size does not grow with the size of network is due to its cost. Traditionally the table is stored in Content Addressable Memory (CAM), a specialized hardware device that sets the storage contents as a key for retrieving data associated with those contents. CAM provides very fast table lookup but is expensive (4-5 times as much as a conventional RAM of the same size) and has limited storage capacity. As network grows in data center, the number of distinct MAC addresses traversing a switch explodes. In a data center, servers equipped with more than four Ethernet cards are prevalent and moreover, with servers running hypervisor, each virtual machine associates with a globally unique MAC address. Even worse, the adoption of virtual machine migration causes the network administrator unable to provision the number of entries a particular switch should maintain in its table. As a result, the number of MAC addresses can grow rapidly in a short time at unpredictable locations, degrading the overall network performance. STP as a solution to loop prevention As discussed before, Ethernet suffers from broadcast storm caused by loops and STP is adopted to solve this problem. STP detects and breaks the loop by blocking the redundant links, resulting in an underutilized network. Figure 2.2 shows an example. From the left, the topology contains loops inside and without STP, a broadcast packets can cause disastrous infinite looping of frames. With STP enabled, the protocol initially auto elects a root bridge, in this case, the 12

15 SW1, and discovers all the redundant links needs to be blocked in the topology. The gray dash line in the figure represents a link with blocking state, meaning no data packet can pass through it. The resulting topology is a single tree (right most figure) without looping problem but suffers from the following drawbacks: 1. Single tree for all traffic 2. Single path for unicast and multicast 3. 50% of bandwidth unused Slow fail-over When links fail in a STP-enabled network, the spanning tree needs to be rebuilt. The network traffic during this un-converged period of time will be discarded, forcing network services to be paused. Broadcast storms might happen during this time and cause the network to be shutdown. The convergence time of STP ranges from few seconds to several minutes, depending on the size of the network and protocols. In a data center environment, this can cause a serious problem. Although Rapid Spanning Tree Protocol (RSTP) provides a shorter convergence time by only recalculating the subset of links in the tree, in a meshlike network, the RSTP still blocks a large portion of links. The paper [14] built a simulator for RSTP and evaluated the behavior of it under a mesh topology. They showed that RSTP takes multiple seconds to converge on a new spanning tree and concluded that RSTP is not scalable in a large data center. Broadcast overhead Protocols based on broadcast introduce overhead to the data center network. Ethernet uses broadcast as control messaging for higher-layer protocols such as DHCP and ARP (Address Resolution Protocol) [21]. In Ethernet, to find the destination MAC address of the receiving end, the source end first sends ARP broadcast, querying the MAC address of the destination IP address. The destination node will reply to the source node with the MAC address of its receiving interface card. [14] shows that ARP traffic presents a significant burden on a large network. Although each host caches the IP to MAC address mapping, based on their results, a million hosts in a data center will create 239 Mbps of APR traffic to arrive at each host at peak, which might cause congestion at the bottleneck links and consume frame processing time. DHCP is another broadcast-based protocol used ubiquitously to assign IP addresses in a LAN dynamically. When a host boots up, it broadcasts DHCP discovery frames to all the node in the same domain. The DHCP server responses with available IP address for the requested node. Along with ARP, every broadcast frames must be processed by every end host and the broadcast frame processing can not be offloaded to the network interface card. 13

16 2.3 Mapping the L2 + L3 design to cloud-scale requirements While the traditional L2 + L3 design has been adopted to Internet for decades, when applying to the cloud-scale data center, it manifests some limitations that fail to meet the requirements of cloud scale data center network. In this section, we try to simulate using the traditional L2 + L3 design to build the cloud-scale data center and see what s the burdens encountered. 1. Any-to-any connectivity with non-blocking fabric: Cloud-scale fabric requires connecting thousands of hosts as well as meets the oversubscription of 1:1. Typically, the traditional L2 + L3 design does not form a mesh-like network to meet the non-blocking property because (1) if deploying the fabric by only using the Layer 2 switch, STP simply blocks the redundant links and only a single tree is used for forwarding and (2) if deploying by using the Layer 3 routers, it imposes great configuration overhead and is more expensive than using commodity Layer 2 switches. As a result, the traditional hierarchical design is not able to meet the low oversubscription requirement between each layers. For example, although server-to-server communication in the same rack have oversubscription of 1:1, traffic across the rack usually has higher oversubscription ratio. When traffic moves up through the layers of hierarchy, the oversubscription ratio increases rapidly. Typically up-links from servers to ToRs are typically 1:5 to 1:20 oversubscribed. For example, 40 1G NICs connect to ToR switch with only 4 1G up links. Paths that routes through the core layer might suffers from oversubscription of 1:80 to 1:240 [7]. The high oversubscription ratio constrains the workload placement by preventing idle server from being assigned and thus greatly degrades the performance of data intensive applications. 2. Virtual machine mobility: VM migration needs to be transparent from application, which means the IP address must remains the same. To be more specific, this requires that the migration can only happen within the same layer 2 domain, because migrating to another layer 2 network requires reconfiguring the IP and subnet mask that resides in the target layer 2 network. The L2+L3 design, which connects multiple small layer 2 domains with layer 3 routing, greatly restricts the mobility of VM to migrate only in its current layer 2 domain. Some techniques such as using VLAN (Virtual Local Area Network) to extend virtually to another physically location, or tunneling techniques such as IP-in-IP, if configured properly, can be used to increase the VM mobility. However, this usually requires error-prone manually labor and the result is a high turnaround time. Moreover, misconfiguration of VLANs can cause serious network meltdown. 3. Fast fail-over: Conventional Ethernet relies on the Spanning Tree Protocol to guarantee loop-free packet forwarding. In a network that oper- 14

17 ates normally, if a link or switch goes down, STP automatically picks the backup path from the redundant links and the network re-converges. After a few seconds, the network operates normally again. One problem of using STP as fail-over mechanism is that the re-convergence time for STP is too long, i.g., several seconds. Mission critical application such as financial transection that executes in milliseconds can not wait several seconds for a network disruption. Needless to say, the Layer 3 routing protocols such as link state routing impose higher fail-over latency. 4. Support for Multi-tenancy: Multi-tenancy support needs the cloudscale data center to provide a group of logically isolation virtual resource such as a VDC (Virtual Data Center) for each tenant. Typically, under the L2 + L3 architecture, VLAN is one of the options to partition an Ethernet network and create multiple virtual layer 2 domains for each tenant. Unlike physical LAN, VLAN cab be define logically and allows for end hosts to be grouped together even if there are physically located at separate switches. The group shares the same broadcast domain and has the same attributes as a physical LAN. This is achieved by assigning each group a VLAN ID, and frames coming from the group are tagged with the ID when entering the VLAN and untagged when leaving. Although VLAN offers flexibility to create a logical LAN, it comes with some limitations. First, subnets for each VLAN and IP address range should be provisioned and managed. Second, layer 3 routers connecting each VLANs need to be properly configured. Inter-VLAN traffic needs to be routed through gateway. Moreover, there are limited number of VLAN IDs, i.e., 4096, can be used. The last and most important limitation is that there is no way, or need significant efforts, to provide each tenant their own 24-bit IP address space because routing packet with the same destination IP address but destined to different hosts on the same physical network is almost impossible. Figure 2.3 explains the difficulties using IP-in-IP tunneling. The cloud service provider is hosting two customers, A and B. For each customer A and B, the cloud service providers offers each of them the 24-bit private IP address space, e.g., Accidentally both clients assign IP2 as the IP address of their VM and the VMs are located at the routing domain of the router R3. When IP1 from customer B and IP3 from customer A both try to communicate with their VM with address IP2, the next hop router R2 encapsulates the IP header with another outer IP header, which destines to router R3. Inside the data center fabric, the packets are routed by the outer IP header. As soon as packets arrive at the R3, the R3 decapsulates the IP-in-IP packets and forwards to the outgoing interface set by its routing table. Although IP-in-IP tunneling separates the routing domain of cloud service provider and customers, the problem still exists at the edge router (R3) when destination IP addresses to different customers are the same, in this case, IP2. This implies that R3 must apply some non-standard techniques, 15

18 IP2 IP2 IP3 Payload IP1 Customer A, VDC1 IP2 vlana Customer B, VDC2 decap vlanb R3 ip10 R1 IP6 Router IP2 IP1 Payload Data center fabric IP-in-IP tunneling IP10 IP20 IP2 IP3 Payload R4 R2 IP3 ip20 IP3 IP1 IP-in-IP tunneling IP10 IP20 IP2 IP1 Payload Figure 2.3: Multi-tenancy problem caused by IP address space reuse. e.g., VRF (Virtual Routing and Forwarding), in order to route to two hosts belonging to different VDCs. 5. Load balancing routing: Although IP routing protocol can somehow consider the load of each path and do some balancing, at layer 2 domain, the spanning tree protocol (STP) creates a single tree for packet forwarding and all the redundant links are blocked. This is, in fact, a waste of available bandwidth and increases the possibility of unbalanced link utilization. Although configuring per-vlan spanning tree (PVST) can improve load balance and overall throughput [22], how to effectively use it and dynamically configure require periodically probing the network and reassigning the spanning tree root. In this chapter, we revisit the classic Ethernet s broadcast-based service model and the spanning tree protocol. We argue that standard Ethernet does not scale to a large number of hosts because of this outdated model. We evaluate the L2 + L3 design to see whether it meets the cloud-scale data center network requirements. In the next chapter we present several academic and industrial solutions, which are aimed to build a single large layer 2 network fabric by solving the scalability issues of Ethernet mentioned above. 16

19 Chapter 3 All Layer 2 Network We define an all layer 2 network to be a scalable network architecture that carries traffic based on Ethernet technologies. The network employs only commodity Ethernet switches as dump packet forwarding engines. The forwarding decisions are made by Ethernet header and its corresponding entry in the forwarding table. There is only a single subnet. The routers in this network should be the gateway connecting to the WAN. 3.1 Design Issues Before jumping into various proposed solutions, we first list the common design issues the network architects needs to address when building a single, large-scale Ethernet network. 1. Physical network topology: Since the traditional tree topology imposes high oversubscription, the solution requires a physical topology design that is non-blocking, easy to manage, and extendable to large number of nodes. 2. Addressing: The solution should solve the limited forwarding table problem by designing some ways to assign MAC addressing in order to deliver the frame as well as minimizing the forwarding table usage. The addressing techniques should be supported by commodity switches or with minimum modifications. 3. Routing: if STP is disabled as a solution to make efficient use of all links, it should design its own routing techniques in the all layer 2 network. This include topology discovery, loop prevention, and load balancing. 4. Fail-over: The design should include failure detection, either centrally or distributively disseminates the failure information, and triggers the recovery mechanism to quickly bring the network to normal operation mode. 17

20 Logically a single switch, states are synchronized Stacking or Trunking STP views as a non-loop topology SW1 SW2 SW1 SW3 SW4 SW3 SW4 Logically a single link to the upper layer Figure 3.1: SMLT: Split Multi-Link Trunking (left) and STP s view (right). 3.2 Standards and Industrial Solutions Link Aggregation Protocols LACP: Link Aggregation Control Protocol The IEEE 802.3ad, or called link aggregation, MLT (Multi-Link Trunking), or NIC bonding, is an industrial standard to combine multiple network links in parallel into a logically single connection. The benefits are increased aggregation throughput and redundancy in case of link fails. By leveraging the LACP, the switches learn the identity of its neighboring switches capable of supporting LACP and the capability of each port. It then groups similarly configured ports into a single local link. Packets destined to the logical link (or called trunk) are distributed to one of the link in the group based on certain algorithm. If one of the links in the group fails, traffic previously carried over that failed link moves to the remaining links within the same group. LACP is commonly deployed in the cloud data center or enterprise network. For example, a fat-tree topology has higher bandwidth demand when moving up in the hierarchy. A ToR switch can group four of its 10GE ports to another 4 ports on its aggregation switch, creating a 40 Gb uplink bandwidth in a single spanning tree (STP) domain. Without LACP, grouping multiple links between two switches results in redundant links being blocked by STP. LACP solves this limitation because from STP s point of view, once the links are aggregated into a logical one, the STP treats the aggregated group of links as a single entity, avoiding the possibility of redundant links being blocked. SMLT: Split Multi-Link Trunking SMLT is an enhancement to the LACP, which tries to remove the limitation that all the physical links in the group must be located on the same switches. For LACP, all the physical links aggregated in the group can only connect to a pair of switches, which increases cabling complexity and limits the flexibility of network design. SMLT combines the bandwidth of multiple Ethernet ports while split links into multiple switches. This offers not only larger aggregated bandwidth but also provide a higher 18