Cloud-Scale Data Center Network Architecture. Cheng-Chun Tu Advisor: Tzi-cker Chiueh
|
|
|
- Griselda Vivian Simmons
- 10 years ago
- Views:
Transcription
1 Cloud-Scale Data Center Network Architecture Cheng-Chun Tu Advisor: Tzi-cker Chiueh September 10, 2011
2 Abstract Cloud-scale data center network imposed unique requirements that are different from the tradition network architecture, which is based on a combination of Layer 2 Ethernet switches and Layer 3 routers. The state-of-the-art shows that the Layer 3 and Layer 2 model today brings significant configuration overhead and fails to meet some critical requirements of virtualized data center. As Ethernet offers high performance-to-cost ratio and ease of configuration, we argue that it is desirable to build the cloud-scale data center network replying only on Ethernet technology. ITRI (Industrial Technology Research Institute, Taiwan) container computer is a modular computer designed to be a building block for constructing cloudscale data centers. Rather than using a traditional data center network architecture, ITRI container computer s internal interconnection fabric, called Peregrine, is specially architected to meet the scalability, fast fail-over and multitenancy requirements of these data centers. Peregrine is an all-layer 2 network that is designed to support up to one million Layer 2 end points, provide quick recovery from any single network link/device failure, and incorporate dynamic load-balancing routing to make the best use of all physical network links. In addition, Peregrine features a unique private IP address reuse mechanism that allows virtual machines assigned the same IP address to run on it simultaneously without interfering with on another. Finally, the Peregrine architecture is implementable using only off-the-shelf commodity Ethernet switches. This report describes the design and implementation of a fully operational Peregrine prototype, which is built on a folded Clos physical network topology, and the results and analysis of a performance evaluation study based on measurements taken on prototype.
3 Contents 1 Introduction Characteristics of Cloud-Scale Data Centers Scale-out model Virtualization Multi-tenancy Requirements on Cloud-Scale Data Center Networks Any-to-any connectivity with non-blocking fabric Virtual machine mobility Fast fail-over Support for Multi-tenancy Load balancing routing Current Data Center Network Architecture Hybrid design: Layer 2 plus Layer Limitations of standard Ethernet Revisiting the classic Ethernet Scalability issues of Ethernet Mapping the L2 + L3 design to cloud-scale requirements All Layer 2 Network Design Issues Standards and Industrial Solutions Link Aggregation Protocols ECMP: Equal-Cost Multi-Path TRILL and RBridge aq: Shortest Path Bridging Cisco FabricPath Brocade VCS Juniper QFabric OpenFlow Academic Solution PortLand VL Monsoon
4 4 Peregrine: An All-Layer-2 Container Computer Network Architecture Introduction ITRI Container Computer Two-Stage Dual-Mode Packet Forwarding Fast Fail-Over Load Balancing Routing Peregrine Implementation and Performance Evaluation Prototype Implementation Network Initialization Effectiveness of Load Balancing Routing Packet Forwarding Performance Fail-over Latency Conclusion
5 Chapter 1 Introduction Cloud-scale data center is a computer house where a large number of computer systems and associated facilities are gathered together. Cloud-scale data center provides power management, cooling, data communication infrastructure, management interfaces, and security. In recent years, clod-scale data centers are built to provide environments for variety of applications that handle the core business and critical operational data for companies. Business applications such as on-line financial transaction processing, multimedia content delivery, and computationally intensive workload are critical to business revenue. Companies rely heavily on data center because it provides centralized control and management for business applications. As such, data center is a key component that need to be carefully designed to meet the growing performance requirements. 1.1 Characteristics of Cloud-Scale Data Centers Scale-out model Cloud-scale data centers require high-performance network interconnect at the scale of tens of thousands of servers. To connect such a large number of hosts, traditional data centers form a tree-like physical topology with progressively more expensive and specialized high-end devices when moving up the network hierarchy. The model is called scale-up design, which suffers from limited scalability. For example, communication between hundreds of racks requires high capacity backplane, which is usually deployed using highest-end IP switches/routes with large backplane capacity. The existence of few and critical components in the cloud-scale data center requires extra high reliability, periodically upgraded, and great maintenance efforts. Over the years, the scale-out model has been replaced by scale-up model, which aggregates a large number of low-cost and commodity devices to achieve the equal functionalities provided by the few specific expensive components. The optimization no longer implemented in the 3
6 few specific components but overall as a system-wide design. The use of commoditized devices brings the flexibility that the deployment can be easy to scale to a large number or to shrink to individual needs. Due to the large scale in data center, the difference of cost between using the commodity devices versus non-commodity, high-end devices can amount to billions of dollars for the cloud service providers [2, 26] Virtualization Virtualization has proven it a good solution to provide low cost host in the cloudscale data center. Machines using virtualization software can easily consolidate multiple applications/oses from variety of vendors into a single physical server. While this trend seems to be on the server side only, surprisingly it has directly influence on the underlying network. The following lists the characteristics of cloud-scale data center network that comes from the virtualization. 1. Unpredictable bandwidth requirement: With typical virtualized server usually having more than four network interface cards, the density of virtual machines per physical machine becomes higher. The adoption of virtual machine migration software, i.e., VMware vmotion, to dynamically move servers on demand across the network greatly increase the volume of network traffic. Moreover, with hypervisors deployed in the data center, the software such as VMware usually requires a broader layer 2 domain, which is incompatible with traditional data center network design. The concepts such as dynamically provisioning and virtual machine migration make it very difficult to determine where the traffic is coming from and where the traffic is going to. With the increased number of virtual machines running and moving between different physical servers, the application traffic becomes unpredictable and congestion becomes common cases. 2. Security policy enforcement: In the cloud-scale, virtualized data center, operators need to maintain policy changes and manage configurations over the devices. The conventional design applies physical separation to protect or isolate different data center elements. For example, the multi-tier model - web-server, application, and database - are often separated physically at different locations. Today, all the tiers may be run on one single physical machine. As a result, virtualization breaks down the mutli-tier security model. Moreover, while machine migrates from one server to another becomes easy, often there are dependent devices need to be configured as well, such as firewall and intrusion prevention systems. 3. Management overhead: The boundary between network, storage, security teams becomes blurring. For example, when moving virtual machines from one to another, application bandwidth requirement and security policies needs to be properly reconfigured at multiple devices such as routers, 4
7 switches, and load balancers. Moreover, the hypervisor comes with a software implemented switch or virtual bridge, which runs at the server side, hidden from the network management. Management of these elements becomes an issues and network management tools today are only beginning to support the concept of virtualization Multi-tenancy One important characteristic of cloud computing is that most information technology users do not need to own there hardware and software infrastructure. They either pay for their IT infrastructure usage on demand or get it for free from the cloud service providers. In this new ecology, this requires the cloud service providers, who own the cloud-scale data center, to consolidate the physical data center infrastructure (PDC) and create multiple illusions of resources assigned for each individual tenant dynamically. The basic resource abstraction for a tenant is a virtual data center (VDC), which consists of One or multiple virtual machines, each equipped with a virtual CPU, specific amount of physical memory, and a virtual disk of a specific capacity. A guarantee minimum network bandwidth to the Internet. A set of management policies and configurations, including firewall, network configuration, disk backup policy, etc. The binary images for the OS and applications to be deployed on the virtual machines. Tenants either fully rely on VDC as their IT technology or partially integrate their on-premise data centers into the cloud VDC. The latter requires a seamless integration of the remote resources in the cloud and local resources hosted by the tenant. By identifying the characteristics and challenges to the cloud-scale data center network, in the next section, we list the general requirements for data center network architecture and map these requirements to various solutions presented in chapter Requirements on Cloud-Scale Data Center Networks Cloud-scale data center network architecture needs to support wide range of application types, each with different requirements. Large number of servers, applications, increasing complexity of management, high availability and reliability requirements are all key issues data center framework needs to optimize. Before embarking on the design of the network architecture, we carefully reviewed the related research literature and studied possible use cases and came up with the following requirements: 5
8 1.2.1 Any-to-any connectivity with non-blocking fabric Servers are typically equipped with multiple NICs to provide high aggregation throughput for applications running on VMs. With the unprecedented bandwidth requirement in today s application, the underlying physical topology should guarantee there is rich connectivity between each PM s NICs and there should be no or low oversubscription, which refers to the ratio of the allocated bandwidth per user to the guaranteed bandwidth per user. In other words, 1:1 oversubscription means any arbitrary host in the data center should be able to communicate with any other host at the full bandwidth capacity of its network interface. Rich connectivity also means having multiple candidate paths between two end points for various routing protocols to pick the optimal path. This implies less chance of application performance degradation caused by network congestion and or device failures. The ultimate goal is to create a logically single switch, with non-blocking internal fabric and is able to scale up to support the requirement of cloud-scale data center: 100,000 or more ports and approaching one million virtual machines Virtual machine mobility Virtual machine (VM) migration is one of the strategies to dynamically reassign VMs to PMs based on the cloud resource provisioning system. To maximize efficiency, virtual machine should be able to migrate transparently to any physical machine in the data center for the load balancing or energy saving purposes. More specifically, transparent migration means the IP address of the VM and the security policies remains consistent without affecting any service running on the VM after migration. The cloud-scale network architecture should create the illusion of a single layer 2 network as an unrestricted VM migration domain Fast fail-over With the trend of using commodity [2] hardware, failures in the cloud-scale data center will be common. In the conventional design, detection of failures and recovery from it depends on layer 2 or layer 3 routing protocols, e.g., IS-IS and OSPF, which are broadcast-based and require seconds to recovery. For example, spanning tree protocol (STP) periodically probes the network states and takes tens of seconds to recalculate the tree when links or switches fail. However, variety of network and data intensive applications hosted in the data center, e.g., on-line financial transaction processing, execute in milliseconds and definitely can not tolerate a network meltdown of several seconds. The requirement of low fail-over latency includes detecting the failure events, responding to the management server, possibly some root cause analysis, and finally taking the recovery actions. We expect the cloud-scale data center network to provide fast recovery mechanism at the scale of milliseconds to minimize the impact on application performance. 6
9 1.2.4 Support for Multi-tenancy Cloud service providers offer their resources to customers (customers) in an ondemand fashion. Customers using cloud resources to dynamically extend their existing IT infrastructure. Multi-tenancy support for cloud-scale data center means that the cloud infrastructure should be able to share with different customers (tenants) upon a large physical resources. To be more specific, the physical resources should be able to be partitioned into multiple logical and independent network resources and is able to be dynamically allocate to each customers. Moreover, the cloud service provider should offer flexible and seamless deployment approach for customers to integrate or extend their on-premise IT infrastructure into the cloud. The most straightforward way for integrating a cloud-based virtual data center with an on-premise physical data center (PDC) into a seamless data center is to ensure they share the same IP address space, i.e., IP addresses are allocated and reclaimed by a single entity, and connect them with a pair of VPN gateways that are properly configured Load balancing routing Designing the effective routing strategies for cloud-scale data center depends on the understanding of traffic patterns. [12] collects socket-level logs from 1500 servers and analyzed the one month data set. They conclude two traffic patterns named Work-Seeks-Bandwidth and Scatter-Gather patterns from servers supporting map reduce style jobs as well as distributed replicated block storage layer for persistent storage. Another study from [7] indicates that the variability in data center traffic is not amendable to concise summarization and hence engineering routes for just a few giant flows from traffic metrics is unlikely to work well. While it is still too early to claim whether there is predictable traffic pattern in data center network traffic, there are a few important metrics that the routing protocols should consider: 1. Topology aware: Cloud-scale data center usually deploy a mesh-like topology with multiple redundant links interconnecting between ToR, aggregation, and core layers. The routing algorithm should be able to understand the topology and efficiently use the rich connectivity. 2. Multipathing: the routing protocol should be able to establish multiple paths between two end hosts to increase the aggregated bandwidth and avoid link congestion at particular hot-spot. 3. Many-to-one/many-to-many traffic pattern: With the massive deployment of MapReduce-like applications in the cloud, it s better that the routing protocols can be optimized to this special many-to-one/many-to-many traffic pattern. 4. Traffic engineering: Since every workload in the data center is controlled by the cloud service provider, it is practical for the cloud IT staff to measure and analyze the traffic characteristics. By classifying different 7
10 requirements, e.g., low-latency or high throughput, the routing protocol engineers the routes according to the individual need. 5. Resilience: the routing protocols should be resilient to any changes in the topology and response in time to various events that affect the current routing decision. 8
11 Chapter 2 Current Data Center Network Architecture Given the above requirements, this section first presents the design of conventional data center network, which is a layer 2 plus layer 3 design. We next map the requirements to the conventional design and discuss the issues and causes of those issues from the design. 2.1 Hybrid design: Layer 2 plus Layer 3 Ethernet becomes one of the most popular LAN technologies today in many environments including enterprise networks, campus networks, and data centers. Even Internet service providers use Ethernet as their backbone network to carry traffic between multiple sites. Because of Ethernet s high performance-to-cost ratio and ease of configuration, almost every computer system today is equipped with one or multiple Ethernet network interface cards and Ethernet becomes the dominant networking technology. To achieve the ease of configuration, or the plug-and-play property, one of the fundamental design decision of Ethernet is using broadcast model for querying and locating a specific services. For example, Address Resolution Protocol (ARP) [21] uses broadcast to discover the target IP address to its MAC address mapping. The DHCP protocol depends on broadcast to locate the DHCP server and then IP address configuration is assigned to the client. Although the broadcast model brings many conveniences, it restricts its size to only hundreds of hosts [17] and thus unscalable when deploying on a large network. To deal with Ethernet s limited scalability, a large network today is composed of multiple Ethernet LANs interconnected by IP routing, the so-called Layer 2 plus Layer 3 (L2+L3) solution. In this design, the size of an Ethernet LAN is usually restricted to a few hundreds of hosts and each LAN forms an IP subnet. An IP subnet is a subset of network given by an IP prefix representing the network identification. Each host in the subnet is assigned a host number. 9
12 Internet Internet CR CR Layer 3 network AR AR AR AR Layer 2 network AS ToR ToR AS ToR LB ToR: Top-of-Rack Switch AS: Aggr L2 Switch LB: Load Balancer AR: Access L3 Router CR: Core L3 router Rack Rack Rack Figure 2.1: Layer 2 plus Layer 3 hybrid network architecture design for data centers. The host number with IP prefix is the hosts IP address. The router typically contains many interfaces and each interface associates with an IP subnet. The information for how to determine the outing interface from IP prefix is maintained in the routing table, a data structure that maps IP prefix to the outgoing interfaces. Figure 2.1 shows the convention L2 + L3 network architecture for large campus or data center network. The network is a hierarchy from the core router at top to the ToR (Top of Rack) switches and servers. Usually there are 20 to 40 servers per rack, and each server is equipped with multiple NICs connected to ToR switches. ToR switches connect to the aggregation switches (AS) where the server-to-server traffic cross the rack flows through. Firewall and server load balancing techniques are applied at this layer to optimize the network and secure applications. The left bottom side of the figure forms a single layer 2 domain. In order to scale to large number of nodes, another layer of network, layer 3 routing, is deployed to interconnect multiple Layer 2 domains. The access routers (AR) connects to access switches (AS) downstream and connects to core routers (CR) for traffic coming from and going to the Internet. 10
13 2.2 Limitations of standard Ethernet Revisiting the classic Ethernet Ethernet, standardized as IEEE 802.3, is a family of frame-based network technologies for local area networks (LAN). A LAN consists of multiple Ethernet bridges and hosts. Each hosts in Ethernet is assigned an unique 48-bit MAC (Media Access Control) address. Ethernet bridge connects multiple hosts and bridges to form a multi-hop network and maintains a data structure called forwarding table, a map that maintains the destination MAC address to the outgoing port on the bridge. When a frame is presenting on a particular port, the switch automatically associates the port to the source MAC address in the frame, a process named source port learning. The bridge then forwards packets by looking up the forwarding table using packets destination MAC address and decides the outgoing port. If the destination MAC address is not present in the table, the bridge broadcasts the packets to all ports except the receiving port, resulting in a domain-wide flood. Another cause of flooding is by sending the broadcast frame using broadcast MAC address, i.e. ff:ff:ff:ff:ff:ff. Ethernet bridges, if not properly connected, suffers from the broadcast storm caused by loops in the physical topology. Unlike IP packets, Ethernet frame does not carry a TTL (Time To Live) field in the frame. When a broadcast frame enters a Ethernet with loop topology, the frame will be repeatably replicated and forwarded to other bridges. This generates infinite number of frames in the current LAN and blocks all other network traffic, resulting in a network meltdown. The IEEE 802.1D STP (Spanning Tree Protocol) aims to solve the loop problem. Given an arbitrary network topology, the bridges at multiple layers running STP will coordinate themselves and form an unique tree. STP first auto elects a root bridge as the root node of the tree and collectively computes the spanning tree by calculating the distance to the root bridge. The STP converges when all the links in the tree are either in the forwarding state or in the blocking state. The Ethernet bridges then are only allowed to forward frames to the ports in forwarding state. Coupling with the broadcast-based delivery model, this design gives Ethernet one of the most promising features: plug-and-play simplicity. When you connect Ethernet switch and hosts together, Ethernet is able to discover the topology automatically and learn about host address and location on the network with little configuration Scalability issues of Ethernet This enchanting plug-and-play property of Ethernet has proven to be successful in the past decades. Unfortunately, it does not come without a cost. We will discuss the the fundamental design model of Ethernet and why it does not meet the requirements for cloud-scale data center. 11
14 loop STP-enabled Equivalent to a single tree SW1 SW2 SW1 SW2 SW1 blocked blocked SW2 SW3 SW4 SW3 SW4 SW3 SW4 Figure 2.2: Classic STP Loop prevention in Ethernet. Limited forwarding table size When frames enter, Ethernet switch determines the outgoing interface by looking up the forwarding table. The forwarding table in switch contains one entry per destination MAC address. Each entry is associated with an aging time, and once expires, the entry will become invalid and can be reuse by new entry. However, in the case of heavy loading and large and diverse number of hosts communicating to each other, the forwarding table might be full and can not hold any new entry. In this case, the incoming frame without destination MAC address present in the table will be flooded to all ports and cause traffic storm. So why not increase the size of forwarding table? The reason is that forwarding table size does not grow with the size of network is due to its cost. Traditionally the table is stored in Content Addressable Memory (CAM), a specialized hardware device that sets the storage contents as a key for retrieving data associated with those contents. CAM provides very fast table lookup but is expensive (4-5 times as much as a conventional RAM of the same size) and has limited storage capacity. As network grows in data center, the number of distinct MAC addresses traversing a switch explodes. In a data center, servers equipped with more than four Ethernet cards are prevalent and moreover, with servers running hypervisor, each virtual machine associates with a globally unique MAC address. Even worse, the adoption of virtual machine migration causes the network administrator unable to provision the number of entries a particular switch should maintain in its table. As a result, the number of MAC addresses can grow rapidly in a short time at unpredictable locations, degrading the overall network performance. STP as a solution to loop prevention As discussed before, Ethernet suffers from broadcast storm caused by loops and STP is adopted to solve this problem. STP detects and breaks the loop by blocking the redundant links, resulting in an underutilized network. Figure 2.2 shows an example. From the left, the topology contains loops inside and without STP, a broadcast packets can cause disastrous infinite looping of frames. With STP enabled, the protocol initially auto elects a root bridge, in this case, the 12
15 SW1, and discovers all the redundant links needs to be blocked in the topology. The gray dash line in the figure represents a link with blocking state, meaning no data packet can pass through it. The resulting topology is a single tree (right most figure) without looping problem but suffers from the following drawbacks: 1. Single tree for all traffic 2. Single path for unicast and multicast 3. 50% of bandwidth unused Slow fail-over When links fail in a STP-enabled network, the spanning tree needs to be rebuilt. The network traffic during this un-converged period of time will be discarded, forcing network services to be paused. Broadcast storms might happen during this time and cause the network to be shutdown. The convergence time of STP ranges from few seconds to several minutes, depending on the size of the network and protocols. In a data center environment, this can cause a serious problem. Although Rapid Spanning Tree Protocol (RSTP) provides a shorter convergence time by only recalculating the subset of links in the tree, in a meshlike network, the RSTP still blocks a large portion of links. The paper [14] built a simulator for RSTP and evaluated the behavior of it under a mesh topology. They showed that RSTP takes multiple seconds to converge on a new spanning tree and concluded that RSTP is not scalable in a large data center. Broadcast overhead Protocols based on broadcast introduce overhead to the data center network. Ethernet uses broadcast as control messaging for higher-layer protocols such as DHCP and ARP (Address Resolution Protocol) [21]. In Ethernet, to find the destination MAC address of the receiving end, the source end first sends ARP broadcast, querying the MAC address of the destination IP address. The destination node will reply to the source node with the MAC address of its receiving interface card. [14] shows that ARP traffic presents a significant burden on a large network. Although each host caches the IP to MAC address mapping, based on their results, a million hosts in a data center will create 239 Mbps of APR traffic to arrive at each host at peak, which might cause congestion at the bottleneck links and consume frame processing time. DHCP is another broadcast-based protocol used ubiquitously to assign IP addresses in a LAN dynamically. When a host boots up, it broadcasts DHCP discovery frames to all the node in the same domain. The DHCP server responses with available IP address for the requested node. Along with ARP, every broadcast frames must be processed by every end host and the broadcast frame processing can not be offloaded to the network interface card. 13
16 2.3 Mapping the L2 + L3 design to cloud-scale requirements While the traditional L2 + L3 design has been adopted to Internet for decades, when applying to the cloud-scale data center, it manifests some limitations that fail to meet the requirements of cloud scale data center network. In this section, we try to simulate using the traditional L2 + L3 design to build the cloud-scale data center and see what s the burdens encountered. 1. Any-to-any connectivity with non-blocking fabric: Cloud-scale fabric requires connecting thousands of hosts as well as meets the oversubscription of 1:1. Typically, the traditional L2 + L3 design does not form a mesh-like network to meet the non-blocking property because (1) if deploying the fabric by only using the Layer 2 switch, STP simply blocks the redundant links and only a single tree is used for forwarding and (2) if deploying by using the Layer 3 routers, it imposes great configuration overhead and is more expensive than using commodity Layer 2 switches. As a result, the traditional hierarchical design is not able to meet the low oversubscription requirement between each layers. For example, although server-to-server communication in the same rack have oversubscription of 1:1, traffic across the rack usually has higher oversubscription ratio. When traffic moves up through the layers of hierarchy, the oversubscription ratio increases rapidly. Typically up-links from servers to ToRs are typically 1:5 to 1:20 oversubscribed. For example, 40 1G NICs connect to ToR switch with only 4 1G up links. Paths that routes through the core layer might suffers from oversubscription of 1:80 to 1:240 [7]. The high oversubscription ratio constrains the workload placement by preventing idle server from being assigned and thus greatly degrades the performance of data intensive applications. 2. Virtual machine mobility: VM migration needs to be transparent from application, which means the IP address must remains the same. To be more specific, this requires that the migration can only happen within the same layer 2 domain, because migrating to another layer 2 network requires reconfiguring the IP and subnet mask that resides in the target layer 2 network. The L2+L3 design, which connects multiple small layer 2 domains with layer 3 routing, greatly restricts the mobility of VM to migrate only in its current layer 2 domain. Some techniques such as using VLAN (Virtual Local Area Network) to extend virtually to another physically location, or tunneling techniques such as IP-in-IP, if configured properly, can be used to increase the VM mobility. However, this usually requires error-prone manually labor and the result is a high turnaround time. Moreover, misconfiguration of VLANs can cause serious network meltdown. 3. Fast fail-over: Conventional Ethernet relies on the Spanning Tree Protocol to guarantee loop-free packet forwarding. In a network that oper- 14
17 ates normally, if a link or switch goes down, STP automatically picks the backup path from the redundant links and the network re-converges. After a few seconds, the network operates normally again. One problem of using STP as fail-over mechanism is that the re-convergence time for STP is too long, i.g., several seconds. Mission critical application such as financial transection that executes in milliseconds can not wait several seconds for a network disruption. Needless to say, the Layer 3 routing protocols such as link state routing impose higher fail-over latency. 4. Support for Multi-tenancy: Multi-tenancy support needs the cloudscale data center to provide a group of logically isolation virtual resource such as a VDC (Virtual Data Center) for each tenant. Typically, under the L2 + L3 architecture, VLAN is one of the options to partition an Ethernet network and create multiple virtual layer 2 domains for each tenant. Unlike physical LAN, VLAN cab be define logically and allows for end hosts to be grouped together even if there are physically located at separate switches. The group shares the same broadcast domain and has the same attributes as a physical LAN. This is achieved by assigning each group a VLAN ID, and frames coming from the group are tagged with the ID when entering the VLAN and untagged when leaving. Although VLAN offers flexibility to create a logical LAN, it comes with some limitations. First, subnets for each VLAN and IP address range should be provisioned and managed. Second, layer 3 routers connecting each VLANs need to be properly configured. Inter-VLAN traffic needs to be routed through gateway. Moreover, there are limited number of VLAN IDs, i.e., 4096, can be used. The last and most important limitation is that there is no way, or need significant efforts, to provide each tenant their own 24-bit IP address space because routing packet with the same destination IP address but destined to different hosts on the same physical network is almost impossible. Figure 2.3 explains the difficulties using IP-in-IP tunneling. The cloud service provider is hosting two customers, A and B. For each customer A and B, the cloud service providers offers each of them the 24-bit private IP address space, e.g., Accidentally both clients assign IP2 as the IP address of their VM and the VMs are located at the routing domain of the router R3. When IP1 from customer B and IP3 from customer A both try to communicate with their VM with address IP2, the next hop router R2 encapsulates the IP header with another outer IP header, which destines to router R3. Inside the data center fabric, the packets are routed by the outer IP header. As soon as packets arrive at the R3, the R3 decapsulates the IP-in-IP packets and forwards to the outgoing interface set by its routing table. Although IP-in-IP tunneling separates the routing domain of cloud service provider and customers, the problem still exists at the edge router (R3) when destination IP addresses to different customers are the same, in this case, IP2. This implies that R3 must apply some non-standard techniques, 15
18 IP2 IP2 IP3 Payload IP1 Customer A, VDC1 IP2 vlana Customer B, VDC2 decap vlanb R3 ip10 R1 IP6 Router IP2 IP1 Payload Data center fabric IP-in-IP tunneling IP10 IP20 IP2 IP3 Payload R4 R2 IP3 ip20 IP3 IP1 IP-in-IP tunneling IP10 IP20 IP2 IP1 Payload Figure 2.3: Multi-tenancy problem caused by IP address space reuse. e.g., VRF (Virtual Routing and Forwarding), in order to route to two hosts belonging to different VDCs. 5. Load balancing routing: Although IP routing protocol can somehow consider the load of each path and do some balancing, at layer 2 domain, the spanning tree protocol (STP) creates a single tree for packet forwarding and all the redundant links are blocked. This is, in fact, a waste of available bandwidth and increases the possibility of unbalanced link utilization. Although configuring per-vlan spanning tree (PVST) can improve load balance and overall throughput [22], how to effectively use it and dynamically configure require periodically probing the network and reassigning the spanning tree root. In this chapter, we revisit the classic Ethernet s broadcast-based service model and the spanning tree protocol. We argue that standard Ethernet does not scale to a large number of hosts because of this outdated model. We evaluate the L2 + L3 design to see whether it meets the cloud-scale data center network requirements. In the next chapter we present several academic and industrial solutions, which are aimed to build a single large layer 2 network fabric by solving the scalability issues of Ethernet mentioned above. 16
19 Chapter 3 All Layer 2 Network We define an all layer 2 network to be a scalable network architecture that carries traffic based on Ethernet technologies. The network employs only commodity Ethernet switches as dump packet forwarding engines. The forwarding decisions are made by Ethernet header and its corresponding entry in the forwarding table. There is only a single subnet. The routers in this network should be the gateway connecting to the WAN. 3.1 Design Issues Before jumping into various proposed solutions, we first list the common design issues the network architects needs to address when building a single, large-scale Ethernet network. 1. Physical network topology: Since the traditional tree topology imposes high oversubscription, the solution requires a physical topology design that is non-blocking, easy to manage, and extendable to large number of nodes. 2. Addressing: The solution should solve the limited forwarding table problem by designing some ways to assign MAC addressing in order to deliver the frame as well as minimizing the forwarding table usage. The addressing techniques should be supported by commodity switches or with minimum modifications. 3. Routing: if STP is disabled as a solution to make efficient use of all links, it should design its own routing techniques in the all layer 2 network. This include topology discovery, loop prevention, and load balancing. 4. Fail-over: The design should include failure detection, either centrally or distributively disseminates the failure information, and triggers the recovery mechanism to quickly bring the network to normal operation mode. 17
20 Logically a single switch, states are synchronized Stacking or Trunking STP views as a non-loop topology SW1 SW2 SW1 SW3 SW4 SW3 SW4 Logically a single link to the upper layer Figure 3.1: SMLT: Split Multi-Link Trunking (left) and STP s view (right). 3.2 Standards and Industrial Solutions Link Aggregation Protocols LACP: Link Aggregation Control Protocol The IEEE 802.3ad, or called link aggregation, MLT (Multi-Link Trunking), or NIC bonding, is an industrial standard to combine multiple network links in parallel into a logically single connection. The benefits are increased aggregation throughput and redundancy in case of link fails. By leveraging the LACP, the switches learn the identity of its neighboring switches capable of supporting LACP and the capability of each port. It then groups similarly configured ports into a single local link. Packets destined to the logical link (or called trunk) are distributed to one of the link in the group based on certain algorithm. If one of the links in the group fails, traffic previously carried over that failed link moves to the remaining links within the same group. LACP is commonly deployed in the cloud data center or enterprise network. For example, a fat-tree topology has higher bandwidth demand when moving up in the hierarchy. A ToR switch can group four of its 10GE ports to another 4 ports on its aggregation switch, creating a 40 Gb uplink bandwidth in a single spanning tree (STP) domain. Without LACP, grouping multiple links between two switches results in redundant links being blocked by STP. LACP solves this limitation because from STP s point of view, once the links are aggregated into a logical one, the STP treats the aggregated group of links as a single entity, avoiding the possibility of redundant links being blocked. SMLT: Split Multi-Link Trunking SMLT is an enhancement to the LACP, which tries to remove the limitation that all the physical links in the group must be located on the same switches. For LACP, all the physical links aggregated in the group can only connect to a pair of switches, which increases cabling complexity and limits the flexibility of network design. SMLT combines the bandwidth of multiple Ethernet ports while split links into multiple switches. This offers not only larger aggregated bandwidth but also provide a higher 18
21 Flow hashing to 4 uplinks R1 R2 R3 R4 3 Layer 3 ECMP routing 2 R5 R6 R7 R8 4 1 A1 An B1 Bn C1 Cn D1 Dn Figure 3.2: 4-way Equal-Cost Multi-Path setup. reliability network design. Because once one of the neighboring switches fails, the rest of the switches can still forward the traffic for the group. Figure 3.1 shows an example topology and its STP view. The left side of the figure shows that SW1 and SW2 first create a single logical switches by using stacking protocols or trunking, depending on the switch vendors. States such as forwarding table, port status between SW1 and SW2 are synchronized. The SW3 and SW4 each splits two of their ports to connect to SW1 and SW2, forming a multi-link trunking topology. Although physically contains loop, there is no link being blocked. The right side of the figure shows the STP s view of this topology. Because spanning tree protocol is transparent to the redundant links in the trunking port, no link is blocked ECMP: Equal-Cost Multi-Path Equal-cost multi-path routing (ECMP) is a load balancing routing protocol based on RFC 2991 [24] that optimizes flows over multiple best paths to a single destination. ECMP applies load balancing routing on flows such as TCP or UDP, potentially provides increases in bandwidth between two endpoints by spreading the traffic over multiple paths. Figure 3.2 shows a four-way ECMP design. Let s assume a flow is established from A1 to B1 and a packet from A1 enters the next hop (SW5), the default hashing algorithm setting is to hash flows based on Layer 3 source-destination IP addresses, and Layer 4 port numbers and determines the outgoing interface from one of the four uplinks, assuming its SW2. The upper layer SW2 then deterministically forwards the packet to SW6 and delivers to B1. The resulting effect is that the traffic volume between SW5 to SW1-SW4 is randomly distributed, based on the each TCP or UDP flows. However, ECMP has some limitations. if certain elephant flows are present, or multiple flows are accidentally hashed to the same outgoing interface, congestion and unbalanced link utilization can still happen [3]. Although per-packet load balancing to multiple paths gives the best bandwidth utilization, usually it is deprecated because of the following reasons: First, per-packet load balancing 19
22 Architecture SMLT ECMP Protocol layer Layer 2 forwarding Layer 3 IP routing Load balance granular, per-host finer, per-flows, per-packet granularity Path diversity grouping ports IGP, BGP routing protocols exploration Table 3.1: A comparison of Trunking and ECMP. Figure 3.3: Forwarding Paradigm of TRILL (From NIL Data Communications). is likely to increase the number of out-of-order packets, resulting in the increasing workload on the receiving hosts. For example, TCP treats out-of-order delivery as an indication of network congestion. In this case, TCP decreases its window size and the throughput of TCP connections becomes lower. Second, per-packet quires process switching on other than the highest-end routers, and process switching is 8 to 12 times more router processor intensive than fast switching. Finally, randomly spread the packet into multiple links increases the difficulties when IT stuff tries to debug the network. Table 3.1 shows a comparison of SMLT and ECMP. There are also some common limitations of both SMLT and ECMP as below: 1. Upstream hash collision: usually there are limited number of ports or path in a group. By applying hash algorithm on either per-host or per-flow, there is high possibility that multiple big or medium size flows are assigned to the same links and cause link congestion. 2. Downstream collision: Both ECMP and SMLT balance the traffic at the edge device upstream to the upper layer devices. Two flows coming from different hosts might end up using the same downstream link and congestion can happen. 20
23 3.2.3 TRILL and RBridge TRILL (Transparent Interconnect of Lots of Links) [25], invented by Radia Perlman to remove certain deficiencies of bridged Ethernet, is an IETF protocol implemented by devices called routing bridges (RBridge) [20]. TRILL solution is not intended to solve the scalability problem in modern data center, but to avoid the disadvantages of standard bridges. As mentioned before, using IP routing does not have problems such as possible looping in topology and STP inefficiency. TRILL is aimed to solve these problems by adding the routing capability to the layer 2 bridge. TRILL design makes no assumption about physical network topology. TRILL switch can coexist with the standard Ethernet switch, making itself as an incremental feature. As for addressing and routing in TRILL, Rbridges in the data center runs the link-state routing protocol such as IS-IS. All Rbridges have the topology information and are able to create shortest path route to each other. The Rbridges learns the MAC address of the end hosts by inspecting the packets originating on that link. This information is distributed to all other Rbridges so that other Rbridges knows the appropriate destination Rbridge. This approach is not scalable when having a large number of end hosts. When packets enter the Rbridge network, the ingress Rbridge determines the Rbridge nickname associating with the destination MAC address and encapsulates the outgoing packets. Rbridge header includes hop count, the egress Rbridge nickname and the ingress Rbridge nickname. Before packet reaches the destination Rbridge, at each intermediate Rbridge, the hop count is decremented. This prevents temporarily loop during the converging time. Figure 3.3 shows an example of TRILL forwarding. The ingress RBridge receives the users MAC frame and encapsulates in a TRILL header with addresses of ingress RBridge A and egress RBridge C. The TRILL datagram gets a new MAC header, which lookups and modifes every time the packet is forwarded by an RBridge. The hop count field in the TRILL header, HopC, is decreased and the TRILL header stays unchanged. To coexist with standard switch, the Rbridge can use the standard layer 2 header with its own protocol type. Then the Rbridge header is appended after the standard Ethernet header and followed by the payload. When the egress Rbridge C receives this packet, the encapsulated Rbridge header is removed so that the Rbridge is transparent to the destination host S. In brief, the design can be thought as transparently connecting the links but avoiding the disadvantages of bridges by using routing. TRILL focuses on the issues of looping and STP inefficiency but does not directly solves cloud-scale data center network problems such as limited forwarding table size and meets all requirements aq: Shortest Path Bridging Shortest Path Bridging (SPB) is an IEEE 802.1aq draft intended to serve as both carrier and enterprise solution. There are two ways for SPB multipath bridging: Shortest Path Bridging Mac-in-Mac (SPBM) and Shortest Path Bridging 21
24 Figure 3.4: Forwarding Paradigm of Shortest Path Bridging, SPB (From NIL Data Communications). VLAN (SPBV). The 802.1ad (Q-in-Q) forwarding is called SPBV and 802.1ah (Provider Backbone Bridging or MAC-in-MAC) is called SPBM forwarding. For the purpose of this report, we focus on SPBM. SPB reuses the Provider Backbone (PBB) 802.1ah [4] Mac-in-Mac technique. The ingress switch takes the customer s MAC frame and encapsulates it in 802.1ah MAC frame. The 802.1ah frame header includes the service identifier (I-SID), which abstracts the service from the network by mapping single or multiple VLANs to an I-SID. The SPB automatically constructs a shortest path through the network to extend LAN connectivity. The frame is forwarded based on the destination MAC address. Throughout the forwarding process, the frame remains unchanged. This alleviates the limited forwarding table size problem since only the edge switches need to learn the user s MAC addresses. As for routing, SPB uses IS-IS as link state routing protocol to build the network topology and selects shortest paths according to the link metrics. The traffic (unicast and multicast) is then assigned to that path. For the purpose of load balancing, SPB computes 16 source trees that are shortest path. Figure 3.4 shows an example of SPBM forwarding. The 802.1aq edge switch (switch U) takes the user s MAC frames and encapsulates it in 802.1ah (MAC-in-MAC) frame. The destination MAC address in the 802.1ah header is the egress switch (switch C). Throughout the backbone forwarding process, the frame remains untouched and the destination MAC address is unchanged. When frame arrives at the egress switch C, the frame is decapsulated and delivered to the host S. Table 3.2 shows the comparison of TRILL and SPB Cisco FabricPath Cisco FabricPath [23] is an innovative Cisco NX-OS technology that combines Layer 2 configuration and flexibility with Layer 3 convergence and scale. The idea is to create a simple, scalable, and efficient layer 2 domains that is applicable to data centers. It increases server-to-server bandwidth with multiple active 22
25 Architecture TRILL SPB Topology general general Addressing TRILL header 802.1ah MAC-in-MAC Routing Link-state, IS-IS Link-state, IS-IS Load Balancing N x transit hash 16 x ECMP from source node based trees Loop prevention hop count and RPFC (Reverse RPFC Path Forwarding Check) Fail-over latency Depends on IS-IS Depends on IS-IS Multicast or Single tree Source node based spanning tree Broadcast Compatibility new header requires additional on every Rbridge, and new ASIC Tradition Ethernet switching with 802.1ah capable hardware Table 3.2: Solution strategies comparison of TRILL and SPB [10]. paths and create a non-blocking architecture to improve performance. Cisco claims that FabricPath is a superset of TRILL and will support TRILL when it standardized. About topology design, FabricPath makes no assumption about the physical topology. An example from the FabricPath [11, 23] suggests a two layer topology (aggregation/spine and access switch) with 16-way ECMP. When the two layer switches are combined with 16-port 10G-bps PortChannels, FabricPath can provide a potential data center fabric with bandwidth of 2.56 (Tbps) between switches. As for routing and addressing, similar to TRILL, the control plane of FabricPath is built on top of the IS-IS (Intermediate System-to-Intermediate System) routing protocol, and the routing table is computed for multicast and unicast destinations. Frames in FabricPath are forwarded along the shortest path, which reduce the overall network traffic load and increases efficiency. The frames are always forwarded with known addresses, which means no flooding compared to standard Ethernet. Moreover, the FabricPath frames include a TTL field similar to IP, which prevent the problem of loop in the bridge layer 2 network. To address the issue of limited forwarding table size, MAC addresses in FabricPath are learn selectively only at edge, allowing to save MAC address table in the aggregation switches. The detection of links or switches failures depends on the IS-IS routing protocol. Between access and aggregation layer, FabricPath load balance the traffic using multi-way ECMP. ECMP is able to use the available links between any two devices and load balance the traffic to all the available links Brocade VCS Brocade VCS protocols remove the need for STP and allow for all equal-cost paths to be active, which results in no single point of failure. 23
26 Figure 3.5: QFabric s design: Node, Interconnect, and Director. Brocade VCS enables organizations to preserve existing network designs and cabling and to achieve active-active server connections without using Spanning Tree Protocol (STP). the Brocade VDX design utilizes a single standard LAG, consisting of multiple 10 GbE connections from Logically a single switch???????????????????????????????????? allow the two switches to look like a single switch to the core routers.??????????? protocol, Transparent Interconnection of Lots of Links (TRILL), provides active multipath support, so that the rack server sees only a single ToR switch, Juniper QFabric QFabric-Architecture/ba-p/77880 Juniper s virtual layer 2 switching architecture, called QFabric [1], aimed to distribute the control and data plane to the edge. QFabric makes the network itself behave like a single switch. Inside every switch is a mesh-like fabric that is completely flat and provides any-to-any connectivity between ports. The design of QFabric in Figure 3.5 has three basic components : 1. QF/Node: The QFabric Node provides access into and out of the fabric. The node devices are typically line cards that reside within a chassis switch, forming a high density edge device. 2. QF/Interconnect: This is the backplane of QFabric. The design enables any-to-any connectivity, where every device is a single hop away from any other device. The initial release of QFabric architecture offers the interconnection of up to 128 QF/node edge devices, creating a single fabric capable of supporting 6,000 10GbE ports. 24
27 3. QF/Director: This component provides the control and management services for the fabric. The Director has an exclusive out-of-band control plane network for carrying control traffic between QF/Node and QF/Interconnect devices. Moreover, it allows the data center network to appear as a single switch, providing simplicity of management to network operators. In conclusion, QFabric provides a flatten, non-blocking network fabric that supports high-speed server-to-server communication. It allows the data center network to appear as a single switch, which dramatically reduces the cost of managing multiple switches. QFabric further separates the control plane and data plane to eliminate any single point of failure in the system. By adding ports to an existing switches, the QFabric architecture can scale the data center network with minimal additional management and operation overhead OpenFlow OpenFlow [13] is a programmable protocol designed to manage and direct traffic among OpenFlow switches from different vendors. Typically, the basic job of a networking device (bridge or router) is to make forwarding/routing decision (control plane) and subsequently forward the data (data plane). The control plane runs control protocols, e.g., STP, RSTP, routing protocols, MAC address learning. These control protocols program the forwarding information into data plane, which can be a simple lookup table on TCAM (Tenery Content Addressable Memory). Packets are then forwarded to the outgoing interface by looking up the table. Typically, the control plane uses a communication protocols to program the forwarding information into data plane. Vendors today offer various degree of programmability and proprietary protocols on their switches and routers to program the forwarding information into data plane. As a result, global and unified network resource management and traffic engineering are limited by inconsistence of devices from multiple vendors. The goal Open- Flow s goal is to provide an open and standard protocol to program the table in switches and router from different vendors. OpenFlow consists of three parts: 1. Flow tables installed on switches 2. OpenFlow controller 3. Proprietary OpenFlow protocol for controller to create secure channels with switches. The flow table on each switch is controlled by OpenFlow controller via the secure channel. The controller imposes policies of flows into the flow table. Paths that go through the network could be optimized by their specific characteristics, such as SLA, end-to-end latency, or throughput. When an OpenFlow switch receives a frame, it first checks if there is matching entry in its flow table. If not, the switch forwards the frame to the controller. The controller makes forwarding decisions based on various fields in the frame, such as source/destination MAC, IP, or port numbers. This part of the logic can also be used as a firewall to block or restrict certain network flows. Once the controller decides the forwarding 25
28 Figure 3.6: PortLand: Packet forwarding and Actual MAC (AMAC) to Pseudo MAC (PMAC) mapping. policy, it programs the information into the switch s flow table via the secure channel. In conclusion, with OpenFlow, the network operators can slice off a part of network devices (switches or routers) from the OpenFlow network and create virtual network for researcher to develop new protocols. In addition, OpenFlow also offers fine-grained flow-level forwarding control without restrictions of IP routes or STP. 3.3 Academic Solution PortLand The goal of PortLand [16] is to create a scalable, easily manageable, and faulttolerant data center layer 2 network fabric. It proposed a three-tier hierarchical topology and a scalable layer 2 routing and forwarding protocol. PortLand consider designing a data center over the tree-stage (core-aggregationedge) fat tree topology. Edge and aggregation switches are grouped into pods and each pod connects to each core switch. PortLand overcomes the limited forwarding table size problem by assigning the hierarchical pseudo MAC address (PMAC) to each host representing its location (present as pod.position.port.vmid). The design keeps PMAC transparent to host. The host remains unmodified and still uses its original MAC address (AMAC). When communication starts, hosts sending out ARP requests receive the PMAC of the destination host, in Figure 3.6 step 1-3. The forwarding of the following packets is based on PMAC of the destination host, which means the AMAC from hosts never consume forwarding table entries except the edge switches. When edges switches associated with the destination receives a frame, it needs to perform PMAC to 26
29 IP-in-IP decapsulation IP-in-IP encapsulation VLB Figure 3.7: VL2: packet encapsulation, decapsulation, and VLB. AMAC header rewriting so that the destination host is unaware of the existing of PMAC. This requires a centralized server, named fabric manager, to keeps all the PMAC to IP address mappings and also requires edges switches to maintain PMAC to AMAC mappings. When a packet enters the PortLand network fabric, since the location of the destination is encoded, switches can forward the packets based on PMAC. This design requires switches to discover their own locations using PortLands location discovery protocol (LDP), and the edge switch automatically learns the PMAC to AMAC mapping by observing the incoming packets. PortLand guarantees loop-free forwarding by preventing the switches to forward packets to an upward-facing port if the packets is received from an upper switch in the hierarchy. Switches running LDP detects switches and links failures by exchanging the liveness messages. The fabric manager keeps a fault matrix for each links and updates with new information. Once changed, the fabric manager informs all affected switches of the failures with new topology and then recalculating their own forwarding table based on the new topology. The author assumes the multipath and load balance problem is orthogonal to the work and suggests that one of the choices is using flow hashing ECMP to achieve flow-level load balancing VL2 The main ideas beyond VL2 is to create a large virtual layer 2 domain using commodity hardware, a non-oversubscription Clos network topology, and load balancing mechanism based on randomization. VL2 takes the scale out approach and design its topology based on aggregating the capacity of large number of commodity switches. The topology follows the three layers design: intermediate, aggregation, and ToR(Top of Rack). The links between intermediate switches and aggregation form a Clos [5] network, providing rich path diversity and no oversubscription. 27
30 VL2 leverages two different IP address families, location-specific IP addresses (LA) and application-specific IP address (AA). LAs are hierarchically assigned to all the switches and switches perform the IP based link-state routing protocol. AA is allocated to applications, which remains unmodified when application migrates to other locations. The idea is to create the illusion of having a large IP subnet (AA address space) with underlying network routes packets by the LA. Similar to PortLand, this requires a directory system to maintain the mappings of AA to LA. To route between servers, VL2 deployed 2.5 layer agent at each server, intercepting the ARP and redirecting to the directory system. The directory system responses with the LA associated with the destination address AA and the agent encapsulates the LA to the outgoing packets. To equally distribute the load, VL2 applies Valiant Load Balancing (VLB) by randomly chooses one intermediate switches and encapsulated the LA of intermediate switch into the outgoing packets. In brief, from Figure 3.7, the outgoing packet is encapsulated with an intermediate switch LA and a ToR switch LA. Along the forwarding path, the intermediate switch and the ToR switch decapsulates the packets and forward to next hop destination. VL2 also uses ECMP [9] to distribute traffic across equal-cost paths. The combination of VLB and ECMP prevents any host from being heavily loaded in the data center. In addition, VL2 uses link-state routing protocol to detect switch/link failures and maintain switch-level topology Monsoon Monsoon [8] is aimed to create an all layer 2 scalable and load balancing network architecture. Monsoon uses three layer approach (ToR, ingress/egress, and intermediate) to scale its layer 2 domain to 100,000 servers. Each ToR switch has 2 10-Gbps port as uplinks, ingress/egress as well as intermediate switch have Gbps ports. The design is 1:1 oversubscription and well suited for load balance routing. To route between servers, the source node needs two information. First, a list of MAC addresses responsible for handling this destination IP address, and second, a list of MAC addresses of the ToR switches that each of the servers is connected to. Monsoon agent on each server replaces the user-level ARP functionality and is responsible for obtaining theses information from a Monsoon directory server and encapsulates every outgoing packets. The directory server maintains the IP address of a server to a list of (server MAC address and ToR switch MAC address). When communication starts, the outgoing frames is encapsulated with MAC addresses of ToR plus the intermediate switch and sends out. The intermediate switches and ToR switches, which support MAC-in-MAC tunneling, need to decapsulate the frame. Eventually the frame arrives at destination host as an unaltered frame. With MAC-in-MAC tunneling, the switches in Monsoon forwards frames based only on the switches MAC addresses, which solves the limited forwarding table size problem. To distribute workload in data center, Monsoon provides mechanisms called load spreading. The load spreading is achieved by creating a VIP (Virtual IP) shared by a set of servers (server pool). When requests come, the directory 28
31 server replies the VIP with a list of the MAC addresses associated with it and uses MAC rotation to provide efficient server-to-server forwarding. The sender uses consistent hashing to select a destination host from the MAC address list. When server fails, Monsoon leverages the existing data center health service system to remove/add the servers from server pools. Addressing Hierarchical, encode location into MAC address centralized routing based on 4D architecture VLB + MAC rotation Routing Local Discover Protocol, shortest path, route by MAC Load Balance Loop preven- packets traveling down tion Fail-over tency la- ARP/DHCP handling Solution Issue PortLand VL2 Monsoon Topology Two-layer, multi-root Three layer, Clos net- Three layer, multi-root work Flat, IP-in-IP, Location address(la) an Application Address (AA) link state routing, shortest path, route by IP flow hashing ECMP VLB + flow-based ECMP depend on IP routing can not travel back up protocol ms, centrally seconds, depends on control and notify routing protocol converged time redirect ARP at switch disable ARP and DHCP, replace by user-level agent Table 3.3: Summary of the solution strategies from PortLand, VL2, and Monsoon architecture. Flat, MAC- in-mac encapsulation, source routing NA based on ECMP to detect failures redirect ARP and DHCP 29
32 Chapter 4 Peregrine: An All-Layer-2 Container Computer Network Architecture 4.1 Introduction Cloud computing ushers in an era in which most information technology users do not need to own the system hardware and software infrastructure on which their day-to-day IT applications run. They either pay for their IT infrastructure usage on demand or get it for free, e.g. through subsidies from advertisers. Although the concept of decoupling using from owning IT infrastructure only starts to gain traction in the enterprise space within the last 3 years, it has been quite common and popular in the consumer space. In this new ecology, it is the cloud service providers that build and own IT infrastructures on which third-party or their own cloud applications run and deliver services, and along the way get reimbursed for values provided to their respective users. The name of the game behind most cloud computing business models is economy of scale. By consolidating IT infrastructures within an organization (private cloud) or across multiple organizations (public cloud), both the capital expense (software licensing cost, hardware acquisition cost, etc.) and operational expense (human system administration and support cost, energy usage cost, etc.) could be significantly cut down. In addition, by exploiting statistical multiplexing, a consolidated IT infrastructure could be made more capable, flexible and robust than the sum of the parts from which it is consolidated. Although IT infrastructure consolidation brought forth by cloud computing has many benefits, it also escalates the scalability issue of IT infrastructure to a new level. One such issue is scalability of a cloud data center s network architecture. This paper describes the design, implementation and evaluation of a data center network called Peregrine, which is specifically designed for a container computer 30
33 built at Industrial Technology Research Institute (ITRI) in Taiwan. The ITRI container computer is designed to be a modular building block for constructing a cloud data center computer, which in general is composed of multiple container computers that are connected by a data center network, is interfaced with the public Internet through one or multiple IP routers, and is designed as an integrated system whose hardware components such as servers and switches are stripped off unnecessary functionalities, whose resources are centrally configured, monitored and managed, and which encourages systemwide optimizations to reach more optimal global design tradeoffs. A key design decision of the ITRI container computer is using only commodity hardware, including compute servers, network switches, and storage servers, and leaving high availability and performance optimization to the systems software. Another key decision is to design a new data center network architecture from the grounds up to meet the unique requirements imposed by a cloud data center computer. Before embarking on the design of the network architecture for the ITRI container computer, we carefully reviewed related research literature and studied possible use cases and came up with the following requirements: 1. There is only one network, which supports communications among programs, data storage accesses and interactions with the Internet. 2. The network must be buildable from mainstream commodity layer-2 switches for lower cost and better manageability. 3. The network must be able to support up to one million end points, each of which corresponds to a virtual or physical machine. 4. The fail-over latency for any single network link/device failure is lower than 100 msec. 5. The loads on the network s physical links are balanced. 6. The network must support private IP address reuse, i.e., multiple instances of the same private IP address can co-exist simultaneously. The first requirement dictates that the ITRI container computer should not use a separate SAN for storage data accesses, and its network must interface seamlessly with the container computer s internet edge logic component. The second requirement mandates that only mainstream rather than high-end enterprise-grade Ethernet switches be used and the modifications required on these switches must be minimized. The last requirement is included specifically to support Amazon EC2-like IaaS (infrastructure as a service) cloud service, where multiple virtual data enters are multiplexed on a physical data center, and each virtual data center is given the full private IP address address 10.X.X.X so that a customer s virtual data center could seamlessly inter-operate with its existing on-premise physical data centers without any network/system reconfiguration, such as IP address re-assignment. A natural choice for building an all-layer-2 data center network is the standard Ethernet architecture. Unfortunately, because conventional Ethernet is based on a spanning tree architecture, it cannot satisfy the third and fourth requirements. Moreover, because the number of forwarding table entries in most 31
34 Physical Server VM0 VM1 VMn Layer-3 Border Routers Layer-2-Only Data Center Network Load Balancing Traffic Shaping Intrusion Detection NAT/VPN Compute Server Rack Storage Server Figure 4.1: System architecture of the ITRI container computer and its various system components mainstream Ethernet switches is between to 64000, they are un-equipped to meet the third requirement. Finally, IP address reuse is actually considered a run-time configuration error and is thus impossible to support in standard Ethernet networks. Peregrine satisfies all the requirements mentioned above. It uses a two-stage dual-mode packet forwarding mechanism to support up to 1M end points using only mainstream Ethernet switches. It incorporates load-aware routing to make the best of all the physical network links, and proactively provisions primary and backup routes to anticipate potential network failures. Peregrine supports private IP address reuse through a protected address translation mechanism similar to virtual address translation. Finally, Peregrine only requires about 100 lines of code change on mainstream Ethernet switches. 4.2 ITRI Container Computer Figure 4.1 shows the logical system architecture of the ITRI container computer. The ITRI container computer is physically housed in an ISO-standard 20-foot (6.096 meter) shipping container, and consists of 12 server racks lined up on both sides of the container with an access aisle in the middle, where each server rack holds up to 96 current-generation X86 CPUs and 3TB of DRAM. Twelve JBOD (Just a Bunch Of Disks) storage servers, each packed with 40 disks, are installed in the container computer. Together with the local disks directly attached to compute server nodes, the container boasts of more than 1 petabyte worth of usable disk space. The ITRI container computer uses a single 380VDC power distribution network to distribute power to all of its hardware devices, avoiding power efficiency loss due to unnecessary conversion between AC and DC currents. The PDU on each server rack is capable of supporting 25 Kilowatts of power. The cool- 32
35 Core Core Region Region TOR TOR Figure 4.2: The physical topology of the ITRI container computer s network is a modified Clos network. ing subsystem uses a combination of air and liquid cooling technologies, and is specifically designed to achieve an annual average PUE of 1.2 in a subtropical climate such as Taiwan, where PUE is defined as the total amount of energy consumption divided by that of the IT equipments alone. The ITRI container computer is designed to subsume all hardware functionalities seen in a typical data center, and thus includes support for all internet edge logic such as NAT (network address translation), VPN (virtual private networking), traffic shaping, and server/network load balancing, which is implemented on general-purpose server clusters rather than commercial proprietary appliances. To support lights-out management, the ITRI container computer incorporates a comprehensive SNMP-based environmental monitoring and control subsystem to protect itself, including a fire-and-smoke detection system backed up by a clean-agent gas-fire suppression subsystem, a physical security alarm subsystem, and an early earthquake detection system that proactively shuts itself down in the event of an earthquake. This container computer is designed to sustain a Richter scale 6.0 earthquake with no operational impact. The ITRI container computer s network is a modified Clos network, as shown in Figure 4.2. Every rack contains 48 server nodes, each having 4 1GE NICs, and includes 4 top-of-rack (TOR) switches, each having 48 1GE ports and 4 10GE ports. There is a virtual switch inside every server node that is connected to the server node s four NICs, which in turn are connected to the four TOR switches in the same rack. The four 10GE unlinks on each TOR switch are connected to four different regional switches, each of which has 48 10GE ports. To improve the performance of storage accesses, each storage server has four 10GE NICs and is directly connected to four different regional switches. In total, five regional switches are used in the ITRI container computer. Peregrine is designed to 33
36 connect multiple ITRI container computers, but will require another layer of core switches to establish the necessary connectivity. 4.3 Two-Stage Dual-Mode Packet Forwarding Because the number of addressable hosts in a single IP subset of an enterprise network rarely exceeds 5,000, the number of forwarding table entries on a large percent of mainstream enterprise-grade Ethernet switches is no larger than 32,000. Coupled with the fact that Ethernet switches forward packets based on their destination address, mainstream Ethernet switches cannot be used to build a network with a million end points or hosts because they cannot afford to allocate a forwarding table entry for each and every host. Peregrine solves this problem using a two-stage forwarding scheme. Hosts in a Peregrine network are partitioned into disjoint groups, each of which is proxied by a dedicated intermediary. That is, every intermediary is capable of reaching every host in its group in one hop. To send a packet to a destination D, the source S first identifies the intermediate associated with D, uses the MAC address of D s intermediary as the packet s destination address, and embeds D s MAC address somewhere inside the packet. This process is known as MACin-MAC (MIM) encapsulation. When S sends this MIM packet out, it reaches D s intermediary first, and the intermediary, knowing that it is an MIM packet, takes out the embedded D s MAC address and replaces the packet s destination address with it, and dispatches it to the normal packet forwarding process. The intermediary is said to perform MIM decapsulation in this case. The packet eventually will arrive at D because we assume D s intermediary can always reach D in one hop. Whenever a VM moves from one physical machine to another, the VM s routing state in the network must be modified accordingly so that packets destined to this VM could still reach it after its migration. Two-stage forwarding simplifies routing state migration in a way similar to mobile IP [19]: When a VM moves to a new PM, Peregrine changes the VM s intermediary to the intermediary covering the new PM and informs all parties previously communicating with the VM of this change. In the transition period, the old intermediary should send back a host unreachable ICMP message on behalf of the migrated VM, whenever it receives packets destined to the VM. With two-stage forwarding, an intermediary switch only needs to allocate forwarding table entries for the other intermediaries and all the hosts in the same group as the intermediary switch. Let G denote the number of hosts in an intermediary s group. In a 1,000,000-node Peregrine network, the number of forwarding table entries needed by every intermediary switch is thus G + G. For a non-intermediary switch, it only needs to allocate forwarding table entries for other intermediaries. In the context of the ITRI container computer, there are two possible choices for the intermediary: the TOR switch or the virtual switch inside every server node. When the intermediary is a TOR switch, G is 480 because a TOR switch is connected to 48 server nodes and every 34
37 server node is assumed to run 10 virtual machines, and the required number of forwarding table entries per switch is about 2600, which is way below the current forwarding table size limit of When the intermediary is a virtual switch, G becomes 10, and the required number of forwarding table entries per switch becomes 100,000, which is higher than the limit. However, choosing TOR switches to be intermediaries requires modifications to these switches, although the modification effort is relatively minor, about 100 lines of code, as most modern switches support the ability to trap certain types of packets and handle them separately. More problematically, there is a serious performance penalty associated with MIM decapsulation because existing switches are not designed to perform this function in the data plane, and therefore have to support it in the control processor, whose packet processing rate can be easily 3 to 4 orders of magnitude slower than the data plane s packet forwarding rate. Hopefully, as more and more commercial switches support the OpenFlow standard [13], which provides the flexibility of customized packet processing in the data plane, this performance penalty could be significantly reduced. Although two-stage forwarding provides the generality of exchanging packets between any two nodes in a 1,000,000-node network, this generality comes with a potential performance cost. To mitigate this performance overhead, we propose an optimization called dual-mode forwarding, which allows a source to send packets directly to those destinations with which it communicates frequently, and indirectly to the rest. Because the number of nodes that a given node, say X, is expected to communicate frequently, even in a 1,000,000-node network, is expected to be small, say 100, one can allocate forwarding table entries for these nodes on the switches along the paths between X and them, to speed up these communications. This is possible because there are many unused entries in the switch forwarding tables after two-stage forwarding is adopted. More generally, Peregrine dynamically measures the traffic volume from every host to every other host, i.e., the traffic matrix, and sorts the resulting measurements into a list in a decreasing order. Starting from the head entry of this list, Peregrine allocates a forwarding table entry in every switch on the path from the source to the destination in the entry, and continues to go down the list until either the entry s traffic volume is too low to be worthwhile or the occupancy ratio of any of the forwarding tables on the path exceeds a certain threshold. Originally, every node is reachable indirectly via its associated intermediary. As more traffic load information becomes available, Peregrine gradually builds up direct routes between those node pairs with heavy communications. Note that for a given node X, some nodes may find it worthwhile to build a direct route to X, but others may choose to use the original indirect route. 4.4 Fast Fail-Over Peregrine is designed to reduce the fail-over delay of any single switch/port/link failure to under 100 msec. To achieve this aggressive goal, for a given node X, 35
38 Peregrine pre-computes a primary and a backup route from every other node to X, where the primary route and backup route are node-disjoint and linkdisjoint excluding the two end points, assuming the underlying physical network connectivity provides enough redundancy. Whenever a network link or device fails, the primary routes provisioned on the failed device or link are identified, and the nodes that are using these primary routes are notified to switch to their corresponding backup routes. The fail-over delay of a network device/link failure thus consists of the time to detect the device/link failure, the time to identify affected primary routes and the nodes currently using these routes, and the time to inform these affected nodes to switch from primary to backup routes. Because Peregrine uses conventional Ethernet switches and Ethernet switches forward packets based on their destination address, the only way to forward packets destined to a given node X on different routes is to assign multiple MAC addresses to X, each representing a distinct route to reach X. At start-up time, Peregrine installs pre-computed primary/backup routes to every hosts in the switches forwarding tables. At run time, switching from the primary to the backup route of a given host is a matter of using the host s backup MAC address rather than primary MAC address. To enable fast fail-over, for a given host X, Peregrine pre-computes two disjoint paths from each of the other hosts to X. One simple way to achieve this is to compute two disjoint spanning trees (primary and backup), each of which is rooted at X and spans all other hosts. Whenever a network failure affects the primary spanning tree for the host X, all other hosts are informed to switch to the backup spanning tree of X by using X s backup MAC address to reach X. That is, X is reachable to the rest of the world either through its primary MAC address or backup MAC address, but never both. The main advantage of this design is it greatly simplifies the bookkeeping of availability state for each host. However, there are two disadvantages. First, any failure that affects (even a slight portion of) a given host s primary spanning tree renders the entire spanning tree unusable the collateral damage of this coarse-gained fail-over strategy may be too severe. For example, a NIC-to-TOR link failure may disable many spanning trees. Second, although Clos network provides rich connectivity, ensuring that a given node s primary and backup spanning trees are completely disjoint might be difficult and greatly reduces the flexibility of load balancing routing (described below). Therefore, Peregrine adopts a finegrained fail-over approach called node-pair path (NPP) fail-over, which requires Peregrine to, keep track for each host X whether each of its communicating hosts currently uses X s primary or backup MAC address to reach X. This design allows two different hosts to reach X using X s primary and backup MAC address simultaneously. With NPP design, although the primary paths from all other hosts to X form a spanning tree and the backup paths another spanning tree, these two trees are no longer required to be disjoint, but the primary and backup paths between each node pair still need to be disjoint. By removing the requirement that the primary and backup trees of a given host be disjoint, the flexibility and efficiency of Peregrine s routing algorithm is significantly increased. 36
39 4.5 Load Balancing Routing Traditional Ethernet architecture does not support dynamic routing that could accommodate fluctuating workload patterns; only Layer-3 routers provide such support. By exploiting the capability of populating the forwarding tables on switches, Peregrine supports load-balancing packet routing, which takes into account the following factors. First, the importance of different physical links in a data center network is different, even if the physical network topology is symmetric. For example, a physical link may be more critical than another because it is used by many hosts to access a storage server. Peregrine computes the notion of link criticality [6] and uses it to avoid choosing more critical links early on so as to eventually achieve network-wide load balance. Second, the number of hops on the route between two hosts is an important quality indicator because it determines the network latency as well as the amount of injected load. A simple and effective load-balancing routing scheme is to compute a large number of paths between every source and destination pair < s, d >, and to distribute the traffic from s to d equally among these paths. However, this algorithm is infeasible because it would require a large number of forwarding table entries for each host. Instead, we could statically run this routing algorithm, and use its result to steer the direction of more practical routing algorithms. More concretely, we compute up to N shortest paths for every possible source/destination pair < s, d >, equally distribute the traffic between s and d on these N paths, and then compute the link criticality of a physical link l with respect to < s, d >, denoted as θ l (s, d), as M N, where M is the number of paths between s and d that go through the link l. Then the expected load from s to d on the link l is θ l (s, d) T M(s, d), where T M(s, d) represents the bandwidth demand from s to d, and the total expected load on the link l is thus θ l = (s,d) θ l(s, d) T M(s, d). Finally, we define the cost of the link l as cost(l) = θ l /R l, where R l is the residual capacity of the link l, and avoid choosing links with higher cost as much as possible when computing routes. Given a traffic matrix, each of whose entries represents the bandwidth demand from one host to another, Peregrine first sorts its entries in a decreasing order and then computes paths for these entries in this order. That is, host pairs with higher bandwidth demands are routed earlier. To compute the primary path for a host pair < s, d >, Peregrine computes K shortest paths from s to d, filters out those paths that cross switches whose forwarding table is already full, and picks the path whose sum of the costs of its links is minimum. After taking out the links on the primary path from s to d, Peregrine repeats the same process to calculate their backup path. After the primary and backup path from s to d are computed, the residual capacity of every links on these two paths is reduced by T M(s, d), and the expected load and cost on other links are also adjusted accordingly. Whenever a link experiences congestion because of traffic load fluctuations, Peregrine identifies all source destination pairs whose primary or backup path passes through this link, deducts their measured bandwidth demands from the measured costs of the links on these primary paths, and applies the same routing 37
40 algorithm to compute a new primary or backup path for each of these source destination pairs, this time using their measured bandwidth demands. 38
41 Chapter 5 Peregrine Implementation and Performance Evaluation 5.1 Prototype Implementation The current Peregrine implementation on the ITRI container computer, as shown in Figure 5.1, consists of a kernel agent that performs MIM encapsulation and is installed on the Dom0 of every physical machine with Xen, a central directory server (DS) that performs generalized IP to MAC address look-up, and a central route algorithm server (RAS) for that constantly collects the traffic matrix, runs the load-balancing routing algorithm based on the traffic matrix, and populates the switches with the resulting routing state. With two-stage dual-mode packet forwarding, there are up to four ways to reach a Peregrine host X: Route directly to X using X s primary MAC address, Route directly to X using X s backup address, Route to X s primary intermediary and then X using X s primary MAC address, and Route to X s backup intermediary and then X using X s backup MAC address. The first two possibilities exist only for those hosts that can be directly reachable. Accordingly, there are four MAC addresses associated with each Peregrine host, its primary MAC address, backup MAC address, primary intermediary s MAC address and backup intermediary s MAC address. Traditionally, translating a host s IP address to its MAC address is done via the ARP protocol, which 39
42 Directory Server Route Algorithm Server Physical Server VM0 VM1 VMn Layer-2-Only Clos Network MIM agent Figure 5.1: The software architecture of the current Peregrine prototype, which consists of a kernel agent installed on every physical machine, a central directory server (DS) for IP to MAC address look-up, and a central route algorithm server (RAS) for route computation and route state population. is incompatible with Peregrine s design because it is based on broadcast-based queries and unicast-based responses. Instead, Peregrine adopts a centralized directory service (DS) architecture, as shown in Figure 5.1, in which every ARP query about an IP address A is transparently intercepted by Peregrine s kernel agent and re-directed to the DS, which responds with the four MAC addresses associated with A, and the availability status of the four routes to reach A. Peregrine does not require modifications to the header structure of Ethernet packets. To perform MIM encapsulation for an outgoing packet, the Peregrine agent puts the primary or backup intermediary s MAC address in the packet s destination address field, and the MAC addresses of the sending and receiving host in the packet s source address field. This means that every Peregrine host s MAC address has only 24 bits, rather than 48 bits. In addition, the MAC addresses of all VMs in a Peregrine network are centrally allocated, and every VM is assigned two MAC addresses. The centralized IP to MAC address mapping architecture also enables Peregrine to support private IP address reuse, which allows multiple virtual data centers (VDC) to run on a single Peregrine network and gives each VDC the same private IP address space (e.g. 10.x.x.x). When a VM in a VDC issues an ARP query about an IP address, Peregrine consults with the DS using the IP address and the ID of the VDC, which disambiguates the same IP address simultaneously used by multiple VDCs based on their VDC ID. Figure 5.2 gives an example to illustrate the MAC address look-up for twostage dual-mode packet forwarding. When VM3 sends out an ARP query about VM6 s IP address (step 1), the Peregrine agent installed at Dom0 of VM3 s physical machine (PM1) intercepts this query and submits the resulting query to the directory server (DS) (step 2). The DS looks up its database and sends the four MAC addresses and their availability status associated with VM6 (step 3) back to the Peregrine agent on PM1, which first creates and sends an legitimate 40
43 VM3 direct 1. ARP Request indirect mac1 mac2 mac3 mac4 (primary, backup) (primary, backup) 3. Reply 2. Redirect Directory Service DS SW2 MAC address VM6: Primary/backup: mac1/mac2 SW3: Primary/backup: mac3/mac4 PM1 SW1 SW3 PM2 VM6 4. Encapsulation 5. Decapsulation mac3 mac1 VM3 Ethernet header SW4 DA:6-byte SA: 6-byte DA SA mac1 VM3 s mac mac3 mac1 VM3 PM: Physical Machine VM: Virtual Machine SW: Switch DS: Directory Server mac1 VM3 s mac Ethernet header Figure 5.2: MAC address translation for two-stage dual-mode forwarding. ARP reply to VM3 as well as caches the reply to answer future ARP queries on VM6 s IP address. Once VM3 receives VM6 s MAC address, it forms the associated packet and sends the packet out. In Peregrine, all packets from a DomU VM pass through the Peregrine agent in Dom0 of the corresponding physical machine. For each packet going by, the Peregrine agent consults with the ARP cache with the packet s destination IP address and rewrites the packet s destination MAC address field based on the MAC address look-up result. For example, in the case of Figure 5.2, VM6 can be reached in four ways: (1) Indirect Primary: The Peregrine agent on PM1 performs MIM encapsulation with the MAC address of VM6 s primary intermediary and VM6 s primary MAC address, and sends the packet out (step 4). When the packet arrives at VM6 s primary intermediary, i.e., SW3, it decapsulates the MIM packet and forwards the resulting packet to VM6 (step 5). (2) Indirect Backup: Everything works in the same way as the Indirect Primary case, except that it is VM6 s backup intermediary, SW4, that is used for packet relaying. (3) Direct Primary: The destination MAC address of the outgoing packets is VM6 primary MAC address. (4) Direct Backup: The destination MAC address of the outgoing packets is VM6 secondary MAC address. Figure 5.3 illustrates how Peregrine s fast fail-over mechanism works. Initially, VM6 s primary and backup MAC addresses, mac1 and mac2, are prepopulated on the switches along the two disjoint routes by the RAS (step 1) The primary route to VM6 goes through SW2 and SW3 while the backup route goes through SW1 and SW4. Whenever a link along the primary path from VM3 to VM6 is down, an SNMP trap is sent from the link s adjacent switch to the RAS (step 2), which determines the source destination pairs that are affected by the link failure and passes this information to the DS (step 3), which then informs the source hosts that their associated destination hosts are reachable only via their backup MAC addresses, in this case, sending out an ARP entry update to PM1 (step 4) indicating that to send packets from VM3 to 41
44 VM3 PM1 Primary Path mac1 mac2 5. Backup Path Route Algorithm Service Directory Service 3. update DS RAS 1. Deploy Forwarding table SW2 4. Update cache SW1 mac1 2. Link down trap mac2 mac1 SW3 SW4 mac2 PM2 VM6 mac1: Primary mac2: Backup VM6 Figure 5.3: Switching from direct primary route to direct backup route upon a link failure VM6 should use mac2 as the destination MAC address, which is the backup MAC address for VM6. After that, all packets destined to VM6 from VM3 go through its backup path (step 5). Upon a link/switch ailure, the DS only needs to update those physical machines that currently cache ARP entries that are invalidated by the failure, because it keeps track of which physical machines cache which ARP entries. The DS performs these ARP cache updates using unicast. For a given VM, the number of physical machines caching its ARP entry is expected to be relatively small. Therefore the DS allocates space enough to record at most M caching machines for a given ARP entry, where M is tentatively set to 50. For a very popular VM that communicates with a large number of physical machines, a special flag is set in its ARP entry, and any modification to its ARP entry triggers an ARP update to every physical machine. 5.2 Network Initialization To run the Peregrine architecture on commodity Ethernet switches, these switches are required to block broadcast packets, disable unicast packet flooding, and turn off the IEEE 802.1D spanning tree protocol (STP). When the ITRI container computer starts up, the switches on its network are first put in the standard STP mode, and then configured to satisfy Peregrine s requirements. However, turning off STP on switches one by one could easily lead to packet looping and thus broadcast storms. To solve this problem, Peregrine uses the following algorithm to convert the network from the standard STP mode to the Peregrine mode: 1. Statically configure all the switches in such a way that the switch directly attached to a Peregrine server called RAS, which is also responsible for route computation, is the root of the initial spanning tree when the container computer network starts up. 42
45 2. Construct a list of switches by doing a breadth-first search of all the switches in the initial spanning tree, and reverse the list. 3. From the RAS, visit each switch in the reversed list by turning off broadcast packet forwarding, unicast packet flooding, and STP on it, and populating its static forwarding table with the results of the load balancing routing algorithm. Essentially, this algorithm starts the switch reconfiguration process from the leaves of the initial spanning tree, which is rooted at the switch attached to the RAS, and ensures that all the switches that are not yet reconfigured are reachable through the initial spanning tree. We used two racks in the ITRI container computer as the evaluation testbed for the Peregrine prototype. The testbed consists of four 48-port TOR switches with 4 10GE uplink, two 48-port 10GE regional switches, and 48 physical machines. Each physical machine is equipped with eight 2.53GHz Intel Xeon CPU cores, 40GB DRAM, and 4 GE NICs, and is installed with CentOS 5.5, which is equipped with the Linux kernel Two physical machines are used to deploy the RAS and DS. The MIM kernel agent is installed on all other physical machines. Each physical machine is connected to four TOR switches via a separate 1GE NIC, and each TOR switch in turn is connected to four regional switches via a separate 10GE link. No firmware modifications are required on these regional or TOR switches. 5.3 Effectiveness of Load Balancing Routing We used a simulation approach to evaluating the effectiveness of Peregrine s load-balancing routing algorithm. The test network being simulated spans 52 physical machines with 384 links. Each physical machine is connected to four TOR switches via a separate 1GE NIC, and each TOR switch in turn is connected to four regional switches via a separate 10GE link. To derive realistic input network traffic loads, we started with the packet traces collected from the Lawrence Berkeley National Lab campus network [18]. Each packet trace spans over a period of 300 to 1800 seconds from different subnets with a total of around 9000 end hosts. We assumed each packet trace represents a VM-to-VM traffic matrix in a virtual data center, and the VMs are assigned to PMs in a random fashion. Because the ITRI container computer is designed to support multiple virtual data centers running concurrently on it, we created multiple multi-vdc traffic matrixes, each of which is constructed by randomly combining five VMto-VM traffic matrixes into one traffic matrix. Totally 17 multi-vdc 300-second traces were created and replayed on the simulated network. Given a multi-vdc packet trace, we used the first half of the trace to derive its traffic matrix, compute routes for communicating physical machines, and replayed the second half of the trace on the simulated network using the resulting routes. The metric used to measure the effectiveness of routing algorithms is the congestion count, N c, during the trace replay period. For every second 43
46 Congestion Count RSPR FLCR % of additional traffic 1% 2% 3% 4% Additional traffic load percentage Trial Number Figure 5.4: Congestion count (left Y axis) and additional traffic load (right Y axis) comparison between the full link criticality-based routing algorithm and the random shortest path routing algorithm using multiple multi-vdc packet traces as inputs N c FLCR N c RSPR Z=100% Z=90% % of host pairs accounting for 90% of total traffic Figure 5.5: Congestion count ratio between full link criticality-based routing and random shortest-path routing under different degrees of skewedness in the input traffic matrix 44
47 of the input trace, we placed the load of every communicating host pair during that second on the links along the pair s route in the simulated network. During this replay process, whenever a host pair s load is placed on a link whose capacity (Mbits/sec) is already exceeded, the congestion count is incremented by one. Figure 5.4 compares the congestion counts of the full link criticality-based routing (FLCR) algorithm, which is load aware, and the random shortest-path routing (RSPR) algorithm, which is load-insensitive, using as inputs the 17 multi-vdc input packet traces described above. These two algorithms represent the two extremes of Peregrine s routing algorithm (Section III.D): FLCR corresponds to when Z is set to 100, whereas RSPR corresponds to when Z is set to 0. As expected, FLCR out-performs RSPR in all 17 traces, because the former strives to avoid congested links through the guidance of link criticality and expected link load. In contrast, RSPR only relies on randomization to avoid congestion and is thus less effective. The price that FLCR pays for avoiding congestion is the paths it produces tend to be longer and have a larger hop count that those produced by RSPR. As a result, the total traffic load injected by FLCR tends to be higher than that injected by RSPR. Fortunately, the percentage of additional traffic load due to longer paths is insignificant, around 0.5 To explain why the effectiveness difference between FLCR and RSPR varies with the input traces, we measured the concentration percentage of each input trace, which is the percentage of the top heavy-traffic host pairs that account for 90% of the total traffic volume in the input trace, and correlated this percentage with the routing effectiveness difference, as represented by the ratio of congestion counts (N c ) of FLCR and RSPR, for all 17 input traces. As shown by the solid curve in Figure 5.5, when an input trace has a lower concentation percentage, the congestion count ratio tends to be lower, indicating that the routing effectiveness between FLCR and RSPR is greater. This is because a lower concentration percentage means a higher degree of skewedness in the input workload, and the advantage of FLCR over RSPR is more pronounced when the input load is more skewed. The complexity of full link criticality-based routing is O(L P ), where L represents number of physical network links and P is number of PM pairs. From the multi-vdc traces, we found that most of the entries in their traffic matrices are insignificantly small, e.g., the traffic loads of fewer than 5% of the host pairs account for more than 90% of the total traffic volume, as shown in Figure 5.5. The solid curve corresponds to FLCR (Z=100), whereas the dotted curve corresponds to the case when applying link criticality-based routing only to top heavy-traffic host pairs that are responsible for 90% of the total traffic volume, i.e. Z = 90. The difference between these two curves is very small, indicating that the two configurations have similar routing effectiveness, although the Z=90 case requires much less route computation time than the Z=100 case. More concretely, the number of host pairs in a 500-server network to which link criticality-based routing is applied is reduced from 250K when Z=100 to 12.5K when Z=90. In our current implementation, the route computation 45
48 time for 12.5K host pairs takes about 10 minutes. 5.4 Packet Forwarding Performance One concern in the Peregrine architecture is the DS s throughput required to handle ARP requests from up to a million hosts. One study based on a traffic collection from 2456 hosts [15] showed that there are on average 89 ARP queries issued per second. A simple extrapolation suggests that around 36K ARP queries are expected in a data center consisting of one million hosts. This performance requirement is well below the measured performance on the current DS implementation, 100K ARP queries per second. Assume a physical machine caches ARP entries for one minute and hosts 20 VMs, and a VM communicates with 100 other distinct VMs continuously. Under this assumption, every PM generates 20K ARP requests every minute, which corresponds to 30 ARPs/seconds being directed to the DS. Since the DS is able to handle 100K ARPs/sec, we expect the current design can support up to 3.3K physical machines. Encapsulation and Decapsulation Overhead: Another concern about the Peregrine architecture is the overhead of packet encapsulation and decapsulation. We took two physical machines, installed the MIM agent on them, ran four virtual machines on each of them, and established four TCP connections between these four pairs of VMs. We measured the throughputs of these four TCP connections with/without MIM encapsulation and decapsulation. The measured throughputs of these TCP connections when MIM is turned on, are only about 0.5% less than those without MIM. W further measured the latency for packet encapsulation and decapsulation inside the kernel. Under the same configuration, packet decapsulation took less than 1usec (99th percentile), and packet encapsulation took 4usec (99th percentile), because it requires looking up the ARP cache. For comparison, we also implemented the MIM decapsulation engine in a commodity Ethernet switch, and the packet decapsulation throughput is disappointingly low, around 100 packets/sec, because MIM decapsulation takes place in the control processor. This is why we choose to perform packet decapsulation on the physical machines rather than on the switches. 5.5 Fail-over Latency To measure the fail-over latency, we measured the service disruption time for an UDP connection running on two physical machines of the evaluation testbed when one of its underlying links fails. The sender of this UDP connection sends one packet every msec to the receiver across a TOR switch. We then counted the number of lost packets due to a link failure and further broke down the total service disruption time into the following four components: (1) A neighboring switch of a failed link detects the link failure through polling of its local interfaces and sends out an SNMP trap to the RAS. (2) The RAS processes 46
49 Step 1. switch 2. RAS 3. DS 4. PM Total Time 20 80ms < 20ms < 5ms < 2ms < 77ms Table 5.1: The breakdown of the fail-over latency of a single link failure in the evaluation testbed Time (seconds) Number of affected end hosts Figure 5.6: ARP cache update latency increases linearly with the number of physical machines whose ARP cache needs to be updated. the link failure event to identify affected destination hosts and passes them to the DS. (3) DS updates its ARP database for these affected hosts and sends out ARP cache updates on them to those physical machines that communicate with these hosts. (4) The MIM agent on a physical machine updates its ARP cache upon receiving such an ARP cache update message. Table 5.1 shows the average time spent in each step from twenty failure runs in which the number of affected host pairs is fewer than 10. The upshot is that the average fail-over latency of the Peregrine prototype is around 77 msec. The only time-varying step in the fail-over latency is step (3), in which the DS sends out ARP cache updates to all physical machines caching MAC addresses of those hosts affected by the link failure. Figure 5.6 shows the time taken by the DS to send out ARP cache updates increases linearly with the number of physical machines caching MAC addresses of affected hosts, because the DS needs to send them out in sequence. Even when the number of physical machines whose ARP cache needs to be updated is 1000, the total fail-over latency is increased only by an additional 300 msec. 5.6 Conclusion Recognizing that the internal fabric of a container computer does not need to be compatible with other legacy IT infrastructures, designers of the ITRI container computer devised an innovative data center network architecture called Peregrine, which employs only commodity Ethernet switches as dump packet forwarding engines but removes most of the control plane functionalities in the 47
50 traditional Ethernet architecture, such as spanning tree, source learning, flooding and broadcast-based ARP query, and centralizes the address look-up, routing and fast fail-over intelligence on dedicated servers. We have completed a fully operational Peregrine prototype, presented its design and implementation in this paper, and demonstrated the effectiveness and efficiency of the Peregrine architecture using simulation and measurement results. We are currently working on improvements to the Peregrine prototype, including stress-testing the prototype s robustness and scalability on a fully populated container computer, and on a multi-container computer set-up, extending the DS to a distributed cluster implementation to enhance its scalability and availability, and porting Peregrine (particularly its packet decapsulation logic) to switches supporting the OpenFlow standard to further increase the number of end points that a single Peregrine network can span. 48
51 Bibliography [1] The juniper networks qfabric architecture: A revolution in data center network design: Flattening the data center architecture [2] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network architecture. In Proceedings of the ACM SIGCOMM 2008 conference on Data communication, pages ACM, [3] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat. Hedera: Dynamic flow scheduling for data center networks. In Proceedings of the 7th USENIX conference on Networked systems design and implementation, pages USENIX Association, [4] P. Bottorff and S. Haddock. Ieee ah-provider backbone bridges, [5] C. Clos. A study of non-blocking switching networks. Bell System Technical Journal, 32(2): , [6] K. Gopalan, T. Chiueh, and Y. Lin. Network-wide load balancing routing with performance guarantees. In Communications, ICC 06. IEEE International Conference on, volume 2, pages IEEE, [7] A. Greenberg, J. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta. Vl2: A scalable and flexible data center network. ACM SIGCOMM Computer Communication Review, 39(4):51 62, [8] A. Greenberg, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta. Towards a next generation data center architecture: scalability and commoditization. In Proceedings of the ACM workshop on Programmable routers for extensible services of tomorrow, pages ACM, [9] C. Hopps. Analysis of an equal-cost multi-path algorithm [10] A. Inc. Compare and contrast spb and trill [11] C. Inc. Scaling data centers with fabricpath and the cisco fabricpath switching system. 49
52 [12] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken. The nature of data center traffic: measurements & analysis. In Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference, pages ACM, [13] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner. Openflow: enabling innovation in campus networks. ACM SIGCOMM Computer Communication Review, 38(2):69 74, [14] A. Myers, E. Ng, and H. Zhang. Rethinking the service model: Scaling ethernet to a million nodes. In Proc. HotNets. Citeseer, [15] A. Myers, E. Ng, and H. Zhang. Rethinking the service model: Scaling ethernet to a million nodes. In Proc. HotNets. Citeseer, [16] R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat. Portland: a scalable fault-tolerant layer 2 data center network fabric. ACM SIGCOMM Computer Communication Review, 39(4):39 50, [17] P. Oppenheimer. Top-down network design. Cisco Press, [18] R. Pang, M. Allman, M. Bennett, J. Lee, V. Paxson, and B. Tierney. A first look at modern enterprise traffic. In Proceedings of the 5th ACM SIGCOMM conference on Internet Measurement, pages 2 2. USENIX Association, [19] C. Perkins, S. Alpert, and B. Woolf. Mobile IP; Design Principles and Practices. Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA, [20] R. Perlman. Rbridges: transparent routing. In INFOCOM Twentythird AnnualJoint Conference of the IEEE Computer and Communications Societies, volume 2, pages IEEE, [21] D. Plummer. An ethernet address resolution protocol. Technical report, RFC 826, [22] S. Sharma, K. Gopalan, S. Nanda, and T. Chiueh. Viking: A multispanning-tree ethernet architecture for metropolitan area and cluster networks. In INFOCOM Twenty-third AnnualJoint Conference of the IEEE Computer and Communications Societies, volume 4, pages IEEE, [23] C. Sturdevant. Cisco debuts fabricpath. eweek, 27(14):34 34, [24] D. Thaler and C. Hopps. Multipath issues in unicast and multicast nexthop selection
53 [25] J. Touch and R. Perlman. Transparent interconnection of lots of links (trill): Problem and applicability statement [26] A. Vahdat, M. Al-Fares, N. Farrington, R. Mysore, G. Porter, and S. Radhakrishnan. Scale-out networking in the data center. Micro, IEEE, 30(4):29 41,
TRILL Large Layer 2 Network Solution
TRILL Large Layer 2 Network Solution Contents 1 Network Architecture Requirements of Data Centers in the Cloud Computing Era... 3 2 TRILL Characteristics... 5 3 Huawei TRILL-based Large Layer 2 Network
STATE OF THE ART OF DATA CENTRE NETWORK TECHNOLOGIES CASE: COMPARISON BETWEEN ETHERNET FABRIC SOLUTIONS
STATE OF THE ART OF DATA CENTRE NETWORK TECHNOLOGIES CASE: COMPARISON BETWEEN ETHERNET FABRIC SOLUTIONS Supervisor: Prof. Jukka Manner Instructor: Lic.Sc. (Tech) Markus Peuhkuri Francesco Maestrelli 17
OVERLAYING VIRTUALIZED LAYER 2 NETWORKS OVER LAYER 3 NETWORKS
OVERLAYING VIRTUALIZED LAYER 2 NETWORKS OVER LAYER 3 NETWORKS Matt Eclavea ([email protected]) Senior Solutions Architect, Brocade Communications Inc. Jim Allen ([email protected]) Senior Architect, Limelight
Avaya VENA Fabric Connect
Avaya VENA Fabric Connect Executive Summary The Avaya VENA Fabric Connect solution is based on the IEEE 802.1aq Shortest Path Bridging (SPB) protocol in conjunction with Avaya extensions that add Layer
Network Virtualization and Data Center Networks 263-3825-00 Data Center Virtualization - Basics. Qin Yin Fall Semester 2013
Network Virtualization and Data Center Networks 263-3825-00 Data Center Virtualization - Basics Qin Yin Fall Semester 2013 1 Walmart s Data Center 2 Amadeus Data Center 3 Google s Data Center 4 Data Center
TRILL for Data Center Networks
24.05.13 TRILL for Data Center Networks www.huawei.com enterprise.huawei.com Davis Wu Deputy Director of Switzerland Enterprise Group E-mail: [email protected] Tel: 0041-798658759 Agenda 1 TRILL Overview
Virtual PortChannels: Building Networks without Spanning Tree Protocol
. White Paper Virtual PortChannels: Building Networks without Spanning Tree Protocol What You Will Learn This document provides an in-depth look at Cisco's virtual PortChannel (vpc) technology, as developed
Ethernet-based Software Defined Network (SDN) Cloud Computing Research Center for Mobile Applications (CCMA), ITRI 雲 端 運 算 行 動 應 用 研 究 中 心
Ethernet-based Software Defined Network (SDN) Cloud Computing Research Center for Mobile Applications (CCMA), ITRI 雲 端 運 算 行 動 應 用 研 究 中 心 1 SDN Introduction Decoupling of control plane from data plane
VMDC 3.0 Design Overview
CHAPTER 2 The Virtual Multiservice Data Center architecture is based on foundation principles of design in modularity, high availability, differentiated service support, secure multi-tenancy, and automated
TRILL for Service Provider Data Center and IXP. Francois Tallet, Cisco Systems
for Service Provider Data Center and IXP Francois Tallet, Cisco Systems 1 : Transparent Interconnection of Lots of Links overview How works designs Conclusion 2 IETF standard for Layer 2 multipathing Driven
Scalable Approaches for Multitenant Cloud Data Centers
WHITE PAPER www.brocade.com DATA CENTER Scalable Approaches for Multitenant Cloud Data Centers Brocade VCS Fabric technology is the ideal Ethernet infrastructure for cloud computing. It is manageable,
VXLAN: Scaling Data Center Capacity. White Paper
VXLAN: Scaling Data Center Capacity White Paper Virtual Extensible LAN (VXLAN) Overview This document provides an overview of how VXLAN works. It also provides criteria to help determine when and where
Network Virtualization for Large-Scale Data Centers
Network Virtualization for Large-Scale Data Centers Tatsuhiro Ando Osamu Shimokuni Katsuhito Asano The growing use of cloud technology by large enterprises to support their business continuity planning
Advanced Computer Networks. Datacenter Network Fabric
Advanced Computer Networks 263 3501 00 Datacenter Network Fabric Patrick Stuedi Spring Semester 2014 Oriana Riva, Department of Computer Science ETH Zürich 1 Outline Last week Today Supercomputer networking
Brocade One Data Center Cloud-Optimized Networks
POSITION PAPER Brocade One Data Center Cloud-Optimized Networks Brocade s vision, captured in the Brocade One strategy, is a smooth transition to a world where information and applications reside anywhere
Ethernet Fabrics: An Architecture for Cloud Networking
WHITE PAPER www.brocade.com Data Center Ethernet Fabrics: An Architecture for Cloud Networking As data centers evolve to a world where information and applications can move anywhere in the cloud, classic
Multi-Chassis Trunking for Resilient and High-Performance Network Architectures
WHITE PAPER www.brocade.com IP Network Multi-Chassis Trunking for Resilient and High-Performance Network Architectures Multi-Chassis Trunking is a key Brocade technology in the Brocade One architecture
Deploying Brocade VDX 6720 Data Center Switches with Brocade VCS in Enterprise Data Centers
WHITE PAPER www.brocade.com Data Center Deploying Brocade VDX 6720 Data Center Switches with Brocade VCS in Enterprise Data Centers At the heart of Brocade VDX 6720 switches is Brocade Virtual Cluster
Data Center Convergence. Ahmad Zamer, Brocade
Ahmad Zamer, Brocade SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
Walmart s Data Center. Amadeus Data Center. Google s Data Center. Data Center Evolution 1.0. Data Center Evolution 2.0
Walmart s Data Center Network Virtualization and Data Center Networks 263-3825-00 Data Center Virtualization - Basics Qin Yin Fall emester 2013 1 2 Amadeus Data Center Google s Data Center 3 4 Data Center
Cisco FabricPath Technology and Design
Cisco Technology and Design 2 Agenda Introduction to Concepts Technology vs Trill Designs Conclusion 3 Introduction to By Francois Tallet 5 Why Layer 2 in the Data Centre? Some Applications / Protocols
BUILDING A NEXT-GENERATION DATA CENTER
BUILDING A NEXT-GENERATION DATA CENTER Data center networking has changed significantly during the last few years with the introduction of 10 Gigabit Ethernet (10GE), unified fabrics, highspeed non-blocking
EVOLVING ENTERPRISE NETWORKS WITH SPB-M APPLICATION NOTE
EVOLVING ENTERPRISE NETWORKS WITH SPB-M APPLICATION NOTE EXECUTIVE SUMMARY Enterprise network managers are being forced to do more with less. Their networks are growing in size and complexity. They need
Brocade Solution for EMC VSPEX Server Virtualization
Reference Architecture Brocade Solution Blueprint Brocade Solution for EMC VSPEX Server Virtualization Microsoft Hyper-V for 50 & 100 Virtual Machines Enabled by Microsoft Hyper-V, Brocade ICX series switch,
Shortest Path Bridging IEEE 802.1aq Overview
Shortest Path Bridging IEEE 802.1aq Overview Don Fedyk IEEE Editor 802.1aq Alcatel-Lucent IPD Product Manager Monday, 12 July 2010 Abstract 802.1aq Shortest Path Bridging is being standardized by the IEEE
Optimizing Data Center Networks for Cloud Computing
PRAMAK 1 Optimizing Data Center Networks for Cloud Computing Data Center networks have evolved over time as the nature of computing changed. They evolved to handle the computing models based on main-frames,
Migrate from Cisco Catalyst 6500 Series Switches to Cisco Nexus 9000 Series Switches
Migration Guide Migrate from Cisco Catalyst 6500 Series Switches to Cisco Nexus 9000 Series Switches Migration Guide November 2013 2013 Cisco and/or its affiliates. All rights reserved. This document is
VXLAN Overlay Networks: Enabling Network Scalability for a Cloud Infrastructure
W h i t e p a p e r VXLAN Overlay Networks: Enabling Network Scalability for a Cloud Infrastructure Table of Contents Executive Summary.... 3 Cloud Computing Growth.... 3 Cloud Computing Infrastructure
Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family
Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family White Paper June, 2008 Legal INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL
Simplify Your Data Center Network to Improve Performance and Decrease Costs
Simplify Your Data Center Network to Improve Performance and Decrease Costs Summary Traditional data center networks are struggling to keep up with new computing requirements. Network architects should
Data Center Fabrics What Really Matters. Ivan Pepelnjak ([email protected]) NIL Data Communications
Data Center Fabrics What Really Matters Ivan Pepelnjak ([email protected]) NIL Data Communications Who is Ivan Pepelnjak (@ioshints) Networking engineer since 1985 Technical director, later Chief Technology
Analysis of Network Segmentation Techniques in Cloud Data Centers
64 Int'l Conf. Grid & Cloud Computing and Applications GCA'15 Analysis of Network Segmentation Techniques in Cloud Data Centers Ramaswamy Chandramouli Computer Security Division, Information Technology
Objectives. The Role of Redundancy in a Switched Network. Layer 2 Loops. Broadcast Storms. More problems with Layer 2 loops
ITE I Chapter 6 2006 Cisco Systems, Inc. All rights reserved. Cisco Public 1 Objectives Implement Spanning Tree Protocols LAN Switching and Wireless Chapter 5 Explain the role of redundancy in a converged
WHITE PAPER. Network Virtualization: A Data Plane Perspective
WHITE PAPER Network Virtualization: A Data Plane Perspective David Melman Uri Safrai Switching Architecture Marvell May 2015 Abstract Virtualization is the leading technology to provide agile and scalable
Data Center Networking Designing Today s Data Center
Data Center Networking Designing Today s Data Center There is nothing more important than our customers. Data Center Networking Designing Today s Data Center Executive Summary Demand for application availability
Juniper Networks QFabric: Scaling for the Modern Data Center
Juniper Networks QFabric: Scaling for the Modern Data Center Executive Summary The modern data center has undergone a series of changes that have significantly impacted business operations. Applications
Pre$SDN era: network trends in data centre networking
Pre$SDN era: network trends in data centre networking Zaheer Chothia 27.02.2015 Software Defined Networking: The Data Centre Perspective Outline Challenges and New Requirements History of Programmable
Juniper / Cisco Interoperability Tests. August 2014
Juniper / Cisco Interoperability Tests August 2014 Executive Summary Juniper Networks commissioned Network Test to assess interoperability, with an emphasis on data center connectivity, between Juniper
Chapter 1 Reading Organizer
Chapter 1 Reading Organizer After completion of this chapter, you should be able to: Describe convergence of data, voice and video in the context of switched networks Describe a switched network in a small
Ethernet-based Software Defined Network (SDN)
Ethernet-based Software Defined Network (SDN) Tzi-cker Chiueh Cloud Computing Research Center for Mobile Applications (CCMA), ITRI 雲 端 運 算 行 動 應 用 研 究 中 心 1 Cloud Data Center Architecture Physical Server
Peregrine: An All-Layer-2 Container Computer Network
Peregrine: An All-Layer-2 Container Computer Network Tzi-cker Chiueh Cloud Computing Research Center for Mobile Applications (CCMA) 雲 端 運 算 行 動 應 用 研 究 中 心 ICPADS 2011 1 1 Copyright 2008 ITRI 工 業 技 術 研
Chapter 3. Enterprise Campus Network Design
Chapter 3 Enterprise Campus Network Design 1 Overview The network foundation hosting these technologies for an emerging enterprise should be efficient, highly available, scalable, and manageable. This
VMware Virtual SAN 6.2 Network Design Guide
VMware Virtual SAN 6.2 Network Design Guide TECHNICAL WHITE PAPER APRIL 2016 Contents Intended Audience... 2 Overview... 2 Virtual SAN Network... 2 Physical network infrastructure... 3 Data center network...
Simplifying the Data Center Network to Reduce Complexity and Improve Performance
SOLUTION BRIEF Juniper Networks 3-2-1 Data Center Network Simplifying the Data Center Network to Reduce Complexity and Improve Performance Challenge Escalating traffic levels, increasing numbers of applications,
CHAPTER 10 LAN REDUNDANCY. Scaling Networks
CHAPTER 10 LAN REDUNDANCY Scaling Networks CHAPTER 10 10.0 Introduction 10.1 Spanning Tree Concepts 10.2 Varieties of Spanning Tree Protocols 10.3 Spanning Tree Configuration 10.4 First-Hop Redundancy
- Hubs vs. Switches vs. Routers -
1 Layered Communication - Hubs vs. Switches vs. Routers - Network communication models are generally organized into layers. The OSI model specifically consists of seven layers, with each layer representing
Lecture 7: Data Center Networks"
Lecture 7: Data Center Networks" CSE 222A: Computer Communication Networks Alex C. Snoeren Thanks: Nick Feamster Lecture 7 Overview" Project discussion Data Centers overview Fat Tree paper discussion CSE
Enterasys Data Center Fabric
TECHNOLOGY STRATEGY BRIEF Enterasys Data Center Fabric There is nothing more important than our customers. Enterasys Data Center Fabric Executive Summary Demand for application availability has changed
Brocade Data Center Fabric Architectures
WHITE PAPER Brocade Data Center Fabric Architectures Building the foundation for a cloud-optimized data center. TABLE OF CONTENTS Evolution of Data Center Architectures... 1 Data Center Networks: Building
How To Make A Vpc More Secure With A Cloud Network Overlay (Network) On A Vlan) On An Openstack Vlan On A Server On A Network On A 2D (Vlan) (Vpn) On Your Vlan
Centec s SDN Switch Built from the Ground Up to Deliver an Optimal Virtual Private Cloud Table of Contents Virtualization Fueling New Possibilities Virtual Private Cloud Offerings... 2 Current Approaches
全 新 企 業 網 路 儲 存 應 用 THE STORAGE NETWORK MATTERS FOR EMC IP STORAGE PLATFORMS
全 新 企 業 網 路 儲 存 應 用 THE STORAGE NETWORK MATTERS FOR EMC IP STORAGE PLATFORMS Enterprise External Storage Array Capacity Growth IDC s Storage Capacity Forecast = ~40% CAGR (2014/2017) Keep Driving Growth!
Extending Networking to Fit the Cloud
VXLAN Extending Networking to Fit the Cloud Kamau WangŨ H Ũ Kamau Wangũhgũ is a Consulting Architect at VMware and a member of the Global Technical Service, Center of Excellence group. Kamau s focus at
Data Center Network Topologies: FatTree
Data Center Network Topologies: FatTree Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems and Networking September 22, 2014 Slides used and adapted judiciously
Top-Down Network Design
Top-Down Network Design Chapter Five Designing a Network Topology Copyright 2010 Cisco Press & Priscilla Oppenheimer Topology A map of an internetwork that indicates network segments, interconnection points,
Flattening the Data Center Architecture
WHITE PAPER The Juniper Networks QFabric Architecture: A Revolution in Data Center Network Design Flattening the Data Center Architecture Copyright 2011, Juniper Networks, Inc. 1 Table of Contents Executive
Non-blocking Switching in the Cloud Computing Era
Non-blocking Switching in the Cloud Computing Era Contents 1 Foreword... 3 2 Networks Must Go With the Flow in the Cloud Computing Era... 3 3 Fat-tree Architecture Achieves a Non-blocking Data Center Network...
Simplifying Virtual Infrastructures: Ethernet Fabrics & IP Storage
Simplifying Virtual Infrastructures: Ethernet Fabrics & IP Storage David Schmeichel Global Solutions Architect May 2 nd, 2013 Legal Disclaimer All or some of the products detailed in this presentation
Introducing Brocade VCS Technology
WHITE PAPER www.brocade.com Data Center Introducing Brocade VCS Technology Brocade VCS technology is designed to revolutionize the way data center networks are architected and how they function. Not that
Testing Network Virtualization For Data Center and Cloud VERYX TECHNOLOGIES
Testing Network Virtualization For Data Center and Cloud VERYX TECHNOLOGIES Table of Contents Introduction... 1 Network Virtualization Overview... 1 Network Virtualization Key Requirements to be validated...
SummitStack in the Data Center
SummitStack in the Data Center Abstract: This white paper describes the challenges in the virtualized server environment and the solution that Extreme Networks offers a highly virtualized, centrally manageable
Outline. VL2: A Scalable and Flexible Data Center Network. Problem. Introduction 11/26/2012
VL2: A Scalable and Flexible Data Center Network 15744: Computer Networks, Fall 2012 Presented by Naveen Chekuri Outline Introduction Solution Approach Design Decisions Addressing and Routing Evaluation
Multi-site Datacenter Network Infrastructures
Multi-site Datacenter Network Infrastructures Petr Grygárek rek 1 Why Multisite Datacenters? Resiliency against large-scale site failures (geodiversity) Disaster recovery Easier handling of planned outages
Data Center Switch Fabric Competitive Analysis
Introduction Data Center Switch Fabric Competitive Analysis This paper analyzes Infinetics data center network architecture in the context of the best solutions available today from leading vendors such
Brocade Data Center Fabric Architectures
WHITE PAPER Brocade Data Center Fabric Architectures Building the foundation for a cloud-optimized data center TABLE OF CONTENTS Evolution of Data Center Architectures... 1 Data Center Networks: Building
Multitenancy Options in Brocade VCS Fabrics
WHITE PAPER DATA CENTER Multitenancy Options in Brocade VCS Fabrics As cloud environments reach mainstream adoption, achieving scalable network segmentation takes on new urgency to support multitenancy.
WHITE PAPER. Copyright 2011, Juniper Networks, Inc. 1
WHITE PAPER Network Simplification with Juniper Networks Technology Copyright 2011, Juniper Networks, Inc. 1 WHITE PAPER - Network Simplification with Juniper Networks Technology Table of Contents Executive
ConnectX -3 Pro: Solving the NVGRE Performance Challenge
WHITE PAPER October 2013 ConnectX -3 Pro: Solving the NVGRE Performance Challenge Objective...1 Background: The Need for Virtualized Overlay Networks...1 NVGRE Technology...2 NVGRE s Hidden Challenge...3
Scaling 10Gb/s Clustering at Wire-Speed
Scaling 10Gb/s Clustering at Wire-Speed InfiniBand offers cost-effective wire-speed scaling with deterministic performance Mellanox Technologies Inc. 2900 Stender Way, Santa Clara, CA 95054 Tel: 408-970-3400
The Impact of Virtualization on Cloud Networking Arista Networks Whitepaper
Virtualization takes IT by storm The Impact of Virtualization on Cloud Networking The adoption of virtualization in data centers creates the need for a new class of networking designed to support elastic
APPLICATION NOTE 210 PROVIDER BACKBONE BRIDGE WITH TRAFFIC ENGINEERING: A CARRIER ETHERNET TECHNOLOGY OVERVIEW
PROVIDER BACKBONE BRIDGE WITH TRAFFIC ENGINEERING: A CARRIER ETHERNET TECHNOLOGY OVERVIEW By Thierno Diallo, Product Specialist Originally designed as a local-area network (LAN) communication protocol,
How To Understand and Configure Your Network for IntraVUE
How To Understand and Configure Your Network for IntraVUE Summary This document attempts to standardize the methods used to configure Intrauve in situations where there is little or no understanding of
DATA CENTER. Best Practices for High Availability Deployment for the Brocade ADX Switch
DATA CENTER Best Practices for High Availability Deployment for the Brocade ADX Switch CONTENTS Contents... 2 Executive Summary... 3 Introduction... 3 Brocade ADX HA Overview... 3 Hot-Standby HA... 4 Active-Standby
Extreme Networks: Building Cloud-Scale Networks Using Open Fabric Architectures A SOLUTION WHITE PAPER
Extreme Networks: Building Cloud-Scale Networks Using Open Fabric Architectures A SOLUTION WHITE PAPER WHITE PAPER Building Cloud- Scale Networks Abstract TABLE OF CONTENTS Introduction 2 Open Fabric-Based
hp ProLiant network adapter teaming
hp networking june 2003 hp ProLiant network adapter teaming technical white paper table of contents introduction 2 executive summary 2 overview of network addressing 2 layer 2 vs. layer 3 addressing 2
Fibre Channel over Ethernet in the Data Center: An Introduction
Fibre Channel over Ethernet in the Data Center: An Introduction Introduction Fibre Channel over Ethernet (FCoE) is a newly proposed standard that is being developed by INCITS T11. The FCoE protocol specification
Avoiding Network Polarization and Increasing Visibility in Cloud Networks Using Broadcom Smart- Hash Technology
Avoiding Network Polarization and Increasing Visibility in Cloud Networks Using Broadcom Smart- Hash Technology Sujal Das Product Marketing Director Network Switching Karthik Mandakolathur Sr Product Line
Networking in the Era of Virtualization
SOLUTIONS WHITEPAPER Networking in the Era of Virtualization Compute virtualization has changed IT s expectations regarding the efficiency, cost, and provisioning speeds of new applications and services.
20. Switched Local Area Networks
20. Switched Local Area Networks n Addressing in LANs (ARP) n Spanning tree algorithm n Forwarding in switched Ethernet LANs n Virtual LANs n Layer 3 switching n Datacenter networks John DeHart Based on
CLOUD NETWORKING FOR ENTERPRISE CAMPUS APPLICATION NOTE
CLOUD NETWORKING FOR ENTERPRISE CAMPUS APPLICATION NOTE EXECUTIVE SUMMARY This application note proposes Virtual Extensible LAN (VXLAN) as a solution technology to deliver departmental segmentation, business
Expert Reference Series of White Papers. Planning for the Redeployment of Technical Personnel in the Modern Data Center
Expert Reference Series of White Papers Planning for the Redeployment of Technical Personnel in the Modern Data Center [email protected] www.globalknowledge.net Planning for the Redeployment of
Broadcom Smart-NV Technology for Cloud-Scale Network Virtualization. Sujal Das Product Marketing Director Network Switching
Broadcom Smart-NV Technology for Cloud-Scale Network Virtualization Sujal Das Product Marketing Director Network Switching April 2012 Introduction Private and public cloud applications, usage models, and
Increase Simplicity and Improve Reliability with VPLS on the MX Series Routers
SOLUTION BRIEF Enterprise Data Center Interconnectivity Increase Simplicity and Improve Reliability with VPLS on the Routers Challenge As enterprises improve business continuity by enabling resource allocation
WHITE PAPER Ethernet Fabric for the Cloud: Setting the Stage for the Next-Generation Datacenter
WHITE PAPER Ethernet Fabric for the Cloud: Setting the Stage for the Next-Generation Datacenter Sponsored by: Brocade Communications Systems Inc. Lucinda Borovick March 2011 Global Headquarters: 5 Speen
Expert Reference Series of White Papers. VMware vsphere Distributed Switches
Expert Reference Series of White Papers VMware vsphere Distributed Switches [email protected] www.globalknowledge.net VMware vsphere Distributed Switches Rebecca Fitzhugh, VCAP-DCA, VCAP-DCD, VCAP-CIA,
Cisco Data Center Network Manager Release 5.1 (LAN)
Cisco Data Center Network Manager Release 5.1 (LAN) Product Overview Modern data centers are becoming increasingly large and complex. New technology architectures such as cloud computing and virtualization
SummitStack in the Data Center
SummitStack in the Data Center Abstract: This white paper describes the challenges in the virtualized server environment and the solution Extreme Networks offers a highly virtualized, centrally manageable
Data Center Network Topologies
Data Center Network Topologies. Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 [email protected] These slides and audio/video recordings of this class lecture are at: 3-1 Overview
Addressing Scaling Challenges in the Data Center
Addressing Scaling Challenges in the Data Center DELL PowerConnect J-Series Virtual Chassis Solution A Dell Technical White Paper Dell Juniper THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY
Load Balancing Mechanisms in Data Center Networks
Load Balancing Mechanisms in Data Center Networks Santosh Mahapatra Xin Yuan Department of Computer Science, Florida State University, Tallahassee, FL 33 {mahapatr,xyuan}@cs.fsu.edu Abstract We consider
TÓPICOS AVANÇADOS EM REDES ADVANCED TOPICS IN NETWORKS
Mestrado em Engenharia de Redes de Comunicações TÓPICOS AVANÇADOS EM REDES ADVANCED TOPICS IN NETWORKS 2008-2009 Exemplos de Projecto - Network Design Examples 1 Hierarchical Network Design 2 Hierarchical
The evolution of Data Center networking technologies
0 First International Conference on Data Compression, Communications and Processing The evolution of Data Center networking technologies Antonio Scarfò Maticmind SpA Naples, Italy [email protected]
Cloud Networking: Framework and VPN Applicability. draft-bitar-datacenter-vpn-applicability-01.txt
Cloud Networking: Framework and Applicability Nabil Bitar (Verizon) Florin Balus, Marc Lasserre, and Wim Henderickx (Alcatel-Lucent) Ali Sajassi and Luyuan Fang (Cisco) Yuichi Ikejiri (NTT Communications)
Intel Ethernet Switch Converged Enhanced Ethernet (CEE) and Datacenter Bridging (DCB) Using Intel Ethernet Switch Family Switches
Intel Ethernet Switch Converged Enhanced Ethernet (CEE) and Datacenter Bridging (DCB) Using Intel Ethernet Switch Family Switches February, 2009 Legal INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION
How To Make A Network Cable Reliable And Secure
ETHERNET KEPT Provider Link State Bridging Gerard Jacobs Senior Solutions Architect Agenda > Network Visions > Carrier Ethernet > Provider Link State Bridging (PLSB) > Summary Network Visions HYBRID L1
How To Switch In Sonicos Enhanced 5.7.7 (Sonicwall) On A 2400Mmi 2400Mm2 (Solarwall Nametra) (Soulwall 2400Mm1) (Network) (
You can read the recommendations in the user, the technical or the installation for SONICWALL SWITCHING NSA 2400MX IN SONICOS ENHANCED 5.7. You'll find the answers to all your questions on the SONICWALL
Understanding Fundamental Issues with TRILL
WHITE PAPER TRILL in the Data Center: Look Before You Leap Understanding Fundamental Issues with TRILL Copyright 2011, Juniper Networks, Inc. 1 Table of Contents Executive Summary........................................................................................................
Block based, file-based, combination. Component based, solution based
The Wide Spread Role of 10-Gigabit Ethernet in Storage This paper provides an overview of SAN and NAS storage solutions, highlights the ubiquitous role of 10 Gigabit Ethernet in these solutions, and illustrates
