Simplify Your Data Center Network to Improve Performance and Decrease Costs Summary Traditional data center networks are struggling to keep up with new computing requirements. Network architects should rethink their designs and adopt simpler topologies and new control protocols to achieve better performance and operational agility and save as much as 50% on capital expenditures. Overview Key Challenges Data center (DC) networks must support distributed, highly virtualized and dynamic workloads, which create bigger intra-dc traffic flows (east-west). Data center network transit time must be predictable and independent from workload location. In a flat budget environment, the per-port cost savings due to organic technology improvements (Moore's Law) are not sufficient to finance increased DC network capacity requirements. Recommendations Implement simple one-tier or two-tier physical data center network topologies to save up to 50% on capital investment. Don't embrace a preconceived architecture; use our recommended criteria to guide network design and to select control protocols and vendors. Look to develop the network topology to support Software Defined Data Center, (SDDC) solutions supporting cost reductions of 50% over existing traditional data center architectures. Introduction Data center network requirements have changed substantially in the past 10 years for many enterprises, as applications have evolved from simple client/server designs to service-oriented, distributed architectures. In these environments, a three-tier hierarchical network (core, aggregation, access) is no longer adequate to support new traffic patterns and dynamic 1
workloads. This traditional design should be revisited when refreshing the infrastructure, and multiple alternatives exist. After years of relative stability, data center networks are evolving from an integrated model, where devices packaged with hardware and software supported a predefined set of network functions, to a more flexible environment, where multiple options are available at each layer, and software can augment capabilities over time. For example, in the past, when selecting a switch, there was no choice of software to operate it. Today, the same hardware can often operate with different software and be deployed in completely different architectures, so it is more important than ever to evaluate the options. To begin, network architects should conduct a high-level topology assessment ("What should my network look like?"; "How many layers do I need?"), then consider control plane options (the protocols that run the network). These two factors are closely related, so they must both be considered iteratively, until converging to a solution that best matches technical requirements and budget constraints. This document provides insight to support this process. The goal is not to develop a detailed technical design, but rather to enable a more thoughtful interaction with vendors. Clients realize they need to change and are trying to orient themselves across the many options they see in the market. We are living in transitional times for data center networking. Blindly embracing an architecture without performing an analysis can lead to suboptimal decisions, limit future options and increase costs. Analysis Implement Simple One-Tier or Two-Tier Physical Data Center Network Topologies to Save Up to 50% on Capital Investment With simple client/server applications, a hierarchical three-tier network design was a good solution, because it efficiently aggregated north-south traffic flows coming in and out from the data center, with Layer 3 providing scalability through separation of broadcast domains. Today, many applications are based on the service-oriented architecture (SOA) and spread out on multiple logical components, distributed across a large number of servers. Emerging big data applications have similar characteristics. In these environments, east-west (server-to-server) traffic flows predominate over north-south flows. In a traditional three-tier network design, these server-to-server flows might have to go across all tiers, up to the Layer 3 core, introducing latency and creating a performance bottleneck. In environments adopting cloud computing, seamless virtual machine (VM) mobility can also be 2 2 an issue, since Layer 2 domains (virtual LANs [VLANs]) do not extend across the Layer 3 core.
Simplifying Physical Network Topology Based on these new requirements, and considering the progress made by switching technology, we recommend simplifying the topology when refreshing the DC network. Reducing the number of tiers from three to two, or even one, reduces the number of switches, which translates not only to less capital expenditure, but also to better performance, latency and simplified operations. Some vendors position fabric extenders in this context. One-Tier Network Topology Using a big switch (actually, two, for redundancy) to connect all servers is the easiest connectivity option. Today, high-density 40 Gigabit Ethernet (GbE) switches can support more than a thousand 10GbE ports with Quad Small Form-factor Pluggable (QSFP) splitter cables. For those small to medium environments (indicatively from 200 to 2,000 ports) where growth and expandability are not the main concern, a one-tier topology provides a cost-effective, lowlatency, easy to implement and manage option. This architecture is sometimes referred to as "middle of row," as the switch pair is located in the middle of the racks, to optimize cabling. As the number of servers increases, running all cables to a central point becomes less practical (or even not possible with distance). This is the reason why the Top of Rack (ToR) architecture, with smaller ToR switches installed in each rack, became popular. To scale the one-tier approach, while keeping a simple network topology, some vendors (such as Dell and Alcatel-Lucent Enterprise) propose a single layer of fully meshed switches, which could be installed in each rack for small environments. For small and midsize networks, the one-tier design has technical advantages and eliminates the cost and complexity of additional layers, with associated switches and cabling. For example, we have seen configurations with 100 physical servers that would equal 1,000+ virtual server s dual-attached LAN and dual attached NAS at less than $200 per 10GbE port, which is greater than 50% less than some more traditional, multiple-tier designs. Two-Tier Network Topology ("Spine and Leaf") Going above one tier and adopting a modular design becomes a necessity for larger environments, indicatively over a thousand physical servers, or where great expandability is a requirement. Providing non-blocking connectivity across a large number of devices is the same challenge faced by the engineers of the first telephone exchange systems. This explains why data center networking vendors have rediscovered the Clos network concepts studied 3
back in the 1950s and are now proposing the spine-and-leaf physical topology. Figure 1 illustrates a simple spine-and-leaf topology. It is also called a "folded three-stage Clos network," because it can be depicted as a three-stage network (for the telephony-exchangeminded) or as a two-tier network. The topology is the same, and can efficiently support serverto-server (east-west) traffic flows. In a data center network design, the leaf corresponds often to the ToR switch, providing connectivity to servers, while the spine corresponds to the core switch. Servers can be dualattached to provide full-path redundancy. Table 1 summarizes the benefits provided by this network design. Benefit Availability Benefits of Spine-and-Leaf Design Details Connectivity remains available, although with reduced performance, when any link or switch stops working. Complete redundancy is achieved at system level; fault-tolerance at device level is not necessary. Efficient use of switching capacity Traffic is load-balanced across multiple links and switches (all paths have equal cost), so the available capacity can be fully utilized. 4 4 Horizontal The design can scale horizontally, using the same switch models. Adding scalability more spine switches increases core capacity, while adding more leaf
switches increases the number of server ports. The limit is the number of ports available for leaf-spine connections on both, which depends on the models of switch used. Deterministic and consistent Simplicity Every leaf is two hops away from every other leaf, so the switching fabric provides predictable latency across any server pair, resulting in consistent application performance. The design is based on a few standardized building blocks. This facilitates provisioning, automation and maintenance. There are three main factors to consider when evaluating the topology (how many switches, and what interconnections are among them): 1. Leaf switch features: This switch is used as ToR, and typically is a fixed form factor (FFF), because it has a relatively small number of 10GbE ports (less than 100), few 40GbE uplinks and its cost/port strongly influences the overall cost. The number of 10GbE ports sets the upper limit on the number of servers that can be attached to each leaf. The number of 40GbE ports sets a limit on the number of spines, since the leaf must connect to all of them. Each vendor has different ToR switch models. Some have 10GbE ports dedicated to server connectivity (typically 48 or 96) and 40GbE ports (typically from four to eight) for spine connectivity. Others models have only 40GbE ports, so any port can be used for spine connectivity, but they require QSFP splitters for attaching servers at 10GbE, which adds to cabling cost and complexity. 2. Oversubscription: This is the ratio between total access port capacity and spine connectivity bandwidth available at the leaf. For example, 96 x 10GbE access ports with 8 x 40GbE uplinks gives 960:320, or a 3:1 oversubscription. Sizing the level of bandwidth contention is a key design attribute. A value of one means that all servers connected to the leaf can send/receive traffic at wire speed to the spines simultaneously in other words, a non-blocking design. This would be an over-engineered solution in most cases; a 3:1 ratio is adequate for common application scenarios. Server attachment options can influence oversubscription. 3. Spine switch features: The size of this switch (i.e., the number of 40GbE ports) sets a limit to the number of leaves that can be connected. The simplest network configuration can have only two spine switches. Vendors normally propose chassisbased models in this scenario, because of their expandability, but a configuration with multiple FFF switches as spines can achieve a similar level of scalability at lower cost. For example, eight FFF spines with 32 x 40GbE ports have the equivalent capacity to two chassis switches with 128 x 40GbE ports. For their actual implementation of spine switches, network architects should consider the current trend toward FFF as well as specific requirements for a spineand-leaf design. 5
Combined, these three factors determine the maximum number of server ports that can be connected using a spine-and-leaf design with certain switch models. For example, a design based on 32 leaf switches with (96 x 10GbE) + (8 x 40GbE) each, in combination with eight spine switches with 32 x 40GbE ports, provides 32 x 96 = 3,072 server attach ports, with 3:1 oversubscription, a configuration that would fulfill the needs of the majority of enterprises (1,500 dual-attached physical servers, or 15,000 VMs, with an average 1:10 virtualization ratio). The street price for this type of configuration, based on FFF switches, is around $250 to $300 per 10GbE port. At some point, the design limit for the utilized switch models will be reached, and no further expansion will be possible. However, considering useful equipment lifetimes and the increasing use of virtualization and cloud, which reduce server footprint, a typical enterprise should have enough visibility to evaluate and consider this approach. After defining the network topology at a high level (i.e., number of tiers, what kind of switches and main interconnections), you must make an important decision about the control plane in other words, the Layer 2 (L2) and Layer 3 (L3) protocols that will govern the network. Many viable alternatives are possible. Every vendor will make its preferred recommendations, although most can support multiple options, each one with advantages and disadvantages. Network architects should weigh the options based on how they assess and prioritize the following requirements: Scale How many physical servers must be supported? Expected growth Is that number stable, growing (or shrinking) predictably or hard to forecast? Technical factors What are the design constraints (for example, L2 connectivity needs, path recovery times, etc.)? Budget Is cost containment a high priority? Vendor independence Regardless of cost implications, is this a necessity? Staff What human resources and skills are available for design and operations? Network architects should then ponder the following technical considerations before finalizing network design. Layer 2 or Layer 3 Design (Bridged or Routed Network) In an L2 design, VLANs extend across multiple leaf switches; while in an L3 design, VLANs are confined at the leaf, and IP routing takes place for every packet that goes from one leaf to another. The L2 design is simpler, but the L3 design is more scalable. Both have further options. Layer 2 Design 6 Spanning Tree Protocol (STP) is a legacy L2 protocol, designed to prevent loops by blocking links 6 when Ethernet networks had a treelike topology. In a spine-and-leaf design, multiple paths (i.e., loops) are created on purpose, to be used in parallel, so STP is no longer used to manage the
overall network's logical topology. STP is still used for specific purposes (for example, for dualserver attachment in Multi-Chassis Link Aggregation Group [MC-LAG] solutions). A number of alternatives are available to manage multiple paths at L2, including Transparent Interconnection of Lots of Links (TRILL), Shortest Path Bridging (SPB) and various MC-LAG implementations, often based on proprietary variations of these protocols. Most vendors also implement virtual chassis functionalities, so that a number of physical switches (for example, all spines) could be managed like a single virtual switch. Although L2 can be simpler to implement, it is less scalable and less mature in terms of standards (such as TRILL and SPB), so it is not the preferred solution if multivendor interoperability is a requirement, or for very large implementations, because of the inherent limitations of L2 broadcast domains. Layer 3 Design An IP design for a spine-and-leaf physical topology requires a dynamic Internet Protocol (IP) routing protocol that supports load balancing with equalcost multipath (ECMP). Open Shortest Path First (OSPF) is the common choice; Intermediate System to Intermediate System (IS-IS) is possible. The choice recommended by most vendors is Border Gateway Protocol (BGP), which can be implemented as internal BGP (ibgp) or external BGP (ebgp), with the latter being the preferred choice, because it natively supports load balancing with ECMP. Traditionally, BGP was associated with Internet core routers and with complex WAN designs, so it is not widely known by DC networking staff, which limits its adoption. However, BGP is highly scalable and well-proven for multivendor interoperability. This design is the best choice for very large (5,000 servers or more) multivendor environments, such as cloud service providers. The limitation of an L3 design is that standard VLANs cannot extend beyond the leaf (ToR) switch. Virtualized environments that need L2 connectivity across servers to support VM mobility would require an overlay solution to circumvent the limitation. Programmable Ethernet Fabrics To simplify network deployment and operations, most vendors have introduced solutions that overcome the need to manage each switch individually and provide the opportunity to see the network as a whole. Auto-discovery mechanisms reduce the configuration task, and the overall network can be configured manually, through a graphical user interface (GUI), or programmatically, through a northbound API. A policy-driven framework on top can further abstract application requirements from individual device configuration. The combination of these additional functionalities, often packaged as a proprietary solution, transforms a network made of individual components into what is referred to as a "fabric" in vendor's marketing literature. Each vendor has one or more solutions in this space that have 7
evolved historically. Some are marketed as fabric, and some as software-defined networking (SDN)-like solutions. Clients should identify proprietary elements of the different vendor solutions and evaluate whether the benefits, in terms of simplification of recurrent operational tasks, outweigh the lock-in risks. These solutions can encompass both L2 and L3 to provide all necessary connectivity options, and might include distributed L3 functions to deliver optimized switching and routing. Some fabrics rely on Virtual Extensible LAN (VXLAN) overlays to deliver L2 connectivity on an L3 backbone. Overlays Creating an abstraction layer on top of the physical network can solve the problem of providing L2 connectivity across L3. Tunnels based on VXLAN or Network Virtualization using Generic Routing Encapsulation (NVGRE) can transport L2 frames over IP, extending virtual LANs wherever needed. The VXLAN tunnel endpoints (VTEP) can be implemented in a software virtual switch (vswitch) running on servers or in leaf switches at the edge of the network. The first option is common in software-based overlays; the second is used for connecting bare-metal servers, but also in solutions that combine a BGP control plane with Ethernet Virtual Private Network (EVPN) to transparently extend L2 domains across an IP network. Overlays can also be used to isolate parts of the network, and to provide additional security or to support multiple tenants, even if they have overlapping IP addresses. Overlays are not a complete network solution, though; they require an underlay network with enough capacity and reliability to operate. Overlays do not necessarily need a spine-and-leaf network topology. However, the combination of the two designs is a good match; it combines the flexibility of the overlay and the robustness of the underlay. SDN SDN is an architectural model for networks. A spine-and-leaf infrastructure can support a genuine SDN deployment, with the control plane running in a central controller, decoupled from the switches. All the considerations above would still be conceptually relevant, and multiple L2/L3 designs are possible, since SDN is an architectural model and not a standardized solution. Customers who have already embraced the device-based SDN model based on OpenFlow and VMware Software Defined Data Center topology with VMware NSX switch topology with integration from Arista Networks and Palo Alto Network can be implement a spine-and-leaf network topology. 8 Conclusions 8 Data center networks are rapidly evolving, and a large number of options are available, although at different levels of maturity. There are multiple factors that network architects and managers
must consider. The spine-and-leaf physical topology is valuable, regardless of the selected control plane. Adopting a proprietary solution for the control plane can bring short-term benefits at the price of limiting future options, but viable alternatives exist. Architectural decisions of this magnitude do not occur every year, and many organizations might not possess the necessary skill sets. In this case, consider obtaining design validation from a third party that does not benefit from the sale of the solution. 9