Low Latency 10 GbE Switching for Data Center, Cluster and Storage Interconnect

White PAPER Low Latency 10 GbE Switching for Data Center, Cluster and Storage Interconnect Introduction: High Performance Data Centers As the data center continues to evolve to meet rapidly escalating demands for higher levels of performance and resource virtualization, three rather distinct networking requirements have emerged. As shown in Figure 1, the typical server in a high performance data center may require connection to three switching fabrics: a LAN for connecting users and general networking, an inter-processor communications (IPC) fabric for low latency message passing between compute cluster applications, and a storage fabric for access to shared storage/file resources. While Ethernet is the de facto technology for the general purpose LAN, it has been widely considered as a sub-optimal switching fabric for very high performance cluster interconnect IPC (e.g., for those MPI parallel applications that require extremely low end-to-end latency). In particular, GbE s end-to-end messagepassing latency in the range of 50-70 microseconds is significantly higher than the <10 microseconds for the more Figure 1. Data center switching fabrics specialized cluster interconnects. In the absence of congestion, end-to-end latency includes two basic components: 1) sending/receiving delay in the end systems/nics for moving the message between the application buffers and the network, and 2) network latency involved in serializing the message and switching it through network nodes to its destination. In spite of the latency issues, the cost effectiveness of Gigabit Ethernet has resulted in its being chosen as the IPC interconnect for more than 55 percent of the cluster computers on the June 2007 Top500 list. Therefore, the higher latency of GbE doesn t prevent excellent performance on parallel benchmarks. It has, however, prevented GbE clusters from capturing any of the Top50 positions on the current Top500 list. As a block storage fabric, Gigabit Ethernet with iscsi has offered a good medium performance solution for access to networked block storage. However, iscsi has not yet posed a serious threat to Fibre Channel due to the lower bandwidth of GbE verses 2 and 4 Gbps Fibre Channel, plus Ethernet s traditionally higher CPU utilization, higher memory bandwidth consumption and higher latency. Recent developments in 10 Gigabit Ethernet NIC hardware and low latency 10 GbE switching are positioning 10 Gigabit Ethernet to offer bandwidth and latency performance that is on a par with, or even surpasses, that of the more specialized interconnects, including Fibre Channel and InfiniBand. These developments will allow network managers to minimize the complexity of the data center by using Ethernet as the "converged" switching technology that can meet the highest performance requirements of each type of data center traffic. 2007 FORCE10 NETWORKS, INC. [ P AGE 1 OF 6 ]

Low Latency Cut-through Ethernet Switching With cut-through Ethernet switching, the switch delays the packet only long enough to read the Layer 2 packet header and make a forwarding decision based on the destination address and other header fields (e.g., VLAN tag and 802.1p priority field). Switching latency is reduced because packet processing is restricted to the header itself rather than the entire packet. Cut-through Ethernet switching is not new. In fact, the first Ethernet switches, the EtherSwitches introduced by Kalpana in 1990, were cut-through switches. The fundamental drawback of cut-through switches is that they can t identify and discard all corrupted packets because the packets are forwarded before the FCS field is received and thus no CRC calculation can be performed. Corrupted packets were common in the early days of Ethernet, partly as the result of collisions in the flat shared Ethernet LANs that were in vogue at the time. In spite of this shortcoming, cutthrough switching was the predominant mode of Ethernet switching until the development of Fast Ethernet in 1995-6. The emergence of Fast Ethernet and 10/100 Mbps Ethernet switching eliminated much of the latency advantage of cutthrough switching because speed changes (i.e., switching from 10 Mbps Ethernet to 100 Mbps Ethernet) force the switch to store-and-forward the packet rather than cutting it though. Fast Ethernet also reduced packet serialization time by a factor of 10, which further eroded the latency advantage of cut-through switching. In this earlier era of networking, the predominate networked applications (email, file transfer, and NetWare file access) were not sensitive to switch latencies in the 10-100 microsecond range. The advent of 10/100 switching, followed by the subsequent development of 10/100/1000 Layer 2/3 switching, completely eliminated cut-through switching as a viable forwarding mode for general purpose switched LANs. Cut-through switching, however, is currently enjoying a resurgence as the switching mode for specialized 10 GbE data center interconnects for those applications that are highly sensitive to switch latencies in the microsecond range. Cut-through switching is applicable for data center interconnects that do not require speed changes and are limited enough in diameter/extent to have very low packet error rates. Low error rates mean that only a negligible amount of bandwidth will be wasted on bad packets, which will be dropped by the hardware engines in modern NICs rather than by the cut-through switches. Figure 2. Network latency for a store-and-forward vs. a cut-through switch Ethernet Network Latency: Cut Through vs. Store-and-Forward Figure 2 illustrates the differences in network latency between store-and-forward and cut-through switches. The store-and-forward switch has to wait for the full packet serialization by the sending NIC before it begins packet processing. The switch latency for a packet is measured as the delay between the last bit into the switch and the first bit out (LIFO) of the switch. After packet processing is complete, the switch has to re-serialize the packet to deliver it to its destination. Therefore, neglecting the small propagation delay over short data center cabling (~5.5 ns/meter), the network latency for a one hop store-and-forward (SAF) switched network is: SAF Network = 2 x + LIFO Latency (Serialization Delay) Switch Latency In the case of the cut-through switch at the bottom of Figure 2, the switch can begin forwarding the packet to the destination system as soon as the destination address is mapped to the appropriate output port. This means that the cut-through switch can overlap the serialization of the outgoing packet from the switch to the destination end system with the serialization of the incoming packet. The switch latency is measured as the delay between the first bit in and the first bit out (FIFO) of the switch. Therefore, the corresponding network latency through a one hop cut-through (CT) switched network is : CT Network = Serialization + FIFO Latency Delay Switch Latency 2007 FORCE10 NETWORKS, INC. [ P AGE 2 OF 6 ]

The network latency for the cut-through switch is lower for two reasons: 1) only one instance of serialization delay is encountered, which can be a significant factor for larger frame sizes, and 2) the switch latency is lower because of the inherent simplicity of CT switching. Typical switch latency for a 10 GbE store-and-forward switch is in the range of 5-35 microseconds, while the switch latency for a 10 GbE cut-through switch is typically only 300-500 nanoseconds. Depending on the degree of over-subscription that can be tolerated in a specific application, the number of access switches may be either increased or decreased in proportion to the aggregation switches. As the diameter of the interconnect network increases, the advantage of CT switching becomes more significant. For example, for a 3-hop network, the network latencies for the two types of switches are: SAF Network = 4 x + 3 x (LIFO Latency (Serialization Delay) Switch Latency) CT Network = Serialization + 3 x (FIFO Latency Delay Switch Latency) Cut Through Switching for Cluster and Storage Interconnect In its current incarnation, cut-through switching is generally based on a single chip, non-blocking switch implementation. Because of the limitations of VLSI technology, the number of high speed switch ports per chip is typically in the range of 8-32, irrespective of the technology involved (Fibre Channel, InfiniBand, Myrinet, or 10 Gigabit Ethernet). For example, the Force10 S2410 cut-through switch has 24 10 GbE ports. A standalone switch can be viewed as a single-tier cut-through switching fabric and provides the advantages of cut-through switching for smaller clusters and storage networks. The most common technique for building larger cluster and storage interconnect fabrics from smaller switching elements is to aggregate multiple switches in a multi-tiered configuration. One approach is to build a Layer 2 interconnect fabric using cut-through switches in both the aggregation and the access tiers of the network, as shown in Figure 3. In this particular example, dual redundant aggregation switches are used, which means that for any particular access VLAN, one of the aggregation switches is the primary switch and the other switch plays a secondary or backup role. For a particular VLAN, only the LAG to the primary aggregation switch would carry traffic under normal operating conditions, and the second uplink LAG would be blocked per the rapid spanning tree protocol (RSTP). Figure 3. Two-tier cut-through switching fabric with Force10 S2410 switches Much larger 10 GbE clusters can be built by aggregating networks similar to the one shown in Figure 3 with a higher density 10 GbE switch such as the Force10 Networks E-Series, as shown in Figure 4. In this configuration, the E-Series ports used by the access switches could be configured to provide either SAF Layer 2 switching or routing among the low latency portions of the network. Figure 4. Aggregation of S2410 switches with an E-Series switch 2007 FORCE10 NETWORKS, INC. [ P AGE 3 OF 6 ]

S2410 CT E-Series SAF Hybrid CT/SAF Switch Latency 64B 300 ns 16 µs N/A 1500B 300 ns 18.3 µs N/A 9252B 300 ns 33.5 µs N/A Network Latency 1-Tier/1 Hop (Figure 3) 64B 351 ns N/A N/A 1500B 1.5 µs N/A N/A 9252B 8.3 µs N/A N/A Network Latency 2-Tier/3 Hops (Figure 4) 64B 951 ns N/A N/A 1500B 2.1 µs N/A N/A 9252B 8.9 µs N/A N/A Network Latency 3-Tier/5 Hops (Figure 5) 64B N/A N/A 17 µs 1500B N/A N/A 35.6 µs 9252B N/A N/A 57.2 µs Intelligent 10 GbE NICs The traditional Ethernet NIC relies on the host CPU to handle the TCP/IP protocol processing. With a softwarebased protocol stack, the host CPU is shared between the application and the network. The generally accepted rule of thumb is that each bit per second of network traffic consumes a Hz of CPU bandwidth. Therefore, a software protocol stack causes CPU utilization to become very high with network bandwidths in excess of 1 Gbps, resulting in the CPU itself becoming the bottleneck that limits throughput and adds significantly to end-to-end latency. Table 1. Switch and network latencies for Force10 CT and SAF switches Further expansion of the cluster network can be achieved by Layer 3 meshing of E-Series switches with the added advantages of the superior load sharing and path recovery capabilities of equal cost multi-path (ECMP) routing. Table 1 presents the measured switch latencies for the Force10 S2410 cut-through switch compared to a Force10 TeraScale E-Series 10 GbE store-and-forward switch and summarizes the network latencies that may be expected with the switch fabrics shown in Figures 3 and 4. The network latencies were calculated using the formulas presented earlier in the paper. The cut-through switch and network latencies compare favorably with those of InfiniBand, Fibre Channel and the other specialized interconnections. It is also clear from the table that cut-through switching significantly reduces network latency as a contributor to 10 GbE end-to-end latency, allowing latency reduction efforts to focus on the delays within the end systems. Network designs that maximize the diameter of the cut-through portions will exhibit significantly lower latency. Figure 5. Intelligent Ethernet NIC protocol stacks Over the last few years, vendors of intelligent Ethernet NICs, together with the RDMA Consortium and the IETF, have been working on specifications for hardware-accelerated TCP/IP protocol stacks that can support the everincreasing performance demands of general purpose networking, cluster IPC and storage interconnect over GbE and 10 GbE. The efforts have focused on the technologies shown in Figure 5, which provides a highly simplified overview of hardware-assisted end system protocol stacks. A dedicated TCP offload engine (TOE) is incorporated in the NIC. The TOE offloads essentially all the TCP/IP processing from the host CPU. This greatly reduces CPU utilization and also reduces latency because the protocols are executed in hardware rather than software. Tests have shown that 10 GbE TOE NICs together with cut-through 10 GbE switching are capable of end-to-end, small message latency of about 10 microseconds for MPI over sockets. 2007 FORCE10 NETWORKS, INC. [ P AGE 4 OF 6 ]

TOE also provides a major improvement for throughputs of 10 GbE web and NAS servers. The tests cited above also show that TOE NICs can improve Web server performance as much as 10X verses conventional 10 GbE NICs. Remote direct memory access (RDMA) is a mechanism of offloading data-copy operations from the CPU by allowing direct data transfers between the network and user memory space. RDMA conserves memory bandwidth by eliminating TCP/IP copy requirements (zero copy) and kernel transitions. RDMA is of particular benefit to movement of large blocks of data, such as that required for storage interconnect. The IETF has developed a standard called iwarp (Internet wide area RDMA protocol) for RDMA over TCP/IP. The iwarp specification includes TOE functionality in order to eliminate the major sources of network overhead in TCP/IP processing. Separate work is also progressing on an RDMA interface for the network file system (NFS). With RDMA, the Ethernet CPU utilization for message transfers is expected to be reduced to less than 10 persent regardless of the message throughput level. iscsi protocol acceleration is an implementation of the iscsi protocols in hardware that offload compute-intensive iscsi operations from the CPU, improving the throughput and transaction rates and reducing the CPU utilization. The RDMA Consortium has developed the iscsi Extensions for RDMA (iser) protocol, which provides a datamover architecture (DA) extension that offloads iscsi data movement and placement operations to the RDMA hardware, while the control aspects remain in software. This may turn out to be a more flexible approach than full iscsi offload to NIC hardware. would be based on cut-through 10 GbE switching. If the highest levels of performance are required, the servers will have separate network interfaces to each fabric. There are two possible scenarios for the evolution of data center NICs: 1) Servers may use three different types of intelligent network interfaces: a TOE NIC optimized for general networking, an iwarp RDMA NIC optimized for low-latency IPC, and an iscsi NIC optimized for storage networking. 2) Alternatively, a "converged" NIC that supports the full offload suite may emerge as the most cost-effective solution. In this case, a single model of high performance 10 GbE NIC could be deployed throughout the data center with its mode of offload functionality selected by the data center manager as part of the installation/ configuration process. Layer 2 Fabric Interconnection Enabled by Converged NICs With a converged NIC it is possible to configure servers with a single 10 GbE NIC for access to both cut-through and store-and-forward LAN fabrics. One example of how this may be done is illustrated in Figure 6. The server with The sockets direct protocol (SDP) is part of the RDMA specification that allows unmodified sockets applications to gain direct access to RDMA-optimized data transfers. Direct access from the application to the RDMA hardware can also help to reduce latency. Development of offload NICs for 10 GbE has proven to be a fairly significant challenge for Ethernet adapter vendors due to the complexities of the protocols involved. However, there are now a number of vendors in the market place with proven high performance products that exploit various technologies described in this section. As 10 GbE is adopted as the mainstream converged-switching technology, the high performance data center of the near future will resemble that of Figure 1, with three separate 10 GbE switch fabrics. The general purpose LAN fabric would be based on storeand-forward switching, while the IPC and storage fabrics Figure 6. Example of fabric interconnection with converged NICs 2007 FORCE10 NETWORKS, INC. [ P AGE 5 OF 6 ]

the converged NIC is connected to the cut-through access switch. Some of the uplink ports of each access switch are allocated to intra-fabric connectivity with the remainder allocated to inter-fabric connectivity. The same is true for the CT aggregation switches. In this example, separate VLANs are configured for intrafabric and inter-fabric traffic. The direct LAG connection from each access switch to a core LAN switch is designated as the primary path (P) for the inter-fabric traffic to the root switch in the SAF LAN aggregation layer, with a secondary (backup) path (S) directed through one of the CT aggregation switches. For simplicity, the P and S paths are shown only for the access switch on the left of the diagram. A configuration similar to this one has the advantage that the inter-fabric LAN traffic normally bypasses the low latency fabric, eliminating the possibility of contention for low latency bandwidth. In the event of a failure in the primary inter-fabric path, traffic would fail-over to the secondary path where Layer 2 QoS could be configured to provide strict priority to the intra-fabric traffic over general purpose LAN traffic. Conclusion With the advent of 10 GbE cut-through switching and intelligent 10 GbE NICs, Ethernet is ready to challenge the specialized low latency interconnect technologies for performance supremacy in IPC cluster interconnect and storage interconnect. These developments are clearing the way for network managers to simplify the technology makeup of the data center and leverage the cost-effectiveness of Ethernet to minimize TCO without any compromise in performance. As data centers move toward virtualized applications and infrastructure, the combination of lower port prices and lower latency will be crucial drivers of the adoption of 10 Gigabit Ethernet as the converged data center switching technology. With the performance benefits of offload NICs applicable to general purpose servers, such as the Web server front ends of the data center and NAS servers, TOE and iwarp NICs can be expected to ride a fairly steep cost-reduction curve that will benefit lower volume applications such as IPC cluster interconnect and high-end storage networking. For the typical large enterprise, probably the most significant impact of low-latency 10 GbE networking will be the large cost savings that will be realized through deployment of high performance iscsi SANs and clustered storage as alternatives to DAS, Fibre Channel SANs or InfiniBand SANs. Force10 Networks, Inc. 350 Holger Way San Jose, CA 95134 USA www.force10networks.com 408-571-3500 PHONE 408-571-3550 FACSIMILE 2007 Force10 Networks, Inc. All rights reserved. Force10 Networks and E-Series are registered trademarks, and Force10, the Force10 logo, Reliable Business Networking, Force10 Reliable Networking, C-Series, P-Series, S-Series, EtherScale, TeraScale, FTOS, SFTOS, StarSupport and Hot Lock are trademarks of Force10 Networks, Inc. All other company names are trademarks of their respective holders. Information in this document is subject to change without notice. Certain features may not yet be generally available. Force10 Networks, Inc. assumes no responsibility for any errors that may appear in this document. WP16 907 v1.8 2007 FORCE10 NETWORKS, INC. [ P AGE 6 OF 6 ]