TECHNOLOGY CONNECTED Advances with System Area Network Speeds Data Transfer between Servers with A new network switch technology is targeted to answer the phenomenal demands on intercommunication transfer speeds between servers, which are becoming all too evident in today s client-server architecture that is found in all data processing environments. by Joey Maitra, Magma The proliferation of the raw processing power of computers has resulted in system architectures where processing tasks are distributed and assigned to various processing elements in the system in order to spread the load and derive better system throughput. The execution of these tasks is closely coordinated and then integrated by some central processing () entity to produce the desired output. The intent is to have the entire set of these elements share the processing load thereby contributing to boost the overall throughput of the system. Processing elements must then communicate with the central entity and/ or among themselves to synchronize the execution of their respective tasks. In most scenarios, there is also volatile and non-volatile storage elements dedicated to these distributed processing elements comprising the system. For instance in blade centers, blade servers have their own private storage facilities and also communicate with each other over high-speed connections on the mid-plane as well as to devices on a storage area network (SAN) through a switch module. This is typically /-X to /-X Bridge FIGURE 1 Communication Controller Device Layers Transaction Data Link Physical Communication Controller Card A typical Processor architecture prevalent in almost all current computer motherboards. 24 OCTOBER FROM OCTOBER 2010 RTC 2010 MAGAZINE RTC MAGAZINE
Server Motherboard 10 Gigabit e CArd FIGURE 2 Link Layer TCP/IP Packets embedded in Packets Driver Code MAC Layer 10 Gigabit TCP/IP Packets embedded in Frames Buffer the case in mid- to high-end server environments. However, to extend this paradigm to an environment made up of servers physically located in separate enclosures would require a fast interconnect mechanism. In other words, these servers must communicate among themselves via some sort of a network. In such environments, there is also the need to access vast amounts of data via network attached storage (NAS) devices. This scenario is all too prevalent in datacenters and server farms to mention a few. Today, these access mechanisms are implemented via local area network (LAN) with technologies such as Infini- Band, 10 Gigabit, Fibre Channel and the like. Another point to note, the phenomenal rate of deployment of the Internet has resulted in most LANs using TCP/IP in the upper layers of the communication Virtual Connection Gigabit Switch Crossbar Switch Shared Control Server Motherboard Server 2 10 Gigabit e CArd 10 Gigabit TCP/IP Packets embedded in Frames Link Layer TCP/IP Packets embedded in Packets Driver Code MAC layer TCP/IP packet data flow from the Application layer of the sending server through the network to the Application layer of the destination server. stack. IP packets from the TCP/IP layers are essentially encapsulated within the frames of the communication protocol used to form the LAN. The physical connections to the network fabric for servers and computers take place through either a network I/O controller card or a network controller device resident on the motherboards. These motherboards host a root complex processor as shown in Figure 1. A root complex denotes the root of an I/O hierarchy that connects the / memory subsystem to I/O devices. This hierarchy consists of a root complex (RC), multiple endpoints (I/O devices), a switch and a to /-X bridge, all interconnected via links. is a point-to-point, low-overhead, low-latency communication link maximizing application payload bandwidth and link efficiency. Inherent in the technology is a very robust communication protocol with its own set of Transaction, Data Link and Physical Layers. The network I/O controllers implement some specific communication protocol and provide the interface to the physical media constituting the LAN. The controllers interface to a endpoint of the root complex processor (RCP) of the server node participating in the network. Incidentally, this architecture is not restricted to servers since it is common in workstations, desktops and laptops. The vast majority of the industry s prevalent communication protocols was invented before the advent of technology. These protocols have their own set of almost identical infra-structure made up of Transaction, Data link and Physical layers. As depicted in Figure 2, data originating at the Application layer are transformed into TCP/IP packets and then embedded in packets. These packets are then sent to the controller that de-packetizes the TCP/ IP packet from the packets and re-packetizes it to be sent in frames over the 10Gigabit physical media. The reverse process takes place at the destination server end. It is obvious from the discussion so far that there is an awful lot of protocol duplication. The cost of such duplication measured in terms of overall throughput of the network becomes more poignant when the nuances of the various communication protocols are considered as they relate to efficiency, i.e. data rate, maximum payload, packet header overhead, etc. It turns out that the duplication of the communication protocol, even though it may be executed in hardware, causes unnecessary software and hardware overhead burdens that seriously impact the overall throughput of the network infrastructure. Another important factor impacting the overall performance of a network is the bandwidth limitations of the physical media associated with the communication protocol used. This encompasses transfer rates, maximum distances supported and the connection topography to name a few. RTC FROM MAGAZINE RTC MAGAZINE OCTOBER OCTOBER 2010 2010 25
/-X IP Interface Driver Magma Driver Transaction Layer FIGURE 3 40/80 Gigabits/sec 80/160 Gigabits/sec Full Duplex IP Application Transaction Layer e Switch Virtual Internet Server 2 40/80 Gigabits/sec IP Interface Driver Magma Driver Transaction Layer 80/160 Gigabits/sec Full Duplex An alternate approach to sending TCP/IP packets via links with the use of a Switch. to /-X Bridge System Kernel Space xmt IP Packet at t1 rcv IP Packet at t 1 xmt IP Packet at t 2 rcv IP Packet at t 2 FIGURE 4 Mapped e Switch Peer to Peer Physical 4 Giga Bytes 64-Bit Addressable Logical Space System DMA Engine O R Processor Server n System Server x System Mapped /-X to /-X Bridge Server n System Server n Kernel Space xmt IP Packet at t1 rcv IP Packet at t 1 xmt IP Packet at t 2 rcv IP Packet at t 2 The mechanism for the transfer of TCP/IP packets between server memories without any involvement on the part of the server processors. For instance, with 10 Gbit/s the restriction of the data transfer rate to10 Gbit/s is potentially a very serious limitation for many applications. Given this scenario, the ideal approach to boosting the overall performance of the network would be to use the technology as the network fabric. Embedded in the packet is the IP datagram with the destination IP address of the server node. is a point-to-point communication protocol and consequently does not have a media access control (MAC) address. Therefore, the most natural and logical approach to routing data from one node in the network to another would be to have some entity route the data based on the destination IP address. Implementation of this type of routing methodology essentially makes that entity an IP router. This is where the switch comes into play as shown in Figure 3. All of the downstream ports of the switch connect to servers comprising the nodes of a system area network. Intelligence tied to the upstream port of the switch has already established the knowledge of the correlation between the downstream ports and the corresponding IP address of the server attached to it. Data flows from one server to another through the switch. Consequently, it requires that the root complex processor (RCP) tied to the upstream port of the switch communicate with the RCP of the server. This poses the question of how best to communicate between two RCPs. Bus enumeration techniques in architecture, which is the same as in bus architecture, cannot allow one RCP to go through the discovery of devices on a bus that belongs to another RCP. However, there is a technique pioneered by PLX Corporation during the heyday of the bus that addresses this issue and it is called Non Transparent Bridging (). This method allows two RCPs to communicate through the use of base address registers (BARs). This interchange of information is applicable for memory, I/O and configuration spaces in the context of Bus architecture and is applicable for both systems. 26 OCTOBER FROM OCTOBER 2010 RTC 2010 MAGAZINE RTC MAGAZINE
TECHNOLOGY CONNECTED This can only be supported if the underlying hardware of the switch provides functions on the respective downstream ports. The RCP of the IP router sets up the BAR registers on the individual Switch ports attached to the respective servers and maps their system memories to respective windows in the logical address space of its own system memory. This then allows for the visibility of individual system memories of all respective servers in the network by one entity. This access mechanism is used to transfer data, in this case TCP/IP packets, between servers comprising the LAN. This method allows for the transfer of memory or I/O data between attached servers through the switch ports at the maximum data rate supported by the respective physical links. For example, with 8 lanes of links using Gen 2 technology the data transfer rate is 40 Gbit/s and with 16 lanes it is 80 Gbit/s. incorporates full duplex communication technology meaning transmit and receive can happen at the same time. This then makes the full duplex bandwidth for 8 lanes of Gen 2 to be 80 Gbit/s and for 16 lanes it is 160 Gbit/s. Gen 3 technology, which is currently being developed, will more than double these numbers. Magma s patent pending technology, which covers all aspects of a network based on running TCP/IP protocol over fabric inclusive of the IP Router, is the basis of the network switch design. It relies on the pull-model for data transfer through the network switch. This allows for the processors on the sending servers to be totally free and oblivious of how IP data is transferred to the destination server. This significantly reduces the processor overhead on transferring data to and from the network. This is illustrated in Figure 4. With the -based network switch technology, the maximum number of nodes that can be on one network is 256 because of the restrictions imposed by configuration space that supports a maximum of 256 buses. This may be construed as a limitation, but it allows for a very symmetrical topography with one RCP, that RAID Subsystem Data Base Server Mail Server Internet 80 Gigabit/sec TCP/IP over System Area Network Optical Juke Box Tape Library Application Server FIGURE 5 Is an example of how servers with disparate functions participate seamlessly in a symmetrical TCP/IP based System Area Network. of the network switch, servicing all of the nodes as devices underneath it. There is no additional RCP involved on expanding the number of nodes and, therefore, no additional memory resources are required. Consequently, adding nodes to the network simply implies daisy-chaining switches resulting in significant cost per port decrease as the number of nodes in the network is increased. Moreover, as compared to 10Gigabit and other legacy networks, adding nodes to the network switch is seamless because of the plug-n-play attributes of the bus architecture. Since the servers have no direct visibility into a remote server s memory, any data transfer operations necessarily require the root switch to be involved. For instance, when a source server needs to read/write data from/to a target server, the server notifies the root switch rather than attempting to communicate with the target server. It is the root switch that accesses the memory of the source as well as the target server. To further reduce data transfer latencies, the new switch technology uses DMA controllers built into the ports of the switch. This relieves the network switch processor from moving data between servers and allows for concurrent transfers between nodes in the network. This amounts to peer-to-peer transfers within the switch array contributing to drastic reduction in data transfer latencies in the network. Based on the destination IP addresses of all of the individual packets in a particular server s kernel space, the RCP on the network switch sets up the DMA descriptor file and then fires the DMA engine. technology is fast becoming ubiquitous and the result has been that all server and workstation manufacturers now provide a certain number of slots for I/O expansion. These form the endpoints of the RCP on the host computer backplane. A host card will take the signals from the backplane and 27 RTC MAGAZINE RTCFROM MAGAZINE OCTOBEROCTOBER 2010 2010
bring them out on fiber, or alternately on copper, to attach to the ports of the network switch. The number of lanes operational between the server and the network switch will depend on the number of lanes supported by the server hardware. allows for link negotiation whereby both ends of a link negotiate to support the minimum number of lanes supported by either of the two connection points. Consequently, each port of the network switch will negotiate down to the number of lanes supported by the host connection to that individual port. These ports support Gen 2 signaling standards and will negotiate down to Gen 1 signaling to support the corresponding connection to the server. This makes the network switch highly scalable. The network switch technology is based completely on standards with no aspects of the technology being proprietary. With the industry s commitment to technology, it provides a migration path to a newer generation of technology thus potentially extending its life cycle. This technology allows for the coexistence of legacy networks as it goes through its adoption cycle phase and, moreover, can serve as a fallback mechanism for mission-critical applications. This allows for a fail safe deployment. Another significant advantage is the cost per port as nodes get added to the network since there is only one root complex processor (RCP) on the network switch in this network topology. Figure 5 shows an example of how servers with disparate functions participate seamlessly in a symmetrical TCP/ IP-based system area network. This also shows how storage and processing servers coexist in one homogeneous network. This is facilitated by the increasingly popular implementation of iscsi on communication with network attached storage devices. iscsi is essentially the SCSI protocol embedded in TCP/IP packets. SCSI protocol is widely used in the industry to communicate with storage devices. Also, the connection to the Internet implies simply transferring all IP packets intact that are not destined for any server on the network via a wide area network (WAN) interface. The deployment of the network switch as shown in Figure 5 is representative of a topography that with different software modules can be used for clustering, I/O virtualization and cloud computing applications. It is a highly flexible architecture. Magma San Diego, CA. (858) 530-2511. [www.magma.com]. 28 OCTOBER FROM OCTOBER 2010 RTC 2010 MAGAZINE RTC MAGAZINE