Lustre Networking BY PETER J. BRAAM

Lustre Networking BY PETER J. BRAAM A WHITE PAPER FROM CLUSTER FILE SYSTEMS, INC. APRIL 2007

Audience Architects of HPC clusters Abstract This paper provides architects of HPC clusters with information about Lustre networking that they can use to make decisions about performance and scalability relevant to their deployments. We will review Lustre message passing, Lustre Network Drivers, and routing in Lustre networks and describe how these can be used to improve cluster storage management. The final section of this paper describes some new Lustre networking features that are currently under consideration or planned for release. Contents Challenges in Cluster Networking................................................... 3 Lustre Networking - Architecture and Current Features.................................. 3 Lustre Networking Architecture................................................... 3 Network Types Supported in Lustre Networks........................................ 4 Routers and Multiple Interfaces in Lustre Networks.................................... 5 Lustre Networking Applications..................................................... 6 Lustre Support for RMDA........................................................ 6 Using Lustre Networking to Implement a Site-Wide File System.......................... 6 Using Lustre Routers for Load Balancing............................................ 7 Anticipated Features in Future Releases............................................. 8 New Features For Multiple Interfaces............................................... 8 Server-Driven QoS............................................................. 9 A Router Control Plane.......................................................... 10 Asynchronous I/O.............................................................. 10 Conclusion..................................................................... 11 2 Lustre Networking

Challenges in Cluster Networking Today's data centers provide many challenges on the networking front, which few systems address. File systems require native storage networking over different types of networks and must be able to exploit features like remote direct memory access (RDMA). In large installations, multiple networks must be able to simultaneously access all storage from all locations through routers and multiple network interfaces. Storage management nightmares, such as handling multiple copies of data as they are staged on file systems local to a cluster, can be avoided only if such features are available. We will first describe how Lustre networking addresses almost all of these challenges today. We will also describe how Lustre networking is expected to evolve to provide further levels of load balancing, control, quality of service (QoS) and high availability in networks on a local and global scale. Lustre Networking - Architecture and Current Features Lustre Networking Architecture Based on extensive research, Lustre networking has evolved into a set of protocols and APIs to support high performance, high availability file systems. Key features of Lustre networking are: RDMA, when supported by underlying networks Support for a number of commonly-used network types such as InfiniBand and IP High availability and recovery built into the Lustre networking stack Simultaneous availability of multiple network types with routing between them Figure 1 shows how these network features are implemented in a cluster deployed with Lustre. Figure 1. A Lustre cluster Cluster File Systems, Inc. 2007 3

Lustre networking is implemented with layered APIs and software modules. The file system uses a remote procedure API with facilities for recovery and bulk transport. This API in turn uses the LNET TM Message Passing API, which has been derived from the Sandia Portals message passing API, a well known API in the HPC community. The LNET API has a pluggable driver architecture, similar in concept to the Portals network abstraction layer (NAL), to support multiple network types individually or simultaneously. The drivers, called Lustre Network Drivers (LND), are loaded into the driver stack, one for each network that is in use. A feature that allows routing between the different network types has been implemented as a result of a suggestion early in the Lustre product cycle by a key customer, Lawrence Livermore National Laboratories (LLNL). Figure 2 shows how the software modules and APIs are layered. Figure 2. Modular Lustre networking implemented with layered APIs In a Lustre network, configured interfaces are named using network identifiers (NIDs). The NID is a string that has the form <address>@<type><network id>. A Lustre network is a set of configured interfaces on nodes that can send traffic directly from one interface on the network to another. Examples of NIDs are 192.168.1.1@tcp0, designating an address on the 0th Lustre TCP network, and 4@elan8, designating address 4 on the 8th Lustre Elan network. Network Types Supported in Lustre Networks Lustre provides LNDs to support many networks including: InfiniBand: OpenFabrics, Mellanox Gold, Cisco, Voltaire and Infinicon TCP: Any network carrying TCP traffic, including GigE, 10GigE, and IPoIB Quadrics: Elan3, Elan4 Myricom: GM, MX Cray: Seastar, RapidArray The networks are supported by LNDs, which are pluggable modules for the LNET interfaces. 4 Lustre Networking

Routers and Multiple Interfaces in Lustre Networks Lustre networks consist of interfaces on nodes configured with a NID and communicating without the use of intermediate router nodes with their own NIDs. A Lustre network is not required to be physically separated from another, although that is possible. LNET can conveniently define a Lustre network by naming the IP addresses of the interfaces forming the Lustre network. When more than one Lustre network is present, LNET can route traffic between networks using routing nodes in the network. An example is shown in Figure 3. If multiple routers are present between a pair of networks, they offer both load balancing and high availability through redundancy. Figure 3. Lustre networks connected through routers If multiple interfaces of one type are available, they should be placed on different Lustre networks, unless the underlying network software for the network type supports interface bonding resulting in one address. Such interface bonding is available for IP networks and Elan4. Later in this paper, we will describe features that may be developed in future releases to allow LNET to manage multiple network interfaces. Figure 4 shows how multiple Lustre networks can make effective use of multiple server interfaces in the presence of multiple clients. Figure 4. A Lustre server with multiple network interfaces offering load balancing to the cluster Cluster File Systems, Inc. 2007 5

Lustre Networking Applications Lustre Support for RMDA With the exception of TCP, LNET provides support for RDMA on all network types. The LND driver automatically uses this feature for large message sizes. When RDMA is used, nodes can achieve almost full bandwidth with extremely low CPU utilization. This is advantageous, particularly for nodes that are busy running other software, such as Lustre server software. However, provisioning with sufficient CPU power and high performance motherboards may justify TCP networking as a trade-off to using RDMA. On 64-bit processors, LNET can saturate several gige interfaces with relatively low CPU utilization, and with the recently released dual-core Intel Xeon processor 5100 series ("Woodcrest"), the bandwidth on a 10 GigE network can approach a gigabyte per second. Lustre networking provides extraordinary bandwidth utilization of TCP networks. For example, end-to-end I/O over a single GigE link routinely exceeds 110 MB/sec. Using Lustre Networking to Implement a Site-Wide File System Site-wide file systems are typically used in HPC centers where many clusters exist on different high speed networks. Such networks are usually not easy to extend or connect to other networks. An increasingly popular approach is to build a storage island at the center of such an installation. The storage island contains storage arrays and servers and is connected on a network such as an InfiniBand or TCP network. Multiple clusters can connect to this island through routing nodes. The routing nodes are simple Lustre systems with at least two network interfaces, one to the internal cluster network and one to the network used in the storage island. Figure 5 shows an example of a global file system. Figure 5. A global file system implemented using Lustre networks A global file system, also referred to as a site-wide file system, provides transparent access from all clusters to file systems located in the storage island. The benefits are not to be underestimated. Traditional data management for multiple clusters involves staging data from one cluster on the file system to another. By using Lustre as a site-wide file system, multiple copies of the data are no longer needed and substantial savings can be achieved, from a manageability and from a capacity perspective. 6 Lustre Networking

Using Lustre Routers for Load Balancing Lustre routers are commodity server systems and can be used in a load-balanced, redundant router configuration. For example, consider an installation with servers on a network with 10 GigE interfaces and many clients attached to a GigE network. It is possible, but typically costly, to purchase IP switching equipment that can connect to both the servers and the clients. With a Lustre network, the purchase of such costly switches can be avoided. For a more cost-effective solution, two separate networks can be created. A smaller server network contains the servers on the fast network and a set of router nodes with sufficient aggregate throughput. A second client network with slower interfaces contains all the client nodes and is also attached to the router nodes. If this second network already exists and has sufficient free ports to add the Lustre router nodes, no changes to this client network are required. Figure 6 shows an installation with this configuration. Figure 6. An installation combining slow and fast networks using Lustre routers The routers provide a redundant, load-balanced path between the clients and the servers. This network configuration allows many clients to utilize together the full bandwidth of a server, even if individual clients have insufficient network bandwidth. The routers collect the throughput of the slow network on the client side and forward the data stream to the servers. Because multiple routers stream data to the server network simultaneously, the network on the server side can see data streams that are in excess of those seen on the client networks. Cluster File Systems, Inc. 2007 7

Anticipated Features in Future Releases Although Lustre networking offers many features today, more are coming in future releases. Some possible directions for the development of new features include support of multiple network interfaces, implementation of server-driven QoS guarantees, asynchronous I/O, and a control interface for routers. New Features For Multiple Interfaces LNET can currently exploit multiple interfaces by placing them on different Lustre networks. For example, consider a server with two network interfaces with a subset of the clients connected to one interface and the remaining clients connected to the other interface. LNET can define two networks, each network consisting of the subset of clients connected to one of the server network interfaces. This configuration provides reasonable load balancing for a server with many clients. However, it is a static configuration that does not handle link-level failover or dynamic load balancing. We plan to address these shortcomings with the following design. First, LNET can virtualize multiple interfaces and offer the aggregate as one NID to the users of the LNET API. In concept, this is quite similar to the aggregation (also referred to as bonding or trunking) of Ethernet interfaces using protocols like 802.3ad Dynamic Link Aggregation. The key features that a future LNET release may offer are: Load balancing: All links are used based on availability of throughput capacity. Link-level high availability: If one link fails, the other channels transparently continue to be used for communication. These features are shown in Figure 7. Figure 7. Link-level load balancing and failover From a design perspective, these load-balancing and high-availability features are similar to the features offered with LNET routing. A challenge in developing these features is providing a simple way to configure the network. Assigning and publishing NIDs for the bonded interfaces should be simple and flexible, and work even if not all links are available at startup. We expect to use the management server protocol to resolve this issue. 8 Lustre Networking

Server-Driven QoS QoS is an issue in various scenarios. A prevalent one is when multiple clusters are competing for bandwidth from the same storage servers. A primary QoS goal is to avoid thrashing server systems in which conflicting demands from multiple clusters or systems result in performance degradation for all clusters. Setting and enforcing policies is one way to avoid this. For example, a policy can be established that guarantees that a certain minimal bandwidth is allocated to resources that must respond in real-time, such as a display session of visual streaming data. Or a policy can be defined that gives systems or clusters doing mission critical work priority for bandwidth over less important clusters or systems. Lustre's role is not to determine an appropriate set of policies, however, but to provide site management capabilities to implement policies to be set and enforced. Figure 8. Using server-driven QoS to schedule video rendering and visualization The Lustre QoS scheduler will have two components: a Local Request Scheduler (LRS) and a global Epoch Handler (EH). The LRS is responsible for receiving and queueing requests according to a local policy as shown in Figure 8. The EH supports the concept of a virtual shared time slice among all servers. This time slice can be relatively large to avoid overhead due to excessive server-to-server networking and latency. For example, a slice might be one second. The LRS and EH together allow a cluster of servers to all execute the same policy during the same time slice. Note that the policy may subdivide the EH time slice and use the subdivision advantageously. The LRS also provides summary data to the EH so that global knowledge and adaptation can be established. Cluster File Systems, Inc. 2007 9

A Router Control Plane Lustre is expected to be used in vast worldwide file systems that traverse networks with up to hundreds of routers. To achieve wide area QoS guarantees that cannot be achieved with static configurations, these networks must provide for dynamically implemented configuration changes. To handle these situations, a rich control interface is required between the routers and outside administrative systems. Such a rich control interface is the Lustre Router Control Plane. An example where the Lustre Router Control Plane can be useful is when data packets are being routed by routers from A to B and also from C to D and for operational reasons a preference needs to be given to routing the packets from C to D. The control plane would apply a policy to the routers so that packets would be sent from C and D before packets are sent from A to B. The technology to be used for the control interface remains under discussion but could be similar to what is being discussed elsewhere for global network management. Asynchronous I/O In large compute clusters, significant I/O optimization is still a possibility. When a client writes large amounts of data, a truly asynchronous I/O mechanism would allow the client to register the memory pages that need to be written for RDMA and allow the server to transfer the data to storage without causing interrupts on the client. This makes the client CPU fully available to the application again, which is a significant benefit for some situations. Figure 9. Network-level DMA with handshake interrupts and without handshake interrupts LNET supports RDMA, but currently a handshake at the operating system level is required to initiate the RDMA. The handshake exchanges the network-level DMA addresses to be used. The proposed change would eliminate the handshake and include the network-level DMA addresses in the initial requests to transfer data as shown in Figure 9. 10 Lustre Networking

Conclusion Lustre networking provides an exceptionally flexible and innovative infrastructure. Among the many features and benefits that have been discussed, the most significant are: Native support for all commonly used HPC networks Extremely fast data rates through RDMA and unparalleled TCP throughput Support for site-wide file systems through routing, eliminating staging, and copying of data between clusters Load-balancing router support to eliminate low-speed network bottlenecks Lustre networking will continue to evolve with features to handle link aggregation, server-driven QoS, a rich control interface to large routed networks and asynchronous I/O without interrupts. Legal Disclaimer Lustre is a registered trademark of Cluster File Systems, Inc. and LNET is a trademark of Cluster File Systems, Inc. Other product names are the trademarks of their owners. Although CFS strives for accuracy, we reserve the right to change, postpone, or eliminate features at our sole discretion. Cluster File Systems, Inc. 2007 11