Lustre Networking BY PETER J. BRAAM



Similar documents
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

Next Generation HPC Storage Initiative. Torben Kling Petersen, PhD Lead Field Architect - HPC

Quantum StorNext. Product Brief: Distributed LAN Client

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

High Speed I/O Server Computing with InfiniBand

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

Auspex Support for Cisco Fast EtherChannel TM

Using Multipathing Technology to Achieve a High Availability Solution

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

Brocade Solution for EMC VSPEX Server Virtualization

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

From Ethernet Ubiquity to Ethernet Convergence: The Emergence of the Converged Network Interface Controller

3G Converged-NICs A Platform for Server I/O to Converged Networks

Private cloud computing advances

Running Native Lustre* Client inside Intel Xeon Phi coprocessor

Broadcom Ethernet Network Controller Enhanced Virtualization Functionality

Microsoft SQL Server 2012 on Cisco UCS with iscsi-based Storage Access in VMware ESX Virtualization Environment: Performance Study

Introduction to Cloud Design Four Design Principals For IaaS

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics

Question: 3 When using Application Intelligence, Server Time may be defined as.

Site2Site VPN Optimization Solutions

Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Post-production Video Editing Solution Guide with Microsoft SMB 3 File Serving AssuredSAN 4000

SiteCelerate white paper

InfiniBand in the Enterprise Data Center

NETWORK ISSUES: COSTS & OPTIONS

Integration Guide. EMC Data Domain and Silver Peak VXOA Integration Guide

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

RDMA over Ethernet - A Preliminary Study

Boosting Data Transfer with TCP Offload Engine Technology

Building a Scalable Storage with InfiniBand

Scalable Internet Services and Load Balancing

Unified Fabric: Cisco's Innovation for Data Center Networks

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Top Ten Reasons for Deploying Oracle Virtual Networking in Your Data Center

APPLICATION NOTE. Benefits of MPLS in the Enterprise Network

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Mesh VPN Link Sharing (MVLS) Solutions

POWER ALL GLOBAL FILE SYSTEM (PGFS)

Oracle SDN Performance Acceleration with Software-Defined Networking

Meeting the Five Key Needs of Next-Generation Cloud Computing Networks with 10 GbE

Isilon IQ Network Configuration Guide

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

IBM Global Technology Services September NAS systems scale out to meet growing storage demand.

Configuring IPS High Bandwidth Using EtherChannel Load Balancing

Mellanox Academy Online Training (E-learning)

Cisco Application Networking for BEA WebLogic

Cisco Application Networking for IBM WebSphere

Cisco Application Networking for Citrix Presentation Server

ADVANCED NETWORK CONFIGURATION GUIDE

How To Build A Network For Storage Area Network (San)

VIDEO SURVEILLANCE WITH SURVEILLUS VMS AND EMC ISILON STORAGE ARRAYS

Load Balancing for Microsoft Office Communication Server 2007 Release 2

Intel DPDK Boosts Server Appliance Performance White Paper

Purpose-Built Load Balancing The Advantages of Coyote Point Equalizer over Software-based Solutions

Zadara Storage Cloud A

How To Use The Cisco Wide Area Application Services (Waas) Network Module

Windows 8 SMB 2.2 File Sharing Performance

Achieving High Availability & Rapid Disaster Recovery in a Microsoft Exchange IP SAN April 2006

Voice Over IP. MultiFlow IP Phone # 3071 Subnet # Subnet Mask IP address Telephone.

Resource Utilization of Middleware Components in Embedded Systems

Enhancing Cisco Networks with Gigamon // White Paper

TÓPICOS AVANÇADOS EM REDES ADVANCED TOPICS IN NETWORKS

Cisco Wireless Security Gateway R2

Feature Comparison. Windows Server 2008 R2 Hyper-V and Windows Server 2012 Hyper-V

Scalable Internet Services and Load Balancing

Highly-Available Distributed Storage. UF HPC Center Research Computing University of Florida

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

An Oracle Technical White Paper November Oracle Solaris 11 Network Virtualization and Network Resource Management

Fault Tolerant Servers: The Choice for Continuous Availability on Microsoft Windows Server Platform

Microsoft Private Cloud Fast Track

WAN Optimization Integrated with Cisco Branch Office Routers Improves Application Performance and Lowers TCO

BRIDGING EMC ISILON NAS ON IP TO INFINIBAND NETWORKS WITH MELLANOX SWITCHX

A Dell Technical White Paper Dell Storage Engineering

Networking and High Availability

IP SAN Best Practices

Cisco Virtual Office Unified Contact Center Architecture

New Storage System Solutions

Windows TCP Chimney: Network Protocol Offload for Optimal Application Scalability and Manageability

Truffle Broadband Bonding Network Appliance

Testing Challenges for Modern Networks Built Using SDN and OpenFlow

SAN Conceptual and Design Basics

Netowork QoS Provisioning Engine for Distributed Real-time and Embedded Systems

F5 and Oracle Database Solution Guide. Solutions to optimize the network for database operations, replication, scalability, and security

D1.2 Network Load Balancing

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

Cisco WAAS for Isilon IQ

Network Performance Management Solutions Architecture

High Performance Server SAN using Micron M500DC SSDs and Sanbolic Software

Preparing Your IP network for High Definition Video Conferencing

Cisco Unified Computing System: Meet the Challenges of Virtualization with Microsoft Hyper-V

scalability OneBridge

Networking and High Availability

Transcription:

Lustre Networking BY PETER J. BRAAM A WHITE PAPER FROM CLUSTER FILE SYSTEMS, INC. APRIL 2007

Audience Architects of HPC clusters Abstract This paper provides architects of HPC clusters with information about Lustre networking that they can use to make decisions about performance and scalability relevant to their deployments. We will review Lustre message passing, Lustre Network Drivers, and routing in Lustre networks and describe how these can be used to improve cluster storage management. The final section of this paper describes some new Lustre networking features that are currently under consideration or planned for release. Contents Challenges in Cluster Networking................................................... 3 Lustre Networking - Architecture and Current Features.................................. 3 Lustre Networking Architecture................................................... 3 Network Types Supported in Lustre Networks........................................ 4 Routers and Multiple Interfaces in Lustre Networks.................................... 5 Lustre Networking Applications..................................................... 6 Lustre Support for RMDA........................................................ 6 Using Lustre Networking to Implement a Site-Wide File System.......................... 6 Using Lustre Routers for Load Balancing............................................ 7 Anticipated Features in Future Releases............................................. 8 New Features For Multiple Interfaces............................................... 8 Server-Driven QoS............................................................. 9 A Router Control Plane.......................................................... 10 Asynchronous I/O.............................................................. 10 Conclusion..................................................................... 11 2 Lustre Networking

Challenges in Cluster Networking Today's data centers provide many challenges on the networking front, which few systems address. File systems require native storage networking over different types of networks and must be able to exploit features like remote direct memory access (RDMA). In large installations, multiple networks must be able to simultaneously access all storage from all locations through routers and multiple network interfaces. Storage management nightmares, such as handling multiple copies of data as they are staged on file systems local to a cluster, can be avoided only if such features are available. We will first describe how Lustre networking addresses almost all of these challenges today. We will also describe how Lustre networking is expected to evolve to provide further levels of load balancing, control, quality of service (QoS) and high availability in networks on a local and global scale. Lustre Networking - Architecture and Current Features Lustre Networking Architecture Based on extensive research, Lustre networking has evolved into a set of protocols and APIs to support high performance, high availability file systems. Key features of Lustre networking are: RDMA, when supported by underlying networks Support for a number of commonly-used network types such as InfiniBand and IP High availability and recovery built into the Lustre networking stack Simultaneous availability of multiple network types with routing between them Figure 1 shows how these network features are implemented in a cluster deployed with Lustre. Figure 1. A Lustre cluster Cluster File Systems, Inc. 2007 3

Lustre networking is implemented with layered APIs and software modules. The file system uses a remote procedure API with facilities for recovery and bulk transport. This API in turn uses the LNET TM Message Passing API, which has been derived from the Sandia Portals message passing API, a well known API in the HPC community. The LNET API has a pluggable driver architecture, similar in concept to the Portals network abstraction layer (NAL), to support multiple network types individually or simultaneously. The drivers, called Lustre Network Drivers (LND), are loaded into the driver stack, one for each network that is in use. A feature that allows routing between the different network types has been implemented as a result of a suggestion early in the Lustre product cycle by a key customer, Lawrence Livermore National Laboratories (LLNL). Figure 2 shows how the software modules and APIs are layered. Figure 2. Modular Lustre networking implemented with layered APIs In a Lustre network, configured interfaces are named using network identifiers (NIDs). The NID is a string that has the form <address>@<type><network id>. A Lustre network is a set of configured interfaces on nodes that can send traffic directly from one interface on the network to another. Examples of NIDs are 192.168.1.1@tcp0, designating an address on the 0th Lustre TCP network, and 4@elan8, designating address 4 on the 8th Lustre Elan network. Network Types Supported in Lustre Networks Lustre provides LNDs to support many networks including: InfiniBand: OpenFabrics, Mellanox Gold, Cisco, Voltaire and Infinicon TCP: Any network carrying TCP traffic, including GigE, 10GigE, and IPoIB Quadrics: Elan3, Elan4 Myricom: GM, MX Cray: Seastar, RapidArray The networks are supported by LNDs, which are pluggable modules for the LNET interfaces. 4 Lustre Networking

Routers and Multiple Interfaces in Lustre Networks Lustre networks consist of interfaces on nodes configured with a NID and communicating without the use of intermediate router nodes with their own NIDs. A Lustre network is not required to be physically separated from another, although that is possible. LNET can conveniently define a Lustre network by naming the IP addresses of the interfaces forming the Lustre network. When more than one Lustre network is present, LNET can route traffic between networks using routing nodes in the network. An example is shown in Figure 3. If multiple routers are present between a pair of networks, they offer both load balancing and high availability through redundancy. Figure 3. Lustre networks connected through routers If multiple interfaces of one type are available, they should be placed on different Lustre networks, unless the underlying network software for the network type supports interface bonding resulting in one address. Such interface bonding is available for IP networks and Elan4. Later in this paper, we will describe features that may be developed in future releases to allow LNET to manage multiple network interfaces. Figure 4 shows how multiple Lustre networks can make effective use of multiple server interfaces in the presence of multiple clients. Figure 4. A Lustre server with multiple network interfaces offering load balancing to the cluster Cluster File Systems, Inc. 2007 5

Lustre Networking Applications Lustre Support for RMDA With the exception of TCP, LNET provides support for RDMA on all network types. The LND driver automatically uses this feature for large message sizes. When RDMA is used, nodes can achieve almost full bandwidth with extremely low CPU utilization. This is advantageous, particularly for nodes that are busy running other software, such as Lustre server software. However, provisioning with sufficient CPU power and high performance motherboards may justify TCP networking as a trade-off to using RDMA. On 64-bit processors, LNET can saturate several gige interfaces with relatively low CPU utilization, and with the recently released dual-core Intel Xeon processor 5100 series ("Woodcrest"), the bandwidth on a 10 GigE network can approach a gigabyte per second. Lustre networking provides extraordinary bandwidth utilization of TCP networks. For example, end-to-end I/O over a single GigE link routinely exceeds 110 MB/sec. Using Lustre Networking to Implement a Site-Wide File System Site-wide file systems are typically used in HPC centers where many clusters exist on different high speed networks. Such networks are usually not easy to extend or connect to other networks. An increasingly popular approach is to build a storage island at the center of such an installation. The storage island contains storage arrays and servers and is connected on a network such as an InfiniBand or TCP network. Multiple clusters can connect to this island through routing nodes. The routing nodes are simple Lustre systems with at least two network interfaces, one to the internal cluster network and one to the network used in the storage island. Figure 5 shows an example of a global file system. Figure 5. A global file system implemented using Lustre networks A global file system, also referred to as a site-wide file system, provides transparent access from all clusters to file systems located in the storage island. The benefits are not to be underestimated. Traditional data management for multiple clusters involves staging data from one cluster on the file system to another. By using Lustre as a site-wide file system, multiple copies of the data are no longer needed and substantial savings can be achieved, from a manageability and from a capacity perspective. 6 Lustre Networking

Using Lustre Routers for Load Balancing Lustre routers are commodity server systems and can be used in a load-balanced, redundant router configuration. For example, consider an installation with servers on a network with 10 GigE interfaces and many clients attached to a GigE network. It is possible, but typically costly, to purchase IP switching equipment that can connect to both the servers and the clients. With a Lustre network, the purchase of such costly switches can be avoided. For a more cost-effective solution, two separate networks can be created. A smaller server network contains the servers on the fast network and a set of router nodes with sufficient aggregate throughput. A second client network with slower interfaces contains all the client nodes and is also attached to the router nodes. If this second network already exists and has sufficient free ports to add the Lustre router nodes, no changes to this client network are required. Figure 6 shows an installation with this configuration. Figure 6. An installation combining slow and fast networks using Lustre routers The routers provide a redundant, load-balanced path between the clients and the servers. This network configuration allows many clients to utilize together the full bandwidth of a server, even if individual clients have insufficient network bandwidth. The routers collect the throughput of the slow network on the client side and forward the data stream to the servers. Because multiple routers stream data to the server network simultaneously, the network on the server side can see data streams that are in excess of those seen on the client networks. Cluster File Systems, Inc. 2007 7

Anticipated Features in Future Releases Although Lustre networking offers many features today, more are coming in future releases. Some possible directions for the development of new features include support of multiple network interfaces, implementation of server-driven QoS guarantees, asynchronous I/O, and a control interface for routers. New Features For Multiple Interfaces LNET can currently exploit multiple interfaces by placing them on different Lustre networks. For example, consider a server with two network interfaces with a subset of the clients connected to one interface and the remaining clients connected to the other interface. LNET can define two networks, each network consisting of the subset of clients connected to one of the server network interfaces. This configuration provides reasonable load balancing for a server with many clients. However, it is a static configuration that does not handle link-level failover or dynamic load balancing. We plan to address these shortcomings with the following design. First, LNET can virtualize multiple interfaces and offer the aggregate as one NID to the users of the LNET API. In concept, this is quite similar to the aggregation (also referred to as bonding or trunking) of Ethernet interfaces using protocols like 802.3ad Dynamic Link Aggregation. The key features that a future LNET release may offer are: Load balancing: All links are used based on availability of throughput capacity. Link-level high availability: If one link fails, the other channels transparently continue to be used for communication. These features are shown in Figure 7. Figure 7. Link-level load balancing and failover From a design perspective, these load-balancing and high-availability features are similar to the features offered with LNET routing. A challenge in developing these features is providing a simple way to configure the network. Assigning and publishing NIDs for the bonded interfaces should be simple and flexible, and work even if not all links are available at startup. We expect to use the management server protocol to resolve this issue. 8 Lustre Networking

Server-Driven QoS QoS is an issue in various scenarios. A prevalent one is when multiple clusters are competing for bandwidth from the same storage servers. A primary QoS goal is to avoid thrashing server systems in which conflicting demands from multiple clusters or systems result in performance degradation for all clusters. Setting and enforcing policies is one way to avoid this. For example, a policy can be established that guarantees that a certain minimal bandwidth is allocated to resources that must respond in real-time, such as a display session of visual streaming data. Or a policy can be defined that gives systems or clusters doing mission critical work priority for bandwidth over less important clusters or systems. Lustre's role is not to determine an appropriate set of policies, however, but to provide site management capabilities to implement policies to be set and enforced. Figure 8. Using server-driven QoS to schedule video rendering and visualization The Lustre QoS scheduler will have two components: a Local Request Scheduler (LRS) and a global Epoch Handler (EH). The LRS is responsible for receiving and queueing requests according to a local policy as shown in Figure 8. The EH supports the concept of a virtual shared time slice among all servers. This time slice can be relatively large to avoid overhead due to excessive server-to-server networking and latency. For example, a slice might be one second. The LRS and EH together allow a cluster of servers to all execute the same policy during the same time slice. Note that the policy may subdivide the EH time slice and use the subdivision advantageously. The LRS also provides summary data to the EH so that global knowledge and adaptation can be established. Cluster File Systems, Inc. 2007 9

A Router Control Plane Lustre is expected to be used in vast worldwide file systems that traverse networks with up to hundreds of routers. To achieve wide area QoS guarantees that cannot be achieved with static configurations, these networks must provide for dynamically implemented configuration changes. To handle these situations, a rich control interface is required between the routers and outside administrative systems. Such a rich control interface is the Lustre Router Control Plane. An example where the Lustre Router Control Plane can be useful is when data packets are being routed by routers from A to B and also from C to D and for operational reasons a preference needs to be given to routing the packets from C to D. The control plane would apply a policy to the routers so that packets would be sent from C and D before packets are sent from A to B. The technology to be used for the control interface remains under discussion but could be similar to what is being discussed elsewhere for global network management. Asynchronous I/O In large compute clusters, significant I/O optimization is still a possibility. When a client writes large amounts of data, a truly asynchronous I/O mechanism would allow the client to register the memory pages that need to be written for RDMA and allow the server to transfer the data to storage without causing interrupts on the client. This makes the client CPU fully available to the application again, which is a significant benefit for some situations. Figure 9. Network-level DMA with handshake interrupts and without handshake interrupts LNET supports RDMA, but currently a handshake at the operating system level is required to initiate the RDMA. The handshake exchanges the network-level DMA addresses to be used. The proposed change would eliminate the handshake and include the network-level DMA addresses in the initial requests to transfer data as shown in Figure 9. 10 Lustre Networking

Conclusion Lustre networking provides an exceptionally flexible and innovative infrastructure. Among the many features and benefits that have been discussed, the most significant are: Native support for all commonly used HPC networks Extremely fast data rates through RDMA and unparalleled TCP throughput Support for site-wide file systems through routing, eliminating staging, and copying of data between clusters Load-balancing router support to eliminate low-speed network bottlenecks Lustre networking will continue to evolve with features to handle link aggregation, server-driven QoS, a rich control interface to large routed networks and asynchronous I/O without interrupts. Legal Disclaimer Lustre is a registered trademark of Cluster File Systems, Inc. and LNET is a trademark of Cluster File Systems, Inc. Other product names are the trademarks of their owners. Although CFS strives for accuracy, we reserve the right to change, postpone, or eliminate features at our sole discretion. Cluster File Systems, Inc. 2007 11