Considering Middleware Options

Transcription

1 Considering Middleware Options in High-Performance Computing Clusters Middleware is a critical component for the development and porting of parallelprocessing applications in distributed high-performance computing (HPC) cluster infrastructures. This article describes the evolution of the Message Passing Interface (MPI) standard specification as well as both open source and commercial MPI implementations that can be used to enhance Dell HPC cluster environments. BY RINKU GUPTA, MONICA KASHYAP, YUNG-CHIN FANG, AND SAEED IQBAL, PH.D. High-performance computing (HPC) clusters a popular platform for hosting distributed parallel-processing applications comprise multiple standards-based servers connected to each other via network interconnects. A typical HPC cluster has a layered architecture, beginning at the hardware level and concluding with the application level, as shown in Figure 1. Servers reside at the lowest level of the architecture, and each server contributes computational power to the cluster. Servers are connected to each other by a network infrastructure, which may be based on standard Ethernet technologies (such as Fast Ethernet or Gigabit 1 Ethernet) or proprietary high-speed technologies (such as Myricom Myrinet or InfiniBand). On top of the hardware level is the operating system (OS) and required communication protocol libraries, as defined by the specific interconnect for example, TCP/IP for Ethernet or GM for Myrinet. This infrastructure helps provide the computational power of a supercomputer for parallel-processing applications. To enhance this distributed infrastructure and ease development and porting of parallelprocessing applications, a layer of middleware is required. With the growth of parallel-processing application development, two programming models have evolved to provide middleware capabilities: the shared-memory programming model and the message-passing programming model. The shared-memory programming model is based on the concept of shared address space in which data exchange is achieved by writing to the shared space. The message-passing programming model is based on the concept of distributed address space in which data exchange is achieved through explicit message passing. Message Passing Interface (MPI) is the de facto messagepassing standard today. This article focuses on the growth on the MPI standard as well as the open source and commercial implementations of MPI available for use on Dell HPC clusters. Evolution of middleware libraries As massively parallel processing (MPP) and clusters have gained popularity, organizations have developed middleware libraries for use with these powerful systems. Parallel Virtual Machine (PVM) 2 was one of the first full-fledged middleware software libraries. PVM is designed to allow a network of heterogeneous machines to appear logically to the user as a single, large parallel machine. PVM was initially developed in 1989 as a joint research effort between 1 This term does not connote an actual operating speed of 1 Gbps. For high-speed transmission, connection to a Gigabit Ethernet server and network infrastructure is required. 2 For more information about PVM, visit Reprinted from Dell Power Solutions, February Copyright 2005 Dell Inc. All rights reserved. POWER SOLUTIONS 1

2 Application Middleware Operating system Communication protocol Hardware Figure 1. HPC cluster layered architecture the University of Tennessee, Oak Ridge National Laboratory, and Emory University. In addition to providing an MPI implementation for sending and receiving messages, PVM implemented resource management, signal handling, and fault tolerance to help build a user environment for parallel processing. Because PVM was one of the first parallel-processing systems that provided portability across heterogeneous networks, the library was widely adopted by developers of parallel-processing applications. Both the popularity and the shortcomings of PVM have provided a great impetus for the development of the MPI specification. Emergence of the MPI specification The MPI standard 3 specification was developed in 1993 by a diverse group of computer vendors, computer scientists, and software programmers who formed the MPI Forum. During the early 1990s, various vendors were developing their own middleware. The MPI Forum set out to develop a practical, portable, efficient, and flexible standard for communication among nodes and for running parallelprocessing applications on distributed memory architectures. MPI allows data to be moved between the nodes in a cluster by sending and receiving the data as messages. This sending and receiving of messages allows all the nodes in the cluster to be synchronized. Note: The MPI specification is not a language. The specification comprises collections of subroutine application programming interfaces (APIs) that can be called by C and FORTRAN programs. Wide acceptance of MPI has led to multiple implementations of the MPI specification for a variety of distributed memory based clusters. For parallel-processing nodes with specialized networking hardware, native MPI implementations can enhance performance. Various implementations have led to parallel MPI applications being ported across a wide range of architectures. MPI implementations can be fine-tuned for a specific architecture and the interconnects on which they run, helping to optimize efficiency and provide high performance. MPI 1.1 and 1.2 standard specifications The MPI 1.1 and MPI 1.2 specifications introduced many subroutine APIs, which provided great ease of application writing. These APIs included primitives for point-to-point communications and collective operations, and for creating process topologies and process groups. Point-to-point communications comprise communications between two nodes for example, synchronous or asynchronous sending and receiving of messages. Collective operations comprise global communications between groups of nodes for example, barriers that bring about synchronization between groups of nodes; broadcasts for sending messages from one to many nodes; and reduce, scatter, and gather operations. Figure 2 shows some of the subroutine primitives defined within the MPI 1.2 specification. A vendor can implement a subroutine as long as the subroutine primitive provided in the implementation conforms to the specification both syntactically and semantically. MPI 2.0 standard specification Released after MPI 1.2 had been widely accepted, MPI 2.0 made major changes to the MPI 1.2 specification. Some of the most significant enhancements offered by MPI 2.0 include the following: Dynamic process management: Process management allows processes to be dynamically added and deleted. MPI 2.0 supports dynamic process management because many emerging message-passing applications (such as applications that require runtime assessment of the number and type of processes needed) require process control. By contrast, MPI 1.2 based applications are static; that is, no processes can be added to or deleted from an application after the application has been started. One-sided communication operations: MPI 2.0 provides support for one-sided communication operations such as Action put and get. The put operation transfers data directly from the sender node s memory to the receiver, or target, node s memory; the get operation transfers data from the target node s memory to the caller node s memory. Send data (blocking) to a node Receive data (blocking) from a node Broadcast data from one node to many nodes Figure 2. Examples of MPI subroutine primitives MPI command MPI_Send MPI_Recv MPI_Bcast 3 For more information about the MPI standard, visit www-unix.mcs.anl.gov/mpi. 2 POWER SOLUTIONS Reprinted from Dell Power Solutions, February Copyright 2005 Dell Inc. All rights reserved. February 2005

3 Other enhancements relate to extending collective communication operations and defining new nonblocking operations. 4 MPI collective operations Open source MPI implementations MPICH, 5 which is currently maintained by the Argonne National Labs and Mississippi State University, is a freely available, portable implementation of MPI. The development of MPICH began in parallel with the development of the MPI specification to enable the specification to address problems that would be faced by implementers of the specification. Thus, a complete, portable, and efficient MPICH implementation was available when the MPI specification was formally released, allowing developers of parallel-processing applications to experiment with MPI almost immediately. MPICH functionality MPICH was designed with the following goals: MPI point-to-point communications ADI Channel interface Implementations of channel interface Figure 3. MPICH layered architecture Implementations of ADI Maximum portability and reuse of code: In any implementation, a large amount of code is system independent. MPICH was designed to allow complex communication operations to be specified portably in terms of low-level primitives. The developers intention was to maximize the amount of code that can be shared without compromising performance. Fast porting to new architectures: Another design goal was to create a structure whereby MPICH could be ported to a new platform quickly and then gradually tuned for that platform by replacing parts of the shared code with platformspecific code. To achieve these goals, the MPICH implementation follows a layered architecture, as shown in Figure 3. At the top level of the hierarchy are primitives for the MPI collective operations. An example of a collective operation is broadcast ( MPI_Bcast), wherein one of the source nodes can send the same data to multiple nodes within a group of nodes. These collective operations are implemented in the MPICH implementation by calling MPI point-to-point primitives such as send ( MPI_Send) and receive e ( MPI_Recv). These point-topoint primitives call various other functions specified at lower levels of the hierarchy to carry out the actual sending and receiving functions using the communication protocol. One of the lowest layers in the architecture is the abstract device interface (ADI), which is a mechanism designed to help achieve goals of portability and performance. The ADI contains the communication protocol dependent code. All the MPI functions are implemented using the functions and macros defined at the ADI layer. Hence, functions defined at levels higher than the ADI layer are portable. Having multiple implementations of the ADI helps provide portability and ease of implementation. Below the ADI layer is an additional low-level layer called the channel interface. The channel interface is designed to provide a mechanism to quickly port MPICH to new environments. The channel interface comprises functions that provide the basic capability of sending data from one process to another. MPICH thus offers an incremental approach to trading portability for performance. A vendor can start the porting process by creating a channel interface implementation. The implementation can then be expanded to include additional, specialized ADI functionality. Going upward in the MPICH architecture hierarchy increases the performance benefits of the implementation but obviously decreases the portability of the same code for future implementations. The current releases of MPICH are based on the MPI 1.2 standard. MPICH2, 6 which is now under development, is an all-new implementation of MPI that is intended to support research into the implementation of both MPI 1.2 and MPI 2.0. MPICH variations MPICH has been widely adapted by various vendors, and has been the basis for MPI-related research projects in various universities and research institutions. The following sections discuss popular MPICH variations adapted for commonly used high-speed interconnects. 4 A detailed discussion of the MPI specifications is beyond the scope of this article. For more information, refer to the MPI specifications at www-unix.mcs.anl.gov/mpi. 5 For MPICH papers and implementation details, visit www-unix.mcs.anl.gov/mpi/mpich. 6 For more information about MPICH2, visit www-unix.mcs.anl.gov/mpi/mpich2. Reprinted from Dell Power Solutions, February Copyright 2005 Dell Inc. All rights reserved. POWER SOLUTIONS 3

4 MPICH-GM (MPI on Myrinet). Myricom Myrinet 7 is a highspeed, low-latency, high-bandwidth interconnect used in HPC clusters. The GM 8 protocol is a low-level message-passing communication protocol designed for Myrinet networks. Myrinet is theoretically capable of providing unidirectional throughput of up to 2 Gbps and low latency. Low latency is critical for communicationintensive applications because less time is spent on communication overhead, leaving more time for computation. This high performance is achievable on Myrinet networks because GM is a userlevel protocol, which bypasses the OS while sending and receiving messages during communication after the initial connection has been established. MPICH-GM, 9 which is a port of MPICH on top of GM, is the MPI implementation on top of GM. The porting of MPICH on GM is accomplished by creating a new GM device at the ADI and channel interface levels of MPICH. In this way, MPICH-GM offers a portable, efficient implementation of MPI that applications can use to take advantage of performance offered by the low-level Myrinet hardware and the GM protocol. MPICH-GM works on a variety of platforms, including the Linux, Solaris, and FreeBSD operating systems. MPICH-GM is also supported on many architectures, including IA-32, IA-64, and Mac OS X, and is fully supported by Myricom. MVAPICH (MPI on InfiniBand). The InfiniBand architecture is a standard that defines a high-speed network for interprocess communication and storage I/O nodes. The low-latency, highbandwidth capabilities and remote direct memory access (RDMA) features accelerate applications running in HPC and enterprise environments. MVAPICH is an open source MPI 1.2 implementation developed by The Ohio State University and is based on the Verbs API (VAPI) implementation by Mellanox Technologies. MVAPICH is also a port of MPICH on the VAPI layer. This porting is carried out by creating the VAPI device at the ADI level of MPICH. Other open source MPI implementations In addition to the MPICH implementations, other implementations of the MPI standard exist. Local Area Multicomputer (LAM) 10 is an open source implementation of the MPI standard. LAM originated at the Ohio Supercomputing Center and is now maintained by the Open Systems Laboratory at Indiana University. Like other MPI implementations, LAM/MPI provides high performance on many platforms, even on heterogeneous clusters of workstations. Commercial MPI implementations Commercial MPI implementations are available for a wide range of hardware and are produced by many vendors. The following sections briefly discuss some of the popular commercial MPI implementations that enterprises can run on Dell HPC clusters. Verari MPI/Pro. MPI/Pro 11 is a proprietary, commercially supported MPI 1.2 implementation developed by Verari Systems. It is one of the most popular commercial implementations and is supported on both Microsoft Windows and Red Hat Linux operating systems. MPI/Pro features include low CPU overhead and thread safety. The implementation supports TCP, symmetric multiprocessing (SMP), and Myrinet and InfiniBand drivers for Windows and Linux. Verari ChaMPIon/Pro. ChaMPIon/Pro 12 is a full MPI 2.0 implementation available for Linux. ChaMPIon/Pro supports Myrinet, InfiniBand, and Quadrics network interconnects as well as TCP/IP protocols. This MPI implementation supports major MPI 2.0 enhancements, including extended collective operations, dynamic process management, and one-sided communication APIs. Scali MPI Connect. Scali offers an MPI implementation called Scali MPI Connect. 13 Scali s integrated architecture enables third-party applications to be compiled once to run on the various leading interconnect technologies. The implementation is designed to allow binary programs that are linked with Scali MPI Connect to run on any of the supported interconnects Gigabit Ethernet, Myrinet, Dolphin Interconnect scalable coherent interface (SCI), or InfiniBand without recompilation or relinking. Whether the cluster is built using one of these interconnects or a combination thereof, applications and users interact only with Scali MPI Connect. Middleware: A key component for HPC cluster performance Middleware implementations, both commercial and open source, are significant components in HPC cluster configurations. The widely accepted MPI standard has enabled a diverse set of implementations that are designed to enhance performance and ease the development and porting of parallel-processing applications in distributed computing infrastructures. Rinku Gupta is a systems engineer and advisor in the Scalable Systems Group at Dell. Her current research interests are middleware libraries, parallel processing, performance, and interconnect benchmarking. Rinku has a B.E. in Computer Engineering from Mumbai University in India and an M.S. in Computer Information Science from The Ohio State University. 7, 8,9 For more information about Myrinet, GM, and MPICH-GM, visit 10 For more information about LAM/MPI, visit 11,12 For more information about Verari Systems MPI/Pro and ChaMPIon/Pro, visit 13 For more information about Scali MPI Connect, visit 4 POWER SOLUTIONS Reprinted from Dell Power Solutions, February Copyright 2005 Dell Inc. All rights reserved. February 2005

5 Monica Kashyap is a senior systems engineer in the Scalable Systems Group at Dell. Her current interests and responsibilities include in-band and out-of-band cluster management, cluster computing packages, and product development. She has a B.S. in Applied Science and Computer Engineering from the University of North Carolina at Chapel Hill. Yung-Chin Fang is a senior consultant in the Scalable Systems Group at Dell. He specializes in cyberinfrastructure resource management and high-performance computing. He also participates in open source groups and standards organizations as a Dell representative. Yung-Chin has a B.S. in Computer Science from Tamkang University and an M.S. in Computer Science from Utah State University. Saeed Iqbal, Ph.D., is a systems engineer and advisor in the Scalable Systems Group at Dell. His current work involves evaluation of resource managers and job schedulers used for commodity clusters. Saeed is also involved in performance analysis and system design of clusters. He has a Ph.D. in Computer Engineering from The University of Texas at Austin, and an M.S. in Computer Engineering and a B.S. in Electrical Engineering from the University of Engineering and Technology in Lahore, Pakistan. Reprinted from Dell Power Solutions, February Copyright 2005 Dell Inc. All rights reserved. POWER SOLUTIONS 5