Distributed Symmetric Multi-Processing (DSMP)

Distributed Symmetric Multi-Processing (DSMP) Symmetric Computing Venture Development Center University of Massachusetts Boston 100 Morrissey Boulevard Boston, MA 02125 USA Page 1 of 13

This page is intentionally blank. Page 2 of 13

Introduction Most High Performance Computing (HPC) today is supercomputing clusters. However, to achieve the performance that HPC clusters promise, applications must be tailored for the architecture. Over time, a library of cluster-ready HPC applications has been developed; however, there are many important end-user applications left unaddressed. Entire fields of scientific endeavor that could benefit from high performance computing have not due primarily to the programming complexity of HPC clusters. Most scientists, researchers, engineers and analysts like to focus on their specialty, get their computations executed quickly, and avoid becoming entangled in the computer science complexities that supercomputing clusters demand. They typically develop their work on Symmetric Multi-Processing (SMP) workstations -- SMP computers aligning more closely with their computer skills, and only need supercomputers for their more demanding applications and data sets. Scientists, researchers, engineers and analysts would rather not re-write their applications for a supercomputing cluster and would prefer to use SMP supercomputers. However, SMP supercomputers have been out of reach economically for many. They have been too expensive due their reliance on costly proprietary hardware: Application Specific Integrated Circuits (ASICs) and/or Field Programmable Gate Arrays (FPGAs). Recently, new software-based SMP supercomputers have emerged. Software-based SMP supercomputers are economical. They can be made from the same lower cost, nonproprietary, off-the-shelf server nodes that have always made non-smp supercomputing clusters attractive. This new software makes homogeneous HPC clusters act and behave as SMP supercomputers. The current generation of software-based SMP supercomputers from homogeneous HPC clusters relies on hypervisor implementations -- that is, a layer of software that executes above the operating system. This added layer of software adversely affects performance limiting its High Performance Computing applicability. A new generation of software-based SMP supercomputers promises the performance that HPC applications demand. This next generation of software-based SMP supercomputers is implemented at the operating system level inherently faster than a hypervisor. Symmetric Computing s patent-pending distributed shared memory extension to the Linux kernel delivers the economies and performance of supercomputing clusters, but with the programming simplicity of SMP. This technological leap is truly disruptive as it brings the benefits of supercomputing to heretofore unreached fields of research, development and analysis. Page 3 of 13

Supercomputing Cluster Limitations Computations assigned to a cluster of networked systems must be carefully structured to accommodate the limitations of the individual server nodes that make up the cluster. In many cases, highly skilled programmers are needed to modify code to accommodate the clusters hierarchy, per-node memory limitation and messaging scheme. Lastly, once the data-sets and program are ready to run, they must first be propagated onto each and every node within the cluster. Only then can actual work begin. Besides complexity, there are entire classes of applications and data sets that are inappropriate for supercomputing clusters. Many high performance computing applications (e.g., genomic sequencing, coupled engineering models) invoke large data sets and require large shared memory ( 512-GB RAM). The language-independent communications protocol Message Passing Interface, or MPI, is the de facto standard for supercomputing clusters. Addressing problems with big data-sets is impractical with the MPI-1 model, for it has no shared memory concept, and MPI-2/MPICH has only a limited distributed shared memory concept with a significant latency penalty (>40 µ-seconds). Hence, an MPICH cluster based upon individual server nodes each with less than 512-GB RAM can t support problems with very large data sets without significant restructuring of the application and associated data sets. Even after a successful port, many programs suffer poor performance due to system hierarchy and message latency. Data-sets in many fields (e.g., bioinformatics & life sciences, Computer-Aided Engineering (CAE), energy, earth sciences, financial analyses) are becoming too large for the Symmetric Multi- Processing (SMP) nodes that make up the cluster. In many cases, it is impractical and inefficient to distribute the large data sets. The alternatives are to: Restructure the problem (to fit within the memory limitations of the nodes and suffer inefficiencies); Purchase a memory appliance to accommodated the data-set and suffer the increased latency (>20µ-seconds); or, Wait for and purchase time on a University or National Labs SMP supercomputer. Each of these options has their own drawbacks, ranging from latency & performance to lengthy queues for time on a government (e.g., NSF, DoE) supercomputer. What HPC users really want is unencumbered access to an affordable, large shared-memory SMP supercomputer. Page 4 of 13

Graphic Processor Units A variant of supercomputing is supercomputers enhanced with Graphic Processor Units (GPUs) typically with GPU PCIe adapters. GPUs are ASICs designed to rapidly manipulate and alter memory to accelerate the building of images in a frame buffer intended for output to a display. They are very efficient at manipulating computer graphics, and their highly parallel structure makes them effective for algorithms where processing of large blocks of data is done in parallel. GPUs can add performance to a supercomputer but only if applications are redesigned for the architecture, whether one is adding GPU adapters to an SMP computer 1 or if one is adding a shelf of GPU adapters to a supercomputing cluster. Performance test results from independent test laboratories indicate that adding GPU adapters can actually slow performance if the application is not re-designed. Redesigning applications for a supercomputer with GPU adapters is a non trivial task, and once redesigned one has custom software that can have significant ongoing software maintenance issues. GPU adapters need separate memory spaces and copy memory spaces across the I/O bus. Multiple GPU cards can easily tax the PCIe bus if one does not architect their application(s) properly. So, supercomputers enhanced with GPU adapters are inappropriate for HPC applications with large memory requirements. GPUs have attractive single precision, no-error correction Giga-Floating Point Operations Per Second (GFLOPS) ratings. However, those ratings are meaningless for scientific, engineering and financial calculations. What are of interest are the double precision, error-correction ratings. When one considers the cost per double precision, error correction GFLOPS, general purpose Central Processor Units (CPUs) like the AMD Opteron 6200 series and Intel Xeon E7 processors are comparable to current GPU ASICs like the Nvidia GF100. Comparable cost per double precision, error-correction GFLOPS performance of CPUs and GPUs indicate that GPUs may have limited applicability for scientific, engineering and financial HPC computations particularly when general purpose CPUs are easier to program and support. 1 If you have a GPU-enabled application or you are committed to redesigning your application(s), one can add GPU adapters to the spare PCIe slots of Symmetric Computing supercomputers. Page 5 of 13

Distributed Symmetric Multi-Processing (DSMP) Symmetric Computing s patent-pending Distributed Symmetric Multi-Processing (DSMP) provides affordable SMP supercomputing. DSMP software enables Distributed Shared Memory (DSM), or Distributed Global Address Space (DGAS), across an InfiniBand connected cluster of homogeneous Symmetric Multiprocessing (SMP) nodes. Homogeneous InfiniBand-connected clusters are converted into a DSM/DGAS supercomputer which can service very large data-sets or accommodate legacy MPI/MPICH applications with increased efficiency and throughput via application-utilities (gasket interface) that support MPI over shared-memory. DSMP can displace and, in some cases, obsolete message-passing as the protocol of choice for a wide range of memory intensive applications because of its ability to service economically a wider class of problems with greater efficiency. DSMP creates a large, shared-memory software architecture at the operating system level. From the programmer s perspective, there is a single software image and one operating system Linux kernel for a DSMP-enabled homogeneous cluster. And, DSMP executes on state-of-theart, off-the-shelf (OTS) Symmetric Multiprocessing (SMP) servers. Employing a Linux kernel extension with off-the-shelf server hardware, Symmetric Computing s DSMP provides large shared memory, multiple-core SMP computing with both economy and performance. The table below compares MPI Clusters, DSM supercomputers and DSMPenabled clusters. MPI Cluster SMP Supercomputer DSMP Cluster Off-the-Shelf Hardware Yes No Yes Affordability $ $$$$$ $ Shared Memory No Yes Yes Single Software Image No Yes Yes Intelligent Platform Management Interface Yes Yes Yes Virtualization Support No Yes No Static Partition for Multiple Applications Support No Yes Planned Scalable Yes No Yes The key features of the DSMP architecture are: A DSMP Linux kernel-level enhancement that results in significantly lower latency and improved bandwidths, making a DSM/DGAS architecture both practical and possible; A transactional distributed shared-memory system that allows the architecture to scale to thousands of nodes with little to no impact on global memory access times; Page 6 of 13

An intelligent, optimized InfiniBand driver that leverages the HCA s native Remote Direct Memory Access (RDMA) logic, reducing global memory page access times to under 5µseconds; and, An application driven, memory-page coherency scheme that simplifies porting applications and allows the programmer to focus on optimizing performance vs. spending endless hours re-writing the application to accommodate the limitations of the message-passing interface. How DSMP Works DSMP is a OS kernel enhancement that transforms an InfiniBand connected cluster of homogeneous off-the-shelf SMP servers into a shared-memory supercomputer. The DSMP kernel provides six (6) enhancements that enable this transformation: 1. A transactional distributed shared-memory system 2. Optimized InfiniBand drivers 3. An application driven, memory-page coherency scheme 4. An enhanced distributed multi-threading service 5. Support for a distributed MuTeX 6. Distributed buffered file I/O Transactional Distributed Shared-Memory System: The center piece of Symmetric Computing s patent-pending DSMP Linux kernel extension is its transactional distributed shared-memory architecture, which is based on a two tier memory-page (Local and Global) and tables that support the transactional memory architecture. The transactional global shared memory is concurrently accessible and available to all the processors in the system. This memory is uniformly distributed over each server in the cluster and is indexed via tables, which maintain a linear view of the global-memory. Local-memory contains copies of what is in global-memory acting as a memory page cache, providing temporal locality for program and data. All processor code execution and data reads occur only from the local memory (cache). DSMP divides physical memory into two partitions: local working memory and global shared memory. The global partitions on each node are combined to form a single global shared memory that is linearly addressable by all nodes in a consistent manner. The global memory maintains a reference copy of each 4096-byte memory page used by any program running on the system at a fixed address. The local memory contains a subset of the total memory pages used by the running program. 4096-byte memory pages are copied from global memory to local memory when needed by the executing program. Any changes made to local memory pages are written back to the global memory. Page 7 of 13

1.44-TB Global Transactional Shared Memory 4096-byte pages 4096-byte pages 32-GB Local Memory #1 (cache) 32-GB Local Memory #2 (cache) 32-GB Local Memory #3 (cache) 64-byte cache lines 64-byte cache lines PCIe PCIe PCIe Node 1 Node 2 Node 3 GbE IB 1 IB 2 IB 3 Shown above is a data flow view of a Trio Departmental Supercomputer with 192 processor cores and 1.54-TB of RAM. It is made with three homogeneous 4P servers each with 64 cores (using 16-core AMD Opteron 6200 series processors) and 512 GB of physical memory per node. The three nodes in Trio are connected via 40 Gbps InfiniBand PCIe adapters. [Note: A Trio is a 3-node DSMP configuration that does not need a costly InfiniBand Switch for high-speed interconnect.] DSMP sets the size of local-memory at boot time but is typically 32 GB. When there is a pagefault in local-memory, the DSMP kernel finds an appropriate least recently used (LRU) 4096- byte memory-page in local-memory and swaps in the missing global-memory page; this happens in under 5µ-seconds. The large temporally local-memory (cache) provides all the performance benefits (STREAMS, RandomAccess and Linpack) of local-memory in a legacy SMP server with the added benefit spatial locality to a large globally shared-memory which is <5µseconds. Not only is this architecture unique and extremely powerful, but it can scale to hundreds and even thousands of nodes with no appreciable loss in performance. Optimized InfiniBand Drivers: DSMP is made possible with the advent of a low latency, commercial, off-the-shelf network fabric. It wasn t that long ago, with the exit of Intel from the InfiniBand consortium, that industry experts were forecasting its demise. Today InfiniBand is the fabric of choice for most supercomputing clusters - due to its low latency & high bandwidth. Page 8 of 13

To squeeze every last nano-second of performance out of the fabric, DSMP bypasses the Linux InfiniBand protocol stack with its own low-level drivers. The DSMP set of drivers leverages the native RDMA capabilities of the InfiniBand host channel adapter (HCA). This allows the HCA to service and move memory-page requests without processor intervention. Hence, RDMA eliminates the overhead for message construction and deconstruction, reducing system-wide latency. Application-Driven, Memory Page Coherency Scheme: All proprietary supercomputers maintain memory-consistency and/or coherency via a hardware extension of the host processors coherency scheme. DSMP being a software solution based on local and global memory resources takes a different approach. Coherency within the local memory of each of the individual SMP servers is maintained by the AMD64 Memory Management Unit (MMU) on a cache-line basis. Memory consistency between a page in global memory and all copies of the page in local memory is maintained and controlled by an OpenMP Application Programming Interface (API) with a set of memory page lock operations similar to those used in database transactions: Allow a global memory page to be locked for exclusive access. Release the lock. Force a local memory page to be immediately updated to global memory. This Symmetric Computing API replaces OpenMP in its entirety and allows programs parallelized with the OpenMPI API to take advantage of all the processor cores within the system without any source code modifications. This allows programs with multiple execution threads possibly running on multiple nodes to maintain global memory page consistency across all nodes. If application parallelization is achieved under POSIX threads or any other method, it is the responsibility of the programmer to insure coherency. To assist the programmer in this process, a set of DSMP-specific primitives are available. These primitives, combined with some simple intuitive programming rules, augmented with the new primitives make porting an application to a multi-node DSMP platform simple and manageable. Those rules are as follows: Be sensitive to the fact that memory-pages are swapped into and out of local memory (cache) from global memory in 4096-byte pages and that it takes 5 µ-seconds to complete the swap. Because DSMP is based on a 4096-byte cache page, it is important not to map unique or independent data structures on the same memory page. To help prevent this, a new malloc( ) function [a_malloc( )] is provided to assure alignment on a 4096-byte boundary and to help avoid FALSE sharing. Page 9 of 13

Because of the way local and global-memory are partitioned within the physical memory, care must be taken to distribute process/threads and associated data-sets evenly over the four processors while maintaining temporal locality to data. How you do this is a function of the target application. In short, care should be taken to not only distribute threads but ensure some level of data locality. DSMP supports full POSIX conformance to simplify parallelization of threads. If there is a data-structure which is modified-shared and if that data structure is accessed by multiple process/threads which are on an adjacent server, then it will be necessary to use a set of three new primitives to maintain memory-consistency: 1. Sync( ) synchronizes a local data-structure with its parent in global-memory. 2. Lock( ) prevents any other process thread from accessing and subsequently modifying the noted data-structure. Lock( ) also invalidates all other copies of the data structure (memory-pages) within the computing-system. If a process thread on an adjacent computing-device accesses a memory-page associated with a locked data structure, execution is suspended until the structure (memory page) is released. 3. Release( ) unlocks a previously locked data structure. These three primitives are provided to simplify the implementation of system wide memory consistency. NOTE: If your application is single thread, or is already parallelized for OpenMP and/or Portable Operating System Interface for Unix (POSIX) threads, it will run on a DSMP system and in most cases, will be able to access all the processor resources within the cluster. If your code is not parallelized with OpenMP or POSIX threads, a multithread application may not be able to take full advantage of all the processor cores on the adjacent worker nodes (only the processor-cores on the head-node). In the case of a Trio Departmental Supercomputer, your application will have access to all 64 processor cores on the head-node and 1.44-TB of global shared memory (1.54-TB total RAM minus 3 node partitions of 32-GB local memory) with full coherency support. Distributed Multi-Threading: The gold standard for parallelizing C/C++ or Fortran source code is with OpenMP and the POSIX thread library or Pthreads. To simplify porting and/or running OpenMP applications under DSMP, Symmetric Computing Engineering provides a DSMP-specific OpenMP API that addresses program issues associated with page alignment, how and when to fork( ) a thread to multiple remote nodes and related issues. Distributed MuTeX: A MuTeX, or MuTual exclusion, is a set of algorithms used in concurrent programming to avoid the simultaneous use of a common resource, such as a global variable or a critical section. A distributed MuTeX is nothing more than a DSMP kernel enhancement that insures that a MuTeX functions as expected within the DSMP multi-node system. From a programmer s point-of-view, there are no changes or modification to MuTeX it just works. Page 10 of 13

A Perfect-Storm Every so often, there are advancements in technology that impact a broad swath of the society, either directly or indirectly. These advancements emerge and take hold due to three events: 1. A brilliant idea is set-upon and implemented which solves a real-world problem. 2. A set of technologies have evolved to a point that enable the idea. 3. The market segment which directly benefits from the idea is ready to accept it. The enablement of high performance shared-memory architecture over a cluster is such an idea. As implied in the previous section, the technologies that have allowed DSMP to be realized are: 1. The shift away from ultra-fast single core processors for a lower-power multi-core approach. 2. The adoption of the x86-64 processor as the architecture of choice for technical computing. 3. The integration of the DRAM controller and I/O complex onto the x86-64 processor. 4. The sudden and rapid increase of DRAM density with the corresponding drop in memory cost. 5. The commercial availability of a high bandwidth, low latency network fabric InfiniBand. 6. The adoption of Linux as the operating system of choice for technical computing. 7. The commercial availability of high-performance small form-factor SMP nodes. The recent availability of all seven conditions has enabled Distributed Symmetric Multiprocessing. DSMP Performance Performance of a supercomputer is a function of two metrics: 1. Processor performance (computational throughput) which can be assessed with HPPC Linpack or FFT algorithms and; 2. Global Memory Read/Write performance - which can assessed and measured with HPPC STREAMS or RandomAccess. The extraordinary thing about the DSMP is the fact that it is based on economical, off-the-shelf components. That s important, because as with MPI, DSMP performance scales with the performance of the components from which it is dependent on. As an example, random read/write latency for DSMP, improved 40% simply moving from a 20 Gbps to a 40 Gbps InfiniBand fabric with no changes to the DSMP software. Within this same timeframe, the AMD64 processor density went from quad-core to six-core, to eight-core and to sixteen-core Page 11 of 13

again, with only a small increase in the total cost of the system. Over time the performance gap between a DSMP and a proprietary SMP supercomputer of equivalent processor/memory density will continue to narrow and eventually disappear. Looking back to the birth of the Beowulf project, an equivalent processor/memory density SMP supercomputer outperformed a Beowulf cluster by a factor of 100, yet MPI clusters continued to dominate supercomputing why? The reasons are twofold: 1. First price performance: MPI clusters are affordable both from the hardware as well as a software perspective (Linux based), and they can grow in small ways - by adding a few additional PC/servers or in large way, by adding entire racks of servers. Hence, if a researcher needs more computing power, they simply add more commodity PC/servers. 2. Second accessibility: MPI clusters are inexpensive enough that almost any University or College can afford them, making them readily accessible for a wide range of applications. As a result, an enormous amount of academic resources are focused on improving MPI and the applications that run on them. DSMP brings the same level of value, accessibility and grass-roots innovation to the same audience that embraced MPI. However, the performance gap between a DSMP cluster and a legacy shared-memory supercomputer is small, and in many applications, a DSMP cluster out performs machines that cost 10x its price. As an example, if your application can fit (program and data) within local-memory (cache) of Trio, then you can apply 192 processors concurrently to solve your problem with no penalty for memory accesses (all memory is local for all 192 processors). Even if your algorithm does access global shared memory, it s less than 5µseconds away. Today, Symmetric Computing is offering four turn-key DSMP platforms coined the Trio Departmental Supercomputer line and one dual node system named the Duet Departmental Supercomputer. The table below lists the various TFLOPS/memory configurations available in 2011. Model Processor Peak Theoretical Floating Total system Global Shared Cores Point Performance memory Memory* Duet 128 1.12 TFLOPS 512-GB 448-GB Duet 128 1.12 TFLOPS 1-TB 960-GB Trio 192 1.776 TFLOPS 768-GB 672-GB Trio 192 1.776 TFLOPS 1.54-GB 1.44-TB Looking forward into 2012, Symmetric Computing plans to introduce a multi-node InfiniBand switched based system delivering up to 2048 cores and 16-TB of RAM in a single 42U rack. In addition, we are working with our partners to deliver a turn-key platforms optimized for application specific missions such as database search on proteins and/or RNA (e.g., NCBI blast, Page 12 of 13

FASTA, HMMER3) and short-read sequence alignment on proteins and nucleotides (e.g., BFAST, Bowtie, MOSAIK). Challenges and Opportunities There is real value in DSMP kernel technology over MPI/MPICH. However, much of the HPC market has become complacent with MPI and look only for incremental improvements (i.e., MPI-1 to MPI-2). In addition, as MPI nodes convert to the AMD Opteron processor and Intel Xeon processor s direct connect architecture, combined with memory densities of >256- GB/node, MPI-2 with DSM support will gradually service a wider range of applications. Symmetric Computing addresses these issues first by supplying affordable turn-key DSMPenabled appliances such as the Trio Departmental Supercomputer. Our platforms are ideal for solving the high performance computing challenges in key sectors such as Bioinformatics and computer-aided engineering. About Symmetric Computing Symmetric Computing is a Boston based software company with offices at the Venture Development Center on the campus of the University of Massachusetts Boston. We design software to accelerate the use and application of shared-memory computing systems for bioinformatics and life sciences, computer-aided engineering, energy, earth sciences, financial analyses and related fields. Symmetric Computing is dedicated to delivering standards-based, customer-focused technical computing solutions for users, ranging from Universities to enterprises. For more information, visit www.symmetriccomputing.com. Page 13 of 13