Distributed Symmetric Multi-Processing (DSMP)
|
|
|
- Marylou Flynn
- 9 years ago
- Views:
Transcription
1 Distributed Symmetric Multi-Processing (DSMP) Symmetric Computing Venture Development Center University of Massachusetts Boston 100 Morrissey Boulevard Boston, MA USA Page 1 of 13
2 This page is intentionally blank. Page 2 of 13
3 Introduction Most High Performance Computing (HPC) today is supercomputing clusters. However, to achieve the performance that HPC clusters promise, applications must be tailored for the architecture. Over time, a library of cluster-ready HPC applications has been developed; however, there are many important end-user applications left unaddressed. Entire fields of scientific endeavor that could benefit from high performance computing have not due primarily to the programming complexity of HPC clusters. Most scientists, researchers, engineers and analysts like to focus on their specialty, get their computations executed quickly, and avoid becoming entangled in the computer science complexities that supercomputing clusters demand. They typically develop their work on Symmetric Multi-Processing (SMP) workstations -- SMP computers aligning more closely with their computer skills, and only need supercomputers for their more demanding applications and data sets. Scientists, researchers, engineers and analysts would rather not re-write their applications for a supercomputing cluster and would prefer to use SMP supercomputers. However, SMP supercomputers have been out of reach economically for many. They have been too expensive due their reliance on costly proprietary hardware: Application Specific Integrated Circuits (ASICs) and/or Field Programmable Gate Arrays (FPGAs). Recently, new software-based SMP supercomputers have emerged. Software-based SMP supercomputers are economical. They can be made from the same lower cost, nonproprietary, off-the-shelf server nodes that have always made non-smp supercomputing clusters attractive. This new software makes homogeneous HPC clusters act and behave as SMP supercomputers. The current generation of software-based SMP supercomputers from homogeneous HPC clusters relies on hypervisor implementations -- that is, a layer of software that executes above the operating system. This added layer of software adversely affects performance limiting its High Performance Computing applicability. A new generation of software-based SMP supercomputers promises the performance that HPC applications demand. This next generation of software-based SMP supercomputers is implemented at the operating system level inherently faster than a hypervisor. Symmetric Computing s patent-pending distributed shared memory extension to the Linux kernel delivers the economies and performance of supercomputing clusters, but with the programming simplicity of SMP. This technological leap is truly disruptive as it brings the benefits of supercomputing to heretofore unreached fields of research, development and analysis. Page 3 of 13
4 Supercomputing Cluster Limitations Computations assigned to a cluster of networked systems must be carefully structured to accommodate the limitations of the individual server nodes that make up the cluster. In many cases, highly skilled programmers are needed to modify code to accommodate the clusters hierarchy, per-node memory limitation and messaging scheme. Lastly, once the data-sets and program are ready to run, they must first be propagated onto each and every node within the cluster. Only then can actual work begin. Besides complexity, there are entire classes of applications and data sets that are inappropriate for supercomputing clusters. Many high performance computing applications (e.g., genomic sequencing, coupled engineering models) invoke large data sets and require large shared memory ( 512-GB RAM). The language-independent communications protocol Message Passing Interface, or MPI, is the de facto standard for supercomputing clusters. Addressing problems with big data-sets is impractical with the MPI-1 model, for it has no shared memory concept, and MPI-2/MPICH has only a limited distributed shared memory concept with a significant latency penalty (>40 µ-seconds). Hence, an MPICH cluster based upon individual server nodes each with less than 512-GB RAM can t support problems with very large data sets without significant restructuring of the application and associated data sets. Even after a successful port, many programs suffer poor performance due to system hierarchy and message latency. Data-sets in many fields (e.g., bioinformatics & life sciences, Computer-Aided Engineering (CAE), energy, earth sciences, financial analyses) are becoming too large for the Symmetric Multi- Processing (SMP) nodes that make up the cluster. In many cases, it is impractical and inefficient to distribute the large data sets. The alternatives are to: Restructure the problem (to fit within the memory limitations of the nodes and suffer inefficiencies); Purchase a memory appliance to accommodated the data-set and suffer the increased latency (>20µ-seconds); or, Wait for and purchase time on a University or National Labs SMP supercomputer. Each of these options has their own drawbacks, ranging from latency & performance to lengthy queues for time on a government (e.g., NSF, DoE) supercomputer. What HPC users really want is unencumbered access to an affordable, large shared-memory SMP supercomputer. Page 4 of 13
5 Graphic Processor Units A variant of supercomputing is supercomputers enhanced with Graphic Processor Units (GPUs) typically with GPU PCIe adapters. GPUs are ASICs designed to rapidly manipulate and alter memory to accelerate the building of images in a frame buffer intended for output to a display. They are very efficient at manipulating computer graphics, and their highly parallel structure makes them effective for algorithms where processing of large blocks of data is done in parallel. GPUs can add performance to a supercomputer but only if applications are redesigned for the architecture, whether one is adding GPU adapters to an SMP computer 1 or if one is adding a shelf of GPU adapters to a supercomputing cluster. Performance test results from independent test laboratories indicate that adding GPU adapters can actually slow performance if the application is not re-designed. Redesigning applications for a supercomputer with GPU adapters is a non trivial task, and once redesigned one has custom software that can have significant ongoing software maintenance issues. GPU adapters need separate memory spaces and copy memory spaces across the I/O bus. Multiple GPU cards can easily tax the PCIe bus if one does not architect their application(s) properly. So, supercomputers enhanced with GPU adapters are inappropriate for HPC applications with large memory requirements. GPUs have attractive single precision, no-error correction Giga-Floating Point Operations Per Second (GFLOPS) ratings. However, those ratings are meaningless for scientific, engineering and financial calculations. What are of interest are the double precision, error-correction ratings. When one considers the cost per double precision, error correction GFLOPS, general purpose Central Processor Units (CPUs) like the AMD Opteron 6200 series and Intel Xeon E7 processors are comparable to current GPU ASICs like the Nvidia GF100. Comparable cost per double precision, error-correction GFLOPS performance of CPUs and GPUs indicate that GPUs may have limited applicability for scientific, engineering and financial HPC computations particularly when general purpose CPUs are easier to program and support. 1 If you have a GPU-enabled application or you are committed to redesigning your application(s), one can add GPU adapters to the spare PCIe slots of Symmetric Computing supercomputers. Page 5 of 13
6 Distributed Symmetric Multi-Processing (DSMP) Symmetric Computing s patent-pending Distributed Symmetric Multi-Processing (DSMP) provides affordable SMP supercomputing. DSMP software enables Distributed Shared Memory (DSM), or Distributed Global Address Space (DGAS), across an InfiniBand connected cluster of homogeneous Symmetric Multiprocessing (SMP) nodes. Homogeneous InfiniBand-connected clusters are converted into a DSM/DGAS supercomputer which can service very large data-sets or accommodate legacy MPI/MPICH applications with increased efficiency and throughput via application-utilities (gasket interface) that support MPI over shared-memory. DSMP can displace and, in some cases, obsolete message-passing as the protocol of choice for a wide range of memory intensive applications because of its ability to service economically a wider class of problems with greater efficiency. DSMP creates a large, shared-memory software architecture at the operating system level. From the programmer s perspective, there is a single software image and one operating system Linux kernel for a DSMP-enabled homogeneous cluster. And, DSMP executes on state-of-theart, off-the-shelf (OTS) Symmetric Multiprocessing (SMP) servers. Employing a Linux kernel extension with off-the-shelf server hardware, Symmetric Computing s DSMP provides large shared memory, multiple-core SMP computing with both economy and performance. The table below compares MPI Clusters, DSM supercomputers and DSMPenabled clusters. MPI Cluster SMP Supercomputer DSMP Cluster Off-the-Shelf Hardware Yes No Yes Affordability $ $$$$$ $ Shared Memory No Yes Yes Single Software Image No Yes Yes Intelligent Platform Management Interface Yes Yes Yes Virtualization Support No Yes No Static Partition for Multiple Applications Support No Yes Planned Scalable Yes No Yes The key features of the DSMP architecture are: A DSMP Linux kernel-level enhancement that results in significantly lower latency and improved bandwidths, making a DSM/DGAS architecture both practical and possible; A transactional distributed shared-memory system that allows the architecture to scale to thousands of nodes with little to no impact on global memory access times; Page 6 of 13
7 An intelligent, optimized InfiniBand driver that leverages the HCA s native Remote Direct Memory Access (RDMA) logic, reducing global memory page access times to under 5µseconds; and, An application driven, memory-page coherency scheme that simplifies porting applications and allows the programmer to focus on optimizing performance vs. spending endless hours re-writing the application to accommodate the limitations of the message-passing interface. How DSMP Works DSMP is a OS kernel enhancement that transforms an InfiniBand connected cluster of homogeneous off-the-shelf SMP servers into a shared-memory supercomputer. The DSMP kernel provides six (6) enhancements that enable this transformation: 1. A transactional distributed shared-memory system 2. Optimized InfiniBand drivers 3. An application driven, memory-page coherency scheme 4. An enhanced distributed multi-threading service 5. Support for a distributed MuTeX 6. Distributed buffered file I/O Transactional Distributed Shared-Memory System: The center piece of Symmetric Computing s patent-pending DSMP Linux kernel extension is its transactional distributed shared-memory architecture, which is based on a two tier memory-page (Local and Global) and tables that support the transactional memory architecture. The transactional global shared memory is concurrently accessible and available to all the processors in the system. This memory is uniformly distributed over each server in the cluster and is indexed via tables, which maintain a linear view of the global-memory. Local-memory contains copies of what is in global-memory acting as a memory page cache, providing temporal locality for program and data. All processor code execution and data reads occur only from the local memory (cache). DSMP divides physical memory into two partitions: local working memory and global shared memory. The global partitions on each node are combined to form a single global shared memory that is linearly addressable by all nodes in a consistent manner. The global memory maintains a reference copy of each 4096-byte memory page used by any program running on the system at a fixed address. The local memory contains a subset of the total memory pages used by the running program byte memory pages are copied from global memory to local memory when needed by the executing program. Any changes made to local memory pages are written back to the global memory. Page 7 of 13
8 1.44-TB Global Transactional Shared Memory 4096-byte pages 4096-byte pages 32-GB Local Memory #1 (cache) 32-GB Local Memory #2 (cache) 32-GB Local Memory #3 (cache) 64-byte cache lines 64-byte cache lines PCIe PCIe PCIe Node 1 Node 2 Node 3 GbE IB 1 IB 2 IB 3 Shown above is a data flow view of a Trio Departmental Supercomputer with 192 processor cores and 1.54-TB of RAM. It is made with three homogeneous 4P servers each with 64 cores (using 16-core AMD Opteron 6200 series processors) and 512 GB of physical memory per node. The three nodes in Trio are connected via 40 Gbps InfiniBand PCIe adapters. [Note: A Trio is a 3-node DSMP configuration that does not need a costly InfiniBand Switch for high-speed interconnect.] DSMP sets the size of local-memory at boot time but is typically 32 GB. When there is a pagefault in local-memory, the DSMP kernel finds an appropriate least recently used (LRU) byte memory-page in local-memory and swaps in the missing global-memory page; this happens in under 5µ-seconds. The large temporally local-memory (cache) provides all the performance benefits (STREAMS, RandomAccess and Linpack) of local-memory in a legacy SMP server with the added benefit spatial locality to a large globally shared-memory which is <5µseconds. Not only is this architecture unique and extremely powerful, but it can scale to hundreds and even thousands of nodes with no appreciable loss in performance. Optimized InfiniBand Drivers: DSMP is made possible with the advent of a low latency, commercial, off-the-shelf network fabric. It wasn t that long ago, with the exit of Intel from the InfiniBand consortium, that industry experts were forecasting its demise. Today InfiniBand is the fabric of choice for most supercomputing clusters - due to its low latency & high bandwidth. Page 8 of 13
9 To squeeze every last nano-second of performance out of the fabric, DSMP bypasses the Linux InfiniBand protocol stack with its own low-level drivers. The DSMP set of drivers leverages the native RDMA capabilities of the InfiniBand host channel adapter (HCA). This allows the HCA to service and move memory-page requests without processor intervention. Hence, RDMA eliminates the overhead for message construction and deconstruction, reducing system-wide latency. Application-Driven, Memory Page Coherency Scheme: All proprietary supercomputers maintain memory-consistency and/or coherency via a hardware extension of the host processors coherency scheme. DSMP being a software solution based on local and global memory resources takes a different approach. Coherency within the local memory of each of the individual SMP servers is maintained by the AMD64 Memory Management Unit (MMU) on a cache-line basis. Memory consistency between a page in global memory and all copies of the page in local memory is maintained and controlled by an OpenMP Application Programming Interface (API) with a set of memory page lock operations similar to those used in database transactions: Allow a global memory page to be locked for exclusive access. Release the lock. Force a local memory page to be immediately updated to global memory. This Symmetric Computing API replaces OpenMP in its entirety and allows programs parallelized with the OpenMPI API to take advantage of all the processor cores within the system without any source code modifications. This allows programs with multiple execution threads possibly running on multiple nodes to maintain global memory page consistency across all nodes. If application parallelization is achieved under POSIX threads or any other method, it is the responsibility of the programmer to insure coherency. To assist the programmer in this process, a set of DSMP-specific primitives are available. These primitives, combined with some simple intuitive programming rules, augmented with the new primitives make porting an application to a multi-node DSMP platform simple and manageable. Those rules are as follows: Be sensitive to the fact that memory-pages are swapped into and out of local memory (cache) from global memory in 4096-byte pages and that it takes 5 µ-seconds to complete the swap. Because DSMP is based on a 4096-byte cache page, it is important not to map unique or independent data structures on the same memory page. To help prevent this, a new malloc( ) function [a_malloc( )] is provided to assure alignment on a 4096-byte boundary and to help avoid FALSE sharing. Page 9 of 13
10 Because of the way local and global-memory are partitioned within the physical memory, care must be taken to distribute process/threads and associated data-sets evenly over the four processors while maintaining temporal locality to data. How you do this is a function of the target application. In short, care should be taken to not only distribute threads but ensure some level of data locality. DSMP supports full POSIX conformance to simplify parallelization of threads. If there is a data-structure which is modified-shared and if that data structure is accessed by multiple process/threads which are on an adjacent server, then it will be necessary to use a set of three new primitives to maintain memory-consistency: 1. Sync( ) synchronizes a local data-structure with its parent in global-memory. 2. Lock( ) prevents any other process thread from accessing and subsequently modifying the noted data-structure. Lock( ) also invalidates all other copies of the data structure (memory-pages) within the computing-system. If a process thread on an adjacent computing-device accesses a memory-page associated with a locked data structure, execution is suspended until the structure (memory page) is released. 3. Release( ) unlocks a previously locked data structure. These three primitives are provided to simplify the implementation of system wide memory consistency. NOTE: If your application is single thread, or is already parallelized for OpenMP and/or Portable Operating System Interface for Unix (POSIX) threads, it will run on a DSMP system and in most cases, will be able to access all the processor resources within the cluster. If your code is not parallelized with OpenMP or POSIX threads, a multithread application may not be able to take full advantage of all the processor cores on the adjacent worker nodes (only the processor-cores on the head-node). In the case of a Trio Departmental Supercomputer, your application will have access to all 64 processor cores on the head-node and 1.44-TB of global shared memory (1.54-TB total RAM minus 3 node partitions of 32-GB local memory) with full coherency support. Distributed Multi-Threading: The gold standard for parallelizing C/C++ or Fortran source code is with OpenMP and the POSIX thread library or Pthreads. To simplify porting and/or running OpenMP applications under DSMP, Symmetric Computing Engineering provides a DSMP-specific OpenMP API that addresses program issues associated with page alignment, how and when to fork( ) a thread to multiple remote nodes and related issues. Distributed MuTeX: A MuTeX, or MuTual exclusion, is a set of algorithms used in concurrent programming to avoid the simultaneous use of a common resource, such as a global variable or a critical section. A distributed MuTeX is nothing more than a DSMP kernel enhancement that insures that a MuTeX functions as expected within the DSMP multi-node system. From a programmer s point-of-view, there are no changes or modification to MuTeX it just works. Page 10 of 13
11 A Perfect-Storm Every so often, there are advancements in technology that impact a broad swath of the society, either directly or indirectly. These advancements emerge and take hold due to three events: 1. A brilliant idea is set-upon and implemented which solves a real-world problem. 2. A set of technologies have evolved to a point that enable the idea. 3. The market segment which directly benefits from the idea is ready to accept it. The enablement of high performance shared-memory architecture over a cluster is such an idea. As implied in the previous section, the technologies that have allowed DSMP to be realized are: 1. The shift away from ultra-fast single core processors for a lower-power multi-core approach. 2. The adoption of the x86-64 processor as the architecture of choice for technical computing. 3. The integration of the DRAM controller and I/O complex onto the x86-64 processor. 4. The sudden and rapid increase of DRAM density with the corresponding drop in memory cost. 5. The commercial availability of a high bandwidth, low latency network fabric InfiniBand. 6. The adoption of Linux as the operating system of choice for technical computing. 7. The commercial availability of high-performance small form-factor SMP nodes. The recent availability of all seven conditions has enabled Distributed Symmetric Multiprocessing. DSMP Performance Performance of a supercomputer is a function of two metrics: 1. Processor performance (computational throughput) which can be assessed with HPPC Linpack or FFT algorithms and; 2. Global Memory Read/Write performance - which can assessed and measured with HPPC STREAMS or RandomAccess. The extraordinary thing about the DSMP is the fact that it is based on economical, off-the-shelf components. That s important, because as with MPI, DSMP performance scales with the performance of the components from which it is dependent on. As an example, random read/write latency for DSMP, improved 40% simply moving from a 20 Gbps to a 40 Gbps InfiniBand fabric with no changes to the DSMP software. Within this same timeframe, the AMD64 processor density went from quad-core to six-core, to eight-core and to sixteen-core Page 11 of 13
12 again, with only a small increase in the total cost of the system. Over time the performance gap between a DSMP and a proprietary SMP supercomputer of equivalent processor/memory density will continue to narrow and eventually disappear. Looking back to the birth of the Beowulf project, an equivalent processor/memory density SMP supercomputer outperformed a Beowulf cluster by a factor of 100, yet MPI clusters continued to dominate supercomputing why? The reasons are twofold: 1. First price performance: MPI clusters are affordable both from the hardware as well as a software perspective (Linux based), and they can grow in small ways - by adding a few additional PC/servers or in large way, by adding entire racks of servers. Hence, if a researcher needs more computing power, they simply add more commodity PC/servers. 2. Second accessibility: MPI clusters are inexpensive enough that almost any University or College can afford them, making them readily accessible for a wide range of applications. As a result, an enormous amount of academic resources are focused on improving MPI and the applications that run on them. DSMP brings the same level of value, accessibility and grass-roots innovation to the same audience that embraced MPI. However, the performance gap between a DSMP cluster and a legacy shared-memory supercomputer is small, and in many applications, a DSMP cluster out performs machines that cost 10x its price. As an example, if your application can fit (program and data) within local-memory (cache) of Trio, then you can apply 192 processors concurrently to solve your problem with no penalty for memory accesses (all memory is local for all 192 processors). Even if your algorithm does access global shared memory, it s less than 5µseconds away. Today, Symmetric Computing is offering four turn-key DSMP platforms coined the Trio Departmental Supercomputer line and one dual node system named the Duet Departmental Supercomputer. The table below lists the various TFLOPS/memory configurations available in Model Processor Peak Theoretical Floating Total system Global Shared Cores Point Performance memory Memory* Duet TFLOPS 512-GB 448-GB Duet TFLOPS 1-TB 960-GB Trio TFLOPS 768-GB 672-GB Trio TFLOPS 1.54-GB 1.44-TB Looking forward into 2012, Symmetric Computing plans to introduce a multi-node InfiniBand switched based system delivering up to 2048 cores and 16-TB of RAM in a single 42U rack. In addition, we are working with our partners to deliver a turn-key platforms optimized for application specific missions such as database search on proteins and/or RNA (e.g., NCBI blast, Page 12 of 13
13 FASTA, HMMER3) and short-read sequence alignment on proteins and nucleotides (e.g., BFAST, Bowtie, MOSAIK). Challenges and Opportunities There is real value in DSMP kernel technology over MPI/MPICH. However, much of the HPC market has become complacent with MPI and look only for incremental improvements (i.e., MPI-1 to MPI-2). In addition, as MPI nodes convert to the AMD Opteron processor and Intel Xeon processor s direct connect architecture, combined with memory densities of >256- GB/node, MPI-2 with DSM support will gradually service a wider range of applications. Symmetric Computing addresses these issues first by supplying affordable turn-key DSMPenabled appliances such as the Trio Departmental Supercomputer. Our platforms are ideal for solving the high performance computing challenges in key sectors such as Bioinformatics and computer-aided engineering. About Symmetric Computing Symmetric Computing is a Boston based software company with offices at the Venture Development Center on the campus of the University of Massachusetts Boston. We design software to accelerate the use and application of shared-memory computing systems for bioinformatics and life sciences, computer-aided engineering, energy, earth sciences, financial analyses and related fields. Symmetric Computing is dedicated to delivering standards-based, customer-focused technical computing solutions for users, ranging from Universities to enterprises. For more information, visit Page 13 of 13
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)
PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters from One Stop Systems (OSS) PCIe Over Cable PCIe provides greater performance 8 7 6 5 GBytes/s 4
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN
1 PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster Construction
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Flash Memory Arrays Enabling the Virtualized Data Center. July 2010
Flash Memory Arrays Enabling the Virtualized Data Center July 2010 2 Flash Memory Arrays Enabling the Virtualized Data Center This White Paper describes a new product category, the flash Memory Array,
Symmetric Multiprocessing
Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called
Enabling Technologies for Distributed Computing
Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF Multi-core CPUs and Multithreading Technologies
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing
Enabling Technologies for Distributed and Cloud Computing
Enabling Technologies for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Multi-core CPUs and Multithreading
Client/Server Computing Distributed Processing, Client/Server, and Clusters
Client/Server Computing Distributed Processing, Client/Server, and Clusters Chapter 13 Client machines are generally single-user PCs or workstations that provide a highly userfriendly interface to the
High Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.
White Paper 021313-3 Page 1 : A Software Framework for Parallel Programming* The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. ABSTRACT Programming for Multicore,
Clusters: Mainstream Technology for CAE
Clusters: Mainstream Technology for CAE Alanna Dwyer HPC Division, HP Linux and Clusters Sparked a Revolution in High Performance Computing! Supercomputing performance now affordable and accessible Linux
Chapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
InfiniBand vs Fibre Channel Throughput. Figure 1. InfiniBand vs 2Gb/s Fibre Channel Single Port Storage Throughput to Disk Media
InfiniBand Storage The Right Solution at the Right Time White Paper InfiniBand vs Fibre Channel Throughput Storage Bandwidth (Mbytes/sec) 800 700 600 500 400 300 200 100 0 FibreChannel InfiniBand Figure
ioscale: The Holy Grail for Hyperscale
ioscale: The Holy Grail for Hyperscale The New World of Hyperscale Hyperscale describes new cloud computing deployments where hundreds or thousands of distributed servers support millions of remote, often
MOSIX: High performance Linux farm
MOSIX: High performance Linux farm Paolo Mastroserio [[email protected]] Francesco Maria Taurino [[email protected]] Gennaro Tortone [[email protected]] Napoli Index overview on Linux farm farm
Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC
HPC Architecture End to End Alexandre Chauvin Agenda HPC Software Stack Visualization National Scientific Center 2 Agenda HPC Software Stack Alexandre Chauvin Typical HPC Software Stack Externes LAN Typical
High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
Seeking Opportunities for Hardware Acceleration in Big Data Analytics
Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who
Solving I/O Bottlenecks to Enable Superior Cloud Efficiency
WHITE PAPER Solving I/O Bottlenecks to Enable Superior Cloud Efficiency Overview...1 Mellanox I/O Virtualization Features and Benefits...2 Summary...6 Overview We already have 8 or even 16 cores on one
Advanced Core Operating System (ACOS): Experience the Performance
WHITE PAPER Advanced Core Operating System (ACOS): Experience the Performance Table of Contents Trends Affecting Application Networking...3 The Era of Multicore...3 Multicore System Design Challenges...3
SMB Direct for SQL Server and Private Cloud
SMB Direct for SQL Server and Private Cloud Increased Performance, Higher Scalability and Extreme Resiliency June, 2014 Mellanox Overview Ticker: MLNX Leading provider of high-throughput, low-latency server
Hadoop Cluster Applications
Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday
Microsoft Windows Server Hyper-V in a Flash
Microsoft Windows Server Hyper-V in a Flash Combine Violin s enterprise-class storage arrays with the ease and flexibility of Windows Storage Server in an integrated solution to achieve higher density,
Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
Res. Lett. Inf. Math. Sci., 2003, Vol.5, pp 1-10 Available online at http://iims.massey.ac.nz/research/letters/ 1 Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
White Paper The Numascale Solution: Extreme BIG DATA Computing
White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad ABOUT THE AUTHOR Einar Rustad is CTO of Numascale and has a background as CPU, Computer Systems and HPC Systems De-signer
Data Centric Systems (DCS)
Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
OpenMP Programming on ScaleMP
OpenMP Programming on ScaleMP Dirk Schmidl [email protected] Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign
numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT
numascale Hardware Accellerated Data Intensive Computing White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad www.numascale.com Supemicro delivers 108 node system with Numascale
The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC2013 - Denver
1 The PHI solution Fujitsu Industry Ready Intel XEON-PHI based solution SC2013 - Denver Industrial Application Challenges Most of existing scientific and technical applications Are written for legacy execution
Performance of the JMA NWP models on the PC cluster TSUBAME.
Performance of the JMA NWP models on the PC cluster TSUBAME. K.Takenouchi 1), S.Yokoi 1), T.Hara 1) *, T.Aoki 2), C.Muroi 1), K.Aranami 1), K.Iwamura 1), Y.Aikawa 1) 1) Japan Meteorological Agency (JMA)
PRIMERGY server-based High Performance Computing solutions
PRIMERGY server-based High Performance Computing solutions PreSales - May 2010 - HPC Revenue OS & Processor Type Increasing standardization with shift in HPC to x86 with 70% in 2008.. HPC revenue by operating
Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage
White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage
Multi-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
Unified Computing Systems
Unified Computing Systems Cisco Unified Computing Systems simplify your data center architecture; reduce the number of devices to purchase, deploy, and maintain; and improve speed and agility. Cisco Unified
DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION
DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION A DIABLO WHITE PAPER AUGUST 2014 Ricky Trigalo Director of Business Development Virtualization, Diablo Technologies
HPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
Sun Constellation System: The Open Petascale Computing Architecture
CAS2K7 13 September, 2007 Sun Constellation System: The Open Petascale Computing Architecture John Fragalla Senior HPC Technical Specialist Global Systems Practice Sun Microsystems, Inc. 25 Years of Technical
LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance
11 th International LS-DYNA Users Conference Session # LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton 3, Onur Celebioglu
The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage
The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage sponsored by Dan Sullivan Chapter 1: Advantages of Hybrid Storage... 1 Overview of Flash Deployment in Hybrid Storage Systems...
White Paper The Numascale Solution: Affordable BIG DATA Computing
White Paper The Numascale Solution: Affordable BIG DATA Computing By: John Russel PRODUCED BY: Tabor Custom Publishing IN CONJUNCTION WITH: ABSTRACT Big Data applications once limited to a few exotic disciplines
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
Direct Scale-out Flash Storage: Data Path Evolution for the Flash Storage Era
Enterprise Strategy Group Getting to the bigger truth. White Paper Direct Scale-out Flash Storage: Data Path Evolution for the Flash Storage Era Apeiron introduces NVMe-based storage innovation designed
Virtual Compute Appliance Frequently Asked Questions
General Overview What is Oracle s Virtual Compute Appliance? Oracle s Virtual Compute Appliance is an integrated, wire once, software-defined infrastructure system designed for rapid deployment of both
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays Red Hat Performance Engineering Version 1.0 August 2013 1801 Varsity Drive Raleigh NC
Accelerating I/O- Intensive Applications in IT Infrastructure with Innodisk FlexiArray Flash Appliance. Alex Ho, Product Manager Innodisk Corporation
Accelerating I/O- Intensive Applications in IT Infrastructure with Innodisk FlexiArray Flash Appliance Alex Ho, Product Manager Innodisk Corporation Outline Innodisk Introduction Industry Trend & Challenge
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director [email protected] Dave Smelker, Managing Principal [email protected]
LS DYNA Performance Benchmarks and Profiling. January 2009
LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The
Building Clusters for Gromacs and other HPC applications
Building Clusters for Gromacs and other HPC applications Erik Lindahl [email protected] CBR Outline: Clusters Clusters vs. small networks of machines Why do YOU need a cluster? Computer hardware Network
Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014
Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet Anand Rangaswamy September 2014 Storage Developer Conference Mellanox Overview Ticker: MLNX Leading provider of high-throughput,
Recommended hardware system configurations for ANSYS users
Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range
Pedraforca: ARM + GPU prototype
www.bsc.es Pedraforca: ARM + GPU prototype Filippo Mantovani Workshop on exascale and PRACE prototypes Barcelona, 20 May 2014 Overview Goals: Test the performance, scalability, and energy efficiency of
Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003
Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks An Oracle White Paper April 2003 Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building
Operating System for the K computer
Operating System for the K computer Jun Moroo Masahiko Yamada Takeharu Kato For the K computer to achieve the world s highest performance, Fujitsu has worked on the following three performance improvements
Intel DPDK Boosts Server Appliance Performance White Paper
Intel DPDK Boosts Server Appliance Performance Intel DPDK Boosts Server Appliance Performance Introduction As network speeds increase to 40G and above, both in the enterprise and data center, the bottlenecks
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Rambus Smart Data Acceleration
Rambus Smart Data Acceleration Back to the Future Memory and Data Access: The Final Frontier As an industry, if real progress is to be made towards the level of computing that the future mandates, then
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
Infrastructure Matters: POWER8 vs. Xeon x86
Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report
HPC Software Requirements to Support an HPC Cluster Supercomputer
HPC Software Requirements to Support an HPC Cluster Supercomputer Susan Kraus, Cray Cluster Solutions Software Product Manager Maria McLaughlin, Cray Cluster Solutions Product Marketing Cray Inc. WP-CCS-Software01-0417
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer
Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer Stan Posey, MSc and Bill Loewe, PhD Panasas Inc., Fremont, CA, USA Paul Calleja, PhD University of Cambridge,
Client/Server and Distributed Computing
Adapted from:operating Systems: Internals and Design Principles, 6/E William Stallings CS571 Fall 2010 Client/Server and Distributed Computing Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Traditional
Microsoft Windows Server Hyper-V in a Flash
Microsoft Windows Server Hyper-V in a Flash Combine Violin s enterprise- class all- flash storage arrays with the ease and capabilities of Windows Storage Server in an integrated solution to achieve higher
Trends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING
OBJECTIVE ANALYSIS WHITE PAPER MATCH ATCHING FLASH TO THE PROCESSOR Why Multithreading Requires Parallelized Flash T he computing community is at an important juncture: flash memory is now generally accepted
White Paper Solarflare High-Performance Computing (HPC) Applications
Solarflare High-Performance Computing (HPC) Applications 10G Ethernet: Now Ready for Low-Latency HPC Applications Solarflare extends the benefits of its low-latency, high-bandwidth 10GbE server adapters
Essentials Guide CONSIDERATIONS FOR SELECTING ALL-FLASH STORAGE ARRAYS
Essentials Guide CONSIDERATIONS FOR SELECTING ALL-FLASH STORAGE ARRAYS M ost storage vendors now offer all-flash storage arrays, and many modern organizations recognize the need for these highperformance
Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution
Arista 10 Gigabit Ethernet Switch Lab-Tested with Panasas ActiveStor Parallel Storage System Delivers Best Results for High-Performance and Low Latency for Scale-Out Cloud Storage Applications Introduction
- An Essential Building Block for Stable and Reliable Compute Clusters
Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative
159.735. Final Report. Cluster Scheduling. Submitted by: Priti Lohani 04244354
159.735 Final Report Cluster Scheduling Submitted by: Priti Lohani 04244354 1 Table of contents: 159.735... 1 Final Report... 1 Cluster Scheduling... 1 Table of contents:... 2 1. Introduction:... 3 1.1
Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks
WHITE PAPER July 2014 Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks Contents Executive Summary...2 Background...3 InfiniteGraph...3 High Performance
Software-defined Storage Architecture for Analytics Computing
Software-defined Storage Architecture for Analytics Computing Arati Joshi Performance Engineering Colin Eldridge File System Engineering Carlos Carrero Product Management June 2015 Reference Architecture
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 [email protected] THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833
Networking Virtualization Using FPGAs
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,
SR-IOV In High Performance Computing
SR-IOV In High Performance Computing Hoot Thompson & Dan Duffy NASA Goddard Space Flight Center Greenbelt, MD 20771 [email protected] [email protected] www.nccs.nasa.gov Focus on the research side
1 Bull, 2011 Bull Extreme Computing
1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance
Switching Architectures for Cloud Network Designs
Overview Networks today require predictable performance and are much more aware of application flows than traditional networks with static addressing of devices. Enterprise networks in the past were designed
Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat
Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat Why Computers Are Getting Slower The traditional approach better performance Why computers are
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
White Paper. Recording Server Virtualization
White Paper Recording Server Virtualization Prepared by: Mike Sherwood, Senior Solutions Engineer Milestone Systems 23 March 2011 Table of Contents Introduction... 3 Target audience and white paper purpose...
Fabrics that Fit Matching the Network to Today s Data Center Traffic Conditions
Sponsored by Fabrics that Fit Matching the Network to Today s Data Center Traffic Conditions In This Paper Traditional network infrastructures are often costly and hard to administer Today s workloads
Building a Scalable Storage with InfiniBand
WHITE PAPER Building a Scalable Storage with InfiniBand The Problem...1 Traditional Solutions and their Inherent Problems...2 InfiniBand as a Key Advantage...3 VSA Enables Solutions from a Core Technology...5
An Implementation Of Multiprocessor Linux
An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than
Increasing Flash Throughput for Big Data Applications (Data Management Track)
Scale Simplify Optimize Evolve Increasing Flash Throughput for Big Data Applications (Data Management Track) Flash Memory 1 Industry Context Addressing the challenge A proposed solution Review of the Benefits
Quantum StorNext. Product Brief: Distributed LAN Client
Quantum StorNext Product Brief: Distributed LAN Client NOTICE This product brief may contain proprietary information protected by copyright. Information in this product brief is subject to change without
I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology
I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology Reduce I/O cost and power by 40 50% Reduce I/O real estate needs in blade servers through consolidation Maintain
Netvisor Software Defined Fabric Architecture
Netvisor Software Defined Fabric Architecture Netvisor Overview The Pluribus Networks network operating system, Netvisor, is designed to power a variety of network devices. The devices Netvisor powers
