High Performance Computing Trey Breckenridge Computing Systems Manager Engineering Research Center Mississippi State University
What is High Performance Computing? HPC is ill defined and context dependent. In the late 1980 s, the US Government defined supercomputers as processors capable of more than 100MFlops. This definition is clearly obsolete, as modern desktop PC s are capable of ~ 5GFlops. Another approach is to describe HPC as the fastest computers at any point in time, however, that is more a budgetary dependent definition. For the intent of this presentation, we will define HPC as: Computing resources which provide at least an order of magnitude more computing power than is normally available on a desktop computer.
What does the definition really mean? That definition sounds like HPC is hardware only. Isn t the software important too? The full range of supercomputing activities including existing supercomputer systems, special purpose and experimental systems, and the new generation of large scale parallel architectures. HPC exists on a broad range of computer systems, from departmental clusters of desktop workstations to large parallel processing systems.
Why High Performance Computing? To achieve the maximum amount of computations in a minimum amount of time SPEED! To solve problems that couldn t otherwise be solved without large computer systems. Traditionally, HPC used in scientific and engineering fields for work with massively complex simulations. Computations are typically floating point intensive.
Areas of HPC Use Traditional: Computational Fluid Dynamics (CFD) Climate, Weather, and Ocean Modeling and Simulation (CWO) Nuclear Modeling and Simulation Geophysical/Petroleum Modeling Emerging: Computer Graphics/Scientific Visualization Financial Modeling Database Applications Bioinformatics Biomedical
Parallel Computing A collection of processing elements that can communicate and cooperate to solve large problems more quickly than a single processing element. Simultaneous use of multiple processors to execute different parts of a program. Goal: To reduce wall-clock time of run No single processor ever again is likely to match performance of existing parallel HPC systems: HPC => Parallel
Overt Type of Parallelism Parallelism is visible to the programmer May be difficult to program (correctly) Large improvements in performance Covert Parallelism is not visible to the programmer Compiler responsible for parallelism Easy to do Small improvements in performance are typical
Speed Up Speed Up is one quantitative measure of the benefit of parallelism Speed Up is defined as S / T(N) where, S = best serial time T(N) = time required for N processors Since S/N is the best possible parallel time, speedup typically should not exceed N S is sometimes difficult to measure causing many people to substitute T(1) for S
Types of Speed Up
Efficiency Speed up does not measure how efficiently the processors are being used Is it worth using 100 processors to get a speed up of 2? Efficiency is defined as the ratio of the speed up and the number of processors required to achieve it The best efficiency is 1 In reality, it is between 0 and 1
HPC Architecture and Design
Vector Processors Large rows of data are operated on simultaneously Scalar Data is operated on in a sequential fashion Instruction sets Complex Instruction Set Computer (CISC) Reduced Instruction Set Computer (RISC) Post-RISC or CISC/RISC UltraSPARC IBM Power4 IA64
Scalar vs. Vector Arithmetic DO 10 i = 1.n a(i) = b(i) + c(i) 10 CONTINUE Scalar: a(1) = b(1) + c(1) a(2) = b(2) + c(2) a(n) = b(n) + c(n) Vector: a = b + c n instructions one vector instruction
Where is Scalar better? If the vector length is small If the loop contains IF statements If partial vectorization involves large overhead If recursion is used Small budget for capital expenditures!
Architectural Classifications Published by Flynn in 1972 Flynn s Taxonomy Outdated, but still widely used Categorizes machines by instruction streams and data streams A stream of instructions (the algorithm) tells the computer what to do. A stream of data (the input) is affected by these instructions. Four Categories SISD Single Instruction, Single Data MISD Multiple Instruction, Single Data SIMD Single Instruction, Multiple Data MIMD Multiple Instruction, Multiple Data
SISD Single Instruction, Single Data Conventional single processor computers Each arithmetic instruction initiates an operation on a data item taken from a single stream of data elements. Historical supercomputers and most contemporary microprocessors are SISD
SIMD Single Instruction, Multiple Data Many, simple processing elements 1000s Each processor has its own local memory Each processor runs the same program Each processor processes different data streams All processors work in lock-step (synchronously) Very efficient for array/matrix operations Most older vector/array computers are SIMD Example machines: Cray YMP Thinking Machine s CM-200
MISD Multiple Instruction, Single Data Very few machines fit this category None have been commercially successful or have had any impact on computational science
MIMD Multiple Instruction, Multiple Data Most diverse of the four classifications Multiple processors Each processor either has own, or accesses shared, memory Each processor can run the same or different programs Each processor processes different data streams Processors can work synchronously or asynchronously
MIMD cont. Processors can be either tightly or loosely coupled Examples include: Processors and memory units specifically designed to be components of a parallel architecture (e.g., Intel Paragon) Large scale parallel machines built from off the shelf workstations (e.g., Beowulf Cluster) Small scale multiprocessors made by connecting multiple vector processors together (e.g., Cray T90) Wide variety of other designs as well
SPMD Computing Not a Flynn category, per se, but instead a combination of categories. SPMD stands for single program, multiple data The same program is run on the processors of an MIMD machine. Occasionally the processors may synchronize. Because an entire program is executed on separate data, it is possible that different branches are taken, leading to asynchronous parallelism SPMD came about as a desire to do SIMD like calculations on MIMD machine SPMD is not a hardware paradigm, but instead, the software equivalent of SIMD
Memory Classifications Organization Shared Memory (SM-MIMD) Bus based Interconnection network Distributed Memory (DM-MIMD) Local Message passing Virtual shared memory (VSM-MIMD) Physically distributed, but appears as one image Access Uniform Memory Access (UMA) All processors take the same time to reach all memory locations Non-Uniform Memory Access (NUMA)
Memory Organization Shared Memory One common memory block between all processors Bus Based Since bus has limited bandwidth, number of processors which can be used is limited to a few tens of processors Examples include typical multi-processors PC s, SGI Challenge
Memory Organization Switch based Utilizes (complex) inter-connected network to connect processors to shared memory modules May use multi-stage networks - NUMA Increases bandwidth to memory over bus based systems Every processor still has access to global memory Examples include Sun E10000
Memory Organization Distributed Memory Message Passing. Memory physically distributed through the machine. Each processor has private memory. Contents of private memory can only be accessed by that processor. If required by another processor, then it must be sent explicitly. In general, machines can be scaled to thousands of processors. Requires special programming techniques. Examples include Cray T3E, IBM SP
Memory Organization Virtual Shared Memory Objective is to have the scalability of distributed memory with the programmability of shared memory Global address space mapped onto physically distributed memory Data moves between processors on demand or as it is accessed
Compute Clusters Connecting multiple standalone machines via a network interconnect, utilizing software to access the combined systems as one computer The standalone machines could be inexpensive single processor workstations or multi-million dollar multiprocessor servers Individual machines can be connected via numerous networking technologies using a variety of topologies. 100BaseT Ethernet inexpensive, low performance, high latency Myrinet (2 Gb/s) expensive, high performance, low latency Proprietary high speed network Nearly 20% of fastest 500 supercomputers in the world are clusters.
Beowulf Clusters First developed in 1994 at NASA Goddard Goal is to build a supercomputer utilizing a large number of inexpensive, commodity off-the-shelf (COTS) parts. Increasingly used for HPC applications due to high cost of MPPs and the wide availability of networked workstations. Not a panacea for HPC. Many applications require shared memory or vector solutions. Existing Beowulf clusters range from 2 to 4000 processors, are likely to reach 10000 processors in the near future.
Metacomputing Metacomputing is a dynamic environment that has some informal pool of nodes that can join or leave the environment whenever they desire. SETI@HOME Why do we need metacomputing? Our computational needs are infinite Our financial needs are finite Someday we will utilize computing cycles just like we utilize electricity from the power company. Enables us to buy cycles on an as needed basis. Commonly referred to The Grid or Computational Grids
Job Execution Most HPC systems do not allow interactive access. Batch-style jobs are submitted to the system via a queuing mechanism. Schedulers determine the order in which jobs should be run. Factors include User priority Resource availability The goal of the Scheduler is to maximize system utilization. Scheduler optimization is an important component and is a field of study of its own.
HPC Software
Programming Languages It has been said, I don t know what language they will be using to program high performance computers 10 years from now, but we do know it will be called FORTRAN. C and C++ are making strides in the HPC community due to their ability to create complex data structures and better I/O routines. FORTRAN 90 incorporated many of the features of C (e.g., pointers). High Performance Fortran (HPF) is FORTRAN 90 with directivebased extensions allowing for shared and distributed memory machines clusters, traditional supercomputers, and massively parallel processors Today, many programmers prefer to do their data structure, communications, etc. in C, while doing the computations in FORTRAN.
Compilers Compilers are an often overlooked area of HPC, but are of critical importance. Application run times are directly related to the ability of the compiler to produce highly optimized code. Poor compiler optimization could result in run times increasing by an order of magnitude. Optimization Levels None, Basic, Interprocedural analysis, Runtime profile analysis, Floating-point, Data flow analysis, Advanced
Distributed Memory Parallel Programming Message passing is a programming paradigm where one effectively writes multiple programs for parallel execution. The problem must be decomposed, typically by domain or function Each process knows only about its own local data. If data is required from a different process, it must send a message to that process asking for the data Access to remote data is much slower than to local data, so a major objective is to minimize remote communications.
Message Passing Environments PVM Parallel Virtual Machine Portable and operable across heterogeneous computers Performance sacrificed for flexibility Well defined protocol allows for interoperability between different implementations MPI Message Passing Interface Today s standard for message passing Widely adopted by most vendors Portable and operable across heterogeneous computers Good performance with reasonable efficiency No standard for interoperability between implementations
Shared Memory Parallel Programming Every processor has direct access to the memory of every other processor in the system Not widely used at programmer level, but widely used at the system level (even on single processors systems via Multithreading) Allows low-latency, high-bandwidth communications Portability is poor Easy to program (compared to message passing) Directive controlled parallelism
Shared Memory Environments POSIX Threads (Pthreads) SHMEM OpenMP Quickly becoming the standard API for shared memory programming Emphasis on performance and scalability Allows for fine-grain or coarse-grain parallelism Some implementations are interoperable with MPI and PVM Message Passing
Benchmarking Benchmarking is an important aspect of HPC and is used for purchase decisions, system configuration, and application tuning. Rule 1: All vendors lie about their benchmarks!! Purchase decisions should not be based on published benchmark results. If at all possible, run your code on the exact machine you are considering for purchase. LINPACK Mother of all benchmarks Not originally designed to be a benchmark, but instead a set of high performance library routines for linear algebra. Reports average megaflop rates by dividing the total number of floating-point operations by time Used for the TOP500 Supercomputing Sites report
www.top500.org
Summary HPC is parallel computing. HPC involves a broad spectrum of components, and is only as fast as the weakest component, whether that be processor, memory, network interconnect, compiler, or software. HPC exists on a broad range of computer systems, from departmental clusters of desktop workstations to large parallel processing systems.
Additional Information Dowd, Kevin and Severance, Charles. High Performance Computing, Second Edition. O Reilly & Associates, Inc., 1998. Dongarra, Jack. High Performance Computing: Technology, Methods and Applications. Elsevier, 1995. Buyya, Rajkumar. High Performance Cluster Computing, Volume 1. Prentice Hall PTR, 1999. Foster, Ian and Kesselman, Carl. The Grid: Blueprint for a new Computing Infrastructure. Morgan Kaufmann Publishers, Inc., 1999.