High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals
Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs 100m FLOPS Today, a 2G Hz desktop/laptop performs a few giga FLOPS Today, a supercomputer performs tens of Tera FLOPS (Top500) High performance: O(1000) more powerful than the latest desktops Most supercomputers are obsolete in terms of performance before the end of their physical life. 2
Applications of HPC HPC is Driven by demand of computation-intensive applications from various areas Medical, Biology, neuroscience (e.g. simulation of brains) Finance (e.g. modelling the world economy) Military and Defence (e.g. modelling explosion of nuclear weapons) Engineering (e.g. simulations of a car crash or a new airplane design) 3
An Example of Demands in Computing Capability Project: Blue Brain aim: construct a simulated brain Building blocks of a brain are neurocortical columns A column consists of about 60,000 neurons Human brain contains millions of such columns First stage: simulate a single column (each processor acting as one or two neurons) Then: simulate a small network of columns Ultimate goal: simulate the whole human brain IBM contributes Blue Gene supercomputer 4
Related Technologies HPC covers a wide range of technologies: computer architecture CPU, memory, VLSI Compilers Identify inefficient implementations Make use of the characteristics of the computer architecture Choose suitable compiler for a certain architecture Algorithms (for parallel and distributed systems) How to program on parallel and distributed systems Middleware From Grid computing technology Application->middleware->operating system Resource discovery and sharing 5
History of High Performance Computing 1960s: Scalar processor Process one data item at a time 1970s: Vector processor Can process an array of data items at one go Architecture Overhead Difference between vector processor and scalar processor Later 1980s: Massively Parallel Processing (MPP) Up to thousands of processors, each with its own memory and OS Break down a problem Difference between MPP and vector processor Later 1990s: Cluster Not a new term itself, but renewed interests Connecting stand-alone computers with high-speed network Difference between cluster and MPP Later 1990s: Grid Tackle collaboration among geographically distributed organisations Draw an analogue from Power grid Difference between Grid and cluster 6
Parallel computing vs. distributed computing Parallel Computing Breaking the problem to be computed into parts that can be run simultaneously in different processors Example: an MPI program to perform matrix multiplication Solve tightly coupled problems Distributed Computing Parts of the work to be computed are computed in different places (Note: does not necessarily imply simultaneous processing) An example: C/S model Solve loosely-coupled problems (no much communication) 7
Architecture Types SMP (Symmetric Multi-Processing) Multiple CPUs, single memory, shared I/O All resources in a SMP machine are equally available to each CPU Does not scale well to a large number of processors (less than 8) - (Scalability is the measure of how well the system performance improves linearly to the number of processing elements) NUMA (Non-Uniform Memory Access) Multiple CPUs Each CPU has fast access to its local area of the memory, but slower access to other areas Scale well to a large number of processors Complicated memory access pattern and system bus MPP (Massively Parallel Processing) Cluster 8
Illustration for Architecture Types Shared memory (uniform memory access - SMP) Processors share access to a common memory space. Implemented over a shared memory bus or communication network. Support for critical sections are required Local cache is critical: If not, bus contention (or network traffic) reduces the systems efficiency. For this reason, pure shared memory systems do not scale naturally. Cache introduces problems of coherency (ensuring that stale cache lines are invalidated when other processors alter shared memory). Shared Memory Interconnect PE PE 0 n 9
Illustration for Architecture Types Shared memory (Nonuniform memory access: NUMA) PE may be fetching from local or remote memory - hence nonuniform access times. NUMA Interconnect cc-numa (cache-coherent Non- Uniform Memory Access) Groups of processors are connected together by a fast interconnect (SMP) These are then connected together by a high-speed interconnect. Global address space. Shared Memory 1 PE PE 1 n Shared Memory m PE PE (m-1)n+1 m.n 10
Illustration for Architecture Types Distributed Memory (MPP, cluster) Each processor has it s own local memory. When processors need to exchange (or share data), they must do this through an explicit communication Message passing (MPI language) Interconnect Typically larger latencies between PEs (especially if they communicate via overnetwork interconnections). Scalability is good if the problems can be sufficiently contained within PEs. PE 0 M 0 PE n M n 11
Goals of HPC Minimise the execution time given the certain number of applications (strong scaling) Maximise the number of applications being completed, given a certain amount of time (weak scaling) Identify compromise between performance and cost. 12