High-Performance Computing: Architecture and APIs

Transcription

1 High-Performance Computing: Architecture and APIs Douglas Fuller ASU Fulton High Performance Computing

2 Why HPC? Capacity computing Do similar jobs, but lots of them. Capability computing Run programs we could not before! Insufficient resources (memory, usually) Insufficient time

3 Commonly Discussed HPC Terms Supercomputer Cluster (Beowulf Cluster) Shared Memory Machine Grid Scalability

4 Intel Core 2 Duo - 3GF

5 Scaling von Neumann Where are the bottlenecks? C P M I/O

6 Vector processing SIMD approach (sound familiar?) Special register/execution units Handles large amounts of data at once Good for linear algebra / scientific computing Can be assisted by language support Can be partially leveraged by compilers

7 Cray-1 First successful vector system, bit, 80 MHz 8MB RAM 250 MFLOPS peak (136 typical) 115 kw

8 Vector processing Resurgence with high-end A/V market MMX, 3DNow!, SSE, SSE2 GPUs Game consoles iphones Vector processing leading to gaming/hpc convergence (Cell)

9 SMP Systems MIMD approach Commodity processors connect through an interconnect to a single logical memory Demands on the interconnection bus extremely high Sustained memory bandwidth Fetch latency Cache coherence is a real problem Programs must still have concurrency, but variables are shared C C C P P P I/O M P P P C C C

10 Digression: cache coherence Multiple processors using the same data These processors caches must stay synchronized! C A=3; C P P I/O P P Get A A C This introduces considerable overhead and limits scalability 3 P M P C

11 Programming SMPs The same as multithreaded serial programming, right? (example) Well, almost. More locality issues False sharing Toolkits to help OpenMP

12 SMP systems today Just about everything has multiple cores Intel Core, Core 2, Xeon,... AMD Opteron,... Cache strategies vary Transistor count (AMD vs. Intel) Memory bandwidth (and local IC buses)

13 Scaling the SMP Remember the critical system bus Broadcast coherence messages and shared links impose high bandwidth requirements Cores aren t the problem. Solution: serialize memory bus communication Removes the S in SMP

14 NUMA Similar to SMP, but we give up the S Reduces bus bandwidth requirements Requires interconnect design (more later) Introduces a penalty for remote memory access! Cache coherence pops up again What about for remote memory?

15 The directory I/O Tracks which CPUs have each cache line C D D C Allows point-to-point messages for cachecoherence P M M P How do you locate a remote block that s cached by another processor? P C M D M D P C

16 Programming NUMA It s just like writing for SMPs. (example) Right? Sort of. It looks the same, but there are more factors to consider. Architecture design imposes a performance impact Code still must be architecture-aware!

17 NUMA today Still exists for HPC, but expensive Custom hardware, directory units, interconnects Custom software (single system image) Commodity processors AMD Opteron (DirectConnect brings MMU onboard)

18 MPP Systems Use a large number of weaker processors Most decouple their memory subsystems - Distributed memory Relies on: Smart system, Smart compiler, or Smart programmer

19 MPP Systems Processors interconnected with custom hardware Architectures vary widely

20 Thinking Machines CM-5 Up to 16, MHz processors Largest ever built was 512 processors, 64 GF peak 16 GB main memory Where was the most famous CM-5?

21 Programming MPPs Program follows architecture (including interconnect) Many MPPs support multiple models More architecture-aware models perform better Less architecture-aware models are more portable What to choose when developing a program?

22 Interconnecting Topology choice critical; considerations include: Performance (latency and bandwidth) Conformity/uniformity Cost Scalability

23 Interconnecting Completely Connected an Completely Connected : Each pr communication link to every othe Completely Connected and Star Networks Completely Connected : Each processor has direct communication link to every other processor Fully connected Arrays and Rings Star Connected Network : The m Star the central processor. Every othe Linear Array : connected to it. Counter part of C Ring Dynamic interconnect. Star Connected Network : The middle processor is the central processor. Ring : Every other processor is connected to it. Counter part of Cross Bar switch in Dynamic interconnect. Mesh Network (e.g. 2D-array)

24 near Array : Arrays and Rings esh Network (e.g. 2D-array) Mesh Torus Hypercube Tree Hypercubes Interconnecting ing ercube : Network : A multidimensional mesh of essors with exactly two processors in each nsion. A d dimensional processor consists of p = 2 d processors wn below are 0, 1, 2, and 3D hypercubes Fat Trees Multiple switches Each level has the same number of links in as out Increasing number of links at each level 0-D 1-D 2-D 3-D hypercubes Gives full bandwidth between the links Torus 2-d Torus (2-d version of the ring) Added latency the higer you go

25 Look familiar? Desktop systems use the same architectures Token Ring SONET FDDI Ethernet

26 Desktop systems Leverage the economics Commodity parts CPUs and memory Circuit City supercomputing Interconnect is now a commodity network.

27 Beowulf Clusters A Beowulf is a parallel computer consisting of a collection of nodes built from commodity parts Each node has it's own processors, memory, and I/O Nodes communicate through an interconnection network. One node designated master or head is attached to public network and interconnection network Compute nodes Interconnection Network Master Node Internet or Internal Network Basic Beowulf

28 Clusters Important Characteristics Commodity Components - Mass Market R&D investment keeps technology moving forward Distributed memory - your old program won t speed up Communication between processors has a cost

29 Programming clusters Multiple system images, therefore there is NO shared memory. Many models try to emulate earlier architectures. Why? By far the most popular is MPI.

30 A 10 minute introduction to MPI

31 What is MPI? The Message Passing Interface,De Facto standard for message passing Unified many vendor specific MP libraries in 1990s Works with C,FORTRAN, F90 (always), C++ (usually) and more exotic things (e.g. Python) occasionally Allows programmer to explicitly send/receive messages among processes in parallel program Supports Data Parallel programming model

32 Data Parallel Programming One program, many copies. Each instance of the program (task) does the same instructions on different data. Each task has it s own local memory The trick (for the programmer): Remember it s parallel Remember what s in what memory Input Output

33 Why the Data Parallel Model? Only one program to worry about. Easier to debug program. Easier to visualize program behavior. Naturally load balances (sometimes).

34 Introduction to MPI MPI is a standard for message passing interfaces MPI-1 covers point-to-point and collective communication Point-to-point: Explicit Messages (send/receive) Collective: (Express Patterns of Communication) MPI-2 covers connection based communication and I/O Typical implementations include MPICH, LAMMPI, and OpenMPI

35 MPI in Six Functions MPI_Init - start using MPI MPI_Comm_size - get the number of tasks MPI_Comm_rank - the unique index of this task MPI_Send - send a message MPI_Recv - receive a message MPI_Finalize - stop using MPI

36 Initialize and Finalize The first MPI call must be to MPI_Init. The last MPI call must be to MPI_Finalize. #include <mpi.h> main(int argc, int **argv) { } MPI_Init(&argc, &argv ); // put program here MPI_Finalize();

37 Initialize and Finalize int MPI_Init(int *argc,char ***argv); int MPI_Finalize(); MPI_INIT(ierror) integer ierror MPI_FINALIZE(ierror) integer ierror void MPI::Init(int& argc,char**& argv); void MPI::Finalize();

38 Size and Rank MPI_Comm_size returns the number of tasks in the job int size; MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank returns the number of the current task (0.. size-1) int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank);

39 MPI Communicators Abstract structure represents a group of MPI tasks that can communicate MPI_COMM_WORLD represents all of the tasks in a given job Programmer can create new communicators to subset MPI_COMM_WORLD RANK or task number is relative to a given communicator Messages from different communicators do not interfere

40 A Simple Example #include "mpi.h" #include <stdio.h> int main(int argc, char *argv[] ) { int rank, size; } MPI_Init( &argc,&argv); MPI_Comm_rank( MPI_COMM_WORLD, &rank); MPI_Comm_size( MPI_COMM_WORLD, &size); printf("hello world from process %d of %d \n",rank,size); MPI_Finalize(); return 0;

41 Send and Recv MPI_Send to send a message char sbuf[count]; MPI_Send(sbuf, COUNT, MPI_CHAR, 1, 99, MPI_COMM_WORLD); MPI_Recv to receive a message char rbuf[count]; MPI_Status status; MPI_Recv(rbuf, COUNT, MPI_CHAR, 1, 99, MPI_COMM_WORLD, &status);

42 Anatomy of MPI_Recv MPI_Recv(rbuf, COUNT, MPI_CHAR, 1, 99, MPI_COMM_WORLD, &status); rbuf : pointer to receive buffer COUNT : items in receive buffer MPI_CHAR : MPI datatype 1 : source task number (rank) 99 : message tag MPI_COMM_WORLD : communicator Status : pointer to status struct

43 MPI Datatypes Encodes type of data sent and received Built-in types MPI_CHAR, MPI_SHORT, MPI_INT, MPI_LONG MPI_FLOAT, MPI_DOUBLE, MPI_LONG_DOUBLE MPI_BYTE, MPI_PACKED User defined types MPI_Type_contiguous, MPI_Type_vector, MPI_Type_indexed, MPI_Type_struct MPI_Pack, MPI_Unpack

44 A Quick Send and Receive Example #include "mpi.h" #include <stdio.h> int main(int argc, char *argv[] ) { int numprocs, myrank, namelen, i; char processor_name[mpi_max_processor_name]; char greeting[mpi_max_processor_name + 80]; MPI_Status status; MPI_Init( &argc,&argv); MPI_Comm_rank( MPI_COMM_WORLD, &myrank); MPI_Comm_size( MPI_COMM_WORLD, &numprocs); MPI_Get_Processor_name( processor_name, &namelen); sprintf(greeting,"hello world from process %d of %d on %s \n",myrank,numpro, processor_name);

45 A Quick Send and Receive Example } if (myrank == 0) { printf("%s\n", greeting); for(i=1;i<numprocs;i++) { MPI_Recv(greeting,sizeof(greeting), MPI_CHAR, i, 1, MPI_COMM_WORL printf("%s\n", greeting); } } else { MPI_Send(greeting, strlen( greeting) +1, MPI_CHAR, 0,1,MPI_COMM_WOR } MPI_Finalize(); return( 0);

46 Collective Operations Rather than dealing with individual messages, express common patterns of communication Simpler coding Hide optimization Hide cluster topology details Called at the same time by every task in the communicator (no if/else) - true data parallel

47 Common Collectives Broadcast / Reduce Scatter / Gather Barrier All-to-all

48 Sending to 8 nodes with a for loop and MPI_Send vs. MPI_Broadcast

49 Message Passing Cautions All messages are overhead (the non-parallel program wouldn t have them). Messages take substantial time Use them only when necessary, and group together as many as possible (long blocks of computation between communication raises performance!).

50

51 Parallelism in Monte Carlo Methods Monte Carlo methods often amenable to parallelism Find an estimate about p times faster OR Reduce error of estimate by p1/2 The trick to parallelizing MC methods is developing independent random number generators!!!

52 Linear Congruential RNGs X i = (a X i 1 + c)mod M

53 Linear Congruential RNGs X i = (a X i 1 + c)mod M Multiplier

54 Linear Congruential RNGs X i = (a X i 1 + c)mod M Multiplier Additive constant

55 Linear Congruential RNGs X i = (a X i 1 + c)mod M Modulus Multiplier Additive constant

56 Linear Congruential RNGs X i = (a X i 1 + c)mod M Modulus Multiplier Additive constant Sequence depends on choice of seed, X 0

57 Period of Linear Congruential RNG Maximum period is M For 32-bit integers maximum period is 232, or about 4 billion This is too small for modern computers Use a generator with at least 48 bits of precision

58 Producing Floating-Point Numbers X i, a, c, and M are all integers X i s range in value from 0 to M-1 To produce floating-point numbers in range [0, 1), divide X i by M

59 Defects of Linear Congruential RNGs Least significant bits correlated Especially when M is a power of 2 k-tuples of random numbers form a lattice Especially pronounced when k is large

60 Lagged Fibonacci RNGs

61 Lagged Fibonacci RNGs "p and q are lags, p > q

62 Lagged Fibonacci RNGs "p and q are lags, p > q "* is any binary arithmetic operation

63 Lagged Fibonacci RNGs "p and q are lags, p > q "* is any binary arithmetic operation "Addition modulo M

64 Lagged Fibonacci RNGs "p and q are lags, p > q "* is any binary arithmetic operation "Addition modulo M "Subtraction modulo M

65 Lagged Fibonacci RNGs "p and q are lags, p > q "* is any binary arithmetic operation "Addition modulo M "Subtraction modulo M "Multiplication modulo M

66 Lagged Fibonacci RNGs "p and q are lags, p > q "* is any binary arithmetic operation "Addition modulo M "Subtraction modulo M "Multiplication modulo M "Bitwise exclusive or

67 Properties of Lagged Fibonacci RNGs Require p seed values Careful selection of seed values, p, and q can result in very long periods and good randomness For example, suppose M has b bits Maximum period for additive lagged Fibonacci RNG is (2 p -1)2 b-1

68 Ideal Parallel RNGs All properties of sequential RNGs No correlations among numbers in different sequences Scalability Locality

69 Parallel RNG Designs Manager-worker Leapfrog Sequence splitting Independent sequences

70 Manager-Worker Parallel RNG Manager process generates random numbers Worker processes consume them If algorithm is synchronous, may achieve goal of consistency Not scalable Does not exhibit locality

71 Leapfrog Method

72 Leapfrog Method Process with rank 1 of 4 processes

79 Properties of Leapfrog Method Easy modify linear congruential RNG to support jumping by p Can allow parallel program to generate same tuples as sequential program Does not support dynamic creation of new random number streams

80 Sequence Splitting

81 Sequence Splitting Process with rank 1 of 4 processes

88 Properties of Sequence Splitting Forces each process to move ahead to its starting point Does not support goal of reproducibility May run into long-range correlation problems Can be modified to support dynamic creation of new sequences

89 Independent Sequences Run sequential RNG on each process Start each with different seed(s) or other parameters Example: linear congruential RNGs with different additive constants Works well with lagged Fibonacci RNGs Supports goals of locality and scalability

90 Best Approach - Use an Existing Library SPRNG (Scalable Parallel Random Number Generator) from Florida State is an MPI based library for generating random numbers independently Linear congruential generator on one node provides seed values for lagged fib generators on other nodes Ridiculously long period, good statistical properties SPRNG is simple and robust, and is highly recommended.

91 SPRNG Example See code listings: sprng_mpi.c seed_mpi.c 2streams_mpi.c pi-simple_mpi.c

92 Parting Note... If you are doing anything special (massive runs, massive storage, massive memory, meeting deadlines, non-traditional usage), please contact us and let us work with you to meet your needs. Policies are there to keep automated systems running well, they are not locked in stone.

93 How to Get More Help Online If something isn t there, fill out a service request to ask for help (same form as account request) Someone will respond next business day hpc@asu.edu Phone -- Leah Kritzer - The HPCI front desk More lectures would be fun: Short courses offered again soon CSE 494/598 (SP07) - a one semester course in MPI and HPC

94 Grids Loosely coupled sets of HPC (and other) compute resources Workstation Grid Portal No centralized control Middleware moves jobs to resources A way to share resources Cluster 1 Cluster 2 SMP Database Server