Simplest Scalable Architecture

Transcription

1 Simplest Scalable Architecture NOW Network Of Workstations

2 Many types of Clusters (form HP s Dr. Bruce J. Walker) High Performance Clusters Beowulf; 1000 nodes; parallel programs; MPI Load-leveling Clusters Move processes around to borrow cycles (eg. Mosix) Web-Service Clusters LVS; load-level tcp connections; Web pages and applications Storage Clusters parallel filesystems; same view of data from each node Database Clusters Oracle Parallel Server; High Availability Clusters ServiceGuard, Lifekeeper, Failsafe, heartbeat, failover clusters

3 Many types of Clusters (form HP s Dr. Bruce J. Walker) High Performance Clusters Beowulf; 1000 nodes; parallel programs; MPI Load-leveling Clusters Move processes around to borrow cycles (eg. Mosix) Web-Service Clusters LVS; load-level tcp connections; Web pages and applications Storage Clusters parallel filesystems; same view of data from each node Database Clusters Oracle Parallel Server; High Availability Clusters ServiceGuard, Lifekeeper, Failsafe, heartbeat, failover clusters NOW type architectures

4 NOW Approaches Single System View Shared Resources Virtual Machine Single Address Space

5 Shared System View Loadbalancing clusters High availability clusters High Performance High throughput High capability

6 Berkeley NOW

7 NOW Philosophies Commodity is cheaper In MB RAM was $40/MB for a PC $600/MB for a Cray M90

8 NOW Philosophies Commodity is faster CPU 150 MHz Alpha MPP year WS year MHz i ~91 32 MHz SS

9 Network RAM Swapping to disk is extremely expensive ms for a page swap on disk Network performance is much higher 700 us for page swap over the net

10 Network RAM

11 NOW or SuperComputer? Machine Time Cost C-90 (16) 27 $30M RS6000 (256) $4M +ATM 2211 $5M +Parallel FS 205 $5M +NOW protocol 21 $5M

12 The Condor System Unix and NT Operational since 1986 More than 1300 CPUs at UW-Madison Available on the web More than 150 clusters worldwide in academia and industry

13 What is Condor? Condor converts collections of distributively owned workstations and dedicated clusters into a highthroughput computing facility. Condor uses matchmaking to make sure that everyone is happy.

14 What is High-Throughput Computing? High-performance: CPU cycles/second under ideal circumstances. How fast can I run simulation X on this machine? High-throughput: CPU cycles/day (week, month, year?) under non-ideal circumstances. How many times can I run simulation X in the next month using all available machines?

15 What is High-Throughput Computing? Condor does whatever it takes to run your jobs, even if some machines Crash! (or are disconnected) Run out of disk space Don t have your software installed Are frequently needed by others Are far away & admin ed by someone else

16 A Submit Description File # Example condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = /home/wright/condor/my_job.condor Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr Arguments = -arg1 -arg2 InitialDir = /home/wright/condor/run_1 Queue

17 What is Matchmaking? Condor uses Matchmaking to make sure that work gets done within the constraints of both users and owners. Users (jobs) have constraints: I need an Alpha with 256 MB RAM Owners (machines) have constraints: Only run jobs when I am away from my desk and never run jobs owned by Bob.

18 Process Checkpointing Condor s Process Checkpointing mechanism saves all the state of a process into a checkpoint file Memory, CPU, I/O, etc. The process can then be restarted from right where it left off Typically no changes to your job s source code needed however, your job must be relinked with Condor s Standard Universe support library

19 Remote System Calls I/O System calls trapped and sent back to submit machine Allows Transparent Migration Across Administrative Domains Checkpoint on machine A, restart on B No Source Code changes required Language Independent Opportunities for Application Steering Example: Condor tells customer process how to open files

20 MOSIX and its characteristics Software that can transform a Linux cluster of x86 based workstations and servers to run almost like an SMP Has the ability to distribute and redistribute the processes among the nodes

21 MOSIX Dynamic migration added to the BSD kernel Now Linux Uses TCP/IP for communication between workstations Requires Homogeneous networks

22 MOSIX All processes start their life at the users workstation Migration is transparent and preemptive Migrated processes use local resources as much as possible and the resources on the home workstation otherwise

23 Process Migration in MOSIX User-level User-level Deputy Link Layer Remote Link Layer Kernel Kernel A local process and a migrated process

24 MOSIX

25 Mosix Make

26 PVM Task based Tasks can be created at runtime Tasks can be notified on the death of a parent or child Tasks can be grouped

27 PVM Architecture Demon based communication User defined host list Hosts can be added and removed during execution The virtual machine may be used interactively or in the background

28 Heterogeneous Computing Runs processes on different architectures Handles conversion between little endian and big endian architectures

29 PVM communication model Explicit message passing Has mechanisms for packing into buffers and unpacking from buffers Supports Asynchronous Communication Supports one to many communication Broadcast Multicast

30 The virtual machine codes All calls to PVM return an integer, if less than zero this indicates an error pvm_perror();

31 PVM

32 Managing the virtual machine Add a host to the virtual machine int info = pvm_addhosts( char **hosts, int nhost, int *infos ); Deleting a host in the virtual machine int info = pvm_delhosts( char **hosts, int nhost, int *infos ) Shutting down the virtual machine int info = pvm_halt( void );

33 Managing the virtual machine Reading the virtual machine configuration int info = pvm config( int *nhost, int *narch, struct pvmhostinfo **hostp ) struct pvmhostinfo { int hi_tid; char *hi_name; char *hi_arch; int hi_speed; } hostp;

34 Managing the virtual machine Check the status of a node int mstat = pvm_mstat(char *host); PvmOk host is OK PvmNoHost host is not in virtual machine PvmHostFail host is unreachable (and thus possibly failed)

35 Tasks PVM tasks can be created and killed during execution id = pvm_mytid(); cnt = pvm_spawn(image, argv, flag, node, num, tids); pid = pvm_parrent(); pvm_kill(tids[0]); pvm exit(); int status = pvm_pstat( tid )

36 Tasks int info = pvm_tasks( int where, int *ntask,struct pvmtaskinfo **taskp ) struct pvmtaskinfo{ int ti_tid; int ti_ptid; int ti_host; int ti_flag; char *ti_a_out; int ti_pid; } taskp;

37 Managing IO In the newest version of PVM output may be redirected to the parent int bufid = pvm_catchout( FILE *ff );

38 Asynchronous events Notifications on special events info = pvm_notify(event, tag, cnt, tids); info = pvm_sendsig(tid, signal);

39 Groups Groups allows for easy fragmentation of the execution in an application num=pvm_joingroup("worker"); size = pvm_gsize("worker"); info = pvm_lvgroup("worker"); int inum = pvm_getinst( char *group, int tid ) int tid = pvm_gettid( char *group, int inum )

40 Buffers PVM applications have a default send and a default receive buffer buf=pvm_initsend(default Raw In place); info = pvm_pk(type)(data,10,1); info = pvm_upk(type)(data,10,1);

41 Managing Buffers info = pvm_mkbuffer(default Raw In place); oldbuf = pvm_setrbuf(bufid); oldbuf = pvm_setsbuf(bufid); int info = pvm_freebuf( int bufid ) int bufid = pvm_getrbuf( void ); int bufid = pvm_getsbuf( void );

42 Receiving messages Messages may be received blocking or nonblocking bufid = pvm_probe(tid, tag); bufid = pvm_recv(tid, tag); bufid = pvm_trecv(tid, tag, tmout); bufid = pvm_nrecv(tid, tag); info = pvm_precv(tid, tag, array, cnt, type, &atid, &atag, &acnt);

43 Sending messages Messages can also be sent in various ways info = pvm_send(tid, tag); info = pvm_psend(tid, tag, data, cnt, type);

44 Managing Buffers info = pvm_mkbuffer(default Raw In place); oldbuf = pvm_setrbuf(bufid); oldbuf = pvm_setsbuf(bufid); int info = pvm_bufinfo( int bufid, int *bytes, int *msgtag, int *tid );

45 Global reductions Global reductions are useful for a wide array of parallel applications info = pvm_reduce(pvmmax, &data, cnt, type, tag, "workers", roottid);

46 PVM Reductions Global Sum Produkt Min Max

47 PVM Synchronizarions Barrier inum=pvm_joingroup("worker"); pvm_barrier("worker",5);

48 Broadcast Sends the active buffer to all members of a group info=pvm_bcast( worker, 42); NOTE: the task that issues a broadcast need not be a member of the group!

49 Multicasting A message can be sent to a number of tasks without the existence of a shared group info = pvm_mcast(list, number, 42);

50 An example Finite differences Well know technique for solving differential equations The one-dimensional version is trivial if we don t need information on the evolution in time

51 The model

52 The example

53 First Solution If left neighbor exist then read data from left send data to the left Update points 0..n-1 If right neighbor exist then read data from right send data to the right update point n

54 Problems with Solution 1? Results in serialization! We must eliminate this serialization

55 Second Solution If left neighbor exist then read data from left send data to the left If right neighbor exist then send data to the right read data from right Update points 0..n

56 Problems with Solution 2 Enforced strict synchronous execution Slowest Task dictates progress All communication takes place at the same time Stresses the communication network

57 Solution 3 If left neighbor exist then send data to the left If right neighbor exist then send data to the right Update points 1..n-1 If left neighbor exist then read data from left Update point 0 If right neighbor exist then read data from right Update points n

58 Problems with solution 3 Practically none! Only potential improvement is to overlap communication and calculation (latency hiding)

59 Solution 4 If left neighbor exist then issue_read data from left issue_send data to the left If right neighbor exist then issue_read data from right issue_send data to the left Update points 1..n-1 Finish_any_read; Update corresponding point Finish_any_read; Update corresponding point

60 Matrix Multiplication Used extremely frequently in scientific applications

61 Naïve version mxmul(real **c, REAL **a, REAL**b, int n) { for(i=0;i<n;i++) for(j=0;j<n;j++) for(k=0;k<n;k++) c[i][j]+=a[i][k]*b[k][j] } The performance of the naïve version may be improved by maintaining B in its transposed form!!

62 Blocked Sequential Version bmul(real **c, REAL **a, REAL**b, int is, int js, int bs, int n){ int i,j,k; } for(i=is*bs;i<is*bs+bs;i++) for(j=js*bs;j<js*bs+bs;j++) for(k=0;k<n;k++) C(i,j)+=A(i,k)*B(k,j); mxmul(real **c, REAL **a, REAL**b, int n){ int i,j,k; } for(i=0; i<n; i+=bs) for(j=0; j<n; j+=bs) bmul(i,i+bs,j,j+bs);

63 Performace of the Basic versions

64 Recursive Version Matrix C mxmul(matrix A, Matrix B, int s){ if(s==1) C=A*B; else { s=s/2; p0=mxmul(ul(a),ul(b),s); p1=mxmul(ur(a),ll(b),s); p2=mxmul(ul(a),ur(b),s); p3=mxmul(ur(a),lr(b),s); p4=mxmul(ll(a),ul(b),s); p5=mxmul(lr(a),ll(b),s); p6=mxmul(ll(a),ur(b),s); p7=mxmul(lr(a),lr(b),s); UL(C)=p0+p1; UR(C)=p2+p3; LL(C)=p4+p5; LR(C)=p6+p7; } return C; }

65 Blocked Parallel Version If we have a broadcast media then we can efficiently broadcast blocks to all workers

66 Blocked Parallel version Done in W broadcasts using W workers!

67 Blocked Version in PVM All workers holds one row-block and the corresponding coloum block Worker zero first broadcasts its coloum, the one and so forth Result is that excatly the size of B is broadcast in W blocks

68 Main main(int argc, char **argv){ int bs; char msg[1024]; N=atoi(argv[1]); bs=atoi(argv[2]); size=atoi(argv[3]); pvm_joingroup("workers"); rank=pvm_getinst( "workers", pvm_mytid()); basicbsize=n/size; lastbsize=basicbsize+n%size; if(rank==size-1)mybsize=lastbsize; else mybsize=basicbsize; } a=(real *)malloc(n*lastbsize*sizeof(real)); //same for b,tb and c mmul(bs); pvm_exit();

69 Main loop mmul(int bs){ int w,i,j,k; int src, atag, acnt; REAL *t=tb; } } for(w=0;w<size;w++){ pvm_initsend(pvm_com_model); if(rank==w){ tb=b; pvm_pkreal(b, N*(w==size-1? lastbsize : basicbsize), 1); pvm_bcast("workers", 100+w); } else { pvm_recv(-1,100+w); pvm_upkreal(tb,n*(w==size-1? lastbsize : basicbsize),1); } for(i=0; i<mybsize; i+=bs) for(j=0; j<mybsize; j+=bs) bmul(i,i+bs,j,j+bs); tb=t;

70 How may this version be improved? Overlapping communication and calculation

71 Summary PVM is similar to programming with threads - except you need messagepassing At first parallel programs may be very inefficient More efficient programs are more complex

72 Programming NOW Dynamic load balancing Dynamic orchestration

73 Dynamic Load Balancing Base your applications on redundant parallelism Rely on the OS to balance the application over the CPUs Rather few applications can be orchestrated in this way

74 Barnes Hut Galaxy simulations are still quite interresting Basic formula is: Naïve algorithm is O(n 2 )

75 Barnes Hut

76 Barnes Hut O(n log n)

77 Balancing Barnes Hut

78 Dynamic Orchestration Divide your application into a job-queue Spawn workers Let the workers take and execute jobs from the queue Not all applications can be orchestrated in this way Does not scale well job-queue process may become a bottleneck

79 Parallel integration

80 Parallel integration Split the outer integral Jobs = range(x 1, x 2, interval) Tasks = integral with x 1 = Jobs i, x 2 =Jobs i ; for i in len(jobs -1) +1 Result = Sum(Execute(Tasks))

81 Genetic Algorithms Genetic algorithms are very well suited for NOW type architectures Requires much processing time Little communication Many independent blocks

82 Example Based on Conway s-game-of-life We have an area with weed Bacteria Or another simple organism Life in this scenario is governed by very simple rules We desire an initial setup that returns the most life after exactly 100 iterations

83 Rules A cell with less than 2 neighbors die, from loneliness A cell with more than 3 neighbors die from crowding A living cell with 2 or 3 neighbors survive to next generation A dead cell with exactly 3 neighbors springs to life by reproduction

84 Approach Let the computer test Various sizes of initial population size Vary mutation rate Run a paralle solution finder using the island model Where each node in a NOW runs independently from the others But nodes exchange champions every once i a while