High performance computing systems. Lab 1

High performance computing systems Lab 1 Dept. of Computer Architecture Faculty of ETI Gdansk University of Technology Paweł Czarnul For this exercise, study basic MPI functions such as: 1. for MPI management: MPI_Init(...), MPI_Finalize(), Each MPI program should start with MPI_Init(...) and finish with MPI_Finalize(). Each process can fetch the number of processes in the default communicator MPI_COMM_WORLD (the application) by calling MPI_Comm_size (see the example below). Processes in an MPI application are identified by so-called ranks ranging from 0 to n-1 where n is the number of processes returned by MPI_Comm_size(). Based on the rank, each process can perform a part of all required computations so that all processes contribute to the final goal and process all required data. 2. for point-to-point communication: MPI_Send(...), MPI_Recv(...), int MPI_Send(void *buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm) MPI_Send sends data pointed by buf to process with rank dest. There should be count elements of data type dtype. For instance, when sending 5 doubles, count should be 5 and dtype should be MPI_DOUBLE. tag can be any number which additionally describes the message and comm can be MPI_COMM_WORLD for the default communicator. int MPI_Recv(void *buf, int count, MPI_Datatype dtype, int src, int tag, MPI_Comm comm, MPI_Status *stat) MPI_Recv is a blocking receive which waits for a message with tag tag from process with rank src in communicator comm. Dtype and count denote the type and the number of elements which are to be received and stored in buf. Stat holds information about the received message. 3. for collective communication: MPI_Barrier(...), MPI_Gather(...), MPI_Scatter(...), MPI_Allgather(...).

As an example, int MPI_Reduce(void *sbuf, void* rbuf, int count, MPI_Datatype dtype, MPI_Op op, int root, MPI_Comm comm) reduces all values given by processes in communicator comm to a single value in process with rank root. See the code below for adding numbers given by all processes to a single value in process 0. Study the following tutorial on MPI: http://www.lam-mpi.org/tutorials/ The following example computes pi in parallel using an old method from the 17 th century: Pi/4=1/1 1/3 + 1/5 1/7 + 1/9. (1) Note that the program works for any number of processes requested. Successive elements of (1) are assigned to successive processes with ranks from 0 to (proccount-1). For 2 processes: Pi/4 = 1/1 1/3 + 1/5 1/7 + 1/9. process 0 1 0 1 0. For 3 processes: Pi/4 = 1/1 1/3 + 1/5 1/7 + 1/9 1/11. process 0 1 2 1 0 2. etc. This is a simple load balancing technique. For example, checking if successive numbers are prime numbers might involve more time for larger numbers. This strategy balances the execution time among processes quite well. Note that in reality we only consider a predefined number of elements in (1). In general, we should make sure that the data types used for adding the numbers can store resulting subsums. #include <stdio.h> #include <mpi.h>

int main(int argc, char **argv) { double precision=1000000000; int myrank,proccount; double pi,pi_final; int mine,sign; int i; // Initialize MPI MPI_Init(&argc, &argv); // find out my rank MPI_Comm_rank(MPI_COMM_WORLD, &myrank); // find out the number of processes in MPI_COMM_WORLD MPI_Comm_size(MPI_COMM_WORLD, &proccount); // now distribute the required precision if (precision<proccount) { printf("precision smaller than the number of processes - try again."); MPI_Finalize(); return -1; } // each process performs computations on its part pi=0; mine=myrank*2+1; sign=(((mine-1)/2)%2)?-1:1; for (;mine<precision;) { // printf("\nprocess %d %d %d", myrank,sign,mine); // fflush(stdout); pi+=sign/(double)mine; mine+=2*proccount; sign=(((mine-1)/2)%2)?-1:1; } // now merge the numbers to rank 0 MPI_Reduce(&pi,&pi_final,1, MPI_DOUBLE,MPI_SUM,0, MPI_COMM_WORLD); if (!myrank) {

} pi_final*=4; printf("pi=%f",pi_final); // Shut down MPI MPI_Finalize(); return 0; } Assuming the code was saved in file program.c, we have to: 1. compile the code: mpicc program.c 2. run it 1 process: [klaster@n01 1]$ time mpirun -np 1./a.out real 0m9.286s user 0m9.244s sys 0m0.037s 2 processes: [klaster@n01 1]$ time mpirun -np 2./a.out real 0m4.706s user 0m9.286s sys 0m0.063s 4 processes: [klaster@n01 1]$ time mpirun -np 4./a.out real 0m2.420s user 0m9.380s sys 0m0.118s Note smaller execution times for larger numbers of processes used for computations. Lab 527: For this lab, you can use the default MPI implementation on desxx computers in the lab (XX range from 01 to 18) Open MPI.

Compile the code: student@des01:~> mpicc program.c create a configuration for the virtual machine in this case just 2 nodes (des01 and des02): student@des01:~> cat > machinefile des01 des02 then invoke the application for 1 process (running on des01): student@des01:~> mpirun -machinefile./machinefile -np 1 time./a.out 9.25user 0.01system 0:09.27elapsed 99%CPU (0avgtext+0avgdata 13008maxresident)k 0inputs+0outputs (0major+1009minor)pagefaults 0swaps and 2 processes (running on des01 and des02): student@des01:~> mpirun -machinefile./machinefile -np 2 time./a.out 4.63user 0.01system 0:04.65elapsed 99%CPU (0avgtext+0avgdata 13072maxresident)k 0inputs+0outputs (0major+1013minor)pagefaults 0swaps 4.63user 0.01system 0:04.67elapsed 99%CPU (0avgtext+0avgdata 13312maxresident)k 0inputs+0outputs (0major+1023minor)pagefaults 0swaps You can create a larger virtual machine and test the scalability of the application. Lab 527: You can also use mpich on desxx: student@des01:~> /opt/mpich/ch-p4/bin/mpicc program.c program.c: In function main : program.c:12:7: warning: unused variable i student@des01:~> scp a.out des02:~ a.out 100% 1427KB 1.4MB/s 00:00 student@des01:~> scp a.out des03:~ a.out 100% 1427KB 1.4MB/s 00:00 student@des01:~> scp a.out des04:~ a.out

now run the code: 1 process student@des01:~> /opt/mpich/ch-p4/bin/mpirun -np 1 -machinefile./machinefile./a.out student@des01:~> 2 processes student@des01:~> /opt/mpich/ch-p4/bin/mpirun -np 2 -machinefile./machinefile./a.out student@des01:~> 4 processes student@des01:~> /opt/mpich/ch-p4/bin/mpirun -np 4 -machinefile./machinefile./a.out cluster KASK: reach the cluster by ssh studentx@n01.eti.pg.gda.pl X is a number from 1 to 18 The following MPI implementations are available on cluster KASK (use a full path for running mpicc and mpirun): 1. MPICH executables such as mpicc and mpirun available in /opt/mpich2/gnu/bin/ 2. Open MPI - executables in /opt/sun-ct/bin/ 3. MVAPICH executables in /usr/mpi/gcc/mvapich-1.2.0/bin/ Note: the following nodes are available on the cluster: n01 access node compute-0-0 compute-0-1 compute-0-8 Bibliography MPI Docs http://www.mpi-forum.org/docs/docs.html