MPI Application Development Using the Analysis Tool MARMOT

Transcription

1 MPI Application Development Using the Analysis Tool MARMOT HLRS High Performance Computing Center Stuttgart Allmandring 30 D Stuttgart Höchstleistungsrechenzentrum Stuttgart

2 Overview General Problems of MPI Programming Related Work Design of MARMOT MARMOT & Applications Performance Results Conclusions Outlook Höchstleistungsrechenzentrum Stuttgart

3 Problems of MPI Programming All problems of serial programming Additional problems: Increased difficulty to verify correctness of program Increased difficulty to debug N parallel processes Portability issues between different MPI implementations and platforms, e.g. with mpich on the Crossgrid testbed: WARNING: MPI_Recv: tag= > 32767! MPI only guarantees tags up to this. THIS implementation allows tags up to versions of LAM-MPI < v 7.0 only guarantee tags up to New problems such as race conditions and deadlocks Höchstleistungsrechenzentrum Stuttgart

4 Related Work Parallel Debuggers, e.g. totalview, DDT, p2d2 Debug version of MPI Library Post-mortem analysis of trace files Special verification tools for runtime analysis MPI-CHECK Limited to Fortran Umpire First version limited to shared memory platforms Distributed memory version in preparation Not publicly available MARMOT Höchstleistungsrechenzentrum Stuttgart

5 Höchstleistungsrechenzentrum Stuttgart Design of MARMOT

6 What is MARMOT? Tool for the development of MPI applications Automatic runtime analysis of the application: Detect incorrect use of MPI Detect non-portable constructs Detect possible race conditions and deadlocks MARMOT does not require source code modifications, just relinking C and Fortran binding of MPI -1.2 is supported Höchstleistungsrechenzentrum Stuttgart

7 Design of MARMOT Application or Test Program Profiling Interface MARMOT core tool Debug Server (additional process) MPI library Höchstleistungsrechenzentrum Stuttgart

8 Examples of Server Checks: verification between the nodes, control of program Everything that requires a global view Control the execution flow, trace the MPI calls on each node throughout the whole application Signal conditions, e.g. deadlocks (with traceback on each node.) Check matching send/receive pairs for consistency Check collective calls for consistency Output of human readable log file Höchstleistungsrechenzentrum Stuttgart

9 Examples of Client Checks: verification on the local nodes Verification of proper construction and usage of MPI resources such as communicators, groups, datatypes etc., for example Verification of MPI_Request usage invalid recycling of active request invalid use of unregistered request warning if number of requests is zero warning if all requests are MPI_REQUEST_NULL Check for pending messages and active requests in MPI_Finalize Verification of all other arguments such as ranks, tags, etc Höchstleistungsrechenzentrum Stuttgart

10 Documentation/Classification of Errors, Warnings, Notes Höchstleistungsrechenzentrum Stuttgart

16 Höchstleistungsrechenzentrum Stuttgart MARMOT & Applications

17 MARMOT within CrossGrid Name Application B_Stream Aladin, DaveF MM5 STEM ANN Blood flow simulation Flood simulation Air pollution modeling High Energy Physics Höchstleistungsrechenzentrum Stuttgart Language C Fortran90 Fortran77 C # lines million MHz (40 TB/sec) 75 KHz (75 GB/sec) 5 KHz (5 GB/sec) 100 Hz level data 3 recording - PCs & offline analysis level 1 - special hardware (100 MB/sec) level 2 - embedded processors

18 Example: B_Stream (blood flow simulation, tube) Running the application without/with MARMOT: mpirun -np 3 B_Stream 500. tube mpirun -np 4 B_Stream_marmot 500. tube Application seems to run without problems Höchstleistungsrechenzentrum Stuttgart

19 Example: B_Stream (blood flow simulation, tube) 54 rank 1 performs MPI_Cart_shift 55 rank 2 performs MPI_Cart_shift 56 rank 0 performs MPI_Send 57 rank 1 performs MPI_Recv WARNING: MPI_Recv: Use of MPI_ANY_SOURCE may cause race conditions! 58 rank 2 performs MPI_Recv WARNING: MPI_Recv: Use of MPI_ANY_SOURCE may cause race conditions! 59 rank 0 performs MPI_Send 60 rank 1 performs MPI_Recv WARNING: MPI_Recv: Use of MPI_ANY_SOURCE may cause race conditions! 61 rank 0 performs MPI_Send 62 rank 1 performs MPI_Bcast 63 rank 2 performs MPI_Recv WARNING: MPI_Recv: Use of MPI_ANY_SOURCE may cause race conditions! 64 rank 0 performs MPI_Pack 65 rank 2 performs MPI_Bcast 66 rank 0 performs MPI_Pack Höchstleistungsrechenzentrum Stuttgart

20 Example: B_Stream (blood flow simulation, tube-stenosis) Without MARMOT: mpirun -np 3 B_Stream 500. tube-stenosis Application is hanging With MARMOT: mpirun -np 4 B_Stream_marmot 500. tube-stenosis Deadlock found Höchstleistungsrechenzentrum Stuttgart

21 Example: B_Stream (blood flow simulation, tube-stenosis) 9310 rank 1 performs MPI_Sendrecv 9311 rank 2 performs MPI_Sendrecv 9312 rank 0 performs MPI_Barrier 9313 rank 1 performs MPI_Barrier 9314 rank 2 performs MPI_Barrier 9315 rank 1 performs MPI_Sendrecv 9316 rank 2 performs MPI_Sendrecv 9317 rank 0 performs MPI_Sendrecv 9318 rank 1 performs MPI_Sendrecv 9319 rank 0 performs MPI_Sendrecv 9320 rank 2 performs MPI_Sendrecv 9321 rank 0 performs MPI_Barrier 9322 rank 1 performs MPI_Barrier 9323 rank 2 performs MPI_Barrier 9324 rank 1 performs MPI_Comm_rank 9325 rank 1 performs MPI_Bcast 9326 rank 2 performs MPI_Comm_rank 9327 rank 2 performs MPI_Bcast 9328 rank 0 performs MPI_Sendrecv WARNING: all clients are pending! Höchstleistungsrechenzentrum Stuttgart

22 Example: B_Stream (blood flow simulation, tube-stenosis) deadlock: traceback on node 0 timestamp= 9304: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9307: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9309: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9312: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9317: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9319: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9321: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9328: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) Höchstleistungsrechenzentrum Stuttgart

23 Example: B_Stream (blood flow simulation, tube-stenosis) deadlock: traceback on node 1 timestamp= 9306: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9310: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9313: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9315: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9318: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9322: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9324: MPI_Comm_rank(comm = MPI_COMM_WORLD, *rank) timestamp= 9325: MPI_Bcast(*buffer, count = 3, datatype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD) Höchstleistungsrechenzentrum Stuttgart

24 Example: B_Stream (blood flow simulation, tube-stenosis) deadlock: traceback on node 2 timestamp= 9308: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9311: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9314: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9316: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9320: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9323: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9326: MPI_Comm_rank(comm = MPI_COMM_WORLD, *rank) timestamp= 9327: MPI_Bcast(*buffer, count = 3, datatype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD) Höchstleistungsrechenzentrum Stuttgart

25 Example: B_Stream (blood flow simulation, bifurcation) Without MARMOT: mpirun -np 3 B_Stream 500. bifurcation /usr/local/mpich /ch_shmem/bin/mpirun: line 1: 2753 Speicherzugriffsfehler /home/rusbetti/b_streamorg/bin/b_stream "500." "bifurcation" With MARMOT: mpirun -np 4 B_Stream_marmot 500. bifurcation Problem found at collective call MPI_Gather Höchstleistungsrechenzentrum Stuttgart

26 Example: B_Stream (blood flow simulation, bifurcation) 9319 rank 2 performs MPI_Sendrecv 9320 rank 1 performs MPI_Sendrecv 9321 rank 1 performs MPI_Barrier 9322 rank 2 performs MPI_Barrier 9323 rank 0 performs MPI_Barrier 9324 rank 0 performs MPI_Comm_rank 9325 rank 1 performs MPI_Comm_rank 9326 rank 2 performs MPI_Comm_rank 9327 rank 0 performs MPI_Bcast 9328 rank 1 performs MPI_Bcast 9329 rank 2 performs MPI_Bcast 9330 rank 0 performs MPI_Bcast 9331 rank 1 performs MPI_Bcast 9332 rank 2 performs MPI_Bcast 9333 rank 0 performs MPI_Gather 9334 rank 1 performs MPI_Gather 9335 rank 2 performs MPI_Gather /usr/local/mpich /ch_shmem/bin/mpirun: line 1: Speicherzugriffsfehler /home/rusbetti/b_stream/bin/b_stream_marmot "500." "bifurcation" Höchstleistungsrechenzentrum Stuttgart

27 Example: B_Stream (blood flow simulation, bifurcation) 9336 rank 1 performs MPI_Sendrecv 9337 rank 2 performs MPI_Sendrecv 9338 rank 1 performs MPI_Sendrecv WARNING: all clients are pending! Höchstleistungsrechenzentrum Stuttgart

28 Example: B_Stream (blood flow simulation, bifurcation) Last calls on node 0: timestamp= 9327: MPI_Bcast(*buffer, count = 3, datatype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD) timestamp= 9330: MPI_Bcast(*buffer, count = 3, datatype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD) timestamp= 9333: MPI_Gather(*sendbuf, sendcount = , sendtype = MPI_DOUBLE, *recvbuf, recvcount = , recvtype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD) Last calls on node 1: timestamp= 9334: MPI_Gather(*sendbuf, sendcount = , sendtype = MPI_DOUBLE, *recvbuf, recvcount = , recvtype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD) timestamp= 9336: MPI_Sendrecv(*sendbuf, sendcount = 13455, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 13455, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9338: MPI_Sendrecv(*sendbuf, sendcount = 13455, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 13455, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) Last calls on node 2: timestamp= 9332: MPI_Bcast(*buffer, count = 3, datatype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD) timestamp= 9335: MPI_Gather(*sendbuf, sendcount = , sendtype = MPI_DOUBLE, *recvbuf, recvcount = , recvtype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD) timestamp= 9337: MPI_Sendrecv(*sendbuf, sendcount = 13455, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 13455, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) Höchstleistungsrechenzentrum Stuttgart

29 Example: B_Stream (B_Stream.cpp) #ifdef PARALLEl MPI_Barrier (MPI_COMM_WORLD); MPI_Reduce (&nr_fluids, &tot_nr_fluids, 1, MPI_DOUBLE, MPI_SUM, 0,MPI_COMM_WORLD); #else #endif //Calculation of porosity if (ge.me == 0) { Porosity = ((double) tot_nr_fluids) / (ge.global_dim[0] * ge.global_dim[1] * ge.global_dim[2]); } Porosity = ((double) tot_nr_fluids) / (ge.global_dim[0] * ge.global_dim[1] * ge.global_dim[2]); Höchstleistungsrechenzentrum Stuttgart

30 Performance with Real Applications Höchstleistungsrechenzentrum Stuttgart

31 CrossGrid Application: WP 1.4: Air pollution modelling Air pollution modeling with STEM-II model Transport equation solved with Petrov- Crank-Nikolson-Galerkin method Chemistry and Mass transfer are integrated using semi-implicit Euler and pseudoanalytical methods lines of Fortran code 12 different MPI calls: MPI_Init, MPI_Comm_size, MPI_Comm_rank, MPI_Type_extent, MPI_Type_struct, MPI_Type_commit, MPI_Type_hvector, MPI_Bcast, MPI_Scatterv, MPI_Barrier, MPI_Gatherv, MPI_Finalize Höchstleistungsrechenzentrum Stuttgart

32 STEM application on an IA32 cluster with Myrinet 80 Time [s] Processors native MPI MARMOT Höchstleistungsrechenzentrum Stuttgart

33 CrossGrid Application: WP 1.3: High Energy Physics level 1 - special hardware 40 MHz (40 TB/sec) level 2 - embedded processors level 3 - PCs 75 KHz (75 GB/sec) 5 KHz (5 GB/sec) 100 Hz data recording & offline analysis (100 MB/sec) Filtering of real-time data with neural networks (ANN application) lines of C code 11 different MPI calls: MPI_Init, MPI_Comm_size, MPI_Comm_rank, MPI_Get_processor_name, MPI_Barrier, MPI_Gather, MPI_Recv, MPI_Send, MPI_Bcast, MPI_Reduce, MPI_Finalize Höchstleistungsrechenzentrum Stuttgart

34 HEP application on an IA32 cluster with Myrinet Time [s] native MPI MARMOT Processors Höchstleistungsrechenzentrum Stuttgart

35 CrossGrid Application: WP 1.1: Medical Application Calculation of blood flow with Lattice- Boltzmann method Stripped down application with 6500 lines of C code 14 different MPI calls: MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Pack, MPI_Bcast, MPI_Unpack, MPI_Cart_create, MPI_Cart_shift, MPI_Send, MPI_Recv, MPI_Barrier, MPI_Reduce, MPI_Sendrecv, MPI_Finalize Höchstleistungsrechenzentrum Stuttgart

36 Medical application on an IA32 cluster with Myrinet 0,6 Time per Iteration [s] 0,5 0,4 0,3 0,2 0, Processors native MPI MARMOT Höchstleistungsrechenzentrum Stuttgart

37 Message statistics with native MPI Höchstleistungsrechenzentrum Stuttgart

38 Message statistics with MARMOT Höchstleistungsrechenzentrum Stuttgart

39 Medical application on an IA32 cluster with Myrinet without barrier Time per Iteration [s] 0,6 0,5 0,4 0,3 0,2 0, Processors native MPI MARMOT MARMOT without barrier Höchstleistungsrechenzentrum Stuttgart

40 Barrier with native MPI Höchstleistungsrechenzentrum Stuttgart

41 Barrier with MARMOT Höchstleistungsrechenzentrum Stuttgart

42 CrossGrid Application: WP 1.2: Flood monitoring forecasting simulation Flood simulation Fortran applications tested with MARMOT: Aladin, DaveF, MM5 9 different MPI Calls: MPI_INIT, MPI_COMM_SIZE, MPI_COMM_RANK, MPI_GET_PROCESSOR_NAME, MPI_BCAST, MPI_SEND, MPI_RECV, MPI_BARRIER, MPI_FINALIZE Höchstleistungsrechenzentrum Stuttgart

43 Conclusion Full MPI 1.2 implemented C and Fortran binding is supported Used for several Fortran and C benchmarks (NASPB) and applications (Crossgrid Project and others) Tests on different platforms, using different compilers and MPI implementations, e.g. IA32/IA64 clusters (Intel, g++ compiler) with mpich IBM Regatta NEC SX5, SX6, Hitachi SR8000, Cray T3e, Höchstleistungsrechenzentrum Stuttgart

44 Impact and Exploitation MARMOT is mentioned in the U.S. DoD High Performance Computing Modernization Program (HPCMP) Programming Environment and Training (PET) project as one of two freely available tools (MPI-Check and MARMOT) for the development of portable MPI programs. Contact with tool developers of the other 2 known verification tools like MARMOT (MPI-Check, umpire) Contact with users outside CrossGrid MARMOT has been presented at various conferences and publications Höchstleistungsrechenzentrum Stuttgart

45 Future Work Scalability and general performance improvements Better user interface to present problems and warnings Extended documentation Extended functionality: more tests to verify collective calls (parts of) MPI-2 Hybrid programming Höchstleistungsrechenzentrum Stuttgart

46 Thanks for your attention Additional information and download: Höchstleistungsrechenzentrum Stuttgart