MPI Application Development Using the Analysis Tool MARMOT



Similar documents
MARMOT- MPI Analysis and Checking Tool Demo with blood flow simulation. Bettina Krammer, Matthias Müller

Session 2: MUST. Correctness Checking

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

MPI Runtime Error Detection with MUST For the 13th VI-HPS Tuning Workshop

Lightning Introduction to MPI Programming

Lecture 6: Introduction to MPI programming. Lecture 6: Introduction to MPI programming p. 1

HP-MPI User s Guide. 11th Edition. Manufacturing Part Number : B September 2007

High performance computing systems. Lab 1

MPI-Checker Static Analysis for MPI

To connect to the cluster, simply use a SSH or SFTP client to connect to:

Message Passing with MPI

RA MPI Compilers Debuggers Profiling. March 25, 2009

Linux tools for debugging and profiling MPI codes

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

HPCC - Hrothgar Getting Started User Guide MPI Programming

Introduction to Hybrid Programming

Parallelization: Binary Tree Traversal

Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber

Message Passing Interface (MPI)

Parallel Programming with MPI on the Odyssey Cluster

Debugging with TotalView

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Performance and scalability of MPI on PC clusters

Introduction to MPI Programming!

How To Visualize Performance Data In A Computer Program

Message Passing Interface (MPI)

Load Balancing. computing a file with grayscales. granularity considerations static work load assignment with MPI

Introduction to Cloud Computing

Parallel Computing. Parallel shared memory computing with OpenMP

- An Essential Building Block for Stable and Reliable Compute Clusters

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Agenda. Using HPC Wales 2

MPI / ClusterTools Update and Plans

Load Balancing MPI Algorithm for High Throughput Applications

Memory Debugging with TotalView on AIX and Linux/Power

Improved LS-DYNA Performance on Sun Servers

Supercomputing Status und Trends (Conference Report) Peter Wegner

Performance Monitoring of Parallel Scientific Applications

Kommunikation in HPC-Clustern

McMPI. Managed-code MPI library in Pure C# Dr D Holmes, EPCC dholmes@epcc.ed.ac.uk

MOSIX: High performance Linux farm

Access to the Federal High-Performance Computing-Centers

The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters

Evaluation of Java Message Passing in High Performance Data Analytics

BLM 413E - Parallel Programming Lecture 3

Parallel I/O on Mira Venkat Vishwanath and Kevin Harms

HPC Wales Skills Academy Course Catalogue 2015

Performance Tool Support for MPI-2 on Linux 1

A Case Study - Scaling Legacy Code on Next Generation Platforms

Multicore Parallel Computing with OpenMP

Debugging and Profiling Lab. Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma

Virtual Machines.

Streamline Computing Linux Cluster User Training. ( Nottingham University)

Introduction. Reading. Today MPI & OpenMP papers Tuesday Commutativity Analysis & HPF. CMSC 818Z - S99 (lect 5)

Developing Parallel Applications with the Eclipse Parallel Tools Platform

Allinea Performance Reports User Guide. Version 6.0.6

HPC Applications Scalability.

The CNMS Computer Cluster

System Software for High Performance Computing. Joe Izraelevitz

Parallel and Distributed Computing Programming Assignment 1

Libmonitor: A Tool for First-Party Monitoring

Big Data Analytics. Tools and Techniques

Distributed communication-aware load balancing with TreeMatch in Charm++

IBM LoadLeveler for Linux delivers job scheduling for IBM pseries and IBM xseries platforms running Linux

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

CHAPTER FIVE RESULT ANALYSIS

Multi-core CPUs, Clusters, and Grid Computing: a Tutorial

The Assessment of Benchmarks Executed on Bare-Metal and Using Para-Virtualisation

Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory. The Nation s Premier Laboratory for Land Forces UNCLASSIFIED

DB2 Connect for NT and the Microsoft Windows NT Load Balancing Service

HPC Software Requirements to Support an HPC Cluster Supercomputer

Implementing MPI-IO Shared File Pointers without File System Support

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN


Release Notes for Open Grid Scheduler/Grid Engine. Version: Grid Engine

End-user Tools for Application Performance Analysis Using Hardware Counters

Tools for Performance Debugging HPC Applications. David Skinner

PMI: A Scalable Parallel Process-Management Interface for Extreme-Scale Systems

How To Write Portable Programs In C

Shared Parallel File System

The Asterope compute cluster

Parallel Debugging with DDT

Transcription:

MPI Application Development Using the Analysis Tool MARMOT HLRS High Performance Computing Center Stuttgart Allmandring 30 D-70550 Stuttgart http://www.hlrs.de 24.02.2005 1 Höchstleistungsrechenzentrum Stuttgart

Overview General Problems of MPI Programming Related Work Design of MARMOT MARMOT & Applications Performance Results Conclusions Outlook 24.02.2005 2 Höchstleistungsrechenzentrum Stuttgart

Problems of MPI Programming All problems of serial programming Additional problems: Increased difficulty to verify correctness of program Increased difficulty to debug N parallel processes Portability issues between different MPI implementations and platforms, e.g. with mpich on the Crossgrid testbed: WARNING: MPI_Recv: tag= 36003 > 32767! MPI only guarantees tags up to this. THIS implementation allows tags up to 137654536 versions of LAM-MPI < v 7.0 only guarantee tags up to 32767 New problems such as race conditions and deadlocks 24.02.2005 3 Höchstleistungsrechenzentrum Stuttgart

Related Work Parallel Debuggers, e.g. totalview, DDT, p2d2 Debug version of MPI Library Post-mortem analysis of trace files Special verification tools for runtime analysis MPI-CHECK Limited to Fortran Umpire First version limited to shared memory platforms Distributed memory version in preparation Not publicly available MARMOT 24.02.2005 4 Höchstleistungsrechenzentrum Stuttgart

24.02.2005 5 Höchstleistungsrechenzentrum Stuttgart Design of MARMOT

What is MARMOT? Tool for the development of MPI applications Automatic runtime analysis of the application: Detect incorrect use of MPI Detect non-portable constructs Detect possible race conditions and deadlocks MARMOT does not require source code modifications, just relinking C and Fortran binding of MPI -1.2 is supported 24.02.2005 6 Höchstleistungsrechenzentrum Stuttgart

Design of MARMOT Application or Test Program Profiling Interface MARMOT core tool Debug Server (additional process) MPI library 24.02.2005 7 Höchstleistungsrechenzentrum Stuttgart

Examples of Server Checks: verification between the nodes, control of program Everything that requires a global view Control the execution flow, trace the MPI calls on each node throughout the whole application Signal conditions, e.g. deadlocks (with traceback on each node.) Check matching send/receive pairs for consistency Check collective calls for consistency Output of human readable log file 24.02.2005 8 Höchstleistungsrechenzentrum Stuttgart

Examples of Client Checks: verification on the local nodes Verification of proper construction and usage of MPI resources such as communicators, groups, datatypes etc., for example Verification of MPI_Request usage invalid recycling of active request invalid use of unregistered request warning if number of requests is zero warning if all requests are MPI_REQUEST_NULL Check for pending messages and active requests in MPI_Finalize Verification of all other arguments such as ranks, tags, etc. 24.02.2005 9 Höchstleistungsrechenzentrum Stuttgart

Documentation/Classification of Errors, Warnings, Notes 24.02.2005 10 Höchstleistungsrechenzentrum Stuttgart

Documentation/Classification of Errors, Warnings, Notes 24.02.2005 11 Höchstleistungsrechenzentrum Stuttgart

Documentation/Classification of Errors, Warnings, Notes 24.02.2005 12 Höchstleistungsrechenzentrum Stuttgart

Documentation/Classification of Errors, Warnings, Notes 24.02.2005 13 Höchstleistungsrechenzentrum Stuttgart

Documentation/Classification of Errors, Warnings, Notes 24.02.2005 14 Höchstleistungsrechenzentrum Stuttgart

Documentation/Classification of Errors, Warnings, Notes 24.02.2005 15 Höchstleistungsrechenzentrum Stuttgart

24.02.2005 16 Höchstleistungsrechenzentrum Stuttgart MARMOT & Applications

MARMOT within CrossGrid Name Application B_Stream Aladin, DaveF MM5 STEM ANN Blood flow simulation Flood simulation Air pollution modeling High Energy Physics 24.02.2005 17 Höchstleistungsrechenzentrum Stuttgart Language C Fortran90 Fortran77 C # lines 6500 1 million 15500 11500 40 MHz (40 TB/sec) 75 KHz (75 GB/sec) 5 KHz (5 GB/sec) 100 Hz level data 3 recording - PCs & offline analysis level 1 - special hardware (100 MB/sec) level 2 - embedded processors

Example: B_Stream (blood flow simulation, tube) Running the application without/with MARMOT: mpirun -np 3 B_Stream 500. tube mpirun -np 4 B_Stream_marmot 500. tube Application seems to run without problems 24.02.2005 18 Höchstleistungsrechenzentrum Stuttgart

Example: B_Stream (blood flow simulation, tube) 54 rank 1 performs MPI_Cart_shift 55 rank 2 performs MPI_Cart_shift 56 rank 0 performs MPI_Send 57 rank 1 performs MPI_Recv WARNING: MPI_Recv: Use of MPI_ANY_SOURCE may cause race conditions! 58 rank 2 performs MPI_Recv WARNING: MPI_Recv: Use of MPI_ANY_SOURCE may cause race conditions! 59 rank 0 performs MPI_Send 60 rank 1 performs MPI_Recv WARNING: MPI_Recv: Use of MPI_ANY_SOURCE may cause race conditions! 61 rank 0 performs MPI_Send 62 rank 1 performs MPI_Bcast 63 rank 2 performs MPI_Recv WARNING: MPI_Recv: Use of MPI_ANY_SOURCE may cause race conditions! 64 rank 0 performs MPI_Pack 65 rank 2 performs MPI_Bcast 66 rank 0 performs MPI_Pack 24.02.2005 19 Höchstleistungsrechenzentrum Stuttgart

Example: B_Stream (blood flow simulation, tube-stenosis) Without MARMOT: mpirun -np 3 B_Stream 500. tube-stenosis Application is hanging With MARMOT: mpirun -np 4 B_Stream_marmot 500. tube-stenosis Deadlock found 24.02.2005 20 Höchstleistungsrechenzentrum Stuttgart

Example: B_Stream (blood flow simulation, tube-stenosis) 9310 rank 1 performs MPI_Sendrecv 9311 rank 2 performs MPI_Sendrecv 9312 rank 0 performs MPI_Barrier 9313 rank 1 performs MPI_Barrier 9314 rank 2 performs MPI_Barrier 9315 rank 1 performs MPI_Sendrecv 9316 rank 2 performs MPI_Sendrecv 9317 rank 0 performs MPI_Sendrecv 9318 rank 1 performs MPI_Sendrecv 9319 rank 0 performs MPI_Sendrecv 9320 rank 2 performs MPI_Sendrecv 9321 rank 0 performs MPI_Barrier 9322 rank 1 performs MPI_Barrier 9323 rank 2 performs MPI_Barrier 9324 rank 1 performs MPI_Comm_rank 9325 rank 1 performs MPI_Bcast 9326 rank 2 performs MPI_Comm_rank 9327 rank 2 performs MPI_Bcast 9328 rank 0 performs MPI_Sendrecv WARNING: all clients are pending! 24.02.2005 21 Höchstleistungsrechenzentrum Stuttgart

Example: B_Stream (blood flow simulation, tube-stenosis) deadlock: traceback on node 0 timestamp= 9304: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9307: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9309: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9312: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9317: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9319: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9321: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9328: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) 24.02.2005 22 Höchstleistungsrechenzentrum Stuttgart

Example: B_Stream (blood flow simulation, tube-stenosis) deadlock: traceback on node 1 timestamp= 9306: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9310: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9313: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9315: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9318: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9322: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9324: MPI_Comm_rank(comm = MPI_COMM_WORLD, *rank) timestamp= 9325: MPI_Bcast(*buffer, count = 3, datatype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD) 24.02.2005 23 Höchstleistungsrechenzentrum Stuttgart

Example: B_Stream (blood flow simulation, tube-stenosis) deadlock: traceback on node 2 timestamp= 9308: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9311: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9314: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9316: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9320: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9323: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9326: MPI_Comm_rank(comm = MPI_COMM_WORLD, *rank) timestamp= 9327: MPI_Bcast(*buffer, count = 3, datatype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD) 24.02.2005 24 Höchstleistungsrechenzentrum Stuttgart

Example: B_Stream (blood flow simulation, bifurcation) Without MARMOT: mpirun -np 3 B_Stream 500. bifurcation /usr/local/mpich-1.2.5.2/ch_shmem/bin/mpirun: line 1: 2753 Speicherzugriffsfehler /home/rusbetti/b_streamorg/bin/b_stream "500." "bifurcation" With MARMOT: mpirun -np 4 B_Stream_marmot 500. bifurcation Problem found at collective call MPI_Gather 24.02.2005 25 Höchstleistungsrechenzentrum Stuttgart

Example: B_Stream (blood flow simulation, bifurcation) 9319 rank 2 performs MPI_Sendrecv 9320 rank 1 performs MPI_Sendrecv 9321 rank 1 performs MPI_Barrier 9322 rank 2 performs MPI_Barrier 9323 rank 0 performs MPI_Barrier 9324 rank 0 performs MPI_Comm_rank 9325 rank 1 performs MPI_Comm_rank 9326 rank 2 performs MPI_Comm_rank 9327 rank 0 performs MPI_Bcast 9328 rank 1 performs MPI_Bcast 9329 rank 2 performs MPI_Bcast 9330 rank 0 performs MPI_Bcast 9331 rank 1 performs MPI_Bcast 9332 rank 2 performs MPI_Bcast 9333 rank 0 performs MPI_Gather 9334 rank 1 performs MPI_Gather 9335 rank 2 performs MPI_Gather /usr/local/mpich-1.2.5.2/ch_shmem/bin/mpirun: line 1: 10163 Speicherzugriffsfehler /home/rusbetti/b_stream/bin/b_stream_marmot "500." "bifurcation" 24.02.2005 26 Höchstleistungsrechenzentrum Stuttgart

Example: B_Stream (blood flow simulation, bifurcation) 9336 rank 1 performs MPI_Sendrecv 9337 rank 2 performs MPI_Sendrecv 9338 rank 1 performs MPI_Sendrecv WARNING: all clients are pending! 24.02.2005 27 Höchstleistungsrechenzentrum Stuttgart

Example: B_Stream (blood flow simulation, bifurcation) Last calls on node 0: timestamp= 9327: MPI_Bcast(*buffer, count = 3, datatype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD) timestamp= 9330: MPI_Bcast(*buffer, count = 3, datatype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD) timestamp= 9333: MPI_Gather(*sendbuf, sendcount = 266409, sendtype = MPI_DOUBLE, *recvbuf, recvcount = 266409, recvtype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD) Last calls on node 1: timestamp= 9334: MPI_Gather(*sendbuf, sendcount = 258336, sendtype = MPI_DOUBLE, *recvbuf, recvcount = 258336, recvtype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD) timestamp= 9336: MPI_Sendrecv(*sendbuf, sendcount = 13455, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 13455, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9338: MPI_Sendrecv(*sendbuf, sendcount = 13455, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 13455, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) Last calls on node 2: timestamp= 9332: MPI_Bcast(*buffer, count = 3, datatype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD) timestamp= 9335: MPI_Gather(*sendbuf, sendcount = 258336, sendtype = MPI_DOUBLE, *recvbuf, recvcount = 258336, recvtype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD) timestamp= 9337: MPI_Sendrecv(*sendbuf, sendcount = 13455, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 13455, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) 24.02.2005 28 Höchstleistungsrechenzentrum Stuttgart

Example: B_Stream (B_Stream.cpp) #ifdef PARALLEl MPI_Barrier (MPI_COMM_WORLD); MPI_Reduce (&nr_fluids, &tot_nr_fluids, 1, MPI_DOUBLE, MPI_SUM, 0,MPI_COMM_WORLD); #else #endif //Calculation of porosity if (ge.me == 0) { Porosity = ((double) tot_nr_fluids) / (ge.global_dim[0] * ge.global_dim[1] * ge.global_dim[2]); } Porosity = ((double) tot_nr_fluids) / (ge.global_dim[0] * ge.global_dim[1] * ge.global_dim[2]); 24.02.2005 29 Höchstleistungsrechenzentrum Stuttgart

Performance with Real Applications 24.02.2005 30 Höchstleistungsrechenzentrum Stuttgart

CrossGrid Application: WP 1.4: Air pollution modelling Air pollution modeling with STEM-II model Transport equation solved with Petrov- Crank-Nikolson-Galerkin method Chemistry and Mass transfer are integrated using semi-implicit Euler and pseudoanalytical methods 15500 lines of Fortran code 12 different MPI calls: MPI_Init, MPI_Comm_size, MPI_Comm_rank, MPI_Type_extent, MPI_Type_struct, MPI_Type_commit, MPI_Type_hvector, MPI_Bcast, MPI_Scatterv, MPI_Barrier, MPI_Gatherv, MPI_Finalize. 24.02.2005 31 Höchstleistungsrechenzentrum Stuttgart

STEM application on an IA32 cluster with Myrinet 80 Time [s] 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 1112 Processors 13 1415 16 native MPI MARMOT 24.02.2005 32 Höchstleistungsrechenzentrum Stuttgart

CrossGrid Application: WP 1.3: High Energy Physics level 1 - special hardware 40 MHz (40 TB/sec) level 2 - embedded processors level 3 - PCs 75 KHz (75 GB/sec) 5 KHz (5 GB/sec) 100 Hz data recording & offline analysis (100 MB/sec) Filtering of real-time data with neural networks (ANN application) 11500 lines of C code 11 different MPI calls: MPI_Init, MPI_Comm_size, MPI_Comm_rank, MPI_Get_processor_name, MPI_Barrier, MPI_Gather, MPI_Recv, MPI_Send, MPI_Bcast, MPI_Reduce, MPI_Finalize. 24.02.2005 33 Höchstleistungsrechenzentrum Stuttgart

HEP application on an IA32 cluster with Myrinet 400 350 300 Time [s] 250 200 150 native MPI MARMOT 100 50 0 2 3 4 5 6 7 8 910 11 1213 14 1516 Processors 24.02.2005 34 Höchstleistungsrechenzentrum Stuttgart

CrossGrid Application: WP 1.1: Medical Application Calculation of blood flow with Lattice- Boltzmann method Stripped down application with 6500 lines of C code 14 different MPI calls: MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Pack, MPI_Bcast, MPI_Unpack, MPI_Cart_create, MPI_Cart_shift, MPI_Send, MPI_Recv, MPI_Barrier, MPI_Reduce, MPI_Sendrecv, MPI_Finalize 24.02.2005 35 Höchstleistungsrechenzentrum Stuttgart

Medical application on an IA32 cluster with Myrinet 0,6 Time per Iteration [s] 0,5 0,4 0,3 0,2 0,1 0 1 2 3 4 5 6 7 8 9 10 1112 Processors 13 1415 16 native MPI MARMOT 24.02.2005 36 Höchstleistungsrechenzentrum Stuttgart

Message statistics with native MPI 24.02.2005 37 Höchstleistungsrechenzentrum Stuttgart

Message statistics with MARMOT 24.02.2005 38 Höchstleistungsrechenzentrum Stuttgart

Medical application on an IA32 cluster with Myrinet without barrier Time per Iteration [s] 0,6 0,5 0,4 0,3 0,2 0,1 0 1 2 3 4 5 6 7 8 9 10 1112 Processors 13 1415 16 native MPI MARMOT MARMOT without barrier 24.02.2005 39 Höchstleistungsrechenzentrum Stuttgart

Barrier with native MPI 24.02.2005 40 Höchstleistungsrechenzentrum Stuttgart

Barrier with MARMOT 24.02.2005 41 Höchstleistungsrechenzentrum Stuttgart

CrossGrid Application: WP 1.2: Flood monitoring forecasting simulation Flood simulation Fortran applications tested with MARMOT: Aladin, DaveF, MM5 9 different MPI Calls: MPI_INIT, MPI_COMM_SIZE, MPI_COMM_RANK, MPI_GET_PROCESSOR_NAME, MPI_BCAST, MPI_SEND, MPI_RECV, MPI_BARRIER, MPI_FINALIZE 24.02.2005 42 Höchstleistungsrechenzentrum Stuttgart

Conclusion Full MPI 1.2 implemented C and Fortran binding is supported Used for several Fortran and C benchmarks (NASPB) and applications (Crossgrid Project and others) Tests on different platforms, using different compilers and MPI implementations, e.g. IA32/IA64 clusters (Intel, g++ compiler) with mpich IBM Regatta NEC SX5, SX6, Hitachi SR8000, Cray T3e, 24.02.2005 43 Höchstleistungsrechenzentrum Stuttgart

Impact and Exploitation MARMOT is mentioned in the U.S. DoD High Performance Computing Modernization Program (HPCMP) Programming Environment and Training (PET) project as one of two freely available tools (MPI-Check and MARMOT) for the development of portable MPI programs. Contact with tool developers of the other 2 known verification tools like MARMOT (MPI-Check, umpire) Contact with users outside CrossGrid MARMOT has been presented at various conferences and publications http://www.hlrs.de/organization/tsc/projects/marmot/pubs.html 24.02.2005 44 Höchstleistungsrechenzentrum Stuttgart

Future Work Scalability and general performance improvements Better user interface to present problems and warnings Extended documentation Extended functionality: more tests to verify collective calls (parts of) MPI-2 Hybrid programming 24.02.2005 45 Höchstleistungsrechenzentrum Stuttgart

Thanks for your attention Additional information and download: http://www.hlrs.de/organization/tsc/projects/marmot 24.02.2005 46 Höchstleistungsrechenzentrum Stuttgart