RA MPI Compilers Debuggers Profiling. March 25, 2009

RA MPI Compilers Debuggers Profiling March 25, 2009

Examples and Slides To download examples on RA 1. mkdir class 2. cd class 3. wget http://geco.mines.edu/workshop/class2/examples/examples.tgz 4. tar -xzf examples.tgz 5. cd stommel Slides http://geco.mines.edu/workshop/tools Note: There is summary of all scripts given at the end of the slides for easy copy/paste

Experimental MPI Versions

New MPI Compilers Version MVAPICH2 1.2 MVAPICH 1.1 OpenMPI 1.3.1 Both Intel and Portland Group Compilers Support for Debuggers Support for Profiling

Need to modify your Environment Change.tcshrc or.bashrc file Log out then log back in Changes override mpi_selector settings May need to change your PBS script

.tcshrc settings setenv MPI_VERSION /lustre/home/apps/mpi/db/mvapich-1.1 setenv MPI_VERSION /lustre/home/apps/mpi/db/mvapich2-1.2 setenv MPI_VERSION /lustre/home/apps/mpi/db/openmpi1.3.1 setenv MPI_COMPILER intel #setenv MPI_COMPILER pg if ( $?MPI_COMPILER && $?MPI_VERSION ) then setenv MPI_BASE $MPI_VERSION/$MPI_COMPILER setenv LD_LIBRARY_PATH $MPI_BASE/lib:$LD_LIBRARY_PATH setenv LD_LIBRARY_PATH $MPI_BASE/lib/shared:$LD_LIBRARY_PATH setenv MANPATH $MPI_BASE/man:$MPI_BASE/shared/man:$MANPATH set path = ( $MPI_BASE/bin $path ) endif

.bashrc settings export MPI_VERSION=/lustre/home/apps/mpi/db/mvapich-1.1 export MPI_VERSION=/lustre/home/apps/mpi/db/mvapich2-1.2 export MPI_VERSION=/lustre/home/apps/mpi/db/openmpi1.3.1 export MPI_COMPILER=intel #export MPI_COMPILER=pg if [ -n $MPI_COMPILER ]; then if [ -n $MPI_VERSION ]; then export MPI_BASE=$MPI_VERSION/$MPI_COMPILER export LD_LIBRARY_PATH=$MPI_BASE/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=$MPI_BASE/lib/shared:$LD_LIBRARY_PATH export MANPATH=$MPI_BASE/man:$MPI_BASE/shared/man:$MANPATH export PATH=$MPI_BASE/bin:$PATH fi fi

Base Script #!/bin/csh #PBS -l nodes=2:ppn=8 #PBS -l walltime=00:02:00 #PBS -N testio #PBS -o stdout.$pbs_jobid #PBS -e stderr.$pbs_jobid #PBS -r n #PBS -V #----------------------------------------------------- cd $PBS_O_WORKDIR sort -u $PBS_NODEFILE > mynodes.$pbs_jobid ADD YOUR MPI RUN COMMAND HERE

MPI Run commands Version Command openmpi1.3.1 mpiexec -np 16 stc_06 mvapich2-1.2 mpiexec -np 16 /lustre/home/tkaiser/examples/stommel/stc_06 < st.in mvapich-1.1 mpirun_rsh -hostfile $PBS_NODEFILE -np 16 stc_06 < st.in mpirun -machinefile $PBS_NODEFILE -np 16 stc_06 < st.in

Debugging with ddt

Not a big fan of debuggers End up debugging the debugger Steep learning curve Can be misleading Difficult for large processor count and the problem might only show up there My favorite debuggers are printf write

However... I recently used ddt to find a problem for which printf did not work. It might have taken me weeks. Print statements might make the problem go away Debuggers are useful for learning a program that you have never seen ddt is working well on RA

Allinea DDT debugger X-Windows based ssh -X ra An initial setup is done the first time you run Works with both Portland Group and Intel Fortran Good support for Fortran modules Syntax highlighting

.tcshrc Environment for ddt set path = ( /lustre/home/apps/ddt2.4.1/bin $path ) setenv DMALLOCPATH /lustre/home/apps/ddt2.4.1 setenv DMALLOC setenv LD_LIBRARY_PATH $DMALLOCPATH/lib/64:$LD_LIBRARY_PATH.bashrc Requires that you use a MPI that supports debugging such as those listed above export PATH=/lustre/home/apps/ddt2.4.1/bin:$PATH export DMALLOCPATH=/lustre/home/apps/ddt2.4.1 export DMALLOC="" export LD_LIBRARY_PATH=$DMALLOCPATH/lib/64:$LD_LIBRARY_PATH

Debug Compile Line mpicc -g \ -L/lustre/home/apps/gdb-6.8/lib64 \ -liberty \ stc_06.c \ -o stc_06.g

Debug Compile Line mpicc -g -L/lustre/home/apps/gdb-6.8/lib64 -liberty \ stc_06.c \ /lustre/home/apps/ddt2.4.1/lib/64/libdmalloc.a -o \ stc_06.g Here we link to the debug memory library. This is required if you want to track memory usage in ddt. Note it must library be last on the list.

stdin stdout stderr stdin works for both Intel and Portland Group stdout works with the Intel compiler without modification Portland Group compiler requires a special call to be able to see stdout while the program is running, (before MPI_Init) This is NOT a bug call setvbuf3f(6,2,0) setbuf(stdout,null); for Fortan for C

Initial ddt setup Run first time, creates a directory ~/.ddt type ddt Choose a MPI version Choose a list of nodes (Default) Note location of this file Need to change this list to connect to running process Wait a few seconds

Snapz Pro X

Running ddt Select Run and Debug a Program Set number of processes Most likely Set threads to off Click Run Details to follow... Select the program that you will run

To show you... Routine required for correct stdio with Portland Group compiler Setting stdin Module support Changing values Locals / Current Line

Option: Let ddt submit a batch job Your run script becomes a template which ddt fills in the arguments at submit time Tell ddt the particulars Program Input # processors <= 16 ddt will watch the queue for your job to start and then connect

Let ddt submit a batch job Change your run line to run ddt with your program as an argument mpiexec -n 8 stf_03.g < st.in for example, becomes mpiexec -n NUM_PROCS_TAG DDTPATH_TAG/bin/ddt-debugger DDT_DEBUGGER_ARGUMENTS_TAG PROGRAM_ARGUMENTS_TAG Add (Not required but useful for attaching to already running jobs) sort -u $PBS_NODEFILE > mynodes.$pbs_jobid cp mynodes.$pbs_jobid ~/.ddt/nodes

A simple script (more later for specific versions of MPI) #!/bin/csh #PBS -l nodes=1:ppn=8 #PBS -l walltime=00:10:00 #PBS -N testio #PBS -o stdout.$pbs_jobid #PBS -e stderr.$pbs_jobid #PBS -r n #PBS -V #----------------------------------------------------- cd $PBS_O_WORKDIR #save a nicely sorted list of nodes sort -u $PBS_NODEFILE > mynodes.$pbs_jobid cp mynodes.$pbs_jobid ~/.ddt/nodes #for openmpi #mpiexec -n 8 stf_03.g < st.in Note this line is commented out. This one is alive #for openmpi and ddt mpiexec -n NUM_PROCS_TAG DDTPATH_TAG/bin/ddt-debugger \ DDT_DEBUGGER_ARGUMENTS_TAG PROGRAM_ARGUMENTS_TAG

Under Session - Options

Finally select Session - New Session - Run

Let ddt submit the job for you

OpenMPI Debug Script #!/bin/csh #PBS -l nodes=1:ppn=8 #PBS -l walltime=00:10:00 #PBS -N testio #PBS -o stdout.$pbs_jobid #PBS -e stderr.$pbs_jobid #PBS -r n #PBS -V #----------------------------------------------------- cd $PBS_O_WORKDIR #save a nicely sorted list of nodes sort -u $PBS_NODEFILE > mynodes.$pbs_jobid cp mynodes.$pbs_jobid ~/.ddt/nodes DDTPATH_TAG/bin/ddt-client DDT_DEBUGGER_ARGUMENTS_TAG mpiexec -np \ NUM_PROCS_TAG EXTRA_MPI_ARGUMENTS_TAG PROGRAM_TAG \ PROGRAM_ARGUMENTS_TAG

MVAPICH2 Debug Script #!/bin/csh #PBS -l nodes=1:ppn=8 #PBS -l walltime=00:10:00 #PBS -N testio #PBS -o stdout.$pbs_jobid #PBS -e stderr.$pbs_jobid #PBS -r n #PBS -V #----------------------------------------------------- cd $PBS_O_WORKDIR #save a nicely sorted list of nodes sort -u $PBS_NODEFILE > mynodes.$pbs_jobid cp mynodes.$pbs_jobid ~/.ddt/nodes mpiexec -n NUM_PROCS_TAG \ DDTPATH_TAG/bin/ddt-debugger \ DDT_DEBUGGER_ARGUMENTS_TAG PROGRAM_ARGUMENTS_TAG

MVAPICH-1.1 Debug Script #!/bin/csh #PBS -l nodes=1:ppn=8 #PBS -l walltime=00:15:00 #PBS -N testio #PBS -o stdout.$pbs_jobid #PBS -e stderr.$pbs_jobid #PBS -r n #PBS -V cd $PBS_O_WORKDIR #save a nicely sorted list of nodes sort -u $PBS_NODEFILE > mynodes.$pbs_jobid cp mynodes.$pbs_jobid ~/.ddt/nodes mpirun_rsh -hostfile $PBS_NODEFILE -n \ NUM_PROCS_TAG DDTPATH_TAG/bin/ddt-debugger \ DDT_DEBUGGER_ARGUMENTS_TAG PROGRAM_ARGUMENTS_TAG

Attaching to a batch job Key here is that ddt needs to know where your job is running Add the following two lines to your script sort -u $PBS_NODEFILE > mynodes.$pbs_jobid cp mynodes.$pbs_jobid ~/.ddt/nodes ddt will look in ~/.ddt/nodes for nodes to search

Attaching to a batch job

To Attach to a Running Process Session - New Session - Attach List should pop up Nodes need to be in ~/.ddt/nodes

Attaching to a interactive job Key here is that ddt needs to know where your job is running ddt will look in ~/.ddt/nodes for nodes to search You may need to manually edit this file

Attaching to an interactive job

Things to show... Changing MPI version Basic setup Setting break points Seeing modules Memory usage Launching a parallel job Seeing and changing variables

Profiling with IPM

Integrated Performance Monitoring (IPM) Developed by Nick Wright of SDSC http://www.sdsc.edu/us/tools/top/ipm/ Local limited documentation http://geco.mines.edu/ipm/ Available on RA for Experimental versions of MVAPICH* Normal Compile - adding IPM library Normal MPI run Summary of MPI stats at the end of your run to stdout Can Generate a Web page with nice pictures

Integrated Performance Monitoring (IPM) Integrated Performance Monitoring (IPM) is a tool that allows users to obtain a concise summary of the performance and communication characteristics of their codes. IPM is invoked by the user at the time a job is run. By default, a short, text-based summary of the code's performance is provided, and a more detailed Web page summary with graphs to help visualize the output can also be obtained.

Environment Additions for IPM.tcshrc set path = ( $path /lustre/home/apps/pl/bin ) set path = ( $path /lustre/home/apps/ipm/bin ) setenv IPM_KEYFILE /lustre/home/apps/ipm/ipm_key.bashrc export PATH=$PATH:/lustre/home/apps/pl/bin export PATH=$PATH:/lustre/home/apps/ipm/bin export IPM_KEYFILE=/lustre/home/apps/ipm/ipm_key

Compiling for IPM mpif90 -g stf_03.f90 -L$MPI_BASE/ipm/lib -lipm -o stf_03.ipm $MPI_BASE = /lustre/home/apps/mpi/db/version VERSION mvapich-1.1/pg mvapich-1.1/intel mvapich2-1.2/pg mvapich2-1.2/intel openmpi/* Works? yes Stay Tuned yes yes No - know problem

##IPMv0.923#################################################################### # # command : unknown (completed) # host : compute-9-9/x86_64_linux mpi_tasks : 8 on 1 nodes # start : 03/24/09/14:08:52 wallclock : 31.347469 sec # stop : 03/24/09/14:09:24 %comm : 1.24 # gbytes : 0.00000e+00 total gflop/sec : 0.00000e+00 total # ############################################################################## # region : * [ntasks] = 8 # # [total] <avg> min max # entries 8 1 1 1 # wallclock 250.773 31.3467 31.3465 31.3475 # user 250.589 31.3236 31.1813 31.3532 # system 0.448929 0.0561161 0.043993 0.089986 # mpi 3.10778 0.388473 0.112456 0.610158 # %comm 1.23925 0.35875 1.94643 # gflop/sec 0 0 0 0 # gbytes 0 0 0 0 # # # [time] [calls] <%mpi> <%wall> # MPI_Recv 2.60098 32032 83.69 1.04 # MPI_Reduce 0.272061 8000 8.75 0.11 # MPI_Send 0.232291 32032 7.47 0.09 # MPI_Bcast 0.00119273 96 0.04 0.00 # MPI_Comm_size 0.000790782 24 0.03 0.00 # MPI_Allreduce 0.000330307 32 0.01 0.00 # MPI_Allgather 0.000130918 16 0.00 0.00 # MPI_Comm_rank 6.7791e-06 46 0.00 0.00 ###############################################################################

3/24/09 2:15 PM Generate a web page: ipm_parse -html tkaiser.1237925332.870435.0 IPM profile for unknown IPM profile for unknown 3/24/09 2:15 PM IPM profile for unknown 3/24/09 2:15 PM unknown Load Balance Communication Balance Message Buffer Sizes Communication Topology Switch Traffic Memmory Usage Executable Info Host List Environment Developer Info command: unknown codename: unknown state: running username: tkaiser group: tkaiser host: Computation compute-9-9 (x86_64_linux) mpi_tasks: 8 on 1 hosts start: 03/24/09/14:08:52 wallclock: 3.13475e+01 sec stop: 03/24/09/14:09:24 %comm: 1.23924675013956 total memory: 0 gbytes total gflop/sec: - 0.255203764255523 switch(send): 0 gbytes switch(recv): 0 gbytes Communication Event Count Pop NULL 0 * % of MPI Time by MPI rank, by MPI time Load balance by task: memory, flops, timings by MPI rank, time detail by MPI time, time detail by rank, call list Message Buffer Size Distributions: time IPM profile for unknown 3/24/09 2:15 PM HPM Counter Statistics Event Ntasks Avg Min(rank) Max(rank) NULL * 0.00 0 (0) 0 (0) Communication Event Statistics (100.00% detail, 3.0422e-06 error) Buffer Size Ncalls IPM Total profile Time for unknown Min Time Max Time %MPI %Wall MPI_Recv 8016 12012 1.779 2.316e-06 1.511e-02 57.26 0.71 MPI_Recv 4016 8008 0.816 8.717e-07 1.487e-02 26.26 0.33 MPI_Reduce 8 8000 0.272 3.898e-06 9.003e-04 8.75 0.11 MPI_Send 8016 16016 0.192 4.191e-08 6.679e-05 6.17 0.08 MPI_Send 4016 16016 0.041 5.402e-08 4.328e-05 1.30 0.02 Load balance by task: HPM counters 3/24/09 2:15 PM by MPI rank, by MPI time Communication balance by task (sorted by MPI time) cumulative values, values Message Buffer Size Distributions: Ncalls file:///users/tkaiser/desktop/unknown_8_tkaiser.1237925332.870435.0_ipm_unknown/index.html Page 1 of 5 file:///users/tkaiser/desktop/unknown_8_tkaiser.1237925332.870435.0_ipm_unknown/index.html Page 2 of 5 file:///users/tkaiser/desktop/unknown_8_tkaiser.1237925332.870435.0_ipm_unknown/index.html Page 3 of 5 cumulative values, values Communication Topology : point to point data flow data sent, data recv, time spent, map_data file map_adjacency file Switch Traffic (volume by node) Memory usage by node file:///users/tkaiser/desktop/unknown_8_tkaiser.1237925332.870435.0_ipm_unknown/index.html Page 5 of 5

Can profile sections Report will have a new page with the given label!turn on profiling call mpi_pcontrol( 1,"proc_a"//char(0))...!turn off profilingcall mpi_pcontrol( -1,"proc_a"//char(0)) /* turn on profiling*/ MPI_Pcontrol( 1,"proc_a");... /* turn off profiling*/ MPI_Pcontrol(-1,"proc_a");

What s Missing What are we doing about it? Timeline style program tracing Time in MPI routines Communication patterns Time in other routines Memory Tracking Performance numbers Flops Cache misses...

Tracing Evaluated a commercial package and rejected it Will be installing Tau http://www.cs.uoregon.edu/research/tau/home.php Large package which does preprocessing of source Works with many analysis packages Includes memory tracking if malloc/allocate can be seen

Performance Information Some Examples: http://www.ncsa.uiuc.edu/userinfo/resources/ Software/Tools/PAPI/ http://perfsuite.ncsa.uiuc.edu/publications/lj135/ x50.html How do we get it? PAIP

PAPI - Performance API http://icl.cs.utk.edu/papi/ Specifies a standard application programming interface (API) for accessing hardware performance counters available on most modern microprocessors Used by both Tau and IPM Can show the effects of different optimizations Problem: requires Kernel Patch

Tau and PAPI part of POINT http://nic.uoregon.edu/point Productivity from Open, INtegrated Tools (POINT) project is funded as part of the NSF's Software Development for Cyberinfrastructure (SDCI) program Goal: integrate, harden, and deploy an open, portable, robust performance tools environment

Summary The DDT debugger is available for parallel applications DDT can also track memory usage IPM is currently available for simple profiling We will be installing additional performance analysis tools Summary of scripts follows...

.tcshrc additions summary ### mpi settings ## setenv MPI_VERSION /lustre/home/apps/mpi/db/mvapich-1.1 setenv MPI_VERSION /lustre/home/apps/mpi/db/mvapich2-1.2 setenv MPI_VERSION /lustre/home/apps/mpi/db/openmpi1.3.1 setenv MPI_COMPILER intel #setenv MPI_COMPILER pg if ( $?MPI_COMPILER && $?MPI_VERSION ) then setenv MPI_BASE $MPI_VERSION/$MPI_COMPILER setenv LD_LIBRARY_PATH $MPI_BASE/lib:$LD_LIBRARY_PATH setenv LD_LIBRARY_PATH $MPI_BASE/lib/shared:$LD_LIBRARY_PATH setenv MANPATH $MPI_BASE/man:$MPI_BASE/shared/man:$MANPATH set path = ( $MPI_BASE/bin $path ) endif ### ddt settings ### set path = ( /lustre/home/apps/ddt2.4.1/bin $path ) setenv DMALLOCPATH /lustre/home/apps/ddt2.4.1 setenv DMALLOC setenv LD_LIBRARY_PATH $DMALLOCPATH/lib/64:$LD_LIBRARY_PATH ### ipm settings ### set path = ( $path /lustre/home/apps/pl/bin ) set path = ( $path /lustre/home/apps/ipm/bin ) setenv IPM_KEYFILE /lustre/home/apps/ipm/ipm_key

.bashrc additions summary ### mpi settings ### export MPI_VERSION=/lustre/home/apps/mpi/db/mvapich-1.1 export MPI_VERSION=/lustre/home/apps/mpi/db/mvapich2-1.2 export MPI_VERSION=/lustre/home/apps/mpi/db/openmpi1.3.1 export MPI_COMPILER=intel #export MPI_COMPILER=pg if [ -n $MPI_COMPILER ]; then if [ -n $MPI_VERSION ]; then export MPI_BASE=$MPI_VERSION/$MPI_COMPILER export LD_LIBRARY_PATH=$MPI_BASE/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=$MPI_BASE/lib/shared:$LD_LIBRARY_PATH export MANPATH=$MPI_BASE/man:$MPI_BASE/shared/man:$MANPATH export PATH=$MPI_BASE/bin:$PATH fi fi ### ddt settings ### export PATH=/lustre/home/apps/ddt2.4.1/bin:$PATH export DMALLOCPATH=/lustre/home/apps/ddt2.4.1 export DMALLOC="" export LD_LIBRARY_PATH=$DMALLOCPATH/lib/64:$LD_LIBRARY_PATH ### ipm settings ### export PATH=$PATH:/lustre/home/apps/pl/bin export PATH=$PATH:/lustre/home/apps/ipm/bin export IPM_KEYFILE=/lustre/home/apps/ipm/ipm_key

Compiling for IPM mpif90 -g stf_03.f90 -L$MPI_BASE/ipm/lib -lipm -o stf_03.ipm $MPI_BASE = /lustre/home/apps/mpi/db/version VERSION mvapich-1.1/pg mvapich-1.1/intel mvapich2-1.2/pg mvapich2-1.2/intel openmpi/* Works? yes Stay Tuned yes yes No - know problem

Debug Compile Line mpicc -g \ -L/lustre/home/apps/gdb-6.8/lib64 \ -liberty \ stc_06.c \ -o stc_06.g

Debug Compile Line mpicc -g -L/lustre/home/apps/gdb-6.8/lib64 -liberty \ stc_06.c \ /lustre/home/apps/ddt2.4.1/lib/64/libdmalloc.a -o \ stc_06.g Here we link to the debug memory library. This is required if you want to track memory usage in ddt. Note it must library be last on the list.

OpenMPI Debug Script #!/bin/csh #PBS -l nodes=1:ppn=8 #PBS -l walltime=00:10:00 #PBS -N testio #PBS -o stdout.$pbs_jobid #PBS -e stderr.$pbs_jobid #PBS -r n #PBS -V #----------------------------------------------------- cd $PBS_O_WORKDIR #save a nicely sorted list of nodes sort -u $PBS_NODEFILE > mynodes.$pbs_jobid cp mynodes.$pbs_jobid ~/.ddt/nodes DDTPATH_TAG/bin/ddt-client DDT_DEBUGGER_ARGUMENTS_TAG mpiexec -np \ NUM_PROCS_TAG EXTRA_MPI_ARGUMENTS_TAG PROGRAM_TAG \ PROGRAM_ARGUMENTS_TAG

MVAPICH2 Debug Script #!/bin/csh #PBS -l nodes=1:ppn=8 #PBS -l walltime=00:10:00 #PBS -N testio #PBS -o stdout.$pbs_jobid #PBS -e stderr.$pbs_jobid #PBS -r n #PBS -V #----------------------------------------------------- cd $PBS_O_WORKDIR #save a nicely sorted list of nodes sort -u $PBS_NODEFILE > mynodes.$pbs_jobid cp mynodes.$pbs_jobid ~/.ddt/nodes mpiexec -n NUM_PROCS_TAG \ DDTPATH_TAG/bin/ddt-debugger \ DDT_DEBUGGER_ARGUMENTS_TAG PROGRAM_ARGUMENTS_TAG

MVAPICH-1.1 Debug Script #!/bin/csh #PBS -l nodes=1:ppn=8 #PBS -l walltime=00:15:00 #PBS -N testio #PBS -o stdout.$pbs_jobid #PBS -e stderr.$pbs_jobid #PBS -r n #PBS -V cd $PBS_O_WORKDIR #save a nicely sorted list of nodes sort -u $PBS_NODEFILE > mynodes.$pbs_jobid cp mynodes.$pbs_jobid ~/.ddt/nodes mpirun_rsh -hostfile $PBS_NODEFILE -n \ NUM_PROCS_TAG DDTPATH_TAG/bin/ddt-debugger \ DDT_DEBUGGER_ARGUMENTS_TAG PROGRAM_ARGUMENTS_TAG