Caltech Center for Advanced Computing Research System Guide: MRI2 Cluster (zwicky) January 2014

1. How to Get An Account CACR Accounts 2. How to Access the Machine Connect to the front end, zwicky.cacr.caltech.edu: ssh -l username zwicky.cacr.caltech.edu or ssh username@zwicky.cacr.caltech.edu Edits, compiles, builds, and job submissions are done on the front end. NOTE: Password-based authentication is not supported for connecting to the cluster; ssh public key authentication is the only way to connect. If you don't have an ssh keypair, please see SSH key generation instructions. 3. Technical Summary & System Configuration Information The MRI2 cluster, zwicky, is specifically configured to meet the needs of applications in Caltech's Theoretical Astrophysics group, in particular the simulation of black holes and other extreme spacetimes. The configuration, integrated by Hewlett-Packard and CACR's technical support team, consists of 2244 Intel X5650 compute cores plus 320 Intel E5-2670 cores, connected via QDR InfiniBand.160 TB of parallel file system space (Panasas PAS11) is mounted over InfiniBand. Typical applications running on zwicky are MPI based. Queuing policies allow for development and production cycles, flexible to meet the project's PIs' requirements.

Architecture Intel Westmere, RedHat Linux Cluster Head Node 2 Intel x5650 2.66 Ghz Hex Core Processors - 12 cores 48 GB ECC SDRAM Compute Nodes 187 dual-processor Intel x5650 2.66 Ghz, hex core 2 GB ECC SDRAM/core - 24 GB/node 20 dual-processor Intel E5-2670 2.6 Ghz, 8core 64 GB/node 12800R ECC SDRAM 2308 compute cores, 4.6TB memory Network Interconnect 11 Voltaire Infiniband 4x QDR 36P managed switches Storage 40 GB local scratch/node ~160 TB, supplying /home and /panfs/ds06/sxs project work area 44TB o /nfs/ds0[1,3,4]/sxs o /nfs/ds0[3,4]/sxs+bbhdata archival area 180TB o /nfs/as[01 12]/sxs Batch System Torque with Maui Operating System RHEL 5.9 Compilers GNU PGI Open64 Intel MPI Open MPI 4. Available Filesystems and Descriptions/Intended Usage 4.a Filesystems available on head node, compute nodes, and archival storage nodes: /home /nfs/ds01/sxs /nfs/ds03/sxs /nfs/ds03/sxs_bbhdata /nfs/ds04/sxs /nfs/ds04/sxs_bbhdata /panfs/ds06/sxs - RAID5/1 backed up - RAID6 not backed up - RAID6 not backed up - RAID6 not backed up - RAID5 not backed up - RAID5 not backed up - RAID5/1 not backed up 4.b File systems available on compute nodes only /scratch - new filesystem created after each job (~40G), local to each compute node 4.c File systems available on zwicky head and archival storage nodes (tier2-storage-[a,b].cacr.caltech.edu /nfs/as[01, 02, 12]/sxs - RAID6/1 not backed up

All the filesystems listed above, except for /nfs/*/sxs_bbhdata and /scratch, have a per-user directory for each sxs project member. Notes regarding archival storage nodes: 1. Currently access to the archival storage nodes is available from the zwicky cluster as well as external nodes via 'ssh' (as per zwicky head node.) eg. ssh USERNAME@tier2-storage-b.cacr.caltech.edu 2. Please move all data that will not be immediately needed to the archival storage areas. Appropriate commands are: cp -r /panfs/ds06/sxs/username/yourfromdirectory/. \ /nfs/as01/sxs/username/yourtodirectory ( cd /panfs/ds06/sxs/username/yourfromdirectory && tar cf -. ) \ ( cd /nfs/as02/sxs/username/yourtodirectory && tar xf - ) as well as variants of 'rsync' and any other Linux copy command. 3. Remote archival area: /nfs/as[01 12]/sxs/<username>. For example: /nfs/as04/sxs/joeuser Archival storage is currently accessible from the zwicky head node 'zwicky.cacr.caltech.edu' as well as the dedicated archival access nodes 'tier2-storage.cacr.caltech.edu' which is an alias directing to tier2-storage-a.cacr.caltech.edu or tier2-storage-b.cacr.caltech.edu. This archival area is NOT currently available from the zwicky compute nodes. Due to its connectivity, the performance of file copies between /panfs/* and the /nfs/as[01..12] areas will be much higher if done on 'tier2-storage.cacr.caltech.edu' than if done on 'zwicky.cacr.caltech.edu'. This area is intended for archival storage and due to performance reasons is not meant to be used by MPI programs. Low I/O MPI access from the zwicky head node is acceptable but should be limited. Example 1: ssh tier2-storage.cacr.caltech.edu tier2-storage-a: pwd /home/joeuser tier2-storage-a: cd /panfs/ds06/sxs/joeuser tier2-storage-a: pwd /panfs/ds06/sxs/joeuser tier2-storage-a: cp filea /nfs/as01/sxs/joeuser tier2-storage-a: ls -l /nfs/as01/sxs/joeuser total 16 -rw-r--r-- 1 joeuser joeuser 13453 Jul 27 10:23 filea Example 2: ssh tier2-storage.cacr.caltech.edu tier2-storage-b: pwd /home/joeuser

tier2-storage-b: cd /panfs/ds06/sxs/joeuser tier2-storage-b: ls -al total 64 drwxrwsr-x 3 joeuser sxs 4096 Jul 27 10:37./ drwxr-xr-x 67 root root 12288 Jul 11 14:35../ drwxr-sr-x 2 joeuser sxs 4096 Jul 27 10:36 DirectoryA/ tier2-storage-b: ls -al /nfs/as01/sxs/joeuser total 24 drwxrwxr-x 2 joeuser joeuser 4096 Jul 27 10:37./ drwxrwsr-x 67 root root 4096 Jul 25 07:53../ -rw-r--r-- 1 joeuser joeuser 13453 Jul 27 10:34 oldfile tier2-storage-b: tar cf - DirectoryA ( cd /nfs/as01/sxs/joeuser && tar xpf - ) tier2-storage-b: ls -al /nfs/as01/sxs/joeuser total 28 drwxrwxr-x 3 joeuser joeuser 4096 Jul 27 10:37./ drwxrwsr-x 67 root root 4096 Jul 25 07:53../ drwxr-sr-x 2 joeuser joeuser 4096 Jul 27 10:36 DirectoryA/ -rw-r--r-- 1 joeuser joeuser 13453 Jul 27 10:34 oldfile 5. Available compilers, debuggers, libraries and other tools Many useful tools, besides those coming with the standard RHEL, are available on zwicky. 'module avail ' will provide the current available packages; below is a typical collection: Compilers: gcc, intel, open64, pgi, nasm Debuggers: inspector, intel, totalview, valgrind Languages: java, python python-pkg/mpi4py,pytz,numarray,scipy,gnuplot,numpy,sip,ipython,pymc, sympy,matplotlib, pyqt Libraries: ATLAS, HDF5, fftw3, qt, mkl MPI Libraries/Tools impi, openmpi, mvapich2, platform_mpi Tools: autoconf, automake, binutils, cmake, dakota, krb, m4 matlab, metis, octave, papi, petsc, svn, wireshark Visualization: gnuplot, grace, paraview, tecplot, visit 5.1 Using Module

The software environment on the MRI2 cluster is managed with the module command. To get started, please try module help, to see available arguments. To see all available packages: module avail To add default openmpi version: module add openmpi You will be able to see which version is the default for a package by looking at the output from: module avail To use a specific version of openmpi: module add openmpi/1.4.1-gcc The same package cannot be added twice, so when switching between different versions of the same package, you will need to remove the loaded version module del opempi, before loading in the new version module add openmpi/1.4.1-gcc. Another way of doing this, is to use the swap argument: module swap openmpi openmpi/1.4.1-gcc Some packages have dependencies and will show this when a prerequisite is not met: module add python python/2.6.4(23):error:151: Module 'python/2.6.4' depends on one of the module(s) 'openmpi/1.4.1-pgi openmpi/1.4.1- intel openmpi/1.4.1-gcc openmpi/1.4.1 openmpi/1.4-pgi openmpi/1.4-intel openmpi/1.4-gcc openmpi/1.4 openmpi/1.3.3- pgi openmpi/1.3.3-intel openmpi/1.3.3-gcc openmpi/1.3.3' python/2.6.4(23):error:102: Tcl command execution failed: prereq openmpi After adding openmpi module add openmpi, the module add python will succeed. To see what modules are in your environment: module list To clear all packages loaded from your environment: module clear To get help for a module: module help openmpi To display what a module will do to your environment: module display openmpi Some users test package versions with the using command in their Makefiles: MPIVERSIONREQUIRED=openmpi_1.4.1 MPIVERSION := $(strip $(shell using grep mpi)) ifneq ($(MPIVERSION),$(MPIVERSIONREQUIRED)) $(error You are using version $(MPIVERSION). \ You need to be using $(MPIVERSIONREQUIRED) \ for compilation to work.\ Please put 'use $(MPIVERSIONREQUIRED)' in your.tcshrc,.cshrc, or.profile \ [whichever is appropriate for your shell] and try again.) endif With the environment module, the following can be done instead:

MPIVERSIONREQUIRED=openmpi/1.4.1 MPIVERSION := $(strip $(shell echo $LOADEDMODULES tr ':' '\n' grep mpi)) ifneq ($(MPIVERSION),$(MPIVERSIONREQUIRED)) $(error You are using version $(MPIVERSION). \ You need to be using $(MPIVERSIONREQUIRED) \ for compilation to work.\ Please put 'use $(MPIVERSIONREQUIRED)' in your.tcshrc,.cshrc, or.profile \ [whichever is appropriate for your shell] and try again.) endif 6. Supported Debuggers and Debugging Tips idb (Intel Debugger) valgrind - To use valgrind with MPI jobs, do [l,h]mpirun -n X /usr/bin/valgrind --log-file=memlog a.out where memlog is a random name, and you'll see memlog.pid for each task with info about memory and pointer usage. TotalView Documentation o Examples are located in /usr/local/totalview/toolworks/linux-x86-64/examples License supports debugging of up to 32 threads or processes Example of running TotalView in an interactive session: Compile your code with -g option Allocate a couple of interactive nodes for the debugging session: qsub -I -x -l nodes=2 will land you on a compute node, where TotalView can be started: totalview & From the TotalView GUI, choose program to run. Select "Arguments" tab and fill in command-line arguments Select "Parallel" tab and select parallel system: Open MPI Choose how many tasks and how many nodes should be used for the debugging session. Click OK to load the code for debugging. 7. Launching, Managing and Priorities of Parallel Jobs 7.1 Launching a job As stated above, the zwicky cluster is currently a heterogeneous cluster made up of 207 compute nodes. 187 compute nodes each have 12 cores, twenty compute nodes have 16 cores. Allocating any subset of the twenty E5-2670 nodes (each with 16 cores) requires the special "core16" tag. When requesting a compute node, core12 nodes are allocated if no core[12,16] tag is specified.see the runme.pbs example below for requesting a mix of X5650 and E5-2670 nodes. It is expected that users will compile and run small verification tests of their code on the head node. They will then launch compute/memory intensive tests and production runs on the backend (aka compute) nodes.

Access to the compute nodes is obtained via the Maui queuing software. Maui will allocate the requested nodes for the time requested to the user based on a priority-based system. A user can obtain interactive access to a node (or set of nodes), or they can run a "job script" non interactively on a node (or set of nodes.) Interactive jobs on the compute nodes are initiated from the head node via the command: qsub -I -l nodes=x -l walltime=hh:mm:ss X = number of nodes HH:MM:SS = requested time HH: hours MM: minutes SS: seconds Non-interactive or batch jobs are submitted via the command: qsub -V runme.pbs where runme.pbs is the name of the batch script to run on the nodes and the -V option to qsub passes all of your current environment variables to the programs run in the runme.pbs script. Prior to submitting the job to PBS set up the path: module add openmpi/1.3.3-gcc (openmpi/1.3.3-gcc was used to compile the MPI program hello_world, that is run in the runme.pbs script) A simple runme.pbs file could contain: #!/bin/csh -f # Request 2 nodes for 1 hour #PBS -l nodes=2 #PBS -l walltime=01:00:00 # Direct stdout/err to files in directories # below my home directory #PBS -o /home/my_unixid/examples/hello.out #PBS -e /home/my_unixid/examples/hello.err # request core16 and core12 nodes #PBS l nodes=4:core16+8:core12 # Display the libraries my program will be using # note: this will be written to your stdout file. /usr/bin/ldd $HOME/examples/hello_world # Write the list of nodes allocated to me to stdout cat $PBS_NODEFILE # Run "hello_world" on 24 cores using the full pathname # mpirun -np 24 $PBS_O_HOME/examples/hello_world # Run "hello_world" on 12 cores using a relative pathname # (note: the 'cd') cd ${HOME}/examples mpirun -np 12 $PBS_O_HOME/examples/hello_world 7.2. Commonly Used Job Scheduler Commands

Command Description canceljob diagnose pbsnodes qstat qdel showq Cancel a job eg. canceljob <PBS_JOBID> Provide diagnostic reports for various aspects of resources, workload and scheduling eg. diagnose -f Show compute node status eg. pbsnodes -a State of any job currently queued/running. Note: When monitoring jobs with qstat, look at "Elap Time" (elapsed time) rather than "Time Use". The "Elap Time" is the time since the job started, whereas "Time Use" is the CPU time used by the user process; this number is usually zero or close to it, since it counts the script that actually launches the MPI job, not the job itself. qstat -a Show quick information of the server: qstat -B Show all queues: qstat -q Show all jobs running on the system: qstat -r Show detailed information for a specified job: qstat -f PBS_JOBID Another way to cancel a job: eg. qdel Jobs that are queued, running on hold. How many nodes are in use/free. For much more detailed documentation on the Maui scheduler and Torque Resource Manager. See the sections: TORQUE Resource Manager Maui Cluster Scheduler at http://www.clusterresources.com/resources/documentation.php 7.3. VM Limits The zwicky compute nodes now have an upper bound (approximately 66 GBytes) on the total amount of virtual memory that can be used, cumulatively, by all processes running on the node. Since system processes consume a couple of GBytes, this means that the total VM available on a node for a user job will be a bit more than 60 GBytes. Once that limit is hit, the brk() system call (and library routines which call it, e.g. sbrk() and malloc()) will fail and return an error. 8. Getting Help/Communicating with Other Users/Staff CACR technical support is available during standard business hours (M-F, 8am to 5pm); after hours responses are as time permits. 9. Expectations About System Down Times/News Monday mornings are reserved for scheduled system downtime, starting no earlier than 7:00, lasting until no later than noon. Sometimes the downtime will involve only the compute nodes (queues stopped, and no jobs scheduled);

occasionally the downtime will require that users be prevented from using the head node as well. Some weeks no Monday-morning downtime will be taken. If you would like to schedule benchmarking or test runs which require dedicated access to the entire cluster, asking zwicky-help for a portion of the scheduled system time is fine and encouraged. News about operational changes (e.g. system software upgrades, file system policy changes, dedicated runs) will be posted using news. New news items since last logging in on to the head node will be displayed automatically. Using news: news -help Usage: o news prints all new news items o news X prints news item "X" o news -a prints all news items o news -l prints names of all news items o news -h prints this message 10. Performance HPL performance (127.68 GFLOPs peak/node) Cores GFLOPs 12 (1 node) 1.152e+02 96 (8 nodes) 9.187e+02 192 (16 nodes) 1.831e+03 384 (32 nodes) 3.660e+03 768 (64 nodes) 7.288e+03 1536 (128 nodes) 1.452e+04 2244 (187 nodes) TBD 11. Policies All users are expected to adhere to the CACR Computing Policies. 12. Accounting and Job Priority Policies Jobs are scheduled according to "weight." Many factors are taken into consideration when determining a job's weight, including cpu time consumed recently by the user, time spent waiting in the queue, runtime, node count as well as a "Fairshare job Scheduling" algorithm. The intent of FairShare scheduling is to prevent a user from dominating compute resources. A balance is struck between utilization of cpu resources and job throughput. 12.1 Job request limitations Jobs can be requested for a maximum time of 48 hours. Jobs lasting 12 hours or less can request at most 64 nodes Jobs lasting longer than 12 hours can request at most 22 nodes 12.2 Job Priorities There are 4 nodes permanently reserved for jobs running 2 hours or less No more than 22 nodes on the system can be running jobs that last more than 12 hours.

All things being equal a job lasting more than 12 hours will have priority over a job lasting 12 hours or less, but as stated above there will be no more than 22 nodes running such jobs. A fairshare algorithm is implemented between users. o A user's cpu-hrs consumed is weighted over 7x24-hour intervals and an attempt is made to balance this weighted usage equally among all users. o Each interval is weighted by (0.9)^n, where n is the n'th 24 hr interval from the present starting with n=0 Within the list of a user's jobs waiting to run, all other things being equal the job queued the longest will run first.