Parallel I/O on Mira Venkat Vishwanath and Kevin Harms

Size: px

Start display at page:

Download "Parallel I/O on Mira Venkat Vishwanath and Kevin Harms"

Dana Warner
8 years ago
Views:

1 Parallel I/O on Mira Venkat Vishwanath and Kevin Harms Argonne Na*onal Laboratory

2 ALCF-2 I/O Infrastructure Mira BG/Q Compute Resource Tukey Analysis Cluster 48K Nodes 768K Cores 10 PFlops 768X 2 GB/s 1536 GB/s 384 I/O Nodes 1536 GB/s QDR IB Switch Complex 900+ ports 384 GB/s GB/s 6 TB Memory 96 Nodes 192 GPUs 128 DDN VM File Server 26 PB ~250GB/s Storage System Storage System consists of 16 DDN SFA12KE couplets with each couplet consisting of 560 x 3TB HDD, 32 x 200GB SSD, 8 Virtual Machines (VM) with two QDR IB ports per VM

3 I/O Subsystem of BG/Q- Simplified View 128 BG/Q Compute Nodes 2x2x4x4x2 2 GB/s 2 GB/s ION 4 GB/s For every 128 compute nodes, we have two nodes designated as Bridge nodes. Each Bridge node has a link (11 th ) to the I/O Node. On the ION, we have a QDR Infiniband connected on a PCI-e Gen2 Slot 3

4 Control and I/O Daemon (CIOD) I/O Forwarding mechanism for BG/P and BG/Q CIOD is the I/O forwarding infrastructure provided by IBM for the Blue Gene / P For each compute node, we have a dedicated I/O proxy process to handle the associated I/O operations On BG/Q, we have a thread based implementation instead of process-based

5 Intrepid vs Mira : I/O Comparison Intrepid BG/P Mira BG/Q Increase Flops PF 10 PF 20X Cores 160K 768K ~5X ION / CN ratio 1 / 64 1 / X IONs per Rack X Total Storage 6 PB 35 PB 6X I/O Throughput 62GB/s 240 GB/s Peak 4X User Quota? None Quotas enforced!! I/O per Flops (%) X - 5

6 I/O interfaces and Libraries on Mira Interfaces POSIX MPI- IO Higher- level Libraries HDF5 Netcdf4 PnetCDF I/O Performance on Mira Shared File Write ~70 GB/s Shared File Read ~180 GB/s 6

7 I/O Recommendations Single shared file vs file per process (MPI rank) Ideally, you should write a few large files Use MPI- IO Interfaces Use MPI Collec*ve I/O (eg. MPI_File_write_at_all()) when you are wri*ng small data sizes per rank qsub GPFS --env Block BGLOCKLESSMPIO_F_TYPE=0x size on Mira : 8 MB For non- overlapping writes pass the environment variable at job submission BGLOCKLESSMPIO_F_TYPE=0x or prefix file with bglockless:/ to disable locking in the ADIO layer Factor of two improvement in performance For POSIX, instead of lseek and write, use pwrite pwrite(int fd, const void *buf, size_t count, off_t offset); 7

8 More Tips and Tricks for I/O on BG/Q // Create and Preallocate the File On Master if ( 0 == m_rank) { retval = MPI_File_open(MPI_COMM_SELF, (char *)m_partfilename, Pre- create the files before wri*ng to them MPI_MODE_WRONLY MPI_MODE_CREATE, Increases shared file write performance by a factor MPI_INFO_NULL,&m_fileHandle); assert(retval == MPI_SUCCESS); Pre- allocate files Useful // Preallocate if you File know the exact size of data to be wrinen. MPI_File_set_size(m_fileHandle, m_partfilesize); Have a rank, say 0, create & allocate (MPI_File_set_size or MPI_File_close (&m_filehandle); pruncate) } and close. Next, have all ranks open this file for wri*ng. MPI_Barrier (MPI_COMM_WORLD); HDF5 Datatype // Open the File for Writing Instead retval = MPI_File_open(MPI_COMM_WORLD, of Na*ve, use Big Endian(eg. (char H5T_IEEE_F64BE) *)m_partfilename, MPI_MODE_WRONLY, MPI_INFO_NULL, &m_filehandle); Avoid Complex MPI Datatypes Keep assert(retval things == MPI_SUCCESS); simple in produc*on codes Data MPI_Barrier Compression (MPI_COMM_WORLD); 8

9 Tools for I/O Performance Profiling on BG/Q HPM Bob Walkup hnps:// guides/hpctw Darshan Phil Carns and Kevin Harms hnps:// guides/darshan TAU Sameer Shende hnps:// guides/tuning- and- analysis- u*li*es- tau opttrackio in TAU_OPTIONS 9

10 Data for MPI rank 0 of Profiling I/O using HPM Times and statistics from MPI_Init() to MPI_Finalize() MPI Routine #calls avg. bytes time(sec) Link applica*on with MPI_Comm_size MPI_Comm_rank /sop/perpools/hpctw/lib/libmpihpm_smp.a MPI_Isend MPI_Recv /bgsys/drivers/ppcfloor/bgpm/lib/libbgpm.a MPI_Probe MPI_Waitall MPI_Barrier MPI_Gather rw- r- - r- - MPI_Allgather 1 venkatv Performance May : mpi_profile rw- r- - r- - MPI_Allgatherv 1 venkatv Performance May : mpi_profile rw- r- - r- - MPI_Reduce 1 venkatv Performance May : mpi_profile MPI_Allreduce rw- r- - r- - 1 venkatv Performance 5505 May 20 09:29 mpi_profile Job produces the following output, among others MPI_File_close MPI_File_open MPI_File_read_at MPI_File_write_at_all total communication time = seconds. total elapsed time = seconds. total MPI-IO time = seconds. 10

11 Profiling I/O using HPM MPI_File_read_at #calls avg. bytes?me(sec) MPI_File_write_at_all Performance Issue: 15 Collective calls writing 0 bytes taking 10 secs. 11

12 Darshan An I/O Characterization Tool An open- source tool developed for sta*s*cal profiling of I/O Designed to be lightweight and low overhead Finite memory alloca*on for sta*s*cs (about 2MB) done during MPI_Init Overhead of 1-2% total to record I/O calls Varia*on of I/O is typically around 4-5% Darshan does not create detailed func*on call traces No source modifica*ons Uses PMPI interfaces to intercept MPI calls Use ld wrapping to intercept POSIX calls Can use dynamic linking with LD_PRELOAD instead Stores results in single compressed log file 12

13 fms c48l24 am3p5.x (3/27/2010) 1 of 2 Darshan on BG/Q uid: 6277 nprocs: 1944 runtime: 5382 seconds Percentage of run time Average I/O cost per process POSIX MPI-IO Read Write Metadata Other (including application compute) Ops (Total, All Processes) 4.5e+07 4e e+07 3e e+07 2e e+07 1e+07 5e+06 0 Read Write Open Stat Seek MmapFsync POSIX MPI-IO Indep. I/O Operation Counts MPI-IO Coll. logs: /projects/logs/darshan/vesta /projects/logs/darshan/mira bins: /soft/perftools/darshan/darshan/bin 3e+07 I/O Sizes 4e+07 I/O Pattern Count (Total, All Procs) 2.5e+07 2e e+07 1e+07 5e+06 Ops (Total, All Procs) 3.5e+07 3e e+07 2e e+07 1e+07 wrappers: /soft/perftools/darshan/darshan/ wrappers/<wrapper> 1K-10K 101-1K M-4M 100K-1M 10K-100K Read 1G+ 100M-1G 10M-100M 4M-10M Write Most Common Access Sizes access size count Read Total Sequential Write Consecutive export PATH=/soft/perftools/ darshan/darshan/wrappers/xl:$ {PATH} 0 5e+06./fms c48l24 am3p5.x <unknown args> 13

14 33 Early Experience Scaling I/O for HACC HACC on Mira: Early Science Test Run Hybrid/Hardware Accelerated First Large-Scale Computa*onal Simulation Cosmology on Mira code is lead by Salman 16 racks, 262,144 Habib cores at (one-third of Mira) Argonne 14 hours continuous, no checkpoint restarts Members include Katrin Simulation Parameters Heitmann, Hal 9.14 Finkel, Gpc simulation Adrian box Pope, Vitali Morozov 7 kpc force resolution 1.07 trillion particles Par*cle- in- Cell Science Motivations code Achieved ~14 Resolve PF on halos Sequoia that host and luminous red galaxies for 7 PF on Mira analysis of SDSS observations On Mira, HACC Cluster uses cosmology 1-10 Trillion par*cles Baryon acoustic oscillations in galaxy clustering Zoom-in visualization of the density field illustrating the global spatial dynamic range of the simulation -- approximately a million-to-one 14

15 Early Experience Scaling I/O for HACC Typical Checkpoint/Restart dataset is ~ TB Analysis output, wrinen more frequently, is ~1TB- 10TB Step 1: Default I/O mechanism using single shared file with MPI- IO Achieved ~15 GB/s Step 2: File per rank Too many files at 32K Nodes (262K files with 8RPN) Step 3: An I/O mechanism wri*ng 1 File per ION Topology- aware I/O to reduce network conten*on Achieves up to 160 GB/s (~10 X improvement) Wrinen and read ~10 PB of data on Mira (and coun*ng) 15

16 Questions 16

Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF)

Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF) ALCF Resources: Machines & Storage Mira (Production) IBM Blue Gene/Q 49,152 nodes / 786,432 cores 768 TB of memory Peak flop rate: