Parallel file I/O bottlenecks and solutions

Transcription

1 Mitglied der Helmholtz-Gemeinschaft Parallel file I/O bottlenecks and solutions Views to Parallel I/O: Hardware, Software, Application Challenges at Large Scale Introduction SIONlib Pitfalls, Darshan, I/O-Strategies Wolfgang Frings Jülich Supercomputing Centre 13th VI-HPS Tuning Workshop, BSC, Barcelona 2014

2 Overview Parallel I/O from different views Hardware: Example: IBM BG/Q I/O infrastructure System Software: IBM GPFS, I/O-forwarding Application: Parallel I/O libraries Pitfalls Small blocks, I/O to individual files, false sharing Tasks per shared File, portability SIONlib Overview I/O characterization with darshan I/O strategies 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 2

3 IBM Blue Gene/Q (JUQUEEN) & I/O IBM Blue Gene/Q JUQUEEN IBM PowerPC A2 1.6 GHz, 16 cores per node 28 racks (7 rows à 4 racks) 28,672 nodes (458,752 cores) 5D torus network 5.9 Pflop/s peak 5.0 Pflop/s Linpack Main memory: 448 TB I/O Nodes: 248 (27x8 + 1x32) Network: 2x CISCO Nexus 7018 Switches (connect I/O-nodes) Total ports: GigEthernet Nexus-Switch 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 3

4 Blue Gene/Q: I/O-node cabling (8 ION/Rack) internal torus network 10GigE Network IBM th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 4

5 I/O-Network & File Server (JUST) JUQUEEN JUST4 18 Storage Controller (16 x DCS3700, 2 x DS3512) 20 GPFS NSD-Server x CISCO Nexus Ports (10GigE) JUST4-GSS 8 TSM Server p720 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 5

6 Software View to Parallel I/O: GPFS Architecture and I/O Data Path Comp. node Comp. node Comp. node Application NSD client Application NSD client O( )... Application NSD client Application GPFS NSD Client IO size parallelism pagepool (streams) prefetch threads Network (TCP/IP or IB) Network transfer size NSD server NSD server NSD server NSD server NSD server NSD server GPFS NSD Server pagepool (disks) S NSD workers SAN SAN O( )... SAN Adapter / Disk Device Driver hdisk dd adapter dd NSD NSD GPFS server BB1 NSD NSD GPFS server BB2 NSD NSD GPFS server BBn Storage Subsystem SAN Ctrl A Ctrl B P Software Stack IBM th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 6

7 Software View to Parallel I/O: GPFS on IBM Blue Gene/Q (I) I/O- Forwarding S Software Stack IBM th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 7

8 Software View to Parallel I/O: GPFS on IBM Blue Gene/Q (II) I/O- Forwarding Parallel application POSIX I/O POSIX I/O Parallel file system 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 8 IBM 2012

9 Application View to Parallel I/O Parallel application HDF5 NETCDF MPI-I/O SIONlib shared local POSIX I/O Parallel file system 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 9

10 Application View: Data Formats 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 10

11 Application View: Data Distribution HDF5 Parallel Application NETCDF distributed local view Transformation MPI/IO POSIX I/O shared global view Parallel file system Post-processing: convert-utility Software-view Data-view 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 11

12 Parallel Task-local I/O at Large Scale Usage Fields: Check-point files, restart files Result files, post-processing Parallel Performance-Tools Data types: Simulation data (domain-decomposition) Trace data (parallel performance tools) Bottlenecks: File creation #files: O(10 5 ) t1 t2 tn File management./checkpoint/file.0001./checkpoint/file.nnnn 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 12

13 The Showstopper for Task-local I/O: Parallel Creation of Individual Files > 33 minutes < 10 seconds Entries directory i-node f f f f f f f f f f f Tasks Tasks create file FS Block FS Block FS Block Jugene + GPFS: file create+open, one file per task versus one file per I/O-node 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 13

14 SIONlib: Shared Files for Task-local Data Parallel Application HDF5 NETCDF MPI-I/O SIONlib POSIX I/O Parallel file system Serial program Application t 1 t 2 t 3 Tasks t n-2 t n-1 t n t 1 t 2 Logical task-local files t n-1 t n #files: O(10)./checkpoint/file.0001 Physical./checkpoint/file.nnnn multi-file SIONlib Parallel file system 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 14

15 The Showstopper for Shared File I/O: Concurrent Access & Contention t 1 t 2 lock FS Block FS Block FS Block data task 1 data task 2 lock File System Block Locking Serialization SIONlib: Logical partitioning of Shared File: Dedicated data chunks per task Alignment to boundaries of file system blocks no contention Tasks t 1 t 2 t n SIONlib metablock 1 chunk 1 data Gaps chunk 2 data block 1 chunk n data metablock 2 FS Blocks Shared file 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 15

16 SIONlib: Architecture & Example Application SION OpenMP API SION Hybrid API SION MPI API callbacks Parallel generic API Serial API callbacks SIONlib OpenMP ANSI C or POSIX-I/O MPI Extension of I/O-API (ANSI C or POSIX) C and Fortran bindings, implementation language C Current versions: 1.4p3 Open source license: /* fopen() */ sid=sion_paropen_mpi( filename, bw, &numfiles, &chunksize, gcom, &lcom, &fileptr,...); /* fwrite(bindata,1,nbytes, fileptr) */ sion_fwrite(bindata,1,nbytes, sid); /* fclose() */ sion_parclose_mpi(sid) 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 16

17 SIONlib in a NutShell: Task local I/O /* Open */ sprintf(tmpfn, "%s.%06d",filename,my_nr); fileptr=fopen(tmpfn, "bw",...);... /* Write */ fwrite(bindata,1,nbytes,fileptr);... /* Close */ fclose(fileptr); Original ANSI C version no collective operation, no shared files data: stream of bytes 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 17

18 SIONlib in a NutShell: Add SIONlib calls /* Collective Open */ nfiles=1;chunksize=nbytes; sid=sion_paropen_mpi( filename, "bw", &nfiles, &chunksize, MPI_COMM_WORLD, &lcomm, &fileptr,...);... /* Write */ fwrite(bindata,1,nbytes,fileptr);... /* Collective Close */ sion_parclose_mpi(sid); Collective (SIONlib) open and close Ready to run... Parallel I/O to one shared file 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 18

19 SIONlib in a NutShell: Variable Data Size /* Collective Open */ nfiles=1;chunksize=nbytes; sid=sion_paropen_mpi( filename, "bw", &nfiles, &chunksize, MPI_COMM_WORLD, &lcomm, &fileptr,...);... /* Write */ if(sion_ensure_free_space(sid, nbytes)) { fwrite(bindata,1,nbytes,fileptr); }... /* Collective Close */ sion_parclose_mpi(sid); Writing more data as defined at open call SIONlib moves forward to next chunk, if data to large for current block 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 19

20 SIONlib in a NutShell: Wrapper function /* Collective Open */ nfiles=1;chunksize=nbytes; sid=sion_paropen_mpi( filename, "bw", &nfiles, &chunksize, MPI_COMM_WORLD, &lcomm, &fileptr,...);... /* Write */ sion_fwrite(bindata,1,nbytes,sid);... /* Collective Close */ sion_parclose_mpi(sid); Includes check for space in current chunk Parameter of fwrite: fileptr sid 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 20

21 SIONlib: Applications Applications DUNE-ISTL (Multigrid solver, Univ. Heidelberg) ITM (Fusion-community), LBM (Fluid flow/mass transport, Univ. Marburg), PSC (particle-in-cell code), OSIRIS (Fully-explicit particle-in-cell code), PEPC (Pretty Efficient Parallel C. Solver) Profasi: (Protein folding and aggr. simulator) NEST (Human Brain Simulation) MP2C: k tasks, write 16k tasks, write (SION) 1k tasks, read 16k tasks, read (SION) 100 Time (s) 10 MP2C: Mesoscopic hydrodynamics + MD Speedup and higher particle numbers through SIONlib integration Tools/Projects Scalasca: Performance Analysis instrumented application Mio. Particles Local event traces Parallel analysis Score-P: Scalable Performance Measurement Infrastructure for Parallel Codes DEEP-ER: Adaption to new platform and parallelization paradigm Global analysis result 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 21

22 Are there more Bottlenecks? Increasing #tasks further Bottleneck: file meta data management by first GPFS client which opened the file I/Oclient file i-node indirect blocks FS blocks JUGENE: Bandwidth per ION, comparison individual files (POSIX), one file per ION (SION) and one shared file (POSIX) I/O-Node 1 >> P 1 P n Par. FS I/O-Node m P 1 SIONlib e.g.: IBM BG/Q I/O- Infrastructure P n 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 22

23 SIONlib: Multiple Underlying Physical Files Parallelization of file meta data handling using multiple physical files Mapping: Files : Tasks 1 : n p : n n : n IBM Blue Gene: One file per I/O-node (locality) Tasks t 1 t n/2 t n/2+1 t n metablock 1 metablock 2 mapping metablock 1 metablock 2 Shared file Shared file 1 Shared file 2 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 23

24 SIONlib: Scaling to Large # of Tasks JUGENE: Total bandwidth (write), one file per I/O-node (ION), varying the number of tasks doing the I/O I/Onodes Preliminary Tests on JUQUEEN up to 1.8 Mio Tasks JUQUEEN: Total bandwidth (write/read), one file per I/O-bridge (IOB) Old (Just3) vs. New (Just4) GPFS file system 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 24

25 Other Pitfalls: Frequent flushing on small blocks Modern file systems in HPC have large file system blocks A flush on a file handle forces the file system to perform all pending write operations If application writes in small data blocks the same file system block it has to be read and written multiple times Performance degradation due to the inability to combine several write calls flush() 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 25

26 Other Pitfalls: Portability Endianess (byte order) of binary data Example (32 bit): = Address Little Endian Big Endian Conversion of files might be necessary and expensive Solution: Choosing a portable data format (HDF5, NetCDF) 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 26

27 Darshan I/O Characterization Darshan: Scalable HPC I/O characterization tool (ANL) (version 2.2.8) Profiling of I/O-Calls (POSIX, MPI-I/O, ) during runtime Instrumentation dynamic linked binaries: LD_PRELOAD=<<path>libdarshan.so> static binaries: Wrapper for compiler-calls for static binaries Log-files: <uid><binname><jobid><ts>.darshan.gz Path: set by environment variable DARSHANLOGDIR e.g. mpirun -x DARSHANLOGDIR=$HOME/darshanlog Reports: PDF-file or text files Extract information: darshan-parser <logfile> > ~/job-characterization.txt Generate PDF-report from logfile: darshan-job-summary.pl <logfile> PDF-file 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 27

28 Darshan on MareNostrum-III Installation directory: DARSHANDIR=/gpfs/projects/nct00/nct00001/\ UNITE/packages/darshan/2.2.8-intel-openmpi Program start: mpirun -x DARSHANLOGDIR=${HOME}/darshanlog \ -x LD_PRELOAD=${DARSHANDIR/lib/libdarshan.so Parser: $DARSHANDIR/bin/darshan-parser <logfile> Output format: see documentation Generate PDF-report: on local system (needs pdflatex) $LOCALDARSHANDIR/bin/darshan-job-summary.pl <logfile> 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 28

29 How to choose an I/O strategy? Performance considerations Amount of data Frequency of reading/writing Scalability Portability Different HPC architectures Data exchange with others Long-term storage E.g. use two formats and converters: Internal: Write/read data as-is Restart/checkpoint files External: Write/read data in non-decomposed format (portable, system-independent, self-describing) Workflows, Pre-, Postprocessing, Data exchange, 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 29

30 Questions? Serial program t 1 t 2 t 3 Application t n-2 t n-1 t n Tasks Physical multi-file Logical task-local files SIONlib Parallel file system Thank You! 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 30