Parallel file I/O bottlenecks and solutions



Similar documents
Parallel I/O on JUQUEEN

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Scalable System Monitoring

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Report on Project: Advanced System Monitoring for the Parallel Tools Platform (PTP)

Welcome to the. Jülich Supercomputing Centre. D. Rohe and N. Attig Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich

OpenMP Programming on ScaleMP

BG/Q Performance Tools. Sco$ Parker Leap to Petascale Workshop: May 22-25, 2012 Argonne Leadership CompuCng Facility

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Mitglied der Helmholtz-Gemeinschaft. System monitoring with LLview and the Parallel Tools Platform

Operating System for the K computer

BG/Q Performance Tools. Sco$ Parker BG/Q Early Science Workshop: March 19-21, 2012 Argonne Leadership CompuGng Facility

Unified Performance Data Collection with Score-P

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute

FLOW-3D Performance Benchmark and Profiling. September 2012

Parallel IO performance and scalability study on the PRACE CURIE supercomputer.

A highly configurable and efficient simulator for job schedulers on supercomputers

BSC vision on Big Data and extreme scale computing

System Software for High Performance Computing. Joe Izraelevitz

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Multicore Parallel Computing with OpenMP

Current Status of FEFS for the K computer

Beyond Embarrassingly Parallel Big Data. William Gropp

Large File System Backup NERSC Global File System Experience

Data Deduplication in a Hybrid Architecture for Improving Write Performance


Cosmological simulations on High Performance Computers

LLNL s Parallel I/O Testing Tools and Techniques for ASC Parallel File Systems

Understanding and Improving Computational Science Storage Access through Continuous Characterization

Portable, Scalable, and High-Performance I/O Forwarding on Massively Parallel Systems. Jason Cope

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

Uncovering degraded application performance with LWM 2. Aamer Shah, Chih-Song Kuo, Lucas Theisen, Felix Wolf November 17, 2014

Cluster Implementation and Management; Scheduling

Hardware Performance Optimization and Tuning. Presenter: Tom Arakelian Assistant: Guy Ingalls

Clusters: Mainstream Technology for CAE

HPC Wales Skills Academy Course Catalogue 2015

PETASCALE DATA STORAGE INSTITUTE. Petascale storage issues. 3 universities, 5 labs, G. Gibson, CMU, PI

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

A Framework For Application Performance Understanding and Prediction

Recommended hardware system configurations for ANSYS users

Bernie Velivis President, Performax Inc

A Survey of Shared File Systems

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

Parallel I/O on Mira Venkat Vishwanath and Kevin Harms

Data Centric Systems (DCS)

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

I/O intensive applications: what are the main differences in the design of the HPC filesystems vs the MapReduce ones?

JuRoPA. Jülich Research on Petaflop Architecture. One Year on. Hugo R. Falter, COO Lee J Porter, Engineering

Parallel Visualization of Petascale Simulation Results from GROMACS, NAMD and CP2K on IBM Blue Gene/P using VisIt Visualization Toolkit

Update on Petascale Data Storage Institute

MPI / ClusterTools Update and Plans

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

Big Data Management in the Clouds and HPC Systems

POSIX. RTOSes Part I. POSIX Versions. POSIX Versions (2)

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Bottleneck Detection in Parallel File Systems with Trace-Based Performance Monitoring

Petascale Software Challenges. William Gropp

Hadoop on the Gordon Data Intensive Cluster

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

General Parallel File System (GPFS) Native RAID For 100,000-Disk Petascale Systems

Load Testing Analysis Services Gerhard Brückl

IBM System x GPFS Storage Server

LS DYNA Performance Benchmarks and Profiling. January 2009

Blue Gene Active Storage for High Performance BG/Q I/O and Scalable Data-centric Analytics

Distributed communication-aware load balancing with TreeMatch in Charm++

Application Performance Analysis Tools and Techniques

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Picking the right number of targets per server for BeeGFS. Jan Heichler March 2015 v1.2

Parallel Programming Survey

Part I Courses Syllabus

Climate-Weather Modeling Studies Using a Prototype Global Cloud-System Resolving Model

Chapter 6, The Operating System Machine Level

Petascale Visualization: Approaches and Initial Results

SAN Conceptual and Design Basics

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

FD4: A Framework for Highly Scalable Dynamic Load Balancing and Model Coupling

High Performance Computing in Aachen

Evaluating parallel file system security

Kriterien für ein PetaFlop System

(51) Int Cl.: G06F 11/14 ( )

Automatic Tuning of HPC Applications for Performance and Energy Efficiency. Michael Gerndt Technische Universität München

Overlapping Data Transfer With Application Execution on Clusters

EOFS Workshop Paris Sept, Lustre at exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

HPC enabling of OpenFOAM R for CFD applications

Using the Windows Cluster

Performance Tools for System Monitoring

Can High-Performance Interconnects Benefit Memcached and Hadoop?

High Productivity Computing With Windows

Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF)

Hands-on exercise (FX10 pi): NPB-MZ-MPI / BT

Performance Analysis and Optimization Tool

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Cloud Computing Where ISR Data Will Go for Exploitation

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

Transcription:

Mitglied der Helmholtz-Gemeinschaft Parallel file I/O bottlenecks and solutions Views to Parallel I/O: Hardware, Software, Application Challenges at Large Scale Introduction SIONlib Pitfalls, Darshan, I/O-Strategies Wolfgang Frings W.Frings@fz-juelich.de Jülich Supercomputing Centre 13th VI-HPS Tuning Workshop, BSC, Barcelona 2014

Overview Parallel I/O from different views Hardware: Example: IBM BG/Q I/O infrastructure System Software: IBM GPFS, I/O-forwarding Application: Parallel I/O libraries Pitfalls Small blocks, I/O to individual files, false sharing Tasks per shared File, portability SIONlib Overview I/O characterization with darshan I/O strategies 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 2

IBM Blue Gene/Q (JUQUEEN) & I/O IBM Blue Gene/Q JUQUEEN IBM PowerPC A2 1.6 GHz, 16 cores per node 28 racks (7 rows à 4 racks) 28,672 nodes (458,752 cores) 5D torus network 5.9 Pflop/s peak 5.0 Pflop/s Linpack Main memory: 448 TB I/O Nodes: 248 (27x8 + 1x32) Network: 2x CISCO Nexus 7018 Switches (connect I/O-nodes) Total ports: 512 10 GigEthernet Nexus-Switch 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 3

Blue Gene/Q: I/O-node cabling (8 ION/Rack) internal torus network 10GigE Network IBM 2012 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 4

I/O-Network & File Server (JUST) JUQUEEN JUST4 18 Storage Controller (16 x DCS3700, 2 x DS3512) 20 GPFS NSD-Server x3560 2 CISCO Nexus 700 512 Ports (10GigE) JUST4-GSS 8 TSM Server p720 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 5

Software View to Parallel I/O: GPFS Architecture and I/O Data Path Comp. node Comp. node Comp. node Application NSD client Application NSD client O(100...1000)... Application NSD client Application GPFS NSD Client IO size parallelism pagepool (streams) prefetch threads Network (TCP/IP or IB) Network transfer size NSD server NSD server NSD server NSD server NSD server NSD server GPFS NSD Server pagepool (disks) S NSD workers SAN SAN O(10...100)... SAN Adapter / Disk Device Driver hdisk dd adapter dd NSD NSD GPFS server BB1 NSD NSD GPFS server BB2 NSD NSD GPFS server BBn Storage Subsystem SAN Ctrl A Ctrl B 1 2 3 4 P Software Stack IBM 2012 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 6

Software View to Parallel I/O: GPFS on IBM Blue Gene/Q (I) I/O- Forwarding S Software Stack IBM 2012 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 7

Software View to Parallel I/O: GPFS on IBM Blue Gene/Q (II) I/O- Forwarding Parallel application POSIX I/O POSIX I/O Parallel file system 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 8 IBM 2012

Application View to Parallel I/O Parallel application HDF5 NETCDF MPI-I/O SIONlib shared local POSIX I/O Parallel file system 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 9

Application View: Data Formats 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 10

Application View: Data Distribution HDF5 Parallel Application NETCDF distributed local view Transformation MPI/IO POSIX I/O shared global view Parallel file system Post-processing: convert-utility Software-view Data-view 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 11

Parallel Task-local I/O at Large Scale Usage Fields: Check-point files, restart files Result files, post-processing Parallel Performance-Tools Data types: Simulation data (domain-decomposition) Trace data (parallel performance tools) Bottlenecks: File creation #files: O(10 5 ) t1 t2 tn File management./checkpoint/file.0001./checkpoint/file.nnnn 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 12

The Showstopper for Task-local I/O: Parallel Creation of Individual Files > 33 minutes < 10 seconds Entries directory i-node f f f f f f f f f f f Tasks Tasks create file FS Block FS Block FS Block Jugene + GPFS: file create+open, one file per task versus one file per I/O-node 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 13

SIONlib: Shared Files for Task-local Data Parallel Application HDF5 NETCDF MPI-I/O SIONlib POSIX I/O Parallel file system Serial program Application t 1 t 2 t 3 Tasks t n-2 t n-1 t n t 1 t 2 Logical task-local files t n-1 t n #files: O(10)./checkpoint/file.0001 Physical./checkpoint/file.nnnn multi-file SIONlib Parallel file system 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 14

The Showstopper for Shared File I/O: Concurrent Access & Contention t 1 t 2 lock FS Block FS Block FS Block data task 1 data task 2 lock File System Block Locking Serialization SIONlib: Logical partitioning of Shared File: Dedicated data chunks per task Alignment to boundaries of file system blocks no contention Tasks t 1 t 2 t n SIONlib metablock 1 chunk 1 data Gaps chunk 2 data block 1 chunk n data metablock 2 FS Blocks Shared file 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 15

SIONlib: Architecture & Example Application SION OpenMP API SION Hybrid API SION MPI API callbacks Parallel generic API Serial API callbacks SIONlib OpenMP ANSI C or POSIX-I/O MPI Extension of I/O-API (ANSI C or POSIX) C and Fortran bindings, implementation language C Current versions: 1.4p3 Open source license: http://www.fz-juelich.de/jsc/sionlib /* fopen() */ sid=sion_paropen_mpi( filename, bw, &numfiles, &chunksize, gcom, &lcom, &fileptr,...); /* fwrite(bindata,1,nbytes, fileptr) */ sion_fwrite(bindata,1,nbytes, sid); /* fclose() */ sion_parclose_mpi(sid) 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 16

SIONlib in a NutShell: Task local I/O /* Open */ sprintf(tmpfn, "%s.%06d",filename,my_nr); fileptr=fopen(tmpfn, "bw",...);... /* Write */ fwrite(bindata,1,nbytes,fileptr);... /* Close */ fclose(fileptr); Original ANSI C version no collective operation, no shared files data: stream of bytes 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 17

SIONlib in a NutShell: Add SIONlib calls /* Collective Open */ nfiles=1;chunksize=nbytes; sid=sion_paropen_mpi( filename, "bw", &nfiles, &chunksize, MPI_COMM_WORLD, &lcomm, &fileptr,...);... /* Write */ fwrite(bindata,1,nbytes,fileptr);... /* Collective Close */ sion_parclose_mpi(sid); Collective (SIONlib) open and close Ready to run... Parallel I/O to one shared file 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 18

SIONlib in a NutShell: Variable Data Size /* Collective Open */ nfiles=1;chunksize=nbytes; sid=sion_paropen_mpi( filename, "bw", &nfiles, &chunksize, MPI_COMM_WORLD, &lcomm, &fileptr,...);... /* Write */ if(sion_ensure_free_space(sid, nbytes)) { fwrite(bindata,1,nbytes,fileptr); }... /* Collective Close */ sion_parclose_mpi(sid); Writing more data as defined at open call SIONlib moves forward to next chunk, if data to large for current block 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 19

SIONlib in a NutShell: Wrapper function /* Collective Open */ nfiles=1;chunksize=nbytes; sid=sion_paropen_mpi( filename, "bw", &nfiles, &chunksize, MPI_COMM_WORLD, &lcomm, &fileptr,...);... /* Write */ sion_fwrite(bindata,1,nbytes,sid);... /* Collective Close */ sion_parclose_mpi(sid); Includes check for space in current chunk Parameter of fwrite: fileptr sid 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 20

SIONlib: Applications Applications DUNE-ISTL (Multigrid solver, Univ. Heidelberg) ITM (Fusion-community), LBM (Fluid flow/mass transport, Univ. Marburg), PSC (particle-in-cell code), OSIRIS (Fully-explicit particle-in-cell code), PEPC (Pretty Efficient Parallel C. Solver) Profasi: (Protein folding and aggr. simulator) NEST (Human Brain Simulation) MP2C: 1000 1k tasks, write 16k tasks, write (SION) 1k tasks, read 16k tasks, read (SION) 100 Time (s) 10 MP2C: Mesoscopic hydrodynamics + MD Speedup and higher particle numbers through SIONlib integration Tools/Projects Scalasca: Performance Analysis 1 1 10 100 1000 10000 instrumented application Mio. Particles Local event traces Parallel analysis Score-P: Scalable Performance Measurement Infrastructure for Parallel Codes DEEP-ER: Adaption to new platform and parallelization paradigm Global analysis result 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 21

Are there more Bottlenecks? Increasing #tasks further Bottleneck: file meta data management by first GPFS client which opened the file I/Oclient file i-node indirect blocks FS blocks JUGENE: Bandwidth per ION, comparison individual files (POSIX), one file per ION (SION) and one shared file (POSIX) I/O-Node 1 >> 100000 P 1 P n Par. FS I/O-Node m P 1 SIONlib e.g.: IBM BG/Q I/O- Infrastructure P n 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 22

SIONlib: Multiple Underlying Physical Files Parallelization of file meta data handling using multiple physical files Mapping: Files : Tasks 1 : n p : n n : n IBM Blue Gene: One file per I/O-node (locality) Tasks t 1 t n/2 t n/2+1 t n metablock 1 metablock 2 mapping metablock 1 metablock 2 Shared file Shared file 1 Shared file 2 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 23

SIONlib: Scaling to Large # of Tasks JUGENE: Total bandwidth (write), one file per I/O-node (ION), varying the number of tasks doing the I/O 96-128 I/Onodes Preliminary Tests on JUQUEEN up to 1.8 Mio Tasks JUQUEEN: Total bandwidth (write/read), one file per I/O-bridge (IOB) Old (Just3) vs. New (Just4) GPFS file system 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 24

Other Pitfalls: Frequent flushing on small blocks Modern file systems in HPC have large file system blocks A flush on a file handle forces the file system to perform all pending write operations If application writes in small data blocks the same file system block it has to be read and written multiple times Performance degradation due to the inability to combine several write calls flush() 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 25

Other Pitfalls: Portability Endianess (byte order) of binary data Example (32 bit): 2.712.847.316 = 10100001 10110010 11000011 11010100 Address Little Endian Big Endian 1000 11010100 10100001 1001 11000011 10110010 1002 10110010 11000011 1003 10100001 11010100 Conversion of files might be necessary and expensive Solution: Choosing a portable data format (HDF5, NetCDF) 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 26

Darshan I/O Characterization Darshan: Scalable HPC I/O characterization tool (ANL) http://www.mcs.anl.gov/darshan (version 2.2.8) Profiling of I/O-Calls (POSIX, MPI-I/O, ) during runtime Instrumentation dynamic linked binaries: LD_PRELOAD=<<path>libdarshan.so> static binaries: Wrapper for compiler-calls for static binaries Log-files: <uid><binname><jobid><ts>.darshan.gz Path: set by environment variable DARSHANLOGDIR e.g. mpirun -x DARSHANLOGDIR=$HOME/darshanlog Reports: PDF-file or text files Extract information: darshan-parser <logfile> > ~/job-characterization.txt Generate PDF-report from logfile: darshan-job-summary.pl <logfile> PDF-file 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 27

Darshan on MareNostrum-III Installation directory: DARSHANDIR=/gpfs/projects/nct00/nct00001/\ UNITE/packages/darshan/2.2.8-intel-openmpi Program start: mpirun -x DARSHANLOGDIR=${HOME}/darshanlog \ -x LD_PRELOAD=${DARSHANDIR/lib/libdarshan.so Parser: $DARSHANDIR/bin/darshan-parser <logfile> Output format: see documentation http://www.mcs.anl.gov/research/projects/darshan/docs/darshan-util.html Generate PDF-report: on local system (needs pdflatex) $LOCALDARSHANDIR/bin/darshan-job-summary.pl <logfile> 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 28

How to choose an I/O strategy? Performance considerations Amount of data Frequency of reading/writing Scalability Portability Different HPC architectures Data exchange with others Long-term storage E.g. use two formats and converters: Internal: Write/read data as-is Restart/checkpoint files External: Write/read data in non-decomposed format (portable, system-independent, self-describing) Workflows, Pre-, Postprocessing, Data exchange, 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 29

Questions? Serial program t 1 t 2 t 3 Application t n-2 t n-1 t n Tasks Physical multi-file Logical task-local files SIONlib Parallel file system Thank You! 13th VI-HPS Tuning Workshop. BSC, Barcelona 2014, Parallel I/O, W.Frings 30