Parallel I/O on JUQUEEN



Similar documents
Parallel file I/O bottlenecks and solutions

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

System Software for High Performance Computing. Joe Izraelevitz

Portable, Scalable, and High-Performance I/O Forwarding on Massively Parallel Systems. Jason Cope

LLNL s Parallel I/O Testing Tools and Techniques for ASC Parallel File Systems

Part I Courses Syllabus

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Uncovering degraded application performance with LWM 2. Aamer Shah, Chih-Song Kuo, Lucas Theisen, Felix Wolf November 17, 2014

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Parallel IO performance and scalability study on the PRACE CURIE supercomputer.

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC


Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

A highly configurable and efficient simulator for job schedulers on supercomputers

A Survey of Shared File Systems

Operating System for the K computer

ProTrack: A Simple Provenance-tracking Filesystem

Hadoop on the Gordon Data Intensive Cluster

IBM Platform Computing : infrastructure management for HPC solutions on OpenPOWER Jing Li, Software Development Manager IBM

Network File System (NFS) Pradipta De

Anwendungsintegration und Workflows mit UNICORE 6

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Mitglied der Helmholtz-Gemeinschaft. System monitoring with LLview and the Parallel Tools Platform

Current Status of FEFS for the K computer

Hopes and Facts in Evaluating the Performance of HPC-I/O on a Cloud Environment

Bottleneck Detection in Parallel File Systems with Trace-Based Performance Monitoring

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Recent Advances in Periscope for Performance Analysis and Tuning

HadioFS: Improve the Performance of HDFS by Off-loading I/O to ADIOS. Xiaobing Li

Implementing the Hadoop Distributed File System Protocol on OneFS Jeff Hughes EMC Isilon

Cluster Implementation and Management; Scheduling

Welcome to the. Jülich Supercomputing Centre. D. Rohe and N. Attig Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich

Managing Complexity in Distributed Data Life Cycles Enhancing Scientific Discovery

I/O intensive applications: what are the main differences in the design of the HPC filesystems vs the MapReduce ones?

EXPLOITING SHARED MEMORY TO IMPROVE PARALLEL I/O PERFORMANCE

General Parallel File System (GPFS) Native RAID For 100,000-Disk Petascale Systems

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Silviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania)

1 Bull, 2011 Bull Extreme Computing

Parallel Visualization of Petascale Simulation Results from GROMACS, NAMD and CP2K on IBM Blue Gene/P using VisIt Visualization Toolkit

INF-110. GPFS Installation

Parallel I/O on Mira Venkat Vishwanath and Kevin Harms

Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF)

Performance Analysis and Optimization Tool

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

High Performance Computing Facility Specifications, Policies and Usage. Supercomputer Project. Bibliotheca Alexandrina

Distributed File Systems

Chapter 6, The Operating System Machine Level

Application Performance Analysis Tools and Techniques

Performance Analysis of Mixed Distributed Filesystem Workloads

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

HPC Software Requirements to Support an HPC Cluster Supercomputer

Google File System. Web and scalability

InterferenceRemoval: Removing Interference of Disk Access for MPI Programs through Data Replication

CSAR: Cluster Storage with Adaptive Redundancy

JUBE. A Flexible, Application- and Platform-Independent Environment for Benchmarking

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Mixing Hadoop and HPC Workloads on Parallel Filesystems

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

NetApp High-Performance Computing Solution for Lustre: Solution Guide

Data Centric Systems (DCS)

EZManage V4.0 Release Notes. Document revision 1.08 ( )

The scalability impact of a component-based software engineering framework on a growing SAMR toolkit: a case study

Blue Gene Active Storage for High Performance BG/Q I/O and Scalable Data-centric Analytics

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Introduction to Highly Available NFS Server on scale out storage systems based on GlusterFS

Easier - Faster - Better

Parallel Processing using the LOTUS cluster

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

CRIBI. Calcolo Scientifico e Bioinformatica oggi Università di Padova 13 gennaio 2012

Cosmological simulations on High Performance Computers

Storage Virtualization in Cloud

Transcription:

Parallel I/O on JUQUEEN 3. February 2015 3rd JUQUEEN Porting and Tuning Workshop Sebastian Lührs, Kay Thust s.luehrs@fz-juelich.de, k.thust@fz-juelich.de Jülich Supercomputing Centre

Overview Blue Gene/Q I/O Hardware Overview, Cabling, I/O Services GPFS architecture and data path Pitfalls The Parallel I/O Software Stack Darshan Overview, Usage, Interpretation SIONlib Parallel HDF5 I/O Hints 3. February 2015 JUQUEEN Porting and Tuning Workshop 2015 2

Blue Gene/Q Packaging Hierarchy I/O-Nodes JUQUEEN I/O Nodes: 248 (27x8 + 1x32) IBM 2012 3. February 2015 JUQUEEN Porting and Tuning Workshop 2015 3

Blue Gene/Q: I/O-node cabling (8 ION/Rack) IBM 2012 3. February 2015 JUQUEEN Porting and Tuning Workshop 2015 4

Blue Gene/Q: I/O Services Function shipping system calls to I/O-node Support NFS, GPFS, Lustre and PVFS2 filesystems Supports ratio of 8192:1 compute task to I/O-node Only 1 I/O-Proxy per compute node Standard communications protocol OFED verbs Using Torus DMA hardware for performance IBM 2012 3. February 2015 JUQUEEN Porting and Tuning Workshop 2015 5

IBM General Parallel File System : Architecture and I/O Data Path on BG/Q IBM 2012 3. February 2015 JUQUEEN Porting and Tuning Workshop 2015 6

Pitfall 1: Frequent flushing on small blocks Modern file systems in HPC have large file system blocks A flush on a file handle forces the file system to perform all pending write operations If application writes in small data blocks the same file system block it has to be read and written multiple times Performance degradation due to the inability to combine several write calls flush() 3. February 2015 JUQUEEN Porting and Tuning Workshop 2015 7

Pitfall 2: Parallel Creation of Individual Files Jugene (BG/P) + GPFS: file create+open, one file per task versus one file per I/O-node Contention at node doing directory updates (directory meta-node) Pre-created files or own directory per task may help performance, but does not simplify file handling Complicates file management (e.g. archive) shared files are mandatory 3. February 2015 JUQUEEN Porting and Tuning Workshop 2015 8

Pitfall 3: False sharing of file system blocks Parallel I/O to shared files (POSIX) data block file system block free file system block t 1 t 2 lock FS Block FS Block FS Block data task 1 lock data task 2 Data blocks of individual processes do not fill up a complete file system block Several processes share a file system block Exclusive access (e.g. write) must be serialized The more processes have to synchronize the more waiting time will propagate 3. February 2015 JUQUEEN Porting and Tuning Workshop 2015 9

Pitfall 4: Number of Tasks per Shared File Meta-data wall on file level File meta-data management Locking I/Oclient FS blocks file i-node indirect blocks Example Blue Gene/P Jugene (72 racks) I/O forwarding nodes (ION) GPFS client on ION Solution: tasks : files ratio ~ const SIONlib: one file per ION implicit task-to-file mapp T/F: 512/1 or 1/1 T/F: 4096/1 T/F: 16384/1 3. February 2015 JUQUEEN Porting and Tuning Workshop 2015 10

Pitfall 5: Portability Endianess (byte order) of binary data Example (32 bit): 2.712.847.316 = 10100001 10110010 11000011 11010100 Address Little Endian Big Endian 1000 11010100 10100001 1001 11000011 10110010 1002 10110010 11000011 1003 10100001 11010100 Conversion of files might be necessary and expensive Solution: Choosing a portable data format (HDF5, NetCDF) 3. February 2015 JUQUEEN Porting and Tuning Workshop 2015 11

The Parallel I/O Software Stack Parallel application P-HDF5 NetCDF-4 PNetCDF Shared file Tasklocal files MPI-I/O SIONlib POSIX I/O Parallel file system 3. February 2015 JUQUEEN Porting and Tuning Workshop 2015 12

How to choose an I/O strategy? Performance considerations Portability Amount of data Frequency of reading/writing Scalability Different HPC architectures Data exchange with others Long-term storage E.g. use two formats and converters: Internal: Write/read data as-is Restart/checkpoint files External: Write/read data in non-decomposed format (portable, system-independent, self-describing) Workflows, Pre-, Postprocessing, Data exchange, 3. February 2015 JUQUEEN Porting and Tuning Workshop 2015 13

Darshan: Overview Developed at ANL: http://www.mcs.anl.gov/darshan HPC I/O Characterization Tool instruments I/O Uses modified versions of I/O libraries during linking No code changes needed only recompilation with wrapped compiler Helps analysing I/O patterns and identifying bottlenecks Analyses parallel I/O with MPI-IO and standard POSIX I/O Ready to use version available on JUQUEEN 3. February 2015 JUQUEEN Porting and Tuning Workshop 2015 14

Darshan: Usage example on JUQUEEN Load module Recompile module load darshan make MPICC=mpicc.darshan Tell runjob where to save the output (in submit script) runjob... --envs \ DARSHAN_LOG_PATH=$HOME/darshanlogs Analyse output darshan-job-summary.pl mylog.darshan.gz evince mylog.pdf 3. February 2015 JUQUEEN Porting and Tuning Workshop 2015 15

Darshan: Interpret the summary Average and statistical information on I/O patterns Relative time for I/O Most common access sizes Additional metrics File count I/O size histogram Timeline for read / write per task 3. February 2015 JUQUEEN Porting and Tuning Workshop 2015 16

Serial SIONlib: Overview Shared file I/O with automatic file system block alignment One single or few large files (e.g. one per ION) Only open and close calls are collective Support for file coalescing For data sizes << file system block size http://www.fz-juelich.de/jsc/sionlib Application t 1 t 2 t 3 Tasks t n-2 t n-1 t n program /* fopen() */ sid=sion_paropen_mpi( filename, bw, Physical multi-file Logical task-local files Parallel file system SIONlib /* fwrite(bindata,1,nbytes, fileptr) */ sion_fwrite(bindata,1,nbytes, sid); /* fclose() */ sion_parclose_mpi(sid) &numfiles, &chunksize, gcom, &lcom, &fileptr,...); 3. February 2015 JUQUEEN Porting and Tuning Workshop 2015 17

Parallel HDF5: Overview Self-describing file format Hierarchical, filesystem-like data format Support of additional metadata for datasets Portable Builds on top of standard MPI-I/O or POSIX-I/O Move between system architectures Automatic conversion of data representation Move between parallel/serial applications e.g. 4096 MPI processes 65.536 MPI processes e.g. simulation post-processing Complex data selections possible for read or write e.g. read only sub-array of dataset with stride slide provided by Jens Henrik Göbbert 3. February 2015 JUQUEEN Porting and Tuning Workshop 2015 18

Parallel HDF5: I/O Hints Align datasets to file system block size H5Pset_alignment(...); Use hyperslabs H5Sselect_hyperslab(...); Chunk datasets H5Pset_cache(...); H5Pset_chunk_cache(...); Increase transfer buffer H5Pset_buffer(...); Improve metadata handling H5Pset_istore_k(...); H5Pset_meta_block_size( ); P0 P1 P0 P1 slide provided by Jens Henrik Göbbert 3. February 2015 JUQUEEN Porting and Tuning Workshop 2015 19

I/O Hints Reduce locking for MPI-IO export BGLOCKLESSMPIO_F_TYPE="0x47504653 ROMIO hints echo " IBM_largeblock_io true" > romio.hints echo " cb_buffer_size 33554432" >> romio.hints export ROMIO_HINTS=romio.hints Note: don t forget to add exports to runjob, via --exp-env X MPIX_Calls: splitting communicators per I/O-bridge FORTRAN: MPIX_PSET_SAME_COMM_CREATE(INTEGER pset_comm_same, INTEGER ierr) C: #include <mpix.h> int MPIX_Pset_same_comm_create( MPI_Comm *pset_comm_same ) sid=sion_paropen_mpi( filename, bw, &numfiles, &chunksize, gcom, &lcom,...); 3. February 2015 JUQUEEN Porting and Tuning Workshop 2015 20