Parallel I/O on Mira Venkat Vishwanath and Kevin Harms

Similar documents
Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF)

Parallel I/O on JUQUEEN

Portable, Scalable, and High-Performance I/O Forwarding on Massively Parallel Systems. Jason Cope

Parallel file I/O bottlenecks and solutions

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

FLOW-3D Performance Benchmark and Profiling. September 2012

Current Status of FEFS for the K computer

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Modeling Big Data/HPC Storage Using Massively Parallel Simula:on

NERSC File Systems and How to Use Them

DIY Parallel Data Analysis

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

High performance computing systems. Lab 1

Session 2: MUST. Correctness Checking


LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

ECLIPSE Performance Benchmarks and Profiling. January 2009

Lightning Introduction to MPI Programming

MPI Application Development Using the Analysis Tool MARMOT

RA MPI Compilers Debuggers Profiling. March 25, 2009

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Performance Monitoring of Parallel Scientific Applications

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

OpenMP Programming on ScaleMP

Lecture 6: Introduction to MPI programming. Lecture 6: Introduction to MPI programming p. 1

System Software for High Performance Computing. Joe Izraelevitz

Data Deduplication in a Hybrid Architecture for Improving Write Performance

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber

Hadoop on the Gordon Data Intensive Cluster

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Can High-Performance Interconnects Benefit Memcached and Hadoop?

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

A Survey of Shared File Systems

Kommunikation in HPC-Clustern

PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute

HPC Update: Engagement Model

Cluster Implementation and Management; Scheduling

MPI Runtime Error Detection with MUST For the 13th VI-HPS Tuning Workshop

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

EXPLOITING SHARED MEMORY TO IMPROVE PARALLEL I/O PERFORMANCE

Uncovering degraded application performance with LWM 2. Aamer Shah, Chih-Song Kuo, Lucas Theisen, Felix Wolf November 17, 2014

Libmonitor: A Tool for First-Party Monitoring

SR-IOV In High Performance Computing

LS DYNA Performance Benchmarks and Profiling. January 2009

Overlapping Data Transfer With Application Execution on Clusters

Performance Analysis of Mixed Distributed Filesystem Workloads

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Performance Characteristics of Large SMP Machines

Object storage in Cloud Computing and Embedded Processing

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Preview of a Novel Architecture for Large Scale Storage

Architecting a High Performance Storage System

HPCC - Hrothgar Getting Started User Guide MPI Programming

HP ProLiant DL580 Gen8 and HP LE PCIe Workload WHITE PAPER Accelerator 90TB Microsoft SQL Server Data Warehouse Fast Track Reference Architecture

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen


Mixing Hadoop and HPC Workloads on Parallel Filesystems

Performance Analysis of Flash Storage Devices and their Application in High Performance Computing

PCI Express IO Virtualization Overview

Climate-Weather Modeling Studies Using a Prototype Global Cloud-System Resolving Model

Asynchronous Dynamic Load Balancing (ADLB)

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

HP ProLiant SL270s Gen8 Server. Evaluation Report

New Storage System Solutions

Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator

To connect to the cluster, simply use a SSH or SFTP client to connect to:

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

Advancing Applications Performance With InfiniBand

Implementing MPI-IO Shared File Pointers without File System Support

GraySort and MinuteSort at Yahoo on Hadoop 0.23

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

VMware Virtual SAN Hardware Guidance. TECHNICAL MARKETING DOCUMENTATION v 1.0

A Close Look at PCI Express SSDs. Shirish Jamthe Director of System Engineering Virident Systems, Inc. August 2011

Transcription:

Parallel I/O on Mira Venkat Vishwanath and Kevin Harms Argonne Na*onal Laboratory venkat@anl.gov

ALCF-2 I/O Infrastructure Mira BG/Q Compute Resource Tukey Analysis Cluster 48K Nodes 768K Cores 10 PFlops 768X 2 GB/s 1536 GB/s 384 I/O Nodes 1536 GB/s QDR IB Switch Complex 900+ ports 384 GB/s 512-1024 GB/s 6 TB Memory 96 Nodes 192 GPUs 128 DDN VM File Server 26 PB ~250GB/s Storage System Storage System consists of 16 DDN SFA12KE couplets with each couplet consisting of 560 x 3TB HDD, 32 x 200GB SSD, 8 Virtual Machines (VM) with two QDR IB ports per VM

I/O Subsystem of BG/Q- Simplified View 128 BG/Q Compute Nodes 2x2x4x4x2 2 GB/s 2 GB/s ION 4 GB/s For every 128 compute nodes, we have two nodes designated as Bridge nodes. Each Bridge node has a link (11 th ) to the I/O Node. On the ION, we have a QDR Infiniband connected on a PCI-e Gen2 Slot 3

Control and I/O Daemon (CIOD) I/O Forwarding mechanism for BG/P and BG/Q CIOD is the I/O forwarding infrastructure provided by IBM for the Blue Gene / P For each compute node, we have a dedicated I/O proxy process to handle the associated I/O operations On BG/Q, we have a thread based implementation instead of process-based

Intrepid vs Mira : I/O Comparison Intrepid BG/P Mira BG/Q Increase Flops 0.557 PF 10 PF 20X Cores 160K 768K ~5X ION / CN ratio 1 / 64 1 / 128 0.5X IONs per Rack 16 8 0.5X Total Storage 6 PB 35 PB 6X I/O Throughput 62GB/s 240 GB/s Peak 4X User Quota? None Quotas enforced!! I/O per Flops (%) 0.01 0.0024 0.24X - 5

I/O interfaces and Libraries on Mira Interfaces POSIX MPI- IO Higher- level Libraries HDF5 Netcdf4 PnetCDF I/O Performance on Mira Shared File Write ~70 GB/s Shared File Read ~180 GB/s 6

I/O Recommendations Single shared file vs file per process (MPI rank) Ideally, you should write a few large files Use MPI- IO Interfaces Use MPI Collec*ve I/O (eg. MPI_File_write_at_all()) when you are wri*ng small data sizes per rank qsub GPFS --env Block BGLOCKLESSMPIO_F_TYPE=0x47504653 size on Mira : 8 MB For non- overlapping writes pass the environment variable at job submission BGLOCKLESSMPIO_F_TYPE=0x47504653 or prefix file with bglockless:/ to disable locking in the ADIO layer Factor of two improvement in performance For POSIX, instead of lseek and write, use pwrite pwrite(int fd, const void *buf, size_t count, off_t offset); 7

More Tips and Tricks for I/O on BG/Q // Create and Preallocate the File On Master if ( 0 == m_rank) { retval = MPI_File_open(MPI_COMM_SELF, (char *)m_partfilename, Pre- create the files before wri*ng to them MPI_MODE_WRONLY MPI_MODE_CREATE, Increases shared file write performance by a factor MPI_INFO_NULL,&m_fileHandle); assert(retval == MPI_SUCCESS); Pre- allocate files Useful // Preallocate if you File know the exact size of data to be wrinen. MPI_File_set_size(m_fileHandle, m_partfilesize); Have a rank, say 0, create & allocate (MPI_File_set_size or MPI_File_close (&m_filehandle); pruncate) } and close. Next, have all ranks open this file for wri*ng. MPI_Barrier (MPI_COMM_WORLD); HDF5 Datatype // Open the File for Writing Instead retval = MPI_File_open(MPI_COMM_WORLD, of Na*ve, use Big Endian(eg. (char H5T_IEEE_F64BE) *)m_partfilename, MPI_MODE_WRONLY, MPI_INFO_NULL, &m_filehandle); Avoid Complex MPI Datatypes Keep assert(retval things == MPI_SUCCESS); simple in produc*on codes Data MPI_Barrier Compression (MPI_COMM_WORLD); 8

Tools for I/O Performance Profiling on BG/Q HPM Bob Walkup hnps://www.alcf.anl.gov/user- guides/hpctw Darshan Phil Carns and Kevin Harms hnps://www.alcf.anl.gov/user- guides/darshan TAU Sameer Shende hnps://www.alcf.anl.gov/user- guides/tuning- and- analysis- u*li*es- tau opttrackio in TAU_OPTIONS 9

Data for MPI rank 0 of 65536 Profiling I/O using HPM Times and statistics from MPI_Init() to MPI_Finalize(). ----------------------------------------------------------------- MPI Routine #calls avg. bytes time(sec) ----------------------------------------------------------------- Link applica*on with MPI_Comm_size 63 0.0 0.000 MPI_Comm_rank /sop/perpools/hpctw/lib/libmpihpm_smp.a 2 0.0 0.000 MPI_Isend 4636 262086.0 0.031 MPI_Recv /bgsys/drivers/ppcfloor/bgpm/lib/libbgpm.a 4666 260650.4 0.215 MPI_Probe 4636 0.0 0.041 MPI_Waitall 1161 0.0 1.591 MPI_Barrier 1277 0.0 99.924 MPI_Gather 28 80.0 0.942 - rw- r- - r- - MPI_Allgather 1 venkatv Performance 146 28131 May 14.1 20 09:297.821 mpi_profile.217541.0 - rw- r- - r- - MPI_Allgatherv 1 venkatv Performance 62 5940 May 32.6 20 09:298.719 mpi_profile.217541.1220 - rw- r- - r- - MPI_Reduce 1 venkatv Performance 6 5694 May 16.7 20 09:29 0.001 mpi_profile.217541.16912 MPI_Allreduce 950 23.6 55.296 - rw- r- - r- - 1 venkatv Performance 5505 May 20 09:29 mpi_profile.217541.9172 Job produces the following output, among others MPI_File_close 11 0.0 0.231 MPI_File_open 11 0.0 16.976 MPI_File_read_at 1 7559376.0 37.601 MPI_File_write_at_all 20 2426940.0 92.211 ----------------------------------------------------------------- total communication time = 174.581 seconds. total elapsed time = 589.773 seconds. total MPI-IO time = 147.019 seconds. 10

Profiling I/O using HPM MPI_File_read_at #calls avg. bytes?me(sec) 1 7559376.0 37.601 MPI_File_write_at_all 15 0 10.052 5 9707760.0 82.159 Performance Issue: 15 Collective calls writing 0 bytes taking 10 secs. 11

Darshan An I/O Characterization Tool https://www.alcf.anl.gov/user-guides/darshan An open- source tool developed for sta*s*cal profiling of I/O Designed to be lightweight and low overhead Finite memory alloca*on for sta*s*cs (about 2MB) done during MPI_Init Overhead of 1-2% total to record I/O calls Varia*on of I/O is typically around 4-5% Darshan does not create detailed func*on call traces No source modifica*ons Uses PMPI interfaces to intercept MPI calls Use ld wrapping to intercept POSIX calls Can use dynamic linking with LD_PRELOAD instead Stores results in single compressed log file 12

fms c48l24 am3p5.x (3/27/2010) 1 of 2 Darshan on BG/Q uid: 6277 nprocs: 1944 runtime: 5382 seconds Percentage of run time 100 80 60 40 20 0 Average I/O cost per process POSIX MPI-IO Read Write Metadata Other (including application compute) Ops (Total, All Processes) 4.5e+07 4e+07 3.5e+07 3e+07 2.5e+07 2e+07 1.5e+07 1e+07 5e+06 0 Read Write Open Stat Seek MmapFsync POSIX MPI-IO Indep. I/O Operation Counts MPI-IO Coll. logs: /projects/logs/darshan/vesta /projects/logs/darshan/mira bins: /soft/perftools/darshan/darshan/bin 3e+07 I/O Sizes 4e+07 I/O Pattern Count (Total, All Procs) 2.5e+07 2e+07 1.5e+07 1e+07 5e+06 Ops (Total, All Procs) 3.5e+07 3e+07 2.5e+07 2e+07 1.5e+07 1e+07 wrappers: /soft/perftools/darshan/darshan/ wrappers/<wrapper> 1K-10K 101-1K 0-100 1M-4M 100K-1M 10K-100K Read 1G+ 100M-1G 10M-100M 4M-10M Write Most Common Access Sizes access size count 65536 23066526 800 2378602 304 2172322 400 2029354 0 Read Total Sequential Write Consecutive export PATH=/soft/perftools/ darshan/darshan/wrappers/xl:$ {PATH} 0 5e+06./fms c48l24 am3p5.x <unknown args> 13

33 Early Experience Scaling I/O for HACC HACC on Mira: Early Science Test Run Hybrid/Hardware Accelerated First Large-Scale Computa*onal Simulation Cosmology on Mira code is lead by Salman 16 racks, 262,144 Habib cores at (one-third of Mira) Argonne 14 hours continuous, no checkpoint restarts Members include Katrin Simulation Parameters Heitmann, Hal 9.14 Finkel, Gpc simulation Adrian box Pope, Vitali Morozov 7 kpc force resolution 1.07 trillion particles Par*cle- in- Cell Science Motivations code Achieved ~14 Resolve PF on halos Sequoia that host and luminous red galaxies for 7 PF on Mira analysis of SDSS observations On Mira, HACC Cluster uses cosmology 1-10 Trillion par*cles Baryon acoustic oscillations in galaxy clustering Zoom-in visualization of the density field illustrating the global spatial dynamic range of the simulation -- approximately a million-to-one 14

Early Experience Scaling I/O for HACC Typical Checkpoint/Restart dataset is ~100-400 TB Analysis output, wrinen more frequently, is ~1TB- 10TB Step 1: Default I/O mechanism using single shared file with MPI- IO Achieved ~15 GB/s Step 2: File per rank Too many files at 32K Nodes (262K files with 8RPN) Step 3: An I/O mechanism wri*ng 1 File per ION Topology- aware I/O to reduce network conten*on Achieves up to 160 GB/s (~10 X improvement) Wrinen and read ~10 PB of data on Mira (and coun*ng) 15

Questions 16