EXPLOITING SHARED MEMORY TO IMPROVE PARALLEL I/O PERFORMANCE Andrew B. Hastings Sun Microsystems, Inc. Alok Choudhary Northwestern University September 19, 2006 This material is based on work supported by DARPA under Contract No. NBCH3039002
Outline Motivation Previous work New shared memory solutions Performance evaluation Conclusion and future work Page 2
Why Shared Memory? Because it's there! > For Phase II of DARPA's High Productivity Computer Systems program, Sun proposed a petascale shared memory system Opportunity to improve performance without altering applications > Shared memory typically has lower latency and lower overhead (especially for small payloads) than messages > Change just the library to use shared memory Interesting research area > Most previous work on parallel I/O focusses on clusters Page 3
A Common Parallel I/O Problem Application accesses may be noncontiguous in memory and in the file > If not optimized, can result in tens of thousands of small Posix I/O operations Process 1 (P1) memory Process 2 (P2) memory For MPI-IO, two MPI derived datatypes specify the file and memory access patterns file 8 I/O requests (each arrow represents a request) Page 4
Previous Solutions 1 Data sieving I/O Each process locks and reads a contiguous block, fills in the altered data, then writes back and unlocks P1 memory data 2???? buffer 5 P2 1 3 4 6 4 I/O requests file Page 5
Previous Solutions 2 List I/O Each process creates a list of memory regions and a list of file regions; calls a new filesystem interface P1 (memory list, file list) memory data (memory list, file list) P2 2 I/O requests file Datatype I/O Each process creates a small data structure describing repeating regions in memory and in file; calls a new filesystem interface Page 6
Previous Solutions 3 Two-phase collective Each process sends round of data to each aggregator. Aggregator(s) receive and merge into buffer, make large write call(s) to filesystem; repeat until done. P1 send memory data P2 (aggregator) send receive receive buffer file merge write Page 7
Using Shared Memory: mmap Each process maps file into its address space, copies data to appropriate location in mapped file > Similar to List I/O but mostly implemented in library P1 loads/stores memory data P2 mapped file Page 8
Using Shared Memory: Collectives Collective Shared Data: Each aggregator copies data between its working buffer and shared application memory Collective Shared Buffer: Each process copies data between its application memory and aggregator(s)'s shared working buffer(s) P1 loads/stores memory data buffer file P2 (aggregator) write Page 9
Datatype Iterators Problem: copy driven by (offset, length) list: > Huge list thrashes processor cache > List generation expensive; delays I/O Solution: datatype iterator tracks position in MPI datatype, returns next (offset, length) on demand > State fits in handful of cache lines > Tiny startup cost; higher traversal cost can overlap I/O datatype iterator datatype stack 983,039 (offset, length) 0... MPI datatype Page 10
Overlapping I/O Strategy: Split working buffer into sub-buffers > After sub-buffer is filled, initiate asynchronous I/O > Before filling next sub-buffer, wait for previous asynchronous I/O on it to complete > Overlaps I/O and data rearrangement! Performance gain for collective shared buffer on FLASH I/O benchmark: > 60% with lists > 90% with datatype iterators Page 11
Performance Evaluation Hardware: Sun Fire 6800, 24 1200 MHz processors,150 MHz system bus, 96 GB memory, 4 1Gb FC channels, 4 Sun StorEdge T3 disk arrays (T3 cache disabled) Software: LAM 7.1.1, ROMIO 1.2.4, Solaris 9, Sun StorageTek QFS 4.5 (3 data+1 metadata), 64-bit execution model Bandwidth to data arrays: < 300 MB/s Caveat: Buffered reads benefit from warm buffer cache! Sun, StorageTek, Sun Fire, Sun StorEdge, and Solaris are trademarks or registered trademarks of Sun Miicrosystems, Inc., in the United States and other countries. Page 12
Tile Reader Benchmark Tiled display simulation > File size 7 37 MB > From Parallel I/O Benchmarking Consortium, Argonne Data distribution for 2 2 tile array: 768 1024 270 128 Data read by one process 600 550 500 450 400 350 300 250 200 150 100 Aggregate Read Bandwidth (MB/s) 50 0 2 2 2 4 3 4 4 4 5 4 6 4 Tile array dimensions (number of processes = product) CSB-dt (dir) CSB-list (dir) CSD (dir) mmap (buf) *List I/O (buf) 2PC (dir) 2PC (buf) *DS (dir) *DS (buf) *omitted due to poor performance Page 13
ROMIO 3D Block Test 600 600 600 array of ints block-distributed to processes > Uneven data distribution for some process counts > Fixed file size: 824 MB Data distribution for 8 processes: Data accessed by one process Page 14
ROMIO 3D Block Test Results Aggregate Write Bandwidth (MB/s) 300 275 250 225 200 175 150 125 100 75 50 25 0 4 8 12 16 20 24 Number of processes Aggregate Read Bandwidth (MB/s) 1000 900 800 CSB-dt (dir) 700 CSB-list (dir) 600 CSD (dir) 500 mmap (buf) *List I/O (buf) 400 2PC (dir) 300 2PC (buf) *DS (dir) 200 *DS (buf) 100 0 4 8 12 16 20 24 Number of processes *omitted due to poor performance Page 15
FLASH I/O Benchmark Z-Axis Memory Organization From Argonne/Northwestern Checkpoint reorganizes to group values by variable 80 blocks per process File Organization X Var 0 Var 1 Var 2 Var 23 Block 0 Block 1 Block 2 Block 79 Proc 0 Proc 1 Proc 2 0 1 2 3 4 5 6 7 Proc N Blocks to access in Y-axis Cut a slice of the block Y-Axis FLASH block structure X-Axis Each element has 24 variables Variable 0 Variable 1 Variable 2 Variable 23 Y Z 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Blocks to access in X-axis Guard Cells Page 16
FLASH I/O Benchmark Results Aggregate Write Bandwidth (MB/s) Number of processes: 22 300 275 250 225 200 175 150 125 100 75 50 25 0 8 12 16 20 24 28 32 36 Number of cells along block edge File size: 165MB 15GB 300 275 250 225 200 175 150 125 100 Aggregate Write Bandwidth (MB/s) Block size: 20 20 20 cells 75 50 25 0 4 8 12 16 20 24 Number of processes File size: 469MB 2.8GB CSB-dt (dir) CSB-list (dir) CSD (dir) mmap (buf) List I/O (buf) 2PC (dir) 2PC (buf) *DS (dir) *DS (buf) *omitted due to poor performance Page 17
Conclusion Combination of collective shared buffer, datatype iterators, and sub-buffering offered best aggregate performance for several application I/O patterns > Achieved 90% of available disk bandwidth > 5 improvement over two-phase collective Rediscovered streaming I/O principles: 1. Reduce startup overhead (datatype iterators) 2. Overlap I/O and computation when possible (sub-buffering) Page 18
Future Work Apply datatype iterators to MPI messages > Direct sender-to-receiver copy if shared memory Apply datatype iterators to data sieving and twophase collective in ROMIO (currently list-based) > Could benefit traditional clusters Possible standardization of datatype iterators > Required for use of datatype iterators in ROMIO if ROMIO is to remain portable across MPI implementations Page 19
Acknowledgements Harriet Coverston and Anton Rang of Sun Microsystems also contributed to this work. This material is based on work supported by the US Defense Advanced Research Projects Agency under Contract No. NBCH3039002. Page 20
EXPLOITING SHARED MEMORY TO IMPROVE PARALLEL I/O PERFORMANCE Andrew B. Hastings andrew.hastings@sun.com Alok Choudhary choudhar@ece.northwestern.edu This material is based on work supported by DARPA under Contract No. NBCH3039002
Datatype Iterators Interface Interfaces: > dtc_next: advance cursor to next contiguous block, return (offset, length) > dtc_size_seek/dtc_extent_seek: position cursor to size or extent within datatype > dtc_size_tell/dtc_extent_tell: return size or extent within datatype corresponding to cursor position Simplifies implementation: > Collective shared buffer 62% fewer code lines with datatype iterators compared to lists Page 22
Datatype Iterators Example Copy (non-)contiguous application data directly to (non-)contiguous shared working buffer: while (file_off + file_len <= end_off) { // Entire file block still // fits in current chunk while (file_len >= mem_len) { // Mem block fits in file block src = app_buf + mem_off; memcpy(dest, src, mem_len); // Copy remaining mem block file_off += mem_len; file_len -= mem_len; dest += mem_len; (mem_off, mem_len) = dtc_next(mem_dtc); // Get next mem block } while (mem_len >= file_len) { // File block fits in mem block dest = temp_buf + file_off - start_off; memcpy(dest, src, file_len); // Copy remaining file block mem_off += file_len; mem_len -= file_len; src += file_len; (file_off, file_len) = dtc_next(file_dtc); // Get next file block if (file_off + file_len > end_off) break; } } // Elided: post-loop handling of tail end of file block Page 23
Legend CSB-dt: collective shared buffer with datatype iterators > 1 aggregator, 32MB buffer, 4 sub-buffers CSB-list: collective shared buffer with lists > 1 aggregator, 32MB buffer, 4 sub-buffers CSD: collective shared data (lists) > All processes aggregators, 32MB buffer, no sub-buffers 2PC: two-phase collective (lists) > All processes aggregators, 16MB buffer DS: data sieving > 8 MB buffer Page 24