EXPLOITING SHARED MEMORY TO IMPROVE PARALLEL I/O PERFORMANCE

Size: px

Start display at page:

Download "EXPLOITING SHARED MEMORY TO IMPROVE PARALLEL I/O PERFORMANCE"

Ellen Sanders
10 years ago
Views:

1 EXPLOITING SHARED MEMORY TO IMPROVE PARALLEL I/O PERFORMANCE Andrew B. Hastings Sun Microsystems, Inc. Alok Choudhary Northwestern University September 19, 2006 This material is based on work supported by DARPA under Contract No. NBCH

Alok Choudhary Northwestern University September 19, 2006

2 Outline Motivation Previous work New shared memory solutions Performance evaluation Conclusion and future work Page 2

3 Why Shared Memory? Because it's there! > For Phase II of DARPA's High Productivity Computer Systems program, Sun proposed a petascale shared memory system Opportunity to improve performance without altering applications > Shared memory typically has lower latency and lower overhead (especially for small payloads) than messages > Change just the library to use shared memory Interesting research area > Most previous work on parallel I/O focusses on clusters Page 3

Opportunity to improve performance without altering applications > Shared memory typically has lower latency and

4 A Common Parallel I/O Problem Application accesses may be noncontiguous in memory and in the file > If not optimized, can result in tens of thousands of small Posix I/O operations Process 1 (P1) memory Process 2 (P2) memory For MPI-IO, two MPI derived datatypes specify the file and memory access patterns file 8 I/O requests (each arrow represents a request) Page 4

Process 1 (P1) memory Process 2 (P2) memory For MPI-IO, two MPI derived datatypes specify

5 Previous Solutions 1 Data sieving I/O Each process locks and reads a contiguous block, fills in the altered data, then writes back and unlocks P1 memory data 2???? buffer 5 P I/O requests file Page 5

altered data, then writes back and unlocks P1 memory

6 Previous Solutions 2 List I/O Each process creates a list of memory regions and a list of file regions; calls a new filesystem interface P1 (memory list, file list) memory data (memory list, file list) P2 2 I/O requests file Datatype I/O Each process creates a small data structure describing repeating regions in memory and in file; calls a new filesystem interface Page 6

(memory list, file list) P2 2 I/O requests file Datatype I/O Each process creates a small

7 Previous Solutions 3 Two-phase collective Each process sends round of data to each aggregator. Aggregator(s) receive and merge into buffer, make large write call(s) to filesystem; repeat until done. P1 send memory data P2 (aggregator) send receive receive buffer file merge write Page 7

Aggregator(s) receive and merge into buffer, make large write call(s)

8 Using Shared Memory: mmap Each process maps file into its address space, copies data to appropriate location in mapped file > Similar to List I/O but mostly implemented in library P1 loads/stores memory data P2 mapped file Page 8

mapped file > Similar to List I/O but mostly implemented

9 Using Shared Memory: Collectives Collective Shared Data: Each aggregator copies data between its working buffer and shared application memory Collective Shared Buffer: Each process copies data between its application memory and aggregator(s)'s shared working buffer(s) P1 loads/stores memory data buffer file P2 (aggregator) write Page 9

Each process copies data between its application memory and aggregator(s)'s shared

10 Datatype Iterators Problem: copy driven by (offset, length) list: > Huge list thrashes processor cache > List generation expensive; delays I/O Solution: datatype iterator tracks position in MPI datatype, returns next (offset, length) on demand > State fits in handful of cache lines > Tiny startup cost; higher traversal cost can overlap I/O datatype iterator datatype stack 983,039 (offset, length) 0... MPI datatype Page 10

next (offset, length) on demand > State fits in handful of cache lines > Tiny startup cost; higher

11 Overlapping I/O Strategy: Split working buffer into sub-buffers > After sub-buffer is filled, initiate asynchronous I/O > Before filling next sub-buffer, wait for previous asynchronous I/O on it to complete > Overlaps I/O and data rearrangement! Performance gain for collective shared buffer on FLASH I/O benchmark: > 60% with lists > 90% with datatype iterators Page 11

asynchronous I/O on it to complete > Overlaps I/O and data rearrangement!

12 Performance Evaluation Hardware: Sun Fire 6800, MHz processors,150 MHz system bus, 96 GB memory, 4 1Gb FC channels, 4 Sun StorEdge T3 disk arrays (T3 cache disabled) Software: LAM 7.1.1, ROMIO 1.2.4, Solaris 9, Sun StorageTek QFS 4.5 (3 data+1 metadata), 64-bit execution model Bandwidth to data arrays: < 300 MB/s Caveat: Buffered reads benefit from warm buffer cache! Sun, StorageTek, Sun Fire, Sun StorEdge, and Solaris are trademarks or registered trademarks of Sun Miicrosystems, Inc., in the United States and other countries. Page 12

5 (3 data+1 metadata), 64-bit execution model Bandwidth to data arrays: < 300 MB/s Caveat: Buffered reads benefit from warm buffer cache!

13 Tile Reader Benchmark Tiled display simulation > File size 7 37 MB > From Parallel I/O Benchmarking Consortium, Argonne Data distribution for 2 2 tile array: Data read by one process Aggregate Read Bandwidth (MB/s) Tile array dimensions (number of processes = product) CSB-dt (dir) CSB-list (dir) CSD (dir) mmap (buf) *List I/O (buf) 2PC (dir) 2PC (buf) *DS (dir) *DS (buf) *omitted due to poor performance Page 13

Aggregate Read Bandwidth (MB/s) 50 0 2 2 2 4 3 4 4 4 5 4 6 4 Tile array dimensions (number of processes = product) CSB-dt

14 ROMIO 3D Block Test array of ints block-distributed to processes > Uneven data distribution for some process counts > Fixed file size: 824 MB Data distribution for 8 processes: Data accessed by one process Page 14

15 ROMIO 3D Block Test Results Aggregate Write Bandwidth (MB/s) Number of processes Aggregate Read Bandwidth (MB/s) CSB-dt (dir) 700 CSB-list (dir) 600 CSD (dir) 500 mmap (buf) *List I/O (buf) 400 2PC (dir) 300 2PC (buf) *DS (dir) 200 *DS (buf) Number of processes *omitted due to poor performance Page 15

(dir) 700 CSB-list (dir) 600 CSD (dir) 500 mmap (buf) *List I/O (buf) 400 2PC (dir) 300 2PC (buf)

16 FLASH I/O Benchmark Z-Axis Memory Organization From Argonne/Northwestern Checkpoint reorganizes to group values by variable 80 blocks per process File Organization X Var 0 Var 1 Var 2 Var 23 Block 0 Block 1 Block 2 Block 79 Proc 0 Proc 1 Proc Proc N Blocks to access in Y-axis Cut a slice of the block Y-Axis FLASH block structure X-Axis Each element has 24 variables Variable 0 Variable 1 Variable 2 Variable 23 Y Z Blocks to access in X-axis Guard Cells Page 16

5 6 7 Proc N Blocks to access in Y-axis Cut a slice of the block Y-Axis FLASH block structure X-Axis Each element has 24

17 FLASH I/O Benchmark Results Aggregate Write Bandwidth (MB/s) Number of processes: Number of cells along block edge File size: 165MB 15GB Aggregate Write Bandwidth (MB/s) Block size: cells Number of processes File size: 469MB 2.8GB CSB-dt (dir) CSB-list (dir) CSD (dir) mmap (buf) List I/O (buf) 2PC (dir) 2PC (buf) *DS (dir) *DS (buf) *omitted due to poor performance Page 17

Write Bandwidth (MB/s) Block size: 20 20 20 cells 75 50 25 0 4 8 12 16 20 24 Number of processes File size: 469MB 2.

18 Conclusion Combination of collective shared buffer, datatype iterators, and sub-buffering offered best aggregate performance for several application I/O patterns > Achieved 90% of available disk bandwidth > 5 improvement over two-phase collective Rediscovered streaming I/O principles: 1. Reduce startup overhead (datatype iterators) 2. Overlap I/O and computation when possible (sub-buffering) Page 18

bandwidth > 5 improvement over two-phase collective Rediscovered streaming I/O principles: 1.

19 Future Work Apply datatype iterators to MPI messages > Direct sender-to-receiver copy if shared memory Apply datatype iterators to data sieving and twophase collective in ROMIO (currently list-based) > Could benefit traditional clusters Possible standardization of datatype iterators > Required for use of datatype iterators in ROMIO if ROMIO is to remain portable across MPI implementations Page 19

list-based) > Could benefit traditional clusters Possible standardization of datatype iterators >

20 Acknowledgements Harriet Coverston and Anton Rang of Sun Microsystems also contributed to this work. This material is based on work supported by the US Defense Advanced Research Projects Agency under Contract No. NBCH Page 20

This material is based on work supported by the US Defense

21 EXPLOITING SHARED MEMORY TO IMPROVE PARALLEL I/O PERFORMANCE Andrew B. Hastings Alok Choudhary This material is based on work supported by DARPA under Contract No. NBCH

22 Datatype Iterators Interface Interfaces: > dtc_next: advance cursor to next contiguous block, return (offset, length) > dtc_size_seek/dtc_extent_seek: position cursor to size or extent within datatype > dtc_size_tell/dtc_extent_tell: return size or extent within datatype corresponding to cursor position Simplifies implementation: > Collective shared buffer 62% fewer code lines with datatype iterators compared to lists Page 22

23 Datatype Iterators Example Copy (non-)contiguous application data directly to (non-)contiguous shared working buffer: while (file_off + file_len <= end_off) { // Entire file block still // fits in current chunk while (file_len >= mem_len) { // Mem block fits in file block src = app_buf + mem_off; memcpy(dest, src, mem_len); // Copy remaining mem block file_off += mem_len; file_len -= mem_len; dest += mem_len; (mem_off, mem_len) = dtc_next(mem_dtc); // Get next mem block } while (mem_len >= file_len) { // File block fits in mem block dest = temp_buf + file_off - start_off; memcpy(dest, src, file_len); // Copy remaining file block mem_off += file_len; mem_len -= file_len; src += file_len; (file_off, file_len) = dtc_next(file_dtc); // Get next file block if (file_off + file_len > end_off) break; } } // Elided: post-loop handling of tail end of file block Page 23

24 Legend CSB-dt: collective shared buffer with datatype iterators > 1 aggregator, 32MB buffer, 4 sub-buffers CSB-list: collective shared buffer with lists > 1 aggregator, 32MB buffer, 4 sub-buffers CSD: collective shared data (lists) > All processes aggregators, 32MB buffer, no sub-buffers 2PC: two-phase collective (lists) > All processes aggregators, 16MB buffer DS: data sieving > 8 MB buffer Page 24

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING José Daniel García Sánchez ARCOS Group University Carlos III of Madrid Contents 2 The ARCOS Group. Expand motivation. Expand