EXPLOITING SHARED MEMORY TO IMPROVE PARALLEL I/O PERFORMANCE



Similar documents
THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

InterferenceRemoval: Removing Interference of Disk Access for MPI Programs through Data Replication

Performance and scalability of a large OLTP workload

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Cloud Computing at Google. Architecture

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

CSAR: Cluster Storage with Adaptive Redundancy

BMC Recovery Manager for Databases: Benchmark Study Performed at Sun Laboratories

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Cray DVS: Data Virtualization Service

Binary search tree with SIMD bandwidth optimization using SSE

The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

CS 6290 I/O and Storage. Milos Prvulovic

Remote Copy Technology of ETERNUS6000 and ETERNUS3000 Disk Arrays

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

Overlapping Data Transfer With Application Execution on Clusters

Q & A From Hitachi Data Systems WebTech Presentation:

- An Essential Building Block for Stable and Reliable Compute Clusters

Exploiting Transparent Remote Memory Access for Non-Contiguous- and One-Sided-Communication

Virtuoso and Database Scalability

LLNL s Parallel I/O Testing Tools and Techniques for ASC Parallel File Systems

Large File System Backup NERSC Global File System Experience

Indexing on Solid State Drives based on Flash Memory

What is RAID--BASICS? Mylex RAID Primer. A simple guide to understanding RAID

Distributed File Systems

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Best Practices for Data Sharing in a Grid Distributed SAS Environment. Updated July 2010

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

FAWN - a Fast Array of Wimpy Nodes

Operating Systems, 6 th ed. Test Bank Chapter 7

Demand Attach / Fast-Restart Fileserver

Chapter 6, The Operating System Machine Level

Comparing Dynamic Disk Pools (DDP) with RAID-6 using IOR

Arrow ECS sp. z o.o. Oracle Partner Academy training environment with Oracle Virtualization. Oracle Partner HUB

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29


The IntelliMagic White Paper on: Storage Performance Analysis for an IBM San Volume Controller (SVC) (IBM V7000)

Accelerating Server Storage Performance on Lenovo ThinkServer

ECLIPSE Performance Benchmarks and Profiling. January 2009

An Architectural study of Cluster-Based Multi-Tier Data-Centers

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Sun Constellation System: The Open Petascale Computing Architecture

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

1 Storage Devices Summary

Using Linux Clusters as VoD Servers

Fault Tolerance & Reliability CDA Chapter 3 RAID & Sample Commercial FT Systems

Parallel Processing of cluster by Map Reduce

Dependable Systems. 9. Redundant arrays of. Prof. Dr. Miroslaw Malek. Wintersemester 2004/05

NAND Flash Architecture and Specification Trends

Computer Engineering and Systems Group Electrical and Computer Engineering SCMFS: A File System for Storage Class Memory

Big Table A Distributed Storage System For Data

An Implementation and Evaluation of Client-Side File Caching for MPI-IO

Price/performance Modern Memory Hierarchy

GeoGrid Project and Experiences with Hadoop

Benchmarking Hadoop & HBase on Violin

Cooperative Client-side File Caching for MPI Applications

MPLAB TM C30 Managed PSV Pointers. Beta support included with MPLAB C30 V3.00

An Implementation Of Multiprocessor Linux

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Hadoop Architecture. Part 1

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Parallels Cloud Server 6.0

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Implementing Network Attached Storage. Ken Fallon Bill Bullers Impactdata

Lecture 36: Chapter 6

Symmetric Multiprocessing

EMC XtremSF: Delivering Next Generation Performance for Oracle Database

HP Z Turbo Drive PCIe SSD

Configuring CoreNet Platform Cache (CPC) as SRAM For Use by Linux Applications

The Case for Massive Arrays of Idle Disks (MAID)

Computer Systems Structure Input/Output

Chapter 11: Input/Output Organisation. Lesson 06: Programmed IO

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

enabling Ultra-High Bandwidth Scalable SSDs with HLnand

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

Improved LS-DYNA Performance on Sun Servers

D1.2 Network Load Balancing

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Configuring Apache Derby for Performance and Durability Olav Sandstå

InfoScale Storage & Media Server Workloads

HP Smart Array Controllers and basic RAID performance factors

Transcription:

EXPLOITING SHARED MEMORY TO IMPROVE PARALLEL I/O PERFORMANCE Andrew B. Hastings Sun Microsystems, Inc. Alok Choudhary Northwestern University September 19, 2006 This material is based on work supported by DARPA under Contract No. NBCH3039002

Outline Motivation Previous work New shared memory solutions Performance evaluation Conclusion and future work Page 2

Why Shared Memory? Because it's there! > For Phase II of DARPA's High Productivity Computer Systems program, Sun proposed a petascale shared memory system Opportunity to improve performance without altering applications > Shared memory typically has lower latency and lower overhead (especially for small payloads) than messages > Change just the library to use shared memory Interesting research area > Most previous work on parallel I/O focusses on clusters Page 3

A Common Parallel I/O Problem Application accesses may be noncontiguous in memory and in the file > If not optimized, can result in tens of thousands of small Posix I/O operations Process 1 (P1) memory Process 2 (P2) memory For MPI-IO, two MPI derived datatypes specify the file and memory access patterns file 8 I/O requests (each arrow represents a request) Page 4

Previous Solutions 1 Data sieving I/O Each process locks and reads a contiguous block, fills in the altered data, then writes back and unlocks P1 memory data 2???? buffer 5 P2 1 3 4 6 4 I/O requests file Page 5

Previous Solutions 2 List I/O Each process creates a list of memory regions and a list of file regions; calls a new filesystem interface P1 (memory list, file list) memory data (memory list, file list) P2 2 I/O requests file Datatype I/O Each process creates a small data structure describing repeating regions in memory and in file; calls a new filesystem interface Page 6

Previous Solutions 3 Two-phase collective Each process sends round of data to each aggregator. Aggregator(s) receive and merge into buffer, make large write call(s) to filesystem; repeat until done. P1 send memory data P2 (aggregator) send receive receive buffer file merge write Page 7

Using Shared Memory: mmap Each process maps file into its address space, copies data to appropriate location in mapped file > Similar to List I/O but mostly implemented in library P1 loads/stores memory data P2 mapped file Page 8

Using Shared Memory: Collectives Collective Shared Data: Each aggregator copies data between its working buffer and shared application memory Collective Shared Buffer: Each process copies data between its application memory and aggregator(s)'s shared working buffer(s) P1 loads/stores memory data buffer file P2 (aggregator) write Page 9

Datatype Iterators Problem: copy driven by (offset, length) list: > Huge list thrashes processor cache > List generation expensive; delays I/O Solution: datatype iterator tracks position in MPI datatype, returns next (offset, length) on demand > State fits in handful of cache lines > Tiny startup cost; higher traversal cost can overlap I/O datatype iterator datatype stack 983,039 (offset, length) 0... MPI datatype Page 10

Overlapping I/O Strategy: Split working buffer into sub-buffers > After sub-buffer is filled, initiate asynchronous I/O > Before filling next sub-buffer, wait for previous asynchronous I/O on it to complete > Overlaps I/O and data rearrangement! Performance gain for collective shared buffer on FLASH I/O benchmark: > 60% with lists > 90% with datatype iterators Page 11

Performance Evaluation Hardware: Sun Fire 6800, 24 1200 MHz processors,150 MHz system bus, 96 GB memory, 4 1Gb FC channels, 4 Sun StorEdge T3 disk arrays (T3 cache disabled) Software: LAM 7.1.1, ROMIO 1.2.4, Solaris 9, Sun StorageTek QFS 4.5 (3 data+1 metadata), 64-bit execution model Bandwidth to data arrays: < 300 MB/s Caveat: Buffered reads benefit from warm buffer cache! Sun, StorageTek, Sun Fire, Sun StorEdge, and Solaris are trademarks or registered trademarks of Sun Miicrosystems, Inc., in the United States and other countries. Page 12

Tile Reader Benchmark Tiled display simulation > File size 7 37 MB > From Parallel I/O Benchmarking Consortium, Argonne Data distribution for 2 2 tile array: 768 1024 270 128 Data read by one process 600 550 500 450 400 350 300 250 200 150 100 Aggregate Read Bandwidth (MB/s) 50 0 2 2 2 4 3 4 4 4 5 4 6 4 Tile array dimensions (number of processes = product) CSB-dt (dir) CSB-list (dir) CSD (dir) mmap (buf) *List I/O (buf) 2PC (dir) 2PC (buf) *DS (dir) *DS (buf) *omitted due to poor performance Page 13

ROMIO 3D Block Test 600 600 600 array of ints block-distributed to processes > Uneven data distribution for some process counts > Fixed file size: 824 MB Data distribution for 8 processes: Data accessed by one process Page 14

ROMIO 3D Block Test Results Aggregate Write Bandwidth (MB/s) 300 275 250 225 200 175 150 125 100 75 50 25 0 4 8 12 16 20 24 Number of processes Aggregate Read Bandwidth (MB/s) 1000 900 800 CSB-dt (dir) 700 CSB-list (dir) 600 CSD (dir) 500 mmap (buf) *List I/O (buf) 400 2PC (dir) 300 2PC (buf) *DS (dir) 200 *DS (buf) 100 0 4 8 12 16 20 24 Number of processes *omitted due to poor performance Page 15

FLASH I/O Benchmark Z-Axis Memory Organization From Argonne/Northwestern Checkpoint reorganizes to group values by variable 80 blocks per process File Organization X Var 0 Var 1 Var 2 Var 23 Block 0 Block 1 Block 2 Block 79 Proc 0 Proc 1 Proc 2 0 1 2 3 4 5 6 7 Proc N Blocks to access in Y-axis Cut a slice of the block Y-Axis FLASH block structure X-Axis Each element has 24 variables Variable 0 Variable 1 Variable 2 Variable 23 Y Z 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Blocks to access in X-axis Guard Cells Page 16

FLASH I/O Benchmark Results Aggregate Write Bandwidth (MB/s) Number of processes: 22 300 275 250 225 200 175 150 125 100 75 50 25 0 8 12 16 20 24 28 32 36 Number of cells along block edge File size: 165MB 15GB 300 275 250 225 200 175 150 125 100 Aggregate Write Bandwidth (MB/s) Block size: 20 20 20 cells 75 50 25 0 4 8 12 16 20 24 Number of processes File size: 469MB 2.8GB CSB-dt (dir) CSB-list (dir) CSD (dir) mmap (buf) List I/O (buf) 2PC (dir) 2PC (buf) *DS (dir) *DS (buf) *omitted due to poor performance Page 17

Conclusion Combination of collective shared buffer, datatype iterators, and sub-buffering offered best aggregate performance for several application I/O patterns > Achieved 90% of available disk bandwidth > 5 improvement over two-phase collective Rediscovered streaming I/O principles: 1. Reduce startup overhead (datatype iterators) 2. Overlap I/O and computation when possible (sub-buffering) Page 18

Future Work Apply datatype iterators to MPI messages > Direct sender-to-receiver copy if shared memory Apply datatype iterators to data sieving and twophase collective in ROMIO (currently list-based) > Could benefit traditional clusters Possible standardization of datatype iterators > Required for use of datatype iterators in ROMIO if ROMIO is to remain portable across MPI implementations Page 19

Acknowledgements Harriet Coverston and Anton Rang of Sun Microsystems also contributed to this work. This material is based on work supported by the US Defense Advanced Research Projects Agency under Contract No. NBCH3039002. Page 20

EXPLOITING SHARED MEMORY TO IMPROVE PARALLEL I/O PERFORMANCE Andrew B. Hastings andrew.hastings@sun.com Alok Choudhary choudhar@ece.northwestern.edu This material is based on work supported by DARPA under Contract No. NBCH3039002

Datatype Iterators Interface Interfaces: > dtc_next: advance cursor to next contiguous block, return (offset, length) > dtc_size_seek/dtc_extent_seek: position cursor to size or extent within datatype > dtc_size_tell/dtc_extent_tell: return size or extent within datatype corresponding to cursor position Simplifies implementation: > Collective shared buffer 62% fewer code lines with datatype iterators compared to lists Page 22

Datatype Iterators Example Copy (non-)contiguous application data directly to (non-)contiguous shared working buffer: while (file_off + file_len <= end_off) { // Entire file block still // fits in current chunk while (file_len >= mem_len) { // Mem block fits in file block src = app_buf + mem_off; memcpy(dest, src, mem_len); // Copy remaining mem block file_off += mem_len; file_len -= mem_len; dest += mem_len; (mem_off, mem_len) = dtc_next(mem_dtc); // Get next mem block } while (mem_len >= file_len) { // File block fits in mem block dest = temp_buf + file_off - start_off; memcpy(dest, src, file_len); // Copy remaining file block mem_off += file_len; mem_len -= file_len; src += file_len; (file_off, file_len) = dtc_next(file_dtc); // Get next file block if (file_off + file_len > end_off) break; } } // Elided: post-loop handling of tail end of file block Page 23

Legend CSB-dt: collective shared buffer with datatype iterators > 1 aggregator, 32MB buffer, 4 sub-buffers CSB-list: collective shared buffer with lists > 1 aggregator, 32MB buffer, 4 sub-buffers CSD: collective shared data (lists) > All processes aggregators, 32MB buffer, no sub-buffers 2PC: two-phase collective (lists) > All processes aggregators, 16MB buffer DS: data sieving > 8 MB buffer Page 24