COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network card 1 Network card 2 1

I/O Problem (I) Every node has its own local disk no globally visible file-system Most applications require data and executable to be locally available e.g. an MPI application using 100 nodes requires the executable to be available on all nodes in the same directory using the same name I/O problem (II) Current processor performance: e.g. Pentium 4 3 GHz ~ 6GFLOPS Memory Bandwidth: 133 MHz * 4 * 64Bit ~ 4.26 GB/s Current network performance: Gigabit Ethernet: latency ~ 40 µs, bandwidth=125mb/s InfiniBand 4x: latency ~ 5 µs, bandwidth =1GB/s Disc performance: Latency: 7-12 ms Bandwidth: ~20MB/sec 60 MB/sec 2

Basic characteristics of storage devices Capacity: amount of data a device can store Transfer rate or bandwidth: amount of data at which a device can read/write in a certain amount of time Access time or latency: delay before the first byte is moved Prefix Abbreviation Base ten Base two kilo, kibi K, Ki 10^3 2^10=1024 Mega, mebi M, Mi 10^6 2^20 Giga, gibi G, Gi 10^9 2^30 Tera, tebi T, Ti 10^12 2^40 Peta, pebi P, Pi 10^15 2^50 UNIX File Access Model (I) A File is a sequence of bytes When a program opens a file, the file system establishes a file pointer. The file pointer is an integer indicating the position in the file, where the next byte will be written/read. Multiple processes can open a file concurrently. Each process will have its own file pointer. No conflicts occur, when multiple processes read the same file. If several processes write at the same location, most UNIX file systems guarantee sequential consistency. (The data from one of the processes will be available in the file, but not a mixture of several processes). 3

UNIX File Access Model (II) Disk drives read and write data in fixed-sized units (disk sectors) File systems allocate space in blocks, which is a fixed number of contiguous disk sectors. In UNIX based file systems, the blocks that hold data are listed in an inode. An inode contains the information needed to find all the blocks that belong to a file. If a file is too large and an inode can not hold the whole list of blocks, intermediate nodes (indirect blocks) are introduced. Write operations Write: the file systems copies bytes from the user buffer into system buffer. If buffer filled up, system sends data to disk System buffering + allows file systems to collect full blocks of data before sending to disk + File system can send several blocks at once to the disk (delayed write or write behind) - Data not really saved in the case of a system crash - For very large write operations, the additional copy from user to system buffer could/should be avoided 4

Read operations Read: File system determines, which blocks contain requested data Read blocks from disk into system buffer Copy data from system buffer into user memory System buffering: + file system always reads a full block (file caching) + If application reads data sequentially, prefetching (read ahead) can improve performance - Prefetching harmful to the performance, if application has a random access pattern. File system operations Caching and buffering improve performance Avoiding repeated access to the same block Allowing a file system to smooth out I/O behavior Non-blocking I/O gives users control over prefetching and delayed writing Initiate read/write operations as soon as possible Wait for the finishing of the read/write operations just when absolutely necessary. 5

Distributed File Systems vs. Parallel File Systems Offer access to a collection of files on remote machines Typically client-server based approach Transparent for the user Concurrent access to the same file from several processes is considered to be an unlikely event in contrary to parallel file systems, where it is considered to be a standard operation Distributed file systems assume different numbers of processors than parallel file systems Distributed file systems have different security requirements than parallel file systems NFS Network File System Protocol for a remote file service Client server based approach Stateless server (v3) Communication based on RPC (Remote Procedure Call) NFS provides session semantics changes to an open file are initially only visible to the process that modified the file File locking not part of NFS protocol (v3) File locking handled by a separate protocol/daemon Locking of blocks often supported Client caching not part of the NFS protocol (v3) depending on implementation E.g. allowing cached data to be stale for 30 seconds 6

NFS in a cluster Front-end node hosts the file server NFS in a cluster (II) All file operations are remote operations file server (= NFS server) = bottleneck Extensive usage of file locking required to implement sequential consistency of UNIX I/O Communication between client and server typically uses the slow communication channel on a cluster Do we use several disks at all? Some inefficiencies in the specification, e.g. a read operation involves two RPC operations Lookup file-handle Read request 7

Parallel I/O Basic idea: disk striping Stripe factor: number of disks Stripe depth: size of each block Disk striping Requirements for improving disk performance: Multiple physical disks Separate I/O channels to each disk Data transfer to all disks simultaneously Problem of simple disk striping: Minimum stripe depth (sector size) required for optimal disk performance since file size is limited, the number of disks which can be used in parallel is limited as well Loss of a single disk makes entire file useless Risk to loose a disk is proportional to the number of disks used RAID (Redundant Arrays of Independent Disks see lecture 2) 8

Parallel File Systems Goals Several process should be able to access the same file concurrently Several process should be able to access the same file efficiently Problems Unix sequential consistency semantics Handling of file-pointers Caching and buffering Concurrent file access logical view Number of compute and I/O nodes need not match 1 1 2 2 3 3 4 4 Blocks from compute nodes 1 2 3 4 1 2 3 4 Logical view ( shared file ) 1 3 1 3 2 4 2 4 I/O nodes Disks 9

Concurrent file access opening a file Each I/O node has a subset of the blocks File system needs to look up where the file resides Each I/O node maintains its own directory information or Centralized name service File system needs to look up striping factor (often fixed) Creating a new file file systems has to choose different I/O nodes for holding the first block to avoid contention Concurrent write operations How to ensure sequential consistency? File locking Prevents parallelism even if processes write to different locations in the same file (false sharing) Better: locking of individual blocks Parallel file systems often offer two consistency models Sequential consistency A relaxed consistency model application is responsible for preventing overlapping write-operations 10

File pointers In UNIX: every process has a separate file pointer (individual file pointers) Shared file pointers often useful (e.g. reading the next piece of work, writing a parallel log-file) On distributed memory machines: slow, since somebody has to coordinate the file pointer Can be fast on shared memory machines General problems: file pointer atomicity Non blocking I/O Explicit file offset operations: each process tells the file system where to read/write in the file no update to file pointers! Buffering and caching Client buffering: buffering at compute nodes Consistency problems (e.g. one node writes, another tries to read the same data) Server buffering: buffering at I/O nodes Prevents concatenating several small requests to a single large one => produces lots of traffic 11

Example for a parallel file system: xfs Anderson et all., 1995 Storage server: storing parts of a file Metadata manager: keeps track of data blocks Client: processes user requests Client Manager Manager Storage server Storage server Storage server xfs continued Communication based on active messages Uses fast networking infrastructure Log-based file system Modifications to a file are written to a log-file and collectively written to disk To find a data block, a separate table (imap) holds inode references to the position in the log-file Log-file is distributed among several processes using RAID techniques Storage servers are organized in stripe groups I.e. not all storage servers are participating in all operations A globally replicated table stores which server is belonging to which stripe group Each file has a manager associated to it Manager map: identifies manager for a specific file 12

Starting point: file, data Directory returns file id xfs continued again Manager map returns metadata manager Metadata manager returns exact location of inode in the log: stripe group id, segment id and offset in segment Client computes on which server the block really is File, data Directory fid Manager map imap Stripe group map Client caching in xfs xfs maintains a local block cache Based on block caching Request of write permission transfers the ownership Manager keeps track where a file block is cached Collaborative caching Manager transfer most recent version of a data block directly from one cache into another cache 13

xfs versus NFS Issue NFS v3 xfs Design goals Access transparency Server-less system Access model Remote Log-based Communication RPC Active msgs. Client process Thin Fat Server groups No Yes Name space Per client Global Sharing semantics Session UNIX Caching unit Implementation dep. Block Fault tolerance Reliable communication Striping Summary Parallel I/O is a means to decrease the file I/O access times on parallel applications Performance relevant factors Stripe factor Stripe depth Buffering and caching Non-blocking I/O Parallel file systems offer support for concurrent access of processes to the same file Individual file pointers Shared file pointers Explicit offset Distributed file systems are a poor replacement for parallel file systems 14