Distributed RAID Architectures for Cluster I/O Computing Kai Hwang Internet and Cluster Computing Lab. University of Southern California 1
Presentation Outline : Scalable Cluster I/O The RAID-x Architecture Cooperative disk drivers Benchmark Experiments Security and Fault Tolerance Conclusions
Scalable clusters providing SSI services are gradually replacing the SMP, cc-numa, and MPP in Servers, Web Sites, and Database Centers K. Hwang, March 15,2001 in Beijing 3
g Issues in Cluster Design Size Scalability (physical & application) g Enhanced Availability (failure management) g Single System Image (Middleware,OS extensions) g Fast Communication (networks & protocols) g Load Balancing (CPU, Net, Memory, Disk) g Security and Encryption (clusters of clusters) g Distributed Environment (User friendly) g Manageability (Jobs and resources ) g Programmability (simple API required) g Applicability (cluster- and grid-awareness) K. Hwang, March 15,2001 in Beijing 4
The USC Trojans Cluster Project Internet and Cluster Computing Lab. EEB Rm.104 g Sixteen Pentinum PCs are housed in two 9-ft computer racks. g All PCs run with the RedHat Linux v. 6.0 (Kernel v. 2.2.5) g All nodes are connected by a 100 Mbps Fast Ethernet g g The cluster is ported with DQS, LSF, MPI, PVM, TreadMarks, Elias, and NAS benchmarks, etc. Scalable to a future system with 100 s of future PC nodes interconnected by Gigabit networks http://andy.usc.edu/trojan/ K. Hwang, March 15,2001 in Beijing 5
Trojans Linux Cluster with Middleware for Security and Checkpoint Recovery Programming Environments (Java, EDI, HTML, XML) Web Windows User Interface Other Subsystems (Database, OLTP, etc.) Single-System Image and Availability Infrastructure Security and Checkpointing Middleware Linux Linux Linux Pentium PC Pentium PC Pentium PC Gigabit Network Interconnect K. Hwang, March 15,2001 in Beijing 6
An I/O-centric cluster architecture Entry Partition Client Internet/Intranet Database Partition Fast Ethernet Service Partition Service Flow Data Flow 4
Distributed RAID Embedded in Clusters or Storage-Area Networks: g I/O Bottleneck in Scalable Cluster Computing n The gap between CPU/Memory and disk-io widens as the mp doubles in speed every year n Cluster applications are often I/O-bound g Disks connected to hosts are often subject to failure by hosts themselves. Distributed RAID has much higher availability by fault isolation, rollback recovery, and automatic file migration. K. Hwang, March 15,2001 in Beijing 8
Distributed RAID with a single I/O space embedded in a cluster Cluster Network (SAN or LAN) Workstations or PCs ds-raid
Research Projects on Parallel and Distributed RAID System Attributes RAID Architecture environment Enabling Mechanism for SIOS Data Consistency Checking Reliability and Fault Tolerance USC Trojans RAID-x Orthogonal striping and mirroring in a Linux cluster Cooperative device drivers in Linux kernel Locks at device driver level Orthogonal striping and mirroring Princeton TickerTAIP RAID-5 with multiple controllers Single RAID server implementation Sequencing of user requests Parity checks in RAID-5 Digital Petal Chained Declustering in Unix cluster Petal device drivers at user level Lamport s Paxos algorithm Chained Declustering Berkeley Tertiary Disk RAID-5 built with a PC cluster xfs storage servers at file level Modified DASH protocol in the xfs file system SCSI disks with parity in RAID-5 HP AutoRAID Hierarchical with RAID-1 and RAID-5 Disk array within single controller Use mark to update the parity disk Mirroring and parity checks
Four RAID architectures using different mirroring and parity checking schemes Disk 1 Disk 2 Disk 3 Disk 4 Disk 1 Disk 2 Disk 3 Disk 4 B 0 B 1 M 0 M 1 B 0 B 1 B 2 P 1 B 2 B 3 M 2 M 3 B 4 B 5 P 2 B 3 B 4 B 5 M 4 M 5 B 8 P 3 B 6 B 7 B 6 B 7 M 6 M 7 P 4 B 9 B 10 B 11 B 8 B 10 B 9 B 11 M 8 M 10 M 9 M 11 B 12 B 16 B 13 B 14 P 5 (a) Striped mirroring in RAID-10 Disk 0 Disk 1 Disk 2 Disk 3 B 17 P 6 B 15 (b) Parity checking in RAID-5 Disk 0 Disk 1 Disk 2 Disk 3 Data blocks B 0 B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 B 9 M 6 B 10 B 11 B 0 M 3 B 4 B 1 M 0 B 5 B 2 M 1 B 6 B 3 M 2 B 7 Mirrored blocks M 9 M 10 M 11 M 7 M 8 M 3 M 4 M 5 M 0 M 1 M 2 (c) Orthogonal striping and mirroring (OSM) in the RAID-x M 7 M 4 M 5 M 6 7B 8 B 9 B 10 B 11 M 11 M 8 M 9 M 10 (d) Skewed striping in a chained declustering RAID 6
Theoretical Peak Performance of Four RAID Architectures Performance Chained RAID-10 RAID-5 Indicators Declustering RAID-x Read n B n B n B n B Max. I/O Large Write n B (n-1) B n B n B Bandwidth Small Write n B nb / 2 n B n B Large Read mr / n mr / n mr / n mr / n Parallel Small Read R R R R Read or Parallel Large Write 2 mw / n mw / (n-1) 2 mw / n mw / n + Write Time mw / n(n-1) Max. Fault Coverage Small Write 2W R+W 2W W n/2 disk Single disk n/2 disk failures failure failures Single disk failure 7
P/M Distributed RAID-x architecture CDD P/M CDD Cluster Network Node 0 Node 1 Node 2 Node 3 B 0 B 12 B 24 M 25 M 26 M 27 B 4 B 16 B 28 M 29 M 30 M 31 B 8 B 20 B 32 M 33 M 34 M 35 D0 B 1 B 13 B 25 M 14 M 15 M 24 P/M CDD P/M CDD D1 D2 D3 B 2 B 14 B 26 M 3 M 12 M 13 D4 D5 D6 D7 B 5 B 17 B 29 M 18 M 19 M 28 B 6 B 18 B 30 M 7 M 16 M 17 D8 D9 D10 D11 B 9 B 21 B 33 M 22 M 23 M 32 B 10 B 22 B 34 M 11 M 20 M 21 B 3 B 15 B 27 M 0 M 1 M 2 B 7 B 19 B 31 M 4 M 5 M 6 B 11 B 23 B 35 M 8 M 9 M 10 8
Single I/O space in a Distributed RAID enabled by CDDs at Linux kernel level Interconnection Network Central NFS Server IDD Interconnection Network Cluster node IDD Cluster node IDD Cluster node Cluster node Cluster node CDD CDD CDD (a) Separate disks driven by independent disk drivers (IDDs) (b) A global virtual disk with a SIOS formed by cooperative disks K. Hwang, March 15,2001 in Beijing 14
Remote disk access using central NFS server versus using cooperative disk drivers in the RAID-x cluster User Level User Application NFS Server User Level Kernel Level 1 NFS Client 6 5 2 3 4 Traditional Device Driver Kernel Level Client side Server side (a) Parallel disk I/O using the NFS in a server/client cluster. User Level User Application NFS Server is bypassed User Level Kernel Level 1 CDD 6 5 2 CDD 4 3 Kernel Level Client side Server side (b) Using CDDs to achieve a SIOS in a serverless cluster. K. Hwang, March 15,2001 in Beijing 15
Architectural design of cooperative device drivers Node 1 CDD Communications through the network Node 2 CDD Cooperative Disk Driver (CDD) Data Consistency Module Storage Manager CDD Client Module Physical disks Virtual disks Communications through the network (a) Device masquerading (b) CDD architecture K. Hwang, March 15,2001 in Beijing 16
Maintaining consistency of the global directory /sios by all CDDs in the distributed RAID-x Cluster node 1 Cluster node 2 Cluster node 3 Cluster node 4 Application Application Application Application /sios /sios /sios /sios dir1 dir2 dir1 dir2 dir1 dir2 dir1 dir2 file1 file2 file3 file1 file2 file3 file1 file2 file3 file1 file2 file3 CDD CDD CDD CDD Cluster Network K. Hwang, March 15,2001 in Beijing 17
Elapsed Time in Executing the Andrew Benchmark on the Linux Cluster at USC Elapsed Time (sec) 35 30 25 20 15 10 5 Compile Read File Scan Dir Copy Files Make Dir Elapsed Time (sec) 8 7 6 5 4 3 2 1 0 1 4 8 12 16 Number of Clients 0 1 4 8 12 16 Number of Clients NFS results RAID-x results
Parallel Write Performance of four RAID Architectures against Traffic Rate 18 Aggregate Bandwidth (MB/s) 16 14 12 10 8 6 4 2 RAID-x Chained Declustering RAID-10 RAID-5 NFS 0 1 4 8 12 16 Number of Clients Parallel writes (20MB per client) 13
Parallel Write Performance of four RAID architectures vs. Disk Array Size 18 Aggregate Bandwidth (MB/s) 16 14 12 10 8 6 4 2 0 RAID-x Chained Declustering RAID-10 RAID-5 2 4 8 12 16 Disk Numbers Parallel write 13
Achievable I/O Bandwidth and Improvement Factor on Trojans Cluster I/O NFS RAID-x Operations 1 Client 16 Clients Improve 1 Client 16 Clients Improve Large Read 2.58 MB/s 2.3 MB/s 0.89 2.59 MB/s 15.63 MB/s 6.03 Large Write 2.11 MB/s 2.77 MB/s 1.31 2.92 MB/s 15.29 MB/s 5.24 Small Write 2.47 MB/s 2.81 MB/s 1.34 2.35 MB/s 15.1 MB/s 6.43 Operations Chained Declustering RAID-10 1 Client 16 Clients Improve 1 Client 16 Clients Improve Large Read 2.46 MB/s 15.8 MB/s 6.42 2.37 MB/s 10.76 MB/s 4.54 Large Write 2.62 MB/s 12.63 MB/s 4.82 2.31 MB/s 9.96 MB/s 4.31 Small Write 2.31 MB/s 12.54 MB/s 5.43 2.27 MB/s 9.98 MB/s 4.39 K. Hwang, March 15,2001 in Beijing 21
Effects of Stripe Unit Size on I/O Bandwidth of RAID Architectures 20 18 Aggregate Bandwidth (MB/s) 16 14 12 10 8 6 4 2 0 RAID-x Chained Declustering RAID-10 RAID-5 16 32 64 128 Stripe Unit Size (KB) Large write (320MB for 16 clients) 14
Bonnie Benchmark Results on Trojans Cluster 3.5 3 Output Rate (MB/s) 2.5 2 1.5 1 0.5 0 2 4 8 12 16 Number of Disks File rewrite RAID-x chained declustering RAID-10 RAID-5
Fault tolerance Increasing Reliability No data protection Securing Networks, Intranets, Clusters, or Grid Resources with intrusion control and automatic recovery from malicious attacks Gateway firewall to screen Cluster with no security protection Highly secured Intranet with intrusion detection and response, automatic traffic flow between networks recovery from malicious attacks, and fault-tolerance with distributed storage for reliable I/O SMP cluster Intranet Grid Increasing scalability
Distributed Micro-Firewalls K. Hwang, March 15,2001 Source: in BeijingMurali and Hwang, USC 25
Distributed Checkpointing on The RAID-x x in Trojans Cluster Time Process 0 Process 1 Process 2 Stripe0 Process 3 Process 4 Process 5 Process 6 Stripe1 Process 7 Process 8 Process 9 Process10 Stripe2 Process11 C: Checkpointing overhead C S S: Synchronization overhead 18
Security Component Technologies Firewalls and Cryptography Cluster Middleware for Security Anti-virus and Immune Systems Intrusion Detection and Response Distributed Software RAIDs Security & Assurance Policies K. Hwang, March 15,2001 in Beijing 27
Distributed Intrusion Detection and Responses Security Threats Insider attacks Denial-of- Service attacks Trojan Program IP Address Spoofing Probes and Scans Unauthorized External access Attacks on Intranet Infrastructure Effectiveness in using Micro-Firewalls Protect hosts against attack from insiders Protect against denial-of-service attacks from any source Protect hosts from trapdoors by any source Can be reconfigured to prevent IP spoofing at the client host level Use with IDS to block the probes and scans close to their sources Can prevent unauthorized access to the external networks at the source Resist both internal and external attacks and provide fine-grained access control
checkpoint overhead (sec) 3.5 3 2.5 2 1.5 1 0.5 0 Checkpointing overhead on distributed RAIDs NFS Vaidya Striped 1.21 2.21 3.21 4.21 5.21 6.21 7.21 8.11 checkpoint file size (MB) K. Hwang, March 15,2001 in Beijing 29
Advantages and Shortcomings of Distributed Checkpointing Checkpointing Scheme Advantages Shortcomings Suitable applications Simultaneous writing to a central storage (The NFS scheme) Staggered writing to a central storage (Vaidya scheme) Striped staggering checkpointing on any distributed RAID (Our scheme) Simple, no inconsistent state Eliminate the network and I/O contention Eliminate network and I/O contentions, low checkpoint overhead, fully utilize network bandwidth, tolerate multiple failures among stripe groups Has network and I/O contentions, NFS is single point of failure Network bandwidth is wasted, NFS is a single point of failure Can not tolerate more node failures within each stripe group Small size of checkpoint, small number of nodes, low I/O operation Small size of checkpointers, small number of nodes, low I/O operations Large size of checkpointers, large number of nodes, low communication, I/O intensive applications K. Hwang, March 15,2001 in Beijing 30
Conclusions : Distributed storage-area networks demands hardware or software support of a single I/O space not only in clusters but also in pervasive information grids. Hierarchical checkpointing with striping and staggered mirroring for building fault-tolerant clusters to provide continuous network services Hacker-proof clusters are in great demand for securing E-business, distributed computing, and metacomputing grid applications. Exploring new applications in multiserver consolidation, collaborative design, and pervasive network services. K. Hwang, March 15,2001 in Beijing 31
Call for Participation IEEE Third International Conference on Cluster Computing CLUSTER 2001 Sutton Place Hotel, Newport Beach, California October 8-11, 2001