Dr Markus Hagenbuchner markus@uow.edu.au CSCI319. Distributed Systems

Similar documents

(Pessimistic) Timestamp Ordering. Rules for read and write Operations. Pessimistic Timestamp Ordering. Write Operations and Timestamps

Chapter 10: Distributed DBMS Reliability

Distributed systems Lecture 6: Elec3ons, consensus, and distributed transac3ons. Dr Robert N. M. Watson

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Distributed Architectures. Distributed Databases. Distributed Databases. Distributed Databases

PIONEER RESEARCH & DEVELOPMENT GROUP

Chapter 11 I/O Management and Disk Scheduling

Chapter 14: Recovery System

Synchronization in. Distributed Systems. Cooperation and Coordination in. Distributed Systems. Kinds of Synchronization.

Outline. Failure Types

SAN Conceptual and Design Basics

How To Understand The Concept Of A Distributed System

Unification of Transactions and Replication in Three-Tier Architectures Based on CORBA

Chapter 6 External Memory. Dr. Mohamed H. Al-Meer

Practical issues in DIY RAID Recovery

Topics. Distributed Databases. Desirable Properties. Introduction. Distributed DBMS Architectures. Types of Distributed Databases

Distributed Data Management

The Google File System

The ConTract Model. Helmut Wächter, Andreas Reuter. November 9, 1999

technology brief RAID Levels March 1997 Introduction Characteristics of RAID Levels

Computer Networks. Chapter 5 Transport Protocols

CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL

Name: 1. CS372H: Spring 2009 Final Exam

RAID. Tiffany Yu-Han Chen. # The performance of different RAID levels # read/write/reliability (fault-tolerant)/overhead

(Refer Slide Time: 02:17)

How To Write A Network Operating System For A Network (Networking) System (Netware)

Chapter 11 Distributed File Systems. Distributed File Systems

DISTRIBUTED AND PARALLELL DATABASE

Network Attached Storage. Jinfeng Yang Oct/19/2015

Data Storage - II: Efficient Usage & Errors

Lecture 18: Reliable Storage

Chapter 18: Database System Architectures. Centralized Systems

Dependable Systems. 9. Redundant arrays of. Prof. Dr. Miroslaw Malek. Wintersemester 2004/05

Disks and RAID. Profs. Bracy and Van Renesse. based on slides by Prof. Sirer

Avoid a single point of failure by replicating the server Increase scalability by sharing the load among replicas

Network File System (NFS) Pradipta De

Fault Tolerance & Reliability CDA Chapter 3 RAID & Sample Commercial FT Systems

Principles and characteristics of distributed systems and environments

Review: The ACID properties

Middleware and Distributed Systems. System Models. Dr. Martin v. Löwis. Freitag, 14. Oktober 11

Advanced Computer Networks Project 2: File Transfer Application

Improved Digital Media Delivery with Telestream HyperLaunch

Hard Disk Drives and RAID

Data Link Layer(1) Principal service: Transferring data from the network layer of the source machine to the one of the destination machine

Distributed Systems Lecture 1 1

Distribution transparency. Degree of transparency. Openness of distributed systems

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

QoS Parameters. Quality of Service in the Internet. Traffic Shaping: Congestion Control. Keeping the QoS

Recovery and the ACID properties CMPUT 391: Implementing Durability Recovery Manager Atomicity Durability

CORBA and object oriented middleware. Introduction

Distributed Database Management Systems

RAID Storage, Network File Systems, and DropBox

HARD DRIVE CHARACTERISTICS REFRESHER

RADOS: A Scalable, Reliable Storage Service for Petabyte- scale Storage Clusters

Internetworking. Problem: There is more than one network (heterogeneity & scale)

Signature-Free Asynchronous Binary Byzantine Consensus with t < n/3, O(n 2 ) Messages, and O(1) Expected Time

Chapter 12 Network Administration and Support

Pronto: High Availability for Standard Off-the-shelf Databases

Protocols and Architecture. Protocol Architecture.

IP - The Internet Protocol

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Names & Addresses. Names & Addresses. Hop-by-Hop Packet Forwarding. Longest-Prefix-Match Forwarding. Longest-Prefix-Match Forwarding

Review. Lecture 21: Reliable, High Performance Storage. Overview. Basic Disk & File System properties CSC 468 / CSC /23/2006

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

How To Create A Multi Disk Raid

How to choose the right RAID for your Dedicated Server

How To Create A P2P Network

A distributed system is defined as

Processing of Hadoop using Highly Available NameNode

Agenda. Distributed System Structures. Why Distributed Systems? Motivation

Introduction to LAN/WAN. Network Layer

Vbam - Byzantine Atomic Multicast in LAN Based on Virtualization Technology

CSE331: Introduction to Networks and Security. Lecture 9 Fall 2006

Quality of Service in the Internet. QoS Parameters. Keeping the QoS. Traffic Shaping: Leaky Bucket Algorithm

Technical Note. Dell PowerVault Solutions for Microsoft SQL Server 2005 Always On Technologies. Abstract

Computer Networks. Data Link Layer

Enterprise-class versus Desktopclass

Transport Layer Protocols

! Volatile storage: ! Nonvolatile storage:

Distributed Software Systems

How To Build A Clustered Storage Area Network (Csan) From Power All Networks

Chapter Outline. Chapter 2 Distributed Information Systems Architecture. Middleware for Heterogeneous and Distributed Information Systems

Fault-Tolerant Framework for Load Balancing System

Introduction. What is RAID? The Array and RAID Controller Concept. Click here to print this article. Re-Printed From SLCentral

Chapter 16: Recovery System

Outline. Database Management and Tuning. Overview. Hardware Tuning. Johann Gamper. Unit 12

Transactional Support for SDN Control Planes "

Operating System Concepts. Operating System 資訊工程學系袁賢銘老師

ICOM : Computer Networks Chapter 6: The Transport Layer. By Dr Yi Qian Department of Electronic and Computer Engineering Fall 2006 UPRM

Availability Digest. MySQL Clusters Go Active/Active. December 2006

Transcription:

Dr Markus Hagenbuchner markus@uow.edu.au CSCI319 Distributed Systems CSCI319 Chapter 8 Page: 1 of 61

Fault Tolerance Study objectives: Understand the role of fault tolerance in Distributed Systems. Know and explain the various failure models. Explain fault tolerant mechanisms such as TMR, and Byzantine method Understand the effects of client side and server side crashes. Explain the various commit strategies and reliable multicast strategies. Explain strategies to recover from failures CSCI319 Chapter 8 Page: 2 of 61

Basic concepts Failure models Content Failure recovery strategies Such as agreement in faulty systems, reliable communication, server crashes, client crashes, etc Implementation issues CSCI319 Chapter 8 Page: 3 of 61

Fault Tolerance Basic Concepts Being fault tolerant is strongly related to what is called dependable systems Dependability implies the following: 1. Availability 2. Reliability 3. Safety 4. Maintainability CSCI319 Chapter 8 Page: 4 of 61

Failure Models Some types of possible failures. CSCI319 Chapter 8 Page: 5 of 61

Failure Masking by Redundancy Example: Triple modular redundancy (TMR). CSCI319 Chapter 8 Page: 6 of 61

Flat Groups versus Hierarchical Groups (a) Communication in a flat group. (b) Communication in a simple hierarchical group. System can not recover from a failure of the coordinator in case (b), but finding an agreement is harder in case (a) CSCI319 Chapter 8 Page: 7 of 61

Agreement in Faulty Systems (1) Possible cases: 1. Synchronous versus asynchronous systems. 2. Communication delay is bounded or not. 3. Message delivery is ordered or not. 4. Message transmission is done through unicasting or multicasting. CSCI319 Chapter 8 Page: 8 of 61

Agreement in Faulty Systems (2) Circumstances under which distributed agreement can be reached. CSCI319 Chapter 8 Page: 9 of 61

Agreement in Faulty Systems (3) Example: Byzantine agreement problem Given: N processes, which send messages possibly concurrently. Goal is to let each process construct a vector V of length N such that if process i is nonfaulty V[i]=v i, otherwise V[i] is undefined. CSCI319 Chapter 8 Page: 10 of 61

Agreement in Faulty Systems (4) The Byzantine agreement algorithm Goal is to obtain an agreed response among non-faulty nodes. Assumptions made: Processes are synchronous. Messages are send unicast and ordered. Delay is bounded. The algorithm can deal with up to n faulty nodes in a system containing 2n+1 non-faulty nodes. CSCI319 Chapter 8 Page: 11 of 59

Agreement in Faulty Systems (5) Byzantine agreement algorithm: 1.) Every nonfaulty node sends v i (i.e. v i =i) to every other node. Faulty nodes may send anything. 2.) Each nodes forms vector V k [i]= v i 3.) Every nodes passes vector V k to all other nodes. 4.) Each process examines the i-th element of the received vectors. CSCI319 Chapter 8 Page: 12 of 61

Agreement in Faulty Systems (6) The Byzantine agreement problem for three nonfaulty and one faulty process. (a) Each process sends their value to the others. CSCI319 Chapter 8 Page: 13 of 61

Agreement in Faulty Systems (7) The Byzantine agreement problem for three nonfaulty and one faulty process. (b) The vectors that each process assembles based on (a). (c) The vectors that each process receives in step 3. CSCI319 Chapter 8 Page: 14 of 61

Agreement in Faulty Systems (8) The same as previous case, except now with two correct process and one faulty process. CSCI319 Chapter 8 Page: 15 of 61

Failure Detection How to detect when a member of a group of processes has failed? E.g. Pinging, Timeout, Periodic gossip How to distinguish between node failure and network failure? E.g Route request via neighbors (try alternate com. path). What to do with a failing member? E.g. Remove from group, keep trying to contact node. CSCI319 Chapter 8 Page: 16 of 61

Reliable Client-Server Communication Specifically addresses communication failures How can reliable client-server communication be achieved? Illustrated on Remote Procedure Call (RPC) situation. Are there client-server strategies that can be taken to achieve reliable RPC? CSCI319 Chapter 8 Page: 17 of 61

RPC Semantics in the Presence of Failures Five different cases of failures that can occur in RPC systems: 1. The client is unable to locate the server. 2. The request message from the client to the server is lost. 3. The server crashes after receiving a request. 4. The reply message from the server to the client is lost. 5. The client crashes after sending a request. CSCI319 Chapter 8 Page: 18 of 61

Server Crashes (1) A server in client-server communication may crash at different times during an RPC: (a) The normal case. (b) Crash after execution (of procedure). (c) Crash before execution (of procedure). CSCI319 Chapter 8 Page: 19 of 61

Server Crashes (2) Lets work towards server side strategies to minimize the effects of a server side crash. Example: Client requests from server to print some text. The server is to acknowledge that text has been printed. Three events that can happen at the server: Send the completion message (M), Printing text (P), Crash (C). CSCI319 Chapter 8 Page: 20 of 61

Server Crashes (3) These events can occur in six different orderings: 1. M P C: A crash occurs after sending the completion message and printing the text. 2. M C ( P): A crash happens after sending the completion message, but before the text could be printed. 3. P M C: A crash occurs after sending the completion message and printing the text. 4. P C( M): The text printed, after which a crash occurs before the completion message could be sent. 5. C ( P M): A crash happens before the server could do anything. 6. C ( M P): A crash happens before the server could do anything. CSCI319 Chapter 8 Page: 21 of 61

Server Crashes (4) Different combinations of client and server strategies in the presence of server crashes. Therefore, there exists no client-server strategy that provides reliable RPC. CSCI319 Chapter 8 Page: 22 of 61

Client Crashes (1) Client makes a request but crashes before receiving response from server causing orphans. Problems arising out of orphans: Wasted CPU cycles Open file handles Lock up resources CSCI319 Chapter 8 Page: 23 of 61

Client Crashes (2) Strategies to handle orphan problems: Orphan extermination: Keep log, scan for orphans, then explicitly kill process. Very expensive! Reincarnation: A new process broadcasts its presence killing all remote computations (effectively removing orphans). Unreliable. Gentle reincarnation: Same as before but performs additional checks to locate process owners. More reliable. Expiration: Each process (or RPC) is given a fixed time T. Process is killed if time is exceeded. Requires additional communication when more time than T is needed. CSCI319 Chapter 8 Page: 24 of 61

Reliable group communication Reliable Multicasting: Task: Messages are to be sent to several receivers Goal: ensure all receivers receive the messages. One solution: Assign sequence number to each message. Receiver checks for missing numbers and contacts sender for retransmission. Another solution: Receivers send back acknowledgment upon receipt of a message. Sender resends message to receivers from which no acknowledgement was received. CSCI319 Chapter 8 Page: 25 of 61

Basic Reliable-Multicasting Schemes A simple solution to reliable multicasting when all receivers are known and are assumed not to fail. (a) Message transmission. (b) Reporting feedback. CSCI319 Chapter 8 Page: 26 of 61

Scalable Reliable-Multicasting If acknowledgements are send back to sender. Messages are re-sent to receivers which did not acknowledge. But this method is not efficient when the number of receivers is very large! Scalability of the previous approach can be improved: 1. Only negative acknowledgements (NACK) are send back to sender. When multiple receivers missed a package, only one NACK is returned (by broadcasting to all after a random waiting time). 2. Through parallelization, i.e. by using a hierarchical approach with several coordinators in which the coordinators are responsible over a subset of receivers. CSCI319 Chapter 8 Page: 27 of 59

Nonhierarchical Feedback Control Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others. CSCI319 Chapter 8 Page: 28 of 61

Hierarchical Feedback Control The essence of hierarchical reliable multicasting. Each local coordinator forwards the message to its children and later handles retransmission requests. CSCI319 Chapter 8 Page: 29 of 61

Virtual Synchrony (1) How to ensure that all receivers receive a message in case of a sender crash? This can be addressed by considering the logical organization of a distributed system. We can distinguish between message receipt and message delivery. CSCI319 Chapter 8 Page: 30 of 61

Virtual Synchrony (2) The principle of virtual synchronous multicast when realized using views : CSCI319 Chapter 8 Page: 31 of 61

Interactive slide What is virtual synchrony? If a sender crashes during a multicast then a strategy is in place which guarantees that either: All intended receiver receive the message None of the intended receivers receive the message. How can virtual synchrony be realized: For example by using views. Each receiver is assigned to be in the same group in which each node has a complete view of other nodes in the group. This requires that the views are updated synchronously. CSCI319 Chapter 8 Page: 32 of 61

Interactive slide What is virtual synchrony? A guarantee (in case of a sender crash during a multicast) that either: a. All intended receiver receive the message b. None of the intended receivers receive the message. How can virtual synchrony be realized: One possible solution is to use views : Each receiver is assigned to be in the same group in which each node has a complete view of other nodes in the group. This requires that the views are updated synchronously. This turns out to be a non-trivial task. CSCI319 Chapter 8 Page: 33 of 61

Message Ordering (1) Virtual synchrony allows a DS developer to think about the ordering of messages in a multicast. Four different orderings are distinguished: Unordered multicasts Messages may be received in any order by any receiver. FIFO-ordered multicasts Messages from the same sender must be received in the same order by all receivers. Causally-ordered multicasts Messages that are potentially causally related are received in the same order. Totally-ordered multicasts Any of the previous three, and all processes receive messages from any sender in the same order. CSCI319 Chapter 8 Page: 34 of 61

Message Ordering (2) Three communicating processes in the same group. The ordering of events per process is shown along the vertical axis. CSCI319 Chapter 8 Page: 35 of 61

Interactive slide Given the example below, answer the associated question with yes or no: The example complies with the ordering scheme: Unordered multicast. yes FIFO-ordered multicast. no Causally ordered multicast. no Totally-ordered multicast. no CSCI319 Chapter 8 Page: 36 of 61

Interactive slide Given the example below, answer the associated question with yes or no: The example complies with the ordering scheme: Unordered multicast. yes FIFO-ordered multicast. no Causally ordered multicast. no Totally-ordered multicast. no CSCI319 Chapter 8 Page: 37 of 61

Message Ordering (3) Four processes in the same group with two different senders. Assume that the multicast is to P2 and P3 only. CSCI319 Chapter 8 Page: 38 of 61

Interactive slide Given the example below where P2 and P3 are the sole receivers, answer the associated question with yes or no: The example complies with the ordering scheme: Unordered multicast. yes FIFO-ordered multicast. yes Causally ordered multicast. yes Totally-ordered multicast. no CSCI319 Chapter 8 Page: 39 of 61

Interactive slide Given the example below where P2 and P3 are the sole receivers, answer the associated question with yes or no: The example complies with the ordering scheme: Unordered multicast. yes FIFO-ordered multicast. yes Causally ordered multicast. yes Totally-ordered multicast. no CSCI319 Chapter 8 Page: 40 of 61

Virtual Synchrony (3) Implementation of ordered multicast is non-trivial. One solution: Sender keeps a copy of a sent message which is marked stable only if all receivers acknowledged receipt. Otherwise message remains labeled unstable. Only stable messages are allowed to be delivered (to the application) Unstable messages are discarded in the event of a view change. The underlying procedure is visualized in the following: CSCI319 Chapter 8 Page: 41 of 61

Virtual Synchrony (4) A node in a group G may crash during a multicast. This means that: 1. Not all intended receivers may have received a message. 2. All members need to apply a view change of G. Questions to be addressed: How can a member know that a message (which it may have received) has not yet been delivered to all nodes due to a crash? How can all members agree atomically on a view change? The underlying procedure is visualized in the following: CSCI319 Chapter 8 Page: 42 of 61

Virtual Synchrony (4) Example step (a): Process 4 notices that process 7 has crashed and sends a view change. CSCI319 Chapter 8 Page: 43 of 61

Virtual Synchrony (5) Step (b): Processes exchange information about unstable messages. Example: Process 6 sends out all its unstable messages, followed by a flush message (no further unstable messages). CSCI319 Chapter 8 Page: 44 of 61

Virtual Synchrony (6) Step (c): Processes install new view. Example: Process 6 installs the new view when it has received a flush message from everyone else. CSCI319 Chapter 8 Page: 45 of 61

Distributed Commit (1) Distributed commit is a category of algorithms that generalize the concept of virtual synchrony. It allows the realization of many data centric consistency models among replica servers. A commit refers to an irreversible operation i.e. execution of a distributed command, distributed read or write operations, realization of totally ordered multi-casts. Distributed commits allow a distributed operation to occur atomically (Either all processes in a group perform a commit, or none do). CSCI319 Chapter 8 Page: 46 of 61

Distributed Commit (2) A generalization of virtual synchrony: Distributed Commit Either all processes in a group perform a commit, or none do. Uses targeted primitives (VOTE_REQUEST, GLOBAL_ABORT,...) Often realized by means of a coordinator These primitives are Implemented in the middleware layer Accessible by processes in a distributed system CSCI319 Chapter 8 Page: 47 of 61

Distributed Commit (3) Three approaches to realizing distributed commit: One-phase commit Coordinator tells all participating processes to commit. not robust to failure of any one participant. Two-phase commit Most common scheme but can not handle all types of failure of the coordinator. Three-phase commit Can handle (any type of) failure of coordinator. CSCI319 Chapter 8 Page: 48 of 61

Two-phase Commit (1) Protocol consists of two phases, each consisting of two steps. Phase 1: Voting phase Phase 2: Decision Phase Requires that a reliable point-to-point communication strategy is in place. These phases are realized as follows: CSCI319 Chapter 8 Page: 49 of 61

Two-phase Commit (2) Phase 1: Voting phase Step 1: Coordinator sends a VOTE_REQUEST to all participants Step 2: If message received, each participant replies with a VOTE_COMMIT or VOTE_ABORT depending on the local situation. CSCI319 Chapter 8 Page: 50 of 61

Two-phase Commit (3) Phase 2: Decision Phase Step 1: Coordinator collects votes, then sends to all either GLOBAL_COMMIT or GLOBAL_ABORT depending on reply received from step 2. Step 2: If participant receives GLOBAL_COMMIT message, then it commits transaction, otherwise it locally aborts the transaction. CSCI319 Chapter 8 Page: 51 of 61

Two-Phase Commit (4) The finite state diagram of the two-phase commit protocol for (a) the coordinator, and (b) for a participant. The diagram shows the states in which a coordinator and participant can be in. CSCI319 Chapter 8 Page: 52 of 61

Two-phase Commit (5) Robustness: The 2-phase commit strategy can be made robust to almost all possible faults in the coordinator or the participants. Only exception: If coordinator is in the WAIT state, and all participants are in READY state, then no fail save strategy exists. This is because WAIT and READY are blocking states Is a rare situation Can be overcome by using the three-phased commit strategy which avoids mutually blocking processes through the introduction of a pre-commit state (see book for algorithm). CSCI319 Chapter 8 Page: 53 of 61

Recovery (1) Recover a failed process to a correct state: Not always possible as some operations are irreversible. Strategies depend on the type of the affected component (storage, process, ) Strategies include: Checksum, check-pointing, logging,. Recovery requires that information required has been stored safely. CSCI319 Chapter 8 Page: 54 of 61

Stable Storage (1) We differentiate between three types of storage: 1. RAM, wiped out when machine crashes 2. Disk storage, can be lost when disk head crashes. 3. Stable storage designed to survive any type of crash other than acts of God. E.g. ROM or WORM storage. Fail save disc storage. CSCI319 Chapter 8 Page: 55 of 61

Stable Storage (2) RAID: Redundant Array of Independent Disks. RAID 0: Striping across multiple disks. No redundancy RAID 1: Mirrored disks. Reduces capacity by at least 50% RAID 2: Adds parity disk(s) to striped array (hamming code). RAID 3: Striped with (one) dedicated parity disk. Parity disk is a bottle neck. RAID 4: Block level striping of parity (instead of byte level) RAID 5: Distributed striped set of disks. Read as fast as RAID 0 Reduces capacity by at most 50% RAID 6: Dual parity disks. Can deal with two concurrent disk failures. RAID Z: Extends RAID 5 to avoid write holes and performance issues. CSCI319 Chapter 8 Page: 56 of 61

Recovery of processes I.e. realized through check-pointing. A recovery line defines to which state a system has to be reverted to in order to recover from a failed process. Note that in this example, the last check-point of P2 can not serve as a recovery point in this DS. CSCI319 Chapter 8 Page: 57 of 61

Independent Checkpointing Finding a recovery line in a DS can be difficult as it can lead to a domino effect as is illustrated here. The domino effect can be avoided by using: Synchronizing check-pointing across processes (difficult). Logging messages: CSCI319 Chapter 8 Page: 58 of 61

Message-Logging The domino-effect can be avoided by logging (and replaying) messages rather than processes. The difficulty here is that an incorrect replay of messages after recovery can lead to an orphan process. CSCI319 Chapter 8 Page: 59 of 61

Recovery oriented computing Recovery can be achieved by simply starting over again (i.e. rebooting) Problem: This may re-produce the cause of a crash, and hence, a solution is never obtained. Alternatively: Apply checkpointing to redundant algorithms, then recover to an algorithm which has not failed. Robust but hard to implement. CSCI319 Chapter 8 Page: 60 of 61

Summary on fault tolerance Fault tolerance Failure models Failure recovery strategies Scalability issues Reliable multicasting Virtual Synchrony and distributed commits Recovery CSCI319 Chapter 8 Page: 61 of 61