Dr Markus Hagenbuchner markus@uow.edu.au CSCI319 Distributed Systems CSCI319 Chapter 8 Page: 1 of 61
Fault Tolerance Study objectives: Understand the role of fault tolerance in Distributed Systems. Know and explain the various failure models. Explain fault tolerant mechanisms such as TMR, and Byzantine method Understand the effects of client side and server side crashes. Explain the various commit strategies and reliable multicast strategies. Explain strategies to recover from failures CSCI319 Chapter 8 Page: 2 of 61
Basic concepts Failure models Content Failure recovery strategies Such as agreement in faulty systems, reliable communication, server crashes, client crashes, etc Implementation issues CSCI319 Chapter 8 Page: 3 of 61
Fault Tolerance Basic Concepts Being fault tolerant is strongly related to what is called dependable systems Dependability implies the following: 1. Availability 2. Reliability 3. Safety 4. Maintainability CSCI319 Chapter 8 Page: 4 of 61
Failure Models Some types of possible failures. CSCI319 Chapter 8 Page: 5 of 61
Failure Masking by Redundancy Example: Triple modular redundancy (TMR). CSCI319 Chapter 8 Page: 6 of 61
Flat Groups versus Hierarchical Groups (a) Communication in a flat group. (b) Communication in a simple hierarchical group. System can not recover from a failure of the coordinator in case (b), but finding an agreement is harder in case (a) CSCI319 Chapter 8 Page: 7 of 61
Agreement in Faulty Systems (1) Possible cases: 1. Synchronous versus asynchronous systems. 2. Communication delay is bounded or not. 3. Message delivery is ordered or not. 4. Message transmission is done through unicasting or multicasting. CSCI319 Chapter 8 Page: 8 of 61
Agreement in Faulty Systems (2) Circumstances under which distributed agreement can be reached. CSCI319 Chapter 8 Page: 9 of 61
Agreement in Faulty Systems (3) Example: Byzantine agreement problem Given: N processes, which send messages possibly concurrently. Goal is to let each process construct a vector V of length N such that if process i is nonfaulty V[i]=v i, otherwise V[i] is undefined. CSCI319 Chapter 8 Page: 10 of 61
Agreement in Faulty Systems (4) The Byzantine agreement algorithm Goal is to obtain an agreed response among non-faulty nodes. Assumptions made: Processes are synchronous. Messages are send unicast and ordered. Delay is bounded. The algorithm can deal with up to n faulty nodes in a system containing 2n+1 non-faulty nodes. CSCI319 Chapter 8 Page: 11 of 59
Agreement in Faulty Systems (5) Byzantine agreement algorithm: 1.) Every nonfaulty node sends v i (i.e. v i =i) to every other node. Faulty nodes may send anything. 2.) Each nodes forms vector V k [i]= v i 3.) Every nodes passes vector V k to all other nodes. 4.) Each process examines the i-th element of the received vectors. CSCI319 Chapter 8 Page: 12 of 61
Agreement in Faulty Systems (6) The Byzantine agreement problem for three nonfaulty and one faulty process. (a) Each process sends their value to the others. CSCI319 Chapter 8 Page: 13 of 61
Agreement in Faulty Systems (7) The Byzantine agreement problem for three nonfaulty and one faulty process. (b) The vectors that each process assembles based on (a). (c) The vectors that each process receives in step 3. CSCI319 Chapter 8 Page: 14 of 61
Agreement in Faulty Systems (8) The same as previous case, except now with two correct process and one faulty process. CSCI319 Chapter 8 Page: 15 of 61
Failure Detection How to detect when a member of a group of processes has failed? E.g. Pinging, Timeout, Periodic gossip How to distinguish between node failure and network failure? E.g Route request via neighbors (try alternate com. path). What to do with a failing member? E.g. Remove from group, keep trying to contact node. CSCI319 Chapter 8 Page: 16 of 61
Reliable Client-Server Communication Specifically addresses communication failures How can reliable client-server communication be achieved? Illustrated on Remote Procedure Call (RPC) situation. Are there client-server strategies that can be taken to achieve reliable RPC? CSCI319 Chapter 8 Page: 17 of 61
RPC Semantics in the Presence of Failures Five different cases of failures that can occur in RPC systems: 1. The client is unable to locate the server. 2. The request message from the client to the server is lost. 3. The server crashes after receiving a request. 4. The reply message from the server to the client is lost. 5. The client crashes after sending a request. CSCI319 Chapter 8 Page: 18 of 61
Server Crashes (1) A server in client-server communication may crash at different times during an RPC: (a) The normal case. (b) Crash after execution (of procedure). (c) Crash before execution (of procedure). CSCI319 Chapter 8 Page: 19 of 61
Server Crashes (2) Lets work towards server side strategies to minimize the effects of a server side crash. Example: Client requests from server to print some text. The server is to acknowledge that text has been printed. Three events that can happen at the server: Send the completion message (M), Printing text (P), Crash (C). CSCI319 Chapter 8 Page: 20 of 61
Server Crashes (3) These events can occur in six different orderings: 1. M P C: A crash occurs after sending the completion message and printing the text. 2. M C ( P): A crash happens after sending the completion message, but before the text could be printed. 3. P M C: A crash occurs after sending the completion message and printing the text. 4. P C( M): The text printed, after which a crash occurs before the completion message could be sent. 5. C ( P M): A crash happens before the server could do anything. 6. C ( M P): A crash happens before the server could do anything. CSCI319 Chapter 8 Page: 21 of 61
Server Crashes (4) Different combinations of client and server strategies in the presence of server crashes. Therefore, there exists no client-server strategy that provides reliable RPC. CSCI319 Chapter 8 Page: 22 of 61
Client Crashes (1) Client makes a request but crashes before receiving response from server causing orphans. Problems arising out of orphans: Wasted CPU cycles Open file handles Lock up resources CSCI319 Chapter 8 Page: 23 of 61
Client Crashes (2) Strategies to handle orphan problems: Orphan extermination: Keep log, scan for orphans, then explicitly kill process. Very expensive! Reincarnation: A new process broadcasts its presence killing all remote computations (effectively removing orphans). Unreliable. Gentle reincarnation: Same as before but performs additional checks to locate process owners. More reliable. Expiration: Each process (or RPC) is given a fixed time T. Process is killed if time is exceeded. Requires additional communication when more time than T is needed. CSCI319 Chapter 8 Page: 24 of 61
Reliable group communication Reliable Multicasting: Task: Messages are to be sent to several receivers Goal: ensure all receivers receive the messages. One solution: Assign sequence number to each message. Receiver checks for missing numbers and contacts sender for retransmission. Another solution: Receivers send back acknowledgment upon receipt of a message. Sender resends message to receivers from which no acknowledgement was received. CSCI319 Chapter 8 Page: 25 of 61
Basic Reliable-Multicasting Schemes A simple solution to reliable multicasting when all receivers are known and are assumed not to fail. (a) Message transmission. (b) Reporting feedback. CSCI319 Chapter 8 Page: 26 of 61
Scalable Reliable-Multicasting If acknowledgements are send back to sender. Messages are re-sent to receivers which did not acknowledge. But this method is not efficient when the number of receivers is very large! Scalability of the previous approach can be improved: 1. Only negative acknowledgements (NACK) are send back to sender. When multiple receivers missed a package, only one NACK is returned (by broadcasting to all after a random waiting time). 2. Through parallelization, i.e. by using a hierarchical approach with several coordinators in which the coordinators are responsible over a subset of receivers. CSCI319 Chapter 8 Page: 27 of 59
Nonhierarchical Feedback Control Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others. CSCI319 Chapter 8 Page: 28 of 61
Hierarchical Feedback Control The essence of hierarchical reliable multicasting. Each local coordinator forwards the message to its children and later handles retransmission requests. CSCI319 Chapter 8 Page: 29 of 61
Virtual Synchrony (1) How to ensure that all receivers receive a message in case of a sender crash? This can be addressed by considering the logical organization of a distributed system. We can distinguish between message receipt and message delivery. CSCI319 Chapter 8 Page: 30 of 61
Virtual Synchrony (2) The principle of virtual synchronous multicast when realized using views : CSCI319 Chapter 8 Page: 31 of 61
Interactive slide What is virtual synchrony? If a sender crashes during a multicast then a strategy is in place which guarantees that either: All intended receiver receive the message None of the intended receivers receive the message. How can virtual synchrony be realized: For example by using views. Each receiver is assigned to be in the same group in which each node has a complete view of other nodes in the group. This requires that the views are updated synchronously. CSCI319 Chapter 8 Page: 32 of 61
Interactive slide What is virtual synchrony? A guarantee (in case of a sender crash during a multicast) that either: a. All intended receiver receive the message b. None of the intended receivers receive the message. How can virtual synchrony be realized: One possible solution is to use views : Each receiver is assigned to be in the same group in which each node has a complete view of other nodes in the group. This requires that the views are updated synchronously. This turns out to be a non-trivial task. CSCI319 Chapter 8 Page: 33 of 61
Message Ordering (1) Virtual synchrony allows a DS developer to think about the ordering of messages in a multicast. Four different orderings are distinguished: Unordered multicasts Messages may be received in any order by any receiver. FIFO-ordered multicasts Messages from the same sender must be received in the same order by all receivers. Causally-ordered multicasts Messages that are potentially causally related are received in the same order. Totally-ordered multicasts Any of the previous three, and all processes receive messages from any sender in the same order. CSCI319 Chapter 8 Page: 34 of 61
Message Ordering (2) Three communicating processes in the same group. The ordering of events per process is shown along the vertical axis. CSCI319 Chapter 8 Page: 35 of 61
Interactive slide Given the example below, answer the associated question with yes or no: The example complies with the ordering scheme: Unordered multicast. yes FIFO-ordered multicast. no Causally ordered multicast. no Totally-ordered multicast. no CSCI319 Chapter 8 Page: 36 of 61
Interactive slide Given the example below, answer the associated question with yes or no: The example complies with the ordering scheme: Unordered multicast. yes FIFO-ordered multicast. no Causally ordered multicast. no Totally-ordered multicast. no CSCI319 Chapter 8 Page: 37 of 61
Message Ordering (3) Four processes in the same group with two different senders. Assume that the multicast is to P2 and P3 only. CSCI319 Chapter 8 Page: 38 of 61
Interactive slide Given the example below where P2 and P3 are the sole receivers, answer the associated question with yes or no: The example complies with the ordering scheme: Unordered multicast. yes FIFO-ordered multicast. yes Causally ordered multicast. yes Totally-ordered multicast. no CSCI319 Chapter 8 Page: 39 of 61
Interactive slide Given the example below where P2 and P3 are the sole receivers, answer the associated question with yes or no: The example complies with the ordering scheme: Unordered multicast. yes FIFO-ordered multicast. yes Causally ordered multicast. yes Totally-ordered multicast. no CSCI319 Chapter 8 Page: 40 of 61
Virtual Synchrony (3) Implementation of ordered multicast is non-trivial. One solution: Sender keeps a copy of a sent message which is marked stable only if all receivers acknowledged receipt. Otherwise message remains labeled unstable. Only stable messages are allowed to be delivered (to the application) Unstable messages are discarded in the event of a view change. The underlying procedure is visualized in the following: CSCI319 Chapter 8 Page: 41 of 61
Virtual Synchrony (4) A node in a group G may crash during a multicast. This means that: 1. Not all intended receivers may have received a message. 2. All members need to apply a view change of G. Questions to be addressed: How can a member know that a message (which it may have received) has not yet been delivered to all nodes due to a crash? How can all members agree atomically on a view change? The underlying procedure is visualized in the following: CSCI319 Chapter 8 Page: 42 of 61
Virtual Synchrony (4) Example step (a): Process 4 notices that process 7 has crashed and sends a view change. CSCI319 Chapter 8 Page: 43 of 61
Virtual Synchrony (5) Step (b): Processes exchange information about unstable messages. Example: Process 6 sends out all its unstable messages, followed by a flush message (no further unstable messages). CSCI319 Chapter 8 Page: 44 of 61
Virtual Synchrony (6) Step (c): Processes install new view. Example: Process 6 installs the new view when it has received a flush message from everyone else. CSCI319 Chapter 8 Page: 45 of 61
Distributed Commit (1) Distributed commit is a category of algorithms that generalize the concept of virtual synchrony. It allows the realization of many data centric consistency models among replica servers. A commit refers to an irreversible operation i.e. execution of a distributed command, distributed read or write operations, realization of totally ordered multi-casts. Distributed commits allow a distributed operation to occur atomically (Either all processes in a group perform a commit, or none do). CSCI319 Chapter 8 Page: 46 of 61
Distributed Commit (2) A generalization of virtual synchrony: Distributed Commit Either all processes in a group perform a commit, or none do. Uses targeted primitives (VOTE_REQUEST, GLOBAL_ABORT,...) Often realized by means of a coordinator These primitives are Implemented in the middleware layer Accessible by processes in a distributed system CSCI319 Chapter 8 Page: 47 of 61
Distributed Commit (3) Three approaches to realizing distributed commit: One-phase commit Coordinator tells all participating processes to commit. not robust to failure of any one participant. Two-phase commit Most common scheme but can not handle all types of failure of the coordinator. Three-phase commit Can handle (any type of) failure of coordinator. CSCI319 Chapter 8 Page: 48 of 61
Two-phase Commit (1) Protocol consists of two phases, each consisting of two steps. Phase 1: Voting phase Phase 2: Decision Phase Requires that a reliable point-to-point communication strategy is in place. These phases are realized as follows: CSCI319 Chapter 8 Page: 49 of 61
Two-phase Commit (2) Phase 1: Voting phase Step 1: Coordinator sends a VOTE_REQUEST to all participants Step 2: If message received, each participant replies with a VOTE_COMMIT or VOTE_ABORT depending on the local situation. CSCI319 Chapter 8 Page: 50 of 61
Two-phase Commit (3) Phase 2: Decision Phase Step 1: Coordinator collects votes, then sends to all either GLOBAL_COMMIT or GLOBAL_ABORT depending on reply received from step 2. Step 2: If participant receives GLOBAL_COMMIT message, then it commits transaction, otherwise it locally aborts the transaction. CSCI319 Chapter 8 Page: 51 of 61
Two-Phase Commit (4) The finite state diagram of the two-phase commit protocol for (a) the coordinator, and (b) for a participant. The diagram shows the states in which a coordinator and participant can be in. CSCI319 Chapter 8 Page: 52 of 61
Two-phase Commit (5) Robustness: The 2-phase commit strategy can be made robust to almost all possible faults in the coordinator or the participants. Only exception: If coordinator is in the WAIT state, and all participants are in READY state, then no fail save strategy exists. This is because WAIT and READY are blocking states Is a rare situation Can be overcome by using the three-phased commit strategy which avoids mutually blocking processes through the introduction of a pre-commit state (see book for algorithm). CSCI319 Chapter 8 Page: 53 of 61
Recovery (1) Recover a failed process to a correct state: Not always possible as some operations are irreversible. Strategies depend on the type of the affected component (storage, process, ) Strategies include: Checksum, check-pointing, logging,. Recovery requires that information required has been stored safely. CSCI319 Chapter 8 Page: 54 of 61
Stable Storage (1) We differentiate between three types of storage: 1. RAM, wiped out when machine crashes 2. Disk storage, can be lost when disk head crashes. 3. Stable storage designed to survive any type of crash other than acts of God. E.g. ROM or WORM storage. Fail save disc storage. CSCI319 Chapter 8 Page: 55 of 61
Stable Storage (2) RAID: Redundant Array of Independent Disks. RAID 0: Striping across multiple disks. No redundancy RAID 1: Mirrored disks. Reduces capacity by at least 50% RAID 2: Adds parity disk(s) to striped array (hamming code). RAID 3: Striped with (one) dedicated parity disk. Parity disk is a bottle neck. RAID 4: Block level striping of parity (instead of byte level) RAID 5: Distributed striped set of disks. Read as fast as RAID 0 Reduces capacity by at most 50% RAID 6: Dual parity disks. Can deal with two concurrent disk failures. RAID Z: Extends RAID 5 to avoid write holes and performance issues. CSCI319 Chapter 8 Page: 56 of 61
Recovery of processes I.e. realized through check-pointing. A recovery line defines to which state a system has to be reverted to in order to recover from a failed process. Note that in this example, the last check-point of P2 can not serve as a recovery point in this DS. CSCI319 Chapter 8 Page: 57 of 61
Independent Checkpointing Finding a recovery line in a DS can be difficult as it can lead to a domino effect as is illustrated here. The domino effect can be avoided by using: Synchronizing check-pointing across processes (difficult). Logging messages: CSCI319 Chapter 8 Page: 58 of 61
Message-Logging The domino-effect can be avoided by logging (and replaying) messages rather than processes. The difficulty here is that an incorrect replay of messages after recovery can lead to an orphan process. CSCI319 Chapter 8 Page: 59 of 61
Recovery oriented computing Recovery can be achieved by simply starting over again (i.e. rebooting) Problem: This may re-produce the cause of a crash, and hence, a solution is never obtained. Alternatively: Apply checkpointing to redundant algorithms, then recover to an algorithm which has not failed. Robust but hard to implement. CSCI319 Chapter 8 Page: 60 of 61
Summary on fault tolerance Fault tolerance Failure models Failure recovery strategies Scalability issues Reliable multicasting Virtual Synchrony and distributed commits Recovery CSCI319 Chapter 8 Page: 61 of 61