3. Replication Replication Goal: Avoid a single point of failure by replicating the server Increase scalability by sharing the load among replicas Problems: Partial failures of replicas and messages No global ordering of messages Some replicas might execute operations that others did not know about The state of replicas diverge L M B L M B L M B 1
Replicated State Machine Known also as Active Replication The idea: Every replica sees exactly the same set of messages in the same order and will execute them in that order Benefit: Immediate fail-over Limitations: Waste of resources, since all replicas are doing the same Requires determinism, which is not trivial to ensure An important issue: At what level? Option 1: Machine instruction level Virtual Machine Level Machine Instruction Level Replication Benefits: Fast consistency resolution Transparent Problems: Requires special hardware Usually behind technology curve Resource wasteful Requires physical proximity Does not overcome software bugs / no multi-versioning 2
N-Modular Redundancy Node 1 Node 2 Node N Voter Output Monitoring System Logical State Replication Idea: The important thing is the logical/semantic state of the application Benefits: The negation of the shortcoming of machine level Problems: Not transparent Changes programming model Although one tries to minimize this Slower consistency resolution May result in lower throughput and higher latency 3
Total Ordering Based Replication Simple replication protocol Clients can send requests to any replica All replicas utilize a black-box total ordering mechanism to apply the updates in the same order Client1 Request1 Replicated Servers Request2 Client2 Total Order Reply1 Reply2 Total Ordering Based Replication Fine Details How do clients find an alive server? Name servers Local directors Virtual IP being migrated between alive servers What happens if the server fails? Before servicing the request Resubmit After servicing the request We need to avoid re-execution Sequence numbers We need to return pre-computed result Cache of recent results 4
Primary-Backup Replication Cold backup Only the primary is active Periodically checkpoints its state to storage that is available to the backup Stable storage or shared storage (SCSI, SAN) When the primary fails, the backup is initiated, loads the state from storage, and continues from there Slow recovery The backup needs to be started (run applications, obtain resources, etc.) Either the backup needs to replay the last actions from a log file, or it may miss the last updates since the most recent checkpoint + Resource efficient + Invocations need not be deterministic It is possible to have several backups to survive multiple failures Requires consistent failure detection, e.g., by a group membership service It is possible to have several nodes, each primary for some services and backup for others Primary-Backup Replication Warm backup In this case, the backup is (at least) partially alive, so the recovery phase is faster But typically still requires some replaying of last transactions, or losing the last few updates Typically, updates are also frequent Hot standby (leader/follower) The backup is also up, and is constantly updated about the state of the primary + Fast and up-to-date recovery Special protocols are required to ensure true up-to-date recovery We will talk about such protocols later + More efficient than active replication Higher overhead than cold and warm backup Slower recovery than active replication 5
Challenges in Primary/Backup How to consistently agree on the primary? How to detect that the primary has failed? How to ensure that if we suspect the primary, it is indeed no longer operating on behalf of the system How to enable additional backups to join the system without manual intervention and reset We will discuss these issues when we talk about Consensus, Failure Detection, and Membership Quorum Replication Atomic Register Intuitively, operations should appear as if executed on a single server This is a private case of linearizability Well suited for distributed storage and distributed shared memory More specifically, A read always returns a value written either by the last write that terminated before the read started, or by a write that is concurrent with the read If several writes are concurrent, then a following read can return a value written by any one of them A read should not return a value that is older than the value returned by a previous read 6
Quorums A quorum system consists of a set of subsets such that the intersection of every two subsets is not empty Each of these subsets is called a quorum Example U = {1,2,3,4}, S = {{1,2,3}, {2,3,4}, {1,4}} The simplest generic quorum system is majority Every majority subset intersects with every other majority subset Another common type of generic quorum system is any row + any column of a lattice Bi-Quorums The universe U is now composed of two sets of subsets, A and B Every subset from A must intersect every subset from B Clearly, each quorum system is also a bi-quorum system Majority is a generic bi-quorum system Scalable generic bi-quorums Idea Servers are arranged in a logical matrix A write quorum consists of any one row in the matrix A read quorum consists of any one column in the matrix Drawback Tradeoff between availability and size of quorum Read quorums Write quorums 7
Implementing Read/Write Registers with Quorums Quorum replication Clients read and write directly to quorums of servers Essentially adaptation of Attiya, Bar-Noy, Dolev protocol for emulating distributed shared memory robustly, but using any given bi-quorum rather than majority Tradeoff between availability of the system and its scalability (size of quorum) Cannot implement Read-Modify-Write semantics Implementing Quorum Replication Each server maintains a logical timestamp for each register Servers protocol: Upon receiving a r-request(x) message do o := values[x] reply with a r-reply(x,o.val,<o.ts,o.id>) message Upon receiving a w-request(x,v,<ts,id>) message do o := values[x] if <ts,id> is lexicographically larger than <o.ts,o.id> then o.ts := ts; o.id := id; o.val := v reply with a w-reply(x) message 8
Implementing Quorum Replication continued A read operation R(x) 1. Send a read request r-request(x) to all servers 2. Wait for r-reply(x,v,<ts,id>) messages from a READ quorum of servers let v be the value with largest logical timestamp <ts,id> 3. Send a write request w-request(x,v,<ts,id>) to all servers 4. Wait for w-reply(x) replies from a WRITE quorum of servers 5. Return (v) // the value that was reserved in line 2 Why are lines 3&4 important? It is possible to first send requests only to the corresponding quorums; only if there is no reply, send to more servers Implementing Quorum Replication continued A write operation W(x,v) 1. Send a read request r-request(x) to all servers 2. Wait for r-reply(x,-,<ts,id>) messages from a READ quorum of servers let <ts,id> be the largest returned logical timestamp 3. Send a write request w-request(x,v,<ts+1,my_id>) to all servers 4. Wait for w-reply(x) messages from a WRITE quorum of servers 5. Return 9