Scalability. We can measure growth in almost any terms. But there are three particularly interesting things to look at:

Transcription

1 Scalability The ability of a system, network, or process, to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth. We can measure growth in almost any terms. But there are three particularly interesting things to look at: Size scalability: adding more nodes should make the system linearly faster; growing the dataset should not increase latency Geographic scalability: it should be possible to use multiple data centers to reduce the time it takes to respond to user queries, while dealing with crossdata center latency in some sensible manner. Administrative scalability: adding more nodes should not increase the administrative costs of the system (e.g. the administrators-to-machines ratio). A scalable system is one that continues to meet the needs of its users as scale increases. There are two particularly relevant aspects - performance and availability - which can be measured in various ways.

2 Performance (and latency) Characterization of the amount of useful work accomplished by a computer system compared to the time and resources used. Depending on the context, this may involve achieving one or more of the following: Short response time/low latency for a given piece of work High throughput (rate of processing work) Low utilization of computing resource(s) Latency: the state of being latent; delay, a period between the initiation of something and the occurrence. Latent: From Latin latens, latentis, present participle of lateo ("lie hidden"). Existing or present but concealed or inactive.

3 Availability (and fault tolerance) the proportion of time a system is in a functioning condition. If a user cannot access the system, it is said to be unavailable. In formula, availability = uptime / (uptime + downtime) from a technical perspective, availability is mostly about being fault tolerant. Fault tolerance is the ability of a system to behave in a well-defined manner once faults occur Availability Nickname Downtime per year 90% one nine more than a month 99% two nines less than 4 days 99.9% three nines less than 9 hours 99.99% four nines less than 1 hour % five nines about 5 minutes % six nines about 31 seconds

4 Coordination

5 Consensus & Agreement It is generally important that the processes within a distributed system have some sort of agreement Coordination among multiple parties involves agreement among those parties Agreement Consensus Consistency Agreement is difficult in a dynamic asynchronous system in which processes may fail or join/leave

6 System Model A system model is a set of assumptions about the environment and facilities on which a distributed system is implemented. Assumptions include: what capabilities the nodes have and how they may fail how communication links operate and how they may fail and overall properties of the system, such as assumptions about time and order A robust system model is one that makes the weakest assumptions

7 Node Model Nodes serve as hosts for computation and storage. They have: the ability to execute a program the ability to store data into volatile memory (which can be lost upon failure) and into stable state (which can be read after a failure) a clock (which may or may not be assumed to be accurate) Nodes execute deterministic algorithms: the local computation, the local state after the computation, and the messages sent are determined uniquely by the message received and local state when the message was received.

8 Node Failures There are many possible failure models which describe the ways in which nodes can fail. Crash faults: nodes can only fail by crashing, and can (possibly) recover after crashing at some later point. Omissione faults: nodes can fail by crashing or by dropping messages Byzantine faults: nodes can fail by misbehaving in any arbitrary way. Sending wrong messages, ignoring messages, sending useless messages, faking state, etc

9 Link Model Links connect individual nodes to each other. Messages sent in either direction. Network is unreliable and subject to message loss and delays. A network partition occurs when the network fails while the nodes themselves remain operational. Partitioned nodes may be accessible by some clients, and so must be treated differently from crashed nodes.

10 Link Failures There are many possible failure models which describe the ways in which nodes can fail. Crash faults: nodes can only fail by crashing, and can (possibly) recover after crashing at some later point. Byzantine faults: nodes can fail by misbehaving in any arbitrary way. Sending wrong messages, ignoring messages, sending useless messages, faking state, etc

11 Communication Patterns Re e qu nse o sp st Re Synchronous Asynchronous

12 Timing/Ordering Synchronous system model: processes execute in lock-step; there is a known upper bound on message transmission delay; each process has an accurate clock Asynchronous system model: no timing assumptions; processes execute at independent rates; there is no bound on message transmission delay; useful clocks do not exist

13 Impossibility Theorems Two fundamental theorems, FLP and CAP, influences the system design choices FLP theorem: asynchronicity vs synchronicity Consensus is impossible to implement in such a way that it both a) is always correct and b) always terminates if even one machine might fail in an asynchronous system with crash-* stop failures CAP theorem: what happens when network partitions are included in the failure model You can t implement consistent storage and respond to all requests if you might drop messages between processes.

14 FLP Impossibility of Distributed Consensus with One Faulty Process, by Fischer, Lynch and Paterson (1985) Consensus Problem: we have a set of processes, each one with a private input; the processes communicate; the processes must agree on on some process s input.

15 Consensus is important With consensus we can implement anything we can imagine: leader decision mutual exclusion transaction commitment much more In some models consensus is possible, in some other models, it is not The goal is to learn whether, for a given model, consensus is possible or not and prove it!

16 Consensus Properties 1. Agreement: Every correct process must agree on the same value. 2. Integrity: Every correct process decides at most one value, and if it decides some value, then it must have been proposed by some process. 3. Termination: All correct processes eventually reach a decision. 4. Validity: If all correct processes propose the same value V, then all correct processes decide V.

17 FLP System Model Asynchronous communication model, i.e., no upper bound on the amount of time processors may take to receive, process and respond to an incoming message Communication links between processors are assumed to be reliable. It is well known that given arbitrarily unreliable links no solution for consensus could be found even in a synchronous model. Processors are allowed to fail according to the crash fault model this simply means that processors that fail do so by ceasing to work correctly. There are more general failure models, such as byzantine failures where processors fail by deviating arbitrarily from the algorithm they are executing.

18 Consequences of FLP There is no way to solve the consensus problem under a very minimal system model in a way that cannot be delayed forever Complete correctness if not possible in asynchronous models In practice, we may live with very low probability of disagreement (give up safety) In practice, we may live with very low probability of blocking (give up liveness) Two-phase commit or even three-phase commit can block forever This result is particularly relevant to people designing algorithms

19 CAP Presented as Brewer s Conjecture in 2000 Formalized and proved in Brewer s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services, by Lynch and Gilbert (2002) Consistency, availability and partition-tolerance cannot be achieved all at the same time in a distributed system. Simply, in an asynchronous network that performs as expected, where messages may be lost (partition-tolerance), it is impossible to implement a service providing correct data (consistency) and eventually responding to every request (availability) under every pattern of message loss. Slides from _Downloads/Workshop/Talks/2010-HS-DIS-I_Giangreco-CAP_Theorem- Talk.pdf

20 Consistency All the nodes in the system see the same state of the data Formally, we speak of atomic or linearizable consistency There exists a sequential order on all operations which is consistent with the order of invocations and responses, such that each operation looks as if it were completed at a single instant. V1 V1 V1

21 Availability Every request received by a non-failing node should be processed and must result in a response

22 Partition Tolerance If some nodes crash and/or some communications fail, system still performs as expected

23 CAP Theorem It is impossible in the asynchronous network model to implement a read/write data object that guarantees the following properties: Availability Atomic consistency in all fair executions (including those in which messages are lost). Asynchronous, i. e. there is no clock, nodes make decisions based only on the messages received and local computation.

24 Consequences of CAP

25 Consequences of CAP When partitions are rare (e.g., parallel systems), CAP should allow perfect C and A most of the time In distributed systems it is not possible to avoid network partitions. There is not a need to choose between either C or A, instead, it is more an act of balancing between the two properties.

26 Practical Consequences of CAP Many system designs used in early distributed relational database systems did not take into account partition tolerance (e.g. they were CA designs). There is a tension between strong consistency and high availability during network partitions. A distributed system consisting of independent nodes connected by an unpredictable network cannot behave in a way that is indistinguishable from a non-distributed system. There is a tension between strong consistency and performance in normal operation. Strong consistency requires that nodes communicate and agree on every operation. This results in high latency during normal operation. If we do not want to give up availability during a network partition, then we need to explore whether consistency models other than strong consistency are workable for our purposes.

27 Consistency A consistency model is a contract between programmer and system, wherein the system guarantees that if the programmer follows some specific rules, the results of operations on the data store will be predictable. Strong consistency models guarantee that the apparent order and visibility of updates is equivalent to a non-replicated system Linearizable consistency Sequential consistency Weak consistency models are not strong: Client-centric consistency models Causal consistency: strongest model available Eventual consistency models

28 Strong Consistency Models Linearizable consistency: Under linearizable consistency, all operations appear to have executed atomically in an order that is consistent with the global real-time ordering of operations. (Herlihy & Wing, 1991) Sequential consistency: Under sequential consistency, all operations appear to have executed atomically in some order that is consistent with the order seen at individual nodes and that is equal at all nodes. (Lamport, 1979)

29 Primary Copy Replication All updates are performed on the primary log of operations (or alternatively, changes) is shipped across the network to the backup replicas Asynch and synch variants Asynch used by MySQL and MongoDB

30 Primary Copy Fault Tolerance Failures can cause violations to consistency guarantees Site failures, network partitions, undelivered messages We need an atomic commitment protocol (ACP) Commonly used in database systems (i.e., transactions)

31 Atomic Commitment Protocols 1 coordinator, N participants Coordinator knows the names of all participants Participants know the name of the coordinator, not necessarily each other Each site contains a distributed transaction log surviving failures Each process may cast exactly one of two votes: Yes or No Each process can reach exactly one of two decisions: Commit or Abort Consensus Problem!!!!

32 ACP Conditions 1. All processes that reach a decision reach the same one. 2. A process cannot reverse its decision after it has reached one. 3. The Commit decision can only be reached if all processes voted Yes. 4. If there are no failures and all processes voted Yes, then the decision will be to Commit. 5. Consider any execution containing only failures that the algorithm is designed to tolerate. At any point in this execution, if all existing failures are repaired and no new failures occur for sufficiently long, then all processes will eventually reach a decision.

33 Two Phase Commit (2PC) Ready? Yes/No Coordinator Coordinator Participants Participants

34 Two Phase Commit (2PC) Coordinator Commit/Abort Terminates if no faults Satisfy conditions 1-4 Does not satisfy condition 5 (as expected) 3n messages for n nodes What happens with Participants faults?

35 2PC and multiple failures Multiple participant failures: No problem as long as the coordinator lives Coordinator and participant failures: No Abort Abort X Yes Yes X?? The remaining participants cannot take a decision

36 2PC and multiple failures Solution: Add another phase to the protocol! The new phase precedes the commit phase The goal is to inform all participants that all are ready to commit or abort At the end of this phase, every participant knows whether or not all participants want to commit before any participant has actually committed or aborted! This protocol is called the three-phase commit protocol (3PC)

37 Three Phase Commit (3PC) Ready? Yes Coordinator Coordinator Participants Participants

38 Three Phase Commit (3PC) Prepare Ack Coordinator Coordinator Participants Participants

39 Three Phase Commit (3PC) Commit Coordinator Participants

40 Weak Consistency Models Client-centric consistency models are consistency models that involve the notion of a client or session in some way. The eventual consistency model says that if you stop changing values, then after some undefined amount of time all replicas will agree on the same value. Since it is trivially satisfiable (liveness property only), it is useless without supplemental information.

41 Eventual Consistency If no new updates are made to the data object, eventually all accesses will return the last updated value. Special form of weak consistency Eventual network partitions tolerated Requires conflict resolution mechanisms to disseminate write The message exchange is a.k.a. anti-entropy

42 Dynamo Service providing persistent data storage to other services running on Amazon s distributed computing infrastructure. Dynamo: Amazon s Highly Available Key-value Store, SOSP 2007 First paper discussing implementation of weak consistent, highly available data replication service

43 Amazon s Platform Architecture Client Requests Page Rendering Services Request Routing Stateless Aggregator Services Request Routing Stateful Services S3 Datastores Dynamo Instances

44 Design Issues Replication for high availability and durability replication technique: synchronous or asynchronous? conflict resolution: when and who? Dynamo s goal is to be always writable rejecting writes may result in poor Amazon customer experience data store needs to be highly available for writes, e.g., accept writes during failures and allow write conversations without prior context Design choices optimistic (asynchronous) replication for non-blocking writes conflict resolution on read operation for high write throughput conflict resolution by client for user-perceived consistency

45 Design Principles Incremental Scalability scale out (and in) one node at a time minimal impact on both operators of the systems and the system itself Symmetry every node should have the same set of responsibilities no distinguished nodes or nodes that take on a special role symmetry simplifies system provisioning and maintenance Decentralization favor decentralized peer-to-peer techniques achieve a simpler, more scalable, and more available system Heterogeneity exploit heterogeneity in the underlying infrastructure load distribution proportional to capabilities of individual servers adding new nodes with higher capacity without upgrading all nodes

46 Interfaces Data objects (blobs) are identified through an hash key Two single object operations, no group operations Read operation: GET(K) [OBJECT(S), CONTEXT] Locate object replicas associated with key k Return single object or multiple conflicting version of the same object with context Write operation: PUT(K, OBJECT, CONTEXT) Determine where the replicas should be placed based on key k Write the replicas to disks Context System metadata, opaque to the caller Include information on object, such as versioning

47 Partitioning Consisten Hashing Load distribution Arrival and departures have neighbor impact E keys A D C B

48 Replication Every data item is replicated at N hosts N is configured per-instance of Dynamo each key k is assigned a coordinator host that handles write requests for k coordinator is also in charge of replication of data items within its range Coordinator stores data item with key k locally Coordinator stores data item at N-1 clockwise successors nodes Every node is responsible for the region of the ring between itself and its predecessor Preference list enumerates nodes that are responsible for storing a key k contains more than N nodes to account for node failures

49 Data Versioning Replication performed after a response is sent to a client This is called asynchronous replication May result in inconsistencies under network partitions But operations should not be lost add to cart should not be rejected but also not forgotten If add to cart is performed when latest version is not available it is performed on an older version We may have different versions of a key/value pair Once a partition heals versions are merged The goal is not to lose any add to cart Data versioning technique: Vector clocks Capture causality between different versions of an object

50 Vector Clocks Each write to a key k is associated with a vector clock VC(k) VC(k) is an array (map) of integers In theory: one entry VC(k)[i] for each node i When node i handles a write of key k it increments VC(k)[i] VCs are included in the context of the put call In practice: VC(k) will not have many entries (only nodes from the preference list should normally have entries), and Dynamo truncates entries if more than a threshold (say 10)

51 Data Versioning 1. Client A writes new object D node N1 writes version D1 2. Client A updates object D node N1 writes version D2 3. Client A updates object D node N2 writes version D3 4. Client B reads and updates object D node N3 writes version D4 5. Client C reads object D (i.e., D3 and D4) client performs reconciliation node N1 write version D5 Dx[Ny,Z] WRITE handled by N1 1 object version logical clock handling node D1[N1,1] WRITE handled by N1 2 D2[N1,2] WRITE handled 3 by N2 4 D3[N1,2][N2,1] Reconciliation and WRITE handled by N1 WRITE handled by N3 D4[N1,2][N3,1] 5 D5[N1,3][N2,1] [N3,1]

52 Handling PUT/GET Any node can receive GET/PUT request for any key. This node is selected by Generic load balancer By a client library that immediately goes to coordinator nodes in a preference list If the request comes from the load balancer Node serves the request only if in preference list Otherwise, the node routes the request to the first node in preference list Each node has routing info to all other nodes Extended preference list: N nodes from preference list + some additional nodes (following the circle) to account for failures Failure-free case: nodes from preference list are involved in get/put Failures case: first N alive nodes from extended preference list are involved

53 R and W R = number of nodes that need to participate in a GET W = number of nodes that need to participate in a PUT R + W > N (a quorum system) Handling PUT (by coordinator) // rough sketch 1. Generate new VC, write new version locally 2. Send value and VC to N selected nodes from preference list 3. Wait for W -1 Handling GET (by coordinator) // rough sketch 1. Send read to N selected nodes from preference list 2. Wait for R 3. Select highest versions per VC, return all such versions 4. Reconcile/merge different versions 5. Write back reconciled version

54 Quorum Systems Formally, a quorum system S = {S1,, SN} is a collection of quorum sets Si U such that two quorum sets have at least an element in common For replication, we consider two quorum sets, a read quorum R and a write quorum W Rules: 1. Any read quorum must overlap with any write quorum 2. Any two write quorums must overlap U is the set of replicas, i.e., U = N

55 Quorum Rules Read rule: R + W > N read and write quorums overlap Write rule: 2 W > N two write quorums overlap The quorum sizes determine the costs for read and write operations Minimum quorum sizes for are min W = N min R = N 2 Write quorums requires majority Read quorum requires at least half of the nodes Typical Dynamo (R,W,N) = (2,2,3)

56 Handling Failures N selected nodes are the first N healthy nodes Might change from request to request Hence these quorums are Sloppy quorums Sloppy vs. strict quorums sloppy allow availability under a much wider range of partitions (failures) but sacrifice consistency Also, important to handle failures of an entire data center Power outages, cooling failures, network failures, disasters Preference list accounts for this (nodes spread across data centers) Hinted Handoff if a node is unreachable, a hinted replica is sent to next healthy node nodes receive hinted replicas keep them in a separate database hinted replica is delivered to original node when it recovers

57 Replica Synchronization Using vector clocks, concurrent and out of date updates can be detected. Performing read repair then becomes possible In some cases (concurrent changes) we need to ask the client to pick a value. What about node rejoining? Nodes rejoining the cluster after being partitioned When a failed node is replaced or partially recovered Replica synchronization is used: to bring nodes up to date after a failure for periodically synchronizing replicas with each other

58 Replica Reconciliation A node holding data received via hinted handoff may crash before it can pass data to unavailable node in preference list Need another way to ensure each (k, v) pair replicated N times Nodes nearby on ring periodically compare the (k, v) pairs they hold, and copy any they are missing that are held by the other

59 Merkle Trees Idea: hierarchically summarize the (k, v) pairs a node holds by ranges of keys Leaf node: hash of one (k, v) pair Internal node: hash of concatenation of children Compare roots; if match, done If don t match, compare children; recur Benefits minimize the amount of data to be transferred for synch reduce the number of required disk reads

60 Anti Entropy in Dynamo Dynamo uses Merkle trees for anti-entropy as follows each node maintains a separate Merkle tree for each key range it hosts two nodes exchange the root of the Merkle tree of the key ranges that they host in common using the synch mechanism described before, the nodes determine if they have any differences and perform the appropriate synch