Distributed Data Stores 1
Distributed Persistent State MapReduce addresses distributed processing of aggregation-based queries Persistent state across a large number of machines? Distributed DBMS High resource requirements for unnecessary components, high total cost of ownership No incremental scalability ( Elastic scalability ) Replication support for >10 5 nodes, load balancing? Very strict correctness model (ACID) 2
Brewer s Conjecture a.k.a. CAP Theorem Brewcer, PODC 2000 Gilbert, Lynch, ACM SIGACT News, 33(2), 2002, p. 51-59. 3 Non-functional Requirements for Distributed Data Stores Consistency Availability Partition-tolerance Choose 2! 3
Fault Tolerance Millions of hardware components in cluster Disks CPUs Memory Network adapters, Network cabling, Network switches Something is always broken! Availability, Partition-tolerance crucial CAP implies: give up consistency 4
Eventual Consistency Less strict than ACID correctness Without updates, all replicas eventually settle on same state Variants Causal consistency Read-your-writes consistency Monotonic read consistency Monotonic write consistency Example Domain Name System Werner Vogels: Eventually consistent. Commun. ACM 52(1): 40-44 (2009) 5
Consistent Hashing How to allocate data items to N nodes? Hash function: h(o) mod N Problem: Incremental Scalability means frequent adding and removing of nodes Rehashing is not feasible Consistent Hashing Partition Hash Value Space using Indirection Karger et al: Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web. STOC 1997 6
Consistent Hashing o3 o4 o1 o2 o5 Hash values 7
Consistent Hashing o3 t1 o4 o1 o2 t3 o5 t2 Hash values 8
Consistent Hashing t1 o3 o4 o2 o1 t3 o5 t2 Hash values 9
Vector clocks Mechanism to create partial ordering of updates in distributed systems Detect causal relationships and concurrent updates Use vectors of timestamps instead simple timestamps One timestamp per node Increase own vector component at each local update Two versions v1, v2 If all components of v1 smaller than v2: v2 resulted from v1, v1 old If any component of v1 greater than corresponding component from v2: Concurrent updates occurred Resolved conflict uses maximum of each component 10
Dynamo Key-value store Put / Get / (Delete) Key, value: Bytestrings Infrastructure for Amazon services AWS S3, Shopping cart,... >100 service calls/amazon web page DeCandia et al: Dynamo: amazon's highly available key-value store. SOSP 2007: 205-220 11
Dynamo Requirements Incremental Scalability Symmetry/Decentralization no special node roles/points of failure Heterogeneity nodes of different types (e.g. due to general technology progress) Always writable never reject client updates (e.g. shopping cart additions) High performance requirements apply to 99.9% percentile e.g. 300ms per request at 500 requests/sec 12
Consistent Hashing in Dynamo Problem with regular method Load is not uniformly distributed Node performance varies Solution Map each node to multiple positions (virtual nodes) on hash ring Number of virtual nodes depends on node performance Effect Finer granularity of key partitions (more nodes responsible for same range) More load on more powerful nodes Effect of adding/removing nodes is distributed over many remaining nodes 13
Replication Fault tolerance implies replication of data Data replicated to N nodes in preference list Preference list of replication targets Nodes following key range in hash ring size >N to prepare for node failures Use physical nodes by skipping virtual nodes of same physical nodes Preference list information replicated across all nodes 14
Replication Quorum-like system Send N requests, declare success after enough replies arrive Protocol parameters for enough : R reads / W writes Fine-tuning R+W > N gives strong consistency Change R, W depending on application workload Slowest replica of R/W set determines latency Examples N=2, R=1, W=2 N=3, R=2, W=2 N=100, R=1, W=100 N=4, R=1, W=2 15
Load Balancing Any node can accept put/get request If not in preference list, forward to first healthy node on pref list If in preference list, coordinate request (send redundant requests and reply) 16
Sloppy quorum/hinted handoff Only use first N healthy nodes, skip nonresponders/down nodes Handoff to less preferred nodes increases availability Add intended recipients to requests as hint store hinted writes in separate store use hint to propagate updates later when original recipient back up 17
Eventual Consistency in Dynamo Inconsistent versions may occur Sloppy quorum in case of node failure R+W<=N by configuration Use vector clocks to discover inconsistencies during read Syntactic inconsistencies (vector clock values stricly greater) automatically resolved Repair remaining inconsistencies using application code e.g. merge shopping carts 18
Trade-offs Increase W Higher durability, less write availability, lower performance Increase R Less inconsistency, less read availability, lower performance Additional criteria for selecting R/W nodes, e.g. different data centers 19