Storage Systems Autumn 2009 Chapter 6: Distributed Hash Tables and their Applications André Brinkmann
Scaling RAID architectures Using traditional RAID architecture does not scale Adding news disk implies reorganizing the whole data Re-striping requires the movement of all data-blocks Time t striping for re-layout grows linear in capacity: Trend t striping = k * C old where k is a constant and C old is the already stored capacity Newly integrated capacity C new is always smaller than C old
Assumptions How expensive is re-striping? 36 GByte of data can be re-distributed in each hour 100 GByte of new capacity C new have to added Already existing capacity C old between 100 GByte and 1 PByte Restriping tim (hours) 10000000 1000000 100000 10000 1000 100 10 1 Existing capacity (TBytes)
Introduction Randomization Deterministic data placement schemes suffered many drawbacks for a long time Heterogeneity has been an issue It has been costly to adapt to new storage systems It is difficult to support storage-on-demand concepts Is there an alternative to deterministic schemes? Yes, Randomization can help to overcome these drawbacks, but new challenges might be introduced!
Basic Results: Balls into bins Games II Assign n balls to n bins For every ball, choose one bin independently, uniformly at random Maximum load is sharply concentrated: where w.h.p. abbreviates with probability at least, for any fixed
Balls into bins Games I Basic tasks of balls into bins games Assign a set of m balls to n bins Motivation Idea: Just take a random position! Bins = Hard disks Balls = Data items L = max number of data items on each disk Where should I place the next item? 0 1 2 3 4
This sounds terrible: Balls into bins Games III The maximum loaded hard disk stores -times more data than the average This seems not to be scalable, or The model assumes that only very few data items are stored inside the environment, but each disk is able to store many objects Let s assume that many objects means Perfect! Then it holds w.h.p. that Additional Offset see, e.g, M. Raab, A. Steger: Balls into Bins - A Simple and Tight Analysis
Distributed Hash Tables Randomization introduces some (well known) challenges Key questions are: How can we retrieve a stored data item? How can we adapt to a changing number of disks? How can we handle heterogeneity? How can we support redundancy? Key Tasks of Distributed Hash Tables (DHTs)
Consistent Hashing I Introduced in the context of Web Caching Bins are mapped by a pseudo-random hash function h: on a ring (of length 1) Bins become responsible for their interval 1 Balls are mapped by 5 3 an additional hash 2 function g: onto the 4 6 ring Each bin stores balls in its interval See D. Karger, E. Lehman et al.: Consistent Hashing and Random Trees: Tools for Relieving Hot Spots on the World Wide Web
Consistent Hashing II Average load of each bin is, but deviation from average can be high: The maximum arc length on the ring becomes w.h.p. Solution: Each bin is mapped by a set of independent hash functions to multiple points on the ring The maximum arc length assigned to a bin can be reduced to for an arbitrary small constant, if virtual bins are used for each physical bin See I. Stoica, R. Morris, et al.: Chord: A Scalable Peer-To-Peer Lookup Service for Internet Applications.
Join and Leave-Operations I In a dynamic network, nodes can join and leave any time The main goal of a DHT is to have the ability to locate every key in the network at (nearly) any time (Planned) Removal of bins changes the length 1 of its neighbor interval Data has to be moved 3 to neighbor Insertion of bins also only 7 changes interval length of its new neighbor 6 2 4 5
Join and Leave-Operations II Definition of a View V: A view V is a set of bins of which a particular client is aware of. Monotonicity: A ranged hash function f is monotone if for all views implies Monotonicity implies that in case of a join operation of a bin i, all moved data items have destination i Consistent Hashing has property of monotonicity
Heterogeneous Bins Consistent Hashing is (nearly) optimally suited for homogeneous environment, where all bins (disks) have same capacity and performance Heterogeneous bins can be mapped to Consistent Hashing by using a different number of virtual bins for each physical bin The relation between the number of different bins constantly changes Monotonicity (and some other properties) can not be kept up
Why is heterogeneity an issue? Definition A heterogeneous set of disks is a set of disks with different performance and capacity characteristics They are becoming a common configuration Replacing an old disk Adding new disks Cluster build from already existing (heterogeneous) components
Traditional solution Many systems just ignore it: all disks are treated as equal The usable size of all disks is like the smallest one The performance of all disks is assumed as the slowest one Implications No performance gain is obtained Except for some implicit side effect Not all potential capacity gain is obtained Some systems use the unused disk space to build a virtual disk
THE DATA STORAGE EVOLUTION. Has disk capacity outgrown its usefulness? by Ron Yellin (Terada magazine 2006) Disk capacity
THE DATA STORAGE EVOLUTION. Has disk capacity outgrown its usefulness? by Ron Yellin (Terada magazine 2006) Disk performance
THE DATA STORAGE EVOLUTION. Has disk capacity outgrown its usefulness? by Ron Yellin (Terada magazine 2006) Capacity vs. performance
Growth storage needs Information point of view Increase of 30% each year How much information 2003? Peter Lyman and Hal R. Varian School of Information Management and Systems University of California at Berkeley Manufacturers point of view Increase capacity 50% each year Drive manufacturers THE DATA STORAGE EVOLUTION. Has disk capacity outgrown its usefulness? by Ron Yellin, Terada magazine 2006
Share Strategy I g(d) l(c d ) 0 1 Share Strategy tries to map heterogeneous problem to homogeneous solution Each bin d is assigned by a hash function g: to a start point g(d) inside [0,1)-interval The length l of the interval is proportional to the capacity c i (performance, or other metric) of bin i d p o See A. Brinkmann, K. Salzwedel, C. Scheideler: Compact, adaptive placement schemes for non-uniform distribution requirements.
Share Strategy II 0 x h(x) How to retrieve location of a data item x inside this heterogeneous setting? Use hash function h: to map x to [0,1)-Interval Use DHT for homogeneous bins to retrieve location of x from all intervals cutting h(x)
Share Strategy III 0 x h(x) Properties: (Arbitrary) optimal distribution of balls and bins Computational Complexity in O(1) Competitive Ratio concerning Join and Leave is (1+ ) for arbitrary >0 But Share has been optimized for usage in data center environments Share is not monotone and only partially suited for P2P networks
V:Drive SAN MDA V:Drive out-of-band virtualization environment each (Linux) server includes additional blocklevel driver module metadata appliance ensures consistent view on storage and servers Share strategy used as data distribution strategy See A. Brinkmann, S. Effert, et al.: Influence of Adaptive Data Layouts on Performance in dynamically changing Storage Environments
Performance V:Drive - Static Throughput (MB/s) 15 10 5 0 1 2 4 6 8 10 12 14 Physical 80 Volumes VDrive LVM 60 Avg. latency (ms) 40 20 0 Synthetic random I/O benchmark, static configuration 1 2 4 6 8 10 12 14 Physical volumes VDrive LVM
Performance V:Drive Dynamic Throughput (MB/s) 12 10 8 6 4 2 0 2 4 6 8 10 12 14 Physical 50 volumes VDrive 40 LVM Avg. latency (ms) 30 20 10 0 Synthetic random I/O benchmark, dynamic configuration 2 4 6 8 10 12 14 Physical volumes VDrive LVM
V:Drive - Reconfiguration Overhead 7 70 6 60 Throughput / MByte/s 5 4 3 2 1 50 40 30 20 10 Avg. Latency / ms 0 1 5 9 13 17 21 25 29 33 37 Time / minutes 0
Randomization and Redundancy Randomized data distribution schemes do not include mechanisms to safe data against disk failures Question: How to use Randomization and RAID schemes together Assumption: n copies of a data block have to be distributed over n disks No two copies of a data block are allowed to be stored on the same disk
Trivial Solutions Trivial Solution I: Divide storage systems into n storage pools Distribute first copies over first pool,, n-th copies over n-th pool Missing flexibility Trivial Solution II: First copy will be distributed over all disks Second copy will be distributed about all but the previously chosen disk, Not able to use capacity efficiently First Copy Second Copy
Observation Trivial Solution II is not able to use capacity efficiently, because big storage systems will be penalized compared to smaller devices Theorem: Assume a trivial replication strategy that has to distribute k copies of m balls over n > k bins. Furthermore, the biggest bin has a capacity c max that is at least (1 + ) c j of the next biggest bin j. In this case, the expected load of the biggest bin will be smaller than the expected load required for an optimal capacity efficiency. See A. Brinkmann, S. Effert, et al.: Dynamic and Redundant Data Placement, ICDCS 2007
Idea Algorithm has to ensure that bigger bins get data items according to their capacities This can be ensured by an algorithm that iterates over a sorted list of bins 1. At each iteration, the algorithm randomly decides, whether or whether not to place the ball 2. If one of k copies of a ball has been placed, use optimal strategy for (k-1) with remaining bins as input Challenge: How to make random decision in step 1 of each iteration
LinMirror
Example for Mirroring (k=2) denotes the relative capacity of disk i to all disks denotes the relative capacity of disk i to all disks starting with index i is the weight for the random decision!
Example for Mirroring (k=2) If, e.g., disk 2 is chosen as first copy of a mirror, just distribute the second copy according to Share over disks 3, 4, and 5 Some adaptation is necessary, if disk 3 is chose, because weight of disk 4 is greater 1
Observations LinMirror is 4-competitive concerning insertion and deletion of a bin Strategy can easily be extended to arbitrary k Lower and upper bound is (k+1)/2 for homogeneous bins (can be improved to 1-competitive) Data distribution is optimal Redistribution of data in dynamic environment is ln n-competitive for arbitrary k Computational complexity can be reduced to O(k)
Fairness of k-fold Replication Usage in % 20 18 16 14 12 10 8 6 4 2 0 8 Disks 10 Disks 12 Disks 10 Disks 8 Disks
Adaptivity of k-fold Replication 6 5 Competitiveness 4 3 2 1 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 Number of Disks Add as Biggest Add as Smallest
Metadata Management Assignment of data items to disks can be solved efficiently for random data distribution schemes Very good distribution of data and requests Computational complexity low Adaptivity to new infrastructures optimal without redundancy, ok with redundancy Over-provisioning can be efficiently integrated but how to find position of data item on the disks? Equal to the dictionary problem Requires O(n) entries to find location of n objects! Defines bulk set of metadata
Dictionary Problem Extent Size vs. Volume Size 4 KB 16 KB 256 KB 4MB 16MB 256 MB 1 GB 1 GB 8 MB 2 MB 128 KB 8 KB 2 KB 128 Byte 32 Byte 64 GB 512 MB 128 MB 8 MB 512 KB 128 KB 8 KB 2 KB 1 TB 8 GB 2 GB 128 MB 8 MB 2 MB 128 KB 32 KB 64 TB 512 GB 128 GB 8 GB 512 MB 128 MB 8 MB 2 MB 1 PB 8 TB 2 TB 128 GB 8 GB 2 GB 128 MB 32 MB Extent: Smallest continuous unit that can be addressed by virtualization solution Dictionary easily becomes too big to be stored inside each server system for small extent sizes Solutions Caching Huge extent sizes Object Based Storage Systems
Key Value Storage To meet reliability and scaling needs, Amazon has developed a number of storage technologies Amazon Simple Storage Service S3 There are many services on Amazon s platform that only need primary-key access to a data store best seller lists, shopping carts, customer, preferences, session management, sales rank, and product catalog Key Value Stores provide simple primary-key only interface to meet the requirements of these applications See DeCandia, et al.: Dynamo: Amazon s Highly Available Key-value Store
Dynamo Dynamo uses a synthesis of well known techniques to achieve scalability and availability Data is partitioned and replicated using consistent hashing Consistency is facilitated by object versioning Consistency among replicas during updates is maintained by quorum-like technique and a decentralized replica synchronization protocol Gossip based distributed failure detection and membership protocol Dynamo is a completely decentralized system with minimal need for manual administration
Query Model: Assumptions and Requirements Simple read and write operations to data that is uniquely identified by a key. State is stored as binary objects (i.e., blobs) No operations span multiple data items and there is no need for relational schema
Assumptions and Requirements ACID Properties: ACID (Atomicity, Consistency, Isolation, Durability) Experience at Amazon has shown that data stores that provide ACID guarantees tend to have poor availability Dynamo targets applications that operate with weaker consistency (the C in ACID) if this results in high availability Dynamo does not provide any isolation guarantees and permits only single key updates Environment is non-hostile