Storage Systems Autumn 2009. Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Similar documents

V:Drive - Costs and Benefits of an Out-of-Band Storage Virtualization System

Distributed Data Stores

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Dynamo: Amazon s Highly Available Key-value Store

Big Data & Scripting storage networks and distributed file systems

The Advantages and Disadvantages of Network Computing Nodes

A Case for Virtualized Arrays of RAID

How To Virtualize A Storage Area Network (San) With Virtualization

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Cassandra A Decentralized, Structured Storage System

Benchmarking Cassandra on Violin

The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000

A Brief Analysis on Architecture and Reliability of Cloud Based Data Storage

Object Request Reduction in Home Nodes and Load Balancing of Object Request in Hybrid Decentralized Web Caching

Load Balancing in Structured Overlay Networks. Tallat M. Shafaat

Deep Dive: Maximizing EC2 & EBS Performance

OPTIMIZING EXCHANGE SERVER IN A TIERED STORAGE ENVIRONMENT WHITE PAPER NOVEMBER 2006

Physical Data Organization

Varalakshmi.T #1, Arul Murugan.R #2 # Department of Information Technology, Bannari Amman Institute of Technology, Sathyamangalam

Violin: A Framework for Extensible Block-level Storage

A Dell Technical White Paper Dell Compellent

RADOS: A Scalable, Reliable Storage Service for Petabyte- scale Storage Clusters

A Novel Data Placement Model for Highly-Available Storage Systems

RAID Performance Analysis

Design and Implementation of a Storage Repository Using Commonality Factoring. IEEE/NASA MSST2003 April 7-10, 2003 Eric W. Olsen

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Dynamo: Amazon s Highly Available Key-value Store

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Chapter 13. Disk Storage, Basic File Structures, and Hashing

Disks and RAID. Profs. Bracy and Van Renesse. based on slides by Prof. Sirer

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

NoSQL. Thomas Neumann 1 / 22

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

FAWN - a Fast Array of Wimpy Nodes

OceanStor UDS Massive Storage System Technical White Paper Reliability

G Porcupine. Robert Grimm New York University

Hadoop: Embracing future hardware

Scala Storage Scale-Out Clustered Storage White Paper

Practical Cassandra. Vitalii

Q & A From Hitachi Data Systems WebTech Presentation:

Best Practices for Optimizing SQL Server Database Performance with the LSI WarpDrive Acceleration Card

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

The Sierra Clustered Database Engine, the technology at the heart of

Simple, Exact Placement of Data in Containers

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

P2P Storage Systems. Prof. Chun-Hsin Wu Dept. Computer Science & Info. Eng. National University of Kaohsiung

A Review of Column-Oriented Datastores. By: Zach Pratt. Independent Study Dr. Maskarinec Spring 2011

Appendix A Core Concepts in SQL Server High Availability and Replication

Google File System. Web and scalability

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

SOLVING LOAD REBALANCING FOR DISTRIBUTED FILE SYSTEM IN CLOUD

GraySort on Apache Spark by Databricks

EMC XTREMIO EXECUTIVE OVERVIEW

Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

Theoretical Aspects of Storage Systems Autumn 2009

Reliability and Fault Tolerance in Storage

EqualLogic PS Series Load Balancers and Tiering, a Look Under the Covers. Keith Swindell Dell Storage Product Planning Manager

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Information Searching Methods In P2P file-sharing systems

How swift is your Swift? Ning Zhang, OpenStack Engineer at Zmanda Chander Kant, CEO at Zmanda

Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Using Object Database db4o as Storage Provider in Voldemort

OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni

Hypertable Architecture Overview

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at

CS435 Introduction to Big Data

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

1 Storage Devices Summary

The Classical Architecture. Storage 1 / 36

Technical Overview Simple, Scalable, Object Storage Software

Graph Database Proof of Concept Report

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

2009 Oracle Corporation 1

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

International journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online

The IntelliMagic White Paper on: Storage Performance Analysis for an IBM San Volume Controller (SVC) (IBM V7000)

DB2 Database Layout and Configuration for SAP NetWeaver based Systems

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

An Optimization Model of Load Balancing in P2P SIP Architecture

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Evaluation of NoSQL databases for large-scale decentralized microblogging

Cray DVS: Data Virtualization Service

Milestone Solution Partner IT Infrastructure MTP Certification Report Scality RING Software-Defined Storage

DFSgc. Distributed File System for Multipurpose Grid Applications and Cloud Computing

Load Balancing in Structured Peer to Peer Systems

Original-page small file oriented EXT3 file storage system

Designing a Cloud Storage System

Load Balancing in Structured Peer to Peer Systems

Transcription:

Storage Systems Autumn 2009 Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Scaling RAID architectures Using traditional RAID architecture does not scale Adding news disk implies reorganizing the whole data Re-striping requires the movement of all data-blocks Time t striping for re-layout grows linear in capacity: Trend t striping = k * C old where k is a constant and C old is the already stored capacity Newly integrated capacity C new is always smaller than C old

Assumptions How expensive is re-striping? 36 GByte of data can be re-distributed in each hour 100 GByte of new capacity C new have to added Already existing capacity C old between 100 GByte and 1 PByte Restriping tim (hours) 10000000 1000000 100000 10000 1000 100 10 1 Existing capacity (TBytes)

Introduction Randomization Deterministic data placement schemes suffered many drawbacks for a long time Heterogeneity has been an issue It has been costly to adapt to new storage systems It is difficult to support storage-on-demand concepts Is there an alternative to deterministic schemes? Yes, Randomization can help to overcome these drawbacks, but new challenges might be introduced!

Basic Results: Balls into bins Games II Assign n balls to n bins For every ball, choose one bin independently, uniformly at random Maximum load is sharply concentrated: where w.h.p. abbreviates with probability at least, for any fixed

Balls into bins Games I Basic tasks of balls into bins games Assign a set of m balls to n bins Motivation Idea: Just take a random position! Bins = Hard disks Balls = Data items L = max number of data items on each disk Where should I place the next item? 0 1 2 3 4

This sounds terrible: Balls into bins Games III The maximum loaded hard disk stores -times more data than the average This seems not to be scalable, or The model assumes that only very few data items are stored inside the environment, but each disk is able to store many objects Let s assume that many objects means Perfect! Then it holds w.h.p. that Additional Offset see, e.g, M. Raab, A. Steger: Balls into Bins - A Simple and Tight Analysis

Distributed Hash Tables Randomization introduces some (well known) challenges Key questions are: How can we retrieve a stored data item? How can we adapt to a changing number of disks? How can we handle heterogeneity? How can we support redundancy? Key Tasks of Distributed Hash Tables (DHTs)

Consistent Hashing I Introduced in the context of Web Caching Bins are mapped by a pseudo-random hash function h: on a ring (of length 1) Bins become responsible for their interval 1 Balls are mapped by 5 3 an additional hash 2 function g: onto the 4 6 ring Each bin stores balls in its interval See D. Karger, E. Lehman et al.: Consistent Hashing and Random Trees: Tools for Relieving Hot Spots on the World Wide Web

Consistent Hashing II Average load of each bin is, but deviation from average can be high: The maximum arc length on the ring becomes w.h.p. Solution: Each bin is mapped by a set of independent hash functions to multiple points on the ring The maximum arc length assigned to a bin can be reduced to for an arbitrary small constant, if virtual bins are used for each physical bin See I. Stoica, R. Morris, et al.: Chord: A Scalable Peer-To-Peer Lookup Service for Internet Applications.

Join and Leave-Operations I In a dynamic network, nodes can join and leave any time The main goal of a DHT is to have the ability to locate every key in the network at (nearly) any time (Planned) Removal of bins changes the length 1 of its neighbor interval Data has to be moved 3 to neighbor Insertion of bins also only 7 changes interval length of its new neighbor 6 2 4 5

Join and Leave-Operations II Definition of a View V: A view V is a set of bins of which a particular client is aware of. Monotonicity: A ranged hash function f is monotone if for all views implies Monotonicity implies that in case of a join operation of a bin i, all moved data items have destination i Consistent Hashing has property of monotonicity

Heterogeneous Bins Consistent Hashing is (nearly) optimally suited for homogeneous environment, where all bins (disks) have same capacity and performance Heterogeneous bins can be mapped to Consistent Hashing by using a different number of virtual bins for each physical bin The relation between the number of different bins constantly changes Monotonicity (and some other properties) can not be kept up

Why is heterogeneity an issue? Definition A heterogeneous set of disks is a set of disks with different performance and capacity characteristics They are becoming a common configuration Replacing an old disk Adding new disks Cluster build from already existing (heterogeneous) components

Traditional solution Many systems just ignore it: all disks are treated as equal The usable size of all disks is like the smallest one The performance of all disks is assumed as the slowest one Implications No performance gain is obtained Except for some implicit side effect Not all potential capacity gain is obtained Some systems use the unused disk space to build a virtual disk

THE DATA STORAGE EVOLUTION. Has disk capacity outgrown its usefulness? by Ron Yellin (Terada magazine 2006) Disk capacity

THE DATA STORAGE EVOLUTION. Has disk capacity outgrown its usefulness? by Ron Yellin (Terada magazine 2006) Disk performance

THE DATA STORAGE EVOLUTION. Has disk capacity outgrown its usefulness? by Ron Yellin (Terada magazine 2006) Capacity vs. performance

Growth storage needs Information point of view Increase of 30% each year How much information 2003? Peter Lyman and Hal R. Varian School of Information Management and Systems University of California at Berkeley Manufacturers point of view Increase capacity 50% each year Drive manufacturers THE DATA STORAGE EVOLUTION. Has disk capacity outgrown its usefulness? by Ron Yellin, Terada magazine 2006

Share Strategy I g(d) l(c d ) 0 1 Share Strategy tries to map heterogeneous problem to homogeneous solution Each bin d is assigned by a hash function g: to a start point g(d) inside [0,1)-interval The length l of the interval is proportional to the capacity c i (performance, or other metric) of bin i d p o See A. Brinkmann, K. Salzwedel, C. Scheideler: Compact, adaptive placement schemes for non-uniform distribution requirements.

Share Strategy II 0 x h(x) How to retrieve location of a data item x inside this heterogeneous setting? Use hash function h: to map x to [0,1)-Interval Use DHT for homogeneous bins to retrieve location of x from all intervals cutting h(x)

Share Strategy III 0 x h(x) Properties: (Arbitrary) optimal distribution of balls and bins Computational Complexity in O(1) Competitive Ratio concerning Join and Leave is (1+ ) for arbitrary >0 But Share has been optimized for usage in data center environments Share is not monotone and only partially suited for P2P networks

V:Drive SAN MDA V:Drive out-of-band virtualization environment each (Linux) server includes additional blocklevel driver module metadata appliance ensures consistent view on storage and servers Share strategy used as data distribution strategy See A. Brinkmann, S. Effert, et al.: Influence of Adaptive Data Layouts on Performance in dynamically changing Storage Environments

Performance V:Drive - Static Throughput (MB/s) 15 10 5 0 1 2 4 6 8 10 12 14 Physical 80 Volumes VDrive LVM 60 Avg. latency (ms) 40 20 0 Synthetic random I/O benchmark, static configuration 1 2 4 6 8 10 12 14 Physical volumes VDrive LVM

Performance V:Drive Dynamic Throughput (MB/s) 12 10 8 6 4 2 0 2 4 6 8 10 12 14 Physical 50 volumes VDrive 40 LVM Avg. latency (ms) 30 20 10 0 Synthetic random I/O benchmark, dynamic configuration 2 4 6 8 10 12 14 Physical volumes VDrive LVM

V:Drive - Reconfiguration Overhead 7 70 6 60 Throughput / MByte/s 5 4 3 2 1 50 40 30 20 10 Avg. Latency / ms 0 1 5 9 13 17 21 25 29 33 37 Time / minutes 0

Randomization and Redundancy Randomized data distribution schemes do not include mechanisms to safe data against disk failures Question: How to use Randomization and RAID schemes together Assumption: n copies of a data block have to be distributed over n disks No two copies of a data block are allowed to be stored on the same disk

Trivial Solutions Trivial Solution I: Divide storage systems into n storage pools Distribute first copies over first pool,, n-th copies over n-th pool Missing flexibility Trivial Solution II: First copy will be distributed over all disks Second copy will be distributed about all but the previously chosen disk, Not able to use capacity efficiently First Copy Second Copy

Observation Trivial Solution II is not able to use capacity efficiently, because big storage systems will be penalized compared to smaller devices Theorem: Assume a trivial replication strategy that has to distribute k copies of m balls over n > k bins. Furthermore, the biggest bin has a capacity c max that is at least (1 + ) c j of the next biggest bin j. In this case, the expected load of the biggest bin will be smaller than the expected load required for an optimal capacity efficiency. See A. Brinkmann, S. Effert, et al.: Dynamic and Redundant Data Placement, ICDCS 2007

Idea Algorithm has to ensure that bigger bins get data items according to their capacities This can be ensured by an algorithm that iterates over a sorted list of bins 1. At each iteration, the algorithm randomly decides, whether or whether not to place the ball 2. If one of k copies of a ball has been placed, use optimal strategy for (k-1) with remaining bins as input Challenge: How to make random decision in step 1 of each iteration

LinMirror

Example for Mirroring (k=2) denotes the relative capacity of disk i to all disks denotes the relative capacity of disk i to all disks starting with index i is the weight for the random decision!

Example for Mirroring (k=2) If, e.g., disk 2 is chosen as first copy of a mirror, just distribute the second copy according to Share over disks 3, 4, and 5 Some adaptation is necessary, if disk 3 is chose, because weight of disk 4 is greater 1

Observations LinMirror is 4-competitive concerning insertion and deletion of a bin Strategy can easily be extended to arbitrary k Lower and upper bound is (k+1)/2 for homogeneous bins (can be improved to 1-competitive) Data distribution is optimal Redistribution of data in dynamic environment is ln n-competitive for arbitrary k Computational complexity can be reduced to O(k)

Fairness of k-fold Replication Usage in % 20 18 16 14 12 10 8 6 4 2 0 8 Disks 10 Disks 12 Disks 10 Disks 8 Disks

Adaptivity of k-fold Replication 6 5 Competitiveness 4 3 2 1 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 Number of Disks Add as Biggest Add as Smallest

Metadata Management Assignment of data items to disks can be solved efficiently for random data distribution schemes Very good distribution of data and requests Computational complexity low Adaptivity to new infrastructures optimal without redundancy, ok with redundancy Over-provisioning can be efficiently integrated but how to find position of data item on the disks? Equal to the dictionary problem Requires O(n) entries to find location of n objects! Defines bulk set of metadata

Dictionary Problem Extent Size vs. Volume Size 4 KB 16 KB 256 KB 4MB 16MB 256 MB 1 GB 1 GB 8 MB 2 MB 128 KB 8 KB 2 KB 128 Byte 32 Byte 64 GB 512 MB 128 MB 8 MB 512 KB 128 KB 8 KB 2 KB 1 TB 8 GB 2 GB 128 MB 8 MB 2 MB 128 KB 32 KB 64 TB 512 GB 128 GB 8 GB 512 MB 128 MB 8 MB 2 MB 1 PB 8 TB 2 TB 128 GB 8 GB 2 GB 128 MB 32 MB Extent: Smallest continuous unit that can be addressed by virtualization solution Dictionary easily becomes too big to be stored inside each server system for small extent sizes Solutions Caching Huge extent sizes Object Based Storage Systems

Key Value Storage To meet reliability and scaling needs, Amazon has developed a number of storage technologies Amazon Simple Storage Service S3 There are many services on Amazon s platform that only need primary-key access to a data store best seller lists, shopping carts, customer, preferences, session management, sales rank, and product catalog Key Value Stores provide simple primary-key only interface to meet the requirements of these applications See DeCandia, et al.: Dynamo: Amazon s Highly Available Key-value Store

Dynamo Dynamo uses a synthesis of well known techniques to achieve scalability and availability Data is partitioned and replicated using consistent hashing Consistency is facilitated by object versioning Consistency among replicas during updates is maintained by quorum-like technique and a decentralized replica synchronization protocol Gossip based distributed failure detection and membership protocol Dynamo is a completely decentralized system with minimal need for manual administration

Query Model: Assumptions and Requirements Simple read and write operations to data that is uniquely identified by a key. State is stored as binary objects (i.e., blobs) No operations span multiple data items and there is no need for relational schema

Assumptions and Requirements ACID Properties: ACID (Atomicity, Consistency, Isolation, Durability) Experience at Amazon has shown that data stores that provide ACID guarantees tend to have poor availability Dynamo targets applications that operate with weaker consistency (the C in ACID) if this results in high availability Dynamo does not provide any isolation guarantees and permits only single key updates Environment is non-hostile