A Novel Data Placement Model for Highly-Available Storage Systems



Similar documents
Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

File System Management

Load Balancing in Distributed Web Server Systems With Partial Document Replication

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

The Advantages and Disadvantages of Network Computing Nodes

Berkeley Ninja Architecture

Solbox Cloud Storage Acceleration

Scala Storage Scale-Out Clustered Storage White Paper

MinCopysets: Derandomizing Replication In Cloud Storage

1 Storage Devices Summary

Massive Data Storage

Distributed File Systems

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

GraySort on Apache Spark by Databricks

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

POWER ALL GLOBAL FILE SYSTEM (PGFS)

SCALABILITY AND AVAILABILITY

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

The Google File System

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Enabling Multi-pipeline Data Transfer in HDFS for Big Data Applications

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Outline. Clouds of Clouds lessons learned from n years of research Miguel Correia

CS 153 Design of Operating Systems Spring 2015

Load Balancing in Fault Tolerant Video Server

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

3. PGCluster. There are two formal PGCluster Web sites.

G Porcupine. Robert Grimm New York University

Corporate PC Backup - Best Practices

Self-Compressive Approach for Distributed System Monitoring

RAID Storage System of Standalone NVR

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

Load Re-Balancing for Distributed File. System with Replication Strategies in Cloud

The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms. Abhijith Shenoy Engineer, Hedvig Inc.

A Deduplication-based Data Archiving System

Mike Canney Principal Network Analyst getpackets.com

EMC XtremSF: Delivering Next Generation Storage Performance for SQL Server

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

Avamar. Technology Overview

Energy Efficient MapReduce

Benchmarking Cassandra on Violin

Google File System. Web and scalability

A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES

Backup of NAS devices with Avamar

How swift is your Swift? Ning Zhang, OpenStack Engineer at Zmanda Chander Kant, CEO at Zmanda

Big Data With Hadoop

Cassandra A Decentralized, Structured Storage System

Multimedia Caching Strategies for Heterogeneous Application and Server Environments

In Memory Accelerator for MongoDB

Load Balancing. Load Balancing 1 / 24

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Mike Canney. Application Performance Analysis

Snapshots in Hadoop Distributed File System

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Designing a Cloud Storage System

Review. Lecture 21: Reliable, High Performance Storage. Overview. Basic Disk & File System properties CSC 468 / CSC /23/2006

Cloud Computing with Microsoft Azure

Simulating a File-Sharing P2P Network

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing

Performance characterization report for Microsoft Hyper-V R2 on HP StorageWorks P4500 SAN storage

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Scale and Availability Considerations for Cluster File Systems. David Noy, Symantec Corporation

Web Load Stress Testing

Giving life to today s media distribution services

Distributed file system in cloud based on load rebalancing algorithm

Key Components of WAN Optimization Controller Functionality

Oracle Database In-Memory The Next Big Thing

Reliability of Data Storage Systems

Cloud DBMS: An Overview. Shan-Hung Wu, NetDB CS, NTHU Spring, 2015

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

WHITE PAPER. SQL Server License Reduction with PernixData FVP Software

Principles and characteristics of distributed systems and environments

White Paper. PiLab Technology Performance Test, Deep Queries. pilab.pl

EMC XtremSF: Delivering Next Generation Performance for Oracle Database

The Hadoop Distributed File System

A block based storage model for remote online backups in a trust no one environment

Towards an Optimized Big Data Processing System

Disk Storage & Dependability

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

Ceph Optimization on All Flash Storage

Cloud Computing Is In Your Future

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at

Data Deduplication and Tivoli Storage Manager

SOFT 437. Software Performance Analysis. Ch 5:Web Applications and Other Distributed Systems

Practical Data Integrity Protection in Network-Coded Cloud Storage

Best Practices for Virtualised SharePoint

Graph Database Proof of Concept Report

MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT

POSIX and Object Distributed Storage Systems

CS 6290 I/O and Storage. Milos Prvulovic

Transcription:

A Novel Data Placement Model for Highly-Available Storage Systems Rama, Microsoft Research joint work with John MacCormick, Nick Murphy, Kunal Talwar, Udi Wieder, Junfeng Yang, and Lidong Zhou

Introduction Kinesis: Framework for placing data and replicas in a datacenter storage system Three Design Principles: Structured organization Freedom of choice Scattered placement

Kinesis Design => Structured Organization Segmentation: Divide storage servers into k segments Each segment is an independent hash-based system Failure Isolation: Servers with shared components in the same segment r replicas per item, each on a different segment Reduces impact of correlated failures

Kinesis Design => Freedom of Choice Balanced Resource Usage: Multiple-choice paradigm Write: r out of k choices Read: 1 out of r choices

Kinesis Design => Scattered Data Placement Independent hash functions Each server stores different set of items Parallel Recovery Spread recovery load among multiple servers uniformly Recover faster

Motivation => Data Placement Models Hash-based Directory-based Data is stored on a server determined by a hash function Server identified by a local computation during reads Low overhead Any server can store any data A directory provides the server to fetch data form Expensive to maintain a globally-consistent directory in a large-scale system Limited control in placing data items and replicas Can place data items carefully to maintain load balance, avoid correlated failure, etc.

Kinesis => Advantages Enables hash-based storage systems to have advantages of directory-based systems Near-optimal load balance Tolerance to shared-component failures Freedom from problems induced by correlated replica placement No bottlenecks induced by directory maintenance Avoids slow data/replica placement algorithms

Evaluation Theory Simulations Experiments

Theory => The Balls and Bins Problem Load balancing is often modeled by the task of throwing balls (items) into bins (servers) Throw m balls into n bins: Pick a bin uniformly at random Insert the ball into the bin Single-Choice Paradigm: Max Load: (with high prob.)

Theory => Multiple-Choice Paradigm Throw m balls into n bins: Pick d bins uniformly at random (d 2) Insert the ball into the less-loaded bin Excess load is independent of m, number of balls! [BCSV00] Max Load: (with high prob.)

Theory vs. Practice Storage load vs. network load: [M91] Network load is not persistent unlike storage load Non-uniform sampling: [V99],[W07] Servers chosen based on consistent (or linear) hashing Replication: [MMRWYZ08] Choose r out of k servers instead of 1 out of d bins Heterogeneity: Variable-sized servers [W07] Variable-sized items [TW07]

Theory => Planned Expansions Adding new, empty servers into the system Creates sudden, large imbalances in load Empty servers vs. filled servers Eventual storage balance Do subsequent inserts/writes fill up new servers? Eventual network load balance Are new items distributed between old and new servers? New items are often more popular than old items

Theory => Planned Expansions Expanding a linear-hashing system: 0 2 b-1 2 b N = 2 N = 4 N = 6

Theory => Planned Expansions Before Expansion: Assume there are k n servers on k segments Storage is evenly balanced Expansion: Add α n new servers to each segment After expansion: R = (1 - α) n servers with twice the load as L = 2 α n servers (new servers + split servers)

Theory => Planned Expansions Relative Load Distribution: R L Ratio of expected number of replicas inserted in L servers to expected number of replicas inserted on all servers R L > 1 => eventual storage balance! R L < 1 => no eventual storage balance Theorem: R L > 1 if k 2 r [MMRWYZ08]

Evaluation Theory Simulations Experiments

Simulations => Overview Real-World Traces: Trace Num Files Total Size Type MSNBC 30,656 2.33 TB Read only Plan-9 1.9 Million 173 GB Read/write Compare with Chain: one segment single-choice paradigm chained replica placement E.g. PAST, CFS, Boxwood, Petal 0 2 b 2 b-1

Storage: Max - Avg (GB) Storage: Max - Avg (GB) Simulations => Load Balance MSNBC Trace Plan-9 Trace 2.0 1.5 Chain(3) Kinesis(7,3) 3 2.5 2 Chain(3) Kinesis(7,3) 1.0 0.5 0.0 0 10000 20000 30000 Number of Objects 1.5 1 0.5 0 0 1 2 3 Number of Operations (10 6 )

Percentage Overprovisioning Simulations => System Provisioning 60% 52.0% 45% 45.0% 35.0% 30% 29.0% 26.0% 23.0% 20.0% 15% 0% K (5) K (6) K (7) K (8) K (9) K (10) Chain

90 th pc. Queuing Delay (sec) Simulations => User Experienced Delays 10 8 Chain (3) Kinesis (7,3) 6 4 2 0 0 100 200 300 400 500 Number of Servers

Simulations => Read Load vs. Write Load Read load: short term resource consumption Network bandwidth, computation, disk bandwidth Directly impacts user experience Write load: short and long term resource consumption Storage space Solution (Kinesis S+L): Choose short term over long term Storage balance is restored when transient resources are not bottlenecked

Queuing Delay: Max - Avg (ms) Simulations => Read Load vs. Write Load 200 150 100 50 0 Chain Kinesis S Kinesis S+L Kinesis L 0 20 40 60 80 100 Update to Read Ratio (%)

Request Rate (Gb/s) Storage Imbalance (%) Simulations => Read Load vs. Write Load 1.5 1.0 Request Rate Kinesis S+L Kinesis S 15% 10% 0.5 5% 0.0 0 4 8 12 16 20 24 Time (hours) 0%

Evaluation Theory Simulations Experiments

Kinesis => Prototype Implementation Storage system for variable-sized objects Read/Write interface Storage servers Linear hashing system Read, Append, Exists, Size (storage), Load (queued requests) Failure recovery for primary replicas Primary for an item is the highest numbered server with a replica

Kinesis => Prototype Implementation Front end Two-step read protocol 1) Query k candidate servers 2) Fetch from least loaded server with a replica Versioned updates with copy-on-write semantics RPC-based communication with servers Non implemented! Failure detector Failure-consistent updates

Kinesis => Experiments 15 node LAN test-bed 14 storage servers and 1 front end MSNBC trace of 6 hours duration 5000 files, 170 GB total size, and 200,000 reads Failure induced at 3 hours

Experiments => Kinesis vs. Chain 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Kinesis(7,3) 11.9 11.4 11.7 11.9 11.5 11.9 11.6 12.4 12.5 12.9 13.4 12.7 12.3 11.4 17.8 17.4 14.0 14.3 Chain(3) 10.7 10.7 10.2 10.6 10.3 10.8 10.9 10.7 10.4 10.6 Kinesis: 1.73 sec Average Latency Chain: 3 sec

Experiments => Failure Recovery 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Kinesis(7,3) Chain(3) Kinesis: 17 min (12 servers) Total Recovery Time Chain: 44 min (5 servers)

Related Work Prior work: Hashing: Archipelago, OceanStore, PAST, CFS Two-choice paradigm: common load balancing technique Parallel recovery: Chain Replication [RS04] Random distribution: Ceph [WBML06] Our contributions: Putting them together in the context of storage systems Extending theoretical results to the new design Demonstrating the power of these simple design principles

Summary and Conclusions Kinesis: A data/replica placement framework for LAN storage systems Structured organization, freedom-of-choice, and scattered distribution Simple and easy to implement, yet quite powerful! Reduces infrastructure cost, tolerates correlated failures, and quickly restores lost replicas Soon to appear in ACM Transactions on Storage, 2008 Download: http://research.microsoft.com/users/rama

Average Latency (sec) Experiments => Kinesis vs. Chain 15 Chain (3) Kinesis (7,3) 10 5 0 0 0.5 1 1.5 2 2.5 3 Time (hours)

Request Rate Server Load: Max - Avg Experiments => Kinesis vs. Chain 50 40 30 Total Load Chain (3) Kinesis (7,3) 50 40 30 20 20 10 10 0 0 0.5 1 1.5 2 2.5 3 Time (hours) 0