Algorithms and Methods for Distributed Storage Networks 8 Storage Virtualization and DHT Christian Schindelhauer



Similar documents
Distributed Storage Networks and Computer Forensics

The Google File System

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

The Google File System

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Distributed File Systems

Distributed File Systems

Cloud Computing at Google. Architecture

Massive Data Storage

RAID. Tiffany Yu-Han Chen. # The performance of different RAID levels # read/write/reliability (fault-tolerant)/overhead

Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Network Attached Storage. Jinfeng Yang Oct/19/2015

Google File System. Web and scalability

Moving Virtual Storage to the Cloud. Guidelines for Hosters Who Want to Enhance Their Cloud Offerings with Cloud Storage

Moving Virtual Storage to the Cloud

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Quantum StorNext. Product Brief: Distributed LAN Client

Jeffrey D. Ullman slides. MapReduce for data intensive computing

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

V:Drive - Costs and Benefits of an Out-of-Band Storage Virtualization System

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at

Scala Storage Scale-Out Clustered Storage White Paper

Distributed File Systems

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Big Data Processing in the Cloud. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center

PARALLELS CLOUD STORAGE

How To Back Up A Computer To A Backup On A Hard Drive On A Microsoft Macbook (Or Ipad) With A Backup From A Flash Drive To A Flash Memory (Or A Flash) On A Flash (Or Macbook) On

Big Table A Distributed Storage System For Data

Web DNS Peer-to-peer systems (file sharing, CDNs, cycle sharing)

The Hadoop Distributed File System

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

How To Design A Data Center

POWER ALL GLOBAL FILE SYSTEM (PGFS)

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Distributed File Systems

Overview of I/O Performance and RAID in an RDBMS Environment. By: Edward Whalen Performance Tuning Corporation

CHAPTER 17: File Management

Sunita Suralkar, Ashwini Mujumdar, Gayatri Masiwal, Manasi Kulkarni Department of Computer Technology, Veermata Jijabai Technological Institute

OPTIMIZING EXCHANGE SERVER IN A TIERED STORAGE ENVIRONMENT WHITE PAPER NOVEMBER 2006

Distributed Data Stores

THE HADOOP DISTRIBUTED FILE SYSTEM

Ultimate Guide to Oracle Storage

SAN Conceptual and Design Basics

Hypertable Architecture Overview

Seminar Presentation for ECE 658 Instructed by: Prof.Anura Jayasumana Distributed File Systems

Technology Insight Series

Survey on Load Rebalancing for Distributed File System in Cloud

Using EonStor FC-host Storage Systems in VMware Infrastructure 3 and vsphere 4

Distributed File System Choices: Red Hat Storage, GFS2 & pnfs

Archive Data Retention & Compliance. Solutions Integrated Storage Appliances. Management Optimized Storage & Migration

Storage Backup and Disaster Recovery: Using New Technology to Develop Best Practices

Technical Overview Simple, Scalable, Object Storage Software

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

VMware vsphere Data Protection 6.0

Large Scale Storage. Orlando Richards, Information Services LCFG Users Day, University of Edinburgh 18 th January 2013

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

MinCopysets: Derandomizing Replication In Cloud Storage

Ceph. A file system a little bit different. Udo Seidel

The Sierra Clustered Database Engine, the technology at the heart of

Oracle Database Deployments with EMC CLARiiON AX4 Storage Systems

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Table of contents. Matching server virtualization with advanced storage virtualization

A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems*

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Big data management with IBM General Parallel File System

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Storage Architectures for Big Data in the Cloud

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

WHITE PAPER. Permabit Albireo Data Optimization Software. Benefits of Albireo for Virtual Servers. January Permabit Technology Corporation

Scale and Availability Considerations for Cluster File Systems. David Noy, Symantec Corporation

Block based, file-based, combination. Component based, solution based

Virtualization, Business Continuation Plan & Disaster Recovery for EMS -By Ramanj Pamidi San Diego Gas & Electric

Windows Server 2012 R2 Hyper-V: Designing for the Real World

VMware Virtual Machine File System: Technical Overview and Best Practices

DAS to SAN Migration Using a Storage Concentrator

High Availability Storage

VERITAS Storage Foundation 4.3 for Windows

A Brief Analysis on Architecture and Reliability of Cloud Based Data Storage

TECHNICAL WHITE PAPER: ELASTIC CLOUD STORAGE SOFTWARE ARCHITECTURE

Running a Workflow on a PowerCenter Grid

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Distributed Metadata Management Scheme in HDFS

Storage and High Availability with Windows Server 10971B; 4 Days, Instructor-led

Backup architectures in the modern data center. Author: Edmond van As Competa IT b.v.

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

Dell Compellent Storage Center SAN & VMware View 1,000 Desktop Reference Architecture. Dell Compellent Product Specialist Team

Enterprise Manager. Version 6.2. Administrator s Guide

Dependable Systems. 9. Redundant arrays of. Prof. Dr. Miroslaw Malek. Wintersemester 2004/05

Optimizing Large Arrays with StoneFly Storage Concentrators

Storage Networking Foundations Certification Workshop

Hadoop & its Usage at Facebook

SnapServer NAS GuardianOS 6.5 Compatibility Guide May 2011

Transcription:

Algorithms and Methods for Distributed Storage Networks 8 Storage Virtualization and DHT Institut für Informatik Wintersemester 2007/08

Overview Concept of Virtualization Storage Area Networks Principles Optimization Distributed File Systems Without virtualization, e.g. Network File Systems With virtualization, e.g. Google File System Distributed Wide Area Storage Networks Distributed Hash Tables Peer-to-Peer Storage 2

Concept of Virtualization Principle A virtual storage constitutes handles all application accesses to the file system The virtual disk partitions files and stores blocks over several (physical) hard disks Control mechanisms allow redundancy and failure repair Control Virtualization server assigns data, e.g. blocks of files to hard disks (address space remapping) Controls replication and redundancy strategy Adds and removes storage devices 3 File Hard Disks Virtual Disk

Storage Virtualization Capabilities Replication Pooling Disk Management Advantages Data migration Higher availability Simple maintenance Scalability Disadvantages Un-installing is time consuming Compatibility and interoperability Complexity of the system Classic Implementation Host-based - Logical Volume Management - File Systems, e.g. NFS Storage devices based - RAID Network based - Storage Area Network New approaches Distributed Wide Area Storage Networks Distributed Hash Tables Peer-to-Peer Storage 4

Storage Area Networks Virtual Block Devices without file system connects hard disks Advantages simpler storage administration more flexible servers can boot from the SAN effective disaster recovery allows storage replication Compatibility problems between hard disks and virtualization server 5

SAN Networking Networking FCP (Fibre Channel Protocol) - SCSI over Fibre Channel iscsi (SCSI over TCP/IP) HyperSCSI (SCSI over Ethernet) ATA over Ethernet Fibre Channel over Ethernet iscsi over InfiniBand FCP over IP http://en.wikipedia.org/wiki/storage_area_network 6

SAN File Systems File system for concurrent read and write operations by multiple computers without conventional file locking concurrent direct access to blocks by servers Examples Veritas Cluster File System Xsan Global File System Oracle Cluster File System VMware VMFS IBM General Parallel File System 7

Distributed File Systems (without Virtualization) aka. Network File System Supports sharing of files, tapes, printers etc. Allows multiple client processes on multiple hosts to read and write the same files concurrency control or locking mechanisms necessary Examples Network File System (NFS) Server Message Block (SMB), Samba Apple Filing Protocol (AFP) Amazon Simple Storage Service (S3) 8

Distributed File Systems with Virtualization Example: Google File System File system on top of other file systems with builtin virtualization System built from cheap standard components (with high failure rates) Few large files Only operations: read, create, append, delete - concurrent appends and reads must be handled High bandwidth important Replication strategy chunk replication master replication Application GFS client (file name, chunk index) (chunk handle, chunk locations) (chunk handle, byte range) chunk data GFS master File namespace /foo/bar chunk 2ef0 Instructions to chunkserver GFS chunkserver Linux file system Chunkserver state GFS chunkserver Linux file system Figure 1: GFS Architecture 4 step 1 file region may end up contain Client Master clients, although the replicas wi and replication decisions using global knowledge. However, 2 3 tent TCP connection to the chunkserver dividual over anoperations extended are comple we must minimize its involvement in reads and writes so period of time. Third, it reduces the size order of on theall metadata replicas. This leave that it does not become a bottleneck. Clients never read stored on the master. This allows us tobut keep undefined the metadata state as noted in and write file data through the master. Instead, asecondary client asks in memory, which in turn brings other advantages that we the master which chunkservers it should contact. Replica It caches A will discuss in Section 2.6.1. 6 this information for a limited time and interacts with the On the other hand, a large chunk size, 3.2 even with Data lazy space Flow chunkservers directly for many subsequent operations. allocation, has its disadvantages. A small We file decouple consists of the a flow of data Let us explain the interactions for a simple 7read with reference to Figure 1. First, using the fixed chunk size, the client storing 5 those chunks may become hot spots if many clients small number of chunks, perhaps just one. The chunkservers Primary use the network efficiently. W translates the file name and byte offset specified Replica by the application into a chunk index within the file. Then, it sends been a major Legend: issue because our applications pushedmostly linearlyread along a carefully are accessing the same file. In practice, client hot spots to the have primary not and then the master a request containing the file name and chunk large multi-chunk files sequentially. in a pipelined fashion. Our go index. The master replies with the corresponding chunk However, hot spots Control did develop whenmachine s GFS was first network used bandwidth, 6 handle and locations of the replicas. The client Secondary caches this by a batch-queue system: Data an executableand was written high-latency to GFSlinks, and m information using the file name and chunk index Replica as the key. B as a single-chunk file and then started through on hundreds all the of machines at the same time. The few chunkservers To fully storing utilize thiseach machin data. The client then sends a request to one of the replicas, most likely the closest one. The request specifies the chunk executable were overloaded by hundredsdata of simultaneous is pushed linearly requests. and We Data fixed this Flow problem by storing such executables along a handle and a byte range within that chunk. Figure Further 2: Write reads Control than distributed in some other of the same chunk require no more client-master interaction with a higher replication factor and by making the batchqueue system stagger application start times. A potential each machine s full outbound b until the cached information expires or the file is reopened. The Google File System In fact, the client typically asks for multiple chunks in the long-term solution is to allow clients to read fer the data data fromas other fast as possible Sanjay Ghemawat, same Howard requestgobioff, and the master and Shun-Tak can also becomes include Leung the unreachable information for chunks immediately following those or clients repliesinthat suchit situations. no longer holds multiple recipients. a lease. requested. This To avoid network bottlenecks extra information 9 sidesteps several future client-master inter-switch links are often bot 3. The client pushes Albert-Ludwigs-Universität interactions at practically no extra cost. The master stores three major types of machine metadata: forwards the filethe data to t 2.6 Metadata the data to all the replicas. A clientfreiburg can do so in any order. and Each chunk Christian chunkserver namespaces, will Schindelhauer thestore mapping from network files topology chunks, that has no 2.5 Chunk Size the data in an internaland LRU thebuffer locations cache of each until chunk s the replicas. client Allis metadata pushingisdata to chun Chunk size is one of the key design parameters. data is usedwe orhave aged out. kept By decoupling the master s thememory. data flowthe first sends two types the data (names- can and improve file-to-chunk performance mapping) by are also kept persistent by to the closest c chosen 64 MB, which is much larger than from typical the control file system block sizes. Each chunk replicascheduling is stored asthe a plain expensivelogging data flow mutations basedtonan the operation net- log stored on the mas- flow, wepaces wards it to the closest chunkser Linux file on a chunkserver and is extended S1, say S2. Similarly, S2 forwar work only topology as needed. regardless ter s of which local disk chunkserver and replicated is the on remote machines. Using Lazy space allocation avoids wasting space is closer to S2, and so on. Our primary. due to Section internal 3.2 discusses a log allows this further. us to update the master state simply, reliably, Legend: Data messages Control messages

Distributed Wide Area Storage Networks Distributed Hash Tables Relieving hot spots in the Internet Caching strategies for web servers Peer-to-Peer Networks Distributed file lookup and download in Overlay networks Most (or the best) of them use: DHT 10

WWW Load Balancing Web surfing: Web servers offer web pages Web clients request web pages Most of the time these requests are independent Requests use resources of the web servers bandwidth computation time www.apple.de www.google.com www.uni-freiburg.de Christian Stefan Arne 11

Load Some web servers have always high load www.google.com for permanent high loads servers must be sufficiently powerful Some suffer under high fluctuations e.g. special events: - jpl.nasa.gov (Mars mission) - cnn.com (terrorist attack) Monday Tuesday Wednesday Server extension for worst case not reasonable Serving the requests is desired 12

Load Balancing in the WWW Monday Tuesday Wednesday Fluctuations target some servers A B A B A B (Commercial) solution Service providers offer exchange servers an Many requests will be distributed among these servers But how? A B 13

Literature Leighton, Lewin, et al. STOC 97 Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web Used by Akamai (founded 1997) Web-Cache 14

Start Situation Without load balancing Advantage simple Disadvantage servers must be designed for worst case situations Web-Server Web pages request Web-Clients 15

Site Caching Web-Server The whole web-site is copied to different web caches Browsers request at web server Web server redirects requests to Web- Cache Web-Cache delivers Web pages Advantage: good load balancing Disadvantage: bottleneck: redirect large overhead for complete web-site replication redirect Web-Cache Web-Clients 16

Proxy Caching Web-Server Each web page is distributed to a few web-caches Only first request is sent to web server Links reference to pages in the webcache Advantage: No bottleneck Disadvantages: Load balancing only implicit High requirements for placements redirect Link 2. 3. request 1. 4. Then, web clients surfs in the webcache Web- Cache Web-Client 17

Requirements Balance fair balancing of web pages Dynamics Efficient insert and delete of webcache-servers and files?? new X X Views Web-Clients see different set of web-caches 18

Hash Functions Buckets Items Set of Items: Set of Buckets: Example: 19

Ranged Hash-Funktionen Given: Items, Number Caches (Buckets), Bucket set: Views Ranged Hash-Funktion: Prerequisite: for alle views Buckets View Items 20

First Idea: Hash Function Algorithm: Choose Hash funktion, e.g. 3 i + 1 mod 4 5 2 9 4 3 6 Balance: n: number of Cache servers very good Dynamics Insert or remove of a single cache server New hash functions and total rehashing Very expensive!! 21 5 0 1 2 3 9 4 2 i + 2 mod 3 2 3 6 X 0 1 2 3

Requirements of the Ranged Hash Functions Monotony After adding or removing new caches (buckets) no pages (items) should be moved Balance All caches should have the same load Spread (Verbreitung,Streuung) A page should be distributed to a bounded number of Load caches No Cache should not have substantially more load than the average 22

Monotony After adding or removing new caches (buckets) no pages (items) should be moved Formally: For all Pages View 1: View 2: Caches Caches Pages 23

Balance For every view V the is the f V (i) balanced For a constant c and all : Pages View 1: View 2: Caches Caches Pages 24

Spread The spread (i) of a page i is the overall number of all necessary copies (over all views) View 1: View 2: View 3: 25

Load The load λ(b) of a cache b is the over-all number of all copies (over all views) wher in View V := set of all pages assigned to bucket b View 1: View 2: λ(b 1 ) = 2 λ(b 2 ) = 3 View 3: b 1 b 2 26

Distributed Hash Tables Theorem There exists a family of hash function with the following properties Each function f F is monotone Balance: For every view C number of caches (Buckets) C/t minimum number of caches per View V/C = constant (#Views / #Caches) I = C (# pages = # Caches) Spread: For each page i with probability Load: For each cache b mit W keit 27

The Design 2 Hash functions onto the reals [0,1] maps k log C copies of cache b randomly to [0,1] maps web page i randomly to the interval [0,1] := Cache, which minimizes Caches (Buckets): View 1 0 1 View 2 0 1 Webseiten (Items): 28

:= Cache which minimizes For all : Monotony Observe: blue interval in V 2 and in V 1 empty! View 1 0 1 View 2 0 1 29

Balance: For all views 2. Balance Caches (Buckets): Choose fixed view and a web page i Apply hash functions and. Under the assumption that the mapping is random every cache is chosen with the same probability View 0 1 Webseiten (Items): 30

3. Spread (i) = number of all necessary copies (over all views) C number of caches (Buckets) C/t minimum number of caches per View V/C = constant (#Views / #Caches) I = C (# pages = # Caches) ever user knows at least a fraction of 1/t over the caches For every page i with prob. Proof sketch: Every view has a cache in an interval of length t/c (with high probability) The number of caches gives an upper bound for the spread 0 t/c 2t/C 1 31

4. Load Last (load): λ(b) = Number of copies over all views where := wet of pages assigned to bucket b under view V For every cache be we observer with probability Proof sketch: Consider intervals of length t/c With high probability a cache of every view falls into one of these intervals The number of items in the interval gives an upper bound for the load 0 t/c 2t/C 1 32

Summary Distributed Hash Table is a distributed data structure for virtualization with fair balance provides dynamic behavior Standard data structure for dynamic distributed storages 33

Algorithms and Methods for Distributed Storage Networks 8 Storage Virtualization and DHT Institut für Informatik Wintersemester 2007/08