Peer-to-Peer and Grid Computing. Chapter 4: Peer-to-Peer Storage

Similar documents

P2P Storage Systems. Prof. Chun-Hsin Wu Dept. Computer Science & Info. Eng. National University of Kaohsiung

OceanStore: An Architecture for Global-Scale Persistent Storage

Peer-to-Peer Networks. Chapter 6: P2P Content Distribution

Acknowledgements. Peer to Peer File Storage Systems. Target Uses. P2P File Systems CS 699. Serving data with inexpensive hosts:

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Adapting Distributed Hash Tables for Mobile Ad Hoc Networks

DFSgc. Distributed File System for Multipurpose Grid Applications and Cloud Computing

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

Peer-to-peer Cooperative Backup System

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Tornado: A Capability-Aware Peer-to-Peer Storage Network

Diagram 1: Islands of storage across a digital broadcast workflow

A block based storage model for remote online backups in a trust no one environment

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

SwanLink: Mobile P2P Environment for Graphical Content Management System

Sync Security and Privacy Brief

PEER TO PEER CLOUD FILE STORAGE ---- OPTIMIZATION OF CHORD AND DHASH. COEN283 Term Project Group 1 Name: Ang Cheng Tiong, Qiong Liu

Designing a Cloud Storage System

Scaling Distributed Database Management Systems by using a Grid-based Storage Service

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Web DNS Peer-to-peer systems (file sharing, CDNs, cycle sharing)

Software to Simplify and Share SAN Storage Sanbolic s SAN Storage Enhancing Software Portfolio

Security in Structured P2P Systems

Cloud Computing at Google. Architecture

Cloud Based Application Architectures using Smart Computing

Approximate Object Location and Spam Filtering on Peer-to-Peer Systems

An Introduction to Peer-to-Peer Networks

ZooKeeper. Table of contents

A Novel Data Placement Model for Highly-Available Storage Systems

Magnus: Peer to Peer Backup System

A Brief Analysis on Architecture and Reliability of Cloud Based Data Storage

Cassandra A Decentralized, Structured Storage System

InfiniteGraph: The Distributed Graph Database

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Considerations when Choosing a Backup System for AFS

Cloud Storage and Peer-to-Peer Storage End-user considerations and product overview

CS5412: TIER 2 OVERLAYS

Chord - A Distributed Hash Table

Architectures and protocols in Peer-to-Peer networks

Object Storage: A Growing Opportunity for Service Providers. White Paper. Prepared for: 2012 Neovise, LLC. All Rights Reserved.

Deduplication has been around for several

Distributed File Systems

Distributed File Systems. Chapter 10

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

A Survey and Comparison of Peer-to-Peer Overlay Network Schemes

How To Create A P2P Network

iservdb The database closest to you IDEAS Institute

How to Choose Between Hadoop, NoSQL and RDBMS

Survey on Load Rebalancing for Distributed File System in Cloud

A Deduplication-based Data Archiving System

A PROXIMITY-AWARE INTEREST-CLUSTERED P2P FILE SHARING SYSTEM

Dr Markus Hagenbuchner CSCI319. Distributed Systems

Increasing Storage Performance, Reducing Cost and Simplifying Management for VDI Deployments

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

An Overview of Distributed Databases

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

Understanding Disk Storage in Tivoli Storage Manager

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Database Replication with Oracle 11g and MS SQL Server 2008

Distributed File Systems

Agenda. Distributed System Structures. Why Distributed Systems? Motivation

Data Management in the Cloud

Service Overview CloudCare Online Backup

The Google File System

File Management. Chapter 12

Distribution transparency. Degree of transparency. Openness of distributed systems

Online Transaction Processing in SQL Server 2008

CHAPTER 6: DISTRIBUTED FILE SYSTEMS

Operating System Concepts. Operating System 資訊工程學系袁賢銘老師

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

How to Choose your Red Hat Enterprise Linux Filesystem

Distributed Data Management

Veeam Cloud Connect. Version 8.0. Administrator Guide

Real World Enterprise SQL Server Replication Implementations. Presented by Kun Lee

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Considerations when Choosing a Backup System for AFS

Big Data With Hadoop

Network File System (NFS) Pradipta De

Solaris For The Modern Data Center. Taking Advantage of Solaris 11 Features

Chapter 11: File System Implementation. Operating System Concepts with Java 8 th Edition

Hierarchy storage in Tachyon.

Using Peer to Peer Dynamic Querying in Grid Information Services

Distributed File Systems

FAQs for Oracle iplanet Proxy Server 4.0

NFS File Sharing. Peter Lo. CP582 Peter Lo

Windows Azure Storage Essential Cloud Storage Services

UNINETT Sigma2 AS: architecture and functionality of the future national data infrastructure

ScaleArc idb Solution for SQL Server Deployments

Physical Data Organization

CipherShare Features and Benefits

Multicast vs. P2P for content distribution

Transcription:

Peer-to-Peer and Grid Computing Chapter 4: Peer-to-Peer Storage

Chapter Outline Using DHTs to build more complex systems How DHT can help? What problems DHTs solve? What problems are left unsolved? P2P storage basics, with examples Splitting into blocks (CFS) Wide-scale replication (OceanStore) Modifiable filesystem with logs (Ivy) Future of P2P filesystems Kangasharju: Peer-to-Peer Networks 2

How to Use a DHT? Recall: DHT maps keys to values Applications based on DHTs must need this functionality Or: Must be designed in this way! Possible to design an application in several ways Keys and values are application specific For filesystem: Value = file For email: Value = email message For distributed DB: Value = contents of entry, etc. Application stores values in DHT and uses them Simple, but a powerful tool Kangasharju: Peer-to-Peer Networks 3

Problems Solved by DHT DHT solves the problem of mapping keys to values in the distributed hash table Efficient storage and retrieval of values Efficient routing Robust against many failures Efficient in terms of network usage Provides hash table-like abstraction to application Kangasharju: Peer-to-Peer Networks 4

Problems NOT Solved by DHT Everything else except what is on previous slide In particular, the following problems Robustness No guarantees against big failures Threat models for DHTs not well-understood yet Availability Data not guaranteed to be available Only probabilistic guarantees (but possible to get high prob.) Consistency No support for consistency Data in DHT often highly replicated, consistency is a problem Version management No support for version management Might be possible to support this to some degree Kangasharju: Peer-to-Peer Networks 5

P2P FS: Introduction P2P filesystems (FS) or P2P storage systems were the first applications of DHTs Fundamental principle: Key = filename, Value = file contents Different kinds of systems Storage for read-only objects Read-write support Stand-alone storage systems Systems with links to standard filesystems (e.g., NFS) Kangasharju: Peer-to-Peer Networks 6

P2P FS: Current State Only examples of P2P filesystems come from research Research prototypes exist for many systems No wide-area deployment Experiments on research testbeds No examples of real deployment and real usage in wide-area After initial work, no recent advances? At least, not visible advances Three examples: Cooperative File System, CFS OceanStore Ivy Kangasharju: Peer-to-Peer Networks 7

P2P FS: Why? Why build P2P filesystems? Light-weight, reliable, wide-area storage At least in principle Distributed filesystems not widely deployed either Were studied already long time ago Gain experience with DHT and how DHTs could be used in real applications DHT abstraction is powerful, but it has limitations Understanding of the limitations is valuable Kangasharju: Peer-to-Peer Networks 8

P2P FS: Basic Techniques Three fundamental basic techniques for building distributed storage systems 1. Splitting files into blocks 2. Replicating files (or blocks!) 3. Using logs to allow modifications For now: Simple analysis of advantages and disadvantages and three examples Detailed performance analysis in Chapter 5 For blocks and replication Kangasharju: Peer-to-Peer Networks 9

Splitting Files into Blocks Why: Files are of different sizes and peers storing large files have to serve more data Pro: Dividing files into equal-sized blocks and storing blocks on different peers can achieve load balance If different files share blocks, we can save on storage Con: Instead of needing one peer online, all peers with all blocks must be online (see below) Need metadata about blocks to be stored somewhere Granularity tradeoff: Small blocks -> Good load balance, but lots of overhead and vice versa Kangasharju: Peer-to-Peer Networks 10

Replication Why: If file (or block) is stored only on one peer and that peer is offline, data is not available Replicating content to multiple peers significantly increases content availability Pro: High availability and reliability But only probabilistic guarantees Con: How to coordinate lots of replicas? Especially important if content can change Unreliable network requires high degree of replication for decent availability Wastes storage space Kangasharju: Peer-to-Peer Networks 11

Logs Why: If we want to change the stored files, we need to modify every stored replica Keep a log for every file (user, ) which gives information about the latest version Pro: Changes concentrated in one place Anyone can figure out what is the latest version Con: How to keep the log available? By replicating it? ;-) Kangasharju: Peer-to-Peer Networks 12

P2P FS: Overview Three examples of P2P filesystems CFS (blocks and replication) Basic, read-only system Based on Chord OceanStore (replication) Vision for a global storage system Based on Tapestry Ivy (logs) Read-write, provide NFS semantics Based on Chord Kangasharju: Peer-to-Peer Networks 13

CFS CFS = Cooperative File System Developed at MIT, by same people as Chord CFS based on the Chord DHT Read-only system, only 1 publisher CFS stores blocks instead of whole files Part of CFS is a generic block storage layer Features: Load balancing Performance equal to FTP Efficiency and fast recovery from failures Kangasharju: Peer-to-Peer Networks 14

CFS Properties Decentralized control Scalability (comes from Chord) Availability In absence of catastrophic failures, of course Load balance Load balanced according to peers capabilities Persistence Data will be stored for as long as agreed Quotas Possibility of per-user quotas Efficiency As fast as common FTP in wide area Kangasharju: Peer-to-Peer Networks 15

CFS: Layers, Clients, and Servers FS DHash DHash DHash Chord Chord Chord Client Server Server FS: provide filesystem API, interpret blocks as files DHash: Provides block storage Chord: DHT layer, slightly modified Clients access files, servers just provide storage Kangasharju: Peer-to-Peer Networks 16

Chord Layer in CFS Chord layer is slightly modified from basic Chord Instead of 1 successor, each node keeps track of r successors Finger tables as before Also, try to reduce lookup latency Nodes measure latencies to other nodes Report measured latencies to other nodes Kangasharju: Peer-to-Peer Networks 17

DHash Layer DHash stores blocks as opposed to whole files Better load balancing More network query traffic, but not really significant Each block replicated k times, with r k Two kinds of blocks: Content block, addressed by hash of contents Signed blocks (= root blocks), addressed by public key - Signed block is the root of the filesystem - One filesystem per publisher Blocks are cached in network Most of caching near the responsible node Blocks in local cache replaced according to least-recentlyused Consistency not a problem, blocks addressed by content - Root blocks different, may get old (but consistent) data Kangasharju: Peer-to-Peer Networks 18

DHash API DHash provides following API to clients: Put_h(block) Put_s(block, pubkey) Get(key) Store block Store or update signed block Fetch block associated with key Filesystem starts with root (= signed) block Block under publisher s public key and signed with private key For clients, read-only, publisher can modify filesystem by inserting a new root block Root block has pointers to other blocks - Either piece of a file or filesystem metadata (e.g., directory) Data stored for an agreed-upon finite interval Kangasharju: Peer-to-Peer Networks 19

CFS Filesystem Structure Public key Root block H(D) D H(F) F H(B1) B1 H(B2) Signature B2 Root block identified by publisher s public key Each publisher has its own filesystem Different publishers are on separate filesystems Other blocks identified based on hash of contents Other blocks can be metadata or pieces of file Kangasharju: Peer-to-Peer Networks 20

Load Balancing and Quotas Different servers have different capabilities One real server can run several virtual servers Number of virtual servers depends on the capabilities Big servers run more virtual servers CFS operates at virtual server level Virtual nodes on a single real node know each other Possible to use short cuts in routing Quotas can be used to limit storage for each node Quotas set on a per-ip address basis Better quotas require central administration Some systems implement better quotas, e.g., PAST Kangasharju: Peer-to-Peer Networks 21

Updates in CFS Only publisher can change data in filesystem CFS will store any block under its content hash Highly unlikely to find two blocks with same SHA-1 hash No explicit protection for content blocks needed Root block is signed by publisher Publisher must keep private key secret No explicit delete operation Data stored only for agreed-upon period Publisher must refresh periodically if persistence is needed Kangasharju: Peer-to-Peer Networks 22

OceanStore OceanStore developed at UC Berkeley Runs on Tapestry DHT Supports object modification Vision of ubiquitous computing: Intelligent devices, transparently in the environment Highly dynamic, untrusted environment Question: Where does persistent information reside? OceanStore aims to be the answer OceanStore s target: 10 10 users, each with 10000 files, i.e., 10 14 files total Kangasharju: Peer-to-Peer Networks 23

OceanStore: Basics and Goals Users pay for the storage service Several companies can provide services together Two goals: 1. Untrusted infrastructure Everything is encrypted, infrastructure unreliable However, assume that most servers are working correctly most of the time One class of servers trusted to follow protocol (but not trusted with data) 2. Nomadic data Anytime, anywhere Introspection used to tune system at run-time Kangasharju: Peer-to-Peer Networks 24

OceanStore Applications OceanStore suitable for many kinds of applications Storage and sharing of large amounts of data Data follows users dynamically Groupware applications Concurrent updates from many people For example, calendars, contact lists, etc. In particular, email and other communication applications Streaming applications Also for sensor networks and dissemination Here we concentrate on storage Kangasharju: Peer-to-Peer Networks 25

System Overview Each object has globally unique identifier (GUID) Objects replicated and migrated on the fly Replicas located in two ways Probabilistic, fast algorithm tried first Slower, but deterministic algorithm used if first one fails Objects exist in two forms, active and archival Active is the latest version with handle for updates Archival is a permanent, read-only version - Archive versions encoded with erasure codes with lot of redundancy - Only a global disaster can make data disappear Kangasharju: Peer-to-Peer Networks 26

Naming and Access Control Object naming Self-certifying object names Hash of the object s owners key and human readable name Allows directories System has no root, but user can select her own root(s) Access control Restricting readers All data is encrypted, can restrict readers by not giving key Revoke read permission by re-encrypting everything Restricting writers All writes must be signed and compared against ACL Kangasharju: Peer-to-Peer Networks 27

Locating Objects OceanStore uses two mechanisms for locating objects 1. Probabilistic algorithm Frequently accessed objects likely to be nearby and easily found This algorithm finds them fast Uses attenuated Bloom filters See below for more details 2. Deterministic algorithm OceanStore based on Tapestry Deterministic routing is Tapestry s routing Guaranteed to find the object See Chapter 3 for the details Kangasharju: Peer-to-Peer Networks 28

Sidenote: Bloom Filters Bloom filters are a space-efficient probabilistic data structure that is used to test whether or not an element is a member of a set False positives are possible False negatives are NOT possible Bloom filter is an array of k bits Also need m different hash functions, each maps key to a bit To insert, calculate all m hash functions and set bits to 1 To check, calculate all m hash functions and if all bits are 1, key is probably in the set If any bit is 0, then it is definitely not in Kangasharju: Peer-to-Peer Networks 29

Sidenote: Bloom Filters To insert or check for an item, Bloom filters take average O(m) time Fixed constant! Independent of number of entries No other data structure allows for this (hash table is close) Attenuated Bloom filter of depth D is same as an array of D normal Bloom filters First filter is for locally stored objects at current node The i th Bloom filter is the union of all Bloom filters at distance i through any path from current node Attenuated Bloom filter for each network edge Queries routed on the edge where the distance to object is shortest Kangasharju: Peer-to-Peer Networks 30

Probabilistic Query Process Node Local filter Neighbor filter 11100 11011 11010 11010 n 3 X n 1 n 2 10101 11100 00011 n 4 00011 Node n 1 wants to find object X, X hashes to bits 0, 1, 3 Node n 1 does not have 0, 1, and 3 in local filter Neighbor filter for n 2 has them, forward query to n 2 Node n 2 does not have them in local filter Filter for neighbor n 3 has them, forward to n 3 Node n 3 has object Kangasharju: Peer-to-Peer Networks 31

Update Model Update model in OceanStore based on conflict resolution Update semantics Each update has a list of predicates with associated actions Predicates evaluated in order Actions for first true predicate are atomically applied (commit) If no predicates are true, action aborts Update is logged in both cases Kangasharju: Peer-to-Peer Networks 32

Update Model: Predicates List of predicates is short Untrusted environment limits what predicates can do Available predicates: Compare-version (metadata comparison, easy) Compare-size (same as above) Compare-block (easy if encryption is position-dependent block cipher) Search (possible to search ciphertext, get boolean result) Kangasharju: Peer-to-Peer Networks 33

Available Operations Four operations available Replace-block Insert-block Delete-block Append If position-dependent cipher, Replace-block and Append are easy operations For Insert-block and Delete-block: Two kinds of blocks: Index and data blocks Index blocks can contain pointers to other blocks To insert a block, we replace old block with an index block which points to old block and new block - Actual blocks are appended to object May be susceptible to traffic analysis Kangasharju: Peer-to-Peer Networks 34

Serializing Updates Replicas divided into two tiers Primary tier is trusted to follow protocol Secondary tier is everyone else Primary tier cooperate in a Byzantine agreement protocol Secondary tier communicates with primary tier and secondary via epidemic algorithms Reason for two tiers: Fault-tolerant protocols possible with only a small number of replicas, protocols communication-intensive Primary tier is well-connected and small Kangasharju: Peer-to-Peer Networks 35

Byzantine Generals Problem Several divisions of the Byzantine army surround an enemy city. Each division is commanded by a general. The generals communicate only through messenger Need to arrive at a common plan after observing the enemy Some of the generals may be traitors Traitors can send false messages Required: An algorithm to guarantee that 1. All loyal generals decide upon the same plan of action, irrespective of what the traitors do. 2. A small number of traitors cannot cause the loyal generals to adopt a bad plan. Solution: (from L. Lamport) If no more than m generals out of n = 3m + 1 are traitors, everybody will follow the orders Kangasharju: Peer-to-Peer Networks 36

Path of an Update 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Update is sent to primary tier and to random replicas in secondary tier for that object Kangasharju: Peer-to-Peer Networks 37

Path of an Update 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Primary tier performs Byzantine agreement Secondary tier propagates update epidemically Kangasharju: Peer-to-Peer Networks 38

Path of an Update 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 When primary tier has finished agreement protocol, the update is sent over multicast to all secondary replicas Kangasharju: Peer-to-Peer Networks 39

Deep Archival Storage Archival mechanism uses erasure codes Reed-Solomon, Tornado, etc. Generate redundant data fragments Created by the primary tier If there are enough fragments and they are spread widely, then it is likely that we can retrieve the data Archival copies are created when objects are changed Every version is archived Can be tuned to be done less frequently Kangasharju: Peer-to-Peer Networks 40

Ivy Ivy developed at MIT Based on Chord Provides NFS-like semantics At least for fully connected networks Any user can modify any file Ivy handles everything through logs Ivy presents a conventional filesystem interface Kangasharju: Peer-to-Peer Networks 41

Problems for Distributed Read/Write 1. Multiple distributed writers make it difficult to maintain consistent metadata 2. Unreliable participants make locking unattractive Locking could help maintain consistency 3. Participants cannot be trusted Machines may be compromised Need to be able to undo 4. Distributed filesystem may become partitioned System must remain operational during partitions Help applications repair conflicting updates made during partitions Kangasharju: Peer-to-Peer Networks 42

Solution: Logs Each participant maintains log of its changes Logs maintain filesystem data and metadata Participant has private snapshot of logs of others Logs stored in DHash (see under CFS for details) Participant writes to its own log, reads all others Log-head points to most recent log record Version vectors impose order on log records from multiple logs View block Log head : Log records Kangasharju: Peer-to-Peer Networks 43

Ivy: Views Each user writing to a filesystem has its own log Participants agree on a view View is a set of logs that comprise the filesystem View block is immutable ( root ) View block has log heads for all participants Kangasharju: Peer-to-Peer Networks 44

Combining Logs How to determine order of modifications from logs? Order should obey causality All participants should agree on the order Each new log record has a sequence number and a version vector Sequence number increasing (managed locally) Version vector has sequence numbers from other logs in view Version vector summarizes knowledge about log Log records ordered by comparing version vectors Vectors u and v comparable if u < v, v < u, or v = u Otherwise concurrent Simultaneous operations result in equal or concurrent vectors Ordered by public keys of participants May need special actions to return to consistency (overlapping modifications) Kangasharju: Peer-to-Peer Networks 45

Ivy: Snapshots Private snapshots avoid traversing whole filesystem Snapshot contains the entire state Each participant has own snapshot Contents of snapshots mostly the same for all participants DHash will store them only once To create snapshot, node: Gets all logs recent than current snapshot Write new snapshot New user must either build from scratch or take a trusted snapshot Kangasharju: Peer-to-Peer Networks 46

P2P FS: General Problems Built on top of unreliable nodes and network How to achieve reliability? Replication gives reliability (see Chapter 5) Replication makes maintaining consistency difficult Nodes cannot be trusted in the general case Must encrypt all data Hard to do content-based conflict resolution (e.g., diff) Performance Distributed filesystems have much lower performance Kangasharju: Peer-to-Peer Networks 47

P2P FS: Future What does future hold for P2P filesystems? What is right area of application? Intranet? Trusted environment, high bandwidth Possibly easy to deploy? - Need to make a product first? Global Internet? Lots of untrusted peers, widely distributed All the problems from above Can take off as a hobby project? Nowhere? P2P filesystems are a total waste of time? Kangasharju: Peer-to-Peer Networks 48

P2P FS: Future Do we need a P2P FS to build useful applications? Yes, they allow efficient distributed storage Working storage system is the basic building block of a useful application No, DHT is enough Several examples of P2P applications directly on top of a DHT Fully reliable P2P FS would make network into one virtual computer Modulo performance issues Building P2P apps would be like building normal apps Kangasharju: Peer-to-Peer Networks 49

P2P FS: Real-World Examples Microsoft has done lot of research in this area Even built prototypes Why no products on the market? Below some wild speculation P2P FS would compete with traditional file servers? P2P needs to be built in to the OS, would kill server market? Impossible to build a good enough P2P FS? Theoretically doable, but too slow and weak in practice? P2P filesystem can be done, but has no advantages? Possibly to build a useful system, but it costs as much as a server-based system? Kangasharju: Peer-to-Peer Networks 50

Chapter Summary How to build applications with DHTs Basic technologies of distributed storage As examples, 3 P2P filesystems CFS - Read-only OceanStore Ivy - Modifications allowed, global vision - NFS-like semantics, traditional filesystem interface Discussion about the future of P2P filesystems Kangasharju: Peer-to-Peer Networks 51