P2P Storage System. Presented by: Aakash Therani, Ankit Jasuja & Manish Shah

Transcription

1 P2P Storage System Presented by: Aakash Therani, Ankit Jasuja & Manish Shah

2 What is a P2P storage system? Peer-to-Peer(P2P) storage systems leverage the combined storage capacity of a network of storage devices(peers) contributed typically by autonomous end-users as a common pool of storage space to store and share content. Applications Distributed file systems Content sharing Back-up & archival storage Peer data management systems

3 What is a P2P storage system? Peer-to-Peer(P2P) storage systems leverage the combined storage capacity of a network of storage devices(peers) contributed typically by autonomous end-users as a common pool of storage space to store and share content. Applications Distributed file systems Content sharing Back-up & archival storage Peer data management systems

4 Designing P2P Storage Systems Factors to keep in mind while designing p2p storage systems Persistent Storage Availability- in the presence of network partitions Durability- against failure and attack Security Issues Access control Protection against content pollution Transactions Concurrency Control Fault Tolerance

5 Cloud Storage v/s P2P Storage When data is stored at server clusters within the internet, this kind of data storage is referred to as cloud storage. Cloud Storage Products Amazon S3- Amazon S3 (Simple Storage Service) is a web service that offers cloud storage through a simple HTTP-based interface. Dropbox- Dropbox is a cloud storage provider and file synchronization tool using the Amazon S3 storage facility as a back-end. When relying on the members of a group storing each other s data, it is called peer-to-peer (p2p) storage. P2P Storage Products Wuala- Wuala [37] is a commercial, distributed storage service that allows users to trade storage capacity in a P2P way

6 Classification of Storage Products Products can be classified based on the types of storage needs:- 1) Backup- Using the service as a backup facility for files stored locally on a computer (which is part of the peer network). This may involve keeping track of versions of files, asthey change over time. 2) File Synchronization- Keeping the same file tree that exists on a number of different computers in sync. When one file is changed on one computer, the copy of that file on the other computers is automatically updated. This type of functionality must deal with conflicts, e.g., in case the same file is changed on multiple computers at the same time. 3) Distributed file system-the online storage capacity is used to implement a distributed file system. One or more computers access the storage in a manner that is very similar to local file systems. 4) Content Sharing-Parts of the file tree stored online are used to share data with other people. By providing credentials to others, they can use the storage facility to read the part of the tree they were granted access to.

7 OceanStore: An Architecture for Global-Scale Persistent Storage

8 OceanStore: A True Data Utility Utility model: consumers pay a monthly fee in exchange for access to persistent storage Highly available data from anywhere Automatic replication for disaster recovery Strong security Providers would buy and sell capacity among themselves for mobile users Deep archival storage: use excess of storage space to ease data management

9 Ubiquitous Computing

10 Two Unique Goals 1) Ability to be constructed from an untrusted infrastructure Servers may crash without warning All information entering the infrastructure must be encrypted Servers participate in protocols for distributed consistency management 2) Support for Nomadic Data Locality is of utmost importance Promiscuous Caching: Data can be cached anywhere, anytime Continuous introspective monitoring to manage caching & locality

11 System Overview Persistent object: The fundamental unit in OceanStore Each object is named by a Globally Unique Identifier (GUID) Objects are replicated and stored on multiple servers Floating replicas: Replicas are independent of the server Two mechanisms to locate a replica 1) A fast, probabilistic algorithm to find the object near the requesting machine 2) If (1) fails, then it is located through a slower, deterministic algorithm

12 Underlying Technologies Naming Access Control Data Location and Routing Data Update Deep Archival Storage Introspection

13 Naming GUID: psuedo-random, fixed-length bit string Decentralized & resistant to attempts by adversaries Self-certifying path names GUID=hash(owner s key, filename) GUID of a server is a secure hash of its key GUID of a data fragment is a secure hash of the data content

14 Access Control OceanStore supports two primitive types of access controls 1) Reader Restriction Encrypt non-public data and distribute the key to users with read access Problem: There is no way to make a reader forget what he has read 2) Writer Restriction Through ACLs specified for each object by its owner Each user has a signing key, ACLs use that key for granting access Note: Reads are restricted at clients via key distribution, while writes are restricted at servers by ignoring unauthorized updates

15 Data Location and Routing Objects can reside on any of the OceanStore servers Use query routing to locate objects Every object is identified by one or more GUIDs Different replicas of the same object has the same GUID OceanStore messages are labeled with A destination GUID (built on top of IP) A random number A small predicate

16 Distributed Routing in OceanStore Routing is a two phase process. Data location and routing combined Advantage being we avoid multiple round trip time Routing itself is 2 tiered Fast probabilistic algorithm and slow reliable hierarchical method.

17 Bloom Filters Based on the idea of hill-climbing If a query cannot be satisfied by a server, local information is use to route the query to a likely neighbor - Via a modified version of a Bloom filter

18 Attenuated Bloom Filters An attenuated Bloom filter of depth D is an array of D normal Bloom filters ith Bloom filter is the union of all the Bloom filters for all of the nodes at a distance i One filter per network edge

19 Attenuated Bloom Filters Lookup 11010

20 The Global Algorithm: Wide-Scale Distributed Data Location Plaxton s randomized hierarchical distributed data structure Resolve one digit of the node id at a time Links form a series of random embedded trees, with each node as the root of one of these trees. Neighbor links can be used to route from anywhere to a given node If information about the GUID (such as its location) were stored at its root, then anyone could find this information simply by following neighbor links until they reached the root node for the GUID.

21 The Global Algorithm: Wide-Scale Distributed Data Location

22 Achieving Locality When a replica is placed somewhere in the system, its location is published to the routing infrastructure. The publishing process works its way to the object s root and deposits a pointer at every hop along the way. Each new replica only needs to traverse O(log(n)) hops to reach the root, where n is the number of the servers When someone searches for information, they climb the tree until they run into a pointer, after which they route directly to the object.

23 Achieving Fault Tolerance Avoid failures at roots Each root GUID is hashed with a small number of different salt values Make it difficult to target a single GUID for DoS attacks If failures are detected, just jump to any node to reach the root OceanStore continually monitors and repairs broken pointers

24 Advantages of Distributed Information Redundant paths to roots Scalable with a combination of probabilistic and global algorithms Easy to locate and recover failed components Plaxton links form a natural substrate for admission controls and multicasts

25 Achieving Maintenance-Free Operation Recursive node insertion and removal Replicated roots Use beacons to detect faults Time-to-live fields to update routes Second-chance algorithm to avoid false diagnoses of failed components Avoid the cost of recovering lost nodes Automatic reconstruction of data for failed servers

26 Update: Format and Semantics An update: a list of predicates associated with actions A set of predicates is evaluated in order The actions of the earliest true predicate are atomically applied Update is logged if it commits or aborts. Predicates: compare-version, compare-block, compare-size, search Actions: replace-block, insert-block, delete-block, append

27 Serializing Updates in an Untrusted Infrastructure Use a small primary tier of replicas to serialize updates Runs Byzantine agreement protocol Minimize communication Meanwhile, a secondary tier of replicas optimistically propagate updates among themselves Final ordering from primary tier is multicasted to secondary replicas

28 Update Path of an update: a) After generating an update, a client sends it directly to the object s inner ring b) While inner ring performs a Byzantine agreement to commit the update, secondary nodes propagate the update among themselves c) The result of update is multicast down the dissemination tree to all secondary nodes

29 The Full Update Path

30 Update commitment Fault tolerance: Guarantees fault tolerance if less than one third of the servers in the inner ring is malicious Secondary nodes do not participate in the Byzantine protocol, but receive consistency information

31 A Direct Path to Clients and Archival Storage Updates flow directly from a client to the primary tier, where they are serialized and then multicast to the secondary servers down the dissemination tree Updates are tightly coupled with archival Archival fragments are generated at serialization time, signed, encoded and distributed with updates

32 Deep Archival Storage Data is fragmented Each fragment is an object Erasure coding is used to increase reliability Administrative domains are ranked by their reliability and trustworthiness Avoid locations with correlated failures

33 Erasure Codes Erasure coding is a process that treats input data as a series of fragments (say n) and transforms these fragments into a greater number of fragments (say 2nor 4n) n Message Encoding Algorithm cn Encoding Transmission n Received Decoding Algorithm n Message

34 Introspection computation optimization observation Observation modules monitor the activity of a running system and track system behavior Optimization modules adjust the computation

35 Introspection Event handlers summarizes local events. These summaries are stored in a database. The information in the database is periodically analyzed and necessary actions are taken. A summary is sent to other nodes.

36 Uses of Introspection Cluster recognition Identify related files Replica management Adjust replication factors Migrate floating replicas

37 Introspection If a replica becomes unavailable: Clients will receive service from a more distant replica This produces extra load on distant replicas Introspective mechanism detects this and new replicas are created Above actions provide fault tolerance and automatic repair

38 Applications Groupware applications Personal information management tools Contact lists Calendars Distributed design tools

39 Conclusion Different from other systems : Utility model Untrusted infrastructure Truly nomadic data Use of introspection Prevention of denial of service attacks Rapid response to regional outages Analysis of access patterns

40 Dynamo: Amazon s Highly Available Keyvalue Store

41 Motivation Build a distributed storage system: Scale Simple: key-value Highly available Guarantee Service Level Agreements (SLA) Service Level Agreements (SLA) Application can deliver its functionality in abounded time: Every dependency in the platform needs to deliver its functionality with even tighter bounds. Example: service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second.

42 Design Consideration 1) Sacrifice strong consistency for availability 2) Conflict resolution is executed during read instead of write, i.e. always writeable. 3) Other principles: Incremental scalability. Symmetry. Decentralization. Heterogeneity.

43 Partition Algorithm Consistent hashing: the output range of a hash function is treated as a fixed circular space or ring. Virtual Nodes: Each node can be responsible for more than one virtual node. Advantages of using virtual nodes If a node becomes unavailable the load handled by this node is evenly dispersed across the remaining available nodes. When a node becomes available again, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes. The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure.

44 Data Versioning & Vector Clock A put() call may return to its caller before the update has been applied at all the replicas A get() call may return many versions of the same object. Challenge: an object having distinct version sub-histories, which the system will need to reconcile in the future. Solution: uses vector clocks in order to capture causality between different versions of the same object. A vector clock is a list of (node, counter) pairs. Every version of every object is associated with one vector clock. If the counters on the first object s clock are less-than-or-equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten.

45 Execution 1) Read / Write request on a key Arrives at a node (coordinator) Ideally the node responsible for the particular key Else forwards request to the node responsible for that key and that node will become the coordinator The first N healthy and distinct nodes following the key position are considered for the request Quorums are used R Read Quorum W Write Quorum R+W>N 2) Writes Requires generation of a new vector clock by coordinator Coordinator writes locally Forwards to N nodes, if W-1 respond then the write was successful 3) Reads Forwards to N nodes, if R-1 respond then forwards to user Only unique responses forwarded User handles merging if multiple versions exist

46 FreeNet: A Distributed Anonymous Information Storage and Retrieval System

47 FreeNet Introduction: P2P network for anonymous publishing and retrieval of data Decentralized Nodes collaborate in storage and routing Data centric routing Adapts to demands Addresses privacy & availability concerns Features: Anonymity for producers and consumers Deniability for information stores Resistance to denial attacks Efficient storing and routing Does NOT provide Permanent file storage Load balancing Anonymity for general n/w usage

48 Architecture Request: 1. Key 2. Hops to live 3. ID 4. Depth Each node local data store + routing table Request file through location independent keys Routing - chain of proxy requests - decision is local Graph structure actively evolves over time

49 Keys and Searching Problems with SSK - updating, versioning Content Hash Keys (CHK) Encrypted by a random encryption key Publish CHK + decryption key CHK + SSK easily updateable files 2 step process publish file, publish pointer Results in pointers to newer version Older versions accessed thru CHK Can be used for splitting files

50 File retrieving a d b e c f Location of keys: Hypertext spider Indirect files published with KSK of search words Publish bookmarks File retrieval Request forwarded to node in RT with closest lexicographic match for the binary key Request routing follows steepest-ascent hill climbing: first choice failure backtrack second choice Timers, hops - curtail request threads Files cached all along the retrieval path Self-reinforcing cycle results in key expertise

51 Data Management Finite data stores - nodes resort to LRU Routing table entries linger after data eviction Outdated (or unpopular) docs disappear automatically Bipartite eviction short term policy New files replace most recent files Prevents established files being evicted by attacks

52 Protocol and Security PROTOCOL Nodes with frequently changing IPs use ARKs Return address specified in requests threat? Messages do not always terminate when hops-to-live reaches 1 Depth is initialized by original requestor to arbitrarily small value Request state maintained at each node timers LRU SECURITY File integrity - KSK vulnerable to dictionary attacks DOS attacks Hash Cash to slow down Attempts to displace valid files are constrained by the insert procedure

53 Thank You..!!!