Structured P2P Overlays

Structured P2P Overlays Project Group A Distributed Framework for Social Networks www.p2pframework.com UPB SS2011 PG-Framework Lecture-03 Structured-Overlays-Pastry.ppt Dr.-Ing. Kalman Graffi Email: graffi@mail.upb.de Fachgruppe Theorie verteilter Systeme Fakultät für Elektrotechnik, Informatik und Mathematik Universität Paderborn Fürstenallee 11, D-33102 Paderborn, Deutschland Tel.+49 5251 606730, Fax. +49 5251 606697 http://www.cs.uni-paderborn.de/fachgebiete/fg-ti.html This slide set is based on the lecture "Communication Networks 2" of Prof. Dr.-Ing. Ralf Steinmetz at TU Darmstadt

Some relevant books Monitoring and Management of Peer-to-Peer Systems Kalman Graffi http://tuprints.ulb.tudarmstadt.de/2248/ Handbook of P2P Networking Xuemen Shen, Heather Yu, John Buford, Mursalin Akon Peer-to-Peer Systems and Applications Ralf Steinmetz, Klaus Wehrle (Editors) www.springerlink.com /content/g6h805426g7t H( m y data ) = 3107? 709 611 1008 2207 12.5.7.31 berkeley.edu planet-lab.org peer-to-peer.info 61.51.166.150 95.7.6.10 86.8.10.18 1622 3485 2011 2906 7.31.10.25 Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 2

Overview 1 Wrap-up: Peer-to-Peer 9 Properties 2 Structured Overlay Networks 2.1 Major Query Types 3 Fundamentals of Hash Tables 3.1 Recall Hash Function & Hash Table 3.2 Distributed Hash Table: Steps of Operation 3.3 Step 1: Addressing in Distributed Hash Tables 3.4 Step 2: Association of Address Space with Nodes 3.5 Step 3: Locating a Data Item 3.6 Step 4: Routing to a Data Item 3.7 Step 5: Data Retrieval Usage of located Resource 3.8 Step 6: Where is the Data located? 3.9 Distributed Hash Table: to Insert and to Delete a Node 3.10 Properties and Components of DHTs 4 Chord 5 Pastry, FreePastry and PAST 5.1 Pastry Routing Table 5.2 Joining the Network 5.3 Key-based Routing Interface 5.4 FreePastry Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 3

1 Wrap-up: Peer-to-Peer 9 Properties 1. relevant resources located at nodes ( peers ) at the edges of a network 2. peers share their resources 3. resource locations widely distributed most often largely replicated 4. variable connectivity is the norm 5. combined Client and Server functionality 6. direct interaction (provision of services, e.g. file transfer) between peers (= peer to peer ) 7. peers have significant autonomy and mostly similar rights 8. no central control or centralized usage/provisioning of a service 9. self-organizing system Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 4

Success of P2P Networking One of the newest buzzwords in networking is Peer-to-Peer (P2P) Is it only a hype? initially 40 million Napster users in 2 years integrated into commercial systems, e.g., Microsoft P2P SDK Advanced Networking Pack for Windows XP open source, e.g., JXTA (Sun) with Protocols & Services strong presence at international networking conferences Above logos copied from the respective web page Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 5

P2P Traffic P2P traffic is the major traffic source, since at least 2003 of overall Internet traffic more than ~50% is P2P traffic and P2P traffic in the Internet 60% 80% P2P file sharing traffic on backbones P2P generates most traffic in all regions Source: http://www.ipoque.com/resources/internet-studies Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 6

Ipoque Internet Study 2008/2009 http://www.ipoque.com/resources/internet-studies/internet-study-2008_2009 Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 7

Enabling Effects File sharing: highly attractive and cheap content users share their content with other users attractive content copyrights are usually not respected (problem!) cheap content Publishing: exploding amount of data 2 x 1E+18 Bytes are produced per year 3 x 1E+12 Bytes are published per year only 1.3x1E+8 websites indexed by search engines like Google see Gong: JXTA: A Network Programming Environment, IEEE Computing 2001 Unused resources at the edges more processing power, memory, bandwidth, storage available 1 TB hard disk for letters? 100 Mbit/s for sending emails? new compression mechanisms (mp3, mpeg), no problem for CPUs assume e.g. a Small-Medium Enterprise (SME) with 100 desktop computers: spare storage space: 100 x 1 TB = 100 TB spare processing power: 100 x 2 x 2 GHz x 5 ops/cycle = 2 trillion ops/sec Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 8

Client / Server Model vs. P2P Technology Situation: 1 server, n * clients Issue:??? E.g. on which server is the information wanted? Solution: Look it up on another server (or google, which does this for you) Advantages: Reliable, well known behavior Drawbacks: Server need to provide (almost) all resources Client / Server model is not P2P: Communication only between clients and server, not between clients and clients Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 9

GRID Computing vs. P2P Technology Similar idea, similar concept as in P2P High performance data processing centers needed for scientific applications but expensive to provide often do not offer enough performance Solution: GRID to interconnect the existing data processing centers to Virtual Organization and operate it as distributed processing center (the GRID) www.gridforum.org www.rechenkraft.net www.ggf.org History Participation Typical Transfer Volume Typical Service Typical Problems P2P Sharing MP3 files & illegal content Voluntarily Small (MP3) to medium (video) File Sharing Hugh number of users cause scalability issues GRID Saving costs for data processing centers By contract Huge (often terabytes) Processing Sharing Transferring huge amounts of data Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 10

Cloud Computing vs. P2P Technology Cloud and P2P Access to a distributed pool of resources Resources: storage, bandwidth, computational power Cloud computing Resource providers: companies Controlled environment No malicious providers No (/minimal) churn Homogenous devices Selective centralized structures OK Accounting, monitoring Single access point Centralized updates P2P systems Resource providers: user devices Uncontrolled environment Churn, malicious providers Heterogeneous devices Uncertainty / unpredictability Distributed access points Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 11

P2P in Business World New services at the edge of the network P2P overlay networks make it relatively easy to deploy new services Group collaboration superior for business processes grow organically, non-uniform and highly dynamic largely manual, ad-hoc, iterative and document-intensive work often distributed, not centralized no single person/organization understands the entire process from beginning to end Cost effectiveness reduces centralized management resources optimizes computing, storage and communication resources rapid deployment P2P applications/protocols tailored for user s needs Napster s success depended to a great amount on its ease of use Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 12

2 Structured Overlay Networks Unstructured P2P Structured P2P Centralized P2P Pure P2P Hybrid P2P DHT-Based Hybrid P2P 1. All features of Peerto-Peer included 2. Central entity is necessary to provide the service 3. Central entity is some kind of index/group database Examples: Napster 1. All features of Peerto-Peer included 2. Any terminal entity can be removed without loss of functionality 3. no central entities Examples: Gnutella 0.4 Freenet 1. All features of Peerto-Peer included 2. Any terminal entity can be removed without loss of functionality 3. dynamic central entities Examples: Gnutella 0.6 Fasttrack edonkey 1. All features of Peer-to-Peer included 2. Any terminal entity can be removed without loss of functionality 3. No central entities 4. Connections in the overlay are fixed Examples: Chord CAN Kademlia 1. All features of Peerto-Peer included 2. Peers are organized in a hierarchical manner 3. Any terminal entity can be removed without loss of functionality Examples: RecNet Globase.KOM from R.Schollmeier and J.Eberspächer, TU München Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 13

P2P Overlay Networks: What is Structure? Mapping Function: F R : R I Mapping Function: F P : P I Legend: I := Identifier space P := Set of peers R := Resources Source:Karl Aberer, The essence of P2P: A reference architecture for overlay networks Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 14

Review of Principles - Unstructured Overlay Networks Unstructured overlay networks Location of resource ONLY known to submitter Peers & resources have NO SPECIAL identifier Each peer is responsible ONLY for the resources it submitted Introduction of new resource at any location Main task: to search To find all peers storing/being in charge of resources fitting to some criteria And later to communicate directly peerto-peers having identified these peers Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 15

Principles - Structured Overlay Networks Unstructured overlay networks Location of resource ONLY known to submitter Peers & resources have NO SPECIAL identifier Each peer is responsible ONLY for the resources it submitted Introduction of new resource at any location Main task: to search To find all peers storing/being in charge of resources fitting to some criteria And later to communicate directly peerto-peers having identified these peers Structured overlay networks Location of resources NOT only known to submitter Each peer may well be responsible for resources IT HAS NOT submitted Introduction of new resource(s) at SPECIFIC location i.e. to give peers and resources (unique) identifier PeerIDs and ObjectIDs (RessourceIDs) shall be from the SAME key set Each peer is responsible for a specific range of ObjectIDs (i.e. RessourceIDs) Challenge: to find peer(s) with specific ID in overlay to lookup To route queries across the overlay network to peer with specific ID i.e. no search needed anymore Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 16

2.1 Major Query Types Lookup Key-Value lookup as known from hash tables look up objects by addressing them with their unique name (cf. URLs in web) Given: Key Return: Single value Full Text Search Find objects by searching with keywords that match object s description Given: sequence of word (search term) Return: all entries / articles matching the search terms Range Query Given: Range [ X, Y] Return: all stored entries within range Location based search Matching Query Given: logical condition (A && B C) Return: all entries fulfilling logical condition Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 17

Motivation Distributed Indexing Communication overhead vs. node state Communication Overhead O(N) O(log N) O(1) Flooding Bottleneck: Communication Overhead False negatives Distributed Hash Table Scalability: O(log N) No false negatives i.e. Never ( answer YES.. if it is NOT there ) More resistant against changes Failures, Attacks Short time users Bottlenecks: Memory, CPU, Network Availability Central Server O(1) O(log N) Node State O(N) The content of this slide has been adapted from Peer-to-Peer Systems and Applications, ed. by Steinmetz, Wehrle Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 18

Mode of Operation of a Distributed Hash Table applications publish/insert (key, data) get/lookup (key) data distributed hash table DHT node node node. every object / resource has a (hash) key is stored at node responsible for its key every node stores and maintains part of hash table lookup(key) node or data directly lookup (ObjID 0x0E ) where it is stored and how it is identified there Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 19

3 Fundamentals of Hash Tables Challenges for designing DHTs: 1. Desired Characteristics Flexibility Reliability Scalability 2. Equal distribution of content among nodes Crucial for efficient lookup of content 3. Permanent adaptation to faults, join, leave of nodes Assignment of responsibilities to new nodes Re-assignment and re-distribution of responsibilities in case of node failure or departure The content of this slide has been adapted from Peer-to-Peer Systems and Applications, ed. by Steinmetz, Wehrle Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 20

3.1 Recall Hash Function & Hash Table Hash function H(x) maps large input domain onto smaller target domain/range (most often subset of integer) such that we get few collisions i.e. it would be possible to uniquely identify most of these strings using this hash hash table data structure that provides fast lookup of a record indexed by a key where the domain of the key is too large for simple indexing; as would occur if an array were used Like arrays, hash tables can provide O(1) lookup with respect to the number of records in the table. And.. Question IF H(x) H(y) THEN (implies) x y? IF H(x) = H(y) THEN (most probably implies) x = y? Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 21

Recall Hash Tables & Hash Functions Hash tables are a well-known data structure A fixed-size array Elements of array also called hash buckets Properties Allow for insertions allow for deletions allow to find entry in constant (average) time Hash functions map keys to elements onto (in) the array Properties of good hash functions: Fast to compute Good distribution of keys into hash table Example: SHA-1 algorithm SHA = Secure Hash Algorithm Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 22

Hash Tables: An Example Hash function: hash(x) = x mod 10 Example Insert numbers 0, 1, 4, 9, 16, and 25 Properties Easy to find if a given key is present in the table Hash Table, an example Keys Values 0 0 1 1 2 3 4 4 5 25 6 16 7 8 9 9 Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 23

Hash Tables: An Example Drawback of the example Collisions are likely to happen i.e. Additional processing as additional list at node use appropriate function Time to search grows linearly with amount of peers To insert and remove a peer scales also linearly e.g. appropriate Distributed Hash Table DHT Hash function must be adapted to the amount of available peers and it is extremely time consuming Distributed Hash Table DHT Distributed Hash Table DHT Assigns concatenated input RANGE to peers (instead of individual numbers) Hash Table, an example Keys Values 0 0 1 1 2 3 4 4 5 25 6 16 7 8 9 9 Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 24

Application-specific Object IDs Content-based Hash Hash(filecontent) = Object ID Exact identification of files possible Locality-sensitive hashing like SimHash [Charikar, 2002] Family of functions that hash similar inputs to the same buckets with high probability Thus, similar input vectors get similar hashes, unlike e.g. MD5 or SHA Geographical position Object ID = X position Y position random identifier Allows for range queries in a multidimensional space Logical structure Hash ( Username:Alice Plugin:Profile Content:All ) = Object ID Allows to construct / lookup simple keys based on application Distributed linked lists Distributed data structures consisting of Pointer objects Payload items Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 25

3.2 Distributed Hash Table: Steps of Operation Sequence of operations (at beginning) Mapping of nodes and data same address space Peers and content are addressed using flat identifiers (IDs) Common address space for data and nodes Nodes are responsible for data in certain parts of the address space Association of data to nodes may change since nodes may disappear (later) Storing / Looking up data in the DHT Look-up for data = routing to the responsible node Responsible node not necessarily known in advance Deterministic statement about availability of data The content of this slide has been adapted from Peer-to-Peer Systems and Applications, ed. by Steinmetz, Wehrle Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 26

3.3 Step 1: Addressing in Distributed Hash Tables 3485-610 611-709 710-1621 1622-2010 2011-2206 2207-2905 2906-3484 (3485-610) 2 m -1 0 H(Node Y)=3485 Data item D : H( D )=3107 Y X H(Node X)=2906 Often, the address space is viewed as a circle. Mapping of content/nodes into linear space Usually: 0,, 2 m -1 >> number of objects to be stored Mapping of data and nodes onto same address space (e.g. 0 to 2 m -1) with hash function e.g., Hash(string) mod 2 m : H( my data ) 2313 Association of parts of address space to DHT nodes The content of this slide has been adapted from Peer-to-Peer Systems and Applications, ed. by Steinmetz, Wehrle Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 27

3.4 Step 2: Association of Address Space with Nodes 709 1008 1622 2011 2207 Logical view of the Distributed Hash Table 611 3485 2906 Node 3485 is responsible for data items in range 2907 to 3485 (in case of a Chord-DHT) Mapping on the real topology Arrangement of the range of values Each node is responsible for part of the value range Often with redundancy (overlapping of parts) Continuous adaptation Real (underlay) and logical (overlay) topology are (mostly) uncorrelated The content of this slide has been adapted from Peer-to-Peer Systems and Applications, ed. by Steinmetz, Wehrle Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 28

3.5 Step 3: Locating a Data Item Locating the data content-based routing Goal: Small and scalable effort O(1) with centralized hash table But: Management of a centralized hash table too costly (server) Minimum overhead with distributed hash tables O(log N): DHT hops to locate object O(log N): number of keys and routing information per node (N = no. of nodes) The content of this slide has been adapted from Peer-to-Peer Systems and Applications, ed. by Steinmetz, Wehrle Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 29

3.6 Step 4: Routing to a Data Item Routing to a key/value-pair Start lookup at arbitrary node of DHT Routing to requested data item (key) H( my data ) = 3107? 709 1008 1622 2011 2207 Node 3485 manages keys 2907-3485, 611 3485 2906 Key = H( my data ) Initial node (arbitrary) (3107, (ip, port)) Value = pointer to location of data OR (small data itself) The content of this slide has been adapted from Peer-to-Peer Systems and Applications, ed. by Steinmetz, Wehrle Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 30

3.7 Step 5: Data Retrieval Usage of located Resource Accessing the content Key/value-pair is delivered to requester Requester analyzes key/value-tuple (and downloads data from actual location in case of indirect storage) H( my data ) = 3107 Get_Data(ip, port) In case of indirect storage: After knowing the actual location, data is requested 709 1008 1622 2011 2207? 611 3485 2906 Node 3485 sends (3107, (ip/port)) to requester The content of this slide has been adapted from Peer-to-Peer Systems and Applications, ed. by Steinmetz, Wehrle Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 31

3.8 Step 6: Where is the Data located? Association of Data with IDs Direct Storage 709 1008 1622 2011 2207 D 611 D D H SHA-1 ( D )=3107 3485 2906 134.2.11.68 Indirect Storage 709 1008 1622 2011 2207 611 3485 2906 134.2.11.68 D H SHA-1 ( D )=3107 Item D: 134.2.11.68 The content of this slide has been adapted from Peer-to-Peer Systems and Applications, ed. by Steinmetz, Wehrle Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 32

Association of Data with IDs Direct Storage How is content stored at the nodes? Example: H( my data ) = 3107 is mapped onto DHT address space Direct storage Content is stored at responsible node for H( my data ) Content is transferred at publication INFLEXIBLE for large amount of content only OK if small amount (<2KB) Not recommended for most applications Suitable for social information? 709 1008 1622 2011 2207 D 611 D D H SHA-1 ( D )=3107 3485 2906 134.2.11.68 The content of this slide has been adapted from Peer-to-Peer Systems and Applications, ed. by Steinmetz, Wehrle Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 33

Association of Data with IDs Indirect Storage Indirect storage Nodes in a DHT store tuples like (key,value) Key = Hash( my data ) 2313 Value is often real storage address of content: (IP, Port) = (134.2.11.140, 4711) MORE FLEXIBLE, but one step more to reach content 709 1008 1622 2011 2207 611 3485 2906 134.2.11.68 D H SHA-1 ( D )=3107 Item D: 134.2.11.68 The content of this slide has been adapted from Peer-to-Peer Systems and Applications, ed. by Steinmetz, Wehrle Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 34

3.9 Distributed Hash Table: to Insert and to Delete a Node Join of a new node 1. Calculation of node ID 2. New node contacts DHT via arbitrary node 3. Assignment of a particular hash range 4. Copying of key/value-pairs of hash range (usually with redundancy) 5. Binding into routing environment 709 1008 1622 2011 2207 3 611 3485 2 2906 1 ID: 3485 134.2.11.68 The content of this slide has been adapted from Peer-to-Peer Systems and Applications, ed. by Steinmetz, Wehrle Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 35

Node Failure and Node Departure Failure of a node Use of redundant key/value pairs (if a node fails) Use of redundant / alternative routing paths Key-value usually still retrievable if at least one copy remains Departure of a node Partitioning of hash range to neighbor nodes Copying of key/value pairs to corresponding nodes Unbinding from routing environment Research challenges Constant replacement of key/value pairs induces costs Replication of key/value pairs needed to avoid data loss Heterogeneity of peer capacities often ignored Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 36

3.10 Properties and Components of DHTs Properties: Hash buckets distributed over nodes Nodes form an overlay network Route messages in overlay to find responsible node Routing and ID labeling scheme in the overlay network is the difference between different DHTs DHT behavior and usage: Node knows object/resource name and wants to find it Unique and known object/resource names assumed Node routes a message in overlay to the responsible node Responsible node replies with object/resource Semantics of object/resource are application defined Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 37

Core Components of Distributed Hash Tables Hash table Uniform distribution Shifted view for each node (adding a node-related offset) Mapping function Node IDs and item keys share the same key-space Rules for associating keys to particular nodes Routing tables Per-node routing tables that refer to other nodes Rules for updating tables as nodes join and leave/fail Routing algorithms (operations on keys): XOR-based (e.g. Kademlia) Shift operations (e.g. D2B) Distance-based (e.g. Chord) Prefix-based (e.g. Pastry) Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 38

4 Chord Chord uses secure hash algorithm SHA-1 hash function Results in a 160-bit object/node identifier Same hash function for objects and nodes Node ID hashed from e.g., IP address Object ID hashed from object name Object names assumed to be known Chord is organized in a ring which wraps around Nodes keep track of predecessor and successor System invariant for valid network operation Node responsible for objects between its predecessor and itself Fingers used to enable efficient content addressing O(log(N)) fingers lead to lookup operation of O(log(N)) length Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications (2001) by Ion Stoica, et.al. Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 39

Chord: Network Topology 709 660-709 1008 710-1008 1622 1009-1622 2011 1623-2011 2207 2012-2207 Circular Key Space 2682 2208-2682 Link to ring successor 659 612-659 611 3486-4047 0-611 3485 2907-3485 2906 2683-2906 Peers are responsible for own ID and IDs back to predecessor Uses SHA-1 (secure hash algorithm) to map IP address/object name onto 160 Bit ID Basic ring topology Successor/ Predecessor Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 40

Chord: Network Topology Fingers points to peers with ObjectIDs increasing exponentially. Here: 709 + 2 k =, 965, 1221, 1733, 2757 709 660-709 1008 710-1008 659 612-659 611 3486-4047 0-611 1622 1009-1622 3485 2907-3485 2011 1623-2011 2906 2683-2906 2207 2012-2207 2682 2208-2682 Enhanced topology k th finger of Peer n is shortcut pointing to peers being responsible for Object ID (n + 2 k ) k ranges from 0 to log(n) O(log(N)) fingers lead to lookup operation of O(log(N)) Finger k 1 2 3 8 9 10 Object ID 709 + 1 = 710 709 + 2 = 711 709 + 4 = 713 709+256 = 965 709+512 = 1221 709+1024 = 1733 Peer ID 1008 1008 1008 1008 1622 2011 11 709 + 2048 = 2757 Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 41

Chord: Join Procedure (1) Request to join the Chord ring 709 660-709 1008 710-1008 1622 1009-1622 2011 1623-2011 2207 2012-2207 2682 2208-2682 2. Route the query in the ring 659 612-659 611 3486-0-611 3485 2907-3485 2906 2683-2906 3. Provide new peer s successor 1. Contact a member of the ring New Peer 1289 Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 42

Chord: Join Procedure (2) Request to join the Chord ring 1. Set successor 2. Redistribute indexing information (e.g. 1009-1289) 709 660-709 Ny:1008 710-1008 Nx:1622 1290-1622 2011 1623-2011 2207 2012-2207 Nz:1289 1009-1289 2682 2208-2682 3. Update successor of predecessor 1. & 2. Notify Successor Actions: N Z : Set Successor (N X ) N Z : Notify N X N X : Set Predecessor N X : Copy items (index) to N Z 659 612-659 611 3486-0-611 4. Build fingers 3485 2907-3485 3. & 4. Stabilize Actions: N Y : Ask Predecessor of N X N Y : Set Successor (N Z ) N Y : Notify N Z N Z : Set Predecessor (N Y ) N X : Clear moved items All: Fix Fingers 2906 2683-2906 Fingers of peer n pointing to peers responsible for ObjectID n + 2 k thus, log(n) fingers are built Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 43

Chord: Addressing Content Query Contains the hash value of the queried content On each step the distance from the destination is halved (remember fingers) Node 1008 queries item 3000 Use Fingers to locate the destination faster Without fingers: no shortcuts, walk the circle 709 660-709 1008 710-1008 1622 1009-1622 1 2011 1623-2011 3 2 2207 2012-2207 Responsible for 1008 + 1024 2682 2208-2682 659 612-659 611 3486-0-611 3485 2907-3485 2906 2683-2906 Responsible for 2207 + 512 Responsible peer found Responsible for 3000 Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 44

Properties of Chord Advantages Disadvantages Efficient look up functionality Messages are routed within O(log N) steps Not churn resistant Chord ring is likely to fail Insufficient stabilization mechanism Low maintenance overhead Easy to implement Intuitive concept due to ring structure No support for heterogeneity All peers are treated equally Overloading of peers may happen No built-in security mechanisms Sensitive to malicious nodes Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 45

5 Pastry, FreePastry and PAST Pastry P2P overlay based on the Plaxton et al. DHT with prefix-based routing Developed by Microsoft and Rice Uni. FreePastry Prototypical Implementation of Pastry Most used by scientific community Comes with a set of extensions PAST Replication layer on top of FreePastry Write once, read many Prefix routing Equivalent to postfix routing Similar to DNS Both peer ID and object ID hashed Expressed in 32 hexadecimal numbers Base is important, here: 16 Besides (prefix-) routing table Leaf set: closest neighbors Neighborhood set: locally close nodes Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 46

5.1 Pastry Routing Table Example (Partial) routing mesh for a single node 4227 Neighbors on higher levels match more digits Routing table Each node has a neighbor map with multiple levels Each level represents a matching prefix up to digit position in ID A given level has number of entries equal to the base of ID ith entry in jth level is closest node which starts prefix(n,j-1)+ i Example: 3th entry of 2th level for node 4227 is the closest node with ID beginning with 43 Routing table of 4227 Some rows missing, table size: 32 x 16 normally more (most) entries filled Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 47

Pastry, FreePastry and PAST ID Space: [0, 2^128[ Randomly assigned while joining Base b (4 or 16) Routing table Used for prefix-based routing Typical size: log_(2^b) (N) rows 2^b 1 entries per row Row nr. i contains only nodeids sharing a prefix of length i with current node Leaf set L closest node IDs Typical size: L = 2^b or 2x2^b Neighborhood set M entries (typically M = 2x2^b) Contains the nodeids and IP addresses of locally closest nodes Routing state of node 10233102, base 4 Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 48

Routing Example Routing Route message from 5230 to 42AD Always route to node closer to target At nth hop, look at n+1th level in neighbor map always one digit more Not all nodes and links are shown Object reponsibility Node responsible for objects which have same ID Unlikely to find such node for every object Node responsible also for nearby objects Responsibility area Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 49

Routing Protocol Message for key K arrives at node X Let X= 10233102, b=4 1. Check if K in scope of Leaf Set E.g. K = 10233030 Direct forwarding to 10233033 2. If not (1) use Routing Table Let l:= prefix length of K and X E.g. K = 10320102, l=2 Check level 3, prefix 103 10-3-23302 3. If not (1) and no routing table entry E.g. K = 10233300 Pick closest peer from routing table: 10233-2-32, as closer than 10233102 4. If X is closest to K than any node in Leaf Set (and Routing Table) X is responsible for K, routing ends Routing state of node 10233102, base 4 Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 50

5.2 Joining the Network New node: Y (node ID 6) Existing nodes: H (ID 1), B (ID 2), C (ID 3), W (ID 4) Protocol: Let H be the locally closest node (known) to Y Y sends a (join request) message addressing ID 6 to node H H forwards the message along the path to the responsible node W (ID 4) All nodes on the path answer their own state information to node Y Y s Neighborhood Set: taken from H Y s Leaf Set: taken from W Y s Routing Table: Row i taken from node i on the path Row 1 taken from H Row 2 taken from B Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 51

Joining the Network (continued) New node: Y (node ID 6) Existing nodes: H (ID 1), B (ID 2), C (ID 3), W (ID 4) Protocol: Y s final state information: sent to every node in Y s routing table/leaf set/nbr set All nodes update their tables Y checks locality of nodes in routing table/leaf set updates its nbr set Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 52

Robustness against Failures Coping with departures and failures Nodes leave unexpectedly (fail) For detection: Periodic checks of table entries Keep-alive messages If node does not answer: failed Failure in Leaf Set: Update entry with leaf set of furthest node Failure in Routing Table: Ask nodes in same row as failed node If all in row failed: as nodes in higher row Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 53

5.3 Key-based Routing Interface Towards a Common API for Structured Peer-to-Peer Overlays Dabek, Zhao, Druschel (Pastry), Kubiatowicz, Stoica (Chord) Allows to exchange the used DHT!!! Notation: read only, read and write Routing API void route(key K, msg M, nodehandle hint) K or hint may be NULL void forward(key K, msg M, nodehandle nexthopnode) Upcall at receiving node, that may read all parameters deliver(key K, msg M) Delivers the message to the receiving application Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 54

KBR: Routing state access nodehandle[] local lookup(key K, int num, boolean safe) Returns a list of possible (num) hops for routing to (key K) nodehandle [] neighborset (int num) Returns unordered list of (num) peers in the neighborhood list nodehandle [] replicaset (key k, int max rank) Returns an ordered set of peers of magnitude (max rank) on which replicas of the object with key k can be stored The nodes which become roots for the key k when the local node fails update(nodehandle n, bool joined) Upcall: informs that node n has either joined or left the local neighbor set boolean range (nodehandle N, rank r, key lkey, key rkey) Provides information about ranges of keys for which the node N is responsible Returns false if the range could not be determined, true otherwise Can only be used for nodes in neighbor set lkey and rkey are modified by the method inclusive range Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 55

5.4 FreePastry Prototypical implementation of Pastry Current version 2.1: released on 13.3.2009 Java based, Sun JDK version 1.5.0 NodeID: 160 bits, 20 byte: 10 hexadecimal number Replication Applications like DHTs use replication to ensure that stored data survives node failure. To replicate a newly received key (K) _r_times, the application calls replicaset (k,r) and sends a copy of the key to each returned node If the implementation is not able to return r suitable neighbors, then the application itself is responsible for determining replica locations More details: FreePastry documentation Replication: PAST see seminar talk Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 56

Properties of Pastry / FreePastry Advantages Disadvantages Well documented, clear APIs Modular, extendable software Large user base, still maintained No support for heterogeneity All nodes are treated equally Strong, long-living peers should do more Basic functionality Routing, DHT (key-value mapping) Distributed storage No further built-in security mechanisms Besides resistance against DoS Sensitive to malicious nodes However, some changes in LifeSocial Limited API Only DHT Also requires sufficient replication, additional services Dr.-Ing. Kalman Graffi, FG Theorie verteilter Systeme 57