1 !! IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 2, NO. 8, OCTOBER 22 1 SCRIBE: A lrge-scle nd decentrlized ppliction-level multicst infrstructure Miguel Cstro, Peter Druschel, Anne-Mrie Kermrrec nd Antony Rowstron Abstrct This pper presents Scribe, sclble ppliction-level multicst infrstructure. Scribe supports lrge numbers of groups, with potentilly lrge number of members per group. Scribe is built on top of Pstry, generic peer-to-peer object loction nd routing substrte overlyed on the Internet, nd leverges Pstry s relibility, self-orgniztion, nd loclity properties. Pstry is used to crete nd mnge groups nd to build efficient multicst trees for the dissemintion of messges to ech group. Scribe provides best-effort relibility gurntees, nd we outline how n ppliction cn etend Scribe to provide stronger relibility. Simultion results, bsed on relistic network topology model, show tht Scribe scles cross wide rnge of groups nd group sizes. Also, it blnces the lod on the nodes while chieving cceptble dely nd link stress when compred to IP multicst. Keywords group communiction, ppliction-level multicst, peer-topeer. I. INTRODUCTION Network-level IP multicst ws proposed over decde go , , . Subsequently, multicst protocols such s SRM (Sclble Relible Multicst Protocol)  nd RMTP (Relible Messge Trnsport Protocol)  hve dded relibility. However, the use of multicst in pplictions hs been limited becuse of the lck of wide scle deployment nd the issue of how to trck group membership. As result, ppliction-level multicst hs gined in populrity. Algorithms nd systems for sclble group mngement nd sclble, relible propgtion of messges re still ctive reserch res , , , , , . For such systems, the chllenge remins to build n infrstructure tht cn scle to, nd tolerte the filure modes of, the generl Internet, while chieving low dely nd effective use of network resources. Recent work on peer-to-peer overly networks offers sclble, self-orgnizing, fult-tolernt substrte for decentrlized distributed pplictions , , , . In this pper we present Scribe, lrge-scle, decentrlized pplictionlevel multicst infrstructure built upon Pstry, sclble, selforgnizing peer-to-peer loction nd routing substrte with good loclity properties . Scribe provides efficient pplictionlevel multicst nd is cpble of scling to lrge number of groups, of multicst sources, nd of members per group. Scribe nd Pstry dopt fully decentrlized peer-to-peer model, where ech prticipting node hs equl responsibilities. Scribe builds multicst tree, formed by joining the Pstry routes from ech group member to rendez-vous point ssocited with group. Membership mintennce nd messge dissemintion in Scribe leverge the robustness, self-orgniztion, loclity nd relibility properties of Pstry. M. Cstro, A-M. Kermrrec nd A. Rowstron re with Microsoft Reserch Ltd., 7 J J Thomson Avenue, Cmbridge, CB3 FB, UK P. Druschel is with Rice University MS-132, 1 Min Street, Houston, TX 77, USA. The rest of the pper is orgnized s follows. Section II gives n overview of the Pstry routing nd object loction infrstructure. Section III describes the bsic design of Scribe. We present performnce results in Section IV nd discuss relted work in Section V. II. PASTRY In this section we briefly sketch Pstry , peer-to-peer loction nd routing substrte upon which Scribe ws built. Pstry forms robust, self-orgnizing overly network in the Internet. Any Internet-connected host tht runs the Pstry softwre nd hs proper credentils cn prticipte in the overly network. Ech Pstry node hs unique, 128-bit nodeid. The set of eisting nodeids is uniformly distributed; this cn be chieved, for instnce, by bsing the nodeid on secure hsh of the node s public key or IP ddress. Given messge nd key, Pstry relibly routes the messge to the Pstry node with the nodeid tht is numericlly closest to the key, mong ll live Pstry nodes. Assuming Pstry network consisting of nodes, Pstry cn route to ny node in less thn steps on verge ( is configurtion prmeter with typicl vlue 4). With concurrent node filures, eventul delivery is gurnteed unless or more nodes with djcent nodeids fil simultneously ( is n even integer prmeter with typicl vlue ). The tbles required in ech Pstry node hve only "! entries, where ech entry mps nodeid to the ssocited node s IP ddress. Moreover, fter node filure or the rrivl of new node, the invrints in ll ffected routing tbles cn be restored by echnging #$ messges. In the following, we briefly sketch the Pstry routing scheme. A full description nd evlution of Pstry cn be found in , . For the purposes of routing, nodeids nd keys re thought of s sequence of digits with bse. A node s routing tble is orgnized into rows with %& entries ech (see Figure 1). The entries in row ' of the routing tble ech refer to node whose nodeid mtches the present node s nodeid in the first ' digits, but whose ' th digit hs one of the () possible vlues other thn the ' th digit in the present node s id. The uniform distribution of nodeids ensures n even popultion of the nodeid spce; thus, only levels re populted in the routing tble. Ech entry in the routing tble refers to one of potentilly mny nodes whose nodeid hve the pproprite prefi. Among such nodes, the one closest to the present node (ccording to sclr proimity metric, such s the round trip time) is chosen. In ddition to the routing tble, ech node mintins IP ddresses for the nodes in its lef set, i.e., the set of nodes with the numericlly closest lrger nodeids, nd the nodes
2 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 2, NO. 8, OCTOBER b b b c c c d d d e e e f f f O d41c d471f1 d47c4 d42b d4213f b c d e f 1fc Route(d41c) d13d3 Fig. 1. Routing tble of Pstry node with nodeid,. Digits re in bse 1, represents n rbitrry suffi. The IP ddress ssocited with ech entry is not shown. Fig. 2. Routing messge from node with key. The dots depict live nodes in Pstry s circulr nmespce. with numericlly closest smller nodeids, reltive to the present node s nodeid. Figure 2 shows the pth of n emple messge. In ech routing step, the current node normlly forwrds the messge to node whose nodeid shres with the key prefi tht is t lest one digit (or bits) longer thn the prefi tht the key shres with the current nodeid. If no such node is found in the routing tble, the messge is forwrded to node whose nodeid shres prefi with the key s long s the current node, but is numericlly closer to the key thn the current nodeid. Such node must eist in the lef set unless the nodeid of the current node or its immedite neighbour is numericlly closest to the key, or djcent nodes in the lef set hve filed concurrently. A. Loclity Net, we discuss Pstry s loclity properties, i.e., the properties of Pstry s routes with respect to the proimity metric. The proimity metric is sclr vlue tht reflects the distnce between ny pir of nodes, such s the round trip time. It is ssumed tht function eists tht llows ech Pstry node to determine the distnce between itself nd node with given IP ddress. We limit our discussion to two of Pstry s loclity properties tht re relevnt to Scribe. The short routes property concerns the totl distnce, in terms of the proimity metric, tht messges trvel long Pstry routes. Recll tht ech entry in the node routing tbles is chosen to refer to the nerest node, ccording to the proimity metric, with the pproprite nodeid prefi. As result, in ech step messge is routed to the nerest node with longer prefi mtch. Simultions performed on severl network topology models show tht the verge distnce trveled by messge is between 1.9 nd 2.2 times the distnce between the source nd destintion in the underlying Internet . The route convergence property is concerned with the distnce trveled by two messges sent to the sme key before their routes converge. Simultions show tht, given our network topology model, the verge distnce trveled by ech of the two messges before their routes converge is pproimtely equl to the distnce between their respective source nodes. These properties hve strong impct on the loclity properties of the Scribe multicst trees, s eplined in Section III. B. Node ddition nd filure A key design issue in Pstry is how to efficiently nd dynmiclly mintin the node stte, i.e., the routing tble nd lef set, in the presence of node filures, node recoveries, nd new node rrivls. The protocol is described nd evluted in , . Briefly, n rriving node with the newly chosen nodeid cn initilize its stte by contcting nerby node (ccording to the proimity metric) nd sking to route specil messge using s the key. This messge is routed to the eisting node with nodeid numericlly closest to 1. then obtins the lef set from, nd the th row of the routing tble from the th node encountered long the route from to. One cn show tht using this informtion, cn correctly initilize its stte nd notify nodes tht need to know of its rrivl. To hndle node filures, neighboring nodes in the nodeid spce (which re wre of ech other by virtue of being in ech other s lef set) periodiclly echnge keep-live messges. If node is unresponsive for period, it is presumed filed. All members of the filed node s lef set re then notified nd they updte their lef sets. Since the lef sets of nodes with djcent nodeids overlp, this updte is trivil. A recovering node contcts the nodes in its lst known lef set, obtins their current lef sets, updtes its own lef set nd then notifies the members of its new lef set of its presence. Routing tble entries tht refer to filed nodes re repired lzily; the detils re described in , . C. Pstry API In this section, we briefly describe the ppliction progrmming interfce (API) eported by Pstry to pplictions such s Scribe. The presented API is slightly simplified for clrity. Pstry eports the following opertions: In the eceedingly unlikely event tht obtin new nodeid. nd re equl, the new node must
3 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 2, NO. 8, OCTOBER nodeid = pstryinit(credentils) cuses the locl node to join n eisting Pstry network (or strt new one) nd initilize ll relevnt stte; returns the locl node s nodeid. The credentils re provided by the ppliction nd contin informtion needed to uthenticte the locl node nd to securely join the Pstry network. A full discussion of Pstry s security model cn be found in . route(msg,key) cuses Pstry to route the given messge to the node with nodeid numericlly closest to key, mong ll live Pstry nodes. send(msg,ip-ddr) cuses Pstry to send the given messge to the node with the specified IP ddress, if tht node is live. The messge is received by tht node through the deliver method. Applictions lyered on top of Pstry must eport the following opertions: deliver(msg,key) clled by Pstry when messge is received nd the locl node s nodeid is numericlly closest to key mong ll live nodes, or when messge is received tht ws trnsmitted vi send, using the IP ddress of the locl node. forwrd(msg,key,netid) clled by Pstry just before messge is forwrded to the node with nodeid = netid. The ppliction my chnge the contents of the messge or the vlue of netid. Setting the netid to NULL will terminte the messge t the locl node. newlefs(lefset) clled by Pstry whenever there is chnge in the lef set. This provides the ppliction with n opportunity to djust ppliction-specific invrints bsed on the lef set. In the following section, we will describe how Scribe is lyered on top of the Pstry API. Other pplictions built on top of Pstry include PAST, persistent, globl storge utility , . III. SCRIBE Scribe is sclble ppliction-level multicst infrstructure built on top of Pstry. Any Scribe node my crete group; other nodes cn then join the group, or multicst messges to ll members of the group (provided they hve the pproprite credentils). Scribe provides best-effort delivery of multicst messges, nd specifies no prticulr delivery order. However, stronger relibility gurntees nd ordered delivery for group cn be built on top of Scribe, s outlined in Section III-B. Nodes cn crete, send messges to, nd join mny groups. Groups my hve multiple sources of multicst messges nd mny members. Scribe cn support simultneously lrge numbers of groups with wide rnge of group sizes, nd high rte of membership turnover. Scribe offers simple API to its pplictions: crete(credentils, groupid) cretes group with groupid. Throughout, the credentils re used for ccess control. join(credentils, groupid, messgehndler) cuses the locl node to join the group with groupid. All subsequently received multicst messges for tht group re pssed to the specified messge hndler. leve(credentils, groupid) cuses the locl node to leve the group with groupid. multicst(credentils, groupid, messge) cuses the messge to be multicst within the group with groupid. Scribe uses Pstry to mnge group cretion, group joining nd to build per-group multicst tree used to disseminte the messges multicst in the group. Pstry nd Scribe re fully decentrlized: ll decisions re bsed on locl informtion, nd ech node hs identicl cpbilities. Ech node cn ct s multicst source, root of multicst tree, group member, node within multicst tree, nd ny sensible combintion of the bove. Much of the sclbility nd relibility of Scribe nd Pstry derives from this peer-to-peer model. A. Scribe Implementtion A Scribe system consists of network of Pstry nodes, where ech node runs the Scribe ppliction softwre. The Scribe softwre on ech node provides the forwrd nd deliver methods, which re invoked by Pstry whenever Scribe messge rrives. The pseudo-code for these Scribe methods, simplified for clrity, is shown in Figure 3 nd Figure 4, respectively. Recll tht the forwrd method is clled whenever Scribe messge is routed through node. The deliver method is clled when Scribe messge rrives t the node with nodeid numericlly closest to the messge s key, or when messge ws ddressed to the locl node using the Pstry send opertion. The possible messge types in Scribe re JOIN, CREATE, LEAVE nd MULTICAST; the roles of these messges re described in the net sections. The following vribles re used in the pseudocode: groups is the set of groups tht the locl node is wre of, msg.source is the nodeid of the messge s source node, msg.group is the groupid of the group, nd msg.type is the messge type. A.1 Group Mngement Ech group hs unique groupid. The Scribe node with nodeid numericlly closest to the groupid cts s the rendezvous point for the ssocited group. The rendez-vous point is the root of the multicst tree creted for the group. To crete group, Scribe node sks Pstry to route CREATE messge using the groupid s the key (e.g. route(create,groupid)). Pstry delivers this messge to the node with the nodeid numericlly closest to groupid. The Scribe deliver method dds the group to the list of groups it lredy knows bout (line 3 of Figure 4). It lso checks the credentils to ensure tht the group cn be creted, nd stores the credentils. This Scribe node becomes the rendez-vous point for the group. The groupid is the hsh of the group s tetul nme conctented with its cretor s nme. The hsh is computed using collision resistnt hsh function (e.g. SHA-1 ), which ensures uniform distribution of groupids. Since Pstry nodeids re lso uniformly distributed, this ensures n even distribution of groups cross Pstry nodes. Alterntively, we cn mke the cretor of group be the rendez-vous point for the group s follows: Pstry nodeid cn be the hsh of the tetul nme of the node, nd groupid cn be the conctention of the nodeid of the cretor nd the hsh of the tetul nme of the group. This lterntive cn improve performnce with good choice of cretor: link stress nd dely will be lower if the cretor sends to the group often, or is close in the network to other frequent senders or mny group members.
4 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 2, NO. 8, OCTOBER (1) forwrd(msg, key, netid) (2) switch msg.type is (3) JOIN : if!(msg.group groups) (4) groups = groups msg.group () route(msg,msg.group) () groups[msg.group].children msg.source (7) netid = null // Stop routing the originl messge Fig. 3. Scribe implementtion of forwrd. (1) deliver(msg,key) (2) switch msg.type is (3) CREATE : groups = groups msg.group (4) JOIN : groups[msg.group].children msg.source () MULTICAST : node in groups[msg.group].children () send(msg,node) (7) if memberof (msg.group) (8) invokemessgehndler(msg.group, msg) (9) LEAVE : groups[msg.group].children = groups[msg.group].children - msg.source (1) if ( groups[msg.group].children = ) (11) send(msg,groups[msg.group].prent) Fig. 4. Scribe implementtion of deliver. In both lterntives, groupid cn be generted by ny Scribe node using only the tetul nme of the group nd its cretor, without the need for n dditionl nming service. Of course, proper credentils re necessry to join or multicst messges in the ssocited group. A.2 Membership mngement Scribe cretes multicst tree, rooted t the rendez-vous point, to disseminte the multicst messges in the group. The multicst tree is creted using scheme similr to reverse pth forwrding . The tree is formed by joining the Pstry routes from ech group member to the rendez-vous point. Group joining opertions re mnged in decentrlized mnner to support lrge nd dynmic membership. Scribe nodes tht re prt of group s multicst tree re clled forwrders with respect to the group; they my or my not be members of the group. Ech forwrder mintins children tble for the group contining n entry (IP ddress nd nodeid) for ech of its children in the multicst tree. When Scribe node wishes to join group, it sks Pstry to route JOIN messge with the group s groupid s the key (e.g. route (JOIN,groupId)). This messge is routed by Pstry towrds the group s rendez-vous point. At ech node long the route, Pstry invokes Scribe s forwrd method. Forwrd (lines 3 to 7 in Figure 3) checks its list of groups to see if it is currently forwrder; if so, it ccepts the node s child, dding it to the children tble. If the node is not lredy forwrder, it cretes n entry for the group, nd dds the source node s child in the ssocited children tble. It then becomes forwrder for the group by sending JOIN messge to the net node long the route from the joining node to the rendez-vous point. The originl messge from the source is terminted; this is chieved by setting netid = null, in line 7 of Figure 3. Figure illustrtes the group joining mechnism. The circles represent nodes, nd some of the nodes hve their nodeid shown. For simplicity, so the prefi is mtched one bit t time. We ssume tht there is group with groupid whose rendez-vous point is the node with the sme identifier. The node with nodeid is joining this group. In this emple, Pstry routes the JOIN messge to node ; then the messge from is routed to ; finlly, the messge from rrives t. This route is indicted by the solid rrows in Figure. Let us ssume tht nodes nd re not lredy forwrders for group. The joining of node cuses the other two nodes long the route to become forwrders for the group, nd cuses them to dd the preceding node in the route to their children tbles. Now let us ssume tht node decides to join the sme group. The route tht its JOIN messge would tke is shown using dot-dsh rrows. However, since node is lredy forwrder, it dds node to its children tble for the group, nd the JOIN messge is terminted. When Scribe node wishes to leve group, it records loclly tht it left the group. If there re no other entries in the children tble, it sends LEAVE messge to its prent in the multicst tree, s shown in lines 9 to 11 in Figure 4. The messge proceeds recursively up the multicst tree, until node is reched tht still hs entries in the children tble fter removing the deprting child. The properties of Pstry routes ensure tht this mechnism produces tree. There re no loops becuse the nodeid of the net node in every hop of Pstry route mtches longer prefi of the groupid thn the previous node, or mtches prefi with
5 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 2, NO. 8, OCTOBER B. Relibility Applictions tht use group multicst my hve diverse relibility requirements. Some groups my require relible nd ordered delivery of messges, whilst others require only besteffort delivery. Therefore, Scribe provides only best-effort delivery of messges but it offers frmework for pplictions to implement stronger relibility gurntees. Scribe uses TCP to disseminte messges relibly from prents to their children in the multicst tree nd for flow control, nd it uses Pstry to repir the multicst tree when forwrder fils. Fig.. Membership mngement nd multicst tree cretion the sme length nd is numericlly closer, or is the nodeid of the root. The membership mngement mechnism is efficient for groups with wide rnge of memberships, vrying from one to ll Scribe nodes. The list of members of group is distributed cross the nodes in the multicst tree. Pstry s rndomiztion properties ensure tht the tree is well blnced nd tht the forwrding lod is evenly blnced cross the nodes. This blnce enbles Scribe to support lrge numbers of groups nd members per group. Joining requests re hndled loclly in decentrlized fshion. In prticulr, the rendez-vous point does not hndle ll joining requests. The loclity properties of Pstry ensure tht the multicst tree cn be used to disseminte messges efficiently. The dely to forwrd messge from the rendez-vous point to ech group member is smll becuse of the short routes property. Second, the route convergence property ensures tht the lod imposed on the physicl network is smll becuse most messges re sent by the nodes close to the leves nd the network distnce trversed by these messges is short. Simultion results quntifying the loclity properties of the Scribe multicst tree will be presented in Section IV. A.3 Multicst messge dissemintion Multicst sources use Pstry to locte the rendez-vous point of group: they route to the rendez-vous point (e.g. route(multicast, groupid)), nd sk it to return its IP ddress. They cche the rendez-vous point s IP ddress nd use it in subsequent multicsts to the group to void repeted routing through Pstry. If the rendez-vous point chnges or fils, the source uses Pstry to find the IP ddress of the new rendez-vous point. Multicst messges re disseminted from the rendez-vous point long the multicst tree in the obvious wy (lines nd of Figure 4). There is single multicst tree for ech group nd ll multicst sources use the bove procedure to multicst messges to the group. This llows the rendez-vous node to perform ccess control. B.1 Repiring the multicst tree Periodiclly, ech non-lef node in the tree sends hertbet messge to its children. Multicst messges serve s n implicit hertbet signl voiding the need for eplicit hertbet messges in mny cses. A child suspects tht its prent is fulty when it fils to receive hertbet messges. Upon detection of the filure of its prent, node clls Pstry to route JOIN messge to the group s identifier. Pstry will route the messge to new prent, thus repiring the multicst tree. For emple, in Figure, consider the filure of node. Node detects the filure of nd uses Pstry to route JOIN messge towrds the root through n lterntive route (indicted by the dshed rrows). The messge reches node who dds to its children tble nd, since it is not forwrder, sends JOIN messge towrds the root. This cuses node to dd to its children tble. Scribe cn lso tolerte the filure of multicst tree roots (rendez-vous points). The stte ssocited with the rendez-vous point, which identifies the group cretor nd hs n ccess control list, is replicted cross the closest nodes to the root node in the nodeid spce (where typicl vlue of is ). It should be noted tht these nodes re in the lef set of the root node. If the root fils, its immedite children detect the filure nd join gin through Pstry. Pstry routes the join messges to new root (the live node with the numericlly closest nodeid to the groupid), which tkes over the role of the rendez-vous point. Multicst senders likewise discover the new rendez-vous point by routing vi Pstry. Children tble entries re discrded unless they re periodiclly refreshed by n eplicit messge from the child, stting its desire to remin in the group. This tree repir mechnism scles well: fult detection is done by sending messges to smll number of nodes, nd recovery from fults is locl; only smll number of nodes (#$ ) is involved. B.2 Providing dditionl gurntees. By defult, Scribe provides relible, ordered delivery of multicst messges only if the TCP connections between the nodes in the multicst tree do not brek. For emple, if some nodes in the multicst tree fil, Scribe my fil to deliver messges or my deliver them out of order. Scribe provides simple mechnism to llow pplictions to implement stronger relibility gurntees. Applictions cn define the following upcll methods, which re invoked by Scribe.
6 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 2, NO. 8, OCTOBER 22 1 forwrdhndler(msg) is invoked by Scribe before the node forwrds multicst messge, msg, to its children in the multicst tree. The method cn modify msg before it is forwrded. joinhndler(msg) is invoked by Scribe fter new child is dded to one of the node s children tbles. The rgument is the JOIN messge. fulthndler(msg) is invoked by Scribe when node suspects tht its prent is fulty. The rgument is the JOIN messge tht is sent to repir the tree. The method cn modify msg to dd dditionl informtion before it is sent. For emple, n ppliction cn implement ordered, relible delivery of multicst messges by defining the upclls s follows. The forwrdhndler is defined such tht the root ssigns sequence number to ech messge nd such tht recently multicst messges re buffered by the root nd by ech node in the multicst tree. Messges re retrnsmitted fter the multicst tree is repired. The fulthndler dds the lst sequence number, ', delivered by the node to the JOIN messge nd the joinhndler retrnsmits buffered messges with sequence numbers bove ' to the new child. To ensure relible delivery, the messges must be buffered for n mount of time tht eceeds the miml time to repir the multicst tree fter TCP connection breks. To tolerte root filures, the root needs to be replicted. For emple, one could choose set of replics in the lef set of the root nd use n lgorithm like Pos  to ensure strong consistency. IV. EXPERIMENTAL EVALUATION This section presents results of simultion eperiments to evlute the performnce of prototype Scribe implementtion. These eperiments compre the performnce of Scribe nd IP multicst long three metrics: the dely to deliver events to group members, the stress on ech node, nd the stress on ech physicl network link. We lso rn our implementtion in rel distributed system with smll number of nodes. A. Eperimentl Setup We developed simple pcket-level, discrete event simultor to evlute Scribe. The simultor models the propgtion dely on the physicl links but it does not model either queuing dely or pcket losses becuse modeling these would prevent simultion of lrge networks. We did not model ny cross trffic during the eperiments. The simultions rn on network topology with routers, which were generted by the Georgi Tech  rndom grph genertor using the trnsit-stub model. The routers did not run the code to mintin the overlys nd implement Scribe. Insted, this code rn on 1, end nodes tht were rndomly ssigned to routers in the core with uniform probbility. Ech end system ws directly ttched by LAN link to its ssigned router (s ws done in ). The trnsit-stub model is hierrchicl. There re 1 trnsit domins t the top level with n verge of routers in ech. Ech trnsit router hs n verge of 1 stub domins ttched, nd ech stub hs n verge of 1 routers. We generted 1 different topologies using the sme prmeters but different rndom seeds. We rn ll the eperiments in ll the topologies. The results we present re the verge of the results obtined with ech topology. We used the routing policy weights generted by the Georgi Tech rndom grph genertor to perform IP unicst routing. IP multicst routing used shortest pth tree formed by the merge of the unicst routes from the source to ech recipient. This is similr to wht could be obtined in our eperimentl setting using protocols like Distnce Vector Multicst Routing Protocol (DVMRP)  or PIM . But in order to provide conservtive comprison, we ignored messges required by these protocols to mintin the trees. The dely of ech LAN link ws set to 1ms nd the verge dely of core links (computed by the grph genertor) ws 4.7ms. Scribe ws designed to be generic infrstructure cpble of supporting multiple concurrent pplictions with vrying requirements. Therefore, we rn eperiments with lrge number of groups nd with wide rnge of group sizes. Since there re no obvious sources of rel-world trce dt to drive these eperiments, we dopted Zipf-like distribution for the group sizes. Groups re rnked by size nd the size of the group with rnk! is given by, where is the totl number of Scribe nodes. The number of groups ws fied t 1, nd the number of Scribe nodes ( ) ws fied t 1,, which were the mimum numbers tht we were ble to simulte. The eponent ws chosen to ensure minimum group size of eleven; this number ppers to be typicl of Instnt Messging pplictions, which is one of the trgeted multicst pplictions. The mimum group size is 1, (group rnk 1) nd the sum of ll group sizes 39,247. Figure plots group size versus group rnk. Group Size Group Rnk Fig.. Distribution of group size versus group rnk. The members of ech group were selected rndomly with uniform probbility from the set of Scribe nodes, nd we used different rndom seeds for ech topology. Distributions with better network loclity would improve the performnce of Scribe. We lso rn eperiments to evlute Scribe on different topology, which is described in . This is relistic topology with 12,39 routers tht ws obtined from Internet mesurements. The results of these eperiments were comprble with the results presented here.
7 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 2, NO. 8, OCTOBER 22 1 B. Dely penlty The first set of eperiments compres the dely to multicst messges using Scribe nd IP multicst. Scribe increses the dely to deliver messges reltive to IP multicst. To evlute this penlty, we mesured the distribution of delys to deliver messge to ech member of group using both Scribe nd IP multicst. We compute two metrics of dely penlty using these distributions: RMD is the rtio between the mimum dely using Scribe nd the mimum dely using IP multicst, nd RAD is the rtio between the verge dely using Scribe nd the verge dely using IP multicst. This differs from the reltive dely penlty (RDP) used in , where the dely rtio ws computed for ech individul group member. RAD nd RMD void the nomlies  ssocited with RDP. Figure 7 shows the cumultive distribution of the RAD nd RMD metrics. The y-vlue of point represents the number of groups with RAD or RMD lower thn or equl to the reltive dely (-vlue). Scribe s performnce is good becuse it leverges Pstry s short routes property. For % of groups, RAD of t most 1.8 nd RMD of t most 1.9 is observed. In the worst cse, the mimum RAD is 2 nd the mimum RMD is 4.2. Cumultive Groups Dely Penlty RMD RAD Fig. 7. Cumultive distribution of dely penlty reltive to IP multicst per group (verge stndrd devition ws 2 for RAD nd 21 for RMD). We lso clculted the RDP for the 1, members of the group with rnk 1. The results show tht Scribe performs well: the men RDP is 1.81, the medin is 1., more thn 8% of the group members hve RDP less thn 2.2, nd more thn 98% hve RDP less thn 4. IP routing does not lwys produce minimum dely routes becuse it is performed using the routing policy weights from the Georgi Tech model . As result, Scribe ws ble to find routes with lower dely thn IP multicst for 2.2% of the group members. C. Node stress In Scribe, end nodes re responsible for mintining membership informtion nd for forwrding nd duplicting pckets wheres routers perform these tsks in IP multicst. To evlute the stress imposed by Scribe on ech node, we mesured the number of groups with non-empty children tbles, nd the number of entries in children tbles in ech Scribe node; the results re in Figures 8 nd 9. Even though there re 1, groups, the men number of nonempty children tbles per node is only 2.4 nd the medin num Number of Children Tbles Fig. 8. Number of children tbles per Scribe node (verge stndrd devition ws 8) Totl Number of Children Tble Entries Totl Number of Children Tble Entries Fig. 9. Number of children tble entries per Scribe node (verge stndrd devition ws 3.2). ber is only 2. Figure 8 shows tht the distribution hs long til with mimum number of children tbles per node of 4. Similrly, the men number of entries in ll the children tbles on ny single Scribe node is only.2 nd the medin is only 3. This distribution lso hs long thin til with mimum of 19. These results indicte tht Scribe does good job of prtitioning nd distributing forwrding lod over ll nodes: ech node is responsible for forwrding multicst messges for smll number of groups, nd it forwrds multicst messges only to smll number of nodes. This is importnt to chieve sclbility with group size nd the number of groups. D. Link stress The finl set of eperiments compres the stress imposed by Scribe nd IP multicst on ech directed link in the network topology. We computed the stress by counting the number of pckets tht re sent over ech link when messge is multicst to ech of the 1, groups. Figure 1 shows the distribution of link stress for both Scribe nd IP multicst with the results for zero link stress omitted. The totl number of links is 1,3,29 nd the totl number of messges over these links is 2,489,824 for Scribe nd 78,83 for IP multicst. The men number of messges per link in Scribe is 2.4 whilst for IP multicst it is.7. The medin is for both. The mimum link stress in Scribe is 431, whilst for IP multicst the mimum link stress is 9. This mens tht the mimum link stress induced by Scribe is bout 4 times tht for IP multicst on this eperiment. The results re good be-
8 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 2, NO. 8, OCTOBER Number of Links Link Stress Scribe IP Multicst Mimum Fig. 1. Link stress for multicsting messge to ech of 1, groups (verge stndrd devition ws 1.4 for Scribe nd for 1.9 for IP multicst). cuse Scribe distributes lod cross nodes (s shown before) nd becuse it leverges Pstry s route convergence property. Group members tht re close in the network tend to be children of the sme prent in the multicst tree tht is lso close to them. This reduces link stress becuse the prent receives single copy of the multicst messge nd forwrds copies to its children long short routes. It is interesting to compre Scribe with nïve multicst tht is implemented by performing unicst trnsmission from the source to ech subscriber. This nïve implementtion would hve mimum link stress greter thn or equl to 1, (which is the mimum group size). Figure 1 shows the link stress for multicsting messge to ech group. The link stress for joining is identicl becuse the process we use to crete the multicst tree for ech group is the inverse of the process used to disseminte multicst messges E. Bottleneck remover The bse mechnism for building multicst trees in Scribe ssumes tht ll nodes hve equl cpcity nd strives to distribute lod evenly cross ll nodes. But in severl deployment scenrios some nodes my hve less computtionl power or bndwidth vilble thn others. Under high lod, these lower cpcity nodes my become bottlenecks tht slow down messge dissemintion. Additionlly, the distribution of children tble entries shown in Figure 9 hs long til. The nodes t the end of the til my become bottlenecks under high lod even if their cpcity is reltively high. This section describes simple lgorithm to remove bottlenecks when they occur. The lgorithm llows nodes to bound the mount of multicst forwrding they do by off-loding children to other nodes. The bottleneck remover lgorithm works s follows. When node detects tht it is overloded, it selects the group tht consumes the most resources. Then it chooses the child in this group tht is frthest wy, ccording to the proimity metric. The prent drops the chosen child by sending it messge contining the children tble for the group long with the delys between ech child nd the prent. When the child receives the messge, it performs the following opertions: (i) it mesures the dely between itself nd other nodes in the children tble it received from the prent; (ii) then for ech node, it computes the totl dely between itself nd the prent vi ech of the nodes; (iii) finlly, it sends join messge to the node tht provides the smllest combined dely. Tht wy, it minimizes the dely to rech its prent through one of its previous siblings. Unlike the bse mechnism for building multicst trees in Scribe, the bottleneck remover my introduce routing loops. However, this hppens only when there re filures nd with low probbility. In prticulr, there re no routing loops in the eperiments tht we describe below. Loops re detected by hving ech prent propgte to its children the nodeids in the pth from the root to. If node receives pth tht contins its nodeid, it uses Pstry to route JOIN messge towrds the group identifier using rndomized route. Additionlly if node receives JOIN messge from node in its pth to the root, it rejects the join nd informs the joining node tht it should join using rndomized route. We rern ll the eperiments in the previous sections to evlute the bottleneck remover. Since we do not model bndwidth or processing t the nodes in our simultor, the cost of forwrding is the sme for ll children. A node is considered overloded if the totl number of children cross ll groups is greter thn 4, nd the group tht consumes the most resources is the one with the lrgest children tble. Figure 11 shows the distribution of the number of children tble entries per node. As epected, the bottleneck remover elimintes the long til tht we observed in Figure 9 nd bounds the number of children per node to Totl Number of Children Tble Entries Fig. 11. Number of children tble entries per Scribe node with the bottleneck remover (verge stndrd devition ws 7). The drwbck with the bottleneck remover is tht it increses the link stress for joining. The verge link stress increses from 2.4 to 2.7 nd the mimum increses from 431 to This does not ccount for the probes needed to estimte dely to other siblings; there re n verge of 3 probes per join. Our eperimentl setup ecerbtes this cost; the bottleneck remover is invoked very often becuse ll nodes impose low bound on the number of children tble entries. We epect this cost to be low in more relistic setting. We do not show figures for the other results becuse they re lmost identicl to the ones presented without the bottleneck remover. In prticulr, the bottleneck remover chieves the gol of bounding the mount of forwrding work per node without noticebly incresing ltency.
9 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 2, NO. 8, OCTOBER F. Sclbility with mny smll groups We rn n dditionl eperiment to evlute Scribe s sclbility with lrge number of groups. This eperiment rn in the sme setup s the others ecept tht there were, Scribe nodes, nd 3, groups with 11 members ech (which ws the minimum group size in the distribution used in the previous eperiments). This setup is representtive of Instnt Messging pplictions. Figures 12 nd 13 show the distribution of children tbles nd children tble entries per node, respectively. The lines lbelled scribe collpse will be eplined lter. The distributions hve shrp peks before nd long thin til, showing tht Scribe scles well becuse it distributes children tbles nd children tble entries evenly cross the nodes Number of Children Tbles scribe scribe collpse Fig. 12. Number of children tbles per Scribe node Totl Number of Children Tble Entries scribe scribe collpse Fig. 13. Number of children tble entries per Scribe node. But the results lso show tht Scribe multicst trees re not s efficient for smll groups. The verge number of children tble entries per node is 21.2, wheres the nïve multicst would chieve n verge of only.. The verge is higher for Scribe becuse it cretes trees with long pths with no brnching. This problem lso cuses higher link stress s shown in Figure 14: Scribe s verge link stress is.1, IP multicst s is 1. nd nïve multicst s is 2.9. (As before, one messge ws sent in ech of the 3, groups). We implemented simple lgorithm to produce more efficient trees for smll groups. Trees re built s before but the lgorithm collpses long pths in the tree by removing nodes tht re not members of group nd hve only one entry in the group s children tble. We rern the eperiment in this section using this lgorithm. The new results re shown under the lbel scribe collpse in Figures 12, 13, nd 14. The lgorithm is effective: Number of Links scribe collpse scribe ip mcst nïve unicst Link Stress Fig. 14. Link stress for multicsting messge to ech of 3, groups. it reduced the verge link stress from.1 to 3.3, nd the verge number of children per node from 21.2 to 8.. We lso considered n lterntive lgorithm tht grows trees less egerly. As before, joining node,, uses Pstry to route JOIN messge towrds the root of the tree. But these messges re forwrded to the first node,, tht is lredy in the tree. If is not overloded, it dds to its children tble nd the previous nodes long the route do not become forwrders for the tree. Otherwise, dds the previous node in the route to its children tble, nd tells this node to tke s its child. This lterntive lgorithm cn generte shllower trees but it hs two disdvntges: it cn increse link stress reltive to the lgorithm tht collpses the tree; nd it reduces Scribe s bility to hndle lrge numbers of concurrent joins when group suddenly becomes populr. V. RELATED WORK Like Scribe, Overcst  nd Nrd  implement multicst using self-orgnizing overly network, nd they ssume only unicst support from the underlying network lyer. Overcst builds source-rooted multicst tree using end-to-end mesurements to optimize bndwidth between the source nd the vrious group members. Nrd uses two step process to build the multicst tree. First, it builds mesh per group contining ll the group members. Then, it constructs spnning tree of the mesh for ech source to multicst dt. The mesh is dynmiclly optimized by performing end-to-end ltency mesurements nd dding nd removing links to reduce multicst ltency. The mesh cretion nd mintennce lgorithms ssume tht ll group members know bout ech other nd, therefore, do not scle to lrge groups. Scribe builds multicst tree per group on top of Pstry overly, nd relies on Pstry to optimize the routes from the root to ech group member bsed on some metric (e.g. ltency). The min difference is tht the Pstry overly cn scle to n etremely lrge number of nodes becuse the lgorithms to build nd mintin the network hve spce nd time costs of #$. This enbles support for etremely lrge groups nd shring of the Pstry network by lrge number of groups. The recent work on Byeu  nd Content Addressble Network (CAN) multicst  is the most similr to Scribe. Both Byeu nd CAN multicst re built on top of sclble peer-to-
10 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 2, NO. 8, OCTOBER peer object loction systems similr to Pstry. Byeu is built on top of Tpestry  nd CAN multicst is built on top of CAN . Like Scribe, Byeu supports multiple groups, nd it builds multicst tree per group on top of Tpestry. However, this tree is built quite differently. Ech request to join group is routed by Tpestry ll the wy to the node cting s the root. Then, the root records the identity of the new member nd uses Tpestry to route nother messge bck to the new member. Every Tpestry node long this route records the identity of the new member. Requests to leve the group re hndled in similr wy. Byeu hs two sclbility problems when compred to Scribe: it requires nodes to mintin more group membership informtion, nd it genertes more trffic when hndling group membership chnges. In prticulr, the root keeps list of ll group members nd ll group mngement trffic must go through the root. Byeu proposes multicst tree prtitioning mechnism to meliorte these problems by splitting the root into severl replics nd prtitioning members cross them. But this only improves sclbility by smll constnt fctor. In Scribe, the epected mount of group membership informtion kept by ech node is smll becuse this informtion is distributed over the nodes. Furthermore, it cn be bounded by constnt independent of the number of group members by using the bottleneck removl lgorithm. Additionlly, group join nd leve requests re hndled loclly. This llows Scribe to scle to etremely lrge groups nd to del with rpid chnges in group membership efficiently. Finlly, whilst Scribe nd Byeu hve similr dely chrcteristics, Byeu induces higher link stress thn Scribe. Both Pstry nd Tpestry (on which Byeu is built) eploit network loclity in similr mnner. With ech successive hop tken within the overly network from the source towrds the destintion, the messge trverses n eponentilly incresing distnce in the proimity spce. In Byeu, the multicst tree consists of the routes from the root to ech destintion, while in Scribe the tree consists of the routes from ech destintion to the root. As result, messges trverse the mny long links ner the leves in Byeu, while in Scribe, messges trverse few long links ner the root. CAN multicst does not build multicst trees. Insted, it uses the routing tbles mintined by CAN to flood messges to ll nodes in CAN overly network, nd it supports multiple groups by creting seprte CAN overly per group. A node joins group by looking up contct node for the group in globl CAN overly, nd by using tht node to join the group s overly. Group leves re hndled by the regulr CAN mintennce lgorithm. CAN multicst hs two fetures tht my be dvntgeous in some scenrios: group trffic is not restricted to flow through single multicst tree, nd only group members forwrd multicst trffic for group. But it is significntly more epensive to build nd mintin seprte CAN overlys per group thn Scribe multicst trees. Furthermore, the delys nd link stresses reported for CAN multicst in  re significntly higher thn those observed for Scribe. Tking network topology into ccount when building CAN overlys is likely to reduce delys nd link stresses but it will increse the cost of overly construction nd mintennce further. Additionlly, the group join mechnism in CAN multicst does not scle well. The node in the CAN overly tht supplies the contct node for the group nd the contct node itself hndle ll join trffic. The uthors of  suggest replicting the functionlity of these nodes to void the problem but this only improves sclbility by constnt fctor. The mechnisms for fult resilience in CAN multicst, Byeu nd Scribe re lso very different. CAN multicst does not require ny dditionl mechnism to hndle fults besides wht is lredy provided by the bse CAN protocol. Byeu nd Scribe require seprte mechnisms to repir multicst trees. All the mechnisms for fult resilience proposed in Byeu re sender-bsed wheres Scribe uses receiver-bsed mechnism. Byeu does not provide mechnism to hndle root filures wheres Scribe does. VI. CONCLUSIONS We hve presented Scribe, lrge-scle nd fully decentrlized ppliction-level multicst infrstructure built on top of Pstry, peer-to-peer object loction nd routing substrte overlyed on the Internet. Scribe is designed to scle to lrge numbers of groups, lrge group size, nd supports multiple multicst source per group. Scribe leverges the sclbility, loclity, fult-resilience nd self-orgniztion properties of Pstry. The Pstry routing substrte is used to mintin groups nd group membership, nd to build n efficient multicst tree ssocited with ech group. Scribe s rndomized plcement of groups nd multicst roots blnces the lod mong prticipting nodes. Furthermore, Pstry s properties enble Scribe to eploit loclity to build n efficient multicst tree nd to hndle group join opertions in decentrlized mnner. Fult-tolernce in Scribe is bsed on Pstry s self-orgnizing properties. The defult relibility scheme in Scribe ensures utomtic repir of the multicst tree fter node nd network filures. Multicst messge dissemintion is performed on besteffort bsis. However, stronger relibility models cn be esily lyered on top of Scribe. Our simultion results, bsed on relistic network topology model, indicte tht Scribe scles well. Scribe is ble to efficiently support lrge number of nodes, groups, nd wide rnge of group sizes. Hence Scribe cn concurrently support pplictions with widely different chrcteristics. Results lso show tht it blnces the lod mong prticipting nodes, while chieving cceptble dely nd link stress, when compred to network-level (IP) multicst. REFERENCES  S. Deering nd D. Cheriton, Multicst Routing in Dtgrm Internetworks nd Etended LANs, ACM Trnsctions on Computer Systems, vol. 8, no. 2, My 199.  S. E. Deering, Multicst Routing in Dtgrm Internetwork, Ph.D. thesis, Stnford University, Dec  S. Deering, D. Estrin, D. Frincci, V. Jcobson, C. Liu, nd L. Wei, The PIM Architecture for Wide-Are Multicst Routing, IEEE/ACM Trnsctions on Networking, vol. 4, no. 2, April 199.  S. Floyd, V. Jcobson, C.G. liu, S. McCnne, nd L. Zhng, A relible multicst frmework for light-weight sessions nd ppliction level frming, IEEE/ACM Trnsction on networking, vol., no. 4, pp , Dec  J.C. Lin nd S. Pul, A relible multicst trnsport protocol, in Proc. of IEEE INFOCOM 9, 199, pp
11 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 2, NO. 8, OCTOBER  K.P. Birmn, M. Hyden, O.Ozksp, Z. Xio, M. Budiu, nd Y. Minsky, Bimodl multicst, ACM Trnsctions on Computer Systems, vol. 17, no. 2, pp , My  Ptrick Eugster, Sidth Hnduruknde, Rchid Guerroui, Anne-Mrie Kermrrec, nd Petr Kouznetsov, Lightweight probbilistic brodcst, in Proceedings of The Interntionl Conference on Dependble Systems nd Networks (DSN 21), Gothenburg, Sweden, July 21.  Luis F. Cbrer, Michel B. Jones, nd Mrvin Theimer, Herld: Achieving globl event notifiction service, in HotOS VIII, Schloss Elmu, Germny, My 21.  Shelly Q. Zhung, Ben Y. Zho, Anthony D. Joseph, Rndy H. Ktz, nd John Kubitowicz, Byeu: An Architecture for Sclble nd Fulttolernt Wide-Are Dt Dissemintion, in Proc. of the Eleventh Interntionl Workshop on Network nd Operting System Support for Digitl Audio nd Video (NOSSDAV 21), Port Jefferson, NY, June 21.  Yng hu Chu, Snjy G. Ro, nd Hui Zhng, A cse for end system multicst, in Proc. of ACM Sigmetrics, Snt Clr, CA, June 2, pp  P.T. Eugster, P. Felber, R. Guerroui, nd A.-M. Kermrrec, The mny fces of publish/subscribe, Tech. Rep. DSC ID:214, EPFL, Jnury 21.  Antony Rowstron nd Peter Druschel, Pstry: Sclble, distributed object loction nd routing for lrge-scle peer-to-peer systems, in Proc. IFIP/ACM Middlewre 21, Heidelberg, Germny, Nov. 21.  Ben Y. Zho, John D. Kubitowicz, nd Anthony D. Joseph, Tpestry: An infrstructure for fult-resilient wide-re loction nd routing, Tech. Rep. UCB//CSD , U. C. Berkeley, April 21.  I. Stoic, R. Morris, D. Krger, M. F. Kshoek, nd H. Blkrishnn, Chord: A sclble peer-to-peer lookup service for Internet pplictions, in Proc. ACM SIGCOMM 1, Sn Diego, CA, Aug. 21.  S. Rtnsmy, P. Frncis, M. Hndley, R. Krp, nd S. Shenker, A Sclble Content-Addressble Network, in Proc. of ACM SIGCOMM, Pittsburgh, PA, Aug. 21.  M. Cstro, P. Druschel, Y. C. Hu, nd A. Rowstron, Eploiting network proimity in peer-to-peer overly networks, 22, Submitted for publiction (http://www.reserch.microsoft.com/ ntr/pstry). Also vilble s Microsoft Reserch TR MSR-TR  Miguel Cstro, Peter Druschel, Aylvdi Gnesh, Antony Rowstron, nd Dn S. Wllch, Security for structured peer-to-peer overly networks, in Proc. of the Fifth Symposium on Operting System Design nd Implementtion (OSDI 22), Boston, MA, December 22.  Peter Druschel nd Antony Rowstron, PAST: A persistent nd nonymous store, in HotOS VIII, Schloss Elmu, Germny, My 21.  Antony Rowstron nd Peter Druschel, Storge mngement nd cching in PAST, lrge-scle, persistent peer-to-peer storge utility, in Proc. ACM SOSP 1, Bnff, Cnd, Oct. 21.  FIPS 18-1, Secure hsh stndrd, Tech. Rep. Publiction 18-1, Federl Informtion Processing Stndrd (FIPS), Ntionl Institute of Stndrds nd Technology, US Deprtment of Commerce, Wshington D.C., April 199.  Yogen K. Dll nd Robert Metclfe, Reverse pth forwrding of brodcst pckets, Communictions of the ACM, vol. 21, no. 12, pp , Dec  L. Lmport, The Prt-Time Prliment, Report Reserch Report 49, Digitl Equipment Corportion Systems Reserch Center, Plo Alto, CA, Sept  E. Zegur, K. Clvert, nd S. Bhttchrjee, How to model n internetwork, in INFOCOM9, Sn Frncisco, CA, 199.  H. Tngmunrunkit, R. Govindn, D. Estrin, nd S. Shenker, The impct of routing policy on internet pths, in Proc. 2th IEEE INFOCOM, Alsk, USA, Apr. 21.  John Jnnotti, Dvid K. Gifford, Kirk L. Johnson, M. Frns Kshoek, nd Jmes W. O Toole, Overcst: Relible Multicsting with n Overly Network, in Proc. of the Fourth Symposium on Operting System Design nd Implementtion (OSDI), October 2, pp  Sylvi Rtnsmy, Mrk Hndley, Richrd Krp, nd Scott Shenker, Appliction-level multicst using content-ddressble networks, in Proceedings of the Third Interntionl Workshop on Networked Group Communiction, London, UK, Nov. 21, pp