A Model Driven Architecture for Adaptable Overlay Networks

Size: px
Start display at page:

Download "A Model Driven Architecture for Adaptable Overlay Networks"


1 A Model Driven Architecture for Adaptable Overlay Networks Vom Fachbereich Informatik der Technischen Universität Darmstadt genehmigte Dissertation zur Erlangung des akademischen Grades Doktor-Ingenieur (Dr.-Ing.) vorgelegt von Stefan Behnel aus Helmstedt Referenten: Prof. Alejandro P. Buchmann, Ph. D. Prof. Geoff Coulson, Ph. D. Datum der Einreichung: 12. Dezember 2006 Datum der mündlichen Prüfung: 5. Februar 2007 Darmstadt 2007, D17


3 Für Gabriele - Danke. Für Alles.


5 Preface Acknowlegements Credit is due to many people who inspired, encouraged and supported me during my research and in the process of writing this thesis. The first to mention at this point is my advisor Professor Alejandro P. Buchmann. He opened the door for me into an interesting and inspiring research environment and supported my work with comments, advice, patience and a remarkable degree of freedom. This work would not have started nor would it have been possible to complete without him. The very next person to thank is Ludger Fiege for numerous discussions and his priceless support as a friend, colleague, critic and (co-)author of papers. An important part of this thesis originated from a collaboration with Geoff Coulson and Paul Grace of the Distributed Systems Group at Lancaster University. I thank both of them for a warm welcome, for fruitful discussions and interesting insights that helped in improving and rounding up the work presented here. Their involvement opened interesting new paths for future research. At a very personal level, I want to thank Wesley Terpstra and Dimka Karastojanova for joining me on the long way since we started. Their friendship and collaboration was invaluable to me. My work was generously supported by a national grant of the Deutsche Forschungsgesellschaft (DFG) as part of the graduate college 749 System Integration for Ubiquitous Computing. The graduate college and its participants provided a friendly, inspiring environment for my research. Finally, in the long tradition of naming the most valuable people last, I thank Gabriele, my parents and my sisters for their enduring support and encouragement during the creation of this thesis and ever before.


7 Scientific Curriculum Vitae Computer Science studies at the Braunschweig University of Technology, Germany Academic year of Computer Science in Limerick, Ireland Specialised studies on Networks and Distributed Systems in Lille, France. Final degree: Diplôme d Études Supérieures Spécialisées in Distributed Systems Professional software development for the leading european online fright exchange service Téléroute in Seclin, France Doctorate as part of a DFG financed Graduate College in the Databases and Distributed Systems Group at the Darmstadt University of Technology, Germany. Final degree: Dr. Ing. in Computer Science. since 2006 Professional software development und architecture in the financial sector. Wissenschaftlicher Lebenslauf Studium der Informatik an der Technischen Universität Carolo-Wilhelmina in Braunschweig, Deutschland Studium der Informatik in Limerick, Irland Spezialisierungsstudium Netzwerke und Verteilte Systeme in Lille, Frankreich. Abschluss als Jahrgangszweiter mit dem Diplôme d Études Supérieures Spécialisées im Bereich Verteilte Systeme Professionelle Software-Entwicklung für die führende europäische Online-Frachtbörse Téléroute in Seclin, Frankreich Vorbereitung der Promotion in der Fachgruppe Datenbanken und Verteilte Systeme an der Technischen Universität Darmstadt, Deutschland, mit Bestenförderung der Deutschen Forschungsgesellschaft im Rahmen eines Graduiertenkollegs. Abschluss mit dem Titel Dr. Ing. im Bereich Informatik. seit 2006 Professionelle Software-Entwicklung und Architektur im Finanzsektor

8 Publications [BB05a] [BB05b] Stefan Behnel and Alejandro Buchmann. Models and Languages for Overlay Networks. In Proc. of the 3rd Int. VLDB Workshop on Databases, Information Systems and Peer-to-Peer Computing (DBISP2P 2005), Trondheim, Norway, August Stefan Behnel and Alejandro Buchmann. Overlay Networks Implementation by Specification. In Proc. of the Int. Middleware Conference (Middleware2005), Grenoble, France, November [BBG + 06] Stefan Behnel, Alejandro Buchmann, Paul Grace, Barry Porter, and Geoff Coulson. A specification-to-deployment architecture for overlay networks. In Proc. of the Int. Symposium on Distributed Objects and Applications (DOA), Montpellier, France, October [Beh05a] Stefan Behnel. MathDOM A Content MathML Implementation for the Python Programming Language. sourceforge.net/, [Beh05b] [Beh05c] [Beh05d] [Beh07] [BFM06] Stefan Behnel. The SLOSL Overlay Workbench for Visual Overlay Design Stefan Behnel. Tin Topologies Designing Overlay Networks in Databases. In Proc. of the Int. Middleware Conference (Middleware2005), Grenoble, France, November Demonstration and Poster Presentation. Stefan Behnel. Topologien aus der Dose Ein Datenbank-Ansatz zum Overlay-Design. In Paul Müller, Reinhard Gotzhein, and Jens B. Schmitt, editors, KiVS Kurzbeiträge und Workshop, volume 61, Kaiserslautern, Germany, March Poster. Stefan Behnel. SLOSL - a modelling language for topologies and routing in overlay networks. In Proc. of the 1st Int. Workshop on Modeling, Simulation and Optimization of Peer-to-peer environments (MSOP2P), Naples, Italy, February Stefan Behnel, Ludger Fiege, and Gero Mühl. On quality-of-service and publish-subscribe. In Proc. of the 5th Int. Workshop on Distributed Event-based Systems (DEBS 06), Lisbon, Portugal, July 2006.

9 [TB04] Wesley W. Terpstra and Stefan Behnel. Bit Zipper Rendezvous - Optimal data placement for general P2P queries. In International Dagstuhl Seminar on Peer-to-Peer-Systems and Applications, Schloss Dagstuhl, Germany, March [TBF + 03] Wesley W. Terpstra, Stefan Behnel, Ludger Fiege, Andreas Zeidler, and Alejandro Buchmann. A Peer-to-Peer Approach to Content- Based Publish/Subscribe. In Proc. of the 2nd Int. Workshop on Distributed Event-based Systems (DEBS 03), San Diego, CA, USA, June [TBF + 04] Wesley W. Terpstra, Stefan Behnel, Ludger Fiege, Jussi Kangasharju, and Alejandro Buchmann. Bit Zipper Rendezvous - Optimal Data Placement for General P2P Queries. In Proc. of the 1st Int. Workshop on Peer-to-peer Computing and Databases, Heraklion, Crete, March 2004.

10 Abstract Recent years have witnessed a remarkable spread of interest in decentralised infrastructures for Internet services. Peer-to-Peer systems and overlay networks have given rise to a major paradigm shift and to novel challenges in distributed systems research. Numerous projects from several research communities and commercial organisations have started building and deploying their own systems. Adoption, however, has been restricted to sparsely selected areas, dominated by few applications. Among the technical reasons for the limited availability of deployable systems is the complexity of their design and the incompatibility of systems and frameworks. Applications are tightly coupled to a specific overlay implementation and the framework it uses. This leaves developers with two choices: implementing their application based on the fixed combination of overlay, networking framework and programming language, or reimplementing the overlay based on the desired application framework. Both approaches have their obvious draw-backs. Implementing an overlay, even as a reimplementation, is a task that exhibits considerable challenges. Protocols have to be adapted, completed and partially reverse engineered from the original system to implement them correctly. Message serialisations and interfaces have to be rewritten for the new framework. State maintenance and event handling are programmed in very different ways in different environments, which typically requires their redesign. Reimplementing an overlay for a new environment is therefore not necessarily less work than designing a new one. As for the alternative, being tied to a specific environment prevents the application designer from freely choosing the best suited framework for the specific application. Deploying different overlays in one application is only possible if they were written for the same framework, and even then, running multiple non-integrated overlays at the same time can become prohibitively resource extensive. Testing with different overlay topologies and adapting to different deployment environments is similarly hard in this scenario. The current techniques used for overlay implementation turn out to become limiting factors in the design of overlay applications. The approach taken by this thesis tackles these issues at design time, at a point long before integration problems arise. It presents a modelling framework that allows to express overlay specific semantics in a platform-independent, domain specific language called the Overlay Modelling Language, OverML. A Model Driven Architecture maps these abstract overlay specifications to concrete, framework specific implementations. Applications based on OverML benefit from state sharing between different overlays and from short topology implementations in the SQL-Like Overlay Specification Language SLOSL. Their conciseness allows an easy adaptation of

11 SLOSL statements to specific quality-of-service requirements. The architecture provides a clean separation between the generated implementation and handwritten components through generic, event-driven interfaces. To evaluate the approach, a series of topologies is exemplarily specified in OverML and/or SLOSL. A complete walk-through from the specification to the deployment of an OverML implemented overlay is additionally presented. The major contribution of this thesis is the first complete design methodology for the integrative, high-level development of portable, adaptable overlay networks.

12 Zusammenfassung Seit einigen Jahren zeigt sich ein stark zunehmendes Interesse an dezentralen Infrastrukturen für internetweite Dienste. Ausgelöst durch den massiven Erfolg von Peer-to-Peer Systemen und Overlay-Netzwerken hat vor allem in der Forschung ein Umdenken eingesetzt. Wie sehr diese Netzwerke als ernsthafte Alternative zu Client-Server Systemen wahrgenommen werden, belegt die große Menge an Systemen, die Forschungsgruppen und Firmen weltweit entworfen haben. Bemerkenswert ist jedoch auch, wie wenige dieser Systeme den Weg in reale Anwendungen gefunden haben. Abgesehen von Programmen zur Distribution großer Datenmengen (E-Donkey oder BitTorrent) gibt es nur wenige Systeme mit nennenswerter Verbreitung. Für diesen Mangel an Anwendungen gibt es durchaus auch technische Gründe. Zu nennen sind hier vor allem die Komplexität dieser verteilten Systeme und die Inkompatibilität der existierenden Implementierungen. Anwendungen sind dabei sehr stark an die zu Grunde liegende Overlay-Software gekoppelt. Ihren Entwicklern bieten sich heute nur zwei Möglichkeiten. Sie können ihre Anwendungen für existierende Overlay-Implementierungen maßschneidern und sich auf die damit einhergehende Kombination aus Software- Umgebung und Programmiersprache festlegen. Oder sie wählen eine Sprache und Umgebung, die zu ihrer Anwendung passt und sind dann gezwungen, das benötigte Overlay in dieser Umgebung neu zu implementieren. Aus Entwicklersicht kann keiner dieser Ansätze zufriedenstellend sein. Ein bestehendes Overlay zu portieren ist nur selten weniger Aufwand als es neu zu schreiben. Das liegt einerseits an der zumeist unzureichenden Spezifikation, die eine Anpassung der Protokolle nach sich zieht. Teilweise muss dabei sogar auf Reverse-Engineering der ursprünglichen Implementierung zurückgegriffen werden. Zudem ist die Implementierung von Nachrichtenverwaltung und -serialisierung, Komponentenschnittstellen und Zustandsverwaltung sehr stark von der verwendeten Software-Umgebung abhängig, so dass diese Teile neu entwickelt werden müssen. Der hierfür benötigte Aufwand steht in keinem Verhältnis zu der Erleichterung, die der Einsatz von Overlays für Anwendungsentwickler bringen soll. Die Alternative ist die Übertragung der vom Overlay genutzten Software- Umgebung auf die Anwendung. Dies macht jedoch nur dann Sinn, wenn Umgebung und Programmiersprache auch angemessene Unterstützung für die Entwicklung der spezifischen Anwendung bieten. Selbst dann besteht jedoch weiterhin das Problem der Inkompatibilität zwischen Overlay-Implementierungen, so dass ein späterer Wechsel oder auch nur Tests mit anderen Overlays mit großem Aufwand verbunden sind. Eine gemeinsame Nutzung mehrerer Overlays zur Laufzeit verbietet sich schon durch die fehlende Integration der Systeme, da sich ihr Resourcenverbrauch zumeist summiert. Die vorliegende Arbeit verfolgt einen umfassenderen Ansatz, der die be-

13 schriebenen Integrationsprobleme von vornherein umgeht und sowohl die Entwicklung als auch den Umgang mit Overlay-Software vereinfacht. Er erlaubt erstmals die plattformunabhängige, abstrakte Modellierung von Overlay-Implementierungen. Die Beschreibung erfolgt in einer eigens zu diesem Zweck entwickelten domänenspezifischen Sprache namens OverML, der Overlay Modelling Language. Entsprechend dem Model-Driven-Architecture Ansatz erfolgt anschließend eine maschinelle Übersetzung der abstrakten Modelle in konkrete Implementierungen für spezifische Software-Umgebungen. Auch für Anwendungen bietet OverML Vorteile. So werden die Topologieeigenschaften von Overlay-Implementierungen durch die Verwendung der kompakten, SQL-ähnlichen Sprache SLOSL übersichtlich und konfigurierbar und lassen sich sehr leicht an spezielle Anforderungen anpassen. Zur Integration mehrerer Overlays wird ein gemeinsamer Zustandsspeicher verwendet, der eine Duplizierung des Verwaltungsaufwandes zu vermeiden hilft. Die Komponentenarchitektur wird durch generische, datengesteuerte Schnittstellen entkoppelt, was zu einer sauberen Trennung von generiertem und handgeschriebenem Code führt. Zur Evaluierung werden verschiedene Overlay-Topologien spezifiziert. Zudem erfolgt eine Beschreibung des vollständigen Entwicklungsablaufes von der Spezifikation einer Overlay-Implementierung bis hin zu ihrem Einsatz in wechselnden Anwendungsumgebungen. Somit stellt diese Arbeit eine neuartige Methodik zur Verfügung, die erstmals die integrative, abstrakte Entwicklung plattformunabhängiger, leicht adaptierbarer Overlay-Implementierungen erlaubt.


15 Contents Preface Scientific Curriculum Vitae Abstract i iii vi 1 Introduction New Services in Globalising Environments Requirements on Overlay Networks Integrative Design of Overlay Networks Overview how to read on Case Study: Pub-Sub Overlays The Publish-Subscribe Paradigm Quality-of-Service in Publish-Subscribe QoS at the Global Infrastructure Level QoS at the Notification and Subscription Level Discussion Publish-Subscribe on Overlay Networks Multicast Approaches Hermes IndiQoS Rebeca Choreca BitZipper Conclusion Overlay Middleware Requirements on an Overlay Middleware Generic, High-Level Models Software Components Management of State and Node Data

16 3.1.4 Integration and Resource Sharing Overlay Frameworks SEDA JXTA ioverlay RaDP2P Macedon What is missing in current frameworks? Current Views on Overlay Software The Networking Perspective The Overlay Protocol Perspective Code Complexity why is this not enough? Overlay MDA The Topology Perspective Topology Rules Topology Maintenance Topological Routing Topology Adaptation Topology Selection Discussion Background: The Local View of a Node The Chord Example Data about Nodes The System Model: Node Views Nodes Node Database Node Set Views View Definitions View Events Harvesters and Controllers Routers and Application Message Handlers Discussion The SLOSL View Definition Language SLOSL Step-by-Step Grammar CREATE VIEW SELECT FROM WITH WHERE HAVING... FOREACH

17 5.2.7 RANKED More Examples De-Bruijn Graphs Pastry Scribe on Chord SLOSL Routing Examples Iterative and Recursive Routing SLOSL Event Types OverML NALA - Node Attributes Type Definition Attributes SLOSL - Local Views EDGAR - Routing Decisions HIMDEL - Messages Programmatic Interface to Messages Network Serialisation of Messages EDSL - Event-Driven State-Machines States Transitions Subgraphs Harvesters Gossip-based Node Harvesters Latency Harvesters Sieving Harvesters Distributed Harvesters Implementation The SLOSL Overlay Workbench Specifying Node Attributes and Messages Writing SLOSL Statements Testing and Debugging SLOSL Statements Designing Protocols and Harvesters OverML Output Implementing SLOSL Transforming SLOSL to SQL Interpreting SLOSL OverML Model Transformation Transforming OverML to an Object Model Efficient SLOSL View Updates

18 8.3.3 Transforming the OverML Object Model to Source Code OverGrid: OverML on the Gridkit Database and SLOSL Views Control Forwarding Event Infrastructure Network Layers OverGrid Case Study: Scribe over Chord Node Attributes for Scribe/Chord SLOSL Implemented Chord Topology Chord Routing through SLOSL Views SLOSL Implemented Scribe on Chord Multicast Routing through SLOSL Views Message Specifications in HIMDEL Event Handling in EDSL From Specification to Deployment Related Work Unstructured Overlay Networks Power-law Networks Square-root Networks Key-Based Routing (KBR) Networks Key-Based Routing LH* SDDS Ring Topologies Trees De-Bruijn Graphs Other Topologies Hierarchical Overlays and Efficient Grouping Topologies Comparison and Evaluation Message Overhead and Physical Networks Physical Mappings Message Overhead and Stability Design and Implementation of Overlays General Concepts in P2P Overlay Networks Declarative Overlay Routing Quality-of-Service in Publish-Subscribe Modelling and Generating Software UML CASE tools The OMG Model Driven Architecture Generative Programming and Domain Languages Modelling Distributed Systems and Protocols

19 9.9 Databases and Data-Driven Software Database Views Data-Driven Software Architectures Conclusion and Future Work 149 A XML Reference Implementations 157 A.1 OverML RelaxNG Schema A.2 XSLT: EDSL to the DOT Language A.3 XSLT: EDSL to flat EDSM graph A.4 XSLT: EDSL to merged State Graph A.5 XSLT: SLOSL to View Objects A.6 XSLT: HIMDEL to Plain Messages Bibliography 179


21 List of Figures 2.1 Distributed communication paradigms The publish-subscribe model QoS, topologies and local broker decisions Approximate code size of overlay frameworks Overlay software from the networking perspective Overlay software from the overlay protocol perspective Code size of well-known overlay implementations Overlay software from the topology perspective The Chord example - from global to local view The overall system architecture of Node Views View components in the Node Views architecture General EDGAR graph for SLOSL routing Chord routing, implemented using SLOSL views Scribe routing over Chord, implemented using SLOSL views Iterative and Recursive Multi-Hop Routing Dependencies of SLOSL view events The OverML Languages Example of an EDSL graph Defining node attributes and messages in the Overlay Workbench SLOSL statements in the Overlay Workbench Visualising and testing SLOSL implemented topologies EDSM definition in the Overlay Workbench Example of an Attribute Dependency Graph for Multiple Views Example of an Attribute Dependency Graph between SLOSL Clauses Example of a merged Attribute Dependency Graph with an additional dependent attribute in the database The per-node Gridkit software framework

22 8.9 Combined architecture of Gridkit and Node Views EDGAR Graph for Chord Routing through SLOSL Views EDGAR Graph for Multicast Routing through SLOSL Views Explicit and implicit leaves in Scribe Resilience of a Gnutella graph Diagrams in UML The history of UML

23 List of Listings 5.1 Chord Graph Implementation in SLOSL De-Bruijn Graph Implementation in SLOSL Pastry Graph Implementation in SLOSL Pastry Leaf-Set Implementation in SLOSL NALA Data Types NALA Node Attribute Example Chord Neighbour Table Implementation in SLOSL SLOSL XML Representation of Chord Neighbour Table EDGAR Example HIMDEL Example Python Example for Message Field Access through HIMDEL Simple EDSM Example in EDSL Example Scenario for SLOSL Topology Simulation Template for Mapping SLOSL to SQL XSL Template for Object Representation of SLOSL Object Representation of a SLOSL Statement A.1 OverML RelaxNG Schema A.2 XSLT mapping from EDSL to the DOT language A.3 XSLT mapping from EDSL to a flat EDSM graph A.4 XSLT mapping from EDSL to a merged state graph A.5 XSLT mapping from SLOSL to View Objects A.6 XSLT mapping from HIMDEL to plain messages


25 Chapter 1 Introduction Contents 1.1 New Services in Globalising Environments Requirements on Overlay Networks Integrative Design of Overlay Networks Overview how to read on Since the early 1990 s, we can witness an exponential growth of participation in the Internet. Only a few years ago, the main drive came from the incremental inclusion of stationary PCs, either from individuals connecting over Internet service providers or from entire enterprise networks. Today, however, we can see that Internet connected devices have become smaller and smaller and thus entered our daily mobile life. R F I D

26 2 CHAPTER 1. INTRODUCTION Following the history of networked computing, we can see a major trend starting from the one computer, many users pattern in the days of expensive and large mainframes. It shifts towards a one computer, one user paradigm in the home computer and PC era during the 1980 s and 1990 s and finally becomes one user, many computers in today s mobile environments. Powerful cell phones and pocket computers (like PDAs) have found their wide-spread use. Wireless Internet access points are steadily increasing their density. Embedded devices like environmental sensors, surveillance cameras or stationary public phones are increasingly connected through to Internet. RFID tags are becoming more and more ubiquitous and, besides security and privacy concerns, raise questions about a suitable global infrastructure. On the other side of the digital gap 1, poverty, socioeconomic background, disabilities and general knowledge deficiencies prevent major parts of the worldwide population from access to learning resources [DOT70]. This was found to be a problem in post-industrial societies, but even more so in developing countries. Especially for the latter, there is a substantial dispute whether the growing ubiquity of Internet access and the freely available resources in the World Wide Web can help in bridging the knowledge gap or not. At the time of this writing, a number of projects world-wide are trying to provide people from rural areas with hardware and Internet connections. A common saying in this context is One child, One laptop, which underlines the hope for a major educational impact. Some of these projects were thus initiated by governmental interest in education (such as the Simputer 2 in India), others come from private initiatives (like the Children s Machine 3 initiated by Nicholas Negroponte). While none of them is free of commercial interest, they all aim to give poor, under-educated people access to globally available knowledge and to enable their active participation in broader communication through cheap, portable computers. Wide spread acceptance and increasing quantities are expected to further decrease the costs of these systems. The before mentioned technology centred initiatives have yet to see their break-through. However, certain results have already become visible and have shown to be rather promising. This further feeds the growing interest in connecting more and more people to knowledge resources. Global events like the World Summit on the Information Society, held by the United Nations in 2003, encourage world-wide discussion on this topic and may well lead to increasing support for sensible projects. We currently see that a large part of the world population is still excluded from the usage of Internet resources. This simple fact exhibits a non-negligible chance that today s Internet still has its major scalability issues lying ahead. 1 divide&oldid= Children%27s Machine&oldid=

27 1.1. NEW SERVICES IN GLOBALISING ENVIRONMENTS New Services in Globalising Environments These emerging trends towards a globalising ubiquity of connectivity are and will continue to be a driving force in the increasing requirements on Internet services. The most widely used services today are (including mailing lists, news-groups and SPAM) and the world wide web. Although mail and web servers have shown a high scalability in every-day situations, their scalability as centralised systems is still limited by hardware performance and financial considerations. Especially free and non-profit services, which continue to fill a major portion of the Internet, show these insufficiencies. They simply cannot afford to provision resources over demand to satisfy extraordinary request peaks. The well-known results are slow and unpredictable responsiveness of mass mail servers and news-groups (SourceForge mailing lists 4, Yahoo-Groups 5,... ). Web-servers have the same problems with flash-crowds ( Slashdotting, reaction to Software announcements, catastrophic news of mass interest,... ). This reveals the need for low-budget scalability of mass communication services. Despite the fact that and world wide web are still the most widely used services, the major part of the internationally generated network traffic (60% and more 6 ) is currently produced by file swapping networks such as Gnutella 7, KaZaA 8, EDonkey/Overnet 9 or BitTorrent 10. Their whole purpose is to make large files (from megabytes to several gigabytes) available for efficient multi-source download. Since they came into use around the year 2000, these systems have already proven their scalability to more than a million concurrent users [SGG02] which is achieved mainly by the decentralisation of their download service. There is currently a major focus in research to see whether an equivalent decentralisation makes sense for other services in these systems, such as the search for files (as exemplified by Gnutella and Overnet), and how this can be achieved in a similarly scalable way. First attempts to decentralise e- mail [KRT03, MPR + 03] and news dissemination [SMPD05] show the potential of server-free, decentralised services. Another major part of the current network traffic comes from server based file distribution, like software updates, CD/DVD image downloads, etc. While most of the files are still copied by FTP or HTTP to the clients or via RSync between mirrors, systems like BitTorrent show that a partially decentralised (Aug. 30, 2006)

28 4 CHAPTER 1. INTRODUCTION approach can achieve much higher scalability [QS04]. It uses a centralised server (or tracker) only for delegating requests for file slices to clients that already downloaded them. The dominating work of copying these slices is done by the participating clients that interact directly. This relieves the need for expensive server replication and over-demand bandwidth provisioning on the side of the distributor. Newer versions of BitTorrent even remove the remaining bottleneck at the tracker by decentralising the client delegation. They use an overlay network similar to the one in the Overnet file sharing network to store the slice availability. 1.2 Requirements on Overlay Networks Due to such promising achievements, interest in these decentralised overlay networks has exploded over the last years, both in research and in practically deployed applications. They are generally perceived as a comfortable building block for large-scale distributed systems. The proven mass scalability of Overnet, KaZaA or Gnutella, but also the extended BitTorrent design show their appealing potential. They can be deployed where applications require decentralised communication over highly scalable, self-maintaining infrastructures. Especially self-maintenance is a crucial feature when it comes to large-scale systems. Overlay networks are commonly constructed based on distributed, self-organising algorithms that aim to achieve a high resilience against arbitrary failures. The ultimate goal is to build administration free virtual networks that provide simple communication abstractions and allow the deployment of diverse applications on top of them. The recently proposed systems are rather diverse in their major characteristics. An extensive overview is provided in the related work sections 9.1 and 9.2 of this thesis. One of the examples above was the Gnutella network, which deploys a power-law topology. Despite the simplicity of its algorithms, it achieves impressively good resilience properties and a low diameter of the overall topology. Other overlays, like P-Grid [Abe01] or Chord [SMK + 01], are organised as distributed trees or rings, which allows them to execute efficient lookup operations directly on top of their topology. Having different characteristics also means that these networks solve different problems with varying effectiveness. It is rather difficult to find or build overlays that can fulfil the entire range of requirements for a given, non-trivial application. This holds especially in large systems which have to respond to the diverse interests of thousands to millions of users. Hybrid search systems like [LHSH04] are an example. They combine the characteristics of different types of overlays to achieve a similar quality-of-service for finding rare items as for the broader search of well replicated data. The following (non-exhaustive) list tries to give an idea about the diversity

29 1.2. REQUIREMENTS ON OVERLAY NETWORKS 5 of requirements that applications can pose on overlay networks. They may benefit from (average) single-hop delivery for reduced latency (provably) good resilience for high reliability and availability rapid reconciliation when nodes join or fail practical deployability (in a specific scenario) high scalability to millions of participants in the Internet optimal performance in small installments and ad-hoc networks low maintenance overhead ping/ack at low rates (or none at all) low per-node degree and state high degree of freedom in optimal neighbour selection optimisability for high throughput or low latency good search properties for both needles and hay Some of these properties are even mutually exclusive. Obviously, a combination of a low node degree and single-hop delivery will not result in a scalable topology. Therefore, each specific application that runs on top of an overlay will require a specific subset of these characteristics at the cost of others being ignorable or even impossible. This subset determines the choice of the right overlay. As an example, overlay-multicast applications [RKCD01, ZH03] may prefer a low or a high node degree. A low degree generally leads to deeper graphs and higher end-to-end latency but allows for higher throughput at each node. A higher degree reduces the tree depth and therefore the latency and the probability of node failures along each path. It also tends to increase the resilience against failing neighbours. Varying the degree may yield the desired throughput versus latency and reliability tradeoff. Distributed content based publish-subscribe systems (see chapter 2) are another example. They base their routing decisions on filter matching. Filter updates, especially to new nodes, can be costly depending on the complexity of filters. In this setting, a small number of neighbours may turn out to be more desirable to keep the number of updates low. As with multicast, however, a higher degree may yield better performance for content forwarding, so the right tradeoff is again application specific. For relatively small overlays, single-hop delivery may be optimal and very desirable. However, the overlay may have to switch to multi-hop delivery when

30 6 CHAPTER 1. INTRODUCTION the number of nodes increases dynamically or if the dynamism (or churn rate ) of the network exceeds a certain threshold [RB04]. In many use cases, the members of an overlay network form subgroups at run-time, e.g. for better efficiency, resilience or security. This may include multicast groups, semantic clusters or functional clusters (like in Omicron [DMS04]). These subgroups will most likely have different requirements than the global overlay (after all, that is why they are formed in the first place). A low maintenance overhead is always desirable, while in small or wellconnected intra-organisational installations other properties, like rapid reconciliation, may take precedence. A quality-of-service aware overlay must provide very different service types to applications, as different users usually have very different ideas about the QoS they need. So far, these requirements stayed at a rather abstract level. The second chapter of this thesis aims to substantiate the diversity of the requirements posed by applications and the characteristics provided by overlay networks. It presents an extensive case study in the concrete field of publish-subscribe applications. By building up a meaningful set of quality-of-service metrics for this area, it provides a solid ground for the evaluation of overlay networks against concrete requirements. Section 2.3 uses the presented metrics for an exemplary evaluation of overlay based publish-subscribe systems. It tries to show the capabilities, weaknesses and limits of different systems. From this analysis, it becomes obvious that no single overlay will ever satisfy the needs of all possible applications, not even in a single application domain such as publish-subscribe. Even within a specific application there may still be situations where a choice between different topologies makes sense. Having to make this decision at implementation time (or even design time) is obviously premature, since it prevents tests with different topologies as well as an adaptation at run-time. As an approach towards the integration of similar overlay types, a common API for structured overlays was proposed by Dabek and others [DZDS03]. This allows an application to abstract from the specific details of different structured overlays and use the one at hand. The main problems, however, arise at a different level, which cannot be approached with an API. The available structured overlays continue to provide distinct implementations in different frameworks and different programming languages. Applications that want to deploy overlays can hardly support more than one framework in one language. This complicates comparisons and effectively prevents a hybrid combination of overlay systems. Even if it was possible to use them combined, the potential gain would be limited by the fact that each overlay requires its own maintenance algorithm and introduces its own overhead.

31 1.3. INTEGRATIVE DESIGN OF OVERLAY NETWORKS 7 Due to the current complexity of overlay implementations, porting an overlay between frameworks basically means reimplementing it, which is a huge obstacle for application developers. In this context, the building block idea degenerates into an additional block that has to be built rather than a helpful support for the design of an application. There were a few attempts to aid in the implementation of overlay networks, mainly JXTA [JXT, JXT03], ioverlay [LGW04] and Macedon [RKB + 04]. Section 3.2 presents them in detail. However, the analysis following their description shows that they fail to make integrative, framework independent models available to overlay designers and application integrators. 1.3 Integrative Design of Overlay Networks As we will see further on in section 3.4, implementing overlay networks is far from trivial. Implementations typically require between 10,000 and 30,000 lines of source code. As distributed networking systems, overlay implementations are often written in event-driven I/O patterns. This leads to counterintuitive code modularisation, cross-cutting interdependencies between modules and limited support for code reuse. These obstacles render the understanding of the actual system and its topological features difficult. As noted in 3.1, a simplification in overlay design requires high-level models that capture the important characteristics of overlay networks. While the componentisation of overlay software must respect the event-driven nature of the running system, it is more important to encourage a clean, high-level design of the implementation. The goal is to make it understandable for other developers and application integrators without requiring additional, external documentation. The need to integrate different overlays into adaptable hybrid systems poses further demands on the design process and its tool support. Different overlays must be able to share their system state. Otherwise, the independent maintenance strategies cannot become aware of each other, which requires them to duplicate their effort. Shared state also reduces code redundancy between overlay implementations. Every participant has to store state about neighbours and maintain its connections to them. A common ground for handling neighbour state allows to move this functionality into middleware components. The main characteristics of overlays result from their topology. As we will see in chapter 3, however, the currently available models and abstractions for these systems do not capture this part of their design. If the topology is meant to become a readable (and thus understandable) part of overlay software, new models are needed to support its design. The topology perspective on overlay software, as described in section 4.1, is an approach to make the five main functionalities in overlay networks visible in

32 8 CHAPTER 1. INTRODUCTION their implementation: topology rules, overlay routing, topology maintenance, topology adaptation and topology selection. Rules define the topology from the local point of view of each participating node. Routing describes how to use the rules to forward messages. Adaptation honours the fact that the rules may leave a certain degree of freedom in neighbour selection and forwarding decisions and tries to optimise the topology according to specific policies. Maintenance deals with the actions that must be taken when the local view breaks the rules. Selection finally deals with the integration of different topologies (i.e. rules, forwarding/maintenance schemes and adaptation policies) to let an application select the right topology for its current requirements. Each overlay implementation must deal with these functionalities. This motivates the design of a domain specific language called SLOSL (described in chapter 5) for the platform independent implementation of rules and adaptation policies. It is easily extended to support routing decisions defined in the EDGAR language (6.3). To capture the entire design process of overlay implementations, and to provide high-level support for topology maintenance and selection, this thesis proposes a Model Driven Architecture based on SLOSL, the XML Overlay Modelling Language OverML (chapter 6) and the Node Views system architecture (4.3). OverML is a set of five domain specific languages for the area of overlay design. The combination of these languages allows for far-reaching, semantically rich models of overlay systems. NALA SLOSL EDGAR HIMDEL EDSL The Node Attribute Language specifies data schemas for the attributes of overlay nodes. It serves as a common model for local state keeping. The SQL-Like Overlay Specification Language defines local views for each node in the overlay. It describes the topology as the union of all local views and expresses topology rules and adaptation strategies. The language defines Extensible Decision Graphs for Adaptive Routing. Based on the rules defined by SLOSL, it specifies the local routing decisions taken by each node to forward messages through the topology. The Hierarchical Message Description Language models layered messages based on NALA data types and SLOSL view data. The Event-Driven State-machine Language defines overlay protocols and connects implementation specific maintenance components with events defined by the architecture, especially by the languages SLOSL and HIMDEL. Chapter 8, as well as the various examples in other chapters, show the broad applicability of these languages to the design of different overlays and their topologies.

33 1.4. OVERVIEW HOW TO READ ON Overview how to read on To give an idea of the diversity of requirements on overlay networks, the following chapter presents an extensive case study of a specific class of overlay applications: distributed publish-subscribe systems. First, it derives qualityof-service metrics that allow to compare systems from this area. They are used in section 2.3 to show the differences between a number of well-known implementations that deploy decentralised overlay networks to support publishsubscribe at a global scale. Their different levels of quality-of-service make a case for middleware architectures that integrate different systems. Only a resource efficient integration allows applications to benefit from the diversity of features that different systems provide. A new methodology for the integrative, platform-independent, high-level design of data-driven overlay networks is the major contribution of this thesis. Chapter 3 overviews different abstraction levels that are currently used in overlay software. It shows the deficiencies of recent frameworks when requiring a substantial simplification of the integrative, portable design of overlay networks. The main reason is that current frameworks have largely focused on networking and protocols. Section 4.1 replaces this focus by a new perspective that targets the main characteristics of overlay networks, which are determined by their topology. Taking the topology perspective on overlays, chapter 4 derives a new architectural model for overlay design, the Node Views architecture. It underlies the domain specific OverML languages that are presented in chapter 6, with a special emphasis on SLOSL, a specification language for local decisions in topologies (chapter 5). The OverML languages form the foundation of the Model Driven Architecture proposed in this thesis. Chapter 7 finally sketches ways of implementing common communication patterns using the new architecture. Chapter 8 presents the current implementation of an Integrated Development Environment (IDE) for overlay design, based on the OverML language, as well as a mapping of the abstract design models to executable implementations. It finishes by presenting a complete walk-through from the specification of an overlay to its deployment. Related work in different areas as well as an overview of available overlay networks is presented in chapter 9. The work is finally concluded by chapter 10.


35 Chapter 2 A Case Study: Publish-Subscribe Overlay Applications Contents 2.1 The Publish-Subscribe Paradigm Quality-of-Service in Publish-Subscribe Publish-Subscribe on Overlay Networks Consumer initiated full knowledge Request Reply Messaging Anonymous Request Reply Publish Subscribe brokered Producer initiated Figure 2.1: Distributed communication paradigms Figure 2.1 shows the known paradigms for distributed communication. The two on the left are in widespread use today. Following the examples from the introduction, the service represents a messaging scheme, although the internal server-to-server interaction is based on request-reply. Web and

36 12 CHAPTER 2. CASE STUDY: PUB-SUB OVERLAYS file servers are built on centralised request-reply interaction. In both cases, the producer of data (i.e. the server) knows all receivers that initiated a request and addresses them explicitly and directly for its reply, which is a major problem for server scalability. Commercial web server clusters already take a step towards higher scalability through a variant of the anonymous request-reply scheme. They exploit the indirection provided by DNS and load balancers to decouple service names from servers and spread the load over a larger number of servers. Large sites like the Amazon web shop 1 or the Google search engine 2 are prominent examples that make tens of thousands of servers available under the same name. They can even be globally distributed over different locations, as Google shows. Addressing works in a randomised way based on implicit request semantics such as the host domain or preferred language of the client. The original BitTorrent design uses request-reply for all services (finding sources and downloading), but the tracker randomises the delegation to download providers. Again, this shows ideas of anonymous (or delegated) requestreply, which is even more visible in the newer overlay design of the system. Current file swapping networks commonly use a delegated or anonymous form of request-reply for search and direct request-reply for download. As the examples show, the flavours of request-reply are the most widely used paradigms. Simple variations of anonymous request-reply have been adopted fairly often in the form of DNS delegation or centralised load-balancing. Scalability is commonly achieved by delegating resource intensive tasks to a large number of nodes, while keeping a single entry point for clients. Overall, the technical trend is clearly guided towards reduced knowledge at the initiator. An interesting and rather old example are mailing lists and news groups, which deploy the fourth paradigm, publish-subscribe. Despite its appealing simplicity, this scheme is so powerful that it can provide advantages to many other Internet services, especially where scalability is an issue. The two brokered paradigms on the right virtualise and decouple the connection between consumers and producers of data. This inherently makes them very interesting for decentralised communication infrastructures. 2.1 The Publish-Subscribe Paradigm The system model of the publish-subscribe communication paradigm is surprisingly simple. It provides three roles: publishers, subscribers and brokers. Publishers (aka producers) provide information, advertise it and publish notifications about it. Subscribers (aka consumers) specify their interest and receive 1 2

37 2.1. THE PUBLISH-SUBSCRIBE PARADIGM 13 relevant information when it appears. Brokers mediate between the two by selecting the right subscribers for each published notification. In the following, the term client will additionally be used for publishers and subscribers to distinguish their roles from the broker infrastructure. A more thorough discussion of the publish-subscribe model is provided in [Fie05]. notify subscribe Subscriber Publisher Publisher publish advertise publish advertise Broker Infrastructure notify notify subscribe subscribe notify Subscriber subscribe Subscriber Figure 2.2: The publish-subscribe model This scheme decouples publishers and subscribers and thus simplifies their roles and interfaces considerably. The drawback is the infrastructure in the form of brokers that is necessary to provide the publish-subscribe service. In the mailing list example, the mailing list servers are brokers that mediate between authors and readers of , i.e. publishers and subscribers. Similarly, in IP-Multicast [Dee91], the IP routers take the role of brokers. However, publish-subscribe is not limited to mailing lists, where the broker is a server that has a simple list of receivers available for each message type. This case is commonly known as centralised subject based publish-subscribe where consumers subscribe to a subject that producers explicitly assign to their notifications. In content based publish-subscribe, subscriptions express more general filters on the actual content of notifications. This scheme is both more powerful and more complex than fixed lists of recipients. An important property of the publish-subscribe model is the level of abstraction at which publishers and subscribers communicate. They are not aware of the organisation of the infrastructure or of the size of the system. There can be a single centralised broker, a cluster of them or a distributed network of brokers. All that participants see is their specific brokers through which communication partners are self-selecting by interest. This makes this model appealing for highly scalable systems that need to hide varying complexity and adaptable infrastructures from the participants. However, finding a suitable infrastructure organisation is not possible without further knowledge about the specific needs of the participants. If the infrastructure is supposed to provide a sufficient level of quality-of-service (QoS) to all participants independent of the current scale or infrastructural organisation, explicit QoS distinctions become vital to the system.

38 14 CHAPTER 2. CASE STUDY: PUB-SUB OVERLAYS 2.2 Quality-of-Service in Publish-Subscribe The publish-subscribe model suggests a number of different quality-of-service (QoS) metrics, as described by the author in [BFM06]. Some are related to subscriptions and single notifications while others describe end-to-end properties of flows of notifications. The following sections describe the specific meaning of these metrics when requested by publishers or subscribers, in subscriptions, notifications or advertisements. Their context in publish-subscribe systems is briefly summarised at the beginning of each section ( ). Their implementation in a distributed publish-subscribe system then has mainly two dimensions: The topological organisation of the brokers and the local decisions of each broker regarding filtering, scheduling, etc. Their interaction with and impact on the QoS metrics are briefly summarised in figure QoS at the Global Infrastructure Level End-to-end latency, bandwidth and delivery guarantees form low-level properties of the broker infrastructure. In a centralised infrastructure in which the client connections also implement QoS, there are ways to impose hard limits on them. In any less predictable environment, especially distributed multi-hop infrastructures, they should not be understood as real-time guarantees. Here they become probabilistic options or even hints about preferences of clients. Note that even hints can be helpful to the infrastructure if it has to determine which notifications to drop from overfull queues or which broken inter-broker connection to repair first. Latency Subscriptions - Subscribers request a publisher with a maximum delay. The end-to-end latency between producers and consumers depends on the number of broker hops between them, the travel time from hop to hop and the time it takes each broker to forward a notification. In a centralised system, the travel time between broker and clients gives a hard lower bound and the additional forwarding time depends on the broker load. Even in a distributed infrastructure, measured lower bounds can give hints if a requested QoS level is achievable at all. In general, however, they do not allow to give absolute guarantees in distributed systems. A broker infrastructure can deal with latency requirements by pre-allocating fixed paths between senders and receivers. This avoids the overhead of having to establish connections on request. A further speed-up can be achieved by tagging notifications and merging them into channels [Fie05]. This approach effectively maps content-based filtering to subject-based filtering and thus simplifies the routing on each broker

39 2.2. QUALITY-OF-SERVICE IN PUBLISH-SUBSCRIBE 15 QoS Metric Topology Impact Local Broker Decisions Latency end-to-end network latency and hop count, reorganisation delays, global load distribution, adaptation to physical topology, degree of freedom in routing Bandwidth end-to-end bandwidth, global load distribution, path independence, adaptation to physical topology, degree of freedom in routing time complexity of filter evaluation, local routing choices local routing choices Message priorities topology shortcuts local scheduling Delivery guarantees reliability, reorganisation, path redundancy, persistence and caching, routing decisions trusted or reliable subgraphs based on reliability, trust or reputation Selectivity of subscriptions Periodic or sporadic delivery adaptation to subscription language (e.g. subject grouping) - conversion Notification order single/multi-path delivery, consistency during reorganisation Validity interval deterministic delivery paths for follow-up messages decidability, local load reordering message drops after delay or on arrival of follow-ups Source redundancy path convergence identity decision Confidentiality trusted subgraphs encrypted filtering Authentication and - client authentication, access control integrity Figure 2.3: Relation of topologies and local broker decisions to QoS metrics: Impacts and implementation choices

40 16 CHAPTER 2. CASE STUDY: PUB-SUB OVERLAYS along the path. Note, however, that such a mapping is not always possible as it can lead to state explosion [Müh02]. The reduced expressiveness of channel identifiers compared to content filters can also introduce excessive forwarding of unwanted information. The most efficient form is obviously a mapping to the network level through native IP-Multicast [Dee91, Hui99]. A combination of both measures may allow to reduce the path length by skipping those brokers that only forward a stream to one or very few connections. Still, the observed latency may vary considerably over time, as Internet paths commonly serve multiple concurrent streams. The optimal case is a physical network with native quality-of-service provisioning and IP-Multicast. Bandwidth Advertisements - Producers specify the minimum and/or maximum bandwidth necessary for the stream they produce. Subscriptions - Subscribers restrict the maximum amount and size of notifications they want to receive per time unit. The overall bandwidth used by the system depends on the throughput per broker and the size of each notification. Today s Internet-level connections tend to be sufficiently dimensioned to allow many concurrent high-traffic streams, but they usually do not provide physical quality-of-service and bandwidth provisioning. This holds especially when streams cross the borders of autonomous systems. Therefore, bandwidth requirements should rather be regarded at a perbroker level. If each broker knows the bandwidth that it can make locally available to the infrastructure, this gives an upper bound for the throughput of a path. Although not necessarily accurate, such an upper bound allows to route notifications based on the highest free bandwidth on the neighbouring brokers. It can be used to avoid high-traffic paths and to do local traffic optimisation. If channel merging is applied (as described for latency), the channels can be tested for their capability of providing the required bandwidth. However, the observed bandwidth in general purpose networks may show high variations over time that cannot be foreseen. Message priorities Notifications - Producers specify the relative priority between notifications that they produce themselves or their absolute priority compared to other (foreign) notifications. Subscriptions - Subscribers specify the relative priorities between their subscriptions or the absolute importance of a subscription.

41 2.2. QUALITY-OF-SERVICE IN PUBLISH-SUBSCRIBE 17 As with latency and bandwidth, message priorities have both a per-broker side and an end-to-end side to them. They are, however, easier to establish in a distributed broker network than latency bounds, as their conception does not aim to provide absolute or real-time guarantees. Priorities between notifications can be used to control the local queues of each broker which will eventually lead to their end-to-end application along a path. Again, channels can be used to shorten the path for high-priority notifications, the extreme case being to send them directly from the system entry point to the recipients. More commonly, however, high-priority notifications will be allowed to overtake those with a lower priority during the forwarding and filtering process at each broker, subject to a weighted scheduling policy. This increases the importance of their per-broker part. Absolute subscription priorities can be merged on their delivery paths. Only the maximum priority is required at the publisher end. Brokers further down the path can store the more exact priorities for their forward scheduling to local subscribers and neighbouring brokers. Delivery guarantees Subscriptions - Subscribers specify which notifications they must receive, which are less important, and where duplication matters. Notifications, Advertisements - Producers can specify if receivers should be guaranteed to receive their notifications. These guarantees can be as simple as a tag for notifications stating if they may be dropped on the delivery path or not. It is even straight forward to merge this tag from multiple subscriptions along the delivery path with a simple boolean and. This approach is especially applicable in combination with message priorities, where the least important message is the first to be dropped. More demanding guarantees regard the completeness and duplication of the delivery. Notifications can be delivered at least once, at most once or exactly once to a subscriber, the latter being the combination of the first two. If the infrastructure is not reliable itself, the first requirement can be achieved by meshing which increases the delivery probability at the cost of generating duplicates and thus increasing the message overhead. A request for at most one delivery encourages either single path delivery or duplicate filtering before the arrival at the subscribers. In infrastructures with unreliable brokers, reputation systems or availability histories may allow to route notifications through more reliable brokers first. This lowers the chance of brokers failing during delivery or maliciously suppressing notifications. However, this should be taken as a hint and by no means as a guarantee since a history does not allow a reliable prediction of the future availability or behaviour of network and brokers.

42 18 CHAPTER 2. CASE STUDY: PUB-SUB OVERLAYS Another problem with delivery guarantees regards temporarily disconnected subscribers [CFH + 03], e.g. in a mobile environment. The broker infrastructure must store notifications that were guaranteed to be delivered at least once until the subscribers become available again. Note that this may be after an arbitrarily long time or never, so in practice, the infrastructure may choose to store notifications only for a reasonable time interval. Finally, it is possible to define a quorum, i.e. to specify the minimum or maximum number of senders that must receive a notification or the minimum or exact number of receivers that a subscriber wants to subscribe to. In a decoupled, brokered environment like publish-subscribe systems, where publishers and subscribers are not supposed to know anything about each other, this criterion should only be available at an administrative level, e.g. within scopes [FMG03]. Note that the requests of subscribers override the advertisements of publishers. If a publisher requests guaranteed delivery and all subscribers agree that its notifications may be dropped, the infrastructure may ignore the request of the publisher. Subscriber requests are generally more important than publisher requests, which become not much more than a hint or default to the infrastructure QoS at the Notification and Subscription Level A number of QoS properties touch the semantics of subscriptions and notifications. They are periodic or sporadic delivery, the order in which notifications arrive, priorities between them, the duration of their validity and redundancy of producers. Security issues like authentication or confidentiality also fall into this scheme. A very important factor is the selectivity of subscriptions. Expressiveness and Selectivity of Subscriptions Subscribers define their subscriptions in a specific lan- Subscriptions - guage. Subscriptions can be expressed in different classes of languages, thus allowing different levels of expressiveness. Simple subscriptions can contain a subject for identity matches, whereas more complex ones can be augmented by further attribute restrictions. The most complex subscriptions possible are filters implemented as Turing complete computer programs. All of these languages have a specific level of complexity with respect to filter identity tests and merging, distributed matching, global message overhead and false positives on delivery. The filter language therefore represents a tradeoff between the local overhead at brokers or subscribers and the distributed overhead inside the broker infrastructure. In general, more complex

43 2.2. QUALITY-OF-SERVICE IN PUBLISH-SUBSCRIBE 19 languages can support a higher expressiveness and therefore a higher selectivity of subscriptions. This allows them to reduce the number of false positives (or uninteresting notifications), but tends to make filter optimisations and distributed matching more difficult. The language is not necessarily the same within the entire infrastructure. Scoped subgraphs [FMG03] may decide to provide languages internally that have a different expressiveness than the outside world, and then convert between the two at the interfaces. Similarly, a broker may decide to accept complex subscriptions from its local subscribers and forward simplified versions that remote brokers understand or that are more suitable for the topology. If it then filters incoming notifications based on the more expressive subscriptions, it can increase the satisfaction level of the local subscribers. Periodic or sporadic delivery Advertisements - Producers advertise their way of publishing. Subscriptions - Subscribers specify which notifications they want to receive sporadically and which they need periodically. This captures the difference between an interest in changes of information and the information itself. Periodically published notifications become a data flow that represents the status of requested information at the moment of each publication. Sporadically published notifications occur only when this information changes. One way of looking at them is as prefiltered events that pass when the data change exceeds a certain threshold. This is an interesting option for reducing the amount of data sent by accepting a certain inaccuracy. The infrastructure may either match subscriptions to corresponding publishers or try to emulate the requested delivery mode by itself. Periodic notifications may be emulated by storing and repeating the latest sporadic notification. Sporadic delivery can be based on a configurable threshold that blocks the delivery of periodically published static values [MUHW04]. Note that in a content-based system with arbitrary filters, the necessary comparison of notifications may be restricted to an identity test or may not be possible at all. Notification order Advertisements - Producers advertise for which of their notifications the ordering matters. Subscriptions - Subscribers specify in their subscriptions which notifications they want to receive in order. The order in which notifications arrive may or may not be relevant. In many cases, ordering is easy to achieve by either using centralised ordering, ordered

44 20 CHAPTER 2. CASE STUDY: PUB-SUB OVERLAYS transports (ATM, TCP) or by letting the producer (or its broker) impose an explicit order and sorting the notifications on delivery. Special care must be taken if the broker topology is allowed to change during the delivery process or if priorities are used, as both may impact the order in which notifications arrive and may even result in message loss. Since ordered delivery does not imply guaranteed delivery, a very efficient solution is to deliberately drop notifications if a successor has already been delivered. If this is not acceptable, notifications must be reordered before delivery, which may introduce arbitrarily long delays. The distributed ordering of events coming from different sources is another problem. For content-based subscriptions, it is even hard to define a meaningful ordering in this case. It is therefore largely dependent on the subscription language and the application if such an order is applicable. A generic (and very simple) approach is the deployment of a dedicated, central broker to enforce a global ordering, which can in turn limit the scalability of the overall infrastructure. Note, however, that this would only impact notifications for which ordering was requested. Validity interval Notifications, Advertisements - Producers advertise or specify a timeout for their notifications, or a successor message that renders them irrelevant. It is important for the infrastructure to know how long a notification stays valid, either specified in terms of time or inferred by the arrival of other (followup) messages. A validity based on a hop count (as used in many low-level transport protocols) would be meaningless in the decoupled publish-subscribe model. If only the most recent event is of interest, the validity specification by follow-up messages is a particularly efficient approach. It allows the infrastructure to reorder and shorten its queues in high traffic situations. Source redundancy Subscriptions - Subscribers request redundant sources for the same event. An example for redundant sources is a set of temperature sensors inside a room that publish more or less the same value. They generate semantically similar notifications which allows subscribers to double check events. In some cases, the increased fault tolerance may be sufficient to replace at-least-once guaranteed delivery. Note that a request for redundancy requires the notifications or advertisements of publishers to be comparable based on a subscription. Rice s Theo-

45 2.2. QUALITY-OF-SERVICE IN PUBLISH-SUBSCRIBE 21 rem 3 implies that complex subscription languages may hinder or prevent this. In this case, explicit hints such as subjects or type identifiers are required to assure the comparability of notifications. Confidentiality Subscriptions - Subscribers encrypt their subscriptions or send them only to trusted brokers. Notifications - Publishers connect only to trusted brokers or send partially or completely encrypted notifications. Confidentiality can obviously be achieved in (sub-)networks that contain only trusted brokers [FZB + 04]. Similarly, any message can be hidden from untrusted brokers on the delivery path by using standard encryption mechanisms between trusted brokers. Untrusted brokers that cannot read and evaluate the content are then forced to broadcast to all of their neighbours. Apart from higher load at those brokers, this also introduces potentially large numbers of duplicates in the system. Confidentiality becomes a difficult problem if untrusted brokers must be allowed to participate in the matching process. This is a minor problem in subject-based publish-subscribe, where the low expressiveness of subscriptions makes the actual content of notifications opaque to the matching process. It can just as well be encrypted, which may be exploited within untrusted broker networks. A reduction of content filters to channels can achieve the same effect. In both cases, the information gain of untrusted brokers depends on the amount of information revealed by the subject or the channel ID. The selectivity of subjects can therefore be seen as a tradeoff between message overhead and revealed information. Untrusted brokers can thus be prevented from overhearing the information, but malicious brokers can still suppress messages in this scenario. Content-based matching requires much higher insight into the content of notifications. A number of recent publications from the database area show ways for matching encrypted data against certain types of queries [AKSX04, DAK00, AKD03] without revealing any of the two. Publish-subscribe systems could apply similar schemes to allow blind matching on untrusted brokers. However, solving this problem for arbitrary queries and data without revealing any information about them is likely impossible. Authentication and Integrity Subscriptions - Notifications - Subscribers authenticate their subscriptions. Publishers authenticate their notifications. 3 theorem&oldid=

46 22 CHAPTER 2. CASE STUDY: PUB-SUB OVERLAYS If publishers want to enforce access control mechanisms [BEP + 03], it becomes necessary for subscribers to authenticate their subscriptions using techniques like digital signatures. On the other side, publishers can sign or watermark their notifications to assure data integrity. The evaluation can then either be done end-to-end or within the broker infrastructure Discussion The analysis presented in this chapter provides extensive insights into the diversity of requirements in publish-subscribe systems with respect to the broker implementation and the topological organisation of distributed broker networks. It must be noted that very few quality-of-service levels can be enforced locally on each single broker, in which case they are mainly related to the filtering process, to scheduling concerns or to local delivery. On the other hand, almost all metrics are impacted by the topology of the broker infrastructure. One of the major limiting factors is the overall message overhead. Message multiplication in multi-hop infrastructures reduces the bandwidth that is available for each unique message. At the same time, it increases the per-broker load and therefore the end-to-end latency. It can even impact delivery guarantees if brokers become overloaded and are forced to drop messages. Two major factors impact the global message overhead of a publish-subscribe architecture: The expressiveness of the subscription language and, again, the topology of the broker network. While subscription languages are an interesting topic for further investigation, they are also out of scope for this thesis. We will therefore continue with a strong focus on those QoS constraints for which the topological infrastructure is the dominating factor. The topology of the broker infrastructure has been a major playground for research in recent years. Overlay networks were widely adopted as the basic building block for publish-subscribe systems. The current state of the art provides a system designer with a considerable body of choices amongst the variety of different overlay implementations. The next section overviews how overlay networks have been used to implement scalable publish-subscribe systems and evaluates the QoS capabilities they provide. 2.3 Publish-Subscribe on Overlay Networks An important development in recent years was the introduction of overlay networks as building blocks for the distributed broker infrastructure of publishsubscribe systems [PB02, TBF + 04]. This allows new approaches to distributed

47 2.3. PUBLISH-SUBSCRIBE ON OVERLAY NETWORKS 23 filtering as well as a design simplification of distributed publish-subscribe systems. The first attempts were targeted at replacing application-level multicast, while later approaches aimed to support content-based publish-subscribe filtering. The following sections describe different systems and mention the specific quality-of-service properties that they provide for broker infrastructures Multicast Approaches While many autonomous systems in the Internet have IP-level multicast enabled, a successful implementation across borders is still not available. It is even unpredictable if the switch to IPv6 will bring substantial advances for the support of native multicast at the physical layer, as its availability to customers will continue to depend on commercial incentives, network management and security concerns of Internet service providers. This is why many distributed applications have long used application level multicast for their needs. Overlay networks are expected to simplify the design of these applications by providing generic infrastructures as building blocks. While rather limited from a publish-subscribe point of view (low expressiveness of subscriptions), multicast has the advantage of being easily and efficiently implemented on any key-based routing (KBR) network (see 9.2). / Selectivity The selectivity of subjects (or multicast groups) is relatively low. This usually means that subscribers are left to do their own filtering. On the other hand, if expensive subscription languages render matching and optimisations difficult, it may be worth considering a combination with subjects to provide a fallback through subject matches. This may prevent broadcasting in many cases, which can considerably reduce the overhead of messages and filtering. Multicast was one of the major applications proposed for various KBR networks. There are two different ways of implementing multicast on top of them: overlay internal and overlay external. Overlay external multicast These implementations use a global overlay to index groups and then simply create a separate broadcast overlay for each multicast group. This was exemplified by the CAN overlay [RHKS01], but is obviously possible with any other overlay that supports broadcasting. A second example is CAM- Chord [ZCLC05] that uses Chord for capacity adapted multicast. The major advantage of external multicast is that each group overlay only has to scale with the size of a group. A disadvantage is the increased overhead of maintaining multiple independent overlays simultaneously. If the overlay is

48 24 CHAPTER 2. CASE STUDY: PUB-SUB OVERLAYS as resource consuming as CAN, external multicast may become very expensive [CJK + 03]. Delivery guarantees All participants in the group overlay are known to be interested in all messages. This diminishes the cost of deliberate duplications which can be exploited to increase the probability of delivery, even under topological reorganisation. The redundancy that is already available in a resilient overlay can directly be exploited to assure the broadcast delivery to all live parties as long as the network stays connected. On the other hand, most KBR broadcast schemes can efficiently avoid message duplication as long as there are no reorganisations during the delivery process. If a subset of the group members caches the received notifications, a node that missed notifications during a temporary failure can try to ask any of the other members for passed notifications, possibly using a random walk or an attenuated broadcast scheme. Similar ideas were presented in [CFH + 03]. Latency The group overlay has the same size as the group. This minimises the end-to-end hop count, which most likely becomes smaller than in the global overlay. Obviously, this reduces the number of required routing decisions and reduces the end-to-end latency for the given topology. Note that the group overlays are independent of the global overlay and may even use a different topology to optimise for their specific group size. Overlay broadcast commonly features very good load balancing. Confidentiality This scheme effectively prevents non-subscribed brokers from seeing the notifications. If some form of access control is used in the subscription process, untrusted brokers can be prevented from joining the multicast network. This enables distribution networks of trusted brokers. / Selectivity Apart from the matching process executed for subscriptions, the distribution scheme is agnostic to the content of notifications. This allows the deployment of arbitrary delivery algorithms and additional subscription languages within the group. Systems like Choreca (2.3.5), that are built on top of broadcast delivery, can run on top of the external multicast scheme to extend its capabilities to content-based filtering. However, this requires a multi-step subscription process: a subject-based subscription to find the multicast group and a contentbased subscription within the group. Depending on the application, this can be an improvement if used to reduce the number of brokers that must touch a content-based subscription. In other content-based filtering scenarios that do not benefit from subjects, the doubled subscription overhead may become a bottleneck for the overall system performance.

49 2.3. PUBLISH-SUBSCRIBE ON OVERLAY NETWORKS 25 Overlay internal multicast Overlay internal multicast uses features of the respective overlay to emulate or establish groups within the overlay itself. Since DHTs provide efficient intrinsics for lookup operations, the obvious approach is to have notifications and subscriptions meet at the node that a lookup yields for the multicast address, the rendez-vous node. Examples are Scribe [RKCD01] or Hermes [PB02] on Pastry [RD01], Bayeux [ZZJ + 01] on Tapestry [ZKJ01] and a number of proposals for Chord [SMK + 01], like [KR04]. These systems use two different schemes for building the multicast tree: Scribe exemplifies reverse-path forwarding (RPF) and Bayeux uses forwardpath forwarding (FPF). Scribe s RPF uses the reverse path from the subscriber to the rendez-vous node to send notifications back, while FPF sends a routing message from the rendez-vous node back to the subscriber to find a new forward path. The latter tends to be more efficient in the beginning of the path as overlays commonly optimise routes in this direction. However, due to the topological organisation of the underlying DHTs, it is less efficient than RPF at the end of the path. Both schemes are described and compared in [ZH03], where a hybrid scheme is proposed to overcome their respective disadvantages. All of these systems use distribution trees (commonly one per group) that are an integral part of the overlay topology. This usually leads to a tree depth of O(log N) where N is the number of participants in the system. The common disadvantage of these systems is that intermediate nodes must forward messages that they are not interested in. Above all, this regards the root node of the respective tree. It even represents a single point of failure that must see all notifications and subscriptions for the group to assure correctness and completeness of delivery. Latency The global multicast network is potentially much larger than each of its groups. This generally increases the end-to-end latency experienced in each group. Also, close to the rendez-vous node, the total number of subscriptions of a node s neighbours is likely to be higher than elsewhere in the network. This further increases the load of those nodes and therefore the overall latency. Delivery guarantees Again, due to higher load close to the rendez-vous node, the delivery guarantees may be impacted. Message duplication is more likely than in the external scheme, since the probability of topological reorganisation increases with the number of participants. On the other hand, a higher number of participants provides higher redundancy in the overall network that can be exploited to re-enhance the probability of delivery. However, achieving this is at least as hard as in the dedicated broadcast networks of external multicast.

50 26 CHAPTER 2. CASE STUDY: PUB-SUB OVERLAYS Selectivity Due to the use of explicitly allocated delivery paths, this multicast scheme is easily extended with more advanced filtering techniques. Hermes (2.3.2) is a good example that matches on subjects and attribute-value pairs, thus providing different grades of content-based filtering with little overhead. Discussion It is not surprising that external multicast appears to outrun the internal scheme in terms of quality-of-service support. It simply uses a partial replica of the global overlay, so it immediately gains the same general properties as the internal multicast without requiring additional overhead at uninterested nodes. Still, the evaluation in [CJK + 03] suggests that the internal scheme has the advantage of avoiding additional maintenance overhead compared to multiple overlays in the external scheme. It also notes that this is payed for by longer end-to-end paths. Intermediate schemes as proposed for Scribe [RKCD01] can reduce the path length by skipping intermediate nodes and merging their children. Since this deviates from the original overlay topology, additional maintenance is necessary to handle these optimisations. Consequently, this approach can also be seen as the construction of a new overlay where the bootstrap overhead is reduced by reusing known nodes. We can therefore suspect the transition between the two methods to be rather smooth. When deciding the right tradeoffs, it must also be noted that current evaluations only consider a single overlay topology for the global and all group overlays in external multicast. It would be interesting to see comparisons where the overlay topology is chosen specifically for the size, selectivity and other quality-of-service requirements of the respective multicast group. Especially the support for arbitrary subscription languages inside of multicast groups could turn the balance in some applications. The Node Views approach, as presented in this thesis, makes the implementation and integration of different overlay topologies simple and thus enables this kind of comparisons. As CAM-Chord [ZCLC05] shows, an inherent advantage of overlay external multicast is the simpler design of overlay topologies and their implementation. Their specialised simplicity helps when introducing new features Hermes Hermes [PB02] was the first system to exceed the limitations of overlay multicast by implementing a form of content-based publish-subscribe (named type and attribute based filtering) on top of overlay networks. It uses the same multicast scheme as in Scribe, but installs content-based filters along the delivery

51 2.3. PUBLISH-SUBSCRIBE ON OVERLAY NETWORKS 27 path that filter on additional attributes of the notifications. Filter merging is possible along the path to reduce the status overhead per node. Apart from these features, Hermes has the same characteristics as Scribe and other overlay internal multicast systems. Selectivity Filtering on arbitrary attributes along the multicast paths substantially increases the selectivity of subscriptions and therefore reduces the number of false positives. The down side is an increased message overhead for filter updates compared to multicast systems. Confidentiality Compared to multicast, the brokers in content-based publish-subscribe systems require higher insight into the notifications. Hermes uses the overlay internal multicast scheme. This means that it can only make the tradeoff between confidentiality and message overhead by letting untrusted brokers either broadcast completely encrypted messages or forward them exclusively based on their subject without revealing further content IndiQoS IndiQoS [CAR05] reuses the design of Hermes and augments it with qualityof-service awareness, as first presented in [AR02]. The authors distinguish between a content profile (such as the precision of a sensor) and a QoS profile, allowing to request periodic delivery or latency bounds. The IndiQoS description also refers to bandwidth as an additional metric. QoS requirements are expressed as additional attributes of subscriptions and advertisements that are evaluated during forwarding. As further improvement, IndiQoS replicates the rendez-vous node and lets the broker infrastructure select an appropriate one based on the requirements of subscribers. This is obviously payed for by a multiplication of the message overhead per notification. Selectivity IndiQoS inherits the selectivity properties of Hermes. Latency Subscribers can explicitly specify a maximum latency. Brokers forward subscriptions according to latencies known from prior advertisements. The rendez-vous replication then enables a lower average latency REBECA The Rebeca system [MFB02, Reb] is somewhat of an outsider in this list since it was initially developed as a static broker network, where joining and

52 28 CHAPTER 2. CASE STUDY: PUB-SUB OVERLAYS leaving was limited to the leaves of the delivery tree. It supports contentbased publish-subscribe and was intended as a proof-of-concept system for filter merging and administration in publish-subscribe systems. Bandwidth, Latency The major disadvantages result from the static distribution tree that currently lacks support for self-maintenance. If applicable, however, such a static environment allows a higher predictability of available resources than a continuously changing and adapting overlay. Where a dynamic overlay must reallocate resources when delivery paths change, a static installation does this only when new resources are requested or old ones deallocated. If nodes are reliable and resource allocation or rerouting is expensive, this can be an interesting alternative to dynamic overlays. Selectivity Rebeca supports arbitrary filters as subscriptions, which can provide a high selectivity and therefore low rate of false positives. On the downside, highly expressive filters may prove opaque to filter merging Choreca Choreca [TBF + 03] is an attempt to map the ideas of Rebeca on a dynamic overlay. It routes content-based subscriptions and notifications over a Chord broadcast tree [EAABH03] of depth O(log N). The main feature is the usage of redundant trees, one rooted in each publisher. The overlap of these trees allows to exploit filter similarities to reduce the filter forwarding overhead. Compared to the multicast based systems, Choreca avoids any single point of failure. Delivery guarantees The path redundancy in Choreca s Chord graph lies within O(log N). To assure the eventual delivery of a notification in the case of a lower number of link failures, it is sufficient to reroute it through a different neighbour than the failed next hop. This allows Choreca to provide a very high probability for delivery. Selectivity Choreca inherits the filter characteristics of Rebeca BitZipper The BitZipper Rendezvous [TBF + 04] is a generic approach to the decentralised evaluation of data and queries. It can be implemented over prefix-routed overlay networks, including most key-based routing networks, like Chord or Pastry (see 9.2). Its overhead is within O( N) messages, but it is independent of any specific feature of data or queries as opposed to the restrictions posed

53 2.3. PUBLISH-SUBSCRIBE ON OVERLAY NETWORKS 29 by subject-based multicast or the NP-completeness of filter merging [Müh02]. General content-based publish-subscribe filtering can be implemented with the same message overhead when using the BitZipper. Delivery guarantees As a general-purpose solution, the BitZipper Rendezvous protocol has a high resource profile. However, the implementation on top of structured overlays provides it with global load distribution and an intrinsic forwarding redundancy that is also within O( N). Exploiting this redundancy for the matching process can dramatically increase the reliability in the face of arbitrary node failures. Selectivity The BitZipper Rendezvous is agnostic to the subscription language. Even Turing complete subscriptions can be supported without impacting the message overhead or the number of brokers required for filtering Conclusion The diverse characteristics of the presented systems leave the deployer with a large grey area of tradeoffs. Each has its own advantages and disadvantages in specific situations. In the case of multicast, the best possible solution is IP- Multicast but only if it is available in a scalable fashion. Especially a global mapping of reasonably expressive subjects to IP addresses exposes considerable administrative barriers and may quickly lead to address exhaustion. In this case, overlay multicast leverages an increased overhead to provide a higherlevel approach that circumvents these problems. Overlay internal multicast provides average O(log N) overhead and tree depth, but the single rendez-vous node may become a problem in high-traffic scenarios. Replication helps in balancing the load, but at the expense of even higher overhead in common scenarios. More general approaches, like Choreca or the BitZipper, remove the single point of failure. They combine high selectivity with global load balancing at the expense of generally increasing the overall overhead. On the other hand, the Hermes system provides similar characteristics and problems as overlay multicast systems, but increases the per-broker state to push the expressiveness of subscriptions towards the more general solutions. Obviously, different applications have different requirements. Where one application may find a single rendezvous point acceptable as it cares more about latency and message overhead, a second one may benefit well enough from the high redundancy of Choreca to accept the additional overhead. Designing overlays that work optimally in all situations is nearly impossible and will most likely lead to extremely complex systems that handle many special cases. Understanding and evaluating the behaviour of such a system will be at least as hard as their initial design.

54 30 CHAPTER 2. CASE STUDY: PUB-SUB OVERLAYS A more promising (and more common) approach is the development of systems that work well in well defined cases. Their characteristics can be proven or evaluated. By relying on explicitly known quality-of-service capabilities of these systems, applications can select the best candidate system for a specific problem in a specific environment. Only the integration of specialised overlays, each with its own simple design, will lead to highly configurable and adaptable systems that keep the high predictability of their components.

55 Chapter 3 Middleware Abstractions for Overlay Software Contents 3.1 Requirements on an Overlay Middleware Overlay Frameworks Current Views on Overlay Software Code Complexity why is this not enough? As we have seen so far, the support for different levels of quality-of-service in large-scale overlay applications relies on an integration of different overlay networks. This requires support in frameworks and design tools to make overlay implementations available in different run-time environments. Platform independence and high-level design support are crucial for the implementation of overlay applications that are expected to adapt to diverse requirements in highly-scalable systems. This chapter makes a case for new abstractions in the design of overlay software based on the deficiencies of current approaches. Based on the examples in the introductory chapters, the first section presents general requirements on overlay middleware systems. Section 3.2 then reviews current frameworks and matches them with the requirements. Section 3.3 generalises the current approaches into two main perspectives on overlay software: the networking perspective and the overlay protocol perspective. Both approaches support developers in the implementation of overlay networks. As section 3.4 shows, however, neither of them provides an abstraction level that substantially reduces the design effort by capturing the specific characteristics of overlay networks. This outcome motivates the development

56 32 CHAPTER 3. OVERLAY MIDDLEWARE of higher-level abstractions, as presented in the next chapter. 3.1 Requirements on an Overlay Middleware To provide helpful design abstractions, overlay middleware must handle, simplify and pre-implement common tasks that help in maintaining overlays in general. It can then leave the implementation of overlay specific components to the developers. This section describes general requirements on overlay middleware systems. An overview of their incarnations in different frameworks is provided in Generic, High-Level Models First of all, overlay middleware must provide simplified, domain specific models for overlay development. This allows the overlay designer to move away from reinventing the wheel in low-level implementations and to focus on the main features of the specific overlay. These models must support the implementation of overlay topologies and algorithms for routing and maintenance. An important factor is programming language independence, including the choice of a suitable execution environment. If the designer can start with abstract specifications of the overlay, it becomes possible to implement the final system in different environments without major redesigns. This is vital for distributed systems that are expected to run on different architectures, like large servers, standard PCs and mobile devices. Language independence is also vital for the development process. The later the decision about the deployment environment and language can be taken, the easier it becomes to base the decision on those performance aspects that ultimately prove to be most relevant. Platform independence further allows using different environments for different steps of the development process. Environments for rapid prototyping may look very different from those enabling high-performance execution. High-level, language independent models and the resulting high-level design are crucial to support this choice of environments. The Node Views architecture presented in this thesis aims to provide such platform-independent models. It combines them with a Model Driven Architecture for overlay implementation Software Components To connect the high-level design with executable code, a middleware requires a component model that enables the integration of language specific implementation parts.

57 3.1. REQUIREMENTS ON AN OVERLAY MIDDLEWARE 33 Reusable components Most importantly, software components should be reusable in different implementations to reduce the necessary amount of hand written source code. Code reusability requires well-defined interfaces between high-level models and specialised components. If components are written solely against the generic models provided by the middleware, they can become part of the middleware itself. This provides a common base of pluggable components and further reduces the effort necessary to design new overlays. It is obvious that this also regards the lower networking levels. Serialising messages, sending and receiving them, is a basic feature of any networking middleware. While a middleware may support a diversity of serialisation formats (XML, XDR, IIOP, custom binary,...) as well as different point-to-point networking protocols (such as TCP/IP, RTP or VPNs), it should hide their deployment behind simple interfaces to make their use a matter of selection rather than programming. One framework that supports this layering of components is Gridkit. It provides a dynamic component architecture and layered networking interfaces for overlay networks. Section 8.4 describes an implementation of the Node Views architecture on top of Gridkit. Reactive components As in any networking software, the components of an overlay middleware are naturally reactive. They respond to events such as incoming or locally generated messages, time-outs and changes to the local model. The middleware must therefore manage these events and the coordination between different components. This feature is often implemented by means of Event-driven State Machines, finite automata that interconnect processing states by event triggered transitions. Their event model is simple: messages are locally received or produced and time-outs are triggered. The system then dispatches these events to states according to the available transitions. Some systems, like SEDA [WCB01] (see 3.2.1), support processing chains that forward data objects from one processing state to the next. The output of such an object from a state becomes an additional event for the system. This approach reduces the complexity of each state and moves more of the control logic into the middleware architecture. More expressive event models can push this even further. By moving more of the complexity into the model that underlies the middleware, components can further reduce their implementation dependency on specific environments and middleware implementations. EDSL (see 6.5) is an abstract EDSM language that supports modelling the event processing of overlay implementations in a framework independent way.

58 34 CHAPTER 3. OVERLAY MIDDLEWARE Management of State and Node Data An important part of overlay software deals with the relation of the local node to remote nodes. Overlay middleware must obviously provide support for managing this relation. This comprises handling the connections to neighbours in the topology, adding them to the local view and removing them, but also keeping fall-back candidates for the maintenance case. Nodes often have to store further state about their neighbours, such as running time-outs, measured latencies or active subscriptions. The ubiquity of state management throughout the system calls for support in middleware. The database provided by the Node Views architecture (see 4.3) aims to become such a central, integrative point of state management. A further step towards a high-level middleware is to manage nodes not only on a connection level but to provide support for selecting neighbours and communication partners based on various criteria. This provides an abstract base for the respective local decisions that are currently implemented by hand, outside of frameworks. SLOSL, as presented in chapter 5, is a domain specific language that provides such a middleware abstraction level Integration and Resource Sharing The last major feature of a common middleware is the integration of different overlays. In an integrative middleware, multiple applications can run in the same environment and each application can deploy multiple overlays. This requires the middleware to provide ways for reducing their aggregated resource usage. This encourages connection sharing between different topologies and applications as well as ways to minimise the general maintenance overhead that is generated, for example, by pings 1. The Node Views approach, that this thesis proposes in chapter 4, aims to provide an integrative data layer for local decisions based on locally consistent views. 3.2 Frameworks and Middleware for Overlays There have been a number of recent proposals for overlay frameworks and middleware. Macedon, ioverlay and RaDP2P are under development and evaluation in the corresponding projects. Other frameworks, like SEDA or JXTA, have also been used for overlay implementations. Figure 3.1 compares the size 2 of some of these frameworks (as far as they are publicly available). As we will see in section 3.4, the code size of a framework 1 PlanetLab applications (http://planet-lab.org/), as an extreme example, were found to generate a total of up to 1GB of ping traffic per day in 2003 [NPB03] 2 Note that these numbers depend on which modules are counted as major (i.e. relevant) parts of the frameworks.

59 3.2. OVERLAY FRAMEWORKS 35 Framework Type Size JXTA Protocol framework 1.5 MB jar (+XML+... ) SEDA EDSM framework 15-30,000 lines of Java Macedon EDSM/overlay framework 15,000 lines of C++ Figure 3.1: Approximate code size of overlay frameworks is in no relation to the size of overlay implementations written on top of them. It is clearly the abstraction they provide that makes the difference SEDA The Staged Event-Driven Architecture (SEDA [WCB01]) is a framework for scalable Internet servers. It provides an event-driven state machine (EDSM) for request processing that is used in OceanStore [KBC + 00] and (in a modified form) in Bamboo [RGRK04]. While designed as a generic architecture, the only available implementation is written in Java. SEDA handles network I/O and builds an abstraction layer for message processing in reactive state components. It does not provide any higher level overlay models, language independence or support for node management. State transitions are specified by class types and dispatched in source code, which makes state implementations hard to reuse JXTA The JXTA project 3 was started by Sun Microsystems in the year It builds a protocol framework for peer-to-peer applications. The main focus are unstructured broadcast networks and it provides means for connecting and grouping nodes, discovering services and dispatching messages. There is no support for node management or for topologies and their design. The ancestry of a broadcast model makes writing structured overlays like HyperCUP [SSDN02] tedious work (see comparison in figure 3.4, page 40) ioverlay ioverlay [LGW04] essentially provides a message switch abstraction for the design of the local routing algorithm. The neighbours of a node are instantiated as local I/O queues. The overlay implementation then forwards incoming messages between them. This approach provides an abstraction level that simplifies the design of overlay algorithms by hiding the lower networking 3

60 36 CHAPTER 3. OVERLAY MIDDLEWARE levels. It also has the advantage of being an integrative approach that allows resource sharing. On the down side, the connection based model provides only basic support for node management and no higher-level model for the implementation of topologies. Furthermore, the router is a monolithic component in the middleware model, which leaves the concern of code reusability entirely to the developer RaDP2P RaDP2P [HCW04] is described as a policy framework for adapting structured overlays to applications. User defined policy components derive node IDs from application semantics and reassign them dynamically to nodes. Although this is an interesting idea, there is currently neither a working implementation nor any publicly available information describing what the policies look like or how they are implemented. This makes it hard to estimate the support for the before mentioned list of middleware requirements. The idea behind RaDP2P provides an extremely high-level model that abstracts from specific overlays. However, as it stands, RaDP2P does not provide support for implementing the overlays themselves. It should therefore rather be considered an application level framework Macedon Macedon [RKB + 04] constitutes the most interesting approach so far. It is essentially a state machine compiler for overlay protocol design. Event-driven state machines (EDSMs) have been used over decades for protocol design and specification. Macedon extends this approach to an overlay specific language based on C++ from which it generates source code for overlay maintenance and routing. The Macedon EDSM provides a high-level, domain specific model for reactive overlay implementations. On the down side, Macedon s support for node management is as limited as in ioverlay. It does not tackle the problem of overlay integration and, as the frameworks above, it does not provide any further models for topology implementation. Its language dependence on C++ is certainly a further drawback. Still, the general approach could be adapted to generate source code for different environments, e.g. for debuggers and simulators What is missing in current frameworks? Overlays are expected to operate autonomously. This means that they must configure themselves and automatically adapt to a changing environment.

61 3.3. CURRENT VIEWS ON OVERLAY SOFTWARE 37 However, this is not only a matter of designing a routing protocol, as the models in current frameworks seem to suggest. Each node in an overlay needs to take local decisions. The sum of these local decisions is the distributed algorithm that maintains the overlay. What are these local decisions based on? ioverlay bases them on the currently available connections and establishes a switch that routes messages between them. It does not provide means to select the right connections for specific requirements or to distinguish between different types of connections. It does not support any node management that could simplify these decisions. Macedon provides an event-driven state machine abstraction. In a number of different proof-of-concept overlay implementations, this was shown to be very useful and efficient for implementing and testing algorithms for routing and maintenance. However, as ioverlay, it does not support any node management for selecting connections. Topologies only arise accidentally from the source code level implementation of message handling decisions. There are no higher-level models to support the decisions that are necessary to establish them. This overview shows that the implementation of message protocols is the dominating model in current frameworks. Event-driven state machines form the state-of-the-art in the design of overlay software. They support the implementation of reactive components that handle messages. They do neither support the implementation or integration of different topologies nor the node management that is necessary for the local decisions to become meaningful. 3.3 Current Views on Overlay Software Overlay software has been implemented on different abstraction levels, ranging from low-level sockets to high-level protocol models. Taking a look at the currently available systems, we can find two basic abstractions in their design: the low-level networking layer and the overlay routing layer. Let us now look at each of them to see what overlay developers have to face when building their software on top of these abstractions The Networking Perspective At the lowest level, we find the most ubiquitous abstraction for networking software: sockets. They provide ways to send raw blocks or streams of bytes between application end-points on different machines. There can hardly be an operating system that does not provide them as a programming primitive. Their use has been largely standardised through Unix-flavoured operating systems and POSIX.1g. While they provide the most portable abstraction, their

62 38 CHAPTER 3. OVERLAY MIDDLEWARE Overlay Application Overlay Implementation Message Processing Flow-Graphs, EDSMs,... Message Passing Serialisation, CORBA, ONC-RPC, SOAP,... Network I/O Sockets, TCP/UDP,... Figure 3.2: Overlay software from the networking perspective help for overlay development is extremely limited. Still, there are overlay systems that were implemented directly on top of sockets, like the original Chord implementation [SMK + 01]. The message passing abstraction allows software to send well-defined data items between end-points in a physical network. A large number of distributed applications build on top of this abstraction. Consequently, there is a remarkable set of tools, ranging from libraries for remote procedure calls (RPC, such as Sun/ONC-RPC or SOAP) over language support for serialisation as in Java or Python, up to messaging frameworks as in CORBA or JMS. Most of these have been ported to a large number of platforms and therefore form an acceptable programming level that hides data issues and platform heterogeneity. For the specific requirements of overlay network development, they still provide a very low level. A number of overlay networks, such as Bamboo [RGRK04], are implemented on top of generic event-driven state machines for Internet servers. They facilitate local message processing by connecting components to request processing chains and triggering their execution. Examples are SEDA [WCB01] or the Twisted-Framework 4. While these frameworks do not offer support for overlay specific tasks, they form a reasonable abstraction level for messagedriven networking software in general The Overlay Protocol Perspective A more interesting abstraction level for overlay development is based on the actual requirements that overlays have. They must route messages between nodes, maintain their topological structure and adapt to application needs. During the year 2004, two interesting frameworks appeared that follow this approach, namely Macedon [RKB + 04] and ioverlay [LGW04] (see 3.2). They visibly raised the abstraction level that was available at the time and thus show the advantage of designing frameworks specifically for overlay implementation. As the major functionality in this abstraction level, Overlay routing protocols implement the local routing decisions for scalable end-to-end message 4

63 3.4. CODE COMPLEXITY WHY IS THIS NOT ENOUGH? 39 Overlay Application Overlay Adaptation RaDP2P Overlay Maintenance Overlay Routing ioverlay Message Processing Macedon Figure 3.3: Overlay software from the overlay protocol perspective forwarding. They are distributed algorithms, executed at each member node, with the purpose of forwarding messages at the overlay level from senders to receivers. Both Macedon and ioverlay take useful approaches here. ioverlay provides a simple message switch abstraction that largely reduces the programming overhead when writing a router. Macedon takes a more sophisticated approach. Its EDSM is controlled by an overlay specific language that supports the complete protocol implementation. Apart from the forwarding feature that is obviously crucial for overlay networks, routing protocols also have a number of implicit semantics. As a requirement for their algorithm, they usually define the valid neighbours for each node that assure the correctness of the routing. This leads us to Overlay Maintenance, the perpetual process of finding these neighbours and replacing failed ones. Usually, maintenance and routing are tightly integrated through their interdependency. The same holds for the third functionality, Overlay Adaptation. As we have seen in 3.2, the approach of RaDP2P is targeted rather at applications than at overlay design. This means that there is currently no actual framework support for the implementation of adaptable overlays, although most of the existing overlay systems provide some form of adaptation. It is usually implemented as an optimisation of the overlay topology based on a specific metric, most commonly the hop-by-hop latency. Whenever the choice of neighbours leaves a certain degree of freedom (which is common case in large-scale systems), a specific adaptation algorithm is executed to find the best candidate. 3.4 Code Complexity why is this not enough? Figure 3.4 compares a number of recent overlay implementations with respect to their code size. All of these were written by hand in a general purpose programming language. It shows that the expected code size that results from this approach goes beyond 20,000 lines of code. It also shows that the use of

64 40 CHAPTER 3. OVERLAY MIDDLEWARE inappropriate frameworks can result in considerably more code being written. Especially JXTA appears to fail entirely in its goal of simplifying the design of overlay applications. Overlay Environment Lines of code Framework Language by hand Macedon Chord sockets C++ 8, Pastry sockets Java 16, Bamboo SEDA Java 20,000 - Tapestry SEDA Java 23,000 - HyperCUP JXTA Java 30,000 - Figure 3.4: Code size of well-known overlay implementations As suggested in section 3.3.1, all of these implementations are very similar in their main components, so one can expect positive synergy effects when they are implemented on top of a common framework. This is one of the reasons why Macedon s implementations are considerably shorter. Note, however, that they are not equivalent in terms of features and that the small number of comparable overlays does not yet support any extrapolation. It would therefore be wrong to commonly expect an order of magnitude here. Macedon s main advantage is the higher abstraction level as compared to the networking layer. On the down side, however, it replaces the challenge of writing low-level networking code by the new challenge of writing an entire overlay implementation as an event-driven state machine. This has several drawbacks. While non-blocking state-machine programs are generally considered more scalable than their major competitors, threaded programs, they are also much harder to write and understand. State-machines require a fundamentally different and very specific approach to code modularisation. They split code into processing stages based on intermediate I/O instead of semantic vicinity. Macedon supports neither visual development nor any higher-level encapsulation of semantically close code. So the gain of a factor 10 in code size is payed for with a considerable increase in complexity and a reduced readability of this code. Another problem with Macedon is its tight coupling to the C++ language. Porting a Macedon implemented overlay to a new programming language basically involves reimplementing both the overlay and Macedon itself. A more general problem is the event model of current event-driven state machines. It is commonly based on messages as monolithic blocks of data that event handlers must handle in their entirety. There is no support for more fine-grained subscriptions that could allow more generic, reusable event handlers.

65 3.4. CODE COMPLEXITY WHY IS THIS NOT ENOUGH? 41 It should also be stressed that support for overlay software functionality is not currently complete. Writing overlay software in an adaptable way must still be done by hand. Even Macedon does not provide any help here, despite the fact that adaptation is a very important feature for overlay applications, especially if they need to be quality-of-service aware. As we have already seen, the overlay protocol approach enforces the plain source code implementation of a number of implicit semantics that largely impact the entire overlay. This regards routing protocols, maintenance algorithms and adaptation. It therefore becomes hard to understand this kind of overlay software without further documentation external to the implementation. This is especially true for the topology, the most important characteristic of the overlay. It is not explicitly described in any single software component, but scattered throughout the implementation. EDSM implementations are the extreme variant, as they structure their code entirely based on overlay specific I/O. This makes it hard to reuse parts of the overlay software for different overlays. It also hinders the integration of different overlays and their combined usage in a single application. This currently makes overlay implementations a rather monolithic piece of software.


67 Chapter 4 A Model Driven Architecture for Overlay Implementation Contents 4.1 The Topology Perspective Background: The Local View of a Node The System Model: Node Views The Model Driven Architecture [MM03] is a software design approach proposed by the Object Management Group (OMG 1 ). It aims at enforcing the power of abstract models over platform specific implementations. MDA applications are designed in platform independent models (PIM) that are then transformed into platform specific models (PSM) and finally augmented with platform specific code to provide a complete implementation of a system. It is therefore a much more high-level and comprehensive technique than a middleware design approach. The OMG s MDA is based on the Unified Modelling Language (UML [Obj01, Fow04]) and thus largely targeted at business logic and enterprise-level software design. This chapter will derive and present models that enable a Model Driven Architecture approach in software design for overlay networks. The next section tries to straighten the perspective on overlay software by introducing the topology itself as a design target. This leads to a novel approach, named Node Views, that is presented in section 4.3. Its major advantages are the decou- 1

68 44 CHAPTER 4. OVERLAY MDA pling of generic maintenance components and the inherent support for explicit topology specifications as the central part of the implementation. 4.1 The Topology Perspective Overlay Applications Topology Selection Overlay Message- Topology Adaptation handling/routing Topology Maintenance Message Processing Topology Rules Node Views Figure 4.1: Overlay software from the topology perspective While current frameworks focus on message forwarding and the protocol design part of overlay software, it should have become clear that these abstractions stay at a relatively low level. Even the EDSM approach forces the developer to design code in a very specific way that is tightly coupled to a framework and leaves source code too monolithic for reuse. We will now raise the perspective to the design of topologies that establish the global characteristics of overlay networks. From this perspective, we can establish five major functional levels in overlay software, as shown in figure Topology Rules Local topology rules play the most important role in overlay software which makes them a very interesting abstraction level. The global topology of an overlay is established by a distributed algorithm that each member node executes. The topology rules on each node form the part that actually implements this algorithm by accepting neighbour candidates or objecting to them. There are two sides to topology rules. Node selection allows an application to show interest in certain nodes and ignore others based on their status, attributes and capabilities. Generally, applications are only interested in nodes that they know (or assume) to be alive, usually based on the information when the last message from them arrived. But not even all locally known live nodes are interesting to the application that can select nodes for communication based on quality-of-service requirements. Furthermore, if a heterogeneous application uses multiple overlays, its participants do not necessarily support all running protocols. Each node must see the others only in overlays that they support.

69 4.1. THE TOPOLOGY PERSPECTIVE 45 Node categorisation is the second task. Where selection is the blackand-white decision of seeing a node or not, categorisation determines how nodes are seen. Nearly all overlay networks know different kinds of neighbours: close and far ones, fast and slow ones, parents and children, super-nodes and peers, or nodes that store data of type A and nodes that store data of type B. Node categorisation lets a node sort other nodes into different buckets to distinguish different types of equivalent nodes. Common overlay tasks are then implemented on top of the node categorisation. It is a hard problem but also an interesting question to what extent the process of inferring the global guarantees provided by a topology from the local rules can be automated. In current structured overlay networks, topology rules are stated apart from the implementation as a local invariant whose global properties are either proven by hand or found in experiments. The Node Views architecture proposed in this thesis is the first to move these rules into machine readable overlay models. This makes them available for automated model transformation and analysis Topology Maintenance Topology maintenance is the perpetual process of repairing the topology whenever it breaks the rules. Above all, this means integrating new nodes (i.e. selecting and categorising them) and replacing failed ones. The detection of a situation that breaks the rules is obviously an event that must be extracted from the topology rules. Support for this functionality is very limited amongst current overlay frameworks, despite its obvious importance for the required self-maintenance in these systems Topological Routing Routing captures the local decision to which of the neighbours a message should be forwarded. The goal is to deliver it to the final destination or to bring it at least one step closer. The routing algorithm exploits the topology rules to determine the best next recipient. The complete process of taking a forwarding decision is typically executed in multiple inter-dependent steps. A part of this is the decision if the local node is the destination of an incoming message. The message is then either locally delivered to a message receiver component or further treated to determine the responsible neighbour Topology Adaptation Topology adaptation is the ability of a given overlay topology to adapt to specific requirements. As opposed to the error correction done by topology

70 46 CHAPTER 4. OVERLAY MDA maintenance, adaptation handles the freedom of choice allowed by the topology rules. The rules therefore draw the line between maintenance and adaptation. An example is Pastry where evaluations have shown that redundant entries in the routing table can be exploited for adaptation to achieve better resilience and lower latency [ZHD03]. Topology adaptation usually defines some kind of metric for choosing new edges out of a valid set of candidates. Building the right sub-groups of nodes in hierarchical topologies (or embedded groups as in [KR04]) also fits into this scheme. Most current overlays are designed with some kind of adaptation in mind, whereas the available frameworks do not provide support for its implementation. What is needed here is a ranking mechanism for connection candidates. Overlays usually aim to provide an efficient topology. The term efficiency, however, is always based on a specific choice of relevant metrics, such as endto-end hop-count or edge latency, but possibly also the node degree or some expected quality of query results. The respective metric determines the node ranking which in turn parametrises the global properties of the topology Topology Selection Topology selection is the choice of different topologies that an overlay application can build on. Supporting multiple topologies obviously makes sense for debugging and testing at design-time. However, it is just as useful at run-time if an application has to adapt to diverse quality-of-service requirements, such as different preferences regarding reliability, throughput and latency. A given topology may excel in one or the other and this specialisation allows it to provide high performance while keeping a simple design. Topology selection allows an application to provide optimised solutions for different cases Discussion While overlay networking stays an important part of the implementation, the topology perspective allows developers to take a broader and more abstract view on the design. It focuses on the major characteristic of overlay networks: their topology. This provides a cleaner view on the intentions of overlay implementations. Topology adaptation and selection play the most important role for QoS support in overlays. However, selection obviously relies on the integration of different overlay implementations to make their topologies available to a single application. This is especially necessary to avoid duplication in effort when maintaining multiple topologies and switching between them. It is not efficient to have an application maintain several overlays if each of them independently sends pings to determine the availability of nodes. An integrative approach is needed to avoid this kind of overhead.

71 4.2. BACKGROUND: THE LOCAL VIEW OF A NODE 47 The following section motivates the background for Node Views, an integrative, high-level approach to overlay software design, that is based on the topology perspective. 4.2 Background: The Local View of a Node In scalable, distributed overlays, the global topology and all properties of the overlay result from the sum of local actions of its participants. All actions must be based on local decisions that each single node takes depending on its local view. The local view is a node s combined knowledge about the other nodes in the system, above all (but not limited to) its neighbours in the topology The Chord Example We take Chord [SMK + 01] as an example, a structured overlay network based on a ring topology (see 9.2). The top node of the ring in figure 4.2 sees the nodes within the shaded area. The local view of this node is 1) the successor link, i.e. the direct clockwise neighbour, plus 2) the finger table, a list of nodes further away along the ring that are chosen according to an overlay specific rule. For better resilience, Chord actually keeps a list of successors. 1 2 = successor Node 1 finger table Node 2 Node 3 : 3 Figure 4.2: The Chord example - from global to local view Two other structured overlays, Pastry [RD01] and Tapestry [ZKJ01], assemble their local view in a similar fashion. They keep a list of direct neighbours and a routing table with members that are further away. The rule for choosing these nodes is similar in both systems, but differs from the rule used in Chord. Unstructured networks also keep a set of neighbours that are chosen based on attributes known about them. These may include network latency, uptime,

72 48 CHAPTER 4. OVERLAY MDA node degree, available resources, data references or categories about content stored on them. Super-Peer networks [YG03] form a variant that keeps two distinct sets of peers and super-peers. In all of these overlay networks, each participant sees a number of other members through specific data structures that represent these neighbour lists, successor lists, routing tables, leafsets, multicast groups, etc. Abstracting from specific implementations, we can now see that each of these data structures provides a node with a partial view on the network. What we previously called the local view of a node is therefore the union of all its partial views on remote nodes Data about Nodes To establish a local view, each node has to keep data about other nodes. Examples are addresses and identifiers, measured or estimated latencies and references to data stored on these nodes. Furthermore, it is generally of interest when a node was last contacted (time-stamps or history), if a node is considered alive or if a ping was recently issued to check if it still is. The information that this was necessary may be very valuable to, say, a routing component that can take the decision to temporarily route around that questionable node in order to keep the hop-by-hop loss rate low. Locally available data about remote nodes is crucial to all components of the overlay software. Data about remote nodes is gathered from diverse sources. Some data can be determined locally (IP address, ping latency,... ), while other information is received in dedicated messages - either directly from the node it describes or indirectly via hearsay of intermediate nodes. There often is more than one way of finding equivalent data. Section 7.2 describes the example of latencies being measured (ping) or estimated. As another example, a node A knows that a node B is alive if A recently succeeded in pinging B, if A received a message from B, if other nodes told A about B (gossip), etc. Note that both overhead and certainty decrease in this order. Different quality-of-service levels in an overlay application can trade load against certainty by selecting different sources. Topology rules, maintenance, adaptation and selection mainly deal with managing data about nodes. The topology rules put constraints on the data about possible neighbour nodes. Maintenance needs to keep data about fallback candidates that may currently not be neighbours. It also deals with gathering data about nodes that joined or finding inconsistencies between local and remote views. Adaptation does a ranking between candidate nodes before it decides about the instantiation as neighbours or fall-backs. Topology selection then switches between different views, i.e., ranking metrics and sets of neighbours. A data abstraction is obviously a good way of dealing with this diversity of

73 4.3. THE SYSTEM MODEL: NODE VIEWS 49 sources, data characteristics and data management tasks. It allows an overlay to lift dependencies on specific algorithms and to take advantage of the different characteristics of different implementations as the need arises. The Node Views approach provides such an abstraction. 4.3 The System Model: Node Views The Node Views approach instantiates the local view of a node as a local database that keeps all locally known data about remote nodes. This allows to apply common approaches from active database management systems in order to model and implement the data management part that is necessary to maintain and adapt topologies. The rest of the architecture, that is shown in figure 4.3, results from following the well-known Model-View-Controller pattern [BMR + 96]. It aims to decouple software components by separating the roles for state and data storage (model), data presentation (views) and data manipulation (controllers, named harvesters in the Node Views context). Overlay Implementation Node set views Applications Harvester 1 Local node database Handler Router View View View definitions Node 1 AFE Node 2 FFEAF..... Node 3 86F7A..... Node 4 AEF Node 5 BFBEB..... Node 6 7FE Node Node 8 FA Harvester 2 Figure 4.3: The overall system architecture of Node Views Nodes A node is the local representation of a remote member of the topology and the atomic unit for overlay networks. It has a number of attributes locally assigned to it. Nodes and their attributes become available to the local view through physical links, messages or via hearsay of other members. There are physical and logical attributes.

74 50 CHAPTER 4. OVERLAY MDA Logical attributes include the node identifier used in structured networks. Similarly, related or precomputed data like distances between identifiers can be stored in attributes. Besides that, an application may add semantic information about data stored on that node or references to content filters associated with it. The most important physical attribute is the network address of the node, but others, like measured or estimated latency and throughput, may also be of interest. Note that this model abstracts from the source of such attributes and thus unifies estimates and measurements. This allows to exploit both in view definitions and to make them exchangeable as needed, without any impact on the implementation of components that use this information. When using multiple (logical and physical) identifiers at runtime, they can potentially become inconsistent. This may happen if two different nodes have taken the same logical identifier by accident or if a node failed and a new one took over its physical address with a different logical identifier. These problems have a number of possible solutions, which are normally implemented as part of the overlay protocol. The Chord overlay, for example, tests for duplicated logical identifiers in its join procedure and repeats the join until an unused identifier was found. In general, the right solution is very overlay specific. Therefore, this kind of inconsistency cannot be prevented or resolved automatically by an overlay-agnostic middleware layer and sometimes not even locally between neighbours. The runtime system should therefore simply trigger a database event if any inconsistency is detected and otherwise leave the resolution to the overlay implementation. Further physical attributes regard the communication status of a node. As described before, it is generally of interest if a node is currently considered alive and when it was last contacted (time-stamps or history). A component that pings nodes if they do not respond to messages within a certain time interval can set a boolean node attribute when starting this action. A routing component can then take the decision to temporarily route around that questionable node in order to proactively keep the hop-by-hop loss rate low. Note that this kind of information is independent of any specific overlay and can therefore be shared between different running overlays to minimise the amount of outgoing pings Node Database The active node database represents the complete local view of a node. It stores the attributes of all locally known nodes which makes it a locally consistent storage point. All components in the system can easily contribute and benefit from it. Note that the database is independent of specific topologies and overlay implementations and that it abstracts from the way how nodes and their attributes are found.

75 4.3. THE SYSTEM MODEL: NODE VIEWS 51 Local Node Database base for update Overlay Application provides / activates View definitions configure Controllers Overlay Routing Messages define uses Node Set Views trigger Messages Figure 4.4: View components in the Node Views architecture When an overlay application terminates and is restarted, it has to reconnect to the overlay. If the local node database is made persistent, it can directly be used to find good candidates for joining. A (partially) persistent node database provides better support for this than a simple, semantic-free list of network addresses as currently used in most overlay implementations. The high value of a consistent local view for making local decisions makes nodes and the node database central components of an overlay middleware. They store information that is relevant to all other components and act as a knowledge base for the local decisions of the overlay implementation and the running applications Node Set Views Node set views represent the views that are specific to a topology 2. They are active views of the node database and form the most important abstraction for overlay implementations. They can also be seen as filters for the database. The software components that use a view are restricted to see only a relevant subset of nodes and their attributes. We can extract node set views from current overlays in many ways. Unstructured overlays usually do not characterise their neighbours and keep only a single set of nodes based on a pre-defined ranking. In a similar way, Super- Peer networks [YG03] use two distinct sets of peer nodes and super-peer nodes. Most structured overlays are based on two or more distinguished node sets. As shown for the Chord example in 4.2.1, the topologies of Pastry [RD01] and Tapestry [ZKJ01] also use two node sets: leafset and routing table. Structured 2 Note that a single topology can have multiple views, such as the leaf-set and routing table found in Pastry [RD01], or different hierarchical levels as in super-node networks.

76 52 CHAPTER 4. OVERLAY MDA overlays based on de-bruijn graphs (9.2.5) have only one node set. It contains the direct neighbours in the graph View Definitions The subset of nodes that are available through a view is determined by a view definition. As generally known from database views, this is a function that constrains the node data taken from the database or recursively from other views. The interaction between database, views and their definition is illustrated in figure 4.4. The view definition language SLOSL, that was developed for this architecture, is presented in chapter View Events As in any networked system, overlay components are naturally reactive and respond to changes in the local view. View events represent these changes. A notification about them is fired whenever nodes enter or leave a view, or when visible node attributes change. The node database can support this with a technique that is long known in the active database area, Event-Condition- Action rules [DBM88, Pat99]. They trigger reactive components of the software when data changes occur. Views generate and filter notifications and software components receive only those from the views they see. Event-triggered components are a common idea in software design, where it is best known as the observer pattern [BMR + 96]. It is also related to publish-subscribe systems (see 2.1). The view events supported by the SLOSL view definition language are described in section Harvesters and Controllers Harvesters are a major part of the overlay maintenance implementation. They aim to update the database according to the view definitions. This includes updating single attributes of nodes as well as searching new nodes that match the current view definitions. Note that harvesters do not aim to provide a global view. They continuously update and repair the restricted and possibly globally inconsistent local view. The node database effectively decouples them from other parts of the overlay software which enables their implementations as generic components in frameworks. Harvesters are further discussed in chapter 7. From an implementation point of view, harvesters are mainly a design concept. MVC terminology uses the term controller (as in figure 4.4), which is any component that modifies the system state. A harvester in the Node Views architecture is a special controller or a complex composition of controllers that targets a specific functionality in an overlay network. The term harvester

77 4.3. THE SYSTEM MODEL: NODE VIEWS 53 will be used wherever complex actions such as replacing nodes on failure or repairing views is involved Routers and Application Message Handlers Other overlay components like message handlers or routers depend on node set views. They need node data for their decisions when accepting, routing, or otherwise handling messages. Structured overlays base this decision mainly on the ID assigned to a node. However, there is a substantial amount of research dealing with better decisions for more efficient overlays. These decisions are based on latency measurements and estimates, path redundancy, node histories, node capabilities, etc. Database and views support these decisions by providing locally consistent views on the nodes and pluggable, generic harvesters that maintain them. They decouple message handlers from maintenance components and reduce their implementation to very specialised components that become pluggable themselves Discussion The architecture uses the Model-View-Controller pattern (database node set views harvesters) to split the previously monolithic implementations of overlay software. The resulting System Model simplifies and decouples large parts of the overlay implementation. Since data is stored in a single place, software components no longer have to care about any data management themselves. They benefit from a locally consistent data store and from notifications about changes. As known from database views for server applications, node set views provide simplified, decoupled layers and a common interface for overlay software components. This makes components reusable and allows their generic implementation to become part of frameworks. Even if a specialised component has to be written from scratch, it benefits from consistent, active node set views and the data-driven decoupling from other components. The most important feature of this abstraction, however, is the inherent support for topology rules and adaptation. The view definition becomes the central point of control for the characteristics of the overlay implementation. The language that we use for this purpose is described in the following chapter. Its high expressiveness for ranking, selecting and categorising nodes allows for a broad range of overlay design decisions while at the same time separating and hiding them from the implementation of overlay components.


79 Chapter 5 The SLOSL View Definition Language for Overlay Topologies Contents 5.1 SLOSL Step-by-Step Grammar More Examples SLOSL Routing SLOSL Event Types This chapter presents SLOSL, the SQL-Like Overlay Specification Language. It was specifically designed for topology view definitions and extends the SQL database query language to support the special semantics of topology rules and topology adaptation. The additional semantics also form the basis of overlay routing, as section 5.4 explains. The clear design goals of SLOSL were a high abstraction level and short, data-driven expressions. Section 5.1 exemplifies how the different clauses of SLOSL interact for creating views and selecting nodes, based on the Node Views architecture described in the previous chapter. The subsequent section 5.2 describes the grammar in detail. Section 5.3 provides some more examples for SLOSL statements that implement different topologies. The next section (5.4) then outlines how SLOSL views are used to automate routing decisions. The final section 5.5 explains the different event types that the active SLOSL views provide, and that the Node Views architecture exploits to trigger controllers.

80 56 CHAPTER 5. THE SLOSL VIEW DEFINITION LANGUAGE 5.1 SLOSL Step-by-Step This section describes the different clauses of SLOSL by evaluating an example step-by-step. It aims at providing a clear idea of the general execution semantics of SLOSL statements. It is important to note beforehand that this scheme is only one way of evaluating these statements. As a declarative language, SLOSL does not provide any guarantees about the order or time of evaluation of its clauses or expressions in execution environments. See section 8.3 for an introduction to possible implementations and optimisations. Here, our running example is a SLOSL specification of the Chord graph as presented in the original publication [SMK + 01]. See for a brief description of the Chord topology or 9.2 for the general concepts behind this kind of overlay. 1 CREATE VIEW c h o r d f i n g e r t a b l e 2 AS SELECT node. i d 3 FROM node db 4 WITH l o g k = l o g (K ) 5 WHERE node. s u p p o r t s c h o r d = true AND node. a l i v e = true 6 HAVING r i n g d i s t ( l o c a l. id, node. i d ) in [2 i, 2 i+1 ) 7 FOREACH i IN [ 0, l o g k ) 8 RANKED highest ( 1, r i n g d i s t ( l o c a l. id, node. i d ) ) As known from SQL, the first code lines do the following: 1. Define a node set view to be known under the name chord fingertable. 2. The interface of this view presents the id attribute of its nodes to the components that use it. No other attributes are presented, although others may be used in the view specification itself. 3. This line states the super-views of the newly created one. In this case, it is a direct sub-view of the local node database. The next thing to note is the WITH clause in line 4. It defines variables or options of this view. They can be set at instantiation time and changed at runtime. In the example, the variable log k is given a default value, the logarithm of the size of the key space. It is used by Chord to determine the length of node identifiers. For the Chord topology, this is a global constant that must be fixed at deployment time. The following figure presents the local view on the topology that we are constructing in this view. It results from the partial evaluation of the SLOSL clauses up to this point.

81 5.1. SLOSL STEP-BY-STEP FROM? 3 CREATE VIEW AS SELECT FROM As can be seen in the bottom-left corner of the graph, there is currently only one node missing in the local database. This may mean that the local node has not yet heard of it or has already removed it from the database due to space constraints. In large overlay networks, the database will usually only contain a small subset of all nodes, most notably those that are visible in the locally active views. Previously known nodes that have gone out of sight may be removed based on an implementation specific garbage collection or caching policy (see 7.3). WHERE 1 2 FROM? WHERE 3 WHERE node.supports chord AND node.alive The next step is to evaluate the WHERE clause. It constrains nodes that are considered valid candidates for this view (node selection). Here we exploit an attribute named supports chord that is true for all nodes that know about the Chord protocol. The second attribute, alive, is true for nodes that the responsible harvesters deemed alive. As the figure shows, there are two nodes that are rejected by the boolean expression because the local node assumes them to be dead.

82 58 CHAPTER 5. THE SLOSL VIEW DEFINITION LANGUAGE WHERE FOREACH FROM? WHERE 3 3 FOREACH i IN [0, log k) We can now evaluate the FOREACH clause. Its purpose is to describe a set of buckets, the node categories or node equivalence classes. They are enumerated by the values of the run variable i in the example. If more than one FOREACH clause is given, each combination of values references one of the buckets. WHERE 1 HAVING 1 FROM? WHERE 2 HAVING HAVING 2 3 FOREACH 3 HAVING ring dist(local.id, node.id) in [2 i, 2 i+1 ) Having determined the available buckets, we can start evaluating the HAVING expression for each node and each bucket. If it evaluates to true, we store the node in the bucket. In the example, the HAVING clause states that the result of the Chord distance function must lie within the given half-open interval (excluding the highest value) that depends on the bucket variable i. Here, the name ring dist refers to a user-provided function. However, the distance is locally a constant which could just as well be stored in a node attribute.

83 5.2. GRAMMAR 59 WHERE 1 RANKED HAVING 1 FROM? WHERE 2 HAVING HAVING 2 3 FOREACH 3 RANKED highest(1, ring dist(local.id, node.id)) Finally, the nodes are chosen from each bucket by the ranking function highest as the 1 top node(s) that provide the highest value for the term ring dist(local.id, node.id). The ranking function allows us to do topology adaptation. It selects nodes from each category that match specific criteria best. 5.2 Grammar This section describes the different clauses of SLOSL in more detail. It defines the grammar in a simple, EBNF-like fashion. A few conventions are used: somename-commalist denotes a comma-separated list of somename elements. expression denotes more or less arbitrary arithmetic expressions, while bool-expression refers to an expression that returns a boolean value (true or false), based on comparisons and the common operators AND, OR, NOT. somename-identifier denotes an identifier and gives it the name somename. This name is not used in the grammar but indicates its meaning. In any case, identifiers are lower-case alpha-numeric names (including underscores) that start with a letter. As can be seen from the example, the complete statement of a SLOSL definition consists of seven parts: statement ::= create attr-select parent-select [options] \ [where] [bucket-select] [ranking] ; Comments up to the end of the line are allowed everywhere (outside tokens) and begin with a double hyphen ( ).

84 60 CHAPTER 5. THE SLOSL VIEW DEFINITION LANGUAGE CREATE VIEW create ::= CREATE VIEW view-identifier [ AS ] This clause creates and identifies a view SELECT attr-select ::= SELECT attribute-def-commalist attribute-def ::= attribute-identifier = expression attribute-def ::= node. attribute-identifier This clause defines the interface of nodes in this view. Only the attributes listed here can be seen by software components that use the view. However, all attributes found in the parent views can be used for the view definition. The expression in the last line is a common short form for attrib = node.attrib, where the new attribute name is identical with the original one. It is quite a capable feature that attributes can be assigned their value by arbitrary expressions that may or may not depend on the attributes with the same name in the parent views. This can be used for renaming attributes, e.g. to trick a black-box component into using other attributes, to convert between different attribute representations (time, value ranges,... ) or to prepare data for transmission in message protocols. If no assignment is supplied, the attribute is presented unchanged as found in the parent views FROM parent-select ::= FROM parent-identifier-commalist Each view takes its nodes from its parent view(s). The set of candidate nodes is the union of all nodes in its parents. All parent views must provide at least those attributes for their nodes that are referenced in the SLOSL statement WITH options ::= WITH option-assignment-commalist option-assignment ::= option-identifier [ = expression ] The WITH clause names configurable options of the defined view. They can be referenced in the clauses SELECT, RANKED, WHERE and HAVING FOREACH. Options can be assigned a default value in the definition, set at view instantiation time and changed at run-time. Special care must be taken if they are used in the FOREACH list (as shown in the Chord example). Changing their value at run-time can mutate the structure of the view buckets. This may break components connected to the view if they expect a static bucket structure.

85 5.2. GRAMMAR WHERE where ::= WHERE bool-expression The boolean expression puts constraints on the attributes of nodes that are considered candidates for this view. This clause implements the node selection part of the topology rules HAVING... FOREACH bucket-select ::= FOREACH BUCKET bucket-select ::= [ HAVING bool-expression ] ( bucket-def )+ bucket-def ::= FOREACH bucket-variable-identifier \ IN (range constants-commalist \ sequence-identifier) range ::= integer : integer This clause aggregates nodes into buckets. It implements the node categorisation of the topology rules. In its simplest incarnation, the FOREACH BUCKET clause copies the buckets from the parent views. View definitions can thus ignore the bucket structure of parents and still provide it as their own interface. This allows template-like view definitions that change only the interface of nodes (SELECT) or that select a subset of nodes (WHERE) from the buckets. In the absence of any FOREACH clause, the view will only have a single flat bucket containing all visible nodes. In the second and more common form, the HAVING FOREACH clause defines either a single bucket of nodes, or a list, matrix, cube, etc. of buckets. The structure is imposed by the occurrence of zero or more FOREACH IN clauses, where each clause adds a dimension. Nodes are then selected into these buckets by the optional HAVING expression (which defaults to true). A node can appear in multiple buckets of the same view if the HAVING expression allows it. If bucket variables are used in the view attribute definition (SELECT), the same node can carry different attributes in different buckets of the created view (see the Pastry example in 5.3). The bucket abstraction is enough to implement graphs like Chord, Pastry, Kademlia or de-bruijn and should be just as useful for a large number of other cases. It is not limited to numbers and ranges, buckets can be defined on any sequence of values. Numbers are commonly used in structured overlays (which are based on numeric identifiers), while strings can for example be used for topic-clustering in unstructured networks. Since SLOSL is working on a database, the values in the FOREACH list can naturally be taken from a database table. However, the warning about variable usage (WITH) applies in this case, too.

86 62 CHAPTER 5. THE SLOSL VIEW DEFINITION LANGUAGE RANKED ranking ::= RANKED function ( expression-commalist ) function ::= lowest highest closest furthest This clause describes a filter (the ranking function) that selects a list of nodes from a list of candidate nodes. As motivated in chapter 3, ranking nodes is a major task in overlay adaptation. SLOSL moves it from the implementation into the view definition and makes it configurable and adaptable through view options and bucket variables. An important feature of this approach is that an application can use multiple specifications and/or rankings to instantiate on-demand the views that currently fit best. In the Chord example, nodes are ranked by highest ID, but one could also use the lowest latency, highest uptime, age of last contact, or some reliability or reputation measure. It is obvious that switching between different ranking mechanisms (RANKED) or node selections (WHERE) does not have any impact on the implementation of routers, harvesters or similarly connected overlay software components, as long as the bucket structure of the view (FOREACH) stays unchanged. It does, however, have a major impact on the global topology and its performance characteristics. One interesting property of the Chord example is that the ranking function is independent of the bucket where it is evaluated. It only depends on the attributes of the node itself and can be calculated in advance. While this is not the case in all topologies, it is an obvious optimisation where available. See section for a discussion of possible optimisations. 5.3 More Examples This section briefly presents a number of topology implementations in SLOSL. The diversity of the examples motivates the broad applicability of SLOSL to the design of various different types of overlays De-Bruijn Graphs De-Bruijn graphs [db46] combine a 1-dimensional view with some calculations that yield at most one node per bucket. Note that due to the structure of these graphs, a functional overlay implementation will need more than one view here to compensate for empty buckets if the ID space is not completely filled up with nodes. Such approaches are described in [LKRG03, DMS04]. 1 CREATE VIEW d e b r u i j n 2 AS SELECT node. i d 3 FROM node db 4 WITH max id=2 64, m a x d i g i t s=8

87 5.3. MORE EXAMPLES 63 5 WHERE node. a l i v e = true 6 HAVING node. i d = ( l o c a l. i d m a x d i g i t s ) % max id + d i g i t 7 FOREACH d i g i t IN ( 0 : m a x d i g i t s ) Pastry Pastry [RD01] uses a 2-dimensional view and two user-provided functions: digit at pos for the digit at a given prefix position in the ID and prefix len, the length of the longest common prefix of two IDs. As in the Chord example, the ring distance is a simple function that is left out for readability reasons. 1 CREATE VIEW p a s t r y r o u t i n g t a b l e 2 AS SELECT node. h e x i d, node. i d 3 FROM node db 4 WITH m a x d i g i t =16, m a x p r e f i x =40 5 WHERE node. a l i v e = true 6 HAVING d i g i t a t p o s ( node. h e x i d, p r e f i x ) = d i g i t 7 AND p r e f i x l e n ( l o c a l. h e x i d, node. h e x i d ) = p r e f i x 8 FOREACH d i g i t IN ( 0 : m a x d i g i t ) 9 FOREACH p r e f i x IN ( 0 : m a x p r e f i x ) 10 RANKED highest ( 1, r i n g d i s t ( l o c a l. id, node. i d ) ) Pastry uses a second view for what it calls a leaf-set. It contains the direct neighbours in the ring, both on the left and the right side of the local node. There are different ways of specifying the leaf-set in SLOSL, one being the use of two views for left and right neighbours. Here, we present a second (and a little more complex) possibility that uses two buckets for this purpose. Note also the use of a new node attribute side to memorise where the neighbour was found. The ncount option makes the number of neighbours configurable at run-time. 1 CREATE VIEW c i r c l e n e i g h b o u r s 2 AS SELECT node. id, s i d e=s i g n 3 FROM node db 4 WITH ncount =10, max id= WHERE node. a l i v e = true 6 HAVING abs ( node. i d l o c a l. i d ) <= max id / 2 7 AND s i g n ( node. i d l o c a l. i d ) < 0 8 OR abs ( node. i d l o c a l. i d ) > max id / 2 9 AND s i g n ( node. i d l o c a l. i d ) > 0 10 FOREACH s i g n IN ( 1,1) 11 RANKED lowest ( ncount, r i n g d i s t ( l o c a l. id, node. i d ) )

88 64 CHAPTER 5. THE SLOSL VIEW DEFINITION LANGUAGE Scribe on Chord Scribe [RKCD01], as presented in section 2.3.1, is a multicast scheme that was initially implemented for the Pastry overlay. It is, however, generally applicable to key-based routing overlays (see 9.2), which allows its implementation also on top of Chord. Scribe essentially exploits the key mapping provided by the overlay network to determine a rendezvous node for each multicast group. It then sends group messages towards the rendezvous, which serves as the root of a multicast tree. Subscriptions build up forwarding state along the way and publications follow them backwards. Subscriptions and publications simply follow the Chord routing policy towards the rendezvous node. Once a match was found however, the publications must be forwarded according to rules that are specific to Scribe. SLOSL models subscription state as a set of group identifiers that it keeps for each neighbour. It then builds up one view for each group topology that the local node participates in. The view selects only those Chord fingers that are subscribed for messages of this group. Publications are simply broadcasted to all nodes in the view to forward them along the multicast tree. 1 CREATE VIEW s c r i b e s u b s c r i b e d f i n g e r s 2 AS SELECT node. i d 3 FROM c h o r d f i n g e r t a b l e 4 WITH group 5 WHERE group in node. s u b s c r i b e d g r o u p s 5.4 SLOSL Routing The SLOSL view definitions allow for a simple extension towards a complete routing strategy. As the Pastry specification in 5.3 exemplifies, most overlay systems base their routing decisions on more than one view. They commonly use different views for semantically close and far neighbours, such as a neighbour table and a routing table. When a router looks for the next hop for a given destination, all it has to do is test the relevant views in a sensible order to see if such a node exists. This process exploits both the node categorisation and ranking features of SLOSL as follows. 1. The SLOSL clauses HAVING FOREACH are evaluated against the (partially) known attributes of the destination node to find the corresponding bucket. If that fails, the complete process fails for this view. 2. If at least one bucket is found, its RANKED clause is executed to determine the best target.

89 5.4. SLOSL ROUTING 65 Note that the WHERE clause is not required to be evaluated. In fact, there may not even be enough local information about the destination node to do node selection. The node may only be known from the destination field of the currently forwarded message and thus may not appear in the view or even in the database. To find the next hop that corresponds to the destination, however, it is sufficient to find the category that it would normally be in. The category determines the local bucket of equivalent nodes. Furthermore, there is nothing that prevents the router from first querying the database for a node that matches the destination exactly. If it is found, its physical address can be used to send the message directly. This can be used to reduce the latency towards certain nodes that the local node is not normally connected to in the overlay topology. So far, we only regarded unicast, i.e. forwarding the message to exactly one neighbour. Some protocols will require broadcast or multicast. In SLOSL overlays, the unicast, multicast and broadcast schemes turn out to be identical, as SLOSL already selects a set of nodes. Multicasting to a subset of the neighbours is the same as broadcasting to a view that selects them. Broadcasting to a view that selects a single neighbour from the only corresponding bucket is the same as unicasting to that neighbour. SLOSL routing therefore incorporates all three types of message forwarding in a general broadcast to all targets that result from the evaluation of a view. first match [ for me? ] dest in DB? fall back handle locally send directly first match exclude last View 1 View 2 handle locally handle error or drop fork View 3 Figure 5.1: General EDGAR graph for SLOSL routing Figure 5.1 shows a general decision graph for SLOSL routing. In general, a graph of this kind is all that is required to fully describe routing components. The language used to express this graph is called EDGAR (see 6.3), Extensible Decision Graphs for Adaptive Routing. Within the graph, messages arrive from the left and are routed as follows. The first match edges simply traverse their children in top-down order. If a target is found, the routing process terminates and the message is forwarded to the target. This corresponds to an ordered XOR evaluation. The first child at the top-left has a predicate associated with it that tests if the message is

90 66 CHAPTER 5. THE SLOSL VIEW DEFINITION LANGUAGE to be accepted locally. Such a predicate is external to the graph and must be provided by the overlay implementation. Depending on the topology, however, the same decision may be available as a generic fall-back if the attempts to route through the views fail. This is shown in the lower part of the figure. The fork edge splits up the routing process and continues it in all of its child branches independently, thus yielding an OR evaluation. In the case shown in the figure this means that the message is broadcasted to both the views 2 and 3 if no matching target is found during the evaluation of view 1. If neither of the three views yields a target, routing continues with the two fall-backs at the bottom. The exclude last edge is used to tag a sub-tree. It prevents the node that last forwarded the message from appearing in any of the target sets that are found further down the decision tree. This is commonly used to avoid duplicates in broadcast or multicast forwarding, since the last hop has already received the message. The feature obviously requires the last hop to be identifiable, which is the case in most physical networks. If it is not available from the physical layer, this information can always be explicitly provided within higher-level message headers. When used in combination with a first match edge, the last hop is always discarded before testing the target set for success Examples Specific routing strategies typically have simpler graphs than the one above, as they do not exploit all possibilities. The Chord routing algorithm, for example, is specified as in figure 5.2. first match [ for me? ] handle locally handle error or drop finger table neighbour set Figure 5.2: Chord routing, implemented using SLOSL views However, even more complex routing strategies, like the routing of multicast messages over a Chord graph (similar to the Scribe approach), become easily understandable when expressed using SLOSL routing graphs. Figure 5.3 shows the complete implementation for the forwarding of publications. The example of implementing Scribe on top of the Chord overlay based on the Node Views architecture is discussed in more detail in section 8.4.

91 5.4. SLOSL ROUTING 67 fork [ am I subscribed? ] handle locally first match [ am I rendezvous? ] subscribed fingers exclude last fork forward (Chord) Figure 5.3: Scribe routing over Chord, implemented using SLOSL views Iterative and Recursive Routing As stated before, implementing EDGAR in a recursive routing scheme is straight forward. Each node receives the message, takes its decisions and forwards it to the next hop. However, it is also possible to implement EDGAR in iterative routing schemes. Here, the source node repeatedly asks the next hop for its successor, until it finds the destination node and sends the message directly. Both schemes are displayed in figure 5.4. query 1st hop response Source iterative query response 2nd hop query response forward Receiver Source recursive forward 1st hop forward ACK 2nd hop forward Receiver Figure 5.4: Iterative and Recursive Multi-Hop Routing The two routing schemes have different quality-of-service characteristics. In the iterative scheme, the source knows all hops and can deploy arbitrary policies to decide if it trusts them and what it sends them. As opposed to recursive routing, it has immediate feedback about the routing delay and about node failures along the path. Obviously, the message overhead of the sender is substantially higher. In recursive routing, the message overhead is the same for each intermediate node (unless recursive ACKs are used) while the sender is relieved from

92 68 CHAPTER 5. THE SLOSL VIEW DEFINITION LANGUAGE the main work. If intermediate node failures occur, the delivery can only be guaranteed if additional ACKs are in place. This also implies that multi-hop forwarding can introduce arbitrary forwarding delays that are not controllable by sender or receiver. An advantage for overlay networks is that the forwarding path follows the topology, which can be optimised for that purpose (topology adaptation). Given the different characteristics of both schemes, it can be interesting to configure the forwarding scheme as part of the implementation. This provides an additional level of adaptation to application requirements. The evaluation of EDGAR results in a set of nodes. However, the node that takes the routing decision is not required to have the complete message available. It only needs to have the node data about the destination that is required by the graph. The sender can create an incomplete node representation and send it to a remote node to request its EDGAR evaluation. The remote node responds with the resulting set of nodes and thus allows the sender to continue its traversal of the topology. 5.5 Event Types of SLOSL Views The Node Views architecture is largely event-driven. The events of incoming and outgoing messages are handled by harvesters, controllers and routers. For maintenance, however, the most important event source are SLOSL views. They generate events for nodes appearing in or disappearing from views, or for views or buckets that turn empty. This section overviews the different types of SLOSL events that emanate from nodes, buckets and views. Nodes. The simplest form of event is a node event. Nodes can enter or leave views, and their attributes can be updated. The latter regards either the attributes stored in the database or those calculated in the SELECT clause of SLOSL statements. Note that the update of a node attribute can trigger enter or leave events in dependent views after their re-evaluation. Buckets. At the next level, buckets can turn empty when their last node leaves or can become non empty when a node enters a previously empty bucket. For views with a RANKED clause, buckets can also produce full and not full events respectively, when the number of nodes reaches the size of the bucket or falls below it. Views. Being a set of buckets, views support the same event types. They can become empty when the last node from the last non empty bucket leaves or non empty when a node enters a previously empty view. Additionally, views become full if a node enters the last empty bucket, or not full when all buckets of the view were non empty and one of them turns empty. Note that a full view does not require all of its buckets to be full. The

93 5.5. SLOSL EVENT TYPES 69 full event therefore means that the view no longer contains any empty buckets. The case where all buckets of a view get filled up triggers the buckets full event. The descriptions above make it clear that there are dependencies between the event types. They are displayed in figure 5.5. Note that the graph shows different ways of generating view (not) empty events, either from bucket events or node events (dashed lines). Implementations can decide which one is more efficient for them. They may let this depend on the events used in an implementation. If no bucket events are used, it is likely more efficient to generate these events from node events directly. There is an edge case where generating these events is not obvious from the graph. As buckets can have a bounded size through the RANKED clause, a node update can replace a node in a bucket. This will generate enter and leave events for the impacted nodes. However, if the size is unchanged after the update propagation, this operation must not generate bucket events or view events in this case. From the point of view of event generation, each view update should therefore be considered an atomic operation. Although this is rarely hard to achieve in practice, implementors must be aware of it.

94 70 CHAPTER 5. THE SLOSL VIEW DEFINITION LANGUAGE database view node inserted node attribute updated node deleted node attribute changed node inserted node updated node deleted bucket was empty bucket non empty view was empty node enters view node entered bucket becomes full bucket full view (insert) replaced node leaves view node leaves view node left bucket becomes empty bucket empty view (delete) replacement enters view view becomes empty bucket was full bucket not full Figure 5.5: Dependencies of SLOSL view events all buckets non empty all other buckets empty all buckets full no other bucket emtpy all buckets empty view full view non empty all view buckets full view not full view empty

95 Chapter 6 OverML Contents 6.1 NALA - Node Attributes SLOSL - Local Views EDGAR - Routing Decisions HIMDEL - Messages EDSL - Event-Driven State-Machines The Overlay Modelling Language OverML is a set of XML languages for overlay specifications. It comprises the five necessary parts as described in the previous chapters: node attributes, messages, view definitions, routers and EDSM graphs. The following sections describe the syntax and semantics of each language. A formal RelaxNG schema [CM01] is provided in appendix A.1. Figure 6.1 illustrates the interaction between the five languages NALA (the Node Attribute Language), SLOSL (the SQL-Like Overlay Specification Language), EDGAR (Extensible Decision Graphs for Adaptive Routing), HIMDEL (the Hierarchical Message Description Language) and EDSL (the Event-Driven State-machine Language). NALA EDGAR Figure 6.1: The OverML Languages

96 72 CHAPTER 6. OVERML The main advantage of having an integrated set of languages is their closure, i.e. their self-containing interaction as the figure shows. SLOSL requires NALA for a definition of the underlying database schema. HIMDEL relies on NALA and SLOSL to define the semantics of message fields and view containers. EDGAR uses SLOSL to make routing decisions about HIMDEL defined messages. EDSL uses HIMDEL and SLOSL to infer the available events for messages and views. 6.1 NALA - The Node Attribute Language NALA EDGAR NALA serves two purposes. First, it provides data types for OverML that can be restricted or composed into more complex types. Its main purpose, however, is the definition of node attributes that SLOSL works on. Attributes can currently use a subset of the data types defined for SQL [SQL92], but the XML Schema data types [XSD04] are an alternative from the XML world. A mapping between the two is provided as part 14 of the SQL 2003 standard [SQL03b]. In both cases, NALA allows the definition of custom data types based on the available standard types Type Definition The nala:types section in OverML defines the data types available to the application. A possible list containing all standard types follows. <nala:types xmlns:nala= x m l n s : s q l= > <s q l : b i g i n t name= b i g i n t /> 64 bit integer 5 <sql:boolean name= boolean /> true/false <sql:bytea name= bytea /> binary byte array <sql: char name= char /> fixed length unicode string <sql: date name= date /> calendar date <sql:decimal name= decimal /> arbitrary length integer 10 <sql:double name= double /> 64 bit float <s q l : i n e t name= i n e t /> IPv4/IPv6 address <sql:integer name= i n t e g e r /> 32 bit integer <s q l : i n t e r v a l name= i n t e r v a l /> relative time span <sql:macaddr name= macaddr /> network MAC address 15 <sql:money name= money /> fixed point currency <s q l : r e a l name= r e a l /> 32 bit float <sql:smallint name= s m a l l i n t /> 16 bit integer <s q l : t e x t name= t e x t /> variable length unicode string <sql:time name= time /> time of day 20 <sql: timetz name= time tz /> time plus timezone <sql:timestamp name= timestamp /> date and time

97 6.1. NALA - NODE ATTRIBUTES 73 <sql:timestamptz name= timestamptz /> timestamp plus timezone <sql:composite name= t c p a d d r e s s /> composite data type 25 <s q l : i n e t name= a d d r e s s /> <sql: shortint name= port /> </ sql:composite> <sql:decimal name= id128 bits= 128 /> restricted decimals 30 <sql:decimal name= id256 bits= 256 /> 35 <sql:array name= i n t e g e r s /> typed sequence <s q l : i n t /> </ sql: array> <s q l : s e t name= IPs /> typed set <s q l : i n e t /> </ s q l : s e t> </ nala:types> Note how sql:decimal appears multiple times. The first (named decimal ) is the normal SQL data type while the others are custom types. They were given different names and restricted to a fixed bit size. As the names suggest, they can be used for node IDs. As this example also shows, the sql:composite data type allows composing multiple simple types into a single structured type. Similarly, the array and set types define sequences and sets that contain items of a specific type. Besides their wide-spread support in current programming languages, most relational databases also provide an array type, whereas they would represent the set type as a table Attributes Any of the defined types can be used to specify node attributes. Each attribute has a unique name. To further specify the semantics of the attribute, the following flags can be used. identifier If set, the attribute can be assumed to uniquely identify the node. If a node carries multiple identifiers, each one is treated independently as a unique identifier. This allows different levels of identification, most notably physical and logical addresses. Note that multiple types (like IP address and port) can be composed into a single new type, which can then be used as a single identifier. static If set to true, the attribute is assumed to be static once it is known about a node. All identifiers are implicitly static, but not all static attributes fulfil the uniqueness requirement of an identifier. Again, inconsistencies must trigger events.

98 74 CHAPTER 6. OVERML transferable specifies whether it makes sense to include the attribute in messages. Some attributes (like the network latency to a neighbour or other locally calculated distances) only make sense locally, so they should be marked non-transferable. selected activates the attribute for use in the database schema. Unselected attributes can be defined in the specification without actually being part of the data schema during execution. This is a pure convenience option for developers. There is one special type of attribute: a dependent attribute. Dependent attributes are calculated based on other attributes. They must be updated whenever the underlying attribute is modified. An example is the constant ring distance between the identifier of a remote node and the local one in a Chord overlay. It only depends on the identifier attributes and can be calculated once and then stored in the database. Other examples are aggregated values, like the average latency over a sliding window. The lower change rate of averaged attribute values is helpful when trading optimal topology adaptation against the cost of reconfiguration, such as opening connections and exchanging state with new neighbours. Simple dependent attributes can be expressed using MathML expressions (type math ), while more complex ones reference an external function (type external ) that must be provided at compile time or deployment time. An interesting future extension to OverML could support portable implementations of more elaborate functions. However, it is not easy to provide such implementations in a programming language independent way, while at the same time coming close enough to the expressiveness of a general programming language. It is currently left to further research to find a suitable tradeoff. There is another important application of dependent attributes. When multiple overlay specifications from different authors are integrated into a single application, they may not obey the same naming scheme for attributes or use different data type representations for the same node attributes. Developers can deal with this by modifying the OverML specifications to use the same data schema, or they can introduce dependent attributes that match the schema of other specifications. If the overlap of different schemas is high, this can be a fast and easy approach for an integration, as the overall number of attributes in a NALA data schema tends to be small. NALA makes the database aware of these dependent attributes. Threaded implementations can therefore update them in a synchronised or transactional manner to prevent premature update events. After calculating all dependent attributes, a collective event can be triggered that avoids unnecessary view updates or inconsistent attributes. Another possibility is to use lazy updating and only mark dependent attributes as outdated. They can then be re-calculated the next time they are accessed.

99 6.2. SLOSL - LOCAL VIEWS 75 An XML example specification of different attributes follows. Note that (except for selected) the flags are represented as elements to simplify future extensions. <nala: attributes xmlns:nala= > <nala: attribute name= i d type name= id256 selected= t r u e > 5 <n a l a : s t a t i c /> <nala: transferable /> <n a l a : i d e n t i f i e r /> </ nala: attribute> <nala: attribute name= knows pastry type name= boolean 10 selected= t r u e > <n a l a : s t a t i c /> <nala: transferable /> </ nala: attribute> <nala: attribute name= knows chord type name= boolean 15 selected= t r u e > <n a l a : s t a t i c /> <nala: transferable /> </ nala: attribute> <nala:attribute name= c h o r d d i s t a n c e type name= id selected= t r u e > <n a l a : s t a t i c /> <nala:depends type= e x t e r n a l > <nala:attribute ref name= i d /> <n a l a : c a l l name= c a l c u l a t e c h o r d d i s t a n c e /> 25 </ nala:depends> </ nala: attribute> <nala:attribute name= l a t e n c y type name= i n t e r v a l selected= t r u e /> </ nala: attributes> 6.2 SLOSL - The SQL-Like Overlay Specification Language The SLOSL language for view definitions is presented in chapter 5. It describes the topology characteristics of the resulting implementation. It is based on node attributes as defined by NALA and provides view definitions that can be referenced by the event subscriptions of EDSL. Similarly, HIMDEL references SLOSL views to include their content in messages, and EDGAR uses them for routing decisions. NALA EDGAR As SLOSL has its own chapter in this thesis (chapter 5), this section only describes the XML representation of SLOSL as part of the OverML languages. The format is presented for the following example, taken from 5.3.

100 76 CHAPTER 6. OVERML 1 CREATE VIEW c hord n e i ghbours 2 AS SELECT i d=node. id, l o c a l d i s t=node. l o c a l d i s t, s i d e=s i g n 3 RANKED lowest ( 1 0, node. l o c a l d i s t ) 4 FROM db 5 WITH l o g k =160, max id=2 log k 1 6 WHERE node. knows chord == true and node. a l i v e == true 7 HAVING abs ( node. i d l o c a l. i d ) <= max id / 2 8 AND s i g n ( node. i d l o c a l. i d ) > 0 9 OR abs ( node. i d l o c a l. i d ) > max id / 2 10 AND s i g n ( node. i d l o c a l. i d ) < 0 11 FOREACH s i g n IN ( 1,1) The XML representation follows. Note that mathematical expressions are written in Content MathML 1 [Mat03], which is a rather verbose language. It is, however, also the best choice for representing the semantics of mathematical expressions within XML languages. <slosl: statements xmlns:slosl= xmlns:mathml= > <slosl:statement name= c h o r d n e i g h b o u r s selected= t r u e > 5 <s l o s l : s e l e c t name= i d > node. i d </ s l o s l : s e l e c t> <s l o s l : s e l e c t name= l o c a l d i s t > node. l o c a l d i s t </ s l o s l : s e l e c t> <s l o s l : s e l e c t name= s i d e type name= s m a l l i n t > s i g n </ s l o s l : s e l e c t> 10 <slosl:parent>db</ s l o s l : p a r e n t> <s l o s l : w i t h name= l o g k > 160 </ s l o s l : w i t h> <s l o s l : w i t h name= max id > 15 <m:math> <m:apply> <m:minus/> <m:apply> <m:power/> 20 <m:cn type= i n t e g e r > 2 </m:cn> <m:ci> l o g k </ m:ci> </m:apply> <m:cn type= i n t e g e r > 1 </m:cn> </m:apply> 25 </m:math> </ s l o s l : w i t h> <slosl:where> <m:math> 30 <m:apply> <m:and/> <m:apply> 1 (Aug. 30, 2006)

101 6.2. SLOSL - LOCAL VIEWS 77 <m:eq/> <m:ci> node. knows chord </ m:ci> 35 <m:true/> </m:apply> <m:apply> <m:eq/> <m:ci> node. a l i v e </m:ci> 40 <m:true/> </m:apply> </m:apply> </m:math> </ slosl:where> 45 <slosl: having> <m:math> <m:apply> <m:or/> 50 <m:apply> <m:and/> <m:apply> <m:leq/> <m:apply> 55 <m:abs/> <m:apply> <m:minus/> <m:ci> node. i d </ m:ci> <m:ci> l o c a l. i d </m:ci> 60 </m:apply> </m:apply> <m:apply> <m:divide/> <m:ci> max id </ m:ci> 65 <m:cn type= i n t e g e r > 2 </m:cn> </m:apply> </m:apply> <m:apply> <m:gt/> 70 <m:apply> <m:times/> <m:ci> s i g n </m:ci> <m:apply> <m:minus/> 75 <m:ci> node. i d </ m:ci> <m:ci> l o c a l. i d </m:ci> </m:apply> </m:apply> <m:cn type= i n t e g e r > 0 </m:cn> 80 </m:apply> </m:apply> <m:apply> <m:and/> <m:apply> 85 <m:gt/> <m:apply>

102 78 CHAPTER 6. OVERML <m:abs/> <m:apply> <m:minus/> 90 <m:ci> node. i d </ m:ci> <m:ci> l o c a l. i d </m:ci> </m:apply> </m:apply> <m:apply> 95 <m:divide/> <m:ci> max id </ m:ci> <m:cn type= i n t e g e r > 2 </m:cn> </m:apply> </m:apply> 100 <m:apply> <m:lt /> <m:apply> <m:times/> <m:ci> s i g n </m:ci> 105 <m:apply> <m:minus/> <m:ci> node. i d </ m:ci> <m:ci> l o c a l. i d </m:ci> </m:apply> 110 </m:apply> <m:cn type= i n t e g e r > 0 </m:cn> </m:apply> </m:apply> </m:apply> 115 </m:math> </ slosl: having> <slosl: buckets> <s l o s l : f o r e a c h name= s i g n > 120 <m:math> <m: list> <m:cn type= i n t e g e r > 1 </m:cn> <m:cn type= i n t e g e r > 1 </m:cn> </ m:list> 125 </m:math> </ s l o s l : f o r e a c h> </ slosl: buckets> <slosl:ranked function= l o w e s t > 130 <slosl:parameter> 10 </ slosl:parameter> <slosl:parameter> node. l o c a l d i s t </ slosl:parameter> </ slosl: ranked> </ slosl:statement> </ slosl: statements> The XML representation is rather straight forward, although verbose due to the lengthy MathML expressions. The main advantage of a structured language like Content MathML over a string representation is the clearer, well-defined semantics. It becomes trivial, for example, to determine the node

103 6.3. EDGAR - ROUTING DECISIONS 79 attribute dependencies of an expression via a simple XPath expression like.//math:ci. These dependencies are helpful for optimisations during execution or model transformation and source code generation (see 8.3). Note the type name attribute in line 8 which refers to a NALA type definition. It allows specifying a type for a calculated view node attribute. This overrides the type that would otherwise have been inferred from the expression. 6.3 EDGAR - Extensible Decision Graphs for Adaptive Routing The simplicity of the routing decision graphs as described in 5.4 directly translates into a simple XML structure for EDGAR. It is a straight forward hierarchical representation of the directed graph. The example below is the XML representation of the general EDGAR graph in figure 5.1. All nodes in the EDGAR graphs are represented as XML elements. The obvious execution plan is a NALA EDGAR depth-first tree traversal. Predicates also become elements that surround their target. A predicate element blocks traversal if the predicate fails. The exclude last element is represented as a more general tag element that sets a tag on the tree traversal process. Tags are evaluated before termination. There are three ways to terminate the routing process. The exit elements always terminate and jump to a specific target. The tryview elements evaluate their target view for possible candidates. If this succeeds, the process terminates. The third way is to terminate the traversal without having hit a matching target. In this case, the behaviour is platform-specific. Platforms may raise exceptions or call global error handlers to handle this. The XML code below prevents this case by terminating on an explicit error handler. <routers xmlns= > <router name= e x a m p l a r y r o u t e r > <firstmatch> 5 <predicate name= for me > <exit target= l o c a l h a n d l e r /> </ predicate> <predicate name= db lookup > <exit target= forward /> 10 </ predicate> <firstmatch> <tag type= e x c l u d e l a s t > <tryview target= View1 /> </ tag> 15 <fork>

104 80 CHAPTER 6. OVERML <tryview target= View2 /> <tryview target= View3 /> </ fork> </ firstmatch> 20 <exit target= e r r o r h a n d l e r /> </ firstmatch> </ router> </ routers> 6.4 HIMDEL - The Hierarchical Message Description Language NALA EDGAR Messages combine attributes and other data into well defined data units for transmission. They have two sides: a software interface for reading and writing data field and a serialisation format for the wire. HIMDEL is a hierarchical message specification language that defines a generic software interface to messages and allows for mappings to different serialisation formats as binary data or XML. As usual, messages are encapsulated in headers which are in turn encapsulated in network protocols. This makes them conceptually hierarchical. In HIMDEL, message definitions consist of three top-level parts: protocols, toplevel headers and pre-defined containers. The rest of the specification follows a hierarchy rooted in a top-level header, followed by encapsulated headers and finally a sequence of data fields as content. Being an XML language, HIMDEL presents this hierarchy in a natural way. <msg:message hierarchy xmlns:msg= x m l n s : s q l= > <msg:container type name= i d s > 5 <msg:attribute access name= s o u r c e type name= i d /> <msg:attribute access name= d e s t type name= i d /> </ msg:container> <msg:header access name= main header > <msg:container r e f access name= a d d r e s s e s 10 type name= i d s /> <msg:message type name= j o i n r e q u e s t /> <! 1 s t m e s s a g e > <msg:message type name= view message > <! 2 n d m e s s a g e > <msg:viewdata structured= t r u e access name= f i n g e r t a b l e 15 type name= c h o r d f i n g e r t a b l e /> </msg:message> <msg:header> <msg:content access name= type type name= s q l : s m a l l i n t />

105 6.4. HIMDEL - MESSAGES <msg:message type name= typed message > <! 3 r d m e s s a g e > <msg:content access name= data type name= s q l : t e x t /> </msg:message> </msg:header> 25 </msg:header> <msg:protocol access name= tcp type name= tcp > <msg:message r e f type name= view message /> <msg:message r e f type name= typed message /> </ msg:protocol> 30 <msg:protocol access name= udp type name= udp > <msg:message r e f type name= j o i n r e q u e s t /> </ msg:protocol> </msg:message hierarchy> In this representation, messages become a path through the hierarchy that describes the ordered data fields (i.e. content and attribute elements) that are ultimately sent through the wire. Message data is encapsulated in the header hierarchy that precedes it on the path. Headers and their messages are finally encapsulated in a network protocol, apart from their specification. This makes it possible to send the same message through multiple protocol channels and to decide the best protocol at runtime. As the example shows, multiple messages can be defined within the same header, which makes them independent messages using this header. When following a message path, other messages and headers are completely ignored. For example, the join request message in the example is not part of the typed message, although it precedes it on the path. To assure the uniqueness of these paths, all children of a parent element must have distinct names. Additionally, fields and containers must be uniquely named along each path, meaning that no field or container has the same name as any of its ancestors. However, only the names of top-level containers and headers, and of all messages and protocols must be globally unique in the specification. The tag order on the message path is also important. It describes the field order when serialising data, but it also defines the data fields that are actually contained in a message. If a header is extended by content or container elements after the definition of a message, the preceding messages will not contain the successor fields, which are not on its path. In the example, the view message will not contain the content field named type. This field is, however, available in the typed message and all messages that are defined later under the same header tag. As shown in the example, container elements can also be used at the highest level inside the message hierarchy tag. However, their definition is not part of the message hierarchy itself. They only predefine container modules for replicated use in headers and messages where they are referenced by the type name attribute of container-ref elements.

106 82 CHAPTER 6. OVERML Programmatic Interface to Messages The programmatic access to messages and their fields is defined using the access name tag. Note that this follows the hierarchy only for containers and message content. Headers are accessed directly, as is protocol data. Accessing the fields of a message from an object oriented language should look like the following Python snippet. def r e c e i v e v i e w m e s s a g e ( view message ) : n e t a d d r e s s = ( view message. tcp. ip, view message. tcp. port ) main header = view message. main header s o u r c e i d = main header. a d d r e s s e s. s o u r c e f i n g e r n o d e s = view message. f i n g e r t a b l e The rules for building the access path are as follows. They allow for a relatively concise, but nevertheless structured and well defined access path to each element. The reference implementation provides this algorithm as an XSL transformation [XSL99] of HIMDEL (see appendix A.6). 1. As the basic unit of network traffic, a message is always the top-level element. 2. Every child of a message is kept as a second level element, referenced by its access name. 3. Entries within containers are referenced recursively, namespaced by the access name of their parent. 4. Following the path from the message back to the root header, all headers and also the protocol become additional second-level elements, referenced by their access name. Their child fields and container elements are referenced recursively as before. Children of nameless headers become elements of the parent header. Software components can then subscribe to message names or header names in a hierarchical way and thus define the part of the message that is actually of interest to them. A simple subset of the XPath language [XPa99] naturally lends itself for defining these subscriptions. Note that even expensive abbreviations like // can be resolved at compile time or deployment time based on the message specifications. The programmatic interface additionally defines two message properties last hop (a node if the last hop of a message is known) and next hop (a node if the next hop was already determined by a router) Network Serialisation of Messages There are a number of possible network representations for messages. Their choice depends on frameworks and languages and is therefore outside the scope

107 6.5. EDSL - EVENT-DRIVEN STATE-MACHINES 83 of the platform independent model provided by OverML. The following describes two possible mappings that model transformations can follow. The SQL/XML standard [SQL03b], as published in 2003, provides a well defined mapping of SQL data types to XML Schema. This should be considered the preferred method for message serialisation to XML. In this case, the hierarchical structure of the message specifications can directly be mapped to an XML serialisation format. Another, currently more common serialisation is the XDR data model [Sun87, Sri95], originally developed by Sun Microsystems 2 for their ONC-RPC [Sun88]. The mapping from the message specification to a flat serialisation is straight forward when laying out the data top-down along the message paths. 6.5 EDSL - The Event-Driven State-machine Language EDSL is a graph language for describing event-driven state machines. It is based on states and transitions, like any state machine description. States represent stages of code execution, while transitions connect processing chains and describe events that trigger new states. Being a graph language, EDSL has a straight forward transformation to the well-known DOT language of graphviz. An implementation NALA EDGAR is provided in appendix A.2. It allows for easy visualisation of EDSL graphs. Figure 6.2 illustrates an example of such a graph. Message Handling Subgraph Join Message received entry handle message exit start Timeout all done Not Joined Figure 6.2: Example of an EDSL graph The main advantage of EDSL over other languages for state machines 3 and graphs (like the DOT language 4 ) is its awareness of SLOSL and HIMDEL. EDSL allows transition triggers to be expressed based on data fields that occur in etc. (Aug. 30, 2006) 4 (Aug. 30, 2006)

108 84 CHAPTER 6. OVERML incoming messages as defined by HIMDEL, or based on events that occur in views defined by SLOSL. It therefore becomes possible to raise the level of framework independence by specifying semantically rich transitions based on OverML instead of hand-written code. The envisioned execution model behind EDSL is a state machine that supports multiple active states at the same time. In literature, such concurrent states are commonly referred to as and states, as in state charts [Har87, HP98]. They are common case in event-driven state machines, which keep multiple pending states and select between them based on I/O events. The execution model also allows for long-running states. These are states that stay active even if one of their transitions has fired. This is helpful for generic message dispatcher states that keep receiving messages and forward them through their transitions. It also allows for concurrency in output chains where a state outputs several data items, one after the other, that are then handled by the successor states concurrently. Exiting such a state and controlling its runtime behaviour has to be done programmatically, though, as it is not part of the graph description that EDSL provides. The following listing shows the EDSL description of the graph in figure 6.2: <edsl:edsm xmlns:edsl= > <e d s l : s t a t e s> <e d s l : s t a t e name= s t a r t id= > 5 <edsl:readablename>s t a r t</ edsl:readablename> </ e d s l : s t a t e> <e d s l : s t a t e name= s t a t e id= > <edsl:readablename>not Joined</ edsl:readablename> </ e d s l : s t a t e> 10 <edsl:subgraph name= subgraph id= entry state= exit state= > <edsl:readablename>message Handling Subgraph</ edsl:readablename> <e d s l : s t a t e s> 15 <e d s l : s t a t e name= entry id= > <edsl:readablename>entry</ edsl:readablename> </ e d s l : s t a t e> <e d s l : s t a t e name= e x i t id= > <edsl:readablename>e x i t</ edsl:readablename> 20 </ e d s l : s t a t e> <e d s l : s t a t e name= s t a t e id= > <edsl:readablename>handle message</ edsl:readablename> </ e d s l : s t a t e> </ e d s l : s t a t e s> 25 <e dsl: t r a nsitions> <edsl: transition type= outputchain > <edsl:from state ref= /> <e d s l : t o s t a t e ref= /> </ edsl: transition> 30 <edsl: transition type= outputchain > <edsl:from state ref= />

109 6.5. EDSL - EVENT-DRIVEN STATE-MACHINES 85 <e d s l : t o s t a t e ref= /> </ edsl: transition> </ e dsl: t r a nsitions> 35 </ edsl:subgraph> <e d s l : s t a t e name= s t a t e id= > <edsl:readablename> a l l done</ edsl:readablename> </ e d s l : s t a t e> </ e d s l : s t a t e s> 40 <e dsl: t r a nsitions> <edsl: transition type= message > <edsl:from state ref= /> <e d s l : t o s t a t e ref= /> <edsl:readablename>join Message r e c e i v e d</ edsl:readablename> 45 <edsl:messagetype>/ j o i n m e s s a g e</edsl:messagetype> </ edsl: transition> <edsl: transition type= timer > <edsl:from state ref= /> <e d s l : t o s t a t e ref= /> 50 <edsl:timerdelay>120000</ edsl:timerdelay> <edsl:readablename>timeout</ edsl:readablename> </ edsl: transition> <edsl:transition type= t r a n s i t i o n > <edsl:from state ref= /> 55 <e dsl:to s t a t e ref= /> </ edsl: transition> <edsl:transition type= t r a n s i t i o n > <edsl:from state ref= /> <e d s l : t o s t a t e ref= /> 60 </ edsl: transition> </ e dsl: t r a nsitions> </ edsl:edsm> Comparing the verbosity of state machine languages like EDSL with the clean and readable graph in figure 6.2 makes a convincing case for the visual design of event-driven state machines. This argument also holds against the implicit implementation of these graphs in source code, as required by most current EDSM frameworks. It is therefore a substantial improvement for developers to design protocols visually (as with the SLOSL Overlay Workbench, see 8.1.4) and to simply generate machine readable EDSL specifications and source code implementations from these graphs States States in EDSL are points of execution. There is one special state called start at the top level that is activated at startup. All other states are activated through subsequent transitions. As for terminology, the transitions of a state designate the transitions that lead away from that state towards other states. States are components of the EDSM that follow a simple, but very generic

110 86 CHAPTER 6. OVERML model of component interaction. It was exemplified by the Axon framework 5, which was in turn inspired by Unix pipes. The Axon model provides components with named pipes for input and output. Components react to objects being passed in through any of their input pipelines and respond by writing objects to any of their output pipelines. The model provides a very generic view on component interaction, similar to subject based publish-subscribe. EDSL inherits the simplicity of this model. It extends it to a complete graph description that makes transitions explicit and augments the semantics of events. States can currently have two additional boolean attributes that are described further down: long running and inherit context. Both default to false. While concrete frameworks may encourage diverging ways of implementing states, the abstract behaviour of a state is as follows. 1. On activation, the state is instantiated and receives the status waiting. 2. Depending on the framework specific scheduler, the state eventually becomes running and is executed within the EDSM. 3. It can then produce output or request a modification of its status. 4. The execution of a state is atomic with respect to its transitions. This means that none of the transitions can fire away from a state during the execution of that state, which is the common case in EDSM frameworks. Note that atomicity only addresses the transitions of the state itself. Frameworks can execute as many states in parallel as they support. 5. Short running states (having long running=false) receive the status terminated when they produce output request to be terminated have finished execution and one of their transitions has fired. Since execution is atomic, there is always a first event that fires after execution, so the transition that terminates the state is well defined. 6. Long running states are only terminated when they explicitly request to be terminated. This implies that any number of transitions can fire on them after execution has finished. Note that it is up to concrete frameworks to provide ways to let long running states determine whether their execution should be restarted after having finished or whether they should stay idle but active (sometimes called a zombie status). 7. Depending on the framework, active states (long running or not) may be able to temporarily interrupt their execution by actively yielding control back to the framework. The framework can then decide when to continue the execution. This is commonly used in single threaded EDSM 5 (Aug. 30, 2006)

111 6.5. EDSL - EVENT-DRIVEN STATE-MACHINES 87 frameworks that cannot afford to miss I/O events during long calculations. This feature does not impact the atomicity constraint and leaves the state running. The inherit context flag of a state is meant to simplify the programming of individual states. Very often, processing chains in EDSMs are interrupted by I/O activity like sending messages and having to wait for an answer. The exact meaning of this flag is framework specific. However, setting it on a state will ask the framework to propagate an execution context from the previous state to this one during the transition. It therefore provides a simple way of propagating variables, configuration or other forms of processing status within a chain of execution Transitions Transitions connect states and define the flow of execution. They are triggered by events which EDSL defines based on the following types: Messages The arrival of a message. Subscriptions to messages are expressed in XPath based on the message content paths as described in section In figure 6.2, the message transition is subscribed to the reception of any join message, without further restrictions. View events Events that occur in views defined by SLOSL (see 5.5). Timers Delay triggered events. The delay is expressed in milliseconds. Output chains All output of the source state is handed on to the target without modification. Frameworks may force the target state to become long running if the source state is. Direct advances Immediately activate the target state when entering the source state, thus essentially merging the two states. This special transition does not require any events or output of the state to be fired. Since it is only fired on state activation, even long running states can deploy it. A possible use case is the design decision to split a single I/O state into semantically different parts Subgraphs EDSL supports subgraphs within a state machine. These are complete state machines themselves that may have further subgraphs besides their states and transitions. Subgraphs have two predefined states, entry and exit, that connect with the parent graph by reproducing their incoming events. The main use cases for subgraphs are harvesters and other complex control components. They can be plugged into the graph and are globally triggered

112 88 CHAPTER 6. OVERML and configured as part of the EDSM. Their integration and interconnection through the EDSL component model simplifies their reuse in different overlays. Subgraphs are only used at design time to reduce the complexity of EDSM designs by semantic modularisation. States are always referenced by internal identifiers that are unique to the complete graph. It is therefore a straight forward transformation to flatten subgraphs by merging them into their parent graph. Flat graphs can then be mapped to EDSM frameworks more easily.

113 Chapter 7 Harvesters Contents 7.1 Gossip-based Node Harvesters Latency Harvesters Sieving Harvesters Distributed Harvesters As companions of database and views, harvesters are a very important part of the middleware. Their main purpose is to maintain the local view of a node by updating the node database with data relevant to the currently available view definitions. This means finding new nodes, as well as determining and maintaining their attributes, but also deleting nodes from the database. There are a number of possible schemes for these tasks out of which the designer of an overlay system can select the most appropriate ones. Having a choice of harvester components available allows overlays to provide very diverse characteristics. Harvesters work on views, nodes and node attributes. They can be simple controllers, but are more commonly complex EDSL compositions of controllers, i.e. EDSL subgraphs. Where controllers are more of an atomic implementation detail of a protocol, harvesters are functional components that implement a specific subtask in the maintenance of an overlay. As seen in Gridkit (8.4), they are often independent of specific overlays and can be provided as middleware components. For example, the harvester task of finding new nodes can be implemented in various different ways based on broadcast, active lookup, overhearing routed messages, central discovery services, etc. Gossip (or epidemic communication [VvRB03, JGKvS04, RGRK04]) has a straight forward implementation in the Node Views architecture that exchanges views between overlay mem-

114 90 CHAPTER 7. HARVESTERS bers. The attributes of nodes are either measured locally or requested from the nodes. It is up to the specific harvester implementation how this is done and up to its configuration how often this happens. Also, when nodes have locally disappeared from all views, the application must decide if (and how long) they should remain in the database. The following sections exemplarily discuss four different kinds of harvesters that implement these kinds of functionality. Note that much of what is said here can be applied in similar ways to other harvester types and implementations. 7.1 Gossip-based Node Harvesters The idea behind epidemic communication is to spread knowledge to all members of a system similar to an epidemic, i.e. by periodically infecting randomly chosen members. On each contact, a member communicates data to another member who can then start to infect others. This leads to an exponential data distribution while keeping the load at each member low and the communication cost configurable. As an example, we can look at a Pastry derivative called Bamboo [RGRK04]. It uses multiple epidemic maintainers for its local data structures. The leafset maintainer periodically exchanges this set with one of the nodes therein. Two routing table maintainers query member nodes and try to fill empty entries. Note that in Bamboo and Pastry, most of the nodes in the leafset will also be in the routing table. If an attribute of a local node representation states that a node has not been queried by another component or otherwise responded for a while, it may be a good candidate for a ping, while a node that is already frequently queried by the leafset maintainer should not be additionally queried by the routing table maintainer. The local consistency provided by the node database naturally supports the coordination between these maintenance components and simplifies their implementation. Jelasity etal. [JGKvS04] introduced the Peer Sampling Service (PSP). It is an overlay service for choosing good candidates for epidemic contact. Their analysis is focused on communicating the availability of nodes in unstructured networks by exchanging address data. This allows them to abstract from the actual way in which data is exchanged. However, if structured networks, multiple overlays or adaptable topologies have to be maintained, the information that must be exchanged is usually more complex. Node set views directly support epidemic communication. Obviously, the exchange of data about well-selected nodes is an exchange of node set views. Views are created using a SLOSL statement which can be seen as a query on the node database. Once a view is defined, its data is also precisely defined and can be sent to any node by a simple call to the framework. The view

115 7.2. LATENCY HARVESTERS 91 definition even allows to infer an implicit message format description. In this model, an exchange of views means the symmetric evaluation of a query in two different databases. The two results are then exchanged and used to update the other database. The PSP distinguishes three cases: sending, receiving and exchanging views. These cases determine in which database(s) the query is executed. Rhea etal. describe them in a similar way for the Bamboo system [RGRK04] and show cases where the symmetric exchange is necessary to assure eventual consistency between neighbours. Current systems implicitly code the query into the overlay software. This resembles materialised views and stored procedures in databases. They are commonly used when performance is essential and changes are rare. In some cases, however, they are premature optimisations that come at the cost of lower adaptability. Their use makes software harder to configure and adapt at run-time. The Node Views model makes these queries explicit and shows that it is perfectly valid to ship the query (or a parametrisation of it) as part of the view exchange. This may be used for adapting the overlay topology to the capabilities of its members. For example, a high performance node may want to augment its view by sending broader queries. This can decrease the endto-end latency it experiences by allowing single hops towards a larger number of nodes. Similarly, a less powerful node may decide to restrict its view by sending queries with higher selectivity. One of the problems in gossip overlays is how to handle dead nodes. Most commonly, dead nodes are simply removed from the local view and will therefore not appear in gossip messages [RGRK04]. This forces nodes to redundantly find out about their failure. In the Node Views approach, the node database can simply keep data about dead nodes without adding any overhead to the software components. Node selection prevents dead nodes from appearing in local views. Remote nodes, however, can send queries for dead nodes as well as live nodes. Messages and database can both use timestamps to constrain the relevance of such data. SLOSL and node set views make it trivial to write overlays based on gossip or other ways of exchanging node data. At the same time, they decouple the data acquisition part of the maintenance algorithm and allow other (nongossip) harvesters to take over if different characteristics are needed. 7.2 Latency Harvesters Among the node attributes, the latency between overlay members is the most important parameter for overlay adaptation. It can be used as a criterion whenever multiple candidates for choosing neighbours or forwarding messages exist. A latency harvester can measure or estimate this latency. While mea-

116 92 CHAPTER 7. HARVESTERS suring usually involves pings or ACKs, there are a number of recent proposals that allow ping-free, resource efficient estimates for the physical latency between nodes. Some use virtual coordinates [PCW + 03, DCKM04], others query the routing infrastructure (BGP) of the physical network [NPB03]. Though these are very different approaches, they all follow one common goal: determine the end-to-end latency between the local node and a remote node. The node database and its views provide an intermediate layer that decouple the harvesters from other components like routers. They hide the way how the latency information is found. SLOSL then provides direct support for converting the values of different sources when showing them in views. This allows the overlay to switch between different approaches based on accuracy and load requirements without affecting components that use this information. Virtual coordinate systems also gain from overlay integration. Only a single harvester instance that works at the database level is needed to determine the coordinates of nodes in different overlays. This broadens the base of nodes and leads to more accurate models even for smaller overlays. 7.3 Sieving Harvesters So far, we have only regarded the process of adding new nodes and updating their attributes. However, when harvesters keep adding nodes, there must also be a mechanism that removes nodes that are no longer in scope of any view. If nodes are not visible in views, their attributes will tend to become outdated more easily than those of active communication partners. On the other hand, keeping track of too many nodes can lead to an exhaustive message load. Collecting the data of too many nodes in the local database can also reduce the run-time performance of the system. Overlay nodes must therefore trade their resource consumption against the value of locally available data. In the Node Views architecture, the problem of keeping nodes available in the database is reduced to a caching problem. Nodes that are visible in views must be kept in the database. Nodes that are not visible in views may be kept in the database. Which invisible nodes should be removed? In many overlay implementations today, only the topology neighbours are kept in the local view. This is equivalent to removing each node from the database that is discarded from a view and no longer visible in any other. A harvester connected to the enter and leave events can easily achieve this. However, some systems have found additional knowledge to be a good thing, e.g. for neighbour redundancy [ZHD03] or fallback candidates [MCKS03]. Systems based on the Node Views architecture can trivially benefit from additional locally available knowledge, as they already select subsets of nodes from a larger database. More knowledge allows them to take more decisions locally, without necessarily having to send messages. These systems should prefer a

117 7.4. DISTRIBUTED HARVESTERS 93 garbage collection scheme where nodes are removed from the database based on common caching policies. Due to the generic interfaces of views and the specific events they generate, generic harvesters can implement basically any kind of caching scheme. It is an important property of this approach that the chosen policies are independent of overlay implementations and may be freely changed at run-time. Usually, this can even be done locally on a per node basis without remote interaction, since it only impacts the local representation of nodes that are not currently chosen as communication partners. This allows each node to determine and follow its locally optimal strategy. 7.4 Distributed Harvesters Darlagiannis and others recently presented Omicron [DMS04], a design study for structured overlay networks. The idea is to split the algorithm that each overlay member traditionally executes into a number of simpler services that have different requirements. These services are then distributed over a cluster of nodes. This has two main advantages. Cluster nodes can provide a service that matches their capabilities instead of struggling to execute all services needed by the overlay. Secondly, it allows for replication within the cluster to increase the reliability of specific services. However, this approach also comes at the cost of additional overhead inside each cluster. Omicron identifies four basic services in a DHT overlay: maintenance, indexing, routing and caching. Members of the same service interconnect between clusters. Therefore, each of the four services is provided by a different overlay while the cluster itself represents a fifth type of overlay. The maintainers have a special role. At the inter-cluster level, they exchange data about the nodes in their clusters. Inside their own cluster, they provide the other members with interconnection candidates from other clusters. Node set views provide straight forward support for this exchange of views between cluster maintainers as well as between members of a cluster. The maintainer only needs to know the view definitions of each cluster member and can then send specific updates for their views. Similarly, when it communicates with the maintainers of other clusters, it can exchange its local view with them. In this model, the approach taken by Omicron mainly becomes a distributed, hierarchical database. There is a possibility of extending this scheme into the architectural design. A distributed implementation of the Node Views architecture could reduce the overhead of the lower participants even further. Such a system would implement the complete architecture at the maintainers and replicate only the materialisation of views at the lower cluster participants. Their updates and view events would then be generated remotely by the cluster maintainer. Such

118 94 CHAPTER 7. HARVESTERS a design allows routers to use static routing tables and to update them very specifically without the overhead of event generation from a local database. The tradeoff between fast, static tables at the lower participants and the availability of the more capable general architecture for taking their own local decisions is a design choice for the implementation. It mainly depends on the expected capabilities of the cluster participants. The Model Driven Architecture approach makes it possible to configure these kinds of implementations details at the model transformation and code generation level.

119 Chapter 8 Implementation Contents 8.1 The SLOSL Overlay Workbench Implementing SLOSL OverML Model Transformation OverGrid: OverML on the Gridkit OverGrid Case Study: Scribe over Chord The concepts developed in this thesis were validated through a proof-ofconcept implementation. It currently comprises three parts: 1. The SLOSL Overlay Workbench 1 [Beh05b], a graphical overlay design front-end for OverML specifications. It is written in the Python language and uses the Qt toolkit for its graphical user interface. 2. A preliminary, interpreted OverML run-time environment. It is written in the Python language and uses the Twisted EDSM framework Exemplary model transformers for SLOSL and OverML that show how to build source code generators for different programming languages. The Workbench allows the development of language neutral and platform independent overlay models (PIM) in OverML that can be transformed to language specific EDSM environments via the Model Driven Architecture approach. 1 (Aug. 30, 2006) 2

120 96 CHAPTER 8. IMPLEMENTATION 8.1 The SLOSL Overlay Workbench In the Model Driven Architecture of Node Views, the abstract specification of overlay software is developed in five steps by defining node attributes, messages, SLOSL views, routing decision graphs and an EDSM graph to connect the controller states. Being an OverML front-end, the workbench directly operates on an XML model. This enables common XML technologies like XPath, XSLT, RelaxNG and XInclude for its implementation, that allow for model modularisation, validation and transformation Specifying Node Attributes and Messages The first step in the design of an overlay network is the specification of node attributes. The user interface for this is shown on the left side of figure 8.1. Figure 8.1: Defining node attributes and messages in the Overlay Workbench In the bottom left corner, the developer specifies and restricts data types, as supported by NALA. Based on these types, node attribute specifications are

121 8.1. THE SLOSL OVERLAY WORKBENCH 97 added to the list in the centre of the window. The attribute flags identifier, static and transferable are set directly below the names of the attributes. The right most part in the figure allows to integrate the attributes into HIMDEL messages. Messages are data structures that build on a basic network protocol. They contain embedded header and content parts in a hierarchical fashion. This hierarchy is directly reflected in the tree structure on the right side of the GUI. The messages are defined below the entry Messages. Above the actual messages, the workbench allows the definition of predefined containers. They are modules that can be reused in different parts of the message hierarchy. Container references are added to messages via drag-anddrop Writing SLOSL Statements The next step in overlay design is the definition of the local views. This is done with SLOSL, as shown in figure 8.2. Figure 8.2: SLOSL statements in the Overlay Workbench

122 98 CHAPTER 8. IMPLEMENTATION The currently defined SLOSL statements are listed on the left and can be selected by mouse click. The right part of the window shows the editor for defining the different SLOSL clauses Testing and Debugging SLOSL Statements Once the SLOSL statements are defined, the topology simulator can be used to test and simulate their effect by executing them in different scenarios. The SLOSL language makes the simulator a powerful tool for this purpose. The developer implements the scenarios in source code, but can completely abstract from the networking nature of the simulated topology. This makes the scenario implementation both simple and short. Figure 8.3: Visualising and testing SLOSL implemented topologies The idea follows directly from the Node Views architecture. The setting deploys a number of nodes in the topology that have a certain knowledge about the attributes of the other nodes. Their local view is then used to establish the SLOSL views that form the topology graph. The example in figure 8.3 shows the following source code for the chord fingertable view (see 5.1). Note that the code has global access to the options defined in the SLOSL WITH clause.

123 8.1. THE SLOSL OVERLAY WORKBENCH 99 1 NODES = 10 2 MAX ID = max id # use the option value from the SLOSL statement 3 4 for n in range ( 0, MAX ID, MAX ID // NODES) : 5 buildnode ( i d=n ) 6 7 def m a k e f o r e i g n ( l o c a l n o d e, a t t r i b u t e s ) : 8 Do something with the node a t t r i b u t e d i c t i o n a r y, 9 then r e t u r n i t. Returning None or an empty d i c t 10 w i l l d i s c a r d the node. 11 a t t r i b u t e s [ knows chord ] = True # set trivial attributes 12 a t t r i b u t e s [ a l i v e ] = True d i s t = a t t r i b u t e s [ i d ] l o c a l n o d e. i d # calculate ring distance 15 i f d i s t < 0 : 16 d i s t = MAX ID + d i s t 17 a t t r i b u t e s [ l o c a l c h o r d d i s t ] = d i s t 18 return a t t r i b u t e s # return node attributes In line 4-5, the script sets up the nodes that participate in the simulation. The nodes are created and given a specific value for their node attribute id. This basically defines the global view of the system. The function make foreign is the place where the local view of each node is constructed. At each node, it is called once for each node that is made known locally. It can then modify the attributes of that foreign node before it is added to the local database. It can even decide to ignore the node completely. The example above only calculates the exact mapping from the global view to the global view of each node, which provides each node with a complete local view. However, by simply varying the node attributes within this function, the developer can implement arbitrary scenarios of nodes knowing each other or not, nodes having failed but still being considered alive by others, or different update propagation states of node attributes. The resulting topology will then be calculated and visualised based on SLOSL evaluation against the globally inconsistent local database of each node Designing Protocols and Harvesters As the final design step, the developer can now build the general structure of the overlay protocol by defining an EDSM graph that interconnects states of execution by events. The user interface for this design phase is shown in figure 8.4. While simple harvesters can be represented as a single state, more complex ones can be expressed as subgraphs, which makes the overall design modular. The graph at the right of figure 8.4 visualises the entire EDSM graph. Subgraphs are merged into the graph and displayed within a rectangular border. The left part of the window shows the design area. Here, subgraphs are repre-

124 100 CHAPTER 8. IMPLEMENTATION Figure 8.4: EDSM definition in the Overlay Workbench sented as icons, just like states. Their innards are accessible through the tab bar above the design area OverML Output At any time, the current OverML specification can be written to a file that the user can feed into an execution environment or a source code generator. The result obeys the schema in appendix A.1. The workbench also supports a transformed output format that translates the HIMDEL hierarchy into separate messages and flattens the EDSL graph. This simplifies the output and makes it more suitable for subsequent transformations into executable code.

125 8.2. IMPLEMENTING SLOSL Implementing SLOSL SLOSL is the most important language within OverML. The other four languages have either straight forward implementations (NALA and EDGAR) or can deploy well-known EDSM and messaging frameworks for their implementation (EDSL and HIMDEL). This section therefore describes a simple SLOSL interpreter that was written as a proof-of-concept implementation, as well as a transformation from SLOSL to the Structured Query Language (SQL) Transforming SLOSL to SQL As an SQL-like language, it is possible to map SLOSL to nested SQL statements, although in a rather expensive way. Some possible optimisations and tradeoffs are mentioned in this section, but were not further investigated in the course of this work. Due to the expensive evaluation of the resulting SQL queries, it is unlikely that frameworks will deploy such a mapping for real-time execution in deployment environments. Therefore, this SQL implementation is only provided as a proof of concept and for general interest. Mapping the clauses that were borrowed from SQL is trivial. The general idea is to create the node database as a table with node attributes. This supports a direct mapping of these SLOSL clauses to the equivalent SQL clauses. The WHERE clause is evaluated directly on this table or on the union of the parent views, depending on the FROM clause. For simplicity, the result will be referred to as the node table from now on. A mapping of the remaining clauses WITH, HAVING FOREACH and RANKED, that are specific to SLOSL, is worth further explanation. The WITH clause can trivially be mapped to tables holding the values of the declared names. This implies joining them with the node table on query evaluation. Another possibility is to replace options by their constant value within the statement. This would require views to be re-instantiated when options are modified. Obviously, this is the most efficient solution for design-time and deployment-time options that are never (or rarely) changed at run-time. The next step is to instantiate a table for the values of each bucket variable declared by a FOREACH statement. These tables are joined with the node table based on the HAVING expression. The result is a table which relates node attributes to matching variable values. Note that the node attribute rows of a node may end up in multiple rows of the result. The next step is the evaluation of the RANKED clause. This means evaluating the ranking expression for each row. Obviously, if the ranking expression does not depend on the variables, it may be more efficient to evaluate it only against the node table before running the joins. In some cases, it may even make sense to store its precomputed result directly in the database to avoid

126 102 CHAPTER 8. IMPLEMENTATION further evaluations. This is the usual tradeoff between space and time, which parametrises the mapping between SLOSL and SQL. The second part of the RANKED evaluation can only be done after the HAVING joins. The ranking and elimination step implies grouping the rows by distinct sets of variable values, sorting the rows in each group by their ranking, counting the rows and eliminate the super-numerous ones. The following listing presents a possible template for generating the corresponding SQL statement. CREATE VIEW [ view name ] AS SELECT noderank, db node id, 3 [ v a r i a b l e s ], [ s e l e c t a t t r i b u t e c a l c u l a t i o n ] FROM ( SELECT node. d b n ode id, [ v a r i a b l e s ], [ s e l e c t a t t r i b u t e s ], ( 6 SELECT COUNT( ) FROM (SELECT FROM [ p a r e n t s ] AS node, 9 WHERE ( [ w h e r e e x p r e s s i o n ] ) AND ( [ h a v i n g e x p r e s s i o n ] ) ) AS node WHERE ( [ n o d e r a n k e x p r ] ) [ rank cmp op ] ( [ rank expr ] ) 12 ) AS noderank FROM [ l o c a l n o d e t a b l e ] AS local, [ v a r i a b l e s o u r c e ], 15 (SELECT node. FROM [ p a r e n t s ] AS node, [ l o c a l n o d e t a b l e ] AS l o c a l 18 WHERE ( [ w h e r e e x p r e s s i o n ] ) ) AS node WHERE ( [ h a v i n g e x p r e s s i o n ] ) 21 ) AS node WHERE ( noderank <= [ rank count ] ) ORDER BY [ v a r i a b l e s ], noderank The HAVING joins and the RANKED sorts show where the costs of this mapping come from. Their expense depends on the number of values per FOREACH variable and on the total number of nodes available in the node table. The declarative nature of the separate SLOSL clauses helps in making tradeoffs. The visible dependencies between attributes and clauses allow to parametrise and optimise the evaluation process, especially if multiple SLOSL views are in use. Common Subexpression Elimination [Muc97], a well-known technique in compiler optimisation, can be deployed to extract and precompute common node attributes, expressions or sub-views of different views. Also, note that the above template is unnecessarily redundant due to the intent of providing a single statement. One redundant subexpression inside the statement itself is the evaluation of the WHERE and HAVING expressions against the source tables, which occurs a second time during RANKED evaluation. Future work could investigate various optimisation strategies in this context.

127 8.3. OVERML MODEL TRANSFORMATION Interpreting SLOSL Where the SQL mapping leads to a rather heavyweight implementation, a more promising approach for a SLOSL implementation is a dedicated interpreter, which was also developed as part of this thesis. It is currently used in the SLOSL Overlay Workbench to visualise topologies (see 8.1.3). The interpreter is written in the Python language. It uses a generic sequential stream evaluation engine based on chained Python generators. The evaluation process follows the step-by-step scheme presented in section 5.1. For simplicity, the database is represented as a simple set of hash tables (or dictionaries in Python terms) that map node identifiers to node objects. The entire generic SLOSL infrastructure of database, views and node implementations is implemented in about 700 lines of code, of which the evaluation engine itself occupies less than 200 lines. The MathDOM XML library [Beh05a] is used to transform arithmetic expressions and boolean expressions of the different SLOSL clauses from platform independent MathML into Python expressions. Their evaluation then deploys the Python interpreter directly. MathDOM also originated from the work on this thesis. It already supports a variety of target languages. On a 1.6 GHz AMD64 machine, the simulator builds a static Chord network of 128 nodes in a key space of 2 8 IDs in about 3.7 seconds. This includes the database setup and involves 1,024 (8 128) node evaluations in 128 views, i.e. 131,072 in total. According to profiling data, a single view evaluation takes less than 2 milliseconds in these tests, a single bucket can therefore be evaluated in about 0.25 milliseconds. These numbers are already in an acceptable range for a deployed system such as a current peer-to-peer overlay. Still, the interpreter uses a generic evaluation engine without any further query optimisation techniques. Future implementations of OverML will generate efficient overlay implementations from specifications through a model transformation process. This will allow the integration of SLOSL optimisers that tune the resulting code for the specific SLOSL statements used. 8.3 OverML Model Transformation The interpreter described above is helpful for testing and debugging environments. In the long term, however, model transformation and source code generation will become the preferred way of generating deployable overlay implementations from OverML specifications. In the context of model driven engineering (see also section on related work), model transformation denotes the transformation of a higher-level platform independent model (PIM) into lower-level platform independent or platform specific models (PSM). The generation of a PSM usually involves some

128 104 CHAPTER 8. IMPLEMENTATION form of source code generation for a (general purpose) programming language. See also section of the related work chapter. This section describes a transformation of OverML to object oriented languages in multiple steps. First, the OverML specification is transformed into a platform independent XML object model by a language agnostic XSL transformation. Then, the active database and its event-driven query execution paths are constructed. In the last step, a language specific generator transforms this representation into source code for a programming language. This section describes the complete transformation process. However, the current implementation does not cover it in its entirety. The platform independent model transformations are available, but no SLOSL optimisers were implemented and the platform specific source code generators stay at a very preliminary level. The goal was not to provide a reference system, but rather to make the architecture itself visible and accessible. It is the nature of a Model Driven Architecture that different target environments require (or encourage) very specialised implementations. A single reference implementation would be of limited help in this context Transforming OverML to an Object Model In the first step, the four languages are transformed into a representation suitable for creating object oriented source code. NALA types are used during the transformation process to provide data type mappings into the target language. NALA attributes are combined into a node class for the database. SLOSL views become view objects with corresponding view node classes. The clauses stay mainly unchanged, as their transformation depends mostly on platform specific parameters such as the database implementation. HIMDEL is expanded into separate message representations containing object hierarchies of containers and headers. This step deploys the same transformation as for building the field access paths (see section 6.4.1). The XSL stylesheet for this transformation is provided in appendix A.6). EDSL already represents state objects but still requires some transformation for simplification. First, the EDSM graph is flattened to remove the subgraph structures. Then, the transitions are moved into the corresponding source states to honour their normal execution from within the active states. EDGAR requires no further transformation, as it has a straight forward mapping to conditional statements in procedural code. The transformations are done independently for each language. The only exception is SLOSL, which depends on NALA for type inference. This transformation of SLOSL into view objects is a little more complex than for the other languages. Appendix A.5 has the complete XSL template. The main part of it follows:

129 8.3. OVERML MODEL TRANSFORMATION 105 Listing 8.3: Main XSL Template for Object Representation of SLOSL <xsl:template match= s l o s l : s t a t e m e n t > <x s l : v a r i a b l e name= classname > <xsl:call template name= c a p i t a l i z e > <xsl:with param name= name 5 </ xsl:call template> </ x s l : v a r i a b l e> s e l e c /> <c l a s s a c c e s s= p u b l i c name= { concat ($ classname, View ) } i n h e r i t s = View > < x s l : i f t e s t= s l o s l : b u c k e t s i n h e r i t = true ] > 10 <x s l : a t t r i b u t e name= i n h e r i t b u c k e t s >t r u e</ x s l : a t t r i b u t e> </ x s l : i f> <c o n s t r u c t o r> <param name= views /> 15 <xsl:for each s e l e c t= s l o s l : p a r e n t > <parent><xsl:value of s e l e c t= s t r i n g ( ) /></ parent> </ xsl:for each> <xsl:for each s e l e c t= s l o s l : w i t h [ math: ] > <a s s i g n f i e l d= > 20 <xsl:apply templates s e l e c t= math: mode= copy expression /> </ a s s i g n> </ xsl:for each> </ c o n s t r u c t o r> 25 <xsl:for each s e l e c t= s l o s l : w i t h > <x s l : v a r i a b l e name= option name > <xsl:call template name= c a p i t a l i z e > <xsl:with param name= name </ xsl:call template> 30 </ x s l : v a r i a b l e> s e l e c /> <v i e w o p t i o n name= > <xsl:apply templates s e l e c t= math: mode= copy expression /> </ v i e w o p t i o n> 35 </ xsl:for each> <c l a s s a c c e s s= p r i v a t e name= ViewNode e x t e n d s= Node > <c o n s t r u c t o r> <param name= node /> 40 <xsl:for each s e l e c t= s l o s l : b u c k e t s / s l o s l : f o r e a c h > < x s l : s o r t s e l e c t= s t r i n g ( ) /> <param name= { s t r i n g ( ) } /> </ xsl:for each> <xsl:apply templates s e l e c t= s l o s l : s e l e c t mode= c l a s s g e n c o n s t r u c t o r /> 45 </ c o n s t r u c t o r> <xsl:apply templates s e l e c t= s l o s l : s e l e c t mode= c l a s s g e n /> </ c l a s s> <method name= s e l e c t n o d e s > 50 <xsl:apply templates s e l e c t= s l o s l : w h e r e [ math: ] mode= copy expression /> <xsl:apply templates s e l e c t= s l o s l : b u c k e t s i n h e r i t!= true and s l o s l : f o r e a c h ] mode= copy expression /> 55 <xsl:apply templates s e l e c t= s l o s l : h a v i n g [ math: ] mode= copy expression /> <xsl:apply templates s e l e c t= s l o s l : r a n k e d [ s t r i n g ) ] mode= copy expression /> </method>

130 106 CHAPTER 8. IMPLEMENTATION 60 </ c l a s s> </ xsl:template> Note the select nodes method at the end that basically copies the main SLOSL clauses to the result. There is no further transformation done at this point as implementations may have very different ways of handling the clauses. Also, there are two distinct use cases of these clauses that frameworks will likely implement differently. One is the computation of the complete view content and the second is about efficient update propagation, as explained in section When we apply the above template to the SLOSL statement that was presented for the Chord fingertable view in chapter 5, it outputs an XML result like the following. For readability, some of the longer MathML expressions were replaced by their shorter infix term equivalents. Listing 8.4: Example for the Object Representation of a SLOSL Statement <c l a s s e s xmlns= x m l n s : s l o s l= xmlns:m= > 5 <c l a s s a c c e s s= p u b l i c name= ChordFingertableView i n h e r i t s= View > <constructor> <param name= views /> <parent>db</ parent> <assign f i e l d= l o g k > 10 <expression> <m:math><m:cn type= i n t e g e r >5</m:cn></m:math> </ expression> </ assign> </ constructor> 15 <viewoption name= l o g k > <expression> <m:math><m:cn type= i n t e g e r >5</m:cn></m:math> </ expression> 20 </viewoption> <c l a s s a c c e s s= p r i v a t e name= ViewNode e x t e n d s= Node > <constructor> <param name= node /> 25 <param name= l /> <assign f i e l d= i d > <expression> <depends type= a t t r i b u t e n a l a t y p e= i d >i d</ depends> <m:math><m:ci>node. i d</ m : ci></m:math> 30 </ expression> </ assign> <assign f i e l d= l o c a l c h o r d d i s t > <expression> <depends type= a t t r i b u t e n a l a t y p e= i d >l o c a l c h o r d d i s t</ depends> <m:math><m:ci>node. l o c a l c h o r d d i s t</ m : c i></m:math> </ expression> </ assign> </ constructor> 40 <nodeproperty name= i d n a l a t y p e= i d /> <nodeproperty name= l o c a l c h o r d d i s t n a l a t y p e= i d />

131 8.3. OVERML MODEL TRANSFORMATION 107 </ c l a s s> <node selection> 45 <slosl:where> <expression> <depends type= a t t r i b u t e n a l a t y p e= boolean >a l i v e</ depends> <depends type= a t t r i b u t e n a l a t y p e= boolean 50 >knows chord</ depends> <! MathML: n o d e. k n o w s c h o r d a n d n o d e. a l i v e > </ expression> </ slosl:where> <s l o s l : b u c k e t s i n h e r i t= f a l s e > 55 <s l o s l : f o r e a c h name= l > <expression> <depends type= i d e n t i f i e r >l o g k</ depends> <m:math> <m:interval c l o s u r e= c l o s e d open > 60 <m:cn type= i n t e g e r >0</m:cn> <m:ci>l o g k</ m : c i> </ m:interval> </m:math> </ expression> 65 </ s l o s l : f o r e a c h> </ s l o s l : b u c k e t s> <s l o s l : h a v i n g> <expression> <depends type= i d e n t i f i e r >l</ depends> 70 <depends type= a t t r i b u t e n a l a t y p e= i d >l o c a l c h o r d d i s t</ depends> <! MathML: n o d e. l o c a l c h o r d d i s t [2 i, 2 i+1 ) > </ expression> </ s l o s l : h a v i n g> 75 <slosl:ranked f u n c t i o n= l o w e s t > <slosl:parameter> <expression> <m:math><m:cn type= i n t e g e r >1</m:cn></m:math> </ expression> 80 </ slosl:parameter> <slosl:parameter> <expression> <depends type= a t t r i b u t e n a l a t y p e= i d >l o c a l c h o r d d i s t</ depends> 85 <m:math><m:ci>node. l o c a l c h o r d d i s t</ m : c i></m:math> </ expression> </ slosl:parameter> </ slosl: ranked> </ node selection> 90 </ c l a s s> </ c l a s s e s> Once again, this extract is rather verbose (mainly due to the usage of MathML), but it is only used as an intermediate result of the transformation process and thus handed from one translator program to another. The listing shows an object-oriented, but platform-independent representation of the SLOSL view implementation. The expressions were annotated with dependency information. This helps in type inference and simplifies the implementation of language specific generators and optimisers. As the example shows, each of the generated view objects has a specific

132 108 CHAPTER 8. IMPLEMENTATION ViewNode class associated with it. It represents the selection and calculation of attributes specified by the SLOSL SELECT clause. Normally, node objects of views are read-only. The reason is that node attributes of views can be calculated in SELECT expressions on their way from the database. There is not necessarily an inverse to that expression (or function), so attribute writes cannot always be propagated to the database. Attributes must therefore be updated directly in the database Efficient Event-driven SLOSL View Updates One of the major goals of SLOSL and the Node Views architecture is to support the efficient integration of multiple overlays. The shared database provides the main foundation for this integration through locally consistent views. However, the local evaluation of these views allows for further integration that can substantially increase the run-time performance. The execution of SLOSL statements is naturally event-driven. When harvesters update attributes and add or discard nodes, the architecture must determine which views are affected and reevaluate them, both to provide consistent views to controllers and routers, and to generate the related view events. The naive way of independently evaluating all of them can quickly become exhaustive, as it depends on the number of locally known nodes, the deployed views and their bucket size. SLOSL, however, is a declarative language that provides high-level semantics. Its evaluation can be implemented in different ways, depending on the requirements. Chapter 5 already presented a number of example statements and the reader may have noticed that there is a certain overlap between them. They all rely on an id attribute, for example, but otherwise differ in their specific dependencies. They all deploy a similar WHERE clause. These redundancies and particularities can be exploited to further integrate their evaluation and to minimise the impact of changes in the database. SLOSL provides two main features that help in integrating views. First of all, it provides separate clauses for semantically different parts of a specification. This allows to extract overlapping expressions that follow the same semantics and to pre-calculate them for use in different views. Secondly, it makes it easy to determine the dependencies between node attributes and single clauses of views. They follow from the attributes and bucket variables used in the expressions of SLOSL statements and allow to select a minimum set of views (and clauses) for evaluation when attributes change. Note also that some of these attributes are usually declared by NALA as identifiers or otherwise static data, which allows to drop their update dependencies from the implementation. As a result, the run-time performance of the locally running software in a SLOSL implemented system is directly linked to the preparation of an efficient multi-statement query execution plan at compile time. A SLOSL optimiser will

133 8.3. OVERML MODEL TRANSFORMATION 109 DB Node Attribute 1 Attribute 2 Attribute 3 Attribute View 1 View 2 View 3 Figure 8.5: Example of an Attribute Dependency Graph for Multiple Views therefore start by building a dependency graph like in figure 8.5. It states which attributes each view depends on. Only the dependent views have to be re-evaluated on updates. The next step is to open up the view declarations and to split them into their different clauses. We can then build an extended dependency graph for the clauses of each view, as in figure 8.6. As in the previous examples, different clauses do not necessarily depend on the same attributes. If an attribute changes, it can therefore not affect independent clauses. However, the clauses may also provide their own dependencies. While the WHERE clause can only depend on the parents of the view (FROM) and on view options (WITH), the RANKED expression may depend on bucket variables, i.e. the FOREACH and HAVING clauses. The dependency graph is therefore needed to determine an efficient execution plan. For example, a change to attribute 1 would enforce a reevaluation of the WHERE clause for the respective node. If that succeeds, evaluating the HAVING FOREACH clauses will tell us into which buckets the respective node belongs so that we can re-run the ranking only for these. Storing the information if a node is already part of a view or caching the previous result of the boolean WHERE expression can even prevent the further view evaluation DB Node Attribute 1 Attribute 2 Attribute 3 Attribute View 1 WHERE HAVING RANKED FOREACH WITH Figure 8.6: Example of an Attribute Dependency Graph between SLOSL Clauses

134 110 CHAPTER 8. IMPLEMENTATION DB Node Attribute 1 Attribute 2 Attribute 3 Attribute 4 [Attribute 5] View 3... WHERE WHERE 1 RANKED HAVING 1 RANKED 1 View 1 View 2 Figure 8.7: Example of a merged Attribute Dependency Graph with an additional dependent attribute in the database entirely if the result of the WHERE clause stays unchanged. The same applies to the second attribute in the figure. Its update enforces the evaluation of the RANKED expression only if the WHERE clause succeeds or changes. As mentioned before, independent expressions (even partial expressions) become candidates for pre-evaluation. Their results can be stored in the database as dependent attributes (see 6.1). The previously presented examples all allow for pre-evaluation of the ranking expression, as the additional attribute in figure 8.7 illustrates. Another possible pre-evaluation can occur when updating multiple views. Common clauses, subexpressions or evaluation steps can be extracted from the specifications, merged and executed in a different order. This reduces the redundancy between views and therefore the cost of updating them. Figure 8.7 shows an example where the evaluation of clauses and subexpressions was partially merged between views. In SLOSL implemented overlay software, the detailedness of the dependency graph and the application of possible optimisation techniques are the two main factors for the efficient execution of local decisions. The best strategies can be determined at compile time or deployment time, so that the cost of finding them does not effect the run time performance. SLOSL optimisers will therefore be able to generate optimal evaluation plans for each specific update event used in a model and source code generators can build efficient execution paths from each of them. The semantics of SLOSL allow for various other approaches. Optimisers may decide to deploy indexes on certain subexpressions. The bucket structure can be layed out and tested for holes and overlaps. This allows to see if nodes are uniquely mapped to buckets, in which case hashing or indexing become very efficient evaluation strategies for the HAVING FOREACH clauses. Different frameworks can take their place within the range of optimisations between additional storage for pre-calculated results, hashing, indexing and linear scans. They can trade the compile time overhead against the complexity and performance of the resulting implementation. It is an important

135 8.3. OVERML MODEL TRANSFORMATION 111 achievement of the SLOSL language to move these tradeoffs into configurations and parametrisations of frameworks and model transformations, away from source code implementations of each single overlay system. Profiling one overlay implementation can thus yield new optimisations that all other SLOSL implemented overlays can immediately benefit from Transforming the OverML Object Model to Source Code The generation of source code is obviously a very platform specific transformation. It completely depends on the specific framework and programming language. However, the object representation that was generated so far allows for a rather straight forward mapping to object oriented languages like Java, C++ or Python. The following describes the generic infrastructure that frameworks must provide and names possible mappings from the generated object representation of the OverML languages to framework specific implementations. HIMDEL requires framework support for message serialization and a mapping to language specific data types. Both were briefly discussed in EDSL requires the implementation of a generic event-driven state machine. It must support message based networking (see HIMDEL above) and event passing between state components. This infrastructure is often implemented on top of variants of the select or poll system calls [Ste03] for Unix operating systems. Several programming languages and virtual machines provide similar capabilities, for example the NIO package for the Java virtual machine [R + 02]. Event passing between states is commonly implemented via an indirection through the framework. This is needed to support the non-blocking I/O of the above system calls. A convenient side effect is the decoupling of states, which, in combination with EDSL generated glue code, leads to decoupled components. Many higher-level frameworks already exist that implement the majority of features required by EDSL. Examples are the Staged Event Driven Architecture for the Java virtual machine (see 3.2.1) or the Twisted framework for the Python language 3. The remaining layer needed to support EDSL source code generation for these frameworks is rather thin, as EDSL was designed with very simple, generic abstractions. Another interesting target environment for EDSL is the Gridkit [GCB + 04, GCBP05] (see 8.4). Its component environment, OpenCOM [CBG + 04], is designed for run-time reconfiguration. The state interfaces defined by EDSL can be mapped to glue code for OpenCOM components. Such an environment lifts the restrictions of static design-time protocols and allows for run-time software adaptation at the component layer. 3

136 112 CHAPTER 8. IMPLEMENTATION EDGAR is a rather simple language. It is easily transformed into conditional statements in any general-purpose programming language. Only the exclude last element requires additional information about node identifiers defined in NALA. NALA defines a data schema for a local data storage point, which may be represented as a database or a local object store or in any other form that seems appropriate. This requires a platform specific component that is configurable by NALA or a transformed representation (e.g. an SQL table creation statement). The second requirement is a mapping from NALA types to platform specific data types. Today, most platforms can handle SQL data types directly, but more specialised mappings will likely yield better performance. Explicitly typed languages, like C or Java, use type declarations and casts in the generated source code. NALA provides the required type information only for the node database. Model transformers must then infer the types of node attributes in views from their dependencies on expressions in the respective SLOSL statements. This step involves common algorithms known from compilers and query engines. However, it depends partially on language specific semantics of expression evaluation, which makes it a platform specific transformation. SLOSL describes the evaluation of views and their updates. There are different ways to implement views, depending on the database infrastructure. This can be as simple as a linear scan through a list of nodes followed by a sorting step to rank the nodes, or it can deploy materialised views and incremental updates. The possible implementations are as diverse as the target platforms. The first implementation of a SLOSL evaluator is part of the interpreter described in section First steps towards a SLOSL optimiser were described in section Future implementations of OverML in different target environments will help to identify and improve platform independent optimisations, that can then become part of the model transformation process. One promising candidate for such a target environment is the Gridkit overlay framework. 8.4 OverGrid: OverML on the Gridkit In joint work with Paul Grace at the University of Lancaster, the author developed a scheme [BBG + 06] for implementing OverML support in the Gridkit. Gridkit [GCB + 04, GCBP05] is a cross-cutting middleware for overlay deployment. Based on the dynamic OpenCOM component model [CBG + 04], it provides abstractions and interfaces for layering various kinds of overlays and interaction paradigms on top of each other, as illustrated in figure 8.8. Its main intention is to provide layered networking and service components that allow the middleware to support any environment from sensor or ad-hoc

137 8.4. OVERGRID: OVERML ON THE GRIDKIT 113 Interaction Service discovery Web Services API Resource discovery Resource mgmt Resource monitoring Overlays Framework OpenCOM v2 component model runtime Security Figure 8.8: The per-node Gridkit software framework networks to Internet-scale applications through simple reconfiguration at the component level. The Gridkit architecture is based on two forms of layering: vertical layers for network protocols, overlays and middleware services, and a horizontal separation of concern between forward, control and state components within each overlay layer. The control element cooperates with its peers on other hosts to build and maintain the virtual network topology. The forwarding element appropriately routes messages over this topology. The state element then keeps per-overlay node state such as a next-neighbours list. This allows for a sufficiently fine-grained decomposition to freely stack the resulting implementations into runnable systems. In Gridkit, the state component is specific to the overlay, and communicates through an overlay specific state interface within its horizontal layer. The control and forwarding components implement the common interfaces IControl and IForward listed below. They allow for vertical stacking of the overlay layers. The control operations are: Create a specified overlay, Join an overlay or Leave an overlay. The forward operations are: Send forwards messages to an identified destination, Receive blocks awaiting messages from the overlay, and EventReceive does the same in a non-blocking style. 1 interface I C o n t r o l { 2 ResultCode Create ( String netid, Object params ) ; 3 ResultCode Join ( String netid, Object params ) ; 4 ResultCode Leave ( String netid ) ; 5 } 6 interface IForward { 7 public byte [ ] Send ( String destid, byte [ ] msg, int param ) ; 8 public byte [ ] Receive ( String netid ) ; 9 public void EventReceive ( String netid, IDeliver evhandler ) ; 10 } According to the developers of Gridkit, this clean split is hard to achieve in practice for source code implementations. There are certain cross-cutting concerns, like the routing algorithm, that create hard interdependencies between the components. Note also that the state component is overlay specific and therefore closely tied to the implementation. Table 8.1, provided by Paul Grace, sums up the lines of code and man days

138 114 CHAPTER 8. IMPLEMENTATION required for implementing various systems and layers, while trying to adhere to the necessary separation of concerns. It becomes clear that even within an infrastructure like Gridkit, and even for well-known overlays, the source code implementation of these systems is hard work, including reverse engineering of existing systems and source level debugging of the new implementation. Overlay Type LoC Man Days Chord Key-based Routing [SMK + 01] Chord-based Distributed Hash Table [SMK + 01] Scribe[RKCD01] Tree-Building Control Protocol [MCH01] SCAMP [GKM01] Minimum Spanning Tree Gossip-based Failure Monitor [vrmh98] Table 8.1: Source code implementation of well-known overlays in Gridkit: lines of code and effort involved The work presented in this thesis was found to be very much complementary to the Gridkit. The Node Views approach aims to simplify the design and implementation of overlays through high-level, platform-independent models. It further helps in decoupling generic components within overlay implementations and provides an integrative data layer as their basis. The languages EDSL and EDGAR and the generally Model-View-Controller based design allow for an easy mapping to the state-forward-control separation of the Gridkit. Applications Overlay Networking Stack SLOSL Views Generic Controllers DB SLOSL Views SLOSL Views Controllers Controllers Forwarding Forwarding Communication Abstractions TCP, UDP, Broadcast,... Physical Networks Ethernet, IP, WiFi, MANET,... Figure 8.9: Combined architecture of Gridkit and Node Views Figure 8.9 illustrates a combined architecture of OverML and Gridkit, named OverGrid. The vertical interfaces and the basic controllers are provided

139 8.4. OVERGRID: OVERML ON THE GRIDKIT 115 by Gridkit, the underlying reconfiguration architecture deploys OpenCOM, and the horizontally arranged components in the overlay stack are modelled, integrated and configured using the Node Views architecture and the OverML languages. The introduction of a database and views simplifies the separation of the control and forwarding components, that was previously found hard to achieve in Gridkit. The main idea is to use OverML for the platform-independent design of overlay topologies, routing strategies etc., and then generate very specialised Gridkit components and OpenCOM glue code from the model. We will now briefly overview major parts of the infrastructure that OverML requires in Gridkit: database and views, control and forwarding components, and event handling Database and SLOSL Views OverGrid replaces the individual overlay state components in Gridkit overlays with a generic database exposing views to the control and forwarder components. Hence, the number of executing components is reduced, and state is more easily shared across overlay implementations. This obviously requires a platform specific SLOSL infrastructure in Gridkit, namely a database and a view evaluator. Both of them are parametrised by the OverML models. The node attribute language (NALA) describes the data schema and SLOSL describes the evaluation of nodes and node attributes into views (or sets of nodes). Depending on the capabilities of the target environment, the database can be anything from a simple hash table of nodes to an object-relational database. This allows for a tradeoff between the run-time performance, the complexity of the static infrastructure and that of the code generation process. Gridkit can easily support different implementations and select between them at deployment time Control Control components are generated mainly from the SLOSL and EDSL models. The resulting event graph implementation is wrapped in an OpenCOM component that implements the IControl interface and interacts with the database through the modelled views. Currently, Gridkit provides general, re-usable overlay control components (termed generic controllers in Gridkit or harvesters in Node Views), that provide repair and backup strategies for overlays [PC06]. These can be remodelled using EDSL to make their internal structure visible at the model level and subject to adaptation in OverGrid. A number of other possible harvester schemes are presented in chapter 7.

140 116 CHAPTER 8. IMPLEMENTATION Forwarding Forwarding is entirely driven by code generation. EDGAR is easily mapped to conditions in source code. SLOSL views, however, require the decision code to be executed in a SLOSL infrastructure or by code generated for SLOSL. As described in section 5.4, this is mainly identical to normal SLOSL evaluation and thus integrates with the database implementation. OverGrid wraps the generated code within an individual OpenCOM component that implements the IForward interface, and binds to the required views Event Infrastructure EDSL describes the component interaction in terms of events. Implementations will commonly use a generic event-driven state machine engine. It can either interpret the EDSL graph directly or can be specialised by OverML code generators. Such a specialisation can be the generation of unique event IDs that speed up the dispatching process. It can also mean the generation of specific event objects that are forwarded between states, or of event specific handler code. The exact implementation depends on the constrains of the target environment and the effort required for the specialisation of the code generators Network Layers As the FROM clause in SLOSL supports views of views, it can describe a layering of topologies that maps directly to layers in Gridkit as in figure 8.9. In combination with EDSL event flow graphs (and its component subgraphs), this nicely describes the event flow through the network layers and the dependencies of local decisions in one overlay layer on overlays in lower layers. 8.5 OverGrid Case Study: Scribe over Chord This section describes a case study of designing a complex overlay implementation with OverGrid. It starts by specifying the overlay using the SLOSL Overlay Workbench and then maps the resulting OverML specification to Gridkit components. This simple walk-through does not honour the fact that design usually evolves incrementally. The real-world design process would normally follow edit-compile-test and edit-compile-deploy cycles. For clarity, this section only presents the design steps in summaries. Note, however, that the support for incremental design is a major advantage of the high-level Node Views architecture.

141 8.5. OVERGRID CASE STUDY: SCRIBE OVER CHORD Node Attributes for Scribe/Chord The first step in the design process is to provide a database schema, expressed as node attributes in NALA. Chord nodes require logical IDs, which are essentially large integers. We define them as having 256 bits. We will see further down that the SLOSL views for Chord and Scribe require the following additional attributes: ring dist is a dependent attribute based on the id attribute. It contains the locally calculated ring distance towards a node based on the Chord metric. supports chord is a boolean flag that states whether a node is known to support the Chord protocol. alive is a boolean flag that is only true for live nodes. triggered is a boolean attribute. It is set to true when the node did not respond to a message and is considered to be possibly no longer alive. Another way of specifying this would be a bounded counter for failed communication attempts (i.e. outgoing messages). subscribed groups is a set of group identifiers (256 bit IDs) that a node is known to be subscribed to in Scribe SLOSL Implemented Chord Topology Chord [SMK + 01] deploys two different views: the finger table defines the major performance characteristics of its topology and the neighbour set keeps the predecessor and successor along the ring to assure correctness. As seen in section 5.1, SLOSL implements the finger table as follows. 1 CREATE VIEW c h o r d f i n g e r t a b l e 2 AS SELECT node. id, node. r i n g d i s t, chord bucket=i 3 FROM node db 4 WITH l o g k = l o g ( K ) 5 WHERE node. s u p p o r t s c h o r d = t r u e AND node. a l i v e = t r u e 6 HAVING node. r i n g d i s t i n [ 2 i, 2 i+1 ) 7 FOREACH i IN [ 0, l o g k ) 8 RANKED l o w e s t ( 1, node. r i n g d i s t ) The neighbour set implementation contains the node with the lowest node ID further along the ring (the successor) and the node with the highest node ID backwards on the ring (the predecessor). For resilience reasons, the view specified below stores a larger number of nodes, as encouraged by the original Chord paper. 1 CREATE VIEW c i r c l e n e i g h b o u r s 2 AS SELECT node. id, s i d e=s i g n 3 FROM node db

142 118 CHAPTER 8. IMPLEMENTATION 4 WITH ncount =10, max id= K 1 5 WHERE node. a l i v e = t r u e 6 HAVING abs ( node. i d l o c a l. i d ) <= max id / 2 7 AND s i g n ( node. i d l o c a l. i d ) < 0 8 OR abs ( node. i d l o c a l. i d ) > max id / 2 9 AND s i g n ( node. i d l o c a l. i d ) > 0 10 FOREACH s i g n IN { 1,1} 11 RANKED l o w e s t ( ncount, node. r i n g d i s t ) Chord Routing through SLOSL Views first match [ for me? ] handle locally handle error or drop finger table neighbour set Figure 8.10: EDGAR Graph for Chord Routing through SLOSL Views The routing decision graph for Chord is shown in figure As generally described for EDGAR in section 6.3, messages are pushed through the tree from the left and traverse it in depth-first pre-order. Chord generally decides responsibility for messages through the finger table view, but it can always fall back to using the circle neighbours if that fails SLOSL Implemented Scribe on Chord The Scribe [RKCD01] multicast scheme requires an additional view for routing publications. It forwards them towards the rendezvous node of the respective group and at each hop along the path broadcasts it to all subscribed children. Forwarding towards the rendezvous simply deploys Chord routing, but the broadcast requires an additional view on top of the Chord topology that selects only subscribed neighbours. The selection is based on the active subscriptions of nodes in the finger table. All locally active subscriptions are given by the well defined set containing the subscriptions of the local node or its children towards the parents. It includes the group IDs for which the local node is the rendezvous node and for which children or the local node are subscribed. The Scribe implementation requires a view for each of the locally active groups as follows. 1 CREATE VIEW s c r i b e s u b s c r i b e d c h i l d r e n 2 AS SELECT node. i d 3 FROM c h o r d f i n g e r t a b l e

143 8.5. OVERGRID CASE STUDY: SCRIBE OVER CHORD WITH sub 5 WHERE sub i n node. s u b s c r i b e d g r o u p s Multicast Routing through SLOSL Views fork [ am I subscribed? ] handle locally first match [ am I rendezvous? ] subscribed fingers exclude last fork forward (Chord) Figure 8.11: EDGAR Graph for Multicast Routing through SLOSL Views The routing decision graph for Scribe publications on Chord is shown in figure Along the path, the execution is forked into different branches, which treat the message independently for local subscriptions, neighbour subscriptions, and forwarding to the rendezvous node using the lower-level router (Chord in this case). The exclude last property prevents the last hop of a received message from appearing amongst the selection of next hops further down the tree Message Specifications in HIMDEL Instead of using the verbose XML representation of HIMDEL, the specification is presented in a shorter form here. The names in brackets are the access names used for accessing fields and structure of messages. Remember that the programmatic interface of messages also defines a last hop field (if it is known) and a next hop field (if it was determined by a router). For transmission, OverGrid uses a binary format as described in section Header [chord] Container [ids] id [sender] id [receiver] Message [chord joined] Message [chord find successor] Message [chord find predecessor] Message [chord notify] Message [chord update fingertable] View-Data [finger table bucket] chord fingertable/bucket Header [scribe]

144 120 CHAPTER 8. IMPLEMENTATION Message [scribe create rendezvous] Message [scribe join] Message [scribe leave] Message [scribe publish] Data [event] Event Handling in EDSL This is the main part where Gridkit integrates with the code generation process. Gridkit components implement the controllers that are connected by the EDSL graph specification. Again for clarity, this section does not describe the entire protocol graph of Scribe on Chord that implements the IControl component. It rather presents an example event processing cycle that shows how the system responds to events in a non-trivial way. The leave process in Scribe, which implements the propagation of unsubscriptions from groups, is a good example for this purpose. The complete process is presented in figure The vertices are EDSL states (Gridkit implemented controllers), solid lines represent EDSL transitions and dashed lines represent programmatic actions of controllers, that are not covered by EDSL. incoming messages component that sends message scribe_leave[ids.receiver] timeout (3 sec) ACK handle leave local leave ACK handler to check triggered/alive ping Got ACK (ignore) remove ID from msg.last_hop.subscribed_groups remove group from local.subscribed_groups switch alive switch triggered database unsubscribe subscription view empty Figure 8.12: Event processing for explicit and implicit leaves in Scribe There are three cases in Scribe that trigger a leave. The first one is the local leave that unsubscribes the local node. The second one is the explicit leave where a neighbour sends an unsubscribe message for a group. The third one is the implicit leave where the local node loses the connection to a subscribed neighbour. The first two cases are mainly identical in handling, the implicit one requires additional logic to decide that a leave must be triggered. In this case, any component that sends messages and expects some kind of acknowledgement for them has two transitions coming out of it: one for

145 8.5. OVERGRID CASE STUDY: SCRIBE OVER CHORD 121 receiving the ACK and one for a timeout. Only one of them will ever be triggered, so whatever comes first will determine the further execution path. If the ACK is received first, all is fine. If, however, the timeout comes first, it triggers the destination state of the timeout, which is a generic ACK handler controller. This controller looks up the next hop in the respective message and switches its triggered attribute through the database API. If it becomes true, the controller triggers the node by sending it a ping and then terminates. The same controller will handle the timeout of this ping. If, however, the attribute was true already and becomes false now, the controller sets the alive attribute of that node to false. These changes will trigger events from the database. In our example, the update event of the triggered attribute is not used, but all SLOSL views must be updated for the alive event. For simplicity, we will assume that the node was only contained in the finger table view and has open subscriptions in the Scribe subscriptions view. Containment can be decided in different ways depending on the database and view implementations. If materialised views are used (which may be the simplest implementation anyway), it is easy to check if a node is visible in a view. Otherwise, the view has to be evaluated for the node, once with the original attributes and once with the modified attributes, to determine a change. If the node update did not impact the view content, no view update is needed and no events are generated. Note that the SLOSL statements in an OverML specification provide all semantics necessary to determine efficient evaluation plans at compilation time. In our example, we assumed that the node was contained in the finger table. It will therefore disappear from the view after the alive update. This triggers the event that the node left the finger table view. The same will happen for the scribe subscribed children views that inherit from the finger table. Our current Chord implementation can ignore these events, as its routers only use the consistently updated views and therefore do not require any notifications about updates. Similarly, the Scribe implementation can ignore them as long as there are nodes left in all subscription views. Therefore, in the simplest case, event handling ends just here. In the case where the failed node was the only subscriber for a group, however, Scribe has to unsubscribe from that group. This is done through the view empty event that is triggered whenever a view update leaves the view empty. Note that the same event is triggered when a controller receives an explicit unsubscription from the last subscribed neighbour in a group and deletes the group ID from its subscribed groups attribute. This will delete the last node from the subscription view of that group and thus trigger the event. The view empty event of subscription views is therefore connected to an unsubscribe state in the EDSL implementation of Scribe. For each event, it sends out an unsubscribe message towards the rendezvous node of the respective group by using the underlying chord router.

146 122 CHAPTER 8. IMPLEMENTATION From Specification to Deployment Once the major characteristics of the overlay stack are defined, we can make the system runnable step by step. We start by running tests within the workbench. The SLOSL visualiser (see 8.1.3) allows the developer to play with simple scenarios. This helps in finding topological problems in the specification before starting to work on the controllers. Since the router components are generated completely based on SLOSL and EDGAR, the main portion that remains to be implemented is the logic behind the IControl interface. It is internally structured by the EDSL graph. At the beginning of the coding step, it is helpful to use dummies for controllers that are not yet implemented. Their interfaces are defined by their EDSL interaction, so this is simple to do in code generators. Also, as figure 8.12 suggests, controllers generally tend to be very simple and small, which allows for predefined tool sets of components and quick-and-dirty stub implementations to get the system working. From the EDSL graph, it is immediately clear which components are required to make a specific portion of the protocols work. The following steps obviously depend on the available tool support. Environments for testing, debugging and simulating overlay implementations are not yet available for the OverGrid architecture. Its main achievements in this area, however, are the rich semantics that become available to debuggers and the possibility to write these tools for generic OverML models and the generic Gridkit/OpenCOM component architecture. Once available, these developer tools will not require any adaptation to the specific overlay implementations. Current overlay simulators are very much focused on specific requirements of the specific overlays they were written for. OverGrid implemented overlays, on the other hand, will seamlessly move between different OverGrid compatible simulators, visualisers, debuggers and deployment environments without major redesign or rewrites. Throughout the testing phase, the designer will be free to go back to the design phase, modify the models and regenerate the overlay implementation. The main difference between environments for testing, simulation and different deployment scenarios is the Gridkit configuration. Different networking stacks (starting with the physical protocols), different IControl components, different subgraphs in the EDSL model and different controller implementations can be used to adapt the system to the current environment at an arbitrary granularity. Due to the OpenCOM component model, all of this can be configured at deployment time.

147 Chapter 9 Related Work Contents 9.1 Unstructured Overlay Networks Key-Based Routing (KBR) Networks Topologies Comparison and Evaluation Message Overhead and Physical Networks Design and Implementation of Overlays Quality-of-Service in Publish-Subscribe Modelling and Generating Software Modelling Distributed Systems and Protocols Databases and Data-Driven Software Related work exists in a number of areas. First of all, there is a considerable amount of peer-to-peer systems and overlay systems that were proposed and/or implemented. Some of these were already mentioned in the previous chapters, especially in 2.3. A broader overview is given in 9.1 and 9.2. Section 9.3 refers to evaluation studies that compare these systems. There are two major books that provide a broad overview of the area of overlay networks and peer-to-peer systems. The popular book edited by Andrew Oram in 2001 [Ora01] provides a collection of articles written by a variety of well known people, both from research and industry. The second book, edited by Steinmetz and Wehrle in 2005 [SW05], contains a newer collection of more research oriented articles. The overall breadth of the collection provides a well readable overview of various research directions and an excellent foundation for future work. One of the intentions behind this thesis was the integration of different overlay networks. Such an integrative approach must prevent exhaustive re-

148 124 CHAPTER 9. RELATED WORK source usage when maintaining multiple overlays on top of a single physical network. Related work in this area is presented in section 9.4. Section 3.2 provided an overview of previous approaches in the area of frameworks and middleware systems for the design of these overlay networks. Further systems and other areas of network and topology design are presented in section 9.5. The general Software Engineering approaches that underly the Model Driven Architecture and CASE tools are briefly overviewed in Unstructured Overlay Networks The field of overlay networks is commonly divided into unstructured and structured networks. There is, however, a substantial number of hybrid approaches and those that reduce the unstructuredness to achieve certain characteristics, so the differences are steadily diminishing since their first distinction. Structured or Key-based Routing networks, as described in 9.2, explicitly target a well defined topology that is actively maintained by each participant. Unstructured networks, on the other hand, do not actively maintain a specific topology. Their properties rather result from the use patterns in their deployment Power-law Networks Most unstructured overlay networks form Power-law topologies, also known as Small-World Graphs [Mil67, WS98]. The oldest and most well-known of them was the original Gnutella 1. Their random nature provides relatively high resilience combined with a small network diameter, which has also been popularised as the six degrees of separation 2. The currently deployed overlays are commonly used for file sharing and decentralised keyword search. Their main advantage is that potentially large documents are statically stored on the source nodes, so that only the small queries need to traverse the network. However, due to their size, complete coverage is extremely expensive. Hence, these systems do not provide any guarantees regarding consistency or availability of data, since queries may not be forwarded to all potential sources and nodes may fail arbitrarily. Evaluations of Gnutella have shown that its random nature makes it highly resilient against random failures [SGG02]. This was found by crawling a live Gnutella network of about 1800 nodes. If the simulated failures are targeted, however, the authors find that it is sufficient to destroy only the most highly connected 4% of the nodes to leave the graph disconnected. A visualisation of world phenomenon&oldid=

149 9.1. UNSTRUCTURED OVERLAY NETWORKS 125 Figure 9.1: Resilience of a Gnutella graph: topology as found by crawler (left), connected after 30% random failures (middle), disconnected after 4% targeted failures (right). Source: [SGG02] (authorised) these finding is presented in figure 9.1. This fraction obviously depends on the total number of nodes and the distribution of out degrees, so the resilience can be expected to grow with the network. However, this still shows that it is worth putting consideration into the selection of neighbours and the maintenance of the overlay, even if the general topology is randomised. There were a number of proposals for semantic grouping in unstructured networks, including [Jos02, NS02, BLZK04]. Their advantage is the topological vicinity of related documents, which simplifies searches. However, one of the problems here is the definition of semantic vicinity. These systems require the availability of meta data, which in turn may require global knowledge of document properties or larger data exchanges between random nodes. This can be a substantial barrier for newly joining nodes. A number of other approaches were proposed to improve the search characteristics of early unstructured networks. Chawathe etal. [CRB + 03] evaluate a number of proposals and finally describe Gia, a system based on Gnutella that combines the most promising tweaks. Such an approach is substantially more promising than the implementation of keyword search on top of Distributed Hash Tables as in [LLH + 03] (see 9.2). Due to their hash table nature, these systems must rely on distributed joins of potentially large data sets as well as global knowledge about the available documents, e.g. to compute the Inverse Document Frequency (IDF) as known from Information Retrieval. This intuition was also expressed in [CRB + 03]. A hybrid solution is presented in [LHSH04]. The authors combine the rather exhaustive search capabilities of unstructured networks with the lookup semantics of Distributed Hash Tables. This enables broad searches for well replicated data as well as lookups of rare (and therefore indexable) data Square-root Networks While power-law topologies mimic a common pattern found in various forms of networks (including social networking), networks and replication schemes

Motivation for peer-to-peer

Motivation for peer-to-peer Peer-to-peer systems INF 5040 autumn 2007 lecturer: Roman Vitenberg INF5040, Frank Eliassen & Roman Vitenberg 1 Motivation for peer-to-peer Inherent restrictions of the standard client/server model Centralised

More information

Event-based middleware services

Event-based middleware services 3 Event-based middleware services The term event service has different definitions. In general, an event service connects producers of information and interested consumers. The service acquires events

More information

Open Text Social Media. Actual Status, Strategy and Roadmap

Open Text Social Media. Actual Status, Strategy and Roadmap Open Text Social Media Actual Status, Strategy and Roadmap Lars Onasch (Product Marketing) Bernfried Howe (Product Management) Martin Schwanke (Global Service) February 23, 2010 Slide 1 Copyright Open

More information


Produktfamilienentwicklung Produktfamilienentwicklung Bericht über die ITEA-Projekte ESAPS, CAFÉ und Families Günter Böckle Siemens CT SE 3 Motivation Drei große ITEA Projekte über Produktfamilien- Engineering: ESAPS (1.7.99 30.6.01),

More information

Embedded Software Development and Test in 2011 using a mini- HIL approach

Embedded Software Development and Test in 2011 using a mini- HIL approach Primoz Alic, isystem, Slovenia Erol Simsek, isystem, Munich Embedded Software Development and Test in 2011 using a mini- HIL approach Kurzfassung Dieser Artikel beschreibt den grundsätzlichen Aufbau des

More information

Search Engines Chapter 2 Architecture. 14.4.2011 Felix Naumann

Search Engines Chapter 2 Architecture. 14.4.2011 Felix Naumann Search Engines Chapter 2 Architecture 14.4.2011 Felix Naumann Overview 2 Basic Building Blocks Indexing Text Acquisition Text Transformation Index Creation Querying User Interaction Ranking Evaluation

More information

3 rd Young Researcher s Day 2013

3 rd Young Researcher s Day 2013 Einladung zum 3 rd Young Researcher s Day 2013 Nach zwei erfolgreichen Young Researcher s Days starten wir kurz vor dem Sommer in Runde drei. Frau Ingrid Schaumüller-Bichl und Herr Edgar Weippl laden ganz

More information

Classic Grid Architecture

Classic Grid Architecture Peer-to to-peer Grids Classic Grid Architecture Resources Database Database Netsolve Collaboration Composition Content Access Computing Security Middle Tier Brokers Service Providers Middle Tier becomes

More information

Peer-to-Peer: an Enabling Technology for Next-Generation E-learning

Peer-to-Peer: an Enabling Technology for Next-Generation E-learning Peer-to-Peer: an Enabling Technology for Next-Generation E-learning Aleksander Bu lkowski 1, Edward Nawarecki 1, and Andrzej Duda 2 1 AGH University of Science and Technology, Dept. Of Computer Science,

More information

Berufsakademie Mannheim University of Co-operative Education Department of Information Technology (International)

Berufsakademie Mannheim University of Co-operative Education Department of Information Technology (International) Berufsakademie Mannheim University of Co-operative Education Department of Information Technology (International) Guidelines for the Conduct of Independent (Research) Projects 5th/6th Semester 1.) Objective:

More information

Designing and Deploying Messaging Solutions with Microsoft Exchange Server 2010 MOC 10233

Designing and Deploying Messaging Solutions with Microsoft Exchange Server 2010 MOC 10233 Designing and Deploying Messaging Solutions with Microsoft Exchange Server MOC 10233 In dieser Schulung erhalten Sie das nötige Wissen für das Design und die Bereitstellung von Messaging-Lösungen mit Microsoft

More information

8 Conclusion and Future Work

8 Conclusion and Future Work 8 Conclusion and Future Work This chapter concludes this thesis and provides an outlook on future work in the area of mobile ad hoc networks and peer-to-peer overlay networks 8.1 Conclusion Due to the

More information

Pavlo Baron. Big Data and CDN

Pavlo Baron. Big Data and CDN Pavlo Baron Big Data and CDN Pavlo Baron www.pbit.org pb@pbit.org @pavlobaron What is Big Data Big Data describes datasets that grow so large that they become awkward to work with using on-hand database

More information

Upgrading Your Skills to MCSA Windows Server 2012 MOC 20417

Upgrading Your Skills to MCSA Windows Server 2012 MOC 20417 Upgrading Your Skills to MCSA Windows Server 2012 MOC 20417 In dieser Schulung lernen Sie neue Features und Funktionalitäten in Windows Server 2012 in Bezug auf das Management, die Netzwerkinfrastruktur,

More information

The Role and uses of Peer-to-Peer in file-sharing. Computer Communication & Distributed Systems EDA 390

The Role and uses of Peer-to-Peer in file-sharing. Computer Communication & Distributed Systems EDA 390 The Role and uses of Peer-to-Peer in file-sharing Computer Communication & Distributed Systems EDA 390 Jenny Bengtsson Prarthanaa Khokar jenben@dtek.chalmers.se prarthan@dtek.chalmers.se Gothenburg, May

More information

Multicast vs. P2P for content distribution

Multicast vs. P2P for content distribution Multicast vs. P2P for content distribution Abstract Many different service architectures, ranging from centralized client-server to fully distributed are available in today s world for Content Distribution

More information


RESEARCH ISSUES IN PEER-TO-PEER DATA MANAGEMENT RESEARCH ISSUES IN PEER-TO-PEER DATA MANAGEMENT Bilkent University 1 OUTLINE P2P computing systems Representative P2P systems P2P data management Incentive mechanisms Concluding remarks Bilkent University

More information

zen Platform technical white paper

zen Platform technical white paper zen Platform technical white paper The zen Platform as Strategic Business Platform The increasing use of application servers as standard paradigm for the development of business critical applications meant

More information

JoramMQ, a distributed MQTT broker for the Internet of Things

JoramMQ, a distributed MQTT broker for the Internet of Things JoramMQ, a distributed broker for the Internet of Things White paper and performance evaluation v1.2 September 214 mqtt.jorammq.com www.scalagent.com 1 1 Overview Message Queue Telemetry Transport () is

More information

CSE 5306 Distributed Systems. Architectures

CSE 5306 Distributed Systems. Architectures CSE 5306 Distributed Systems Architectures 1 Architecture Software architecture How software components are organized, How software components interact System architecture Instantiation and placement of

More information

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

White Paper. How Streaming Data Analytics Enables Real-Time Decisions White Paper How Streaming Data Analytics Enables Real-Time Decisions Contents Introduction... 1 What Is Streaming Analytics?... 1 How Does SAS Event Stream Processing Work?... 2 Overview...2 Event Stream

More information

anon.next: A Framework for Privacy in the Next Generation Internet

anon.next: A Framework for Privacy in the Next Generation Internet anon.next: A Framework for Privacy in the Next Generation Internet Matthew Wright Department of Computer Science and Engineering, The University of Texas at Arlington, Arlington, TX, USA, mwright@uta.edu,

More information

A Brief Analysis on Architecture and Reliability of Cloud Based Data Storage

A Brief Analysis on Architecture and Reliability of Cloud Based Data Storage Volume 2, No.4, July August 2013 International Journal of Information Systems and Computer Sciences ISSN 2319 7595 Tejaswini S L Jayanthy et al., Available International Online Journal at http://warse.org/pdfs/ijiscs03242013.pdf

More information

EDOS Distribution System: a P2P architecture for open-source content dissemination

EDOS Distribution System: a P2P architecture for open-source content dissemination EDOS Distribution System: a P2P architecture for open-source content Serge Abiteboul 1, Itay Dar 2, Radu Pop 3, Gabriel Vasile 1 and Dan Vodislav 4 1. INRIA Futurs, France {firstname.lastname}@inria.fr

More information

A Survey Study on Monitoring Service for Grid

A Survey Study on Monitoring Service for Grid A Survey Study on Monitoring Service for Grid Erkang You erkyou@indiana.edu ABSTRACT Grid is a distributed system that integrates heterogeneous systems into a single transparent computer, aiming to provide

More information

Is Cloud relevant for SOA? 2014-06-12 - Corsin Decurtins

Is Cloud relevant for SOA? 2014-06-12 - Corsin Decurtins Is Cloud relevant for SOA? 2014-06-12 - Corsin Decurtins Abstract SOA (Service-Orientierte Architektur) war vor einigen Jahren ein absolutes Hype- Thema in Unternehmen. Mittlerweile ist es aber sehr viel

More information

IAC-BOX Network Integration. IAC-BOX Network Integration IACBOX.COM. Version 2.0.1 English 24.07.2014

IAC-BOX Network Integration. IAC-BOX Network Integration IACBOX.COM. Version 2.0.1 English 24.07.2014 IAC-BOX Network Integration Version 2.0.1 English 24.07.2014 In this HOWTO the basic network infrastructure of the IAC-BOX is described. IAC-BOX Network Integration TITLE Contents Contents... 1 1. Hints...

More information

Exploiting peer group concept for adaptive and highly available services

Exploiting peer group concept for adaptive and highly available services Exploiting peer group concept for adaptive and highly available services Muhammad Asif Jan Centre for European Nuclear Research (CERN) Switzerland Fahd Ali Zahid, Mohammad Moazam Fraz Foundation University,

More information

An Alternative Web Search Strategy? Abstract

An Alternative Web Search Strategy? Abstract An Alternative Web Search Strategy? V.-H. Winterer, Rechenzentrum Universität Freiburg (Dated: November 2007) Abstract We propose an alternative Web search strategy taking advantage of the knowledge on

More information

Microsoft Azure for AWS Experts MOC 40390

Microsoft Azure for AWS Experts MOC 40390 Microsoft Azure for AWS Experts MOC 40390 In diesem Kurs lernen Sie durch eine tiefgehen Diskussion und praktische Hands-on Übungen die Microsoft Azure Infrastruktur Services (IaaS) einschließlich der

More information

The Changing Global Egg Industry

The Changing Global Egg Industry Vol. 46 (2), Oct. 2011, Page 3 The Changing Global Egg Industry - The new role of less developed and threshold countries in global egg production and trade 1 - Hans-Wilhelm Windhorst, Vechta, Germany Introduction

More information

Scalable Source Routing

Scalable Source Routing Scalable Source Routing January 2010 Thomas Fuhrmann Department of Informatics, Self-Organizing Systems Group, Technical University Munich, Germany Routing in Networks You re there. I m here. Scalable

More information

A Framework for End-to-End Proactive Network Management

A Framework for End-to-End Proactive Network Management A Framework for End-to-End Proactive Network Management S. Hariri, Y. Kim, P. Varshney, Department of Electrical Engineering and Computer Science Syracuse University, Syracuse, NY 13244 {hariri, yhkim,varshey}@cat.syr.edu

More information

Praktikum Wissenschaftliches Rechnen (Performance-optimized optimized Programming)

Praktikum Wissenschaftliches Rechnen (Performance-optimized optimized Programming) Praktikum Wissenschaftliches Rechnen (Performance-optimized optimized Programming) Dynamic Load Balancing Dr. Ralf-Peter Mundani Center for Simulation Technology in Engineering Technische Universität München

More information

Contents. 1010 Huntcliff, Suite 1350, Atlanta, Georgia, 30350, USA http://www.nevatech.com

Contents. 1010 Huntcliff, Suite 1350, Atlanta, Georgia, 30350, USA http://www.nevatech.com Sentinet Overview Contents Overview... 3 Architecture... 3 Technology Stack... 4 Features Summary... 6 Repository... 6 Runtime Management... 6 Services Virtualization and Mediation... 9 Communication and

More information

Computer Networking Networks

Computer Networking Networks Page 1 of 8 Computer Networking Networks 9.1 Local area network A local area network (LAN) is a network that connects computers and devices in a limited geographical area such as a home, school, office

More information

There are a number of factors that increase the risk of performance problems in complex computer and software systems, such as e-commerce systems.

There are a number of factors that increase the risk of performance problems in complex computer and software systems, such as e-commerce systems. ASSURING PERFORMANCE IN E-COMMERCE SYSTEMS Dr. John Murphy Abstract Performance Assurance is a methodology that, when applied during the design and development cycle, will greatly increase the chances

More information

Agenda. Distributed System Structures. Why Distributed Systems? Motivation

Agenda. Distributed System Structures. Why Distributed Systems? Motivation Agenda Distributed System Structures CSCI 444/544 Operating Systems Fall 2008 Motivation Network structure Fundamental network services Sockets and ports Client/server model Remote Procedure Call (RPC)

More information

Trace Driven Analysis of the Long Term Evolution of Gnutella Peer-to-Peer Traffic

Trace Driven Analysis of the Long Term Evolution of Gnutella Peer-to-Peer Traffic Trace Driven Analysis of the Long Term Evolution of Gnutella Peer-to-Peer Traffic William Acosta and Surendar Chandra University of Notre Dame, Notre Dame IN, 46556, USA {wacosta,surendar}@cse.nd.edu Abstract.

More information

New possibilities for the provision of value-added services in SIP-based peer-to-peer networks

New possibilities for the provision of value-added services in SIP-based peer-to-peer networks New possibilities for the provision of value-added services in -based peer-to-peer networks A.Lehmann 1,2, W.Fuhrmann 3, U.Trick 1, B.Ghita 2 1 Research Group for Telecommunication Networks, University

More information

Project Group NODES - Offering Dynamics to Emerging Structures www.peerfact.org

Project Group NODES - Offering Dynamics to Emerging Structures www.peerfact.org PeerfactSim: A Simulation Framework for Peer-to-Peer Systems (and more) Project Group NODES - Offering Dynamics to Emerging Structures www.peerfact.org UPB SS2011 PG-Nodes Lecture-06 P2P-Simulations.ppt

More information

Optimizing Service Levels in Public Cloud Deployments

Optimizing Service Levels in Public Cloud Deployments WHITE PAPER OCTOBER 2014 Optimizing Service Levels in Public Cloud Deployments Keys to Effective Service Management 2 WHITE PAPER: OPTIMIZING SERVICE LEVELS IN PUBLIC CLOUD DEPLOYMENTS ca.com Table of

More information


CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION Internet has revolutionized the world. There seems to be no limit to the imagination of how computers can be used to help mankind. Enterprises are typically comprised of hundreds

More information

Living in the Present: On-the-fly Information Processing in Scalable Web Architectures

Living in the Present: On-the-fly Information Processing in Scalable Web Architectures Department of Computing Living in the Present: On-the-fly Information Processing in Scalable Web Architectures David Eyers, Tobias Freudenreich, Alessandro Margara, Sebastian Frischbier, Peter Pietzuch,

More information

Seminar: Security Metrics in Cloud Computing (20-00-0577-se)

Seminar: Security Metrics in Cloud Computing (20-00-0577-se) Technische Universität Darmstadt Dependable, Embedded Systems and Software Group (DEEDS) Hochschulstr. 10 64289 Darmstadt Seminar: Security Metrics in Cloud Computing (20-00-0577-se) Topics Descriptions

More information

Adapting Distributed Hash Tables for Mobile Ad Hoc Networks

Adapting Distributed Hash Tables for Mobile Ad Hoc Networks University of Tübingen Chair for Computer Networks and Internet Adapting Distributed Hash Tables for Mobile Ad Hoc Networks Tobias Heer, Stefan Götz, Simon Rieche, Klaus Wehrle Protocol Engineering and

More information

Tree Based Decentralized Search & Routing For Content Based Publish/Subscribe System

Tree Based Decentralized Search & Routing For Content Based Publish/Subscribe System Tree Based Decentralized Search & Routing For Content Based Publish/Subscribe System S.Saranya, Research Scholar [1] Tiruppur Kumaran College For Women, Tiruppur S.Radhimeenakshi, Assistant Professor [2]

More information

QoS Integration in Web Services

QoS Integration in Web Services QoS Integration in Web Services M. Tian Freie Universität Berlin, Institut für Informatik Takustr. 9, D-14195 Berlin, Germany tian @inf.fu-berlin.de Abstract: With the growing popularity of Web services,

More information

An Active Packet can be classified as

An Active Packet can be classified as Mobile Agents for Active Network Management By Rumeel Kazi and Patricia Morreale Stevens Institute of Technology Contact: rkazi,pat@ati.stevens-tech.edu Abstract-Traditionally, network management systems

More information

Architectural Design

Architectural Design Software Engineering Architectural Design 1 Software architecture The design process for identifying the sub-systems making up a system and the framework for sub-system control and communication is architectural

More information

Peer-to-Peer Networks. Chapter 6: P2P Content Distribution

Peer-to-Peer Networks. Chapter 6: P2P Content Distribution Peer-to-Peer Networks Chapter 6: P2P Content Distribution Chapter Outline Content distribution overview Why P2P content distribution? Network coding Peer-to-peer multicast Kangasharju: Peer-to-Peer Networks

More information

A: Ein ganz normaler Prozess B: Best Practices in BPMN 1.x. ITAB / IT Architekturbüro Rüdiger Molle März 2009

A: Ein ganz normaler Prozess B: Best Practices in BPMN 1.x. ITAB / IT Architekturbüro Rüdiger Molle März 2009 A: Ein ganz normaler Prozess B: Best Practices in BPMN 1.x ITAB / IT Architekturbüro Rüdiger Molle März 2009 März 2009 I T A B 2 Lessons learned Beschreibung eines GP durch das Business läßt Fragen der

More information

µfup: A Software Development Process for Embedded Systems

µfup: A Software Development Process for Embedded Systems µfup: A Software Development Process for Embedded Systems Leif Geiger, Jörg Siedhof, Albert Zündorf University of Kassel, Software Engineering Research Group, Department of Computer Science and Electrical

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 349 ISSN 2229-5518

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 349 ISSN 2229-5518 International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 349 Load Balancing Heterogeneous Request in DHT-based P2P Systems Mrs. Yogita A. Dalvi Dr. R. Shankar Mr. Atesh

More information

Evolution der Dienste im zukünftigen Internet

Evolution der Dienste im zukünftigen Internet Prof. Dr. P. Tran-Gia Evolution der Dienste im zukünftigen Internet www3.informatik.uni-wuerzburg.de Tools to find the Future Internet Impossible to see the future is. (Master Yoda) IP & Post-IP future

More information

Kapitel 2 Unternehmensarchitektur III

Kapitel 2 Unternehmensarchitektur III Kapitel 2 Unternehmensarchitektur III Software Architecture, Quality, and Testing FS 2015 Prof. Dr. Jana Köhler jana.koehler@hslu.ch IT Strategie Entwicklung "Foundation for Execution" "Because experts

More information

Performance Evaluation of AODV, OLSR Routing Protocol in VOIP Over Ad Hoc

Performance Evaluation of AODV, OLSR Routing Protocol in VOIP Over Ad Hoc (International Journal of Computer Science & Management Studies) Vol. 17, Issue 01 Performance Evaluation of AODV, OLSR Routing Protocol in VOIP Over Ad Hoc Dr. Khalid Hamid Bilal Khartoum, Sudan dr.khalidbilal@hotmail.com

More information

A Peer-to-Peer Approach to Content Dissemination and Search in Collaborative Networks

A Peer-to-Peer Approach to Content Dissemination and Search in Collaborative Networks A Peer-to-Peer Approach to Content Dissemination and Search in Collaborative Networks Ismail Bhana and David Johnson Advanced Computing and Emerging Technologies Centre, School of Systems Engineering,

More information

Matrix: Adaptive Middleware for Distributed Multiplayer Games. Goal: Clean Separation. Design Criteria. Matrix Architecture

Matrix: Adaptive Middleware for Distributed Multiplayer Games. Goal: Clean Separation. Design Criteria. Matrix Architecture Matrix: Adaptive Middleware for Distributed Multiplayer Games Rajesh Balan (Carnegie Mellon University) Maria Ebling (IBM Watson) Paul Castro (IBM Watson) Archan Misra (IBM Watson) Motivation: Massively

More information

Topic Communities in P2P Networks

Topic Communities in P2P Networks Topic Communities in P2P Networks Joint work with A. Löser (IBM), C. Tempich (AIFB) SNA@ESWC 2006 Budva, Montenegro, June 12, 2006 Two opposite challenges when considering Social Networks Analysis Nodes/Agents

More information

Setting Up an AS4 System

Setting Up an AS4 System INT0697_150625 Setting up an AS4 system V1r0 1 Setting Up an AS4 System 2 Version 1r0 ENTSOG AISBL; Av. de Cortenbergh 100, 1000-Brussels; Tel: +32 2 894 5100; Fax: +32 2 894 5101; info@entsog.eu, www.entsog.eu,

More information

Global Server Load Balancing

Global Server Load Balancing White Paper Overview Many enterprises attempt to scale Web and network capacity by deploying additional servers and increased infrastructure at a single location, but centralized architectures are subject

More information



More information

Service Mediation. The Role of an Enterprise Service Bus in an SOA

Service Mediation. The Role of an Enterprise Service Bus in an SOA Service Mediation The Role of an Enterprise Service Bus in an SOA 2 TABLE OF CONTENTS 1 The Road to Web Services and ESBs...4 2 Enterprise-Class Requirements for an ESB...5 3 Additional Evaluation Criteria...7

More information

Middleware and Distributed Systems. System Models. Dr. Martin v. Löwis. Freitag, 14. Oktober 11

Middleware and Distributed Systems. System Models. Dr. Martin v. Löwis. Freitag, 14. Oktober 11 Middleware and Distributed Systems System Models Dr. Martin v. Löwis System Models (Coulouris et al.) Architectural models of distributed systems placement of parts and relationships between them e.g.

More information


A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Simulating a File-Sharing P2P Network

Simulating a File-Sharing P2P Network Simulating a File-Sharing P2P Network Mario T. Schlosser, Tyson E. Condie, and Sepandar D. Kamvar Department of Computer Science Stanford University, Stanford, CA 94305, USA Abstract. Assessing the performance

More information

Lustre Networking BY PETER J. BRAAM

Lustre Networking BY PETER J. BRAAM Lustre Networking BY PETER J. BRAAM A WHITE PAPER FROM CLUSTER FILE SYSTEMS, INC. APRIL 2007 Audience Architects of HPC clusters Abstract This paper provides architects of HPC clusters with information

More information


PEER TO PEER FILE SHARING USING NETWORK CODING PEER TO PEER FILE SHARING USING NETWORK CODING Ajay Choudhary 1, Nilesh Akhade 2, Aditya Narke 3, Ajit Deshmane 4 Department of Computer Engineering, University of Pune Imperial College of Engineering

More information

Creating a Future Internet Network Architecture with a Programmable Optical Layer

Creating a Future Internet Network Architecture with a Programmable Optical Layer Creating a Future Internet Network Architecture with a Programmable Optical Layer Abstract: The collective transformational research agenda pursued under the FIND program on cleanslate architectural design

More information

Lightweight Service-Based Software Architecture

Lightweight Service-Based Software Architecture Lightweight Service-Based Software Architecture Mikko Polojärvi and Jukka Riekki Intelligent Systems Group and Infotech Oulu University of Oulu, Oulu, Finland {mikko.polojarvi,jukka.riekki}@ee.oulu.fi

More information

All-Flash Arrays Weren t Built for Dynamic Environments. Here s Why... This whitepaper is based on content originally posted at www.frankdenneman.

All-Flash Arrays Weren t Built for Dynamic Environments. Here s Why... This whitepaper is based on content originally posted at www.frankdenneman. WHITE PAPER All-Flash Arrays Weren t Built for Dynamic Environments. Here s Why... This whitepaper is based on content originally posted at www.frankdenneman.nl 1 Monolithic shared storage architectures

More information

OpenFlow Based Load Balancing

OpenFlow Based Load Balancing OpenFlow Based Load Balancing Hardeep Uppal and Dane Brandon University of Washington CSE561: Networking Project Report Abstract: In today s high-traffic internet, it is often desirable to have multiple

More information

Designing and Implementing a Server Infrastructure MOC 20413

Designing and Implementing a Server Infrastructure MOC 20413 Designing and Implementing a Server Infrastructure MOC 20413 In dieser 5-tägigen Schulung erhalten Sie die Kenntnisse, welche benötigt werden, um eine physische und logische Windows Server 2012 Active

More information


AN EFFICIENT LOAD BALANCING ALGORITHM FOR A DISTRIBUTED COMPUTER SYSTEM. Dr. T.Ravichandran, B.E (ECE), M.E(CSE), Ph.D., MISTE., AN EFFICIENT LOAD BALANCING ALGORITHM FOR A DISTRIBUTED COMPUTER SYSTEM K.Kungumaraj, M.Sc., B.L.I.S., M.Phil., Research Scholar, Principal, Karpagam University, Hindusthan Institute of Technology, Coimbatore

More information

Service Layer Components for Resource Management in Distributed Applications

Service Layer Components for Resource Management in Distributed Applications Corporate Technology Service Layer Components for Resource Management in Distributed Applications Fabian Stäber Siemens Corporate Technology, Information and Communications Copyright Siemens AG 2007. Alle

More information

Future Internet Architecture

Future Internet Architecture Future Internet Architecture Test der PP-Präsentation & Cloud Computing: A Service-Oriented approach. Wie können die Folien verbessert werden? Paul Mueller FIA Panel: Future Internet Architectures Poznan

More information

An Evaluation of Architectures for IMS Based Video Conferencing

An Evaluation of Architectures for IMS Based Video Conferencing An Evaluation of Architectures for IMS Based Video Conferencing Richard Spiers, Neco Ventura University of Cape Town Rondebosch South Africa Abstract The IP Multimedia Subsystem is an architectural framework

More information

What can DDS do for You? Learn how dynamic publish-subscribe messaging can improve the flexibility and scalability of your applications.

What can DDS do for You? Learn how dynamic publish-subscribe messaging can improve the flexibility and scalability of your applications. What can DDS do for You? Learn how dynamic publish-subscribe messaging can improve the flexibility and scalability of your applications. 2 Contents: Abstract 3 What does DDS do 3 The Strengths of DDS 4

More information

Client/server and peer-to-peer models: basic concepts

Client/server and peer-to-peer models: basic concepts Client/server and peer-to-peer models: basic concepts Dmitri Moltchanov Department of Communications Engineering Tampere University of Technology moltchan@cs.tut.fi September 04, 2013 Slides provided by

More information

An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models

An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models Dissertation (Ph.D. Thesis) An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models Christian Siefkes Disputationen: 16th February

More information


CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING CHAPTER 6 CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING 6.1 INTRODUCTION The technical challenges in WMNs are load balancing, optimal routing, fairness, network auto-configuration and mobility

More information

Wireless Sensor Networks Chapter 3: Network architecture

Wireless Sensor Networks Chapter 3: Network architecture Wireless Sensor Networks Chapter 3: Network architecture António Grilo Courtesy: Holger Karl, UPB Goals of this chapter Having looked at the individual nodes in the previous chapter, we look at general

More information


GLOBAL SERVER LOAD BALANCING WITH SERVERIRON APPLICATION NOTE GLOBAL SERVER LOAD BALANCING WITH SERVERIRON Growing Global Simply by connecting to the Internet, local businesses transform themselves into global ebusiness enterprises that span the

More information

Extending the Internet of Things to IPv6 with Software Defined Networking

Extending the Internet of Things to IPv6 with Software Defined Networking Extending the Internet of Things to IPv6 with Software Defined Networking Abstract [WHITE PAPER] Pedro Martinez-Julia, Antonio F. Skarmeta {pedromj,skarmeta}@um.es The flexibility and general programmability

More information

diversifeye Application Note

diversifeye Application Note diversifeye Application Note Test Performance of IGMP based Multicast Services with emulated IPTV STBs Shenick Network Systems Test Performance of IGMP based Multicast Services with emulated IPTV STBs

More information

3 common cloud challenges eradicated with hybrid cloud

3 common cloud challenges eradicated with hybrid cloud 3 common cloud eradicated 3 common cloud eradicated Cloud storage may provide flexibility and capacityon-demand benefits but it also poses some difficult that have limited its widespread adoption. Consequently,

More information

Optimizing Data Center Networks for Cloud Computing

Optimizing Data Center Networks for Cloud Computing PRAMAK 1 Optimizing Data Center Networks for Cloud Computing Data Center networks have evolved over time as the nature of computing changed. They evolved to handle the computing models based on main-frames,

More information

XMPP A Perfect Protocol for the New Era of Volunteer Cloud Computing

XMPP A Perfect Protocol for the New Era of Volunteer Cloud Computing International Journal of Computational Engineering Research Vol, 03 Issue, 10 XMPP A Perfect Protocol for the New Era of Volunteer Cloud Computing Kamlesh Lakhwani 1, Ruchika Saini 1 1 (Dept. of Computer

More information



More information

Network Architecture and Topology

Network Architecture and Topology 1. Introduction 2. Fundamentals and design principles 3. Network architecture and topology 4. Network control and signalling 5. Network components 5.1 links 5.2 switches and routers 6. End systems 7. End-to-end

More information

Mobile RFID solutions

Mobile RFID solutions A TAKE Solutions White Paper Mobile RFID solutions small smart solutions Introduction Mobile RFID enables unique RFID use-cases not possible with fixed readers. Mobile data collection devices such as scanners

More information

Anonymous Communication in Peer-to-Peer Networks for Providing more Privacy and Security

Anonymous Communication in Peer-to-Peer Networks for Providing more Privacy and Security Anonymous Communication in Peer-to-Peer Networks for Providing more Privacy and Security Ehsan Saboori and Shahriar Mohammadi Abstract One of the most important issues in peer-to-peer networks is anonymity.

More information

Chapter 2 Architectures. Layered Architecture

Chapter 2 Architectures. Layered Architecture Chapter 2 Architectures Software architecture logical organization of the software components Architectural styles: layered, object based, eventbased, shared data space System architecture the instantiation

More information

System Models for Distributed and Cloud Computing

System Models for Distributed and Cloud Computing System Models for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Classification of Distributed Computing Systems

More information

VXLAN: Scaling Data Center Capacity. White Paper

VXLAN: Scaling Data Center Capacity. White Paper VXLAN: Scaling Data Center Capacity White Paper Virtual Extensible LAN (VXLAN) Overview This document provides an overview of how VXLAN works. It also provides criteria to help determine when and where

More information

CiteSeer x in the Cloud

CiteSeer x in the Cloud Published in the 2nd USENIX Workshop on Hot Topics in Cloud Computing 2010 CiteSeer x in the Cloud Pradeep B. Teregowda Pennsylvania State University C. Lee Giles Pennsylvania State University Bhuvan Urgaonkar

More information

SemCast: Semantic Multicast for Content-based Data Dissemination

SemCast: Semantic Multicast for Content-based Data Dissemination SemCast: Semantic Multicast for Content-based Data Dissemination Olga Papaemmanouil Brown University Uğur Çetintemel Brown University Wide Area Stream Dissemination Clients Data Sources Applications Network

More information