A Monitoring Tool to Manage the Dynamic Resource Requirements of a Grid Data Sharing Service

Transcription

1 A Monitoring Tool to Manage the Dynamic Resource Requirements of a Grid Data Sharing Service Voichiţa Almăşan Voichita.Almasan@irisa.fr Supervisors : Gabriel Antoniu, Luc Bougé {Gabriel.Antoniu,Luc.Bouge}@irisa.fr IRISA, Projet Paris February June 2006 Research Master in Computer Science Internship Report, IFSIC, Université de Rennes I.

2 This report corresponds to the research internship of the Research Master in Computer Science of IFSIC (Institut de Formation Supérieure en Informatique et Communication). The internship took place within the PARIS project (Programming parallel and distributed systems for large scale numerical simulation applications) at the IRISA. IRISA (Institut de Recherche en Informatique et Systèmes Aléatoires) is a joint research unit whose partners are the INRIA (Institut National de Recherche en Informatique et Automatique), the CNRS (Centre National de Recherche Scientifique), the Université de Rennes 1 and the INSA (Institut National des Sciences Appliquées) of Rennes. The Antenne de Bretagne of the ENS Cachan (Ecole Normale Supérieure) is a partner of IRISA within the PARIS project. This internship s title is Monitoring Tool to Manage the Dynamic Resource Requirements of a Grid Data Sharing Service and it took place under the responsability of Gabriel Antoniu, researcher at the INRIA, Luc Bougé, Professor at the Antenne de Bretagne of the ENS Cachan and Loïc Cudennec, PhD student. I would like to thank my supervisors, and also Mathieu Jan and Sébastien Monnet for their help during my internship.

3 Contents 1 Introduction 5 2 The Interaction of Distributed Applications with the Grid and the Grid Middleware Managing the Resources of a Computational Grid Distributed Applications The Deployment of Distributed Applications Discussion on the Limitations of Static Application Deployment A Grid Data Sharing Service: JuxMem The Philosophy and Motivation of JuxMem The Existing JuxMem Grid Data Sharing Service Discussion on the Limitations of the Existing JuxMem Contribution: A Monitoring Tool for JuxMem s Dynamic Resource Requirements The General Idea Usage Scenario Scenario Description The Existing JuxMem s Behaviour How to Improve JuxMem s Behaviour Proposed Architecture: Introducing the JuxMem Monitor The Communication Interfaces The Node Reservation Algorithm The Simultaneous Allocation (Co-Allocation) of Nodes on Different Clusters

4 4.4 The Additional Deployment Feature: using ADAGE Discussion Architecture and Implementation The Monitoring Tool: A Modular Software Architecture The JuxMem Communication Module The Coordination Module The Infrastructure Interaction Modules Technologies Used by this Implementation RPC Communication Technique POSIX Threads and Thread Concurency XML Document Managing and Parsing Replaceable Plugins Discussion Preliminary Evaluation and Discussion Describing the Experimental Platform Adding New JuxMem Provider Peers Bound to an Existing JuxMem Manager Peer Motivation and Objective of this Experiment Technical Description of the Experiment Practical Results Experimental Results Discussion Adding Extra Clusters to the Existing JuxMem Topology Motivation and Objective of this Experiment Technical Description of the Experiment Practical Results Experimental Results Discussion Conclusion 49 Bibliography 51 4

5 Chapter 1 Introduction Large-scale distributed systems are currently of very high interest to scientists. The domain has emerged because of the ever growing need of computational power, and because of the need to efficiently correlate and use the resources of an ever larger number of machines, connected via various networks. This is how the grid was born, as a collection of nodes, grouped into clusters which are internally connected via a usually very high-speed network. Because of the way in which it was concieved, the grid does not generate many failures, when compared to other distributed systems (such as peer-to-peer systems), and so it enables for a relatively safe, but still volatile, physical platform. Volatility occurs when some nodes are shut down for maintainance reasons for example, or when a node is overloaded, as it serves too many applications. The grid is very much used in the scientific areas, to efficiently perform heavy, sometimes critical, computations. Nevertheless, the grid can support a larger variety of applications, not just scientific ones. It is a matter of applying the right distributed protocols so that the grid efficiently support all kinds of distributed applications. In order to achieve it in an efficient manner, both the grid management tools and the application running on top of the grid have to be aware of each other s existence, and they should also be able to communicate via a predefined interface. All of these steps have not been fully achieved yet, although the scientific community is very active in the field of applications that aim to efficiently exploiting the grid. This should be one of the following important steps in the development of grids, together with the widening of the area of applications and distributed protocols that could benefit from the features of the grid. The problems of resource management and scheduling have to be very carefully considered. At the very beginning, scheduling in distributed systems was static and could not evolve at the same time with the dynamicity of the volatile resources [7]. In time, it has become clear that one can never be fully aware of the global state of any distributed system, therefore probabilistic scheduling was suggested to be a solution [7]. However, it is difficult to efficiently co-schedule jobs on distributed resources, especially if we want to take into account process scheduling performed on each node and the scheduling policy that is used. This becomes even more complicated when the volatility aspect is also considered. 5

6 The Problem. When it comes to grids and other distributed systems that provide a volatile infrastructure, the difficult part is to adapt the applications that run on top of the grid middleware to the volatility of infrastructure nodes. The applications should be able to integrate new grid nodes and even clusters on demand. To do this, special mechanisms have to be designed and implemented, which receive requests from the application and interact with the grid s middleware in order to reserve new grid nodes on which to instantiate new parts of the requesting application. The way to achieve such dynamicity for the distributed application is to enable it to interact with the grid resource management system and with a deployment tool that knows how to deploy parts of that specific distributed application. Data Sharing on Top of Grids: JuxMem. In what follows, we will focus our attention on solving the problem described above for grid data sharing services. Grid data sharing is an important aspect of grid computing, since most of the applications designed for the grid, namely the scientific ones, produce and need to share a lot of data. Peer-to-peer networks are a good candidate for distributed data sharing, and in fact this is what has made them grow popular. When sharing resources on top of such a network, one cannont neglect the high volatility of each network peer. In order to share data belonging to critical applications, that cannot afford to lose any piece of information, special layers should be previously deployed, dedicated to data-sharing services. An example of such a service is JuxMem [3]. It manages to enable sensitive data sharing on top of the grid, by integrating two special types of peers: the JuxMem Managers and the JuxMem Providers. A Scenario for JuxMem Usage. As explained in the previous section, JuxMem is a grid data sharing service. This means that it offers an infrastructure for data sharing, to be used by distributed applications which run on top of the grid. These distributed applications can be seen as clients for JuxMem. Consider a situation in which an application needs to allocate space where to store pieces of data it would need later, throughout its execution. When requiring data allocation, JuxMem s clients can specify more parameters to JuxMem, which would reflect the degree and modality of replication for that piece of data. In the current version, the clients can specify on how many clusters they want their piece of data to be stored and how many copies would be necessary to be kept on each of these clusters. Consequently, JuxMem will attempt to reserve as many JuxMem storage units as needed for the size of the data, on the number of nodes specified by the client, for each of the clusters required by the client. Our contribution is to study the ways JuxMem could interact with grid resource management systems and deployment tools. Chapter 2 will give an overview of grids and grid middleware, of distributed applications and their needs, and of the deployment of distributed applications. Chapter 3 will describe in more detail the JuxMem grid data sharing service and its architecture. Chapter 4 will describe a set of mechanisms that allow JuxMem to interact with grid resource management systems and deployment tools. Chapter 5 will give some details about the implementation of the mechanisms described in section 4. Chapter 6 will show some experimental results and a preliminary evaluation of the mechanisms described in section 4. Chapter 7 will conclude the report and will discuss some future developement directions in the field of making distributed applications, such as JuxMem, aware of their underlying infrastructure and able dynamically to interact with it, when needed. 6

7 Chapter 2 The Interaction of Distributed Applications with the Grid and the Grid Middleware 2.1 Managing the Resources of a Computational Grid What is a Computational Grid? The term grid has first been used in the field of Computer Science in the years 1990 and its aim was to interconnect high speed computer networks belonging to different universities [10, 20]. The underlying idea is that aggregated computational power and storage can be efficiently used remotely and also that the power of ordinary, thus cheaper, computers is equivalent to the power of an expensive, more powerful but more rare, super-computer. Grids have been designed to perform complicated scientific computations or other complex transactions that require heavy computations, and which can hardly be achieved with a usual workstation. An analogy with the power grid has been made when choosing the name grid for this type of distributed systems. The power grid is the distributed resource of electricity and one can transparently plug and unplug electric appliances to the power grid. The transparence comes from the fact that one can never know exactly where the electrical power has been produced, when using the power grid. Simillarly to the power grid, a computational grid is seen as a shared resource, geographically distributed, that enables users to plug and unplug their applications to it, and thus exploit its resources. Of course, for the efficient use of grids, specialized distributed protocols and applications had to be developed to enable large-scale resource sharing in grid-like systems [12]. Grids are a special kind of distributed systems, in which nodes generally have equal computational power and can be managed as a whole, by means of grid resource managers. In the context of grids, another important matter to consider is the services they can offer and how one can learn what these services are [9]. Since grid nodes can be volatile, because they could be shut down for maintainance or be overloaded serving other applications, ser- 7

8 vices may appear and disapear, and so one has to be very careful when designing the application that is supposed to propose a permanent service on top of the grid [2]. Consequently, it would be best to design a grid layer that can take into account and asure a certain level of QoS, needed by some applications sensitive to a constant service from the grid [13]. Why Manage Grid Resources? As it happens in any real-world case, grid resources are limited in time and space. In order to maintain their degree of usability within an acceptable range, grid resources have to be managed. The main resources that a grid has to offer are its nodes. The number of grid nodes determines the grid s granularity. Nodes are an important factor in some cases, since there are applications that perform special services, which might wish to reserve more than one node. An example of such an application is a middleware service using data replication for the sake of fault tolerance. It is nodes that provide both the computational power and the data storage support, which are the two main requirements addressed by applications to grids. The existence of multiple nodes within the clusters that make up the grid enables the possibility of parallelism, required by certain applications. Parallelism can be exploited at both at computational power level and at data-storage level, so it is actually a sub-feature of both of them. An important factor, which should not be forgotten, is the network connection between two different nodes. The network connection is characterized by its physical topology, its bandwidth and its communication latency. The communication latency is variable and depends on the chosen pair of nodes. It is generally small between nodes within the same cluster and significantly larger for nodes belonging to different, distant sites. All of these grid characteristics, number of nodes, computational power and datastorage, parallelism, network topology, bandwidth and network latency, can be quantified and their quantification should be used by applications which require a certain quality of the service offered by the grid. There already exist prototype grid managers that support the specification of QoS parameters by applications, and one of them is Faucets, described in [13]. Still, there is a long way to go until a usual grid manager will provide clientapplications with the possibility to specify the level of needed QoS, and will also be able to guarantee the required level of QoS. A Few Examples of Resource Management Systems for Grids. The management of grid resources is done by the grid managers and by cluster managers. Several approaches exist and at this point we are most interested in how each of these approaches are able to schedule resources. The PBS cluster manager [26] and OAR [25], a more specialized cluster manager based on PBS, are basic tools for submitting jobs on top of a cluster and for obtaining information about the cluster. In the field of grid managers, a well-known architecture is the one proposed by the Globus Toolkit [20]. The services that it offers, in order to manage the resources of the grid are information services, to obtain information about the physical grid (MDS, Monitoring and Discovery System), and job submission and management services (GRAM, Grid Resource Allocation and Management). Regarding the network bandwidth and communication latency, there are measurement and control tools, such as NWS (Network Weather System). Another grid manager called KOALA [24] is improved to allocate 8

9 resources on multiple clusters at the same time, but is not QoS-enabled either. However, there exist even grid managers that provide QoS guarantees, like Faucets [13]. As already stated, this QoS-enabled grid manager is for now just in its prototype phase, and it enables the applications that really need to be provided with a specified level of QoS to obtain this guarantee. Of course, the application must also know how to quantify its needs, in order for it to require from the grid manager the necessary level of QoS [18]. An application can require Faucets to guarantee a certain QoS by specifying how many nodes it needs, for how much time, and what computational power those nodes must have. These quality requirements can be also submitted to resource managers such as PBS and OAR, but they cannot really assure and guarantee them in the way Faucets does. The most important QoS aspect that Faucets provides is, nevertheless, the QoS at network connection level, which the application can specify as a constraint with respect to the communication latency between two given nodes. In analogy with the case of resource management on a desktop machine, which is performed by the OS, resource management on the grid is done by observing the state of the whole system and by scheduling. The scheduling happens in a simillar way to the one performed by the OS, by partitioning the execution of different applications in time and by allocating resources to applications for a certain period of time. In the context of grids, too, global execution time is a critical factor, among other factors, such as availability. Grid and cluster managers offer services in a field of an evolving topology, the grid or the cluster, which is can be in a constant change, due to its nodes. By means of the managers, one can obtain specific information about the physical infrastructure status. Examples of such pieces of information are: list of jobs that have been submitted to the grid, the list of active or waiting jobs, and the state of some specified particular nodes. This first function of the resource manager can be viewed as its passive function, since it only observes the grid and it does not interact with it. The more interesting active function of the grid manager consists of: the possibility to submit jobs to be scheduled and executed by the grid, to stop already running jobs, to reserve certain nodes for a given time, and to access those reserved nodes. Volatility of Nodes. Volatility happens when, for various reasons, a node is no longer able to provide a service in the system. In this case, the services it offers can no longer be qualified as sure, since its availability is not certain. It might be the duty of the grid resource scheduler to make this aspect of uncertain services transparent to the client-application. This raises many additional problems in the context of grids, related to scheduling, to data replication, to migration of jobs and to availability of service. Volatility is the main cause for the appearance of fault-tolerant techniques and for techniques that aim at ensuring availability of services. In the case of volatile nodes, there exist two aspects. The first one is the fact that at any moment nodes can leave the network, and the other aspect is the fact that nodes can join the network at any time, which is the positive side of volatility. Physical volatility happens in case a node connects or disconnects to the network, or in case it suddenly shuts down. This kind of volatility has hardware causes. The node might fail without having prior knowledge of this, and this is where the fault tolerance 9

10 mechanism is invoked. For the case of a sudden reconnection of a node temporarily down, there must exist a join protocol, which is known both by the new node and by the resource manager. The join protocol might be a simple announcement made by the node to the manager, saying that the node is up and running again. Logical volatility is due to software reasons and it somehow resembles physical volatility. The general cause of logical volatility is the fact that at a certain moment a node can be overcharged by the jobs it is already running, and, in order to fulfill their requirements, it knows it cannot accept to run any other jobs. In the logical volatility case, a node generally knows that it will no longer be available, so it can announce this, or refuse jobs that the resource manager wants to submit to it, during a certain amount of time. After being available again, the node could announce the resource manager about its availability and it will simply accept new jobs to be submitted to it. At this point, it is useful to mention that a grid environment is not very dynamic, compared, for instance, to large peer-to-peer systems. Hence, the problem of volatile nodes can be more easily managed in the field of grids than in the case of most of the other distributed systems. Up to this point, we have examined the grid infrastructure and the middleware that runs on top of the grid. Now we will focus our attention on applications that might run on top of this infrastructure and middleware, namely the distributed applications. We do this in order to eventually evaluate what the grid can offer with respect to the needs of distributed applications. 2.2 Distributed Applications Distributed applications are basically applications that run on more than one machine. This is why distributed applications are composed of multiple parts, with at least one separate part running on each separate machine. Usually, the parts composing a distributed application collaborate and communicate in order to achieve their common goal and act as a whole. What Kinds of Distributed Applications There Exist? A very important aspect concerning distributed application is the resources they need and the ways of ensuring that these resources are available at the time the distributed application need them. If we classify distributed applications according to their resource needs, there exist static distributed applications and dynamic distributed applications. Static distributed applications are the applications that use the same resources throughout their execution. It means that their resource needs do not change throughout their whole execution cycle. This is why, from the very begginning of the application s executions, one is aware of its resource needs, so that one can and should statically reserve 10

11 them before launching the application. The reservation duration should be larger than the estimated worst-case execution time of the static distributed application. Dynamic distributed applications in opposition to static ones, do not need the same quantity of resources at every moment of their execution cycle. Their resource needs evolve in time, and it is usually not predictible at the begginning of the application s execution. These evolving needs are a result of the interaction with the application s environment, that is usually with other applications. As a consequence, the dynamic distributed application must either reserve from the very beginning the maximum amount of resources it might need, or have some means to interact with its underlying infrastructure and dynamically require new resouces the moment it needs them and release them afterwards. What Are the Needs of Distributed Applications? The set of needs a distributed application has is a superset of the needs of a generic, common application. Like every type of application, distributed applications need resources, execution time and also scheduling. Resources are the underlying support of each application, and are basically composed of the hardware infrastructure on which the application runs. They consist of volatile or non-volatile memory, processor power and network latency. The basic characteristic of resources is that they are always limited and have to be managed. Execution time is another important factor that determines the application s characteristics and performances. It is managed by the operating system and it refers to the time an application gets to be served by a certain resource, most usually by the processor, but also by certain peripherals. If it had no execution time allocated to it, any application would remain pending for ever, with no output or final result. Scheduling is the concept that relates the resources and the execution time, by allocating a required resource to an application for a certain fixed time interval. It is related to only one of the many parts of the distributed application. That means that that part of the application is allowed to continue its execution and use the needed resource for a certain execution time, decided by the scheduler. In order to get good scheduling results, one should use appropriate scheduling policies within the system. Besides these basic application needs, distributed application have their specific, more sophisticated needs. These needs that are specific to distributed applications derive from the generic application needs. Two very important things that are required by most distributed applications would be the co-scheduling aspect and the communication aspect between the different distributed parts of the given application. Co-scheduling derives from scheduling and consists of the need of coordinating all scheduling performed for each separate part of the distributed application. This coordination s purpose is making the separate parts function as a whole. There are also optimisation reasons that support the co-scheduling necessity when it comes to distributed applications. 11

12 Communication is a specific need of applications composed of multiple parts and so it is also a need of a distributed application. It is used when different parts of the application need to make other parts aware of their status or of their needs. Communication is also useful in the cases where one part of the distributed application needs to require some service to another part of the same distributed application. For an efficient and reliable communication, communication protocols must be designed, that best fit the application s communication needs and also the characteristics of the underlying communication infrastructure. 2.3 The Deployment of Distributed Applications Up to this point, we have only analysed the possible underlying physical or logical infrastructures for distributed applications, which would be the grid. It is time to stress out the fact that these infrastructures have been designed so that distributed applications could be run on top of them. A preliminary step for launching a distributed application on a given infrastructure is to describe the application and the requirements. They can consist of computational power required and of storage support for the data that the application needs to handle. Of course, there is a strong connection between these application requirements and the physical resources that the underlying infrastructure has to offer, that is the nodes that will be allocated to it and the network connections between the nodes. The process of actually mapping the application to the infrastructure is called deploying the application on that infrastructure. The current approach for deploying distributed applications is to design and use static deployment tools, that do not further interact with the application after having deployed it. The static aspect consists of the fact that the nodes on top of which the application is deployed are selected statically, during the deployment phase. This is why, if the application should need some nodes only at a certain point of the execution its execution, it has to actually be deployed and launched on top of them from the early deployment phase. The disadvantage of static deployment is obvious in this scenario. Why Do Distributed Applications Need Deployment? Distributed applications need to be deployed because this is their nature and they cannot run on a single machine. But manual deployment of distributed applications is somewhat difficult. It requires, for example, manual launching of remote shell consoles on different nodes in order to start applications and this may take a lot of time. This is why deployment tools have emerged, to assist deployment. We only need to provide the chosen deployer with the name of our application, followed in some way by the explicit list of nodes and their description or simply the number of nodes on which we wish to deploy the application. As illustrated in figure 2.1, the ultimate objective is to get a fair balance between 3 different aspects: the application, the resources and the deployment tool. Efficiency can only be obtained if all of these 3 aspects match together. The application can exhbit a certain degree of dynamicity: the set of required resources may remain fixed, which is the case for the origin point of the application axis, or may evolve dynamically in time. Similarly, the resources provided by the infrastructure might remain the same or could vary, that is they can be non-volatile or volatile resources. Non-volatile resources are placed at the origin of the 12

13 DEPLOYMENT AXIS generic deployment ADAGE specific deployment JDF static application non-volatile resources Ideal Deployment Tool JDF dynamic application APPLICATION AXIS volatile resources RESOURCES AXIS Figure 2.1: Deploying applications: a 3D view. resources axis. Finally, the deployment tools can be specifically tailored to an application, for deployment tools placed in the origin of the deployment axis, or they can handle a large class of distributed applications. These 3 aspects can be visualized through a 3D representation, as seen in Figure 2.1. The origin of the axes stands for static applications, non-volatile resources and specific deployment, with respect to all of the 3 axes. A couple of known deployment tools, JDF and ADAGE, can then be localized within this framework. None of them fulfills the objective stated above: the place of the ideal deployment tool is figured by the disk found at the center of the figure, so this place should be in the foreground plane, not in the background plane. How Can One Deploy a Distributed Application? Specific Application Deployment. As its name states, a specific application deployment tool only knows how to deploy a specific type of distributed application, namely the one it has been designed for. It does this by instantiating each component of the application to be deployed, or by starting each specified service contained within the application on a node. The set of nodes corresponding to a distributed application must be a priori specified, by the programmer, in the configuration file of the tool. After doing all this, the deployer no longer supervises the application, since its only job was to deploy it. An example of a specialized tool, that only knows to deploy peer-to-peer overlays written using JXTA, is JDF [22]. JDF has specifically been designed specifically for JXTA peer-to-peer 13

14 CCM Application Description MPI Application Description JXTA Application Description Generic Application Description Deployment Planner Deployment Plan Execution Figure 2.2: The layered structure and the plugins of ADAGE. technology. In such network infrastructures, nodes are expected to be highly volatile. As a consequence, there is no easy means to guarantee whether a node will be available and when it will be. This is why, from some point of view it is good that the JDF tool does not try to wait for a node specified by the programmer to be available. Also, from the volatility point of view, it is useful to have human input when determining the map of nodes where the application should be deployed. One could never select at random the nodes required by the JXTA application and succeed in finding available nodes. JDF does its deployment by using a file which contains the static description of what part of the application should be started on each of the selected nodes. Also, before deploying an application in this way, the network nodes that are to be used have to be checked for availability and then be reserved, all these by means external to the deployment tool. If a node specified in the configuration file for JDF is not present at the deployment time, then the deployment of the entire application will fail. Generic Application Deployment. When deploying generic applications, the tool should be able to provide an technology-independent interface to the underlying infrastructure and multiple technology-specific interfaces, so that any applications specific requirements can be understood. There is an important gain in favour of the tools deploying generic applications, since we do not have to know what kind of applications we will deploy before selecting a tool. The only problem with generic application deployers could be that we might not take the optimal decision as far as the specific current application is concerned, because of not being able to exactly know what sort of application we are currently deploying. But, on the other hand, this is not such a big problem, since deployment just happens once. After the application has been fully deployed and is running, it will perform its specific duties in the same way, even if it has been deployed with a generic deployment tool and not a specific one. This means that performance of applications is not altered, and this is what actually matters. An implementation of such a generic application deployer is ADAGE (Automatic Deployment of Applications in a Grid Environment), described in [16] and furtherly refined in [14]. ADAGE is based on the specific application deployer described in [15]. ADAGE has been designed as a tool for deploying applications on computational grids. It has a robust, very intuitive architecture, structured into layers, as can be seen in Figure

15 It manages to deploy different kinds of applications by converting them to a generic format, which is then submitted to the resource managers for scheduling and execution. It also has an internal planner that determines which of the applications to be deployed is to be served next. As an interface with the application that wishes to be deployed, it has more converters from an application-specific format to a generic format. The generic application format is the internal language that ADAGE uses to uniformly submit applications to the resource manager. The most intelligent and the innovating side of this deployer are its converters from application-specific constraints to generic application constraints that will be communicated to the resource manager. We can see that even if it is a generic application deployer, ADAGE must have previous knowledge of the language and purpose the application was written for, as it must contain the converter that knows how to handle that given application. This is not an important impediment, primarily because there are not so many languages and manners to write distributed applications, and secondly because ADAGE could easily adapt to support a new kind of technology which might arise, since it can already support multiple application deploying. ADAGE must first integrate the converter, before being required to convert that specific kind of application. 2.4 Discussion on the Limitations of Static Application Deployment Static application deployment has its drawbacks. Tools from this category will fail to deploy distributed applications which do not have their requirements fulfilled at the moment of deployment. It is a serious drawback, because it does not allow the application to wait for the specified unavailable resource to return or to have alternatives to the resources it selects. Not having the resource requirements fulfilled is very likely, since we are dealing with volatile nodes. Another problem is the fact that we specify the exact nodes in our deployment map. Consequently, there is no chance to deploy the application if any one of the nodes from the input node list is not present, even if there exist several perfect candidates to replace it within the network. If there existed a manager of the distributed network, as in the case of grids and grid managers, it would be a centralized instance that supervises the entire activity of the network. As a global supervisor of the grid, the manager would be able to announce the deployment tool that there could be a candidate to replace a missing node. Of course, the centralized manager is not a common feature in all distributed systems, but if it existed, it would be very well for the deployer to know how to interact with the centralized manager. A possible solution would be to first launch a the distributed application on a minimal number of nodes, maybe even on just one node. The application should be designed such as to be able to auto-extend when it needs to, and to require and integrate new resources from the grid on demand. To achieve this, a special mechanism should be designed and integrated within the distributed application, and a suggestion for this mechanism would be shown in the next sections. The specific mechanism that will be presented has been integrated within the JuxMem grid data sharing service. To better understand JuxMem s particular needs, the following chapter makes a description of its philosophy and architecture. 15

16 Chapter 3 A Grid Data Sharing Service: JuxMem In the previous chapter we have seen the problem of grids and of distributed applications that need to interact with the grid middleware throughout their execution. In this chapter we shall focus on JuxMem, which is a grid data sharing service, exhibiting all the adaptability needs of a distributed application, as described in the previous section. 3.1 The Philosophy and Motivation of JuxMem JuxMem has appeared because of the need of a data sharing service for the grid. In the grid computing context the target architecture is a federation of clusters, while the targeted applications are scientific applications, such as distributed numerical simulations or other scientific computations. The problem that all distributed applications mentioned above must face is the fact that most of them generate huge amounts of data that have to be stored. The approaches before JuxMem used to store the data produced by these distributed application in such a manner that the distributed application had to explicitly find the location where the data would be stored. This was done by using services such as GridFTP. The problem with this explicit data management approach is that it does not scale well, as it rapidly becomes very complex, as the number of nodes grows. Another problem with the explicit data management approach is that it does not address at all the problems of consistency for possibly replicated data, thus it is again the distributed application s job to take care of this aspect. It becomes clear now that in order to handle the data storage and data sharing problem correctly, one needs a dedicated distributed storage system that should guarantee the consistency needs of the application, as well as fault tolerance and failure recovery in case data is lost. This data storage system should also allow for data to be stored, read and modified by its client applications. One of the well-known systems that matches the description of the data sharing service described above is the DSM (Distributed Shared Memory). DSMs are distributed systems that are characterised by high control over the resources they manage. These resources are homogeneous in most cases and are not very likely to be dynamic. DSM is a classical approach 16

17 for sharing mutable data in a small-scale distributed environment. They address the needs of transparent acces and localisation of data, so the client distributed application no longer has the problem of explicit data management, which has been described. Another feature of the DSMs is that of being able to guarantee the consistency of the various replicas of stored data, by using consistency models and protocols. The disadvantage of DSMs is the fact that they make the hypothesis of a static infrastructure, which can almost never be met in practice and it is just a theoretical model. This is why DSMs do not address the problem of fault tolerance and it is also one of the main reasons why DSMs do not scale well. The opposite of DSM systems, from the dynamicity and scalability point of view, are the P2P (peer-to-peer) systems. These distributed systems can get very large because of the robust and simple protocols they are based on, and can have a very heterogeneous set of interconnected nodes as their infrastructure. The nature of the peer-to-peer protocols allows them to handle failure very well, since in peer-to-peer systems failure is not an exceptional event, but a regular one. All of these are reasons why there exist many data storage systems based on peer-to-peer overlay networks, but none of them allows for the stored data to be modified. In this context, the challenge addressed by JuxMem is that of combining the positive aspects of both DSM systems and peer-to-peer data storage systems in order to achieve a better data sharing service for the grid. The grid is an infrastructure whose volatility is not null, but still it can be kept within certain acceptable controlled ranges. This kind of infrastructure lies between the completely static infrastructure of the DSM systems and the highly dynamic infrastructure of the peer-to-peer systems. Naturally, one can suppose that by having such an infrastructure, the positive aspects of DSM systems and peer-to-peer data storage systems cand be combined, while minimising the negative aspects of both of these distributed storage systems. Indeed, this attempt to combine DSM systems and peer-to-peer systems has materialised itself into the JuxMem grid data sharing service. JuxMem is based on a peer-to-peer network, implemented by using the JXTA library, which has been adapted to support both fault tolerance and data consistency. Unlike common peer-to-peer data storage systems, JuxMem allows for the data it stores to be modified by the client applications. JuxMem also assures the transparent localisation of stored data and its automatic redistribution, in case this is needed. 3.2 The Existing JuxMem Grid Data Sharing Service As explained earlier, JuxMem runs on top of a federation of clusters. For each cluster, there is a peer that coordinates the activity and is responsable for the other peers belonging to that cluster. In JuxMem terminology, the coordinating peer is called the JuxMem Manager peer, while the coordinated group of peers are called the JuxMem Provider peers. The JuxMem Managers and JuxMem Providers form the data storage and sharing infrastructure. There also exists another type of peer in the JuxMem world, which is the JuxMem Client peer. All of these 3 functionalities (manager, provider, client) are defined within the JuxMem API and so each JuxMem peer could exhibit each one of these functionalities. In particular, a JuxMem peer can be a manager, provider, and client during the same lifecycle. Neverthe- 17

18 less, at a certain moment in the execution of JuxMem, each peer can only have one of the 3 functionalities. For the sake of clarity, in the sequel we will suppose that each peer has only one of the 3 functionalities described, throughout its lifecycle. JuxMem Manager Peers. JuxMem Manager peers are the coordinating peers at clusterlevel. They are rendez-vous peers from the JXTA point of view, which basically means that other peers can connect in order to be able to communicate. They keep a cache of other JuxMem peers that are connected to them, and in particular of JuxMem Provider peers. It is the JuxMem Manager peer that receives the request for data storage from JuxMem Client peers, together with the degree of replication for the data. Consequently, the Manager searches the available Provider peers that could store the required piece of data and eventually allocates them to store that piece of data. If the JuxMem topology integrates multiple clusters, there exists a JuxMem Manager running on each of the clusters. Via the rendez-vous peer advertisement, which is specific to the JXTA world, all the JuxMem Manager peers discover themselves after a certain stabilisation interval. After this, the Managers can communicate with the other Managers they know, essentially for the purpose of storing data on multiple clusters in case the Client specifies that this is needed. JuxMem Provider Peers. JuxMem Provider peers are the very center of the JuxMem world, since they are the ones that offer the physical storage. Each Provider is connected to the JuxMem Manager responsible for its cluster, and from the JXTA point of view, Providers are edge peers, as opposed to the Managers which are rendez-vous peers. At its startup, a Provider is initialised with the maximum number of data blocks it can store, and it advertises this via JXTA advertisements. Each time a new piece of data is stored on the Provider, the advertised available number of blocks is decreased. For each stored piece of data, at the moment of data allocation two groups are created: the Local Data Group (LDG) and the Global Data Group (GDG). The LDG is composed by the JuxMem Providers that hold a copy of the data, at each cluster s level. This means that for each piece of data there is an LDG corresponding to each cluster on which that piece of data is stored. GDGs are simillar to LDGs, but they group the JuxMem Providers at the entire grid s level. Each of the LDGs and GDGs have a group leader, which coordinates the activity and the self-organisation of the group. By grouping into Local Data Groups, at cluster level, and Global Data Groups, at grid level, the JuxMem Provider manage to make the piece of data they have in common persistent over time, helping it survive the possible peer failures. Since data is replicated within its global and local data group, there exist mechanisms to ensure the data consistency in agreement with a given data consistency model. JuxMem Client Peers. JuxMem Client peers are the JuxMem peers that know how to use the JuxMem data storage service, in the sense that they implement the part of the JuxMem API that requires allocation of new pieces of data to JuxMem Managers. The parameters for the JuxMem allocation requirement are the size of the data that will be stored, the number 18

19 of clusters it should be stored on, and the number of peers it should be stored on for each cluster. The number of clusters the data should be stored on and the number of nodes it should be stored on each cluster represent the hierarchical replication degree in JuxMem. The Client can additionally specify the consistency model to be used for the replicas of the given data. It is important to mention that the JuxMem Client has no idea about the physical location of the peers which it has allocated to store a piece of data, but it has the guarantee that its data will be consistent and it will survive failures, which is all that matters from the higher-level applications point of view. 3.3 Discussion on the Limitations of the Existing JuxMem In this section we have only discussed about the static topology and functionalities that JuxMem can offer. We have not talked about how this topology can evolve in time and the mechanisms that exist in order to extend the JuxMem topology if this is needed by the client applications that use JuxMem to store and share their data. The reason we have not discussed all this is that these extending mechanisms did not exist before this internship started. Up to now, JuxMem has been used as a static grid data sharing service. The first step was to describe a static topology of JuxMem, that was considered to be sufficient for the set of the JuxMem Clients needs regarding the data size. The second step was to pass this topology description to a deployment tool which would eventually deploy it on the grid. The last step was to let the JuxMem platform stabilise and then launch client applications that would use this platform to store and share their data. The problem with this approach is that one can never estimate the client applications needs correctly. Should a client need more storage space than the whole deployed JuxMem grid data sharing service has to offer, its request will be rejected. On the other hand, even if one were able to correctly estimate the maximum needs of all of the JuxMem client applications, the usage of the reserved resources on which the JuxMem topology would have been deployed would be inefficient. The inefficiency derives from the fact that if the platform is used only at a low rate during a long time, that is only a small part of the JuxMem Providers and even JuxMem Managers are used, the resources they occupy would be blocked for no reason during that same time. Thus, such a usage is neither optimal nor rational and a more adaptive version of JuxMem should be designed, that has the ability to easily integrate new resources and new peers the momemt they are needed. 19

20 Chapter 4 Contribution: A Monitoring Tool for JuxMem s Dynamic Resource Requirements 4.1 The General Idea The idea of a tool to help JuxMem extend and integrate new resources has arised because of the need of JuxMem to adapt to the needs of its possible client applications and to integrate new storage space on demand. In the case of JuxMem, the storage space means grid nodes and clusters, and these are resources that are managed by the grid resource management system. Hence, the only way for JuxMem to get its extra storage space from the grid would be for it to be able to request those resources to the grid resource management system. This interaction can be done by designing and integrating an intermediate layer between JuxMem and the grid middleware, namely the grid resource management system. By designing this intermediate layer, we take into account the fact that JuxMem is a special kind of dynamic distributed application, as described in Chapter 2, whose resource needs evolve in time. These resource needs are addressed by the newly designed component, which will make JuxMem more flexible and more efficient when it comes to resource usage. 4.2 Usage Scenario In this section we will try to describe how to use JuxMem in order to store a piece of data, and what outcome one will get from the existing JuxMem and from the improved JuxMem, that is the JuxMem that integrates the tool which helps it extend on demand. 20

21 data allocation request JuxMem Client JuxMem Manager JuxMem 1 cluster topology rendez-vous peer connection used JuxMem Provider Figure 4.1: A simple extension of the JuxMem topology Scenario Description A simple JuxMem usage scenario involves a JuxMem Manager peer, a JuxMem Provider peer, connected to the given Manager, and a JuxMem Client peer. The JuxMem Manager peer and the JuxMem Provider peer are the only components of a predeployed JuxMem infrastructure. 1. We suppose that the first request of the JuxMem Client peer is to store a piece of data, on a single cluster with degree of replication 1 on that cluster. This does not make much sense from the fault-tolerant point of view, this is for the sake of simplicity. One could easily extrapolate this example to include more JuxMem Providers and require a higher degree of replication within a clusters and maybe more clusters. 2. Subsequently, the JuxMem Client makes another request for another piece of data, after having received the answer to its first allocation request. But in this second request it requires a degree of replication of 2 on the involved cluster. 3. The third, and last, request the JuxMem Client makes is for a piece of data with degree of replication 1 at cluster level, which should be stored on 2 different clusters. We suppose that the JuxMem Provider peer can store a number of data blocks that is superior to the sum of the sizes of the first second and third piece of data that the JuxMem Client wishes to allocate The Existing JuxMem s Behaviour In this section, we wish to examine in more detail what happens when one tries to execute the scenario described earlier for the case of the existing JuxMem. This behaviour is also ilustrated in Figure

22 1. We start with the first request. The Client makes the request to the JuxMem Manager it knows, by using the JuxMem allocation API. The Manager receives the request, finds the one required JuxMem Provider that is connected to it and is able to store the data, and it allocates the required storage space on that JuxMem Provider. After doing all this, it returns a success answer to the JuxMem Client and it also provides it with the data s unique identifier within the JuxMem world. 2. In the case of the second request, the Client makes the allocation request to the same JuxMem Manager, in a manner similar to the first case. The Manager receives the request and looks for the required 2 Providers in the cache of Providers that are connected to it. It only finds one Provider, and it cannot launch another new Provider. The reason it cannot launch a new Provider is because it does not know how to do this correctly and also, more importantly, it does not know how to allocate grid resources that are required for that Provider. Finally, it returns failure to the Client and it is not be able to store its data with degree of replication 2 on a single cluster. 3. In the case of the third request, the Client makes the allocation request to the same JuxMem Manager, in a manner similar to the first two cases. The Manager receives the request and tries to store the data on the required 2 clusters. It looks on its cluster and it finds one Provider that could store the data. But then, when it tries to find a second cluster, it fails to find another JuxMem Manager within the JuxMem topology, which implicitly means that there is no other cluster contained in the topology. The Manager does not try to integrate another cluster in the topology. The reason it cannot integrate another cluster in the topology is that it does not know how to do this correctly. Also, more importantly, he does not know how to allocate grid resources that are required by the new JuxMem Manager and the new JuxMem Provider that should be launched on a new cluster. Finally, it returns failure to the Client and it is not able to store its data with degree of replication 1 on 2 clusters How to Improve JuxMem s Behaviour In this section, we wish to examine in more detail what happens when one tries to execute the scenario described earlier for the case of the improved, dynamic JuxMem. 1. We start with the first request. The Client makes the request to the JuxMem Manager it knows, by using the JuxMem allocation API. The Manager receives the request, finds the one required JuxMem Provider that is connected to it and is able to store the data and it allocates the required storage space on that JuxMem Provider. After doing all this, it returns a success answer to the JuxMem Client and it also provides it with the data s unique identifier within the JuxMem world. All this behaviour is shown in Figure 4.1. So far, nothing is different from the previous case. We will also examine the next two situations, to see if there is the same failure situation that occurs, as in the case of JuxMem with no improvement. 2. In the case of the second request, the Client makes the allocation request to the same JuxMem Manager, in a manner simillar to the first case. The Manager receives the 22

23 data allocation request JuxMem Client JuxMem Manager used JuxMem Provider rendez-vous peer connections JuxMem 1 cluster topology used JuxMem Provider Figure 4.2: A JuxMem extension by adding one Provider. request and looks for the required 2 Providers in the cache of Providers that are connected to it. It is only be able to find 1 Provider, so it tries to dynamically extend the JuxMem topology in order to integrate a new JuxMem Provider for the same cluster. The JuxMem Manager does this by making a request to the intermediar layer, which we call the Monitoring Tool, to add a new JuxMem Provider within the same cluster. The Monitoring Tool does its job and interacts with the grid resource management system to allocate resources for a new JuxMem Provider. After that the Monitor deploys a JuxMem Provider on the allocated grid resources. It also provides this JuxMem Provider with the information it needs so that it will discover the JuxMem Manager that has launched it. When the JuxMem Manager is announced that the Monitoring Tool has finished extending the infrastructure, it returns a success answer to the JuxMem Client and it also provides it with the data s unique identifier within the JuxMem world. All this behaviour is shown in Figure In the case of the third request, the Client makes the allocation request to the same JuxMem Manager, in a manner simillar to the first two cases. The Manager receives the request and tries to store the data on the required 2 clusters, as also presented in Figure 4.3. It looks on its cluster and it finds one Provider that could store the data. But when it tries to find a second cluster, it fails to find another JuxMem Manager within the JuxMem topology, which implicitly means that there is no other cluster contained in the topology. In this case, it tries to dynamically extend the JuxMem topology in order to integrate a new cluster. The JuxMem Manager does this by making a request to the intermediate layer, the Monitoring Tool, to add a new JuxMem Manager and a JuxMem Provider connected to it, both situated on another cluster. The Monitoring Tool does its job and interacts with the grid resource management system to allocate resources for a new JuxMem Manager and a new JuxMem Provider on a cluster different from the one on which is launched the requesting JuxMem Manager. After that the Monitor deploys a JuxMem Manager and a JuxMem Provider connected to it, on the allocated grid resources. It also provides the new JuxMem Manager with the information it needs so that it discovers the JuxMem Manager that has launched it. When the JuxMem Manager is announced that the Monitoring Tool has finished extending the 23

24 data allocation request JuxMem Client JuxMem Manager rendez-vous peer connections used JuxMem Provider FIRST JuxMem CLUSTER rendez-vous peer connection used JuxMem Provider JuxMem 2 cluster topology JuxMem Manager rendez-vous peer connection used JuxMem Provider SECOND JuxMem CLUSTER Figure 4.3: A JuxMem extension by adding a new cluster with one manager and one Provider. infrastructure, it returns a success answer to the JuxMem Client and it also provides it with the data s unique identifier within the JuxMem world. One can notice that in the case of the improved, dynamically adapting JuxMem, there is much smaller chance for failure than in the case of the currently existing JuxMem. In fact, if the grid has enough resources to offer, there will never be a failure because of insufficient resources in the improved case, while in the basic case the number of JuxMem allocation failures depends on the configuration of the original topology and on the amount of requests for storage. 4.3 Proposed Architecture: Introducing the JuxMem Monitor From the scenario described above, the improvement of functionality in JuxMem can be easily noticed. It can be obtained thanks to a new layer, added between the JuxMem peerto-peer overlay and the grid middleware. This new layer can also be seen on Figure 4.4. In what follows, this new layer will be called the JuxMem Monitoring Tool The Communication Interfaces The Monitor has two communication interfaces with its two adjacent logical layers: JuxMem and the infrastructure. The Monitor s upper communication interface is used to ex- 24

25 JuxMem Client Upper-Level Application JuxMem Grid Data Sharing Service NEW LAYER : JuxMem Monitoring Tool Middleware + Infrastructure Figure 4.4: Overview of the position of JuxMem and the Monitor, with respect to the application and the infrastructure. JuxMem Client Upper- Level Application JuxMem Grid Data Sharing Service JuxMem Interaction New LAYER : JuxMem Monitoring Tool Coordination Layer Middleware + Infrastructure Infrastructure Interaction Figure 4.5: Generic overview of the communication interfaces and the layered structure of the Monitor. 25

26 change request-reply messages with the JuxMem peer-to-peer network and namely with the JuxMem Manager peers. This interface is called the JuxMem communication interface. The Monitor s lower communication interfaces are the one that it uses to request services to the grid middleware and the one used to request services to the deployment tool. These lower level communication interfaces of the Monitor are grouped under the name of the infrastructure communication interfaces. These interfaces and layers of the Monitor can be seen in Figure 4.5. The JuxMem Communication Interface The communication between the JuxMem Manager and the Monitor uses an additional deployment request and an additional deployment reply primitive. The two entities communicate in a client-server-like dialogue. The server side is integrated within the Monitor, while the client side is integrated in the JuxMem Manager. The Additionally Deploy Request AdditionallyDeploy(boolean global_policy, int nb_extra_clusters, int persistence_nb_hours, struct depl_map *deployment_map); struct depl_map{ char *cluster_name; int nb_providers; int uptime_providers_and_managers; int local_policy; struct reservation *OAR_reservation; void *specific_application_params; }; The additionally deploy request is performed by the JuxMem Manager, which requests the integration of new JuxMem Providers per cluster and possibly of new clusters to the existing JuxMem topology. If the number of new clusters is 0, than the Monitor will just have to reserve grid resources on the requesting Manager s cluster and deploy the requested number of JuxMem Providers. On the other hand, if the number of new clusters is not 0, the Monitor will have to reserve grid resources on the requested number of new clusters and deploy on each new cluster a new JuxMem Manager and the requested number of JuxMem Providers. Additional to the number of new JuxMem Providers and the number of new clusters, the JuxMem Manager can also specify a resource reservation global policy to be used by the Monitor when integrating extra clusters to the JuxMem topology. The resource reservation global policy parameter was designed to be able to have two values: mandatory reservation and best-effort reservation. In the case of global mandatory for resource reservation, the constraints of requested number of new clusters has to be respected. This means that if the Monitor cannot reserve nodes on a number of new clusters equal to the requested one, it 26

27 will have to return a failure message to the requesting JuxMem Manager. In the case of global best-effort for resource reservation, the Monitor has to reserve nodes on as many clusters as it can, in the range from 1 to the requested number of clusters. In the best-effort case, only if the Monitor cannot reserve nodes on any cluster, will it return a failure message to the requesting JuxMem Manager. Besides the resource reservation global policy, which operates at the grid level, the JuxMem Manager can moreover refine its resource reservation policy, since the communication interface also allows it to specify separate resource reservation policies for each new cluster that should be integrated to the topology. These cluster-level reservation policies are called resource reservation local policies. Like global policies, the local policies may have two values: mandatory and best-effort. The semantics are similar to the ones of the values for the global policy. If local mandatory is specified, resources for the exact number of new Providers have to be reserved; otherwise the Monitor will return 0 resources (nodes) reserved for that cluster and it depends on the global policy if this will translate itself into a failure message to the JuxMem Manager or not. If local best-effort is specified, the Monitor has to reserve as many resources as possible on that cluster, in the range from 1 to the requested number of Providers per cluster. The global policy and the local policies are used in conjunction, in order to establish an accepted minimum and a desired maximum for the number of resources to be reserved. In particular, if all of these policies have the value of mandatory, all of the requested new clusters and new JuxMem Providers have to be integrated in the JuxMem topology. Optional parameters the Manager can specify when it requests the extension of the JuxMem topology are the exact names or identifiers of the new clusters it wishes to integrate. Last but not least, the JuxMem Manager has to specify to the Monitor the maximum number of data blocks, that each of the newly integrated JuxMem Providers would have to be able to store. The Additionally Deploy Reply AdditionallyDeployReply(int effective_nb_extra_clusters, struct depl_map *effective_deployment_map); The additionally deploy reply is performed by the JuxMem Monitor, after having received and processed a request coming from the JuxMem Manager. The structure of this communication primitive is much simpler than the one of the additionally deploy request. It contains a return value for the execution of the additional resource integration and deployment, that can be success or failure. In case of a successful reply, the number of new clusters and the number of newly deployed JuxMem Providers per each cluster is specified by the Monitor in the additionally deploy reply. It is essential for the new JuxMem to be able to correctly express its requests to the Monitor, in order for the JuxMem topology to extend in the direction it really needs to. This is 27

28 JuxMem Client Upper- Level Application JuxMem Grid Data Sharing Service JuxMem Interaction New LAYER : JuxMem Monitoring Tool Coordination Layer Middleware + Infrastructure OAR Interaction ADAGE Interaction Figure 4.6: Complete overview of the communication interfaces and the layered structure of the Monitor. why the communication interface between the JuxMem Manager and the Monitor is very important and had to be designed very carefully. This interface would also have ot handle new needs that could arise in JuxMem, such as anticipated allocation of data, which can be translated through a request to additionally deploy an estimated quantity of resources. Indeed, for such an anticipated resource reservation and topology deployment request, this interface has introduced the resource reservation policy parameters, which allow to specify separately which resources should be integrated for real JuxMem Clients allocation requests and which resources should be integrated just for the topology extensions needed by possible future JuxMem Client allocation requests. The Infrastructure Communication Interfaces As already explained, the infrastructure communication interfaces are the grid resource management system communication interface and the deployment tool communication interface, and they can also be seen on Figure 4.6. These two interfaces are less complex than the JuxMem communication interface and are used to communicate with lower-level applications. Unlike the JuxMem communication case, where the Monitor acts as the server-part of the communication, in the case of both infrastructure communication interfaces the Monitor is just a client that requests services to the infrastructure components. This is why, the Monitor has to know and implement the communication interfaces that are presented by both the grid resource management system and by the deployment tool. The grid resource management system communication interface is used to reserve resources on the grid. The communication interface with this part of the grid middleware implies that the number of reserved nodes, the reservation start time and end time be specified by the client applications. The Monitor obtains these parameters by refining the parameters that are sent by the JuxMem Manager in its additionally deploy request. The deployment tool communication interface is used to deploy new JuxMem Providers and possibly new JuxMem Managers on top of the resources reserved via the grid 28

29 resource management system communication interface. This is the way to actually extend the existing JuxMem topology. The communication interface with any deployment tool that knows how to deploy JuxMem implies the specification of a set of resources, on one hand, and of a set of JuxMem peers that are connected to form a peerto-peer overlay, on the other hand. The set of exact resources is obtained from the grid resource management system, while the JuxMem topology is obtained from the JuxMem Manager s request, by knowing that each Provider has to be connected to the Manager of its cluster and that all the Managers on all clusters have to be connected to all pre-existing Managers The Node Reservation Algorithm The node reservation algorithm is the intelligent part of the Monitor that knows how to extract the information from the JuxMem Manager s additionally deploy request and translate it into reservation requests, which it will submit to the grid resource management system. The algorithm iterates on the set of new clusters to be integrated to the JuxMem topology. Some of these clusters exact names are specified; these are put in a set called the set of specified reservations. The other clusters are put in another set, called the set of random reservations. After this, a set of free clusters, not yet used within the existing JuxMem topology, is constructed. The free cluster set will be used to perform the requested reservations that are in the set of random reservations. The specified reservations and the random reservations are performed separately, since they are performed in a different manner. In fact, since the specified reservations can only be performed on the clusters specified by the JuxMem Manager, they are somewhat easier to be performed since no choice between the free unused clusters has to be done. To perform a specified reservation (automaton shown in Figure 4.7), the Monitor first tests the connection to the specified cluster. Then, if it can connect to that cluster, it tries to reserve the requested number of nodes. If it was able to reserve the requested number of nodes, it can successfully go to the next reservation. If it was unable to reserve the requested number of nodes, but it has reserved some nodes, and the local reservation policy is besteffort, it can successfully proceed to the next reservation. For all the other cases, including the case where the Monitor cannot connect to the specified cluster, if the global reservation policy is best-effort it can successfully go to the next reservation; if the global reservation policy is mandatory, the algorithm finishes and the Monitor will cancel all the reservations it has done so far and return a failure message to the requesting JuxMem Manager. Performing a random reservation is a bit more complicated. We will analyse separately two cases: when the local reservation policy is mandatory (automaton shown in Figure 4.8) and when the local reservation policy is best-effort (automaton shown in Figure 4.9). To perform a random reservation when local reservation policy is mandatory, the Monitor chooses the first cluster unused in the JuxMem current topology, from the free cluster set. It tests the connection to that cluster. Then, if it can connect to that cluster, it tries to reserve the requested number of nodes. If it was able to reserve the requested number of nodes, it 29

30 START specified reservation NO can connect to cluster YES reserve requested nb nodes return FAILURE to Manager reserved nb nodes == requested nb nodes NO YES go to NEXT reservation NO local policy == "best-effort" YES NO reserved nb nodes!= 0 YES The Automaton of a SPECIFIED RESERVATION Figure 4.7: The automaton of a specified reservation. START random reservation (local reservation policy == "mandatory") NO global policy =="besteffort" NO YES have clusters in free cluster set YES choose first unused cluster in JuxMem topology return FAILURE to Manager NO try connection to cluster YES can connect to cluster reserve requested nb nodes cancel reservation elliminate cluster from free cluster set The Automaton of a RANDOM RESERVATION (local reservation policy == "mandatory") NO reserved nb nodes == requested nb nodes YES go to NEXT reservation Figure 4.8: The automaton of a random reservation (mandatory). 30

31 START random reservation (local reservation policy == "best-effort") for each cluster in free cluster set try connection to cluster NO can connect to cluster YES reserve requested nb nodes return FAILURE to Manager compute MAX between nb nodes NO global policy =="besteffort" NO MAX > 0 YES YES cancel unneeded reservations elliminate cluster from free cluster set go to NEXT reservation The Automaton of a RANDOM RESERVATION (local reservation policy == "best-effort") Figure 4.9: The automaton of a random reservation (best-effort). removes the cluster on which it performed the reservation from the free cluster set and it can successfully go to the next reservation. If it was not able to reserve the requested number of nodes, it cancels the reservation and it tries the next cluster from the free available cluster set. If there are no more clusters in the free cluster set, the algorithm will analyse the global reservation policy, since it was not able to reserve the requested number of nodes on one of the clusters. If the global reservation policy is best-effort the algorithm will proceed to the next reservation to be performed; otherwise, the algorithm finishes and the Monitor will cancel all the reservations it has done so far and return a failure message to the requesting JuxMem Manager. To perform a random reservation when the local reservation policy is best-effort the Monitor chooses, one by one, clusters unused in the JuxMem current topology, from the free cluster set. For each of them it tries to connect and to reserve nodes. After having reserved nodes on each cluster from the free cluster set, the algorithm will compute a maximum among all 31

32 the number of nodes it has reserved. It will keep the cluster on which it has reserved most nodes and will cancel the reservations on the other clusters. If the computed maximum is greater than 0, the algorithm removes the corresponding cluster from the free cluster set and it can successfully go to the next reservation. If the maximum is 0 and the global reservation policy is best-effort the algorithm will proceed to the next reservation to be performed; otherwise, the algorithm finishes and the Monitor will cancel all the reservations it has done so far and return a failure message to the requesting JuxMem Manager. Of course, for the determination of the cluster on which the maximum number of nodes can be reserved, optimisations have been performed. For example, if at a certain moment, we can reserve exactly the requested number of nodes, we cancel the search, keep that cluster and cancel all other reservations that we have performed in search of the maximum. Another thing that has been optimised is the fact of not keeping too many reservations active on other clusters, while searching for the maximum. Actually, only the reservation on the cluster with the current maximum number of nodes is retained, and each time the maximum changes, its corresponding reservation is cancelled. At the end of the iteration on both the specified reservation set and the random reservation set, we are sure that we have either returned a failure message to the requesting JuxMem Manager, or we have reserved a collection of grid nodes on different clusters that agrees with the value of the global resource reservation policy and the individual local resource reservation policies, specific to each cluster. The node reservation algorithm is the core of the JuxMem Monitoring Tool, as it is the coordination part that allows for the JuxMem request parameters to be translated into a collection of additional physical resources on top of which the existing JuxMem topology can extend. It is this algorithm that allows the extension of JuxMem and it allows it to handle allocation requests that would otherwise not be supported by the existing JuxMem topology The Simultaneous Allocation (Co-Allocation) of Nodes on Different Clusters In the previous section we have examined in detail the node reservation algorithm. We have reached the conclusion that by using such an algorithm to reserve a collection of resources of multiple clusters, by the end of the algorithm we are sure to have all the needed resources reserved for us. This means that the algorithm contains some kind of a barrier within itself, which ensures that all the reservation attempts end by the end of the algorithm and no reservation attempt remains pending after this algorithm has finished. This behaviour leads to simmultaneous allocation or co-allocation of resources on different clusters, which is usually difficult to achieve and is not integrated in all grid resource management systems that offer the possibility to reserve nodes on different clusters with a single request. The co-allocation is very important when we need to have all our reserved resources available and usable at any moment after performing the reservation. This kind of guarantee is currently not offered by OARGrid [21], which is the grid-level reservation tool used on the French grid, Grid 5000 [21]. Nevertheless, co-allocation can be achieved on Grid 5000, by using OAR [25], the cluster-level reservation tool on Grid 5000, in conjunction with the algorithm described above. 32

33 JuxMem Manager rendez-vous peer connections new JuxMem Provider JuxMem Provider new rendez-vous peer connection JuxMem Provider JuxMem Manager rendez-vous peer connections new JuxMem Manager new rendez-vous peer connections new JuxMem Provider new JuxMem Provider new JuxMem Provider JuxMem Provider newly added JuxMem topology (ADAGE additional deployment) pre-existant JuxMem topology Figure 4.10: Additional deployment with ADAGE. 4.4 The Additional Deployment Feature: using ADAGE As described in chapter 2, ADAGE is a generic deployment tool that supports plugins specific to different types of distributed applications in order to make a correct mapping of the application s description onto the corresponding resources selected from the infrastructure. Another feature of ADAGE, which has not been discussed up to know, is its ability to additionally deploy a topology and to connect it to the already existing topology of the same distributed application. In particular, in the case of JuxMem, ADAGE can both deploy and additionally deploy JuxMem topologies. For the deployment case, it just needs to know the description of the JuxMem topology, i.e. how JuxMem peers are connected among them, and the set of resources that have been reserved for the deployment of JuxMem. For the additional deployment case, ADAGE needs to know the JuxMem existing topology, the JuxMem resources the existing topology is mapped on, the new JuxMem topology, containing the additional JuxMem peers, and the grid resources allocated for this new JuxMem topology, which should include the resources the existing topology is running on. From this input information, by its JXTA plugin which knows how to connect the additional JuxMem topology to the existing JuxMem topology, ADAGE manages to extend the JuxMem topology, as can also be seen in figure Without this additional deployment feature of ADAGE, the binding of the old JuxMem topology with the new, additional, JuxMem peers would not be possible. This is why the Monitor also uses ADAGE to deploy the additional JuxMem topology on top of the resources it has reserved, using the node allocation algorithm and OAR cluster resource reservation system. 33

34 4.5 Discussion Throughout this chapter we have observed the improvement of functionality of the new JuxMem, which integrates the JuxMem Monitoring Tool, and also some details about the key-aspects that make the idea of a dynamic JuxMem functional. These key-aspects are: the communication interfaces of the Monitor and especially the communication interface with JuxMem; the node reservation algorithm, which is able to perform simultaneous allocation of nodes on different clusters; the usage of the ADAGE deployment tool, which also knows how to perform additional deployment to an already deployed JuxMem topology. By combining the above-mentioned key-aspects, the Monitor manages to offer to JuxMem a layer that can mediate between the needs of JuxMem and the underlying grid middleware, to which the Monitor requests reservation of additional resources. The interaction with the Monitor, and the support if offers to JuxMem during the JuxMem Manager data allocation algorithm results in the occurence of a dynamic JuxMem grid data sharing service, which can adapt to the increasing resource needs of the client application. The dynamic behaviour was not possible before the design and integration of the Monitor within the JuxMem architecture. Even so, the version of the Monitor that is described here is just a first step in the process of making JuxMem interact with its underlying infrastructure. But nevertheless it offers the most important functionality that a dynamic JuxMem needs: that is, the integration of new resources. In the future, one can imagine that the Monitor will also release no longer needed resources and make them available to other grid users that might need them. 34

35 Chapter 5 Architecture and Implementation The implementation of the JuxMem Monitoring tool was performed in C and C++ on a Linux platform running on an x86 32-bit architecture. The current version gathers about 3000 lines of code. External tools used by the implementation are OAR [25], the clusterlevel reservation system for Grid 5000, and ADAGE [14], a generic deployment tool that can deploy and additionally deploy JuxMem topologies, introduced in Chapter The Monitoring Tool: A Modular Software Architecture As shown on Figure 5.1 the Monitor s software architecture contains 4 modules: the JuxMem communication module, which contains a client-side interface (JuxMem Manager Monitor Client) and a server-side interface (JuxMem Manager Monitor Server); the coordination module, which is the intelligent component that binds together and coordinates all of the other modules functionality; the OAR interaction module, which is the grid resource management system interaction part; the ADAGE interaction module, which is the deployment tool interaction part of the Monitor. The last two modules, the OAR module and the ADAGE module are grouped under the name of infrastructure interaction modules. The modular design of the Monitor ensures high cohesion within the modules, since each module has its own well-defined functionality, and also low-coupling between the modules, because the modules interact with one-another only through well-defined interfaces. Changes in one module will not affect the other modules as long as the communication interfaces between the modules remain unchanged. In what follows, each of these modules will be described in more detail. 35

36 Current JuxMem Manager JuxMem Manager Monitor Client JuxMem Monitor Server Coordination_Layer ADAGE_ Interact OAR_ Interact Figure 5.1: The Monitor s software architecture The JuxMem Communication Module The JuxMem Communication Module is the part of the Monitor that is used to pass messages from the JuxMem Manager to the Monitor and vice versa. It s implementation relies on the client-server communication model. In the current version this interface uses TCP socket communication on top of which a basic RPC (Remote Procedure Call) communication is simulated. The JuxMem communication module implements the JuxMem communication interface, or the JuxMem-Monitor API. As already described in more detail in Chapter 4, the JuxMem communication interface is used by the JuxMem Manager peers to request additional deployment of new pieces of infrastructure. These requests are converted by the JuxMem communication module into internal data format, that will be later used by the monitor. The JuxMem communication module can be seen as the fronteer between the JuxMem Manager and the Monitor, and has been designed to delimit the two, while allowing communication to take place between the JuxMem Manager and the Monitor The Coordination Module The coordination module is the core of the Monitor, and stands between the JuxMem communication module and the infrastructure interaction (or communication) modules. This module contains the node allocation algorithm, whose functionality has been described in Chapter 4. The node allocation algorithm interacts with the JuxMem communication module and in turn with the OAR interaction module. After running the node allocation algorithm, the coordination module interacts with the ADAGE interaction module. For each JuxMem additionally deploy request that arrives, the coordination module calls the node reservation algorithm. If the node reservation algorithm is successful and all the 36

37 JuxMem usable cluster: - cluster name - number JuxMem managers The Monitor s Actual Data Structures JuxMem topology cluster: - pointer JuxMem usable cluster - number reservations reservation: - pointer JuxMem topology cluster - OAR reservation identifier - start time - end time - number of reserved nodes - list of names of nodes Figure 5.2: The datastructures used within the Monitor. necessary nodes have been reserved via OAR interaction, the coordination module finally calls the ADAGE interaction module to additionally deploy the new part of the JuxMem topology. Used Data Structures and Data Structure Management The coordination module also contains a part that manages the whole set of reservations that have been performed by JuxMem. It is this part of the coordination module that assures thread-safe and process-safe operations upon the data structures that correspond to the image the Monitor has of the underlying grid and of the JuxMem topology that runs on top of the grid. There are 3 data structures that are used and can be visualised on Figure 5.2: the reservation list structure, the JuxMem topology integrated cluster list structure and the JuxMem-usable cluster list structure. They are stored in the main memory of the node running the Monitor, and are also written on the disk after each new successful extension of the JuxMem topology. The JuxMem-usable cluster list contains generic information about each cluster the Monitor can reserve nodes on. For each cluster, it contains the cluster name and the number of JuxMem Managers that have been launched within that cluster. If for a cluster the number of launched JuxMem Managers is 0, it means that the cluster has not yet been integrated within the JuxMem topology. The JuxMem topology integrated cluster list contains information about the clusters on which JuxMem topology has already been deployed. For each cluster, it contains a reference to the cluster s corresponding item within the JuxMem-usable cluster list and the number of reservations that have been performed on that cluster. The reservation list structure contains information about each reservation the monitor has performed. Its component parts are: a reference to the cluster this reservation belongs 37

38 to, that is a reference to the corresponding item within the JuxMem topology integrated cluster list; the OAR unique reservation identifier; the reservation s start time; the reservation s end time; the number of reserved nodes; a list of the names of the nodes that belong to this OAR reservations. For each new reservation that is performed by the Monitor, the data in the data structures described above is updated. If a reservation expires, there exists a background thread that will erase it and update the data structures accordingly The Infrastructure Interaction Modules The infrastructure interaction modules have in common the fact that they both hide the physical topology and the behaviour of the underlying infrastructure to the higher-level modules of the Monitor and eventually to JuxMem. At the same time, they are able to make requests to the intrastructure s middleware, that are necessary in order to ensure a correct functioning of the upper-level modules and of JuxMem. The OAR interaction module knows how to make OAR reservations, how to cancel OAR reservations and how to connect to an OAR reservation it has performed, most usually in order to verify if the nodes that have been reserved are still running. The ADAGE interaction module knows how to construct the input description files for ADAGE, that is the JuxMem application description and the physical resource description, and how to use these descriptions to deploy or additionally deploy the requested JuxMem topology with ADAGE. The ADAGE interaction module also knows how to call the ADAGE cleanup functions, if for some reasons part of the deployed JuxMem infrastructure will no longer be used. 5.2 Technologies Used by this Implementation RPC Communication Technique The RPC communication technique is used in the communication between the JuxMem Manager and the Monitor, via the JuxMem communication interface. The remote procedure to be called by JuxMem is the one contained within the coordination module of the Monitor, that reserves resources and then deploys JuxMem topology on top of those resources. For each new JuxMem Manager request, a new thread is created, that should treat the request and invoke the resource allocation and application deployment procedure. The procedure call is blocking, and the JuxMem Manager, which is the calling application, will block until it gets the result of the remote procedures execution. 38

39 5.2.2 POSIX Threads and Thread Concurency POSIX threads are used throughout this implementation in order to enable concurent treatment of simmultaneous JuxMem Managers requests. Because there exist global shared data structures that are common to all the Monitor s threads, operations upon that data have to be thread-safe and so this data has to be protected by locks XML Document Managing and Parsing As mentioned before, in order to deploy data with ADAGE one has to make a description of the JuxMem topology to be deployed and also a description of the resources on which de deployment is performed. These descriptions are kept in XML files. In order to efficiently manage the XML files, from the point of view of execution time, the Monitor uses XML DOM (Document Object Model), and loads the XML tree into memory. The DOM implementation that it used is XERCES-C, which is a C++ object-oriented implementation. After loading the document that contains the old topology or resource description, one has to search among the DOM nodes and modify some of them, most usually by adding new child-nodes. Efficient searches on XML documents are performed by using XPath queries. An XPath query is based on a regular grammar and thus is very similar to regular expressions. 5.3 Replaceable Plugins In the previous sections of this chapter we have described the modules which compose the Monitor. We have stated that the advantage of a modular software architecture is the lowcoupling and high-cohesion of the code. In this section we will argue for another benefit of the modular architecture of the Monitor, namely the fact that its communication modules can be viewed as replaceable plugins, that could be changed if the infrastructure or even the higher-level application, which is currently JuxMem, would change. These replaceable plugins also improve the maintainability of the Monitor s code and allow the Monitor to adapt to conditions different from the one it has been designed for. Generally, for a module to be easily replaceable, its interaction interface with other modules within the system has to be very inteligently designed and must be as generic as possible. Grid Resource Manager Interaction Plugin The grid resource manager interaction plugin is currently the OAR interaction module. The communication interface with the Monitor s coordination layer is designed to request reservation of a given number of nodes for a certain time on a chosen cluster. Other requests could be to cancel a given reservation, specified by its unique cluster-level identifier, or to connect to such a reservation to examine its status. One can see that these are very generic tasks for node reservation and management, and so the OAR interaction module can be viewed as a plugin that could be replaced by a module written for any other grid resource management system that accepts a subset of the reservation management parameters used for OAR, such as the number of nodes to reserve 39

40 and the walltime of the reservation. This means that one could adapt the Monitor to run on grid infrastructures that use a reservation system different from OAR. Deployment Tool Interaction Plugin The deployment tool interaction plugin is currently the ADAGE interaction module. The communication interface with the Monitor s coordination layer is designed to request deployment of a certain JuxMem infrastructure, that should be mapped on some specified resources. This is a very generic task for a deployment tool that knows how to deploy JuxMem, and so the ADAGE interaction module can be viewed as a plugin that could be replaced by a module written for any other deployment tool that knows how to deploy and additionally deploy JuxMem. This means that one could adapt the Monitor to deploy using any other deployment tool that is able to deploy and also extend JuxMem topologies by additional deployment. 5.4 Discussion This chapter presented details about the implementation of the JuxMem Monitoring Tool. It has a flexible architecture, which, as we have seen, can adapt to different middleware changes, via its replaceable plugins. This means that, by using the Monitor and with minimal changes, JuxMem could be adapted to run in a dynamic manner on any grid infrastructure. Another remark that needs to be done about the Monitor s software design is the fact that, because of its client-server interface that communicates with the upper-layer application (which is JuxMem), the Monitor could be very easily integrated into other distributed applications, in order to make them act more dynamic with respect to their needs for resources. The integration with other applications, having an architecture and functionality simillar to JuxMem, is possible by integrating the client part of the JuxMem communication interface in the new application. After this, by making the correct requests to the Monitor with respect to the distributed applications needs, with a minimal effort, the new distributed application could also become adaptive and dynamic, as is the improved JuxMem. This is another argument in favour of the fact that the Monitor s code has been written to be easily reusable and that the Monitor s software architecture can addapt to a set of functionality changes. 40

41 Chapter 6 Preliminary Evaluation and Discussion The objective of this chapter is to show that the JuxMem Monitoring Tool is a functional software component, that performs well when tested on a real test-bed. 6.1 Describing the Experimental Platform The developement and evaluation of the JuxMem Monitoring Tool has been done within the Grid 5000 French testbed. The Grid 5000 project s objective is to construct an experimental platform containing 5000 processors, distributed on nine different geographical sites: Bordeaux, Grenoble, Lille, Lyon, Nancy, Orsay, Rennes, Sophia and Toulouse. These sites are interconneted by the French national Gigabit network dedicated to teaching and research, called Renater. Grid 5000 offers a set of heterogeneous hardware resources, as well as a highly configurable software environment. 6.2 Adding New JuxMem Provider Peers Bound to an Existing JuxMem Manager Peer Motivation and Objective of this Experiment The purpose of this experiment is to show that the JuxMem Monitoring Tool can successfully extend the JuxMem topology within the same cluster. It is also important to see how well the JuxMem Monitoring Tool performs and what is its response time to a JuxMem Manager. More precisely, we are interested in the response time perceived by the Monitor s client side from the JuxMem Monitor communication interface, since this is the time that the JuxMem Manager will also perceive when making an extension request to the Monitor. This time interval is also the increment of waiting time that might be perceived by the JuxMem Clients that request data allocation to the JuxMem Manager, which needs topology extension. 41

42 If we succeed to show that the JuxMem Monitor has a reasonable response time and that it extends indeed the JuxMem topology within one given cluster, it means that we can have a preliminary validation for the Monitor s functionality and behaviour. In this experiment, we incrementally and sequentially add various numbers of JuxMem Providers to the JuxMem topology and we wish to see how the response time of the Monitor evolves in such conditions, that is how the response time varies with the change of topology and with the change of JuxMem Provider number to be added at one step. For this experiment, since the current JuxMem does not know how to deal with and how to request best-effort resource reservations, the global reservation policy and all of the local reservation policies have been set to mandatory Technical Description of the Experiment The experiment consists in (1) launching the Monitor, which is a daemon program, and then (2) launching a script that enables the sending of JuxMem Monitor client requests every 15 seconds. The 15 s interval has been chosen to be sufficiently large, so that different topology extension requests will not overlap. The initial JuxMem topology is a simple one, and it contains only one JuxMem Manager. The cluster on which this experiment has been launched is the paraci cluster at Rennes. The reason why this cluster has been chosen is because it is a component of the Rennes site, and so tests were easier to be performed on it, and also because at the moment the experiment had to be launched it was the only unoccupied cluster we could find. The experiment consists of three parts. In the first part, for the case of each request there is a fixed number of JuxMem Providers that has to be added to the topology, and for simiplicity we have chosen this number to be 1. This type of request is repeated 20 times. In the second part of the experiment, for each request the number of JuxMem Providers that should be deployed on that cluster is 2 times bigger than in the previous JuxMem Monitor client request. The first request is to additionally deploy just 1 new JuxMem Provider and the last request is to additionally deploy 32 new JuxMem Providers, since the maximum number of nodes on the cluster was 64. It means that there will be 6 such requests. As in the first part of the experiment, requests are performed every 15 seconds. In the third part of the experiment, every 15 seconds there is a request to additionally deploy a random number of JuxMem Providers, that ranges from 1 to 10. These type of requests are repeated 10 times. In parallel with these tests, we also measured the response times for manually reserving nodes with OAR and the response times for manually deploying and additionally deploying the required increment of JuxMem topology with ADAGE. These application response times corresponding to OAR and to ADAGE have been substracted from the time obtained at the end of each step of the experiment, in order to get just the response time of the JuxMem Monitoring Tool, which is the piece of software that we are evaluating. We chose to make these experiments to see the variation of the response time of the Monitor with respect to (1) the dimension of the JuxMem topology on one cluster, (2) the 42

43 Figure 6.1: The response time of OAR. number of nodes to be integrated at each step, (3) the rule contained in sequence of numbers that represent the nodes to be integrated at each step Practical Results The results we have obtained clearly show that the execution time of the Monitor does not vary very much with respect to the topology or with respect to the number of nodes integrated to the infrastructure at one experiment step. The value of the Monitor s response time varies from a minimum of 0.80 s to a maximum of 1.80 s. The problem is that we cannot accurately establish the exact execution time of the JuxMem Monitoring Tool, because this is the sum of the Monitor s own execution time, OAR s execution time and ADAGE s execution time. While ADAGE s execution time is pretty much predictible and stable, having a value of approximatively 0.88 s for both deployment and additional deployment, OAR s execution time is not at all predictible. The reason is the fact that while these tests have been done, OAR has also been used by other external users, whose reservation requests might have influenced the total execution time of the JuxMem Monitoring Tool, which has been measured. By doing manual separate measurements for the same reservation requests for OAR, its response time varied between a minimum of s and a maximum of s. Because of the strong dependence between OAR and the JuxMem Monitoring Tool viewed as a whole, we will show and describe the curves for the JuxMem Monitoring Tool and for OAR in parallel, for each case of the experiment. Experiment A: Requesting 1 JuxMem Provider Each 15 Seconds The curves resulted from this experiment can be seen on Figure 6.2. For a comparison, the curves for the response time can be seen on Figure 6.1. One can notice that the Monitor s execution time is relatively constant and does not depend on to the dimension of the deployed JuxMem topology, when a constant number of cluster nodes should be integrated at each 43

44 Figure 6.2: Requesting 1 JuxMem Provider each 15s. Figure 6.3: Requesting a doubled number of JuxMem Providers each 15s. step. Experiment B: Requesting a Doubled Number of JuxMem Providers Each 15 Seconds The curves resulted from this experiment can be seen on Figure 6.3. One can notice that the Monitor s execution time does not depend on the number of nodes that should be reserved and additionally deployed on just one cluster, when a doubled number of nodes should be integrated at each step. Experiment C: Requesting a Random Number of JuxMem Providers Each 15 Seconds The curves resulted from this experiment can be seen on Figure 6.4. As in the previous experiment, one can notice that the Monitor s execution time does not depend on the number of nodes that should be reserved and additionally deployed on just one cluster, when a number of nodes chosen at random is integrated at each step. 44

45 Figure 6.4: Requesting a random number of JuxMem Providers each 15s Experimental Results Discussion As a remark to the experimental results obtained above, one can notice that the response time for the JuxMem Monitoring Tool viewed as a whole does not increase with increasing topology; on the contrary, it sometimes decreases! On the other hand, the same response time does not increase with the increasing number of requested JuxMem Providers that are requested; in this case too, the total response time of the Monitoring Tool might also decrease as the number of requested JuxMem Providers increases. All of these anomalies are actually due to the fact that the Monitor is not alone on the cluster, and that someone else uses the cluster resource management system too. Another reason for the strange variation of response time, besides the usage of OAR, is the fact that in the case of the Monitor, there is network communication that is performed. The network too is a shared resources and does not introduce communication delays that vary uniformly with the dimension of the JuxMem topology or the number of additional JuxMem Providers to be added. The messages that are passed for any JuxMem topology and for any additional number of Providers within the same cluster have the same size and thus they should not influence the network communication latency. Even with the results anomalies described above, one can still notice that the mean, most oftenly met response time for the Monitor is around 1s, which is very good, even neglectible, when compared to the response time of OAR. One can notice that the Monitor s execution time is comparable to ADAGE s execution time. 6.3 Adding Extra Clusters to the Existing JuxMem Topology Motivation and Objective of this Experiment The purpose of this experiment is to show that the JuxMem Monitoring Tool can successfully extend the JuxMem topology by integrating new clusters to it. As in the case of the experiment using a single cluster, it is also important to see how well the JuxMem Monitoring Tool performs and what is its response time to a JuxMem Manager. More precisely, following the 45

46 model for the previous experiment, we are interested in the response time perceived by the Monitor s client side from the JuxMem Monitor communication interface, since this is the time that the JuxMem Manager will also perceive when making an extension request to the Monitor (and this time is also the increment of waiting time that might be perceived by the JuxMem Clients that request data allocation to JuxMem, which consequently needs topology extension). If we succeed to show that the JuxMem Monitor has a reasonable response time and that it extends indeed the JuxMem topology by being able to integrate new clusters, it means that we can have a preliminary validation for the Monitor s functionality and behaviour. We will try to incrementally add various clusters to the JuxMem topology and we wish to see how the response time of the Monitor evolves in such conditions, that is how the response time varies with the change of topology and with the change of number of clusters to be added at one step. For this experiment, since the current JuxMem does not know how to deal with and how to request best-effort resource reservations, the global reservation policy and all of the local reservation policies have been set to mandatory Technical Description of the Experiment As in the case of the first experiment, this experiment consists in first launching the Monitor, which is a daemon program, and then launching a script that enables the sending of JuxMem Monitor client requests every 15 seconds. The initial JuxMem topology is a simple one, and it contains only one JuxMem Manager. The Monitor is launched on the paraci cluster, at Rennes. The set of clusters on which this experiment has been launched is: the paraci cluster at Rennes, the parasol cluster at Rennes, the Lyon cluster, the Grenoble idpot cluster and the Bordeaux cluster. The reason only these clusters were used, and not others too, is the fact that at the time the experiment has been performed they were the only clusters on which we could reserve nodes using OAR. The experiment consists of two parts. In the first part of the experiment, with every arriving request we are trying to add a new cluster to the JuxMem topology. For simplicity reasons, the addition of each new cluster to the JuxMem topology means deploying just 1 Provider and of course a Manager on that cluster. Since we only had 4 clusters available, there were only 3 steps of the experiment. In the second part of the experiment, we want to examine how the response time to a given additionally deploy request varies with the increase of number of clusters to be integrated at once. Again, since only 4 clusters were available, there were only 3 steps of the experiment. After each experiment step the Monitor was stopped and the JuxMem topology distroyed, and after this cleaning was done, the Monitor and the initial JuxMem topology were launched again. In parallel with these tests, we measured the response times for manually reserving nodes with OAR and the response times for manually deploying and additionally deploying the required increment of JuxMem topology with ADAGE. These application response times corresponding to OAR and to ADAGE have been substracted from the time obtained at the 46

47 Figure 6.5: Requesting the integration of 1 new cluster each 15s. Figure 6.6: Requesting the integration of an increasing number of clusters at each step. end of each step of the experiment, in order to get just the response time of the JuxMem Monitoring Tool, which is the piece of software that we are evaluating Practical Results Experiment D: Requesting the Integration of 1 New Cluster Each 15 Seconds The curves resulted from this experiment can be seen on Figure 6.5. The results we have obtained allow us to suspect that the cost of integration of each new cluster is relatively constant and that it does not depend on the currently deployed JuxMem topology. Actually, this time is around 12s, if we do not include the OAR reservation time, and about 23s if we take into account the time needed by OAR to perform a reservation. 47