STATE ENGINEERING UNIVERSITY OF ARMENIA. Department of Computer Systems and Informatics

Transcription

1 STATE ENGINEERING UNIVERSITY OF ARMENIA Department of Computer Systems and Informatics Artem T. HARUTYUNYAN Development of Resource Sharing System Components for AliEn Grid Infrastructure CERN-THESIS /05/2010 Ph.D. Thesis Scientific adviser: Prof. Ara A. GRIGORYAN Yerevan-2010

2 CONTENTS INTRODUCTION... 4 ACKNOWLEDGEMENTS CHAPTER 1. GRID COMPUTING AND CLOUD COMPUTING Grid computing concepts The architecture of the Grid End systems Clusters Intranets Internets Core Grid services Implementations of Grid infrastructures and projects Grid projects worldwide National Grid initiative in Armenia Cloud computing concepts Infrastructure as a Service (IaaS) Platform as a Service (PaaS) Software as a Service (SaaS) Summary of Chapter CHAPTER 2. ALIEN - GRID INFRASTRUCTURE OF CERN ALICE EXPERIMENT CERN ALICE experiment Distributed computing architecture of ALICE The architecture of AliEn AliEn file and metadata catalogue AliEn monitoring system AliEn Workload Management System (WMS) Problem definition Summary of Chapter CHAPTER 3. DESIGN AND DEVELOPMENT OF GRID BANKING SERVICE MODEL FOR JOB SCHEDULING IN ALIEN Development of the Grid Banking Service (GBS) model for AliEn WMS Discrete-event system model for the simulation of the work of AliEn WMS and GBS The simulation toolkit details

3 3.4 Evaluation of the GBS model with the use of simulation toolkit Integration of the banking service with AliEn WMS Summary of Chapter CHAPTER 4. DEVELOPMENT OF TWO MODELS OF INTEGRATION OF CLOUD COMPUTING RESOURCES WITH ALIEN CernVM a virtual appliance for LHC applications Nimbus a toolkit for building IaaS computing clouds Development of Classic model for integration of cloud computing resources with AliEn Development of Co-Pilot model for integration of cloud computing resources with AliEn Development of Co-Pilot Agent Co-Pilot Adapter communication protocol Comparison of Classic and Co-Pilot models. Measurement of their timing characteristics Summary of Chapter CHAPTER 5. DEVELOPMENT OF SASL BASED SECURITY SYSTEM AND DEMONSTRATION OF THE PORTABILITY OF ALIEN CLIENT PART TO WINDOWS Development of SASL based authentication and authorization system in AliEn Demonstration of the portability of the client part of AliEn to Windows Summary of Chapter BIBLIOGRAPHY Appendix A. Glossary of acronyms Appendix B. AliEn site description file for deploying dynamic virtual sites on Nimbus IaaS cloud ( Classic model) Appendix C. Implementation certificate (YerPhI) Appendix D. Implementation certificate (CERN) Appendix E. Implementation certificate (University of Chicago and Argonne National Laboratory)

4 INTRODUCTION The problem of the resource provision, sharing, accounting and use represents a principal issue in the contemporary scientific cyberinfrastructures. For example, collaborations in physics, astrophysics, Earth science, biology and medicine need to store huge amounts of data (of the order of several petabytes (2 50 bytes)) as well as to conduct highly intensive computations. The appropriate computing and storage capacities cannot be ensured by one (even very large) research center. The modern approach to the solution of this problem suggests exploitation of computational and data storage facilities of the centers participating in collaborations. The most advanced implementation of this approach is based on Grid technologies, which enable effective work of the members of collaborations regardless of their geographical location. Currently there are several tens of Grid infrastructures deployed all over the world. The Grid infrastructures of CERN Large Hadron Collider experiments - ALICE, ATLAS, CMS, and LHCb which are exploited by specialists from five inhabited continents, are among the largest ones. A decade of extensive exploitation of Grid resources by various scientific communities has revealed the following problems: Need in an appropriate coordination of the resource usage and accounting for the resources. Necessity of the increase of the computing and storage capacity of Grid by a seamless integration of external resources. Minimization of the work of resource administrators on the maintenance and support of specific application software required by different scientific communities. Need in a secure access to resources on the base of different authentication mechanisms. 4

5 This dissertation is devoted to the solution of aforementioned problems within the Grid infrastructure of ALICE experiment at CERN Large Hadron Collider (LHC), called AliEn (ALICE Environment on the Grid). AliEn is a set of Grid middleware and application tools and services which are exploited by ALICE collaboration to store and analyze the experimental data, as well as to perform Monte-Carlo simulations. AliEn uses computing and data storage facilities of member institutions from Europe, Asia, Americas and Africa, about 100 centers overall 1. Yerevan Physics Institute participates in the ALICE collaboration since Objectives of the work are: Design and development of a model for the coordination and accounting of the use of resources in the AliEn Workload Management System. Design and development of a model for seamless integration of external resources provided using Cloud Computing technologies. Design and development of an authentication and authorization framework for access to the AliEn Grid resources. Demonstration of the portability of AliEn code to different Operating Systems. The main results of the work are: Development and implementation of a model of Grid Banking Service for job scheduling in Workload Management System of AliEn. The service provides a flexible job scheduling scheme which is based on the collaborative use of resources, where for the execution of jobs users pay from their bank accounts to the sites where their jobs were executed. 1 AliEn is also exploited by other physics collaborations: Panda and CBM at GSI (Gesellschaft für Schwerionenforschung, ion research laboratory at Darmstadt, Germany) as well as by the Mammogrid - European Federated Mammogram Database project. 5

6 The simulations have been performed for evaluating Grid Banking Service model and studying its efficiency. The analysis of the results of simulations has shown that with the use of Grid Banking Service the waiting time of the user s jobs can be significantly decreased. Development and implementation of Classic and Co-Pilot models. The models enable one-click dynamic integration of external resources, provided using Cloud Computing technologies, into AliEn. i) In the Classic model integration is based on the dynamic deployment and configuration of Grid site services and application software on the computing cloud. ii) In the Co-Pilot model one deploys on the cloud only the application software, whereas the functionality of Grid site services is ensured using the specially developed Co-Pilot Agent and Adapter services, as well the Co-Pilot Agent Co-Pilot Adapter communication protocol. The comparative analysis of the deployment and performance timing characteristics of the Classic and Co-Pilot models has been performed. Development and implementation of a flexible modular framework for authentication and authorization in AliEn, in accordance with SASL (Simple Authentication and Security Layer, RFC 2222) standards. Demonstration of the portability of the AliEn client part (v ), which has been created for Linux, to the Microsoft Windows operating system. Development of the installation package for Windows 2000 and XP. Practical significance Grid Banking Service gives an additional degree of freedom for the refinement of Grid resource sharing system by introducing a flexible job scheduling scheme, which provides users with control over their jobs priorities. A requirement to augment the priority may arise, for example, when a user or a group of users carries out intensive calculations for urgent submission of a conference paper. On the other 6

7 hand, there may be a need to lower the priority of jobs (such as the ones submitted by the Monte-Carlo data production operators) to execute them without hindering the work of regular Grid users. The toolkit for the simulation of the work of AliEn Workload Management System and Grid Banking System gives Grid administrators a possibility to model the work of Workload Management System and Grid Banking Service in order to increase the effectiveness of the resource use. Classic and Co-Pilot models enable two paradigms of one-click dynamic integration of computing resources, available from academic and commercial Cloud Computing providers, into the resource pool accessible to the Grid users. This not only allows increasing computing and storage capacity of AliEn Grid, but also minimizes the work of resource administrators in supporting specific applications from various scientific domains. Modular architecture of authentication and authorization framework allows Grid participants to authenticate with different mechanisms (X.509 digital certificates, RSA keys, randomly generated string tokens). The ported version of AliEn client part makes the AliEn infrastructure accessible for Microsoft Windows users. Practical implementation The results presented in the dissertation have been obtained in the framework of the collaboration of YerPhI/ALICE group from the Yerevan Physics Institute with CERN, Univeristy of Chicago and Argonne National Laboratory. Below, their implementation details are presented: In 2007, the authentication and authorization framework has been integrated to the AliEn central and site services, as well as to the AliEn client part. In 2008, the Grid Banking Service has been integrated to the central and site services of AliEn Grid infrastructure. 7

8 The implementation of the Classic and Co-Pilot models has been performed for AliEn Grid on the Nimbus Scientific Computing Cloud of the University of Chicago/Argonne National Laboratory in The work has been conducted in collaboration between YerPhI, CERN, University of Chicago and Argonne National Laboratory. Implementation in YerPhI It is necessary to underline first of all that the rights to exploit the intellectual products (including computing tools) created within the ALICE collaboration are shared by all collaboration members. So any contribution to ALICE is to be considered as contribution to YerPhI as well. The details of the implementation of the results of the dissertation in YerPhI are following: In 2004 the client part of AliEn ported to the Windows OS has been implemented in YerPhI/ALICE group. In 2007 the modular framework for authentication and authorization based on SASL standards has been integrated with the services of YerPhI/ALICE group AliEn Grid site. In 2008 the Grid Banking Service for job scheduling in AliEn has been integrated with the services of YerPhI/ALICE group AliEn Grid site. The implementation of AliEn in YerPhI makes accessible to Armenian specialists the high performance capacities of the Worldwide LHC Computing Grid infrastructure, as well as the data produced by ALICE experiment. The corresponding implementation certificates are appended to the dissertation. Implication of obtained results for ArmGrid Currently ArmGrid has several sites/resources functioning in different national scientific institutions. The Grid Banking Service model 8

9 developed in the dissertation can help to account for and optimize the use of these resources. Currently only scientific applications which are available under Linux can be executed within ArmGrid. From the other side, a significant part of Armenian scientific community runs application software which is created for Windows OS. Deployment of Cloud Computing software/middleware (e.g. open source Nimbus cloudkit) on the ArmGrid resources would turn them into IaaS clouds and would therefore make these resources available to Windows users too. The clouds could then be integrated into to the ArmGrid infrastructure applying the developed in the dissertation Classic and Co-Pilot models. 9

10 ACKNOWLEDGEMENTS The realization of this work would not be possible without continuous professional guidance and moral support of my supervisor Prof. Ara Grigoryan, and I am extremely grateful to him for that. I would like to thank the two official reviewers of the thesis, Prof. Armenak Palyan from State Engineering University of Armenia and Dr. Federico Carminati from CERN, for their comprehensive review of my dissertation and valuable comments. I enjoyed very much the great working environment of Yerevan Physics Institute, and in particular that of the YerPhI ALICE group. I would like to thank Armenuhi Abrahamyan, Marine Atayan, Natella Grigoryan, Hrant Gulkanyan, Hayk Haroyan, Arsen Hayrapetyan, Vanik Kakoyan, Zhanna Karamyan, Narine Manukyan, and Vardanush Papikyan for their professional support and friendship. My colleagues and friends from CERN helped me a lot in understanding various aspects of computer science and in solving complex technical problems. I appreciate pleasant and fruitful collaboration with Carlos Aguado Sanchez, Latchezar Betev, Jakob Blomer, Predrag Buncic, Catalin Cirstoiu, Alberto Cola, Alina Gabriela Grigoras, Costin Grigoras, Raffaele Grosso, Andreas Joachim Peters, Pablo Saiz, and Matevz Tadel. The work with Tim Freeman and Kate Keahey, from the University of Chicago and Argonne National Laboratory, was significant for deepening my knowledge in the areas of virtualization and cloud computing. The accomplishment of this work would not be possible without the continuous support of my parents, grandparents, my wife, my sister, other members of the family, and, of course, my friends. 10

11 During the work on the dissertation I received financial aid from Swiss Fonds Kidagan, Calouste Gulbenkyan Foundation, CERN ALICE Offline and PH/SFT groups, as well as from the Google Summer of Code 2008 program. 11

12 CHAPTER 1. GRID COMPUTING AND CLOUD COMPUTING 1.1 Grid computing concepts Nowadays scientific collaborations in physics, astrophysics, Earth science, biology and medicine use extensively computing and data storage infrastructures for modeling (simulations) and analyses of the experimental data. Because of the complexity of calculations and huge (order of tens of petabytes) amounts of data the experiments collect it is not always possible to gather all the necessary computing power and storage capacity in one geographical location. That is the main reason why these infrastructures are usually distributed across different countries and even continents. To build such infrastructures computers and data storage units are organized into so-called computational and data Grids. Grids allow scientists to seamlessly access data and exploit computing resources which are distributed across many institutions and universities in the world. I. Foster, C. Kesselman and S. Tuecke, in their paper The Anatomy of The Grid [1], define the Grid as an infrastructure which enables flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resources. The term infrastructure is very important because a computational grid is concerned with large-scale pooling of resources, which can include computation cycles, experimental data, scientific instruments, or people. Such pooling requires significant hardware resources to achieve the necessary interconnections and software infrastructure to monitor, control and provide access to the resulting ensemble. The term sharing in this context is not limited to a simple file exchange but implies direct access to computers, software applications, experimental data and tools, which is vital for a range of collaborative problem solving and resource brokering strategies which are exploited in industry, science, and engineering. Within this sharing scheme resource providers and consumers definitely outline what resources are being shared, 12

13 who is allowed to share and access the resources and also the conditions under which the sharing takes place. A set of individuals and/or institutions defined by such sharing rules forms what is called a Virtual Organization (VO) [2-4]. The authors of the computational Grid were inspired by the power grid concept [5]. The status of computation before Grid computing has emerged was somewhat analogous to that of electricity in the beginning of 20 th century. At that time, electric power generation was possible, and new devices were being invented that exploited electric power, but the need for each user to build and operate a new generator made the use of them very unpractical. The truly revolutionary development was not, in fact, electricity, but the electric power grid and the associated transmission and distribution technologies. Together, these developments provided reliable, low-cost access to a standardized power supply service. The result was that power which before was accessible only in crude and not especially portable forms (human effort, horses, water power, steam engines, candles) became universally accessible. By allowing both individuals and industries to take for granted the availability of cheap, reliable power, the electric power grid made possible both new devices and the new industries that manufactured them. By analogy, the term computational grid has been adopted for the e- infrastructure that enabled the increase of demand of computation by different scientific fields and provided means for satisfying that demand. The concept of exploiting of distributed resources which were cooperatively working towards a solution of a single scientific problem was introduced several decades ago [6]. For example in late seventies researchers were working on networked operating systems [7-10] and in the late eighties distributed operating systems [11-17] were in their research focus. In the nineties the heterogeneous computing [18-22] emerged which later has transformed into what was called metacomputing [23-26]. The next phase of development brought Grid computing into play. 13

14 Computational Grids, however, have some distinct characteristics [6] which make them different from analogous systems which existed in the past: Grids are user oriented. This is perhaps the most important, and yet the most subtle, difference. Previous systems were developed for and by the resource owner in order to maximize utilization and throughput. In Grid computing, users have the possibility to choose the specific machines that will be exploited to execute an application and store the application output. This allows maximizing the performance of the application, regardless of the effect on the system as a whole. Grids involve more resources than just computers and networks. Grid computing today is as much about the data as it is about the computational power and storage capacity. Grids are able to manage physical location of the data, store petabyte data sets, replicate the needed pieces of data and provide access to them through the network. The data produced by specialized scientific instruments is being made accessible through the Grid. Examples of such instruments are particle accelerators [27, 28] and telescopes [29]. Underlying hardware and software infrastructure of The Grid is heterogeneous. Making every institution or organization use the same software and hardware is not possible in practice. That is why Grids are attempting to provide common protocols and abstraction layers so that resources with different hardware and software configurations can be consolidated within a big system, and made available to the end user in a coherent way. Grids respect resource providers (or sites) autonomy. Sites entering the Grid have local control over their resources, they can define usage policies and have an account of how, when and by whom the resources they provide are used. 14

15 1.2 The architecture of the Grid As already mentioned, computational grids are created to serve different communities with widely varying characteristics and requirements. According to I.Foster and C. Kesselman (Chapter 2 of [5]) this is the reason why there is no single Grid architecture but nevertheless there is a set of basic services that most Grids provide. Different Grids, however, adopt different approaches to the implementation of these services. The decisions about techniques which are used to implement Grid services are taken depending on the scale at which the given service is going to be exploited. Authors of Chapter 2 of [5] note, that Grid infrastructures, like other computational infrastructures, are fractal, or self-similar at different scales. For example there are interconnections between countries, organizations, clusters, and computers; between components of a computer; and even within a single component. However, at different scales, all that interconnections use different protocols and operate in different physical modes. Authors of [5] make a reasonable suggestion to classify Grid infrastructure constituents according to their scale and complexity. Below an example of such classification is presented. Each layer is built upon the services and features provided by the lower layer End systems Individual end systems (computers, storage systems, sensors, and other devices) are of a relatively small scale and a high degree of homogeneity and integration. There are typically just a few tens of hardware components (processors, disks, etc.), which comply with established hardware design standards and thus interoperate without problems. The operating systems which control them provide management tools and allow maximizing their performance without much effort from the user side. One has to mention, however, that in some cases specialized instruments may be much more complex and could in principle have thousands of components. Such end 15

16 systems represent the simplest, and most intensively studied, environment in which basic services are provided. Today s end systems use well known software architectures [31, 32]. Basic services such as memory or network access, input/output operations etc. are provided by an operating system, which manages hardware resources of the computer. The operating system handles authentication and provides interfaces for users to acquire resources, access files, and so on. Because of the integrated nature of operating systems and conventional hardware it becomes possible to provide high-performance implementations of important functions such as physical memory access and input/output. Programmers develop applications for these end systems by using a variety of high-level languages and tools. During the last decade much effort has been put to add to conventional operating systems features necessary for integration into clusters and networked environments. For example, a lot of work was carried out both by software and hardware specialists to reduce overheads of network communication and increase communication rates [33-35], which resulted in the several orders of magnitude (from order of several MBits/s to the order of GBits/s) increase of conventional network equipment communication rates. As another example one can mention so-called sand box features, which allow creating secure and isolated environment for the execution of different processes on the same end system. Sand boxes can be provided using special software [36] or in the forms of virtual machines [37, 38] Clusters Clusters are interconnected groups of end systems working together as an integrated computing or data processing unit. Computers which form a cluster are usually connected using high speed local area network. Clusters have similarities to the individual end system: as a whole they are seen as homogeneous entities, and the difference between their components is in the configuration but not the basic architecture. Another similarity is 16

17 that clusters have a centralized management: they are either managed by a system administrator or in case of clusters with several hundred of machines by a single group of system administrators. The manager of the cluster has full control over each of its components. Clusters, however, introduce bigger physical scale compared to individual end systems. They may combine from several tens to several thousand processor cores and disks, totaling in teraflops of computing power and petabytes of storage space [39-40], and thus they require different approaches to the monitoring, control and allocation of the resources they provide. In addition clusters are less integrated, because they are normally constructed from commodity parts which are interconnected by means of local area network, thus their communication performance is reduced. As already mentioned, increased scale and reduced integration of clusters requires new services which are not needed in an individual end system, it also makes functions which provide basic services like communication or Input/Output more complex (e.g. network vs. shared address space between processes which run within a single operating system), which results in degradation of performance of the applications. A variety of tools which provide such services in cluster environments exist [41-49]. However developing applications using these tools for which performance is important requires much skill and considerable additional effort Intranets The next class of systems that we consider is intranets, constituents of the Grid which are built from potentially large number of resources that belong to the same institution. Like clusters, intranets can also have a centralized control and their resources may be well coordinated. However in intranets heterogeneity level is higher compared to that of cluster environments: end systems and networks used in intranets are of different types and have different capabilities and because of that one can not imply that all end systems use the same type of the operating system and software tools. 17

18 Another characteristic that raises the level of heterogeneity and brings the requirement of negotiation of potentially conflicting resource usage policies is the separate administration of individual resources. All these factors result in lack of global knowledge: because of the increased number of end systems it is not possible in general for a single entity to have precise global knowledge about the state of the system. Because of the bigger heterogeneity and scale of intranets their operation requires services which are not needed in cluster environments. For example, services which provide information (characteristics, location etc.) about resources currently available on the intranet may be needed. The demand of unified authentication and authorization structure in intranets also dictates the need of using sophisticated security systems like Kerberos [50]. Unlike cluster systems software commonly exploited in intranets does not provide resource allocation or process creation features. The type and architecture of underlying hardware is hidden from communicating parties and resources are provided in the form of well defined computational services. Communicating parties interact using special systems (e.g. DCOM [51], CORBA [52, 53], Java RMI [54]) which provide remote procedure call or remote method invocation features. In general, intranets provide poorly integrated set of services provided which are focusing more on data sharing (e.g. Web services, databases, distributed file systems) or ensuring access to specialized services rather than providing support for coordinated use of multiple end system resources Internets The most challenging environments to perform network computing are internets which span multiple geographically distributed are thus very big and heterogeneous. There is a wide variation in policy and quality because of absence of central body which can enforce policies and ensure quality in internets. Network performance characteristics have substantial differences 18

19 compared to those of local area networks or intranets because of geographical distribution. This distribution also leads to international issues like embargoes on using certain technologies for security. Heterogeneity and scale of internets brings in the requirement of introduction of new services which were not needed in intranets or clusters. Example of such an institution is IGTF [55], an international body which works towards harmonization of policies of different national academic Certification Authorities [56]. Certification Authorities act as trusted third parties and enable establishment of trusted relationship (something which is implied in case of intranets and clusters) between institutions which provide resources and users. Another example of additional service is so-called meta-schedulers. In case of intranets and clusters all resources are managed by single scheduler (like Portable Batch Scheduler [57] or Condor [58]) whereas in Internet one needs a service which would be able to act according to various scheduling policies that apply on different resources and communicate with different schedules Core Grid services The software components which provide the following core services are essential for the operation of the Grid infrastructure: Authentication and authorization. Authentication is the initial step in any computation which involves shared resources. It establishes the identity of the user, whereas authorization determines the privileges which an authenticated user has and the operations which the user is permitted to perform Resource discovery and allocation. This service manages the resources available within a Grid infrastructure according to given resource allocation policy and makes possible acquisition of those resources by the users who have been authorized for their exploitation. The service also ensures that the computation of one user does not interfere with computations of another. 19

20 Data management. This service makes possible transferring data between different components of the system (e.g. from the physical data storage to the machine of the user) Monitoring and Accounting. These services keep the record of available resources and the statistics of their usage. They optionally can provide means to make correlations between resource consumption and some common currency (see, please, Chapter 3 for more details) 1.3 Implementations of Grid infrastructures and projects Grid projects worldwide As it has already been mentioned Grid technologies are serving needs of various scientific communities. Below we present some examples of existing Grid projects and infrastructures with the description of each of them. We also present an overview of several national Grid infrastructures/initiatives. Enabling Grid for e-science (EGEE) Enabling Grids for e-science [59] represents the worlds largest multidisciplinary Grid infrastructure today. This Grid includes resources from more than 250 computer centers from 48 countries providing over CPUs and several petabytes of storage. The infrastructure serves the needs of about 5000 users which form some 200 VOs and runs over Grid computing jobs per day. EGEE users come from disciplines as diverse as archaeology, astronomy, astrophysics, computational chemistry, Earth science, finance, fusion, geophysics, high-energy physics, life sciences, material sciences etc. EGEE infrastructure consists of set of services and testbeds - certification testbed, preproduction service, and production service, support structures (e.g. regional operation centers, Grid Security Vulnerability Group, Global Grid user support) as well as bodies which elaborate policies and carry out overall project management. 20

21 EGEE is built using the glite middleware [60], which contains components from several related projects (e.g. Globus Toolkit [61], Condor [62]) as well as services developed by EGEE community members. glite provides users with high-level services for scheduling and running computational jobs, accessing and moving the data, obtaining information about Grid infrastructure as well as Grid applications. EGEE project is collaborating with several other Grid Infrastructure projects: BalticGrid, E-Science Grid facility for Europe and Latin America (EELA), EUChinaGrid, EUIndiaGrid, EUMedGrid and South-East European Grid (SEE-GRID). The infrastructure provided by these projects is built using glite and covers many parts of the world. National Virtual Observatory (NVO) The aim of this project is the collection of the digital astronomical data. The initial goal of the project is to facilitate access to existing sky surveys and provide standard services for manipulating catalogs and image collections. The long-range goal is to support analyses of entire sky surveys and enable applications that examine multiple collections. NVO is based on web services, data analysis pipelines, and Grid software to support astronomical research. NVO infrastructure enables users to access digital archives, data discovery and access services, programming interfaces, computational services as well as provides metadata management tools. [63] Earth System Grid (ESG) The goal of ESG is to improve greatly the utility of shared community climate model datasets by enhancing the scientist s ability to discover and access datasets of interest as well as enabling increased access to the tools required to analyze and interpret the data. ESG project has created virtual collaborative environment linking distributed centers, 21

22 users, models, and data. Participants in ESG include several US national laboratories and research institutes. The major components of ESG infrastructure are 1) database, storage, and application servers (include the computational and storage resources, as well as application servers for hosting the ESG portal services), 2) Globus/Grid services (provide remote, authenticated access to shared ESG data and computational resources as well tools for submission and management of Grid jobs), 3) High-level and ESGspecific services (provide functionality such as site to site data movement, distributed and reliable metadata access, data aggregation and filtering), 4) ESG applications (includes web-based interface to ESG services, user level tools for data publication as well as assorted clients for data analysis and visualization). [64] MammoGrid MammoGrid project is aimed at the development of pan-european database of mammograms, which will be used to investigate a set of very important healthcare applications as well as evaluate the potential of the Grid to support effective collaboration between healthcare specialists from across European Union. Medical conditions such as breast cancer, and mammograms as images, are extremely complex and thus the variation of different parameters across the population is significant. MammoGrid enables scientists to access a very large database with statistically significant numbers of examples of conditions and provides tools for analyzing that data (e.g. tools to automatically extract tissue information that can be used to perform clinical studies or to automatically extract image information that can be used to perform quality controls on the acquisition process of participating centers). [65] At each hospital participating in the project Gridboxes (secure hardware units) were deployed. They run the Grid middleware developed by CERN AliEn project (see please Chapter 2 for a detailed 22

23 description of AliEn) and provide a single point of entry into the MammoGrid. TeraGrid TeraGrid serves about 4000 users from nearly 200 academic institutions with computing and storage resources housed at 11 resource provider sites. TeraGrid user community is very diverse specialists exploiting TeraGrid come from Physics, Astronomical Sciences, Chemistry, Materials Research, Atmospheric Sciences, Earth Sciences, Biological and Critical Systems, Ocean Sciences, Neuroscience, Computer and Computation Research, Environmental Biology, etc. Computational and storage resources provided by resource providers are integrated using service oriented architecture approach: each resource provides a service which implements a specific operation and has well defined access interface. Individual resources are integrated into single Grid environment using a set of software packages called "Coordinated TeraGrid Software and Services" (CTSS). CTSS provides a standardized environment for the users on all TeraGrid systems, allowing scientists to easily port their applications and code from one system to another. CTSS also provides functions such as single-signon, remote job submission, workflow support, data movement tools, etc. CTSS is built on the base of Globus Toolkit [61], Condor [58], distributed accounting and account management software, verification and validation software, as well as standard programming tools (e.g. compilers).for more details on TeraGrid please refer to [66]. Access Grid Unlike Grid infrastructures from previous examples AccessGrid is not meant for uniting computational and storage resources into single 23

24 infrastructure. Instead, it is an ensemble of resources which include multimedia large-format displays, presentation and videoconferencing devices, interactive environments, as well as interfaces to Grid middleware and to visualization environments. These resources are used to support group-to-group interactions across the Grid. Access Grid is used as an advanced type of videoconferencing facility that allows participants from multiple locations on the Internet to interact in real time. However, the Access Grid offers much more features than conventional videoconferencing tools provide: it provides mechanisms to share data, collaborate using a variety of shared applications (such as sharing presentation material), utilize large-format displays and can employ multiple video sources to allow room-to-room conferencing capabilities. Core components of Access Grid software include the Robust Audio Tool (RAT), an open-source audio conferencing and streaming application, developed by the University College London; ViC, the Video Conferencing tool developed by the Network Working Group at Lawrence Berkeley National Laboratory in collaboration with the University of California, Berkeley; and the Access Grid Venue Client. There are currently about 300 Access Grid nodes deployed all over the world. [67] United Kingdom National Grid Service (NGS) The National Grid Service (NGS) has a mission to provide coherent electronic access for UK researchers to all computational and data based resources and facilities required to carry out their research, independent of resource or researcher location [68]. The service, which is funded by two governmental bodies, Engineering and Physical Sciences Research Council (EPSRC) and the Joint Information Systems Committee (JISC), provides grid computing resources and additional services for over 500 users from United Kingdom scientific and academic communities. NGS combines computing and storage 24

25 resources from 22 sites across the United Kingdom, and provides access to them through a common set of services which use Globus Toolkit [61] for job submission and management, and Storage Resource Broker (SRB) [69], for data management. NGS resources host a number of scientific software packages, such as SIESTA and GAUSSIAN. Apart from providing access to computational and storage resources, the NGS also offers training (through the National e-science Centre in Edinburgh) and Grid support to all UK academics and researchers in Grid computing. Open Science Grid The Open Science Grid (OSG) [70] provides a distributed computational infrastructure for science research. In OSG computing and storage resources from across United States, owned by members of an open consortium, are united within a single Grid infrastructure using OSG middleware. The OSG provides support, operational security, common software, and other facility services such as accounting, monitoring, and resource information on which all OSG members rely. The majority of OSG users come from large physics collaborations: ATLAS, CDF, CMS, D0 and LIGO. There are more than forty active computational sites in the OSG. The infrastructure supports job throughput of more than a hundred thousand CPU hours a day and supports several hundred users. Another important activity of the OSG is the provision and support of a common, integrated, supported set of software components, which are available for installation and use on many platforms for both resources providers and users of the infrastructure. OSG middleware is based on Condor [58] and Globus [61] toolkit software packages, which are used for the provision of base Grid services, as well as about thirty additional software components, including components provided by 25

26 other computer science groups (e.g. EGEE [60]), US national laboratories (Brookhaven, Fermilab and LBNL) and user communities. D-Grid: the German Grid initiative The D-Grid project has the goal of designing, building and operating a network of distributed, high-performance computing resources and related services to enable the processing of large amounts of scientific data and information. Development and operation of this grid infrastructure is proceeding in several overlapping stages. In the first stage Grid services infrastructure for scientists has been built. It has been tested and used by so-called Community Grids in the areas of high-energy physics, astrophysics, alternative energy, medicine, climate research, engineering, and scientific libraries. A short term goal, which was building a core Grid infrastructure for German scientific community, has been achieved at this point. In the consecutive stage the area of supported communities has been broadened to include industry and businesses by the adoption of applications from the construction industry, finance, aerospace and automotive, enterprise information and resource planning systems. [71] The D-Grid infrastructure is built on the base of UNICORE [72], glite [60] and Globus [61] middleware packages National Grid initiative in Armenia The work on the study of the Grid middleware in Armenia began in 2001, with the installation of the Globus Grid middleware toolkit [61] (version 1.2) in Yerevan Physics Institute (YerPhI). In the course of general studies of the Grid functionality, the Globus toolkits 1.4, 2.0 and 4.0 were deployed in 2002, 2003 and 2005, respectively. The functionality of the last version was evaluated in the collaboration with the Oxford e-science Center and the report has been sent to the Globus team. In 2003, the site and client 26

27 middleware of the AliEn (the ALICE experiment environment on the Grid) [73] the Grid infrastructure of the CERN ALICE experiment [74], was installed, which meant the establishment of the first International Grid site in Armenia and in the vast geographical region including South Caucasus, Turkey and Iran [75, 76]. This allowed the members of YerPhI/ALICE group to use the resources of AliEn for the Monte Carlo simulations of the ultrarelativistic heavy ion collisions and for the ALICE detector performance studies. In 2006, a User Interface server has been configured in YerPhI allowing YerPhI users to exploit the Grid resources of DESY (Deutsches Elektronen Synchrotron, the biggest German research center for particle physics) by submitting the massive Monte Carlo production jobs to the DESY VOs (like H1 and HERMES) and to perform the analysis of the accumulated data. In 2007, the middleware of EGEE (Enabling Grid for EsciencE) project [59] has been deployed in YerPhI, allowing integration of YerPhI in the largest in the world Grid/e-Science infrastructure and providing users with the access to it. After the deployment stage, the site passed through a standard EGEE/WLCG [77] site certification process which consisted of the validation of the continuous operation of the site services. During this certification period, different tests were performed by the CERN ROC (Regional Operations Centre) personnel to ensure that all services of YerPhI site were working properly. On March 20, 2006, the YerPhI was been officially certified as production site of EGEE/WLCG [78]. In 2008 representatives of biggest Armenian scientific institutions and the government signed the memorandum about forming Armenian National Grid Initiative (ArmNGI). ArmNGI is responsible for definition and implementation of national Grid development policy, which is aimed at the establishment of sustainable national Grid infrastructure in Armenia. Important mission of ArmNGI activity is the dissemination of Grid technologies within Armenian academic and scientific community. ArmNGI is the collaboration between the government of Republic of Armenia with leading Armenian scientific and educational institutions. Members of ArmNGI are State Scientific Committee of the Ministry of Education and Science of the Republic of Armenia, National 27

28 Academy of Sciences of the Republic of Armenia, State Engineering University of Armenia, Yerevan State University, Yerevan Physics Institute after A. I. Alikhanian, Institute for Informatics and Automation Problems of the National Academy of Sciences of the Republic of Armenia, Armenian e- Science Foundation. The Grid infrastructure deployed by ArmNGI (it is called ArmGrid) [78] unites distributed computing and storage resources from five computing centers in Armenia, and provides access to them using common set of core Grid services which are based on glite middleware toolkit [60]. ArmGrid supports applications from the areas of High Energy Physics (HEP), astrophysics, biology, seismology and Earth science. 1.4 Cloud computing concepts Currently there is no widely accepted definition of the term cloud computing. L. Vaquero et al. in their paper A Break in the Clouds: Towards a Cloud Definition [80] present about 20 definitions of the term by different authors and propose the following definition themselves: Clouds are a large pool of easily usable and accessible virtualized resources (such as hardware, development platforms and/or services). These resources can be dynamically reconfigured to adjust to a variable load (scale), allowing also for an optimum resource utilization. This pool of resources is typically exploited by a pay-per-use model in which guarantees are offered by the Infrastructure Provider by means of customized SLAs (SLA stands for Service Level Agreement). Cloud computing most commonly refers to applications, software platforms or computing infrastructures which are delivered in forms of services over the internet. There are so-called public computing clouds: infrastructures which provide services on commercial basis, or free clouds which are deployed and used by the members of scientific and academic communities, private computing clouds internal computer centers of enterprises or other organizations whose services are not publicly accessible. Sometimes botnets [81], which are groups of 28

29 ordinary consumer s personal computers connected to Internet infected by malicious software and are exploited by computer criminal groups to perform illegal actions like distributed denial of service attacks, are referred to as dark clouds. From the computational and storage points of view cloud computing offers users the following benefits [83]: Seemingly unlimited amount of computing and storage resources which can be made available on demand. This feature frees cloud computing users from the necessity to plan far ahead for resource provisioning. The elimination of an up-front commitment by Cloud users about future resource usage. This feature allows small companies or groups of scientists to start their work or research on relatively small hardware resources and increase them only when there is a real need (e.g. increased load on the services which are provided using those resources). The ability to lease computing and storage resources on a short-term basis on demand, pay for the actual use of them (e.g., processors by the hour and storage by the day) and release them as the resources are not needed anymore. This feature motivates resource consumers to let machines and storage go when they are no longer useful and makes possible the conservation of unused resources by resource providers. The classification of different computing clouds is usually done based on the type of services they provide and also on the way of how those services are provided. Three major types of computing clouds are Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Software as a Service (SaaS) clouds. In the following three sections the overview of IaaS, PaaS and SaaS computing clouds is presented. 29

30 1.4.1 Infrastructure as a Service (IaaS) Infrastructure as a Service (IaaS) is a service which allows users to lease remotely accessible computing resources. The term computing resources in this context refers to CPU and storage resources as well as network traffic. The resource lease occurs without human intervention from service provider s site. The amount of resources leased can be easily changed with few mouse clicks (or remote procedure calls). The important aspect of IaaS is that the resources are leased in the form of virtual machines, giving user full liberty to exploit leased resources the way he/she deems necessary (e.g. deploy any operating system). The provision of such kind of services has become possible only with recent developments in the area of virtualization technologies in both hardware and software. Major CPU vendors Intel and AMD started to include virtualization features in their CPUs: Intel VT [84] and AMD-V [85]. These features significantly ease the development of virtualization software and allow virtualized environments to have performance very close to that of the hardware on which they run. Virtualization technologies along with the widespread deployment of multi-core CPUs triggered emergence of a software tools which enable virtualization. In the past such tools were expensive and were provided mostly by commercial software vendors, whereas nowadays there are free and open source projects (XEN [86], KVM [87] and Sun VirtualBox [88]) which provide high quality and efficient virtual machine hypervisors. It is important to mention that the performance of the applications running in the virtual machines is close to that of the applications running in native (non-virtualized) environments. Due to hardware virtualization features of the modern CPUs the amount of performance penalty incurred by the virtualization is within the acceptable limits [89]. 30

31 There are currently several commercial companies which provide IaaS services. The biggest of them is Amazon, with its Amazon Elastic Compute Cloud (Amazon EC2) [90] and Amazon Simple Storage Service (Amazon S3) [91]. Other companies which provide similar services are GoGrid [92], RightScale [93], 3tera [94], etc. Scientific and academic institutions deploy IaaS infrastructures build and deploy IaaS environments (sometimes referred to as science clouds) using open source software solutions like Nimbus [95], Eucaliptus [96] or OpenNebula [97]. More details on science clouds (in particular Nimbus) will be given in Chapter Platform as a Service (PaaS) 'Platform as a service' (PaaS) is the delivery of a platform and tools for web based software development as a service, i.e. in PaaS the development tools themselves are hosted on the web and are made accessible through the browser. PaaS providers give users the possibility to implement the complete lifecycle of web based applications build and delivery process without the need of obtaining and managing underlying hardware and software systems. They offer tools for cooperative application design and development, software versioning, testing and deployment as well as facilitate database integration, security features implementation and scalability. All these services are usually provisioned as an integrated solution over the web. PaaS provides tools for fast creation of user interfaces in an easy way for application developer. These tools are either based on HTML and JavaScript or rich internet application technologies such as Adobe Flash, Adobe Flex, Adobe Air or Microsoft SilverLight. One of the biggest PaaS providers is Google with its App Engine service [98]. The service supports application development in Java, Python, Ruby as well as JavaScript programming languages. App Engine provides the developer with the persistent data storage with queries, sorting and transactions, APIs 31

32 for authenticating users and sending using Google Accounts, scheduled tasks for triggering events at specified times and regular intervals as well as a fully featured local development environment that simulates Google App Engine on the application developer s computer. Applications developed with Google App Engine are hosted on Google s infrastructure which provides automatic scaling and load balancing. Another big player on PaaS market is Microsoft Corporation with its Windows Azure platform [99]. The platform consists of Windows Azure: an operating system as a service; SQL Azure: a fully relational database running in the cloud; and.net Services: consumable web-based services that provide both secure connectivity and federated access control for applications Software as a Service (SaaS) Software as a Service (SaaS) is a software distribution model in which applications are hosted by a service provider and are made available to users over Internet. Unlike conventional software applications SaaS applications does not require downloading, installing and configuring on user s machine. SaaS applications are managed centrally and users access them with web browsers. With the use of latest web development technologies such as AJAX, Adobe Flash, Adobe Flex, Adobe Air or Microsoft SilverLight SaaS providers are creating web based applications with intuitive user interfaces and reach functionality. An example SaaS provider with a diverse list of offered applications is Zoho [100]. Zoho features a set of productivity and collaboration applications examples of which are: Writer (online word processor), Sheet (online spreadsheets), Show (online presentation tool), Planer (online organizer). It also provides business applications: CRM (online Customer Relationship Management tool), Meeting (online web conferencing and remote support), Invoice (online invoice creation tool), Projects (online project management software), etc. 32

33 1.5 Summary of Chapter 1 Chapter 1 is dedicated to an overview of the Grid and Cloud Computing. The architecture of computing Grids is described, and several multidisciplinary international and foreign national Grids like EGEE (Enabling Grids for E-SciencE) and Open Science Grid (the largest Grid project of USA) as well as ArmGrid (Armenian Grid infrastructure) are presented. The Cloud Computing technologies are overviewed and the major types of computing clouds: Infrastructure as a Service (IaaS) clouds, Platform as a Service clouds (PaaS), as well as Software as a Service (SaaS) clouds, are discussed. 33

34 CHAPTER 2. ALIEN - GRID INFRASTRUCTURE OF CERN ALICE EXPERIMENT 2.1 CERN ALICE experiment ALICE (A Large Ion Collider Experiment) is a general-purpose, heavy-ion detector [74] at the CERN LHC [27] which is designed to address the physics of strongly interacting matter and the quark-gluon plasma at extreme values of energy density and temperature in nucleus-nucleus collisions. The ALICE detector has been built by a collaboration including currently over 1000 physicists and engineers from 105 Institutes in 30 countries. Its overall dimensions are 16x16x26 m 3 with a total weight of approximately t. The experiment consists of 18 different detector systems each with its own specific technology choice and design constraints, driven both by the physics requirements and the experimental conditions expected at LHC. Figure 2.1. ALICE detector 34

35 It is planned that during each Standard Data Taking Year ALICE will run for about effective 10 6 seconds (one month) at a data collection rate of 1.25 GB/s for lead (Pb) ion runs and about 10 7 seconds at a data collection rate of 100 MB/s for proton-proton (p-p) runs. This will result in about 2.25 PB of data each year [101]. Apart from the data recorded by the detector scientists participating in the experiment also use simulations to produce the data and compare results of simulation with the experimental data. The results of analyses of simulated data are extremely important for understanding the performance of the detector as well for fine tuning and bug-fixing the hardware and software of the experiment. 2.2 Distributed computing architecture of ALICE Since the early days of design of the LHC experimental program, it was clear that the necessary computing and storage resources for data processing could not be consolidated at a single computing centre. Considering the huge financial investment required and the need for expert human resources the natural choice was made to distribute those resources throughout computing centers of the institutes and universities involved in the experiments. The technical design of the decentralized offline computing has been outlined in the so-called MONARC (Models of Networked Analysis at Regional Centers for LHC Experiments) model [102] (Figure 2.2). 35

36 CERN (Tier 0, 1 and 2) RAW data master copy Data Reconstruction Prompt analysis CNAF CCIN2P3 FZK RAL NIKHEF NDGF Tier 1 (QoS level 1) Copy of RAW Data reconstruction Data analysis Torino Subatech Nantes Tier 2 (QoS level 2) Monte-Carlo production Partial copy of reconstructed data Data analysis Figure 2.2. Schematic view of ALICE offline computing tasks in the framework of MONARC model The model describes an infrastructure based on distributed computing and data storage resources grouped in the hierarchy of centers called Tiers, where the Tier 0 is CERN. Tier 1 centers are major regional computer centers serving a large country or a geographic region. They have big computing capacity (order of several thousands of CPUs) and provide a wide range of services, and, more importantly, persistent data storage facilities. Tier 2 centers are smaller regional computer centers which provide reduced set of services and would serve either a part of the country or small geographic region. Tier 3 centers are even smaller university department computer centers which have limited computing capabilities and focus on the solution of very specific scientific problems. Tier 4 centers are the workstations of the scientists. The major difference between first three Tiers is the reliability and the Quality of Service (QoS) they provide. The highest QoS is offered by Tier 0 and Tier 1 centers. Within the MONARC model computing centers located at different Tiers have clearly defined purposes. At Tier 0 the master copy of the experimental data is stored and reconstructed as well as prompt data analyses are performed. External Tier 1 centers (i.e. not CERN) store the replicas of experimental data (thus providing a backup of the master copy of the data stored at CERN), perform the reconstruction of the data, as well as are used to perform end-user analyses. Tier 2 centers are used to perform Monte-Carlo 36

37 simulation of the experimental data, keep partial copies of reconstructed data and are used by end-users to perform data analyses. MONARC model has proven to be viable during numerous Physics Data Challenge (PDC) exercises [103, 104], which are series of global infrastructure and software system tests using different data processing schemes. PDCs are aimed at the validation of the computing model, data storage model, software used for experimental data simulation and analyses, as well as ensuring the correctness of technical choices which were made. However with the evolvement of Grid technologies the MONARC model has been progressively replaced by a more symmetric model, in which the only distinctive feature of Tier 1 centers apart from the size and service level is the commitment to provide reliable and persistent data storage facilities. Within the current model ALICE [101] uses the common Grid services deployed by Worldwide LHC Computing Grid (LCG) [77] project and adds the necessary ALICE-specific services from the AliEn (ALICE Environment on the Grid) system [73, ]. In the following sections of this chapter the detailed description of AliEn system will be presented. 2.3 The architecture of AliEn ALICE collaboration started the development of the AliEn framework in year 2000 with the aim of providing the ALICE user community with a single interface which will allow to access transparently heterogeneous computing and storage resources distributed all over the world. Already in 2001 the first version of AliEn was deployed for distributed experimental data simulation (Monte-Carlo production) at several computing sites. The fast development cycle continued, adding more functionality to the system. During physics and data challenge exercises in the years more than 400,000 ALICE production and analysis jobs have been run on an infrastructure created using AliEn framework. AliEn was used to unite computing and storage resources located in about 40 computing sites distributed worldwide producing more than 40 TB of data. Since 2005 AliEn has been used both for Monte-Carlo data production and for end-user data 37

38 analyses. From the early days of AliEn its developers have managed to fulfill their primary goal - to hide from the end users the complexity of rapidly evolving Grid services and underlying heterogeneous infrastructure. AliEn is built in accordance web services model [109] using standard network protocols and Open Source components. Most of the AliEn code (about 95%) is imported from Open Source packages and modules. The native AliEn code is written mostly in Perl. Web services are one of the most important enabling technologies for the functioning of AliEn infrastructure. AliEn user interface part interacts with the AliEn services using SOAP (Simple Object Access Protocol) [110]. Apart from that services constantly exchange SOAP messages between themselves creating a web of collaborating services. The key components and services of AliEn framework are following: Workload Management System (WMS) File and metadata catalogue Data management tools and services for data transfer, replication and storage Authentication and Authorization Monitoring Interfaces to other Grid infrastructures (ARC [111], glite [60], OSG [70]) Interface for working with AliEn from ROOT [112] objectoriented data analyses framework 38

39 The schematic representation of the AliEn services and components as well as their interaction with the computing and storage services available at computing centers is given in Figure 2.3 AliEn Central Services User Authentication File Catalogue Workload Management Job Submission Configuration Monitoring Job Task Queue Accounting Storage Volume Management Computing Element AliEn Site Services Storage Element File Transfer Daemon Package Manager Cluster Monitor Local Batch System Existing site components Local Storage System Figure 2.3. Schematic view of the AliEn basic components and deployment principles AliEn WMS uses the so-called pull approach. In the central database there is a common task queue which contains the list of all jobs of ALICE Virtual Organization. Each institution providing resources for ALICE runs a service called Computing Element (CE) which provides an access to the local 39

40 computing resources that can range from a couple of machines dedicated to running a specific tasks to the huge computer centers with thousands of machines and hundreds of terabytes of disk space. AliEn provides very important and interesting features of interfacing other Grid infrastructures which are being deployed all over the world. Using that feature AliEn Computing Element can be configured to serve as a gateway to an entire foreign Grid infrastructure. Currently interfaces for ARC, glite and OSG have been implemented. Jobs submitted by end-users or operators responsible for running Monte- Carlo production on the Grid end up in the task queue. Job management and brokering services analyze the job requirements such as the needed input datasets, or required application software packages and mark them as eligible for execution on one or more AliEn sites. The jobs waiting in the queue are sorted according to job execution quotas defined for the user who has submitted the job as well as the price of the job (the detailed description of this process is given in Chapter 3). The Computing Elements of the sites where free worker nodes are available get the jobs from the central queue and deliver them to local site queues. Input and output files associated with the jobs are registered in the AliEn File Catalogue (FC), a virtual file system with the directories structure similar to the standard UNIX file system. The difference between AliEn file system and standard UNIX file system is that the former does not actually own the files, but instead keeps the mapping between the Logical File Names (LFN), that are seen by the users and jobs, and Physical File Names (PFN) which reflect the location of the file on the mass storage system. AliEn File system also provides a feature of association of metadata with the files. Detailed description of the AliEn File System will be provided in section 2.2.1). AliEn has a very powerful and rich in features monitoring system based on MonALISA [113] open source monitoring framework (see Section for details). The description of AliEn authentication and authorization system will be provided in Section

41 2.3.1 AliEn file and metadata catalogue One of the key components of the AliEn system is the file and metadata catalogue. The logical structure of the catalogue is very similar to the standard UNIX file system. Files and directories are organized into the hierarchical structure. In addition to common file system functions the catalogue also provides tools which allow users to associate metainformation with each entry. Unlike real file systems it does not own the files but keeps the mapping information between the names of the files in catalogue as they are visible by users and jobs (these are called Logical File Names or LFNs) and the information (called Physical File Name or PFN) about physical location of those files on mass storage systems. One LFN can have an association with one or more PFN. The existence of multiple PFNs associated with a single LFN is the indication of the replication of the physical file on different storage elements. Each PFN entry contains the address of the mass storage system where the file is physically located as well as local path of that file. LFNs can be manipulated by users, several LFNs can be associated with the same PFN. To prevent entry duplication each LFN is associated to a Globally Unique Identifier (GUID) entry. Nearly all components of AliEn system make use of the catalogue in one or other way. The catalogue keeps the information on all the data and application software packages used for Monte-Carlo simulation as well as analyses of the experimental data. The catalogue is also used to keep the input and register the output of the jobs which run in AliEn. Many files in AliEn catalogue have the metadata entries associated with them. These metadata along with various triggers for automatic data management are also kept in AliEn catalogue. The catalogue is build using a set of relational databases. The design of the catalogue allows splitting to the different databases at the directory level (i.e. the information about each directory can be kept in a different 41

42 database). These databases can be stored on different machines, which makes it easy to scale the catalogue in case of high load. However from the user and job point of view the catalogue is seen as a single entity. The internal structure of the catalogue is divided between LFN and GUID tables, which are kept in different databases (Figure 2.4) AliEn File and Metadata Catalogue / Index LFN -> GUID and LFN Metadata Index eflp2-sdfga-1j2 GUID -> PFN /alice g2fz2-bjiuk-li3 /alice/bin jlkja-sadad-32f /alice/bin/data 2fsfo-gt1mii-12s LFN Catalogue GUID Catalogue Figure 2.4. Structure of AliEn File and Metadata Catalogue Every database which keeps LFN information has an index table. The index is used to locate the database which contains the entire information about given directory or file. The next set of tables provides an information about the ownership, group membership, date, size, GUID, etc. of the LFN. In addition there could be other tables which contain the metadata information for any entry in the catalogue. The GUID part of the catalogue has the similar structure. The first table is an index which points to the database and a table where the information about given GUID is contained. The index is calculated based on the creation time of the GUID, which makes very probable the appearance of the GUIDs which were constructed at the same time in the same database. For each GUID two lists of storage elements can be defined: the first is the list of storage elements which have the GUID and can construct a PFN based on the GUID name, and the second is the list of storage elements that cannot construct 42

43 the PFN themselves, in which case the PFN is also stored in the central catalogue. The LFN to PFN translation is done in four steps: 1. Database containing the LFN is found via the LFN index table 2. LFN to GUID is translated 3. Database and the table containing the GUID are found via the GUID index table 4. All PFNs associated with the given GUID are obtained One of the advantages of such two-step translation is the possibility to work directly with GUIDs instead of LFNs, which is the natural namespace for the data management. Apart from that, if a GUID is replicated there is no need to change anything in the LFN database since all the LFNs which point to that GUID will be aware of this replica, and conversely, if the user changes the LFN (e.g. renames the file in the catalogue) there is no need to update the associated GUID entries. The drawback of this approach is that in some case one might encounter so-called orphaned GUIDs - GUID entries which have no associated LFNs. Possible solution to this problem could be keeping the reference counter for each GUID a number which will show how many LFNs point to a given GUID. In that case orphaned GUIDs will be easily detectible because their reference number will be zero. However keeping references will introduce additional time overhead (e.g. directory copying will be become slower) because in that case LFN modification operation might require a change of the reference number. In AliEn this problem is partially solved by delaying the updates so they can be performed in bulk operations. One of the interesting features the catalogue provides is the file collection: user-defined list of catalogue entries. Each entry can be an LFN, a GUID or another collection. The numbers of entries in a single file collection as well 43

44 as the number of collections containing the same entry are not limited. From the user point of view the collections appear as usual files in the catalogue, AliEn provides commands to display or modify (add or remove files) the collections. A command executed on a collection results in execution of the command on all entries which are part of that collection, for example, the replication of the collection to another Storage Elements results in replication of all individual files contained in the collection to that Storage Element. This feature of grouping several files together within a single catalogue entity is extremely useful for the users who submit jobs requiring multiple input files: instead of writing all the input file names in the job description file they can create a collection of files and specify the name of the collection as input requirement, thus keeping their job descriptions short and manageable. The collections are also used for splitting the jobs into sub-jobs. Another useful feature of the file catalogue is trigger system. Triggers allow users to define actions which are performed automatically when an LFN in catalogue is updated. To activate a trigger user needs to register a file in the catalogue which performs the desired action, which for example can be a little script which sends an . Then the user has to associate the script with the directory in the file catalogue and select the type of the modification (insert, update or delete) which will trigger the execution of the script. For the time being ALICE is using trigger mechanism to schedule file replication of the experimental data to the storage systems located at Tier 1 sites: whenever a new file is added to a the directory where the experimental data is recorded, the transfer of that file to some Tier 1 site is scheduled AliEn monitoring system The core of AliEn monitoring system is the MonALISA (MONitoring Agents using a Large Integrated Services Architecture) framework [113].MonALISA provides extended functionality for the service, operating system and network performance monitoring on distributed remote computing sites. It uses a dynamic set of protocols to deliver the monitoring information to different kinds of clients, like a web repository which keeps the history of the 44

45 collected data or a graphical application which can be executed using Java Web Start technology [114]. MonALISA provides also and Application Programming Interface (API) which is called ApMon [115] which can be used to instrument any application to send the monitoring information to the monitoring agents. The MonALISA API is available for a wide variety of modern programming languages. The AliEn monitoring system has the hierarchical infrastructure where the information is collected on lower levels (e.g. on the machines where the jobs run), and then it is selectively aggregated and sent to upper level service (e.g. to a special node on the site where the monitoring service runs), see please Figure 2.5 for details. The aggregation allows significant reduction in the overall volume of the network traffic generated by different components of monitoring system, preserving important details at the same time. SE ApMon CE ApMon SE ApMon CE ApMon Cluster Monitor ApMon Site MonALISA Cluster Monitor ApMon Job Agent ApMon Job Agent Aggregated Data Site MonALISA Job Agent ApMon ApMon Job Agent ApMon SE Cluster Monitor ApMon ApMon CE ApMon Site MonALISA MonALISA Repository Job Agent ApMon Job Agent ApMon Long history DB Figure 2.5. Information flow from the monitored entities 45

46 All AliEn services and clients are equipped with ApMon, which sends general job and host monitoring information as well as allows services to send specific monitoring parameters. The monitoring system components collect extensively information about CPU and memory usage, consumed and wall CPU time, open file descriptors and network traffic. Each site of the AliEn Grid runs a dedicated site monitoring service, which the monitoring agents running on that site report information to. This service also monitors the status of other AliEn services running on that site through periodic functional tests. The site monitoring service does the filtering and aggregation of the collected monitoring data and periodically sends it to the central MonALISA repository of AliEn Grid. The central repository provides a global view of the entire AliEn Grid infrastructure. The repository collects the data from the monitoring services which run on AliEn sites and stores it into a PostgreSQL database. It provides close to real-time and history reports of the AliEn Grid status. The reports can range from general overviews (Figure 2.6) to very detailed views of individual user jobs. The data stream is in average about 3000 values per minute with the current database size of several hundred Gigabytes. Along with the presenting the monitoring information to the users, site and VO administrators the repository also provides features of taking automated decisions based on the monitoring information received. For example it may try to restart the non-responsive site service or send the notification to site administrator in case the problem is persistent. 46

47 Figure 2.6. General overview of AliEn infrastructure It can also be used to automatically submit the Monte-Carlo simulation jobs when the task queue is about to be become empty or to dynamically update the DNS service records of central AliEn services to ensure proper loadbalancing AliEn Workload Management System (WMS) Users describe their jobs using the Job Description Language (JDL) which is based on Condor ClassAd language [116]. In the JDL specification of the job the only mandatory parameter is the name of the executable that has to be run (AliEn Job Manager central service will fill the other necessary fields when the job is submitted to the system). Optionally the user may also specify in the JDL the arguments which has to be passed to the executable, requirements that the worker nodes has to fulfil (e.g. available disk space), the input data file names which the job needs, the names of the output files the job is going to produce, the software application packages which the job 47

48 needs as well the name of the validation script which must be used to validate the output of the job. There are two models of distributing the jobs on the Grid. In the first model (called push ) there is a central service which queries site services from time to time and collects the information about the sites (e.g. the number of free worker nodes available or installed application packages list). This information is then used to choose the suitable site for scheduling a given waiting job. Keeping the up to date status of the whole Grid is not a trivial operation, because there are thousands of elements the status of which can change very often, and thus there is no guarantee that the information which is used to take the job scheduling decision will always be accurate. Central services Site information Jobs Sites info Site A Site B Site C Figure 2.7. Push Grid job submission model In the pool model there is no need in the service which keeps track of all Grid resources. Instead the sites which have free resources for running jobs periodically send their information (e.g. the number of free worker nodes available or installed application packages list) to the central resource brokering service, which checks whether there is a match between the requirements of the jobs waiting in the task queue and the resources 48

49 provided by the site which is asking for a job. If the match occurs then the service immediately schedules the matched job for the execution on the site, otherwise the brokering service reports to the site that there is no match and after waiting some time the site repeats the request. Central services Site information Jobs Site A Site B Site C Figure 2.8. Pull Grid job submission model AliEn uses the pool approach. Every institution which provides resources to the AliEn Grid runs a service called Computing Element (CE). The CE serves as an interface between the local resources available on the site and the central services (as already mentioned in Section 2.1 there are cases when the CE is used as an interface to some other Grid infrastructure). Whenever there is a free resource for running a job on the site the CE sends the description of its capabilities to the AliEn Job Broker central service. This description is also based on the JDL and contains the name of the CE, the list of available software application packages, the name of the Storage Element (SE) closest to that CE, etc. The Job Broker performs the matching of the CE resources with the jobs waiting in the Task Queue and if there is a match it instructs CE to submit to the local batch system the script which 49

50 after arriving to the WN will start the AliEn service called Job Agent (JA). However, it is important to mention that at this point the job is not yet assigned to the matched CE, and if another CE sends a request which matches the same job the Job Broker will instruct that CE to start the JA as well. When the JA starts on the worker node it executes a set of sanity checks and if the checks are successful it sends job request to the Job Broker. The request sent by the JA is more detailed than the one which is sent by CE. It contains among other things the available disk space, memory, information about the platform and the OS of the WN. If the description satisfies the requirements of the job waiting in the Task Queue the Job Broker assigns the job to the JA. The JA analyzes the JDL of the job, downloads the necessary input files, prepares the environment for execution and executes the job. Once the user job terminates the JA initiates the transfer of the output files to the SE and registers the files in the AliEn file catalogue. After that the JA tries to get another job for execution and if there are no more jobs to execute the JA exits. The use of JAs provides a multiple benefits. First of all JAs eliminate the possibility of job failure due to problems with the local batch system or problems with the WNs. They also reduce the time between the submission and execution of a user job since sever sites can submit JA for the same job, which at the end will be assigned to the fastest JA (the one which first manages to request the job from the Job Broker). Finally the use of JAs reduces the load on the local batch system, since a single JA can execute multiple jobs. At the same time JAs significantly increase the load on the Job Broker, since in the architecture without JAs only the CE communicates with the Job Broker and multiple jobs can be assigned to the same CE within the same response to the request. This is not possible with the JAs since they act asynchronously and independently from each other, and each of them has to communicate with the Job Broker. 50

51 However one has to mention that additional functionality provided by the JAs outweighs the drawbacks and in the current operation model the Job Broker was capable of serving more than JAs running in parallel. 2.4 Problem definition Different Grid middleware toolkits (e.g. AliEn [73], glite [60], Unicore [72], ARC [111]) employ different techniques for the implementation of their resource sharing systems. The state of the art, however, is that they all are designed following the general principles presented in Chapter 1, and thus their resource sharing systems have the common set of problems: 1) The tasks of the users sent to the Grid for execution (also referred to as jobs) have different importance for the members of the collaboration and so it is necessary to provide mechanisms for controlling the jobs execution in accordance with their importance, whereas currently the jobs of the users are served by central services on FIFO (First In First Out) basis. So regardless of the size of the Grid and the amount of resources available to Grid users, there exists a problem of the coordination of the use and accounting of those resources between the members of the virtual organization. This requirement brings in the need of having a flexible job scheduling scheme where the priorities of the jobs can be varied. A requirement to augment the priority may arise, for example, when a user or a group of users carries out intensive calculations for urgent submission of a conference paper. On the other hand, there may be a need to lower the priority of jobs (such as data challenge jobs) to execute them without hindering the work of Grid users. 2) The need of scientists in Grid resources varies over the time: during the data taking periods Grid resources are not actively exploited by the users, however once the experimental data is collected and the data analysis 51

52 starts, the number of active users, and thus the required computing capacity, increases significantly. Currently the integration of external resources to the existing Grid infrastructure involves the installation of experiment specific tools and services on those resources, which is a time consuming process requiring significant efforts (due to complexity of installation and configuration, this process may take up to several weeks). So there exists a problem of dynamically increasing the computing and storage capacity of Grid by a seamless integration of external resources (e.g. resources available from academic and commercial Cloud Computing providers) to satisfy the time-varying needs of scientists. The integration must be done in such a way that no change is visible to the user, i.e. the users do not need to change the ways in which they use the system. 3) In addition to maintaining the Grid middleware resource providers have to also maintain application software stacks required by the communities they support. This process requires significant efforts since as the rule the software packages required by the scientists are complex, have a lot of external dependencies, as well as require frequent reinstallation since they are in the phase of active development. Most of the Grid resource providers support several VOs/scientific communities, so, there exists a problem of minimization of the work of resource administrators on the maintenance and support of specific application software required by different scientific communities. 4) Grid infrastructures are comprised of hundreds of resource provider sites. The versions of applications deployed and their configuration varies from site to site, and thus the users have no possibility to test their programs in the environment which will be identical to that of the worker node of the site where their job will be executed. This is very important for the debugging of those programs since the behavior of the same program which was run in different environments is likely to vary. 52

53 So there exists a problem of ensuring that the job execution environment remains consistent across multiple executions on the resources of different sites, and in the same time allowing users to customize the environment in which their jobs are executed. 5) One of the key issues in Grid infrastructures is the authentication and authorization of Grid participants required for accessing Grid resources. The authentication in Grids is performed using X.509 digital certificates [117]. There are, however cases, when the use of digital certificates is not appropriate (e.g. some jobs need to frequently access the file catalogue, whereas the authentication with digital certificates is slow), so there exists a problem of having a flexible authentication and authorization framework which will allow Grid participants to authenticate with different mechanisms. 6) Grid middleware is written for the Linux/Unix operating systems. However most of the users use Windows OS in their everyday work use Windows, so it is important to provide those users with the ported versions of client parts of Grid middleware so they can exploit Grid resources without changing their working habits. This dissertation is devoted to the solution of aforementioned problems within the resource sharing system of AliEn Grid. 2.5 Summary of Chapter 2 Chapter 2 briefly describes the ALICE experiment and its purposes, presents the Grid infrastructure of ALICE experiment (AliEn), and gives an overview of problems existing in the resource sharing system of AliEn. Different Grid middleware toolkits (e.g. AliEn, glite, Unicore, ARC) employ different techniques for the implementation of their resource sharing systems. The state of the art, however, is that they all are designed following the general principles presented in Chapter 1, and thus their resource sharing systems have the common set of problems, which are analyzed and presented in detail in Section 2.4. The dissertation is devoted 53

54 to the solution of these problems within the resource sharing system of AliEn. 54

55 CHAPTER 3. DESIGN AND DEVELOPMENT OF GRID BANKING SERVICE MODEL FOR JOB SCHEDULING IN ALIEN The developed Grid Banking Service model for job scheduling, which is described in the following sections of this chapter, addresses problem 1) defined in Section Development of the Grid Banking Service (GBS) model for AliEn WMS The following banking service model for AliEn has been elaborated. The administrator of Virtual Organization (VO) defines a price, called nominal price, for a unit of CPU resource. The number of units of CPU resource consumed by the jobs in AliEn is calculated based on SPECint2000 [118] specifications 2. For each job, the user declares his/her bid - the price in the units of the nominal price, which he/she is ready to pay for the unit of the CPU resource which the job is going to consume. Job Manager service ranks the jobs according to the so called effective priority which is calculated in the following way. First for each user who has the waiting jobs the value of computed priority is calculated according to the following formula (Figure 3.1): computed priority 1, runningjobs > maxparalleljobs = nominalpriority 2 runningjobs priority, runningjobs maxparalleljobs maxparalleljobs Figure 3.1. Formula for calculating computed priority 2 SPECint2000 is a set of benchmarks designed to test the CPU performance of a modern computer system. It contains two benchmark suites: CINT2000 for measuring and comparing compute-intensive integer performance and FP2000 for measuring and comparing compute-intensive floating performance. SPEC defines a base runtime for each of the 12 benchmark programs. The timed test is run on the system, and the time of the test system is compared to the reference time, and a ratio is computed. That ratio multiplied by 100 becomes the SPECint2000 score for that test. As an example for SPECint2000, consider a processor which can run the benchmark program called 256.bzip2 in 437 seconds. The time it takes the reference machine to run the same program is 1500 seconds. Thus the ratio multiplied by 100 is equal to (1500/437)* The ratio is computed for all of the benchmark programs provided by SPEC, and then the geometric mean of those ratios is computed to produce an overall value. 55

56 Where runningjobs is the number of currently running jobs submitted by the user, maxparalleljobs is the constant defined by the database administrator which denotes the number of jobs which a given user is allowed to run simultaneously, and priority is a constant which is currently is set to 1. According to the formula the value of the computedpriority will be set to 1 if at the moment of calculation the user is running more jobs than he/she is allowed otherwise it will be linearly decreasing depending on the currently running and maximum allowed jobs ratio as well as the value of nominalpriority (constant defined per VO) and priority (constant defined per user) variables. For each waiting job the effective priority is calculated by multiplying the value of the computed priority of the user who has submitted the job with the bid which the user has declared. Thus the jobs with higher bids are getting higher effective priority and are sent to execution executed faster. The sites in their turn set the site price (again in the units of the nominal price) for the unit of CPU resource they provide and accept for execution only the jobs which have bids higher or equal to the site price. When the job is finished, the amount of consumed CPU units is calculated and the user is charged a payment is made from his/her account to the account the site where the job was executed. The amount of CPU units consumed by the job is calculated on the base of job's CPU time, wall clock time (real running time of a job) and then is converted to SPECint2000 units. In our model, an institution entering VO earns symbolic money by providing Grid resources to the participants of VO. The earned money is accumulated in the bank account of the corresponding site. The payment for the executed jobs is done from so-called group accounts. Group accounts are filled up from the funds accumulated in site accounts according to the policy defined by the VO administrator. 56

57 The VO administrator can also impose taxes on the incomes of the sites. The taxes will be transferred to a special VO-owned tax account. The money from this account can further be used, for example, to credit users who ran out of money. Within this model the participating institutions are motivated to provide more resources with better quality and availability, since the better the resources of the site are the higher their price can be and the more money can be gained. Besides, the model motivates users to be more conscientious in the usage of Grid resources. Grid Banking Service software has been developed and integrated into AliEn [119]. Section 3.5 describes the role of the banking service in AliEn job submission and charging process. 3.2 Discrete-event system model for the simulation of the work of AliEn WMS and GBS Functioning of AliEn infrastructure is vital for ALICE collaboration so introducing changes to the software or the configuration always requires thorough analysis. Before a change can be deployed on the infrastructure which is used by ALICE collaboration one has to make sure that it will not hinder the work of AliEn users, Monte-Carlo data production operators as well functioning of AliEn sites. The banking service is not an exception, so every time a VO admin deems necessary changing the configuration parameter (e.g. job nominal price, maximum number of jobs a user is allowed to run in parallel) or site admin decides to change the site configuration parameter (e.g. price of the CPU resource provided by the site) first it is necessary to make sure that the change will not negatively affect any AliEn user or site and second it is very desirable to be able anticipate the effect of the change before it is deployed on the real infrastructure. To be able to study the effects which changes of AliEn banking configuration parameters will produce a simulation model of AliEn Workload Management System (WMS) has been developed. An implementation of the developed 57

58 model a simulation program, takes as an input so-called workload, which is a list containing information about jobs (e.g. when is the job arrival time, who is the user who has submitted the job), as well as a list containing information about sites (e.g. number of worker nodes available on the site, SI2k conversion ratio of site worker nodes) and a list containing information about users (e.g. maximum number of jobs a user is allowed to run in parallel, amount of money available to user). The program then mimics the work of AliEn job management system by assigning jobs to the sites similar to the way it is done in real system and records important parameters (e.g. start time of the jobs, completion time of the jobs) in the database. The simulation finishes when all the jobs get processed by the program. By modifying the simulation parameters one can study the produced effects by analyzing the data which is collected in the database during the simulation. The number of basic elements which constitute AliEn WMS is very big (tens of thousands of computers, hundreds of services, etc.) and their interconnection (network links between the machines on the sites, network links between different sites, etc.) is extremely complex, so overall it is a very sophisticated system. That is why in the developed model the basic constituents of the system are united into interacting subsystems with appropriate high level of abstraction and the whole system is observed as a unification of those high level components - so called entities of the system. The model allows describing and simulating the work of the AliEn WMS as a discrete event-system [120]. In discrete-event simulation a system is modelled in terms of 1) the entities that represent the constituents of the system, 2) its state at each point in time and 3) the activities and events that cause the system state changes. Below we describe what the entities of the simulated system are, how the state of the system is defined and what are the events and activities which cause the changes of the system state. 58

59 In discrete-event simulation systems entities are defined as objects of interest in the simulated system. The entities in our simulated system are AliEn Grid jobs, sites and users with corresponding group accounts in the banking service. The properties of entities which are used to describe them are called attributes. In case of AliEn sites attributes are parameters which represent the site configuration (i.e. the number of worker nodes, SI2k ratio), in case of jobs they represent job parameters (e.g. job duration), and in case of users the attributes are parameters associated with the users (e.g. the name of group account in the banking service to which the user belongs). Job, site and user attributes used in the model are presented below. The attributes using which the jobs are described are following: Job_Id Identification number of the job Job_User User who submitted the job Job_Site Site where the job has been sent for execution Job_Price Bid of the user who submitted the job Job_SI2k Number of SI2k units the job is going to consume Job_Arrival_Time Job arrival time Job_Start_Time Time when the job was sent for execution Job_Run_Time Job running time Job_End_Time Time when the job finished Job_Status Job status Job_Priority Effective priority of the job Sites are described using the following attributes: Site_Name Name of the site Site_SI2k_Ratio SI2k conversion ratio of the site worker nodes Site_Price Price of the unit of CPU resource the site provides Site_Start_Time Time when the site starts the operation 59

60 Site_Max_Jobs Maximum number of jobs the site can concurrently run Site_Job_Request_Rate Site job request rate (seconds) Site_Running_Jobs Number of jobs running on the site Site_Account_Name Name of the site account in the banking service Site_Account_Balance The amount money available on the site banking account The attributes which are used to describe the users are: User_Name Name of the user User_Account_Name Name of the group account in the banking service to which a given user is assigned User_Account_Balance The amount of money available of user s banking account User_Running_Jobs Number of running jobs belonging to a user User_Waiting_Jobs Number of waiting jobs belonging to a user User_Max_Parallel_Jobs Maximum number of jobs a user is allowed to run in parallel User_Priority The value of priority variable which is used during computed priority calculation process (Section 3.1) In discrete-event simulation systems the state of the system is defined by the collection of variables which are necessary to describe the system at any time. In our case system state variables are the attributes of the entities plus the following: System_Waiting_Jobs Overall number of waiting jobs System_Running_Jobs Overall number of running jobs The activities and events which cause the system state change are: Event_Job_New - Arrival of a new job Event_Job_Request Request of a job by site 60

61 Event_Job_End - Completion of a running job Even_Priority_Calculate (Re)calculation of the priorities of the jobs waiting in the queue The mechanism for advancing simulation time and guaranteeing that all events occur in correct chronological order is based on Event Scheduling/Time Advance (ESTA) Algorithm [120]. The algorithm uses socalled Future Event Lists (FEL) which contains the list of event notices for events which have been scheduled to occur at a future time. Scheduling a future event means that at an instant an activity begins (e.g. a job is sent for execution to the site) its duration is computed (e.g. job duration is computed based on the number of SI2k units a job needs and site SI2k conversion ratio) and that the corresponding end-activity event (e.g. job completion) together with its event time is placed on the future event list. To begin the simulation it is necessary to define the input data: configuration of the sites participating in the simulation, settings associated with the users participating in the simulation as well the parameters of the jobs which simulation program will need to process. For each site participating in the simulation values of the following attributes have to be defined: Site_Name Site_SI2k_Ratio Site_Price Site_Start_Time Site_Max_Jobs Site_Account_Name Site_Account_Balance For each user participating in the simulation values of the following attributes have to be defined: 61

62 User_Name User_Account_Name User_Account_Balance User_Max_Parallel_Jobs User_Priority The jobs have to be defined using the following attributes: Job_Id Job_User Job_Price Job_SI2k Job_Arrival_Time Once the values of input parameters are given to the simulation can be started. It is based on the three interacting components: Simulation Site Manager (SSM), Simulation Job and Banking Manager (SJBM) and Simulation Data Manager (SDM). The operational details of these components and the role of each of them in the simulation process are described in the following paragraphs of this section. SSM is the key component of the simulation which is responsible for advancing simulation clock as well as managing the objects which mimic the work of AliEn sites (so-called site objects). Upon start-up it sets the simulation clock to zero and creates the list of site objects. Each site object represents one AliEn site and is initialized using the site configuration data provided as the input data. With every site object there is an associated Future Event List (FEL) which can contain two types of events: either job request (Event_Job_Request) or job completion (Event_Job_End). Initially each site object s FEL contains one job request event scheduled at the time equal to the start time of a site. At each step of the iteration the site manager does the following: Increments the value of simulation clock by one 62

63 Reports to the SJBM the value of simulation clock Starts checking the FELs of the active site objects (the site object is considered active if the value of its Site_Start_Time attribute is greater or equal to the current value of simulation clock). If it finds the event which is scheduled to happen at the time equal to the current value of the simulation clock it takes the appropriate action: conveys either job request or job completion report from site object to SJBM Conveys the job start messages received from SJBM (if any) to the site object to which the message is addressed and instructs it to update its FEL, i.e. to calculate the duration of the job (job duration = Job_SI2k/Site_SI2k_Ratio) and schedule Event_Job_End event at the time which is equal to the sum of current simulation clock value and received job duration. Conveys no job messages received from SJBM (if any) to the site object to which the message is addressed and instructs it to update its FEL, i.e. to schedule Event_Job_Request at the time which is equal to the sum of current simulation clock value and the value of Site_Job_Request_Rate attribute of the site. Reports about all events (e.g. job assignment to the site) which have taken place during that iteration to the SDM Recalculates the values of Site_Running_Jobs attribute for all site objects and reports to SDM SJBM mimics the work of AliEn central job management services (including the banking service). It is responsible for maintaining the list of jobs and distributing them to the simulated sites as well for maintaining users and sites bank accounts. It maintains its own FEL which can contain two types of 63

64 events: Event_Job_New which indicates the arrival of the new job and Event_Priority_Calculate which indicates the need to (re)calculate the priorities of the jobs which are waiting for execution. During the initialization the SJBM loads the input data which concerns users and jobs, marks all jobs as inactive, adds to FEL Event_Priority_Calculate at time zero as well as iterates through the list of jobs and schedules Event_Job_New at the time equal to the value of appropriate Job_Arrival_Time attribute. After initialization it starts waiting for the messages from SSM. It can receive three types of messages from SSM: The first type is simulation clock value update message in which case SJBM checks the FEL and if there are events scheduled at the time equal to the received value of simulation clock takes appropriate action according to the type of event. Event_Job_New causes SJBM to mark the jobs whose arrival time is equal to the received value of simulation clock as active (i.e. set the value of their Job_Status attribute to WAITING ), update User_Waiting_Jobs attribute of the concerned users as well update the value of System_Waiting_Jobs variable. Event_Priority_Calculate event causes the SJBM to recalculate the effective priority of the jobs which have status WAITING (according to the algorithm described in Section 3.1) as well as to schedule the next Event_Priority_Calculate event. The second type is the job request message in which case SSJB sorts the jobs which have status WAITING according to their effective priorities and tries to find a job matching the site from which the job request originated. SSJB makes a reasonable assumption (since in AliEn the package installation is done automatically) that all the sites have the necessary packages installed and satisfy the hardware requirements of the jobs, so the only parameter which maters during the matching process is the price of the site which has to be less or equal to the bid of the job. Once it finds a job it sends it to the matched site for execution. After that it updates Job_Status (sets to 64

65 RUNNING ) and Job_Site (sets to the name of the matched site) attributes of the job, updates the values of User_Waiting_Jobs and User_Running_Jobs attributes of the user who has submitted the job as well as the values of the System_Running_Jobs and System_Waiting_Jobs varaiables. Otherwise (in case it does not find a matching job) it sends no job message to the SSM. The third type of the message which SSJB expects is the job completion report message. This message causes the update of Job_End_Time (sets to the current value of the clock, Job_Run_Time (Job_End_Time Job_Start_Time) and Job_Status (sets to DONE ) attributes. The messages of this type also cause the SSJB to update the values of the attributes associated with the banking service by imitating money by changing the values of User_Account_Balance (decreasing by Job_Price*Job_SI2k) and Site_Account_Balance (increasing by Job_Price*Job_SI2k) attributes. SSJB reports the information about all events and entities which attribute values have been changed to the SDM. 65

66 Simulation Site Manager (SSM) - Manages internal simulation clock - Maintains the list of site objects Simulation Job and Banking Manager (SJBM) - Maintains the list of jobs - Maintains bank accounts Simulation Data Manager (SDM) - Records the simulation log Figure 3.2. Components of discrete-event system simulation SDM service collects from SSM and SSJB information about the state of the system. For each of the values of simulation clock it creates a snapshot of the system state by recording all the system state variables. The simulation stops when the all the jobs which were provided as the input are processed. After the simulation stops the SDM will contain a comprehensive data about the simulated system. 3.3 The simulation toolkit details The toolkit is written in Perl programming language with the use of so-called event-driven programming technique. It has a main loop which is clearly divided down to two sections: the first is event selection (or event detection), and the second is event handling. For the implementation of event selection and handling POE (Perl Object Environment) framework [121] is used. The architecture of POE consists of a set of layers with a component called POE::Kernel on the bottom. POE::Kernel represents 66

67 the events layer which operates as the main loop of each running POE instance. The first call is to the "event dispatcher" - POE::Kernel. An example scenario of the use of simulation program can be the case when the VO administrator wants to find out how the change of tax rate and/or nominal price of the jobs would affect earnings of AliEn sites. AliEn records all the information about the jobs which were run in the system in its database. So the administrator can extract the information about the jobs which were run during, say, previous three months from the database and feed it to the simulation program as an input. Another thing the simulation program would need is the configuration of sites and information about the system users which also can be extracted from the real system. Administrator can then run the simulation with modified values of nominal price and tax rate and once the simulation is over the data from the SDM which will contain the information about earning of the sites during the simulated three months which can be compared with the real data. 3.4 Evaluation of the GBS model with the use of simulation toolkit The toolkit has been used to simulate the work of GBS and WMS for evaluating GBS model and studying its efficiency. In particular we wanted to find out how the use of GBS affects the average waiting time of the jobs 3. As an input for simulations a log containing 11 days of activity collected from multiple nodes that make up the Worldwide LHC Computing Grid (WLCG) has been used. The log (available from the Grid Workloads Archive [122, 123] project) contained information about jobs submitted within ALICE VO. Four cycles of simulations have been performed. In the first cycle the execution of jobs has been simulated using FIFO (First In First Out) 3 Average waiting time of the jobs of the given type is equal to t = (waiting time of individual job of given type ) overall number of jobs 67

68 approach, in the rest of the cycles the execution of jobs has been simulated using the developed GBS approach. In all four cases, of overall jobs, the 9986 (~70%) were assumed to be the ones submitted by Monte-Carlo production operators (hereafter we refer to them as Monte-Carlo jobs), and the rest were assumed to be the jobs submitted by regular user(s) (hereafter we refer to them as user s jobs). 3 Average waiting time of the job Monte-Carlo jobs User's jobs "Rich" user's jobs FIFO Equal prices User's price is 2 times higher Competing users ("Rich" user's price is the highest) Figure 3.3. Results of the simulation. Average waiting time of the job. For the facilitation of the perception of simulation results, in Figure 3.3 the average waiting time of the Monte-Carlo jobs executed using FIFO approach is taken as the unit, and all other values are expressed in that unit. The results of the simulation are as follows: 1) As it was expected, in the first case (FIFO) the average waiting time ratio corresponds to the ratio of the numbers of Monte-Carlo and user s jobs (~70% Monte-Carlo jobs, ~30% user s jobs ), so the system does not give preference to either types of jobs. 2) In this case the jobs were scheduled using the GBS, their price was set to be equal. The simulation has shown that average waiting time of user s jobs by is decreased by about 1.65 times (compared to FIFO). It is explained by 68

69 the fact that the priority depends on the number of currently running jobs, and in the workload the number of Monte-Carlo jobs is more than 2 times bigger, so odds are very high that at any given moment the number of running Monte-Carlo jobs will be higher. This means that the user s jobs get a higher priority and are more likely to be scheduled for execution. 3) In this case the jobs were again scheduled using the GBS, and the price of user s jobs was set to be 2 times higher than that of the Monte-Carlo jobs. It is seen from the plot that the average waiting time again was decreased for about 1.2 times compared to the previous case (more than 2 times compared to the FIFO case), and the waiting time of the Monte-Carlo jobs increased for about 1.7 times. This was also expected, since with higher price the chances of user s job to be scheduled get even bigger. 4) In the last case the jobs were split into three parts: a) 9924 (~70% of overall jobs) jobs were Monte-Carlo jobs with the price set to 1, 2230 (~15% of overall) were the regular user s jobs, with the price set to 2, and the rest were the jobs of rich user, with the price set to 3. It is seen from the plot that the average waiting time of the rich user is ~1.6 times lower than that of the normal user (more than 2.7 times compared to the FIFO case). Again, these results were expected, since the higher price of the jobs suggests that their waiting time should be lower than that of the Monte- Carlo jobs, from the other side users jobs are also competing between themselves and because of the higher price, jobs of rich user get scheduled faster (hence their waiting time becomes less). It is seen from the results that the GBS model indeed provides a flexible job scheduling scheme which solves the problem of the coordination and accounting of the resource use (see, please, Section 2.4 for the definition of the problem). 3.5 Integration of the banking service with AliEn WMS The component of the Job Optimizer service a daemon called Priority, periodically contacts the banking service and retrieves the information about 69

70 the available credits for each user account. It then assigns the priority according to the procedure described in Section 3.1 (effective priority = computed priority * job bid) to the jobs of the users whose accounts have positive balance in the banking service. The jobs of the users whose account has a negative balance get the minimal priority (i.e. 1). The service also puts the mark in the database entries of those jobs so after their completion the users who have submitted them are not charged by the banking service. Priority optimizer Banking service Retrieval of user account information Assignment of priorities to the jobs Job database Figure 3.4. Priority optimizer consults the banking service and assigns the priorities to the jobs CEs of the sites which have free worker nodes for job execution send job execution request to the Job Broker service. The requests contain site configuration information as well as the price of the CPU resource the site provides. According to the model described in Section 3.1, the sites accept for execution only the jobs which have bids higher or equal to the site price. Job Broker Job request Job retrieval from the DB Site Job database Figure 3.5. Sites which have free worker nodes request jobs for execution 70

71 After receiving the job request from the site Job Broker service starts the matching process: it sorts the list of waiting jobs according to their effective priorities and starts iteration through the sorted list. At each iteration step it checks whether the requirements of a given job can be satisfied by the site which has requested a job for execution. Once the Job Broker finds such a job it instructs the CE of a matched site to start a job agent (see please Section for details). Once the job execution is over the information (including the number of consumed CPU resources in SPECint2000 [118] units) about that job is sent from the site to the Job Manager, which stores that information in the database. The list of successfully completed jobs (with final status DONE ) is periodically retrieved from the database by another component of Job Optimizer service - a daemon called Charge. This daemon calculates the price of each job (job price = number of consumed SI2K units the bid nominal unit price) and requests the BS to make money transaction from the group account of the user who has submitted the job to the account of the site where the job was executed. In case a tax rate is defined by the VO administrator part of the money earned by the site is transferred to special tax accumulation account. Charge optimizer Job database Retrieval of jobs Charge for jobs Banking service Figure 3.6. Charging process 71

72 Figure 3.7 demonstrates the overall process of job submission, execution and charging. Job Manager Job database Banking service Submit a job Register in the database Priority optimizer Retrieve users bank accounts information Calculate waiting jobs priorities Request a job for execution Job Broker Retrieve waiting jobs Site Send matched job Charge optimizer Charge finished jobs Figure 3.7. Overall process of job submission, execution and charging It is important to mention that the model is implemented AliEn as an add-on to AliEn: the user can decide not use the features provided by the banking service in which case the the user s job will be charged by the nominal price. The banking service allows VO and site administrators to define flexible policy according to which money earned by the sites will be periodically distributed between the group accounts. The distribution policy represents a set of rules each of which define the money transfer which has to be made 72

73 between a particular site and group accounts. Rules contain information about the accounts between which the transfer must be made, the amount of money which must be transferred and the time when the transfer must be made. The amount can be defined either as a natural number or as a fraction of the sum accumulated on the site account at the moment of the transfer. The transfers can be defined to be made daily (e.g. every fifth day of a month), monthly (e.g. every first day of April), weekly (every Monday) or at a certain day (e.g. at the first of May). The rules of distribution policy can be viewed and modified via the web interface (Figure 3.8). Figure 3.8. AliEn Banking Service distribution policy web interface Figure 3.8 shows a screenshot of the web page where the distribution policy rules can be seen and modified. Let s examine this screenshot: according to first rule all the money accumulated at the CERN site account should be transferred to CERN users group account every fifth day of a month. The second and third rules defines that all the money accumulated on JINR (Joint Institute of Nuclear Research) site account should be transferred to JINR users group account on first days of March and September. Fourth rule defines that all the money accumulated on the account of Yerevan site must be transferred to the Yerevan users group account on every Monday. Last two rules define that on 10 th of March the money accumulated on the account of FZK (Forschungszentrum Karlsruhe) site must be equally 73

74 distributed between University of Frankfurt and University of Heidelberg users accounts. Every rule can be deleted by marking the corresponding checkbox on the left and pressing the Delete button. To define a new rule the site or VO administrator has to chose the site account, group account, fraction of the accumulated money as well as define the time when the transfer must be made. The choices are made using the drop-down menus at the end of the page and a new rule is added by pressing the Add button. The banking service periodically reports the information about the status of site and group accounts to the AliEn monitoring system. This information is collected in the MonALISA repository and is made accessible via AliEn monitoring web page. The system allows viewing that information in two ways: either in the form of a table where the current balance of the account is displayed (Figure 3.9) or in the form of a chart which shows the account balance as a function over time (Figure 3.10). Figure 3.9. A web page (part of central MonALISA repository of AliEn) 74

75 displaying the current balance of site accounts in the banking system Charts can be customized to display the balance information from one or many accounts for a given time period. Figure 3.10 shows a screenshot of chart containing information for CERN and FZK site accounts for the period from November 3 rd to November 23 rd of the year Figure A web page (part of central MonALISA repository of AliEn) displaying the balance of CERN and FZK site accounts in the banking system for the 20 day period starting from November 3 rd 2009 It can be seen from the screenshot that from 3 rd to 13 th of November the balance of the sites almost did not change, however from November 13 th onwards the amount of money accumulated on site accounts started to quickly rise. This is explained by the fact that during the first period the sites were running a very small amount of jobs because the Monte-Carlo data production during that period was stopped. However starting from November 13 th the data production operators restarted it and sites started running a lot of jobs and the amount of money earned by them started to rise. The banking service also reports to MonALISA repository about the numbers, bids and priorities of the jobs which are waiting in the central task queue or 75

76 running on the AliEn grid. This information is then being made accessible on the monitoring pages of AliEn (Figure 3.11). Figure A table containing numbers, bids and priorities of the jobs which are waiting in the central task queue or running on the AliEn grid. The column account contain the usernames of the users who currently have running or waiting jobs, the columns under Jobs show the number of waiting and running jobs belonging to the user, the columns under Price show the average, minimum and maximum values of the bids of the jobs belonging to the user, and the columns under Priority show average, minimum and maximum values of the effective priorities of the waiting jobs. The data presented in the table is updated every two minutes so the AliEn users and Monte-Carlo production operators can use it to decide the price of the jobs which they are going to submit. For example if the job submitter wants the job to be sent for execution relatively quickly then he/she can set the bid of his/her job higher than the average. The banking service runs inside the web server and clients communicate with it using SOAP [110] protocol. Clients authenticate to the service with X.509 [117] certificates. The bank configuration information (e.g. names of the group accounts, prices for the consumed CPU unit proposed by sites, VO tax rate, etc.) is stored in LDAP server. Banking operations (e.g. account 76

77 management, transactions) are implemented using Gold allocation manager [124]. 3.6 Summary of Chapter 3 The model of Grid Banking Service (GBS) for job scheduling in AliEn Grid has been designed and a flexible job scheduling scheme which solves problem 1) outlined in Section 2.4 has been implemented. The model allows addressing a problem of giving AliEn users and Monte-Carlo data production operators the control over the priorities of the jobs they submit. The possibility to alter the priorities of the jobs sent to AliEn Grid for execution is necessary because the research priorities of the collaboration vary over time, and different Grid jobs have different importance for the members of the collaboration. The model describes a mechanism which allows implementing automatic control of jobs execution order in accordance with their importance. To anticipate the effects of the changes of the configuration parameters related to the banking service a discrete-event system model for AliEn WMS has been designed, and a toolkit for simulation of the work of AliEn WMS has been developed. The simulation of the system is done using the Event Scheduling/Time Advance (ESTA) algorithm. The toolkit has been used to simulate the work of GBS and WMS for evaluating GBS model and studying its efficiency. As an input for simulations a log containing 11 days of activity collected from multiple nodes that make up the Worldwide LHC Computing Grid (WLCG) has been used. The log (available from the Grid Workloads Archive project) contained information about jobs submitted within ALICE VO. The analysis of the results of simulation has shown that the GBS allows to decrease the waiting time of the user s jobs from 1.6 to 2.7 times. The simulation toolkit can also be used by the AliEn WMS and banking service administrators to simulate the work of GBS and recommend the 77

78 optimal configuration parameters to AliEn users and Virtual Organization administrators. GBS has been developed, tested and integrated into the Workload Management System (WMS) of AliEn Grid. The monitoring system of AliEn has been equipped with the tools which allow users to get the information about the banking service from the AliEn central monitoring repository web page. With the use of GBS a job scheduling scheme has been developed, where the priority of a job depends on the price, the owner of the job is ready to pay. The price of each job is defined by the user who submits it. The virtual money which is used to pay for a job is accumulated on the bank account of the site where the job was executed. GBS also provides a web interface for administrators to define a flexible policy, according to which the virtual money earned by the sites will be periodically distributed between the accounts of the users. The amount of virtual money which is allocated to a user (or a group of users) depends on the work he/she is doing and reflects research interests of the collaboration. 78

79 CHAPTER 4. DEVELOPMENT OF TWO MODELS OF INTEGRATION OF CLOUD COMPUTING RESOURCES WITH ALIEN For the solution of the problems 2), 3) and 4) outlined in Section 2.4 two models of the integration of computing and storage resources offered by IaaS providers with AliEn Grid have been developed. The first model (that have been called the classic model), assumes the dynamic deployment of an AliEn virtual site on IaaS resources: one first deploys VMs supporting site services, then, starts Worker Nodes (WNs) on the same cloud and, using the site services, gets jobs for execution from central AliEn task queue (detailed information about the Classic model is given in Section 4.3). The second model (called the Co-Pilot model) does not require the deployment of site services on a cloud, instead it is based on a specially developed protocol using which the lightweight agents running on the cloud get the Grid jobs by communicating to the centrally deployed adapter service (detailed information about the Co-pilot model is given in Section 4.4). Models implementations have been tested [125, 126] using the virtual machine images provided by the CernVM project [127] on the University of Chicago s Nimbus scientific computing cloud [95]. The description of the CernVM as well as the architecture of the Nimbus toolkit is given in the Sections 4.1 and 4.2). The description of the developed models, their implementation details as well as their comparison is given in Sections 4.3, 4.4 and CernVM a virtual appliance for LHC applications CernVM is virtual software appliance [127, 128] designed and developed for running physics applications from the LHC [27] experiments at CERN. It provides images for virtual machines which contain just enough Operating System components for running experiment software applications. The virtual machine appliances provide an environment for developing and running LHC data analysis programs on end-user laptop or desktop as well 79

80 as on the Grid. This environment is portable as well as independent from both the operating system and the hardware platform. CernVM makes it possible to use the experiment application software without installing it and also reduces the number of platforms (compiler OS combinations) on which the experiment software needs to be ported and tested, which in turn significantly reduces the cost of the application software maintenance. CernVM virtual machine appliances are built using rbuilder tool from rpath [129]. This tool provides a very convenient way to describe, develop, build and distribute virtual machine appliances which can be run on both 32 and 64 bit platforms. rbuilder provides features for building appliances for all the major free (XEN [86], KVM [87], Sun Virtual Box [88]) and commercial (vmware [130], Parallels [131]). The appliance is built from a so-called group recipe a collection of simple scripts written in python programming language, used to describe individual software application packages. These packages are built using conary package manager program, which unlike other popular package managers can analyze the code of those packages, automatically detect and build all their run-time and build-time dependencies. The resulting image for virtual machine contains the applications as well as just enough of Operating System required to run it. This approach has a clear advantage since the fewer components are put in the image, the less are the efforts of maintaining them and keeping up to date. 80

81 Figure 4.1 Basic building blocks of CernVM software appliance The appliances produced by rbuilder are based on rpath Linux - a distribution binary compatible with Scientific Linux CERN (SLC) [132] (a Linux distribution used by all LHC experiments), which allows reusing the software packages (e.g. experiment application software) already built and tested in SLC environment and simply run them without modifications under CernVM. The appliance is configured to run the rpath Appliance Agent daemon which provides two interfaces for configuring the appliance according to the needs of the user. First is the Web interface for an easy interaction with the end-users and second is based on XML-RPC (XML Remote Procedure Call) [133] which is meant for interaction using the programming language interface to ease bulk or repetitive operations (e.g. configuring many instances of CernVM running on a remote computing cloud). The basic components of CernVM appliance are shown in Figure 4.1. CernVM appliance, which fits in a compressed file of around 130 MB, contains only the operating system that is sufficient to deliver the entire experiment software application packages from a dedicated network file system to the end-user s workstation or laptop on demand. Such an approach allows CernVM developers to maintain a single virtual software appliance, which serves as a common platform for all LHC experiments. At the same time it enables each of the experiments to customize the appliance 81

82 according to their needs and deliver the software applications to the users in an efficient way. The end-users in their turn get an easy to install and portable environment which always contains up to date software frameworks of LHC experiments CVFMS CernVM File System The experiment application software, which is built independently from the CernVM appliance, is made available to the user on demand using the network file system called CVMFS (CernVM File System) [134]. The procedure for building, installing and validating the experiment software application releases is controlled by the specialists from each experiment. CernVM provides tools for synchronization of built and configured experiment software with the central distribution repository from which it is made accessible to the CernVM users by means of network file system. In the scenario employed by CernVM, the virtual machine downloads only the necessary application software libraries and binary files as they get referenced for the first time. By doing that, the amount of software that has to be downloaded in order to run the typical experiment tasks in the virtual machine is reduced by an order of magnitude. The distribution of the experiment application software to many remote instances of CernVM running around the world has been solved by means of a dedicated file system GROW-FS [135]. It has been developed to make a directory tree stored on a web server accessible over the wide area network, with aggressive caching and end-to-end integrity checks. It is designed to maximize the cacheability of all components of the file system. Both files and file metadata can be cached on local disk without remote consistency checks, as well as cached on shared proxy servers, allowing the file system to scale to a very large number of clients. 82

83 Figure 4.2 Software release process The GROW-FS file system is created in the following way. A catalogue of all files in a directory tree is created and exported via web server. This is done by a script that creates a.catalogue file which contains a complete list of files and directories along with a checksum of all the data. This structure is automatically loaded by the special file system driver in the memory of client when it first tries to access the file in GROW-FS. All metadata (e.g. file permission) requests and directory lookups (e.g. file search operations) are handled using this data structure. The integrity of the catalogue is ensured by fetching its checksum using HTTPS secure a version of HTTP protocol. If there is an inconsistency between the master checksum and the catalogue then the driver automatically reloads them from the distribution repository. The initial GROW-FS file system used by CernVM was embedded in Parrot [136] framework and meant to run completely in user space. The use of Parrot, however, introduced considerable performance penalty (because it is intercepting the system calls using strace interface) and occasional problems with certain applications. That is why it has been replaced with the version of CVMFS which used code from both Parrot and GROW-FS and has been implemented to run a FUSE kernel [137] module. This new version has new features added, which improve scalability and performance as well as simplify the deployment of applications in the case when different versions 83

84 compiled for different platforms are available. The principal new features in CVMFS compared to GROWFS are: Using FUSE kernel module allowing for in-kernel caching of file attributes. Capability to work in offline mode providing that all required files are cached. Possibility to use multiple (hierarchy of) file catalogues on the server side. Transparent file compression under given size threshold. Dynamical expansion of environment variables embedded in symbolic links. At present CernVM supports all four LHC experiments by making all their application software (about 100 GB) accessible via CVMFS. It not only solves the problem of virtual machine image size (about 130 MB) and application software updating (update of application software is performed in the central repository and is transparent to the end-user CVMFS automatically loads the newest version available) but also requires minimal changes to the working habits of end-users CVMFS Supporting Infrastructure The operation of CVMFS is critical for the work of remote virtual machines running CernVM appliances. The central infrastructure which supports CVMFS is designed in a way that it can be easily scaled to support higher load by just adding more front end servers which run reverse proxy service and are operating in DNS load balancing mode. These servers forward incoming requests to the backend servers which are deployed inside virtual machines. These virtual machines are connected to the shared network file system which runs on highly available hardware configuration and allows live migration of those virtual machines between the physical nodes. 84

85 Figure 4.3 Supporting infrastructure for CerVM central services The infrastructure is build from three blocks: computing, storage and networking services. Computing services are running all virtual machines and CVMFS services (e.g. web servers). Those services are deployed as virtual machines running on a pool of vmware ESX [138] servers in such a way that the CPU time and system memory are aggregates into a common space and can be allocated for one or more virtual machines. Such an approach provides optimal utilization of hardware resources and simplifies the efforts for providing highly available and scalable services. The Storage System provides reliable and resilient storage for the whole infrastructure. The solution is conceived as a multi-tiered storage solution where two servers, which are deployed on servers having highly available hardware configuration, provide the bulk storage. The system provides advanced management features such as dynamic allocation, dynamic growth, snapshots taking, data replication as well as capability to perform regular backups. The network infrastructure is split into two separate network domains and three different IP segments. Public access services are decoupled from internal services by different physical channels which have different QoS 85

86 (Quality of Service) configuration and use IEEE 802.3ad [139] and 9K Jumbo Frames [140] for efficient data input output and the live migration of virtual machines. 4.2 Nimbus a toolkit for building IaaS computing clouds The Nimbus project [ ] provides an open source, extensible IaaS implementation supporting Web Service Resource Framework (WSRF) [144] as well as the Amazon EC2 interfaces [90]. It allows a client to lease remote computing and storage resources by deploying virtual machines on them. The service allows authenticated clients to dynamically deploy and manage workspaces in the form of virtual machines. To deploy a workspace a user has to provide the information about the virtual machine image which has to be used to start virtual machine(s) as well as configuration in form of an XML file which describes the deployment (e.g. the period for which he/she wants to lease the resources, number of virtual machines to be deployed). Once the virtual machines are deployed the client gets the relevant information about the virtual machines (e.g. the IP addresses) as well as the so-called endpoint reference (EPR) [144] which can be used for interact with the service and manage the deployment (e.g. increase the lease time, pause, resume or terminate the virtual machines). The authentication and authorization part of the toolkit gives a possibility to client to be authorized based on the VO role information which is contained in his/her credentials obtained from Virtual Organization Membership Service (VOMS) [145]. The service provides several flexible options for configuring the networking (allocating new network address, bridging existing address, etc.) of deployed virtual machines. A client, for example, can request that virtual machines are started with different network interface cards to which addresses from different pools (e.g. private IP address pool and public IP address pool) must be assigned. 86

87 The toolkit consists of several components which are schematically presented on Figure 4.4: Context broker Storage Service Workspace Service IaaS Gateway Workspace Resource Manager Workspace Pilot Amazon EC2 Other providers Workspace Control Context Client Cloud Client Workspace Client Figure 4.4 Schematic representation of Nimbus toolkit components Workspace service allows a remote client to request the deployment of one or several virtual machines and provides an interface for managing them. Its main constituent is a Web Service (WS) front-end to a virtual machine based resource manager deployed on a site. The users can communicate with the workspace service either by means of WSRF [144], or alternatively it can be contacted using Amazon s EC2 [90] Web Service Description Language (WSDL) specification. Workspace resource manager implements the deployment of virtual machines on a resource provider's infrastructure Workspace pilot is used for deployment of virtual machines using sites' local resource managers such as Portable Batch System (PBS) [146] or Sun Grid Engine (SGE) [147]. The use of this service allows resource providers to deploy Nimbus toolkit without significant alteration of their current site setup. 87

88 The workspace control components are used to bring the virtual machine image specified by the user from the storage service to the machine where it has to be deployed. They are used to start, stop and suspend the virtual machine as well as configure virtual machine networking settings and deliver contextualization information to it. Currently workspace control works with Xen [86] and KVM [87] virtual machine hypervisors. IaaS gateway serves as an interface between the authenticated clients and other IaaS infrastructures. It is currently capable of mapping the X.509 credentials of Nimbus users with Amazon EC2 accounts. Using this components Nimbus users can deploy their virtual machines on Amazon EC2 infrastructure. Context broker allows client to deploy a functioning cluster of virtual machines (opposed to a set of unconnected virtual machines) with a single request. It acts as an orchestrator for the group of virtual machines deployed within the same request. It assigns different types to the virtual machines from the same deployment according to the configuration provided by the user who submitted the deployment request and helps newly started virtual machines to discover each other. It also delivers configuration information needed for the services running inside virtual machines according to virtual machine type defined by the user. The features provided by this component are essential for the deployment one-click dynamic virtual grid sites on the IaaS computing cloud. Nimbus storage service acts as a secure manager of cloud disk space. For each user it maintains a repository of available virtual machine images. Users can add or remove images to their repositories. It is based on the Globus GridFTP [148] service which allows supporting a rich setoff network file systems. 88

89 Workspace client, having a rich set of possible configuration options provides full access to the Nimbus functionality. However it is relatively complex to use and is typically used in conjunction with a user-friendly wrapper scripts developed by scientific communities. Cloud client, is a popular end-user tool for interacting with Nimbus services. It has a reduced functionality but is very user-friendly and easy to use. These components are lightweight and self-contained, one can select several of them and setup a service in a variety of ways. For example, the initial configuration of Nimbus included the workspace service, workspace resource manager, workspace control, nimbus storage service, and cloud client. By adding the context broker to this set of services it has become possible to deploy one-click virtual clusters. Another example is the IaaS cloud at the University of Victoria, where the resources manager is replaced with the workspace pilot resulting in a cloud with different leasing semantics. The combination of context broker and the IaaS gateway allows deployment of one-click virtual clouds on Amazon EC2 infrastructure. Currently there are several scientific IaaS clouds deployed using nimbus in several universities: Nimbus at University of Chicago Stratus at University of Florida Wispy at Purdue University Kupa at Masaryk University 4.3 Development of Classic model for integration of cloud computing resources with AliEn The classic model implies the deployment of necessary AliEn site services on the cloud. After the services are deployed and configured virtual machines which run AliEn Job Agents (these machines are referred to as Worker Nodes) are started. Worker Nodes communicate with site services which connect to central AliEn services and request a job execution. The 89

90 whole ensemble represents normal AliEn site all the services of which are running inside virtual machines (Figure 4.5). The configuration of the services and the Job Agents is done automatically when the nodes are started. Such kind of site can be started or stopped by a single command. To start a virtual site on the cloud the following AliEn site services must be deployed and configured: ClusterMonitor - routes messages from AliEn site services to central services. All site services communicate to central services and the configuration database through the site Cluster Monitor. (runs on services node) PackMan (Package Manager) installs application packages (e.g. ROOT, AliRoot) required by the jobs on the site (runs on services node) MonaLisa monitors site services (runs on services node) JobAgent (JA) delivers the input of the data required by the job, starts the job and registers the output of the job in the AliEn file catalogue. (runs on WNs) There is no need to run a Computing Element (CE) service, which serves as an interface to the local batch system and is typically used to start the JA on the WNs. The model implies that JAs are automatically started on the WNs whenever WNs are booted. The Storage Element (SE), a service for accessing local mass storage systems, is also not used because the whole site running on a cloud is supposed to be started and stopped very often, and thus the data which potentially could be kept in the SE would not be available most of the time. That is why the data produced on the virtual site is kept on the SEs of other sites. 90

91 Figure 4.5 Site on the IaaS cloud. Classic model To deploy an AliEn site on IaaS computing cloud one has to perform the following steps: 1. Deploy the virtual machines: one machine which will run the ClusterMonitor, PackMan and Monalisa (hereinafter referred to as services node) and a required number of WNs, which will run AliEn JA. 2. Transfer credentials (X.509 certificate and private key) for AliEn authentication to the services node 3. Transfer site configuration information (e.g. the name of the SE where the data produced by the site should be kept) to the virtual machines 4. Make necessary configurations on the services node (e.g. start the NFS server, which must make the application software installation directory accessible from WNs 5. Make necessary configurations on the WNs (e.g. mount the directory exported by services node) 6. Start the ClusterMonitor, MonALISA and PackMan services 7. Start the JA service on WNs Using the features of the Nimbus toolkit a set of programs have been developed which automate all these steps and allow to deploy a functional AliEn site with a single command. The command takes as input the deployment configuration file which is formatted in XML and contains the description of the site to be deployed as well as various configuration parameters. The description includes the number and types of the virtual 91

92 machines which need to be started. In case of AliEn site there are two types of nodes: service (where the site services run) and WN (where AliEn JAs run). For each of the node types there is a section in the deployment file which contains the information regarding the node type. For example, the section which defines the services node is presented on Figure 4.6 (full deployment file for the virtual AliEn site consisting of one service node and five worker nodes is given in Appendix B). Let us examine its contents: <name> - defines the name of the node type. In our case the type is called AliEn-Services-Node. <image> - defines the name of the virtual machine image which must be used to start the virtual machine. The image should be registered in the Nimbus Storage (see, please, section 4.2) for the details. In our case the image produced by CernVM and configured for the ALICE virtual organization has been used. The name of the image is cernvmbootstraped-alice <quantity> - defines how many virtual machines of this node type should be started. In the example setup all the AliEn site services are running inside the same virtual machine, therefore the quantity is set to 1. <workspace> <name>alien-services-node</name> <image>cernvm-bootstraped-alice-0.01</image> <quantity>1</quantity> <nic wantlogin="true">public</nic> <ctx> <provides> <identity /> <role>alien-services</role> <role>nfsserver</role> </provides> <requires> <identity /> <role name="nfsclient" /> <role name="alien-wn" hostname="true" pubkey="true" /> <data name="usercert"> 92

93 <![CDATA[ IICrjCCAhegAwIBAgICAa...]]> </data> <data name="userkey"> <![CDATA[SO9ZVk7w0RdASasg/Kp... ]]> </data> <data name="alice_conf"><![cdata[ CE_FULLNAME alice::cloud::nimbus... PACKMAN_INSTALLDIR /opt/exp/alice/packages ]]> </data> </requires> </ctx> </workspace> Figure 4.6 Fragment of AliEn site deployment configuration XML file containing the services node description <nic wantlogin> - this attribute specifies that the virtual machine should be assigned a public IP address (i.e. the one which will be accessible outside the cloud) and also that SSH login must be enabled on the virtual machine (in case of problems with the AliEn site services the administrator might want to login to the machine to find out their cause) <ctx> - this attribute defines a sub-section which contains the information which will be used by the Nimbus Context broker <provides> - defines a subsection which contains the information about the capabilities which this node provides. This information is needed by the Nimbus Context Broker and if required can be passed to the nodes which have different type (e.g. worker nodes in our case) <identity/> - instructs the Nimbus Context Broker to put the hostname and IP address of the this node to the /etc/hosts file of all other nodes within the same deployment <role> - define the roles which this node type provides. The value of this attribute is used by the context broker to call the appropriate configuration script after the node is booted. 93

94 <requires> - defines a subsection which contains the information about the data describing other nodes from the same deployment required by the services node. <role> - lists the roles (defined in other node types <provides> section) information about which needs to passed to this node. The idea is that the Nimbus Context Broker reports to the program running inside a virtual machine all about the other nodes that it is "required" to know about <data> - these attributes are used to describe the configuration data which must be passed to this node When the site deployment command is executed Nimbus launches the VMs and 'contextualizes' them to according to the types specified in cluster description file. Contextualization is done in the following way, upon launching VMs a lightweight agent running on each VM contacts Nimbus Context Broker and gets the contextualization information (according to the <requires> filed of the site deployment type) and sends to the context broker the information which might be required by other nodes (according to the <provides> filed of the site deployment type). This information is passed to the set of scripts (so-called contextualization scripts), which are launched to configure appropriate services inside VMs and start them. Once the deployment and contextualization are completed, AliEn JAs contacts the Job Broker central service which fetches jobs from the ALICE task queue and sends them to JAs for execution. The operation of the site running on the IaaS cloud at the University of Chicago can be seen on Figure

95 Figure 4.7. AliEn site running on University of Chicago s IaaS cloud 4.4 Development of Co-Pilot model for integration of cloud computing resources with AliEn In the Co-Pilot model AliEn Grid services are not deployed on a cloud. In this model the virtual machines deployed on the cloud are assumed to have the preconfigured environment as well as the experiment application software necessary for running Grid jobs. When deployed these virtual machines start the lightweight program (hereafter referred to as Agent) which makes a job execution requests to a service (or a set of services) running outside the cloud and providing the functionality which previously was provided by AliEn site services. This service (hereafter we refer to it as Adapter) provides an interface between the Agents and AliEn central services (Figure 4.8). Upon receiving the request from the Agent the Adapter service contacts the AliEn Job Broker central service and requests a job for 95

96 execution. After receiving the job it fetches necessary input files from AliEn file catalogue and instructs the Agent to start job execution. Figure 4.8 Co-Pilot Agents running on the IaaS cloud. Co-Pilot model. For the implementation of the Co-Pilot model for AliEn the CernVM Co-Pilot framework has been developed. The framework consists of the Co-Pilot Agent a lightweight program which runs inside virtual machines deployed on the cloud, and of two Adapter services: Co-Pilot Job Adapter and Co-Pilot Storage Adapter. Agent and Adapters communicate via Jabber/XMPP [149] messaging network using a specially developed protocol (its detailed description is given in Section 4.4.1). The use of Jabber/XMPP allows scaling of the system in case of high load on the Adapter services by just deploying new Adapter instances and adding them to the existing messaging network. Like in case of a site deployed following the Classic model a Co-Pilot site can also be deployed on a Nimbus IaaS cloud by a single command. The deployment configuration file, which is again passed to the command as an option, is much simpler in this case (Figure 4.9). The required configuration parameters which need to be defined are the name of the workspace, name of the image which must be used for starting virtual machines, their quantity and the virtual machine network interface configuration information. 96

97 <cluster xmlns=" "> <workspace> <name>alien-copilot-agent-node</name> <image>cernvm-bootstraped-alice-xmpp-1.01.ext3</image> <quantity>5</quantity> <nic wantlogin="true">public</nic> <ctx> <provides></provides> <requires></requires> </ctx> </workspace> </cluster> Figure 4.9 Deployment configuration XML for Co-Pilot model The virtual machine image is preconfigured with all the necessary parameters (e.g. name of the XMPP/Jabber server) for joining the messaging network and requesting a job for execution. One of the interesting features of this model is that its implementation can also be used to integrate with AliEn Grid infrastructure not only the IaaS cloud resources, but also resources provided by so-called volunteer computing systems like BOINC (Berkeley Open Infrastructure for Network Computing) [150]. BOINC provides many teraflops of contributed processor power to a wide range of scientific and technical projects, allowing hundreds of institutes or individual researchers to access large amounts of computing power otherwise unavailable to them. The details of the prototype system which enabled the execution of AliEn Grid jobs on the BOINC volunteer computing system using Co-Pilot model implementation can be found in [151] Development of Co-Pilot Agent Co-Pilot Adapter communication protocol Agents communicate with Adapter service over Jabber/XMPP [149] instant messaging protocol using XML formatted messages. The messages are enclosed in <message> tag, which has two attributes: from and to (sender's 97

98 and receiver's Jabber IDs), and a <body> tag. All the values inside <body> tag are encoded using Base64 [152] algorithm. Each message body contains the <info> tag. The <info> tag has the command attribute, which contains the command needed to be performed by the peer (e.g. Agent sends 'jobdone' command when it has finished executing the job and uploading files) and also other attributes necessary to perform the command. The list of commands which agents sends as well the appropriate response commands which it expects from Adapter service are presented below. The first command which is sent from Agent to the Adapter is the job request command getjob. The request contains the information about the host (e.g. available disk space and memory) where agent runs formatted in JDL [116] as well as the hostname of the machine where agent runs. The example message containing the getjob command is given on Figure <message to='jmreal@cvmappi21.cern.ch' from='agent@cvmappi21.cern.ch'> <body> <info agenthost='_base64:y3ztyxbwati0lmnlcm4uy2g=' jdl='_base64:ciagicbbciagicagicagumvxdw..i4yni44ltaum S5zbXAuZ2NjMy40Lng4Ni5p Njg2IjsgCiAgICAgICAgRnJlZU1lbW9yeSA9IDY4MzsgCiAgICAgICA gu3dhcca9ideynzsgciag ICAgICAgRnJlZVN3YXAgPSAxMjcKICAgIF0=' command='_base64:z2v0sm9i'/> </body> </message> Figure 4.10 Co-Pilot Agent Co-Pilot Adapter communication protocol message containing getjob command Upon receiving job request from the Agent the Adapter contacts AliEn Job Broker and requests a job for execution. After receiving job JDL from AliEn Job Broker, Adapter gets the job input data files from AliEn file catalogue and copies them to the directory from where the Agent can download them by means of Chirp file system [153]. After that the Adapter sends to the Agent a message containing runjob command as well as the <job> tag containing the following attributes: 98

99 id - Job ID chirpurl - Address of Chirp server from where to get the job input inputdir - Directory on Chirp server from which to get the input files of the job inputfiles - List of job input files environment - Environment variables which must be set before job execution packages - List of application software packages needed by the job command - Job execution command arguments - Command line arguments of the job execution command validation - Name of the script for validating the job after it is finished (Optional) Example message containing runjob command is presented on Figure <message to='agent@cvmappi21.cern.ch' from='jmreal@cvmappi21.cern.ch'> <body> <info command='_base64:cnvusm9i '> <job id='_base64:mza3ntczotq= ' chirpurl='_base64:y3ztyxbwatixlmnlcm4uy2g6ota5na==' inputdir='_base64:l2fsawvulwpvy...mduwntyxm2jhmdq=' inputfiles='_base64:y29tbwfuza==' environment='_base64:iefmsuvox...sk9cx1rps0vopsd1du1 ouwnsoww5 packages='_base64:' command='_base64:y29tbwfuza== ' arguments='_base64:' /> </info> </body> </message> Figure Co-Pilot Agent Co-Pilot Adapter communication protocol message containing runjob command When the Agent finishes job execution it sends to the Adapter a request to provide output directory for the job output data. That request contains the getjoboutputdir command, the hostname of the agent as well as the ID of the job. After receiving output directory request from the Agent the Adapter creates the output directory for the job on the Chirp server and makes it writeable for the Agent. A message with the output directory information contains the storejoboutputdir command as well as the following attributes: 99

100 outputchirpurl - Address of Chirp to which the job output must be uploaded outputdir - Directory on the chirp server to which the job output filesmust be put jobid - Job ID When output files are uploaded the Agent sends a message to the Adapter which contains the jobdone command, job execution exit code as well as the ID of the job. After receiving the jobdone command the Adapter uploads the job output files to the Storage Element specified in the job JDL, registers the files in the AliEn file catalogue and changes the job status to DONE. Possible errors during the operation of the Adapter or the Agent can be reported to the corresponding peer using a message containing joberror command with errorcode and errormessage attributes. The protocol supports redirections of messages, i.e. Co-Pilot Adapter can redirect a request from an Agent to another Adapter. This feature allows implementing different kind of Adapters for performing different tasks and in the same time giving an agent only single communication address. For example one can setup an adapter which will be used to retrieve job details and input files (Job Request Adapter) and another adapter which will used to upload job execution results and set the final job status (Job Completion Adapter). The message redirection command is called redirect, the Jabber ID of the service to which the message must be redirected is passed using the attribute called referral and the message itself is enclosed in <info> tag. The example redirection message sent from the Adapter to the Agent is given in Figure <message to='agent@cvmappi21.cern.ch' from='jmreal@cvmappi21.cern.ch'> <body> <info referral='_base64:c3rvcmfnzxjlywxay3ztyxbwatixlmnlcm4u Y2g= ' command='_base64:cmvkaxjly3q= '> 100

101 <info exitcode='_base64:ma== ' jobid='_base64:mza4mde3nde= ' command='_base64:am9irg9uzq== '/> </info> </body> </message> Figure Co-Pilot Agent Co-Pilot Adapter communication protocol message containing redirect command After receiving this message the Agent will decode the value of referral attribute to get the Jabber ID to which the message must be sent (in the given example the string '_BASE64:c3RvcmFnZXJlYWxAY3ZtYXBwaTIxLmNlcm4uY2g=' will be decoded to ), will generate a new message using the contents of the <inner> tag of the original message, and will send it: <message to='storagereal@cvmappi21.cern.ch' from='agent@cvmappi21.cern.ch'> <body> <info exitcode='_base64:ma==' jobid='_base64:mza4mde3nde=' command='_base64:am9irg9uzq==' /> </body> </message> Figure Co-Pilot Agent Co-Pilot Adapter communication protocol message containing redirect command The protocol also supports the secure mode of operation in which case all messages the messages exchanged by Agents and Adapters are encrypted using AES symmetric encryption algorithm [154]. Encryption is done using 256 bit key, which is generated by the Agent and is sent along the request to the Adapter (using session_key attribute). The session key itself is encrypted using RSA [155] algorithm with Adapter's public key (so it can be decrypted only using corresponding private key. Adapter encrypts the replies which it sends to agent using the key it got during the request. An example message exchanged in secure mode is given on Figure <message to='storagereal@cvmappi21.cern.ch' from='agent@cvmappi21.cern.ch'> 101

102 <body info='_base64:vtjgc2rhvmj...mfawakc3v2him3j6vjqkcdrirgo=' session_key='_base64:ylcrwxluhjw...ummppngfmsmvsb1d Zdz09Cg==' /> </message> Figure Co-Pilot Agent Co-Pilot Adapter communication protocol message exchanged in secure mode In secure mode Agents are required to authenticate to the Adapters. Authentication is done using special authentication ticket, which Agent retrieves from the special service called Key Manager. The ticket specifies to which Adapters Agent is allowed to communicate. The ticket has a limited lifetime and is signed using the RSA algorithm with the private key of the Key server. To get the ticket Agents sends to the Key Manager a message containing the getticket command as well as the credential attribute, the value of which is used by the Key Manager to authenticate the Agent. After successful authentication of the Agent the Key Manager sends to agent the message containing the storeticket command and the ticket which is sent as the value of the attribute called ticket. Once the ticket is obtained it is included in all messages sent by the Agent as the value of serviceauthenticationticket attribute of <info> element. 4.5 Comparison of Classic and Co-Pilot models. Measurement of their timing characteristics The major advantage of the classic model is that it is very easy to implement, because its implementation does not require modification of the code of neither AliEn nor Nimbus toolkit. The drawback of the model is that one needs a separate virtual machine (and in case of a high load several of them) to run site services. The deployment of these virtual machines is time consuming and by excluding those from the setup one could potentially utilize more virtual machines for deploying worker nodes and running more jobs. The classic model assumes that the application software is brought to the cloud by PackMan service and is made available to worker nodes through a server running Network File System (NFS). This is not optimal, because 102

103 CernVM image already provides the application software and it would be better if worker nodes, instead of waiting for the installation of the software by the PackMan service and using accessing it through NFS, could directly use the application packages available via CVMFS. Such an approach would allow to eliminate the PackMan service from the deployment, however it requires the modification of the code of AliEn Job Agent service which is not feasible in our case. The implementation of the Co-Pilot model does not require deployment of service nodes and thus potentially allows running more jobs on the same number of virtual machines (since some of them can be used as worker nodes rather than service nodes). Besides it uses application software available from CernVM and does not require existence of Grid package management service such as AliEn PackMan. The current implementation of Co-Pilot Adapter can be used to execute jobs on the cloud only from AliEn Grid. However it can be extended to communicate with any pilot job framework, e.g. PanDA - distributed production and analysis system [156] used by CERN ATLAS [157] experiment, or Dirac [158] Grid solution used by CERN LHCb experiment [159]. The current implementation of Co-Pilot Agent does not have anything AliEn-specific and is written in a way, that running jobs fetched from other frameworks should not require extra development. For the implementations of both models we have measured the time which elapses between a) launching site deployment command and start-up of node(s) on the cloud, b) launching site deployment command and requesting of job(s) by worker nodes and c) requesting job(s) by the worker nodes and assignment of jobs to them by AliEn Job Broker. To perform the measurements we have deployed sites with different number of worker nodes: from 1 to 10. For each number of worker nodes, 3 deployments have been performed, so 30 deployments were done for classic model (launching 165 worker nodes overall) and 30 deployments for Co-Pilot model (launching 165 worker nodes overall). For each deployment timings have been measured, and afterwards mean values of recorded times have been 103

104 calculated. The deployment commands have been launched from a machine located at CERN to the Nimbus Science Cloud of the University of Chicago. The plot on Figure 4.15 represents mean values of the time which elapses between: Figure Virtual site deployment timings launch of site deployment command and start-up of node(s) on the cloud (light green bars for classic model implementation and light orange bars for Co-Pilot model implementation) launch of site deployment command and request of job(s) by worker nodes (dark green bars for classic model implementation and dark orange bars for Co-Pilot model implementation) It is seen from the plot (Figure 4.15) that: The time of worker node deployment on the cloud is proportional to the number of virtual machines being launched. For the same number of the worker nodes, the start-up duration, that is the time period from the issuance of site deployment command to the booting of the OS of nodes is longer in Classic model: this is because in the case of deployment following Classic model one launches an additional node for running AliEn site services. 104

105 The time interval between the start-up and the first job request does not practically depend on the number of worker nodes (Figure 4.15, dark green and dark orange bars). It is about 400 seconds in case of Classic model and about 15 seconds in case of Co-Pilot model. In the Classic model this interval is needed for starting the site and Job Agent services of Alien, while in the Co-Pilot model - for starting the Co-Pilot Agent. The plot on Fig shows the mean values of the time which elapses between requesting of job(s) by the worker nodes and assignment of jobs to them by AliEn (dark green bars for classic sites and dark orange bars for Co-Pilot site). In case of classic site the mean time does not exceed 5 seconds, and in case of Co-Pilot site it varies from 3 to 30 seconds. The reason for this is that the current implementation of Co-Pilot Adapter is currently serving requests sequentially, whereas AliEn services are processing several requests simultaneously. Figure Job request and arrival timings Table 4.1 shows minimum, maximum, mean and the standard deviation of the measured time values for all 165 worker nodes launched during the