SESSION 1: INFRASTRUCTURE SESSIONS

Transcription

1 SESSION 1: INFRASTRUCTURE SESSIONS

2

3 IBERGRID 3 Experience on running the Ibergrid infrastructure within EGI I. Campos 1, E. Fernandez 1, A. Lopez 1, J. Marco 1 P. Orviz 1, G. Borges 2, J. Gomes 2, H.Gomes 2, M.David 2, J.Pina 2 and J. Martins 2 J. López-Cacheiro 3, I. Diaz 3, A. Feijoo 3, C. Fernandez 3, E. Freire 3, and A. Simon 3 1 Instituto de Fisica de Cantabria, CSIC, Santander, Spain [email protected] 2 Laboratório de Instrumentação e Física Experimental de Partículas Lisboa, Portugal 3 Fundacion Centro de Supercomputacion de Galicia, Santiago de Compostela, Spain Abstract. The National Grid Initiatives in Europe are operating already for about one year under the coordination of the European Grid Initiative (EGI). During this time the transition to an infrastructure country-based has taken place. This paper is summarizing the experience of the Spanish and Portuguese NGIs regarding the operational experience and developments needed to move towards the EGI model. An overview of the transition from the ROC based model, operational procedures and tools, and user support at the Ibergrid level is given, together with a perspective on the evolution of the Grid infrastructure in Europe. 1 Introduction For many years the grid computing efforts in Europe were focussed at developing the technologies and demonstrating them through pilot infrastructures. In time those infrastructures have evolved to become production oriented services to support scientific communities. The last step towards a sustainable Grid Infrastructure for the European research community has been taken with the foundation of the European Grid Initiative (EGI), in charge of coordinating the efforts of the infrastructure providers organized at the country level in National Grid Initiatives. The first year of running the European Grid Infrastructure under the organizational model based on National Grid Initiatives has passed. During this year the transition from the model based on Regional Operation Centers (ROC) has taken place. Each particular country or region is responsible for the operation of the infrastructure within their geographical limits, plus a contribution to the running of the global operations of EGI, the so-called International Tasks of the NGIs. The Iberian operation is performed and coordinated under the umbrella of Ibergrid. It has substituted the former South West Federation of EGEE, integrated by Spain, Portugal but has inherited the experience of common operation developed in the previous 6 years. It provides an environment for regional operation of the Iberain Grid infrastructure, at the same time interoperable with the rest of the European Grid Infrastructure.

4 IBERGRID Over the last year the infrastructure has grown to more than 12,500 cores and 11 Petabytes of online storage 4 The usage has increased from 39 millions of CPU hours to more than 76 millions (normalized to KSI2K), mainly due to the start-up in the operations of the LHC. From the accounting portal maintained at CESGA the usage of the infrastructure, a very important information both for infrastructure and funding agencies, can be extracted. The usage data is plotted in Figure 1, where the dominance of the LHC start-up can be clearly observed. Fig. 1. CPU hours normalized to KSI2K in the Ibergrid infrastructure in the period from 2010-Q2 to 2011-Q1 The infrastructure of IBERGRID is expected to continue growing both in size and in diversity as user communities across Europe adopt the Grid as paradigm for sharing computing resources at the European level in the framework of ES- FRI projects, international collaborations, and at the Iberian level, organized by scientific macro-areas as described later in this review. 2 The National Grid Initiatives A very special relation has been established with the Spanish NGI to build the Iberian distributed computing infrastructure (IBERGRID). The operation and 4 See NGI/NGI IBERGRID for details

5 IBERGRID access to grid resources in Portugal and Spain is carried out in an integrated manner within IBERGRID, since many procedures and services are common to both initiatives. 2.1 INGRID The Portuguese NGI maintains a distributed computing infrastructure (INGRID [1]) for scientific applications, integrating computing resources belonging to different academic research organizations. It enables the consolidation of isolated local infrastructures in research centers and universities, allowing optimization of its use in favor of scientific growth. The initiative also aims to ensure the development of skills and capacities of strategic importance for the evolution of distributed computing in Portugal while helping to solve complex problems with high processing and data storage requirements. The participation of Portuguese researchers in international projects is ensured through the national contribution to the European Grid Initiative, and through resource sharing with projects like the Worldwide LHC Computing Grid (WLCG). The largest INGRID offer is based on high throughput computing resources whose access is virtualized through glite middleware. Nevertheless, a small fraction of high performance computing nodes is also available for those communities executing applications with strong latency requirements. The generic infrastructure architecture is based on a two level model. The first level consists of a minimum set of processing resources and storage, and by staff to ensure the integration of all the distributed resources. This is considered the core of the infrastructure, located in the central node of grid computing, and providing services which glue resources together. A second level is composed of resources provided by other institutions with demonstrated experience. The National Grid Initiative is coordinated by UMIC [2] and its technical coordination in handled by LIP [3]. 2.2 ES-NGI The Spanish National Grid Initiative (ES-NGI [4]) is a distributed computing infrastructure which interconnects the resources of a number of computing centers using Grid technology. Currently there are 16 resource centers distributed across the country. Those computing resources are used to support the involvement of spainsh scientist in large scientific collaborations like the analysis of the LHC data, analysis of data coming from Astronomic and Astrophysics missions, like Auger or Planck, Nuclear Fusion or Computational Chemistry. Also those centers provide basic user support to the scientist working in Spain and willing to perform the numerical work on this infrastructure. The mission of ES-NGI can be described as follows: Operate the service platform for Grid Computing at the National level Offering an optimize services for R&D users in Spain, applying scietific criteria for the allocation of those resources

6 IBERGRID Provide the necessary services to operate together with the European Grid Initiative Support the computational projects of international collaborations requiring Grid technology Coordinate with the activities of the Spanish Network for e-science for user outreach Advice the Ministry of Science and Innovation in the area of Grid computing, and participate in the national and international activities as required by the Ministry. The Spanish NGI is coordinated by the Spanish National Research Council (CSIC) from the Institute of Physics of Cantabria (IFCA) in Santander. 3 Transition from the ROC model to the NGI model The transition from the EGEE ROC to the NGI model has been executed following a two step procedure: 1. Creation of the NGI IBERGRID Operational Centre 2. Decommission of the SWE ROC The creation of the NGI IBERGRID Operational Centre followed the EGI Operation Centre Process [5]. Since Portugal and Spain are represented within the EGI Council the political approval required to initiate the process was obtain automatically. Another pre-requisite to start the technical validation included the creation of several mailing lists (to contact the management; to contact the people responsible for monitoring and supporting the Operations Centre infrastructure (ROD); and to contact the responsible security officer). The management personal contacts ( s and telephones) had also to be officially provided. The full technical NGI IBERGRID validation process consisted on an extensive set or tasks and procedures to be executed by many project bodies (check figure 2). The whole process flow was followed through GGUS [6] with associated child tickets to track the execution of the associated activities. Among all the steps, some of the most important are: Creation of NGI IBERGRID entity in GOCDB: All operation tools fetch information from GOCDB, and / or validate information based on GOCDB inputs. Therefore, it is providential that the inserted information is properly verified, specially the roles and permissions attributed to the regional staff personnel. Creation of NGI IBERGRID view in the Operations Dashboard: This action allows ROD staff to access and perform the monitor of the regional production infrastructure though the Operation Dashboard. The ROD task as started even without a certified r-nagios service since during the transition from EGEE to EGI, the Nagios Team was providing an instance to monitoring regional resources.

7 IBERGRID Fig. 2. NGI IBERGRID validation process. Creation of NGI IBERGRID view in SAMAP: This is a very important step, specially for WLCG sites, since WLCG availability and reliability metrics continue to be collected using GridView and SAM results. Migration of sites from SWE ROC to NGI IBERGRID entity in GOCDB: Through this step, sites are linked to a new entity, and one has to be sure that important relations are not lost during the migration process. Validation of the regional SAM instance: This was one of the last, and most problematic steps. According to the r-nagios validation procedure [7], the regional instance is only considered validated once it presents the same results as the r-nagios service put in place by Nagios Team to guarantee monitoring during the transition period. Moreover, the SWE r-nagios instance was directly moved to NGI IBERGRID, and some of the global activities forseen in EGI Operation Centre Process [5] could be bypassed. Once r-nagios was considered validated [8], the Operations Dashboard started to collect and process the messages from the NGI IBERGRID regional Nagios, and the ROD work started to be done based on those results. Once all the steps presented in figure 2 were completed with success and validated, NGI IBERGRID entered in operation, and today is smoothly working integrated in EGI. The decommission of the SWE ROC was performed according to the EGI Operation Centre Decommission Procedure [9] tracked by GGUS ticket [10], It fol-

8 8 IBERGRID lowed an inverse workflow as compared to the one presented in figure 2. During the procedure, the following issues were found: SWE ROC had a particular interesting case of hosting a Marroco site. During the whole transition period, the Marroco site responsibles were warned that they should trigger their own political processes to be able to participate in EGI. At the time SWE was decommission, the Marroco site was performing badly, and no evolution was seen for several months on Marroco EGI strategy. Therefore, the closure of the site was agreed between NGI IBERGRID and site responsibles. A question about if SWE accounting data was not going to be lost with its decommission. It seems the accounting data is associated to sites, and therefore, it is always possible to reconstruct a ROC accounting data, knowing the number of sites associated to that ROC at the time it was decommissioned. 4 Operations Coordination Ibergrid operations are coordinated by weekly meetings and the operations mailing lists (see Table 1). Operations meetings take place every Monday at 11:30 CET and are an important place where NGI operators and administrators can discuss about the latests developments. Table 1. Operations support mailing lists Description (@listas.cesga.es) Purpose General Coordination ibergrid-management Coordination of the infrastructure and communication with EGI operations ROD ibergrid-rod Follow up of the infrastructure status Support to operators ibergrid-ops All members of the operations teams User support teams ibergrid-support All members of the user support teams The collaboration software selected for Ibergrid meetings was the Enabling Virtual Organizations (EVO) System [11] widely used by the High energy Physics community and maintained as joint effort of CERN and DOE. The EVO client is based on Java and supports standard videoconferencing protocols as H.323 used by Polycom, playback and recording functions to store or review a session, private or group chat and an integrated telephone gateway. Currently the EVO Ibergrid operations meetings are coordinated and hosted by CESGA staff. An important element to take care of Ibergrid infrastructure is the Regional Operations on Duty (ROD). Ibergrid ROD team is a rotational task among Portugal (LIP) and Spain (IFCA and CESGA) sites, which rotate their shifts on a

9 IBERGRID weekly basis. ROD team in Ibergrid is also the 1st line support and it is in charge of these actions: Access to Ibergrid operational tools (see next section) such as Regional Nagios [15], GSTAT [14] or GOCDB [12] to check Ibergrid sites sanity. Check Ibergrid grid services status and handle tickets creation. Support Ibergrid sites to solve issues. Coordinate Ibergrid infrastructure within EGI project. ROD meetings are held the first Wednesday of each month and involves LIP, IFCA and CESGA grid operators. These meetings are used to discuss about the infrastructure status and the new actions to be assigned to ROD team. The normal procedure is to review the status of the list of actions from the last meeting. The review of the infrastructure gives a good vision about its status and which actions are needed to solve specific issues. After the meeting the new actions and minutes are submitted to Ibergrid ROD mailing list. Each Friday at 12:00 CET, Ibergrid ROD send an to Ibergrid Operations mailing list. This is a status report about Ibergrid sites, the report includes last alarms raised in the operational tools and a list of open tickets to the federation. The aim of this is to get the feedback of the regional sites in the Mondays Ibergrid Operations meeting. CESGA prepares the agenda for the meeting when this is received, and submits the meeting agenda to the Ibergrid operations mailing list. Usually the Operations meeting points (which is scheduled every Monday at 11:30 CET) are: 1. Ibergrid sites operational status. Summary of the week: This point is used to comment important news/events from last week. Reports from sites with open issues (ROD contribution): A ROD member comments the opened GGUS tickets in the Ibergrid NGI. Site managers connected must give an update about their ticket status. Technical questions from the sites (Technical problems, issues, etc...): Site managers can solve their technical doubts, the installation of a specific service, issues that they are having, or just ask about which is the procedure to add/remove services. 2. The following issues are added to comment tasks that are being doing in the EGI project, and therefore, inform to the Ibergrid site managers about EGI requirements. 3. Action List Review: The action lists created on previous meetings are reviewed. Sites that have assigned actions give an update about the status of each one. 4. AoB and Parking Lot: Sites can ask about some extra issue which was not commented during the meeting or to propose some new point for the next meeting. After the meeting, CESGA writes the minutes and the list of new actions (if case new actions appeared). The agenda and minutes are stored on the Ibergrid Wiki [17] maintained by LIP.

10 10 IBERGRID 4.1 Operational Tools To ensure the proper functioning of Ibergrid infrastructure the Operations team must use different tools to facilitate their work. Ibergrid Regional Nagios [15] is one of those monitoring tools, It is installed and maintained by CESGA. This powerful tool allows to site administrators to test their sites and do an early detection of possible issues. Nagios results are stored in local database and submitted by messaging system to EGI to provide the Availability/Reliability site reports (see figure 3). Fig. 3. Nagios Messaging System Integration Regional Nagios is protected by user X.509 certificates, to get access new users must belong to Ibergrid Regional Staff on NGI IBERGRID. Regional Staff role is requested by site admins at EGI GOCDB tool. Nagios grid monitoring system installation it s based on YAIM glite configuration tool and it is completely transparent from the site administrators point of view. Regional Nagios gathers the complete list of grid services and machines using ldap queries to the central information system. Once a new site is added, Nagios detects the new edpoint (running a cron script called Nagios Configuration Generator (NCG) each 3 hours) and submits predefined probes for each site service. If nagios detects any failure then a notification message its submitted to a central broker and published in the regional Dashboard (see figure 3). The Ibergrid Operational Dashboard [16] was installed at LIP and represents another important piece for the monitoring of the regional infrastructure. The original component is develloped and maintained by the IN2P3 team, and their main functionalities are executed on top of two functional layers: the Lavoisier

11 IBERGRID 11 Web Service and the web interface. Lavoiser is the entity responsible for collecting information from several other operational tools. The web interface consolidates that information and provides a graphical user interface dedicated to the daily work of 1st line supporters. 1st line supporters reviews these information and the notifications created to the sites due to nagios probes failures. The regional dashboard gathers information from: GOCDB The static information about sites and nodes are taken from the GOCDB (Site administrators list, scheduled downtimes, etc) GGUS system. ROD team can open GGUS tickets to sites directly from the Dashboard. Dashboard fills automatically the raised ticket with the necessary information about the cause of the problem to assist to solve the issue. Regional Nagios The monitoring system is configured such that failing tests are automatically reported to the Dashboard system by the messaging system. One of the main assignment to be done by the ROD, is the ROD shift. Each week, one of the members participating in the Ibergrid ROD team does the ROD shift, this procedure can be summarized in the following points: 1. Access to the Regional Dashboard to check if new alarms have been created. An alarm appears in the Dashboard whenever there is an incident, for example a site test changes its nagios status from OK to ERROR. 2. If an alarm for a site is less than 24h, a notification is submitted to the site with copy to the Ibergrid ROD mailing list to inform about the new alarm (machine, failed and the alarm age). This is done through a dashboard tool called notepad and its mission is to inform to the sites before raise any ticket. 3. If the alarm is less than 24h old, and returns to OK status, the alarm should be closed. If a site has a scheduled downtime registered on GOCDB and alarms appear for the site, these alarms are not taken into account. 4. If an alarm for a site is more than 24h, and continues in CRITICAL status, a GGUS ticket is open to the site through the dashboard portal with a deadline of three days. If no response is received or the problem has been unattended, the ticket could be escalated and the site may be suspended. 5. A ROD handover is done each Monday at CET through the Dashboard Handover Tool with an explicit mail sent to the Ibergrid ROD mailing list, with the subject IBERGRID ROD Handover and providing some useful information for the people in charge of doing the next ROD shift such as: Current dashboard alarms. GGUS tickets open through the dashboard and which will continue into the next week. List of issues found last week. On the dashboard portal the duration of each ROD shift and the site who did it is registered. When this period of time expires and the handover is sent, the procedure is repeated again for the new team in charge.

12 12 IBERGRID 5 Support to users User support is a core activity currently under development at the Ibergrid sites. The operation teams have deployed a number of utilities and procedures to deal with user requirements and support. We will describe in this section briefly those operational tools, namely the VOMS services, the NAGIOS service for automatic monitoring of the user VOs, and the user support schema based on team shifts. A total of 10 groups at different Ibergrid sites are participating in the tasks related to user support. User support is done jointly by the participation in shifts for user support. A shift consist of the following tasks. Fig. 4. Snapshot of the table describing groups capacities for user support Get the status summary of the Generic VOs of Ibergrid see list of VOs (salient tickets, ongoing issues) from the previous shift Check the VO support tests once per day and inform the ROD in case of problems Answer the tickets within 24 hours and update the ticket status at least once every 72 hours. Solve user generated tickets to the best of the team knowledge, or contact another user support team in case the expertise required does not exist in the shift team. To this end a table describing the user support capabilities has been produced (see Figure 4). This table needs to be constantly updated as more information regarding possible items for user support are encountered. After two weeks, the group in shift performs the handing over to the next group. The handing over of the shift will take place on monday at 10am and consists of

13 IBERGRID 13 the following steps (to be completed by the team leaving the shift, and the team entering in shift): Produce a short report with the issues deal with during the shift: tickets open, solution, etc... Participate in the operations meeting that monday to present this summary report. The group entering in shift should participate in this operations meeting as well in case there are questions to solve with the current support issues Furthermore, the Request Tracker will send a summary of the current ticket status to the supporters mailing list: [email protected] 5.1 Guidelines for the teams in Shift The team in shift looks everyday at the output of the VO support monitorization system (SEE BELOW). If a failure is observed, it will repeat the test by hand. If the error persists, the shift team will contact the ROD team at [email protected]. It is the responsibility of the ROD team to start the actions to solve this sort of problems. Users will communicate with the support teams via . The setup for this purpose is [email protected] When an arrives to this address, a ticket is automatically created by the RT server. From this moment on the ticket it is the responsibility of the support team in shift. Here follows a list of possible actions to be undertaken: As a general rule, if a user reports a problem related to infrastructure, the shift team should try to reproduce it before contacting operation teams. This will avoid unnecessary noise coming from misunderstandings about how the infrastructure is supposed to work, or command-line errors. Problems with user commands in general should be solved by the shift team directly. Application porting problems should be redirected to the teams reporting experience in this field according to the table of capacities. Problems like I cannot execute any job at this site should be forwarded to that site. The team in shift will solve the problem in collaboration with the site having the problem. It is acknowledged that it is not obvious on which side the problem is (user or infrastructure) and there is often a fine line between both worlds. The support team needs to bridge between operators and users. This list of instructions will be likely extended as experience develops. 5.2 Automatization of VO monitoring While EGI operation teams test and monitor the status of resources through generic tests, they can be considered insufficient for user support. In particular, a

14 14 IBERGRID site can have according to the monitoring system all the services responding properly, and still jobs are failing, data are not transferred, or proxies do not start, for a variety of reasons. The only way to detect these failures is by setting a automatized probing system. The automatization of this probing will greatly increase the reliability of the infrastructure in the eyes of the users, which will not need to report very basic problems to the site admins by themselves. The current operations model forces that each NGI must deploy and operate their own Service Availability Monitoring (SAM) system to monitor the fraction of EGI production infrastructure under their scope 5. In particular this implies that the monitorization of the VO support on the Ibergrid infrastructure needs to be handled by the operational team, using the utilities provided by the Service of Automatic Monitoring (SAM). Currently SAM includes the following components: A test execution framework (probes) based on the open source monitoring framework Nagios, and the Nagios Configuration Generator (NCG) The Aggregated Topology Provider (ATP), the Metrics Description Database (MDDB), and the Metrics Results Database (MRDB) the message bus to publish results and a programmatic interface the visualization portal (MyEGI) The main idea is to profit from the SAM service in use for EGI operations and adapted it to monitor the resources from a VO, or from multiple VOs, using the same SAM instance. Currently the monitorization of the Ibergrid VOs is done at the following level Each SAM instance triggers the execution of probes in grid sites under their scope. The present list of probes for VOs includes: Job submission testing via CE probe/metrics: full job submission chain is exercised - job submission, states monitoring, output sand-box retreival,... Data managements testing via SRM probe/metrics: get full SRM endpoint(s) and storage areas from BDII, copy a local file to the SRM into default space area(s),... WN testing via WN probe/metrics: replica management tests (WN - SE communication),... WMS testing via WMS probe/metrics using submissions to predefined CEs. LFC testing via LFC probe/metrics: read and update catalogue entries,... One of the most obvious advantages of this service is that a VO can then develop and integrate their own probes. The current state of the status of support for phys.vo.ibergrid.eu can be seen in 5 See https : //wiki.egi.eu/wiki/sam Instances for a list of SAM instances across the EGI infrastructure

15 IBERGRID Virtual Organizations in Ibergrid The Ibergrid infrastructure currently supports several types of VOs. There are International VOs such as atlas, cms, LHCb, alice, magic, biomed, auger, compchem, fusion, planck,... Such VOs are managed independently by their respective user communities. User affiliation and structure depends on schemas defined in the framework of the international collaboration that governs the activity of each of them. Ibergrid support VOs exist as well to cover the necessities of monitorization related to the running of the infrastructure. These are the VOs dedicated to support the operational activity of Ibergrid (ops.vo.ibergrid.eu) and the VOs dedicated to user training (tut.vo.ibergrid.eu). In this cathegory falls the generic VO dedicated to testing (iber.vo.ibergrid.eu). A new set of VOs dedicated to Scientific Disciplines having a traditional relation with the e-science domains. These are: Physics and Space Sciences: phys.vo.ibergrid.eu Chemistry and Materials Science: chem.vo.ibergrid.eu Life Sciences: life.vo.ibergrid.eu Information and Communication Technologies: ict.vo.ibergrid.eu Earth Sciences: earth.vo.ibergrid.eu Engineering: eng.vo.ibergrid.eu Social Sciences: social.vo.ibergrid.eu The scope and reach of these VOs is described on the EGI CIC operations portal (located at ). The management services for VO users and proxy generation are handled from the VOMS machines. The central one them located at LIP (voms01.ncg.ingrid.pt). Since the VOMS in a single point of failure for the infrastructure usage it has been decided to have as well two backups at CESGA and IFCA. 6 Conclusions The IBERGRID infrastructure has transitioned along the last year to a model based on two NGIs, operating together in the framework of the European Grid Infrastructure. Over 12,000 execution cores and more than 10 Petabytes of online storage are available in the Iberian peninsula to support projects as large as WLCG, but also to fit the necessities of research groups in both countries willing to use the Grid to share resources and data capacity. Looking at the statistics of availability and reliability we have reasons to be satisfied with the operational coordination and management in place. Over the last months IBERGRID is located among top countries of its size regarding the metrics of quality in operational support 6. There is still quite some work to do specially in the context of operational tools to provide further automatization to the operations. In the framework of the 6 see and reliability monthly statistics for details

16 16 IBERGRID project EGI-InSPIRE the team of CESGA is working in the deployment of the project myegiegi-inspire portal, and continue the teamthe ofsupport CESGAand is working enhancement in the deployment of the accounting of the portal. Metrics portal, areand also continue a critical theinformation support and for enhancement the evaluation of theof accounting the infrastructure por- myegi tal. usage. Metrics CESGA are also teama critical is developing information the new for the EGI evaluation Metrics of Portal, the infrastructure the main objective is CESGA to have ateam set of is metrics developing that the can new help EGI tometrics activityportal, leadersthe to main measure objec- project usage. tive performance is to have and a set keep of metrics track that of its can evolution. help to activity The first leaders Metrics to measure Portal project version will performance and keep track of its evolution. The first Metrics Portal version will be released in August and it will assist EGI NGI operators to generate their quarterly reports automatically. The enhancement of operational tools will still take be released in August and it will assist EGI NGI operators to generate their quarterly reports automatically. The enhancement of operational tools will still take about one year to complete before the deployment takes place in the production about one year to complete before the deployment takes place in the production infrastructure. infrastructure. The joint jointteams teamsofofspain Spainand andportugal Portugal have have become become important important actors actors in the in the collaborative effort effortundertaken by bythethe whole whole EGI EGI collaboration. collaboration. The The teams teams of of IBERGRID are areperforming a amajor major role role in in operations operations coordination, coordination, middleware middleware rollout, software validation and andcertification, and and deployment of VO of VO services services for for user community support. Several contributions in these in these proceedings book book are dedicated to describe the accomplishments of of the the IBERGRID teams teams in those in those areas. areas. are ded- 7 Acknowledgments The authors wish wishtotothank thankthe thespanish Spanish Ministry Ministry for for Science Science and and Innovation Innovation for for its support under under project project numbers numbers FPA FPA and and AIC10-A AIC10-A This This work makes use of results produced with the support of the Portuguese National work makes use of results produced with the support of the Portuguese National Grid Initiative. More information in For the support Grid Initiative. More information in For the support to the work done in the context of the EGI Collaboration we wish to thank the to the work done in the context of the EGI Collaboration we wish to thank the European Commission for its support via the project EGI-InSPIRE (Grant number RI ). European Commission for its support via the project EGI-InSPIRE (Grant number RI ). References References 1. Portuguese National Grid Initiative Web portal; UMIC Portuguese web portal; National Grid Initiative Web portal; LIP UMIC Webweb portal; portal; Spanish LIP WebNational portal; Grid Initiative Web portal; EGI Spanish Operation National Centre GridCreation Initiative Process; Web portal; GGUS EGI Operation reference Centre for NGI Creation IBERGRID Process; Creation; info.php?ticket=57990 GGUS reference for NGI IBERGRID Creation; 7. Regional Nagios Validation Process; info.php?ticket=57990 EGEE/ValidateROCNagios#Validation_Process 7. Regional Nagios Validation Process; 8. GGUS reference for IBERGRID r-nagios validation process; EGEE/ValidateROCNagios#Validation_Process ws/ticket_info.php?ticket= EGI GGUS Operation reference Centre for IBERGRID Decommission r-nagios Process; validation process; SWE ws/ticket_info.php?ticket=57066 reference for SWE ROC Decommission; 9. info.php?ticket=64997 EGI Operation Centre Decommission Process; Enabling SWE reference Virtual Organizations. for SWE ROC Decommission; GOCDB info.php?ticket=64997 portal GGUS Enabling webvirtual page. Organizations Gstat GOCDB 2.0 portal The GGUS Ibergrid web page. Regional Nagios web page The Gstat Ibergrid 2.0 portal. Regional Nagios web page The Ibergrid wiki web page.

17 IBERGRID 17 Fostering multi-scientific usage in the Iberian production infrastructure G. Borges 1, M. David 1, H. Gomes 1, J. Gomes 1, J. Martins 1, J. Pina 1, I. Blanquer 2,M. Caballer 2, D Arce 2,I. Campos 3, A. Lopez 3, J. Marco 3, P. Orviz 3, A. Simon 4 1 Laboratório de Instrumentação e Física Experimental de Partículas, Portugal 2 Universitat Politécnica de Valéncia 3 Instituto de Fisica de Cantabria, CSIC, Santander, Spain 4 Instituto de Física Corpuscular CSIC/University Valencia, Spain [email protected] Abstract. In this article we present the strategies foreseen to foster the usage of the Iberian production infrastructure by regional scientific communities. The main focus is placed on describing the user support mechanisms implemented through a cooperative effort from the Portuguese and Spanish user support teams, and on the services and tools offered to the regional user communities for their use and customization, to foster VO production user activity within the Iberian region. 1 Introduction The evolution of the European Grid Infrastructure (EGI) [1] and of National Grid Initiatives (NGI) should be driven by users. Their overall degree of satisfaction is a key aspect for a sustainable growth of any Distributed Computing Infrastructure (DCI), and NGIs have to be ready for the challenge of implementing a user support model able to properly answer to user demands. The distributed nature of an e-science infrastructure adds one more layer of complexity to a multi-science universe of geographically distributed users, belonging to a wide spectrum of Virtual Organizations (VOs), and with a large range of applications. To summarize, users expect certain levels of service, independently from where they are and from where the problem is experienced. NGI user support teams must be up to the challenge. Another way to look to user support activities is from a VO management perspective. VOs should be empowered with the proper tools to promote their user activities and guarantee a reliable infrastructure from their users point of view. The availability of such tools and services changes the way scientific research takes place and fosters user satisfaction. Moreover, the deployment of standard tools and services enables the establishment of Virtual Research Communities (VRCs), representing disperse groups of researchers, using the same e-infrastructure (services) to produce scientific results at a much lower cost. This paper presents how the Iberian region is facing those challenges through its regional model to support Portuguese and Spanish users. The user enrollment

18 18 IBERGRID process, the user management activities and the implemented strategies to integrate regional user support operations within global project mechanisms explained are explained in detail, including how requests and issues exposed by regional users are forwarded to central operations and technological providers, and how regional requirements for enhancements are followed up by project boards. Finally, the paper concludes with a survey of evaluated services and tools provided and disposed to the regional user communities to foster their activities within the region. 2 User enrollment and management A wide spectrum of user activities are taking place in Ibergrid. International VOs make a strong use of regional resources from Resource Infrastructure Providers (RIP) participating in European and world-wide projects like WLCG [2], ITER [3] or AUGER [4]. The enrollment of regional users in such international communities is outside of the Iberian NGIs scope since those VO establish their own procedures and workflows. On the other hand, national and regional user activities are increasing with the establishment of formal e-science networks enabled by NGIs, and connecting universities and research institutions. To cover those emerging necessities, the Portuguese and Spanish NGIs offer support to a set of VOs dedicated to scientific disciplines with a traditional relation with e-science domains: VO phys.vo.ibergrid.eu: Physics and Space Sciences VO chem.vo.ibergrid.eu: Chemistry and Materials Science VO life.vo.ibergrid.eu: Life Sciences VO ict.vo.ibergrid.eu: Information and Communication Technologies VO earth.vo.ibergrid.eu: Earth Sciences VO eng.vo.ibergrid.eu: Engineering VO social.vo.ibergrid.eu: Social Sciences Additional, 3 more VOs exist: ops.vo.ibergrid.eu, aimed for regional operations; iber.vo.ibergrid.eu, for general and transient activities; and tut.vo.ibergrid.eu, for tutorial sessions. Each regional VO was created with two different groups representing the two participating countries, and the user communities are aggregated as VO subgroups according to their applications (see figure 1). The main advantage with this method is that it will limit the VO growth and minimize the effort involved in configuring VOs. At the same time, this mechanism is general enough to encapsulate the user activity under a national scope, or alternatively, establish an Iberian context for international collaborations between the two countries. However, the establishment of VO subgroups per application, although more flexible, presents an additional challenge. In the current schema, the VO administrator does not know who has permissions to be registered in a given VO group. To overcome this obstacle, the user communities must assume the responsibility for authorizing or denying registrations under the VO application subgroup. The user management process for Ibergrid regional VOs is therefore represented in figure 2, and can be briefly summarized in the following steps: 1. A user submits a VO registration request.

19 IBERGRID 19 Fig. 1. Regional VO group hierarchy 2. After evaluation, the VO admin rejects or accepts the request and the user is notified of the decision. In case of acceptance, the VO privileges are set for the user, and the request is forwarded to the application group manager. 3. After evaluation, the group manager rejects or accepts the request. In case of acceptance, the user is notified and the VO group privileges are set for the user. In case of rejection, the user and the VO admin are notified of the decision, and the VO admin should re-evaluate the context of the VO registration. An automatic rejection from the VO is not executed to allow a failover mechanism regarding incorrect group assignments. The initial step for user enrollment in Ibergrid VO is to provide sufficient information about activities, applications and resource consumption. To properly guide the user (or the user community representative), NGIs propose a lightweighted application form [5] (available online) to present the scientific problem under study, the application and software dependencies in use, operative details (hardware where the application is normally executed, parallel and latency requirements, operation system,...) and execution details (storage needs, frequency of runs, requirements on privacy and confidentiality,...). Through this process, the NGI user support teams will have a notion of how much resources will this community need, which software and hardware necessities are foreseen, redirect the request to the proper VO support, and find the experts to integrate the user community activities in the Iberian production infrastructure. 3 Regional User Support Model User communities are the main driver for the development of standards. Previous experiences showed that a direct communication channel between users, technological providers and infrastructure operators is a complex link to maintain. The lack of a common taxonomy prevents reaching a fast understanding between the involved parties, delaying the resolution of bugs and technology enhancements. It is in this interface area that NGI user support teams play an important role,

20 20 IBERGRID Fig. 2. Regional user VO management process receiving, processing, validating and translating user requirements to the proper regional and global bodies. A clear procedure should be in place to optimize all user community requests and deal with issues exposed by communities. Figure 3 presents the established model for the Iberian community. It copes with two typical use case scenarios: the submission of issues (ex: clarification of procedures, middleware support, bugs...) and the submission of requests (support on application porting, resource biding, etc...). The workflow for the two use cases is presented hereafter. Submission of issues: A user, while porting an application and interacting with the infrastructure, may face several issues, and may not know in advance the cause of the problem: either he is making an incorrect use of the client middleware tools, or some of the infrastructure services may be unavailable, or he could be facing a software bug. To distinguish between the different possibilities for the problem, the following steps are foreseen: 1. The user can report the issue, either by or via local helpdesk, to the NGI user support shifters. More experienced users may submit the issue directly to the project / global helpdesk if they have clearly identified that the problem solution is outside of the regional scope. 2. NGI user support shifters analyze the problem interacting with the user and with NGI user support experts if necessary. The regional support capabilities

21 IBERGRID 21 are well identified [6] so that the expertize to find a solution to a specific problem is easily find. The NGI user support teams will be able to track the source of the problem and redirect it if it is out of their scope. Technological providers, global operations and other NGI staff are reachable via the global project helpdesk. 3. Important operational problems can be brought up for discussion by the NGI user support team on shift, in the weekly Ibergrid operations meeting, where all the regional operation bodies are represented. Submission of requests: As examples of requests, a user may need assistance to port his application in the Iberian grid infrastructure. Alternatively, a user community may need more resources to run their applications or store their data. The establish workflow to address this kind of request is detailed hereafter: 1. Requests can be issued via or via local helpdesk. NGI user support experts analyze the request, and involve the user community representatives through the validation process, case further clarifications if needed. 2. If the fulfillment of a user request depends on a third party, the request can be forwarded as an issue, either to regional operations (via or local helpdesk) or to technological providers or other project bodies (via GGUS). 3. If found relevant, NGI user support experts may add the user request for discussion during the weekly Ibergrid operations meeting 3.1 Requirement gathering process When a user request or issue can not be solved due to missing functionalities at the operations or technological level, there is the possibility to transform it in a requirement that can be issued to the User Community Support Team (UCST). Requirements from regional bodies can emerge at any stage of the user support model (see figure 3): can be submitted directly by the user community to the UCST. can be submitted by the NGI user support teams after evaluating regional user community requests and/or issues. Can be submitted by the regional operations staff after evaluating NGI user support request and/or issues Several channels [7] are available to communicate requirements to UCST. The preferred way to do it is using the EGI RT [8, 9]. Nevertheless, there is also other mechanisms to submit requirements such as online forms and UCST questionnaires available in EGI major events. Once UCST has received those requirements, a normalization process follows, checking for correctness and format, and grouping requirements under different categories. The next step is to search for a possible solution involving other members of the EGI.eu organization and the EGI-InSPIRE project, user support teams from NGIs and VRCs, to provide solution if they can. The remaining unresolved

22 22 IBERGRID Fig. 3. User Support Model. requirements are prioritized and endorsed by the User Community Board (UCB) to assure that the user communities interests are correctly represented at the Technology Coordination Board (TCB). Once incorporated into the UMD (Unified Middleware Distribution) roadmap, external technology providers develop new enhanced technologies and EGI operators validate and install new tools on the infrastructure. 4 Regional Training activities Once a year, Portuguese and Spanish NGIs organize a major training session attached to the biggest regional event on Distributed Computing Infrastructure - the IBERGRID conference. This is the traditional meeting point for regional DCI users, and we profit from the wide spectrum of users to promote the regional infrastructure and provide tutorial sessions on applications, grid computing and high performance computing. Other user trainings may be scheduled on demand and when regional user communities require. The trainers are recruited among experience regional staff, and according to the session objectives. A dedicated VO for tutorials has been set up (tut.vo.ibergrid.eu), and Portugal and Spain support worthless Certification Authorities (CAs) which are used in live tutorials running on top of the production infrastructure. The worthless CAs must be installed on the site Computing Element, Storage Element and Workernodes,

23 IBERGRID 23 Fig. 4. Regional requirements workflow (from regions to technological providers). to allow the authentication of the users under training. Although not integrated in the international recognized distributions, they are managed with the same kind of mechanims and procedures as the official CAs, in order to minimize any security concerns raised by sites that wish to install them. Nevertheless, the participation in these activities is not mandatory, and sites may excluded from the training infrastructure, if they wish. 5 Services and Tools Operating a VO is a complex task that requires an important effort for ensuring a high quality of services. Many tools are available in EGI that rely on the VO information, and sometimes, procedures are neither easily available nor complete. LIP and UPV are Ibergrid members that share a common responsibility in EGI (under TNA3.4 VO Services activity [10]) to evaluate and provide access to tools and services aiming to support VOs in the whole process of start-up, management and operation. The activity consists on pointing to documentation and procedural guidelines to maximize the usage of the resources, and / or providing some services to VOs, if necessary. Since two major Ibergrid players are involved in this EGI global task, Ibergrid region could be seen as a nursery for the study and evaluation of VO services. Under this activity, different classes of tools have been evaluated: Job submission and monitoring oriented tools VO infrastructure monitoring oriented tools

24 24 IBERGRID 5.1 Job submission and monitoring oriented tools The Ibergrid VO Service team has investigated tools and services available within the EGI community that could be adapted by regional VO users. The final goal is to easy the access to those tools aiming to foster production quality by the users of regional VOs. Among the evaluated candidates, we propose GANGA [?] and DIANE [12] as job submission frameworks, and CERN Mini-dashbooards [13] to monitor usage submissions. Relevant information about those tools is aggregated here [14]. GANGA: GANGA aims to be an easy tool for job submission and management. It is built on python and provides client command tools, a Graphical User Interface (GUI), and a WebGUI. A job in GANGA is constructed from a set of building blocks. All jobs must specify the software to be run (application) and the processing system (backend) to be used. Pragmatically, this means that GANGA can be used to submit jobs to the localhost where it is installed, to a local farm or to a computing grid such as LCG/EGI, as long as the appropriate clients command tools are available to GANGA. Among it s valuable key features we give emphasis to its easy installation, how it could be easily extended to several middlewares, its easy command line tools (if you are used to python) and an easy GUI with the capacity to re-use jobs and job templates. GANGA is a valuable tool aiming to decrease the steep learning curve slope of newly registered users using the infrastructure. It could be installed by the user himself, provided as a service by the site administrator to all the VOs supported or even provided as a service by the VO to all the VO users. This last approach is better if the VO would like to enable job monitoring frameworks to assess, at each period in time, their users job production rate. GANGA is presently used by ATLAS and LHCb users among other collaborations. DIANE: DIANE is a lightweight job execution control framework for parallel scientific applications aiming to improve the reliability and efficiency of job execution by providing automatic load balancing, fine-grained scheduling and failure recovery. The backbone of DIANE communication model is based on master-worker architecture. This approach is also known as agent-based computing or pilot jobs in which a set of worker agents controls the resources. The resource allocation is independent from the application execution control and therefore may be easily adapted to various use cases. DIANE uses the GANGA to allocate resources by sending worker agent jobs, hence the system supports a large of computing backends: LSF, PBS, SGE, Condor, LCG/EGI Grid. Among its major key feature, we emphasis its easy installation and the possibility to increase reliability and success rate for job management. DIANE is a proper tool for VOs which need to send large production runs with high success rates. If the VO wants to monitor its production activity, it is better that the VO offers this tool as a service to its users. Mini-Dashboards: The Mini-Dashboards are part of the introductory package offered by CERN to EGI users. It provides a framework to monitor GANGA and

25 IBERGRID 25 DIANE jobs consisting of a web-based interface where users may easily keep track of their jobs. It runs a MySQL DB at the backend and implements the same web interface technologies as the dashboards implemented for HEP communities. With some customization it should be possible to offer an aggregated graphical view of individual usage at a given time. 5.2 VO infrastructure oriented tools The EGI operations model forces that each NGI must deploy and operate their own Service Availability Monitoring (SAM) system to monitor the fraction of the EGI production infrastructure under their scope. The SAM instance triggers the execution of probes in grid sites to fully exercise job and data oriented activities, and raise critical operational alarms in case of failures. The problem is that it is not extensible to the monitoring of other VOs different from the operations one. Moreover, monitoring seems constrained to NGI regions. Ibergrid members have developed a customized recipe so that SAM can be used to monitor resources under the scope of a VO, or multiple VOs [15] (see figure 5). The benefit of the service is clear since VO representatives and users will have an automatic way to check how available and reliable is their infrastructure. In order to accomplish this multi-vo monitoring role, the SAM service was adapted in the following way: the topology generation was changed so that resources to be tested are properly configured. The difference with respect to the service used in operations is that VO resources may not be restricted to a single region, and may be spread along the whole EGI infrastructure. the services which interoperate with the SAM services (the WMS which is used to submit jobs, default SRM used to replicate files,...) have to be properly set-up to support the relevant VOs This experience has been documented so that VRCs (or VOs associated to a VRC) could assume the operation of the service and customize it according to their own needs. One of the clear advantages of a VO installing and deploying their dedicated SAM is that the VO can then develop and integrate their own probes. While EGI operation teams test and monitor the status of resources through generic tests, they can be considered insufficient for certain communities. These approach allows those communities to define custom test suites and insert them in their SAM system. Presently, the service does not allow an easy integration of VO specific probes since this would imply the definition of a new customized profile for VO. However, after Ibergrid experience, SAM developers have agreed to insert a generic VO profile in future releases that could be further enhanced by different VOs. Case the VO is unable to provide their own instances of those services, the Ibergrid partners may offer a VO SAM as a service on a temporary basis. The scalability of the service (How many VOs? How many sites? How many resources? How many probes?) is still under study in order to understand the VO SAM limits so that it continues to deliver a reliable and performant service. Ibergrid regional

26 26 IBERGRID VOs are already being monitored through a dedicated VO SAM, and its input will be used by the regional NGI user support shifters to start trouble tickets on critical sites. Fig. 5. Evolution from a regional SAM to a VO SAM. 6 Conclusion After the first year integrated in EGI operations, Portuguese and Spanish NGIs have extended their close operation relationships also to the user support area. User enrollment procedures are currently in place mapping regional user communities to a flexible scientific VO hierarchic schema enabling collaboration at national or/and international level. The user support model serving Portuguese and Spanish user is prepared to handle regional issues and requests, and to forward problems with a well located source outside of the Iberian region. A training infrastructure is in place for user and site administrator tutorials. Finally, Ibergrid region is profiting from the fact that two of its major members are contributing to the VO Services global tasks, and has become a nursery are for the services and tools find useful for Ibergrid VOs. Acknowledgments This work makes use of results produced with the support of the Portuguese National Grid Initiative. More information in The authors wish to thank the Spanish Ministry for Science and Innovation for its support under project numbers FPA and AIC10-A

27 IBERGRID 27 This work is partially funded by the EGI-InSPIRE (European Grid Initiative: Integrated Sustainable Pan-European Infrastructure for Researchers in Europe) is a project co-funded by the European Commission (contract number INFSO- This work is partially funded by the EGI-InSPIRE (European Grid Initiative: RI ) as an Integrated Infrastructure Initiative within the 7th Framework Integrated Sustainable Pan-European Infrastructure for Researchers in Europe) Programme. EGI-InSPIRE began in May 2010 and will run for 4 years. Full information is available as an Integrated at: Infrastructure Initiative within the 7th Framework is a project co-funded by the European Commission (contract number INFSO- RI ) Programme. EGI-InSPIRE began in May 2010 and will run for 4 years. Full information is available at: References References 1. EGI Web portal; 2. WLCG Web portal; EGI ITER Web Web portal; portal; WLCG Web portal; 4. Pierre Auger Laboratory Web portal; 3. ITER Web portal; 5. Ibergrid application form; application form; 4. Pierre Auger Laboratory Web portal; 5. Ibergrid 6. resources-v1.0.doc Ibergrid User Support Capability Table Ibergrid IBERGRID/images/Support%20EGI%20-%20ibergrid User Capability Table. Nov2010.xls 7. IBERGRID/images/Support%20EGI%20-%20ibergrid EGI requirements gathering process; Nov2010.xls gathering details EGI requirements request tracker; gathering process; gathering details EGI request requirements tracker; manual; Requirement Manual 9. EGI requirements manual; Requirement Manual 10. VO Services Wiki; Services 10. VO Services Wiki; Services 11. GANGA Web portal; GANGA Web portal; DIANE DIANEWeb Web portal; portal; EGI Introductory package; package; VO Services and and Tools Tools Portfolio; Portfolio; and Tools and Portfolio Tools Portfolio 15. VO Service Available Monitoring wiki; wiki; Service Availability Service Availability Monitoring Monitori

28 28 IBERGRID Towards Green Computing A. Simón 1, C. Fernández 1, E. Freire 1, I. Díaz 1, P. Rey 1, A. Feijóo 1 and J.Lopez Cacheiro 1 Fundacion Centro de Supercomputacion de Galicia, Santiago de Compostela, Spain [email protected] Abstract. Green computing combines a new mentality and a holistic combination of techniques to assure that data centers and computation infrastructures such as Grids can do their work with less power consumption and impact in the environment. Implementing this means to consider a multitude of conflicting factors. In this paper we show that new processors like Opteron Magny-Cours are a step in the right direction to achieve a carbon-neutral computing paradigm. 1 Introduction Green Computing is an extremely broad and ramified term for the measures and initiatives in many areas and fields to reduce the ecological impact of computating. Computers were historically not associated to pollution because, unlike cars, they do not directly emit pollutants, but only as an indirect consequence of their electricity consumption and the emissions associated with producing it. Another factor is that their manufacture and disposal also causes ecological problems. Recently, there has been increasing interest on this topic, as cited on Kurp [1], Murugesan [2] and Wang [3]. As related to these works, there is no single solution, but a new mind-set and approaches on various fronts. For example, there are software solutions that scale back power consumption by deactivating computing nodes based on a distributed automatic decision system, such as Das et al. [4], which uses multi-agents, or the solution based on grid quorums of Ishikawa et al. [5]. Other approaches try to lever economic factors. For example, the work of Le et al. [6] presents an adaptive can offload works to other data centers based on local electricity costs, the carbon offset market and the need to honor SLAs. Brebner et al. [7] present a modelization of these same aspects, while also considering the impact of the interaction between SOA architectures and virtualization on power consumption. Virtualization also has strong interactions with Cloud Computing, so works like Liu et al. [8] try to address their impact on the environment, in this case with a solution to monitor load and power consumption of VMs so that they can be migrated accordingly. As part of the Grid 5000 project, Orgerie et al. [9] and DaCosta et al. [10], present a grid-based solution which deactivates unused nodes and aggregates works to minimize power cycling. In a more hardware-based approach, Berl et al. [11] and

29 IBERGRID 29 Kant [12] describe numerous advances on processor power management, including units dedicated exclusively to this task. PowerNap [13] uses part of this functionality to implement a sleep mode with very quick sleep and wake-up times, which can be used in real-time. This, coupled with a mechanism to share redundant PSUs, allegedly brings power savings in the order of 70%. Since Grid computing is largely based on data centers, much of the literature about data center power saving is directly applicable. The trends in power consumption of data centers, shown on Koomey [14] and Koomey et al. [15], are of rapidly increasing power consumption, which, if unchecked, could mean that power and cooling costs dominate over infrastructure and maintenance. Of course, this has been a concern for years with a plethora of different approaches, as can be appreciated on See [16]. Some of these solutions include changing the electric distribution to DC current, with conflicting accounts between supporters like Pratt et al. [17] and detractors like Rasmussen [18]. Other physical-based solutions like Bash and Forman include assigning jobs to machines on cool spots of the data center, but probably would clash with normal job allocation. Another solution presented in Greenberg et al. [19], would be to distribute small non-redundant data centers next to power sources. In this scenario, redundancy would be in a data-center based granularity. This solution has multiple social and political implications that make it unlikely. This paper shows the benefits of performance and power consumption offered by the nodes recently incorporated to CESGA s grid infrastructure. New processor technology, more power proportionality, better PSU and spatial distribution of worker nodes mean that offering more processing power can be done without more stress to the environment. This paper is organized in the following sections: First, Section 2 will present performance benchmarks of these new nodes compared with the old ones, and Section 3 will do the same for power consumption. Finally, Section 4 will show the trends for HPC computing, and in Section 5 we present conclusions. 2 Benchmarks One of the main concerns about green computing is related to maintaining high performance with a minimum energy consumption. CESGA has acquired HP ProLiant SL165z G7 Server early 2011 to follow green computing paradigm. To quantify how energy consumption has decreased with the new purchase, several benchmarks have been performed using two types of nodes which are very representative of CESGA configuration. On the other hand, these benchmarks describes the evolution in time of the performance and energy consumption for these nodes. The SPEC CPU2006 [20] benchmark was selected to carry out this task, since it is the official CPU performance metric to be used by WLCG sites since 1 April The main results of these benchmarks are presented in this section. In our performance evaluation, a system composed by 1 DELL 750 and 1 HP ProLiant SL165z G7 with the following characteristics respectively were used:

30 30 IBERGRID Intel(R) Pentium R 4 CPU 3.20GHz. 1 processor. 2GB DDR333 RAM, 157GB HDD. CentOS release 4.0 i386. Linux Kernel EL. 2 x AMD Opteron T M CPU 2.2GHz. 12 cores per processor x 2 CPUs. 34GB DDR RAM, 900 HDD. Scientific Linux 6 x Linux Kernel el6. SPEC CPU2006 is a benchmark for measuring CPU performance based on the SPEC(R) CPU2006 benchmark suite. It has been designed to provide performance measurements that can be used to compare compute-intensive workloads on different computer systems. SPEC(R) CPU2006 consists of two benchmark suites: CINT2006(SPECint) for measuring and comparing compute-intensive integer performance, and CFP2006(SPECfp) for measuring and comparing compute-intensive floating point performance. These two components are formed by twelve independent benchmarks. As the SPEC(R) CPU2006 are distributed as source code, benchmarks results depends on compilation reference flags. SPEC defines two reference points, base and peak, where Base has a more strict set of compilation rules than peak. The benchmarks programs used to get the results shown on this work were compiled using the base optimization. The standard way to execute SPEC CPU2006 benchmark is running for the same test different parallel processes to use all the logical CPUs, and then each result is added to give the final value. In order to get the results to publish this work, the same procedure was followed on each node to run the benchmark. The procedure is described below, the next examples were used for HP ProLiant SL165z G7 testing: The first step is to install the SPEC CPU2006 benchmark using the install.sh script from SPEC CPU2006 benchmark package:./install.sh -d /installation_directory Then, it is necessary to get the configuration files to make the appropriate set of executables to run the benchmark in 32-bit or 64-bit machines. CESGA downloaded these configuration files developed by the HEPiX Benchmarking Working Group [21]: hepspec-systeminfo.sh runspec.sh simple.sh linux32-gcc_cern.cfg linux64-gcc_cern.cfg redhat32-gcc43_cern.cfg redhat64-gcc43_cern.cfg README.TXT

31 IBERGRID 31 Once these files have been downloaded, they must be copied into the $SPEC/config subfolder. $SPEC is the directory where SPEC CPU2006 benchmark suite has been installed. To continue, it is necessary change to $SPEC directory and load the variables needed to run SPEC CPU2006 benchmark by running the $SPEC/shrc script. The SPEC CPU2006 benchmark suite is distributed as source and must be compiled first from command line: runspec --config=linux64-gcc_cern.cfg --action=scrub all_cpp runspec --config=linux64-gcc_cern.cfg --action=build all_cpp Option scrub cleans last build files and creates the new directories and executables for the next benchmarks, and build option is used to compile and build the new benchmarks using the configuration file. Finally, it is executed one copy of the benchmark per logical CPU (core, or hyper-threaded processor) which is seen by the operating system. CESGA used exactly the same BIOS settings which they are also using in production mode: n_cpus=$(grep -c "^processor" /proc/cpuinfo) for i in $(seq $n_cpus); do runspec --config linux64-gcc_cern --nobuild --noreportable \ all_cpp & done Once all tasks of the benchmark have finished, it can be found several result files named CFP2006.###.ref.txt and CINT2006.###.ref.txt in the $SPEC/result folder. The benchmark script runs three iterations of each benchmark and the median of the three rounds is selected to be part of the overall metric. In the results files named before, the medians values are marked with an asterisk: for i in seq -w ;do echo "result/*.$i.ref.tx";grep "* *$" \ result/*.$i.ref.txt sort -u;done result/*.003.ref.tx result/cfp ref.txt:444.namd * result/cfp ref.txt:447.dealii * result/cfp ref.txt:450.soplex * result/cfp ref.txt:453.povray * result/cint ref.txt:471.omnetpp * result/cint ref.txt:473.astar * result/cint ref.txt:483.xalancbmk * [... ] The number of output files depends on the number of cores available in each system. To calculate the median value for each core, the geometric mean of seven C++ benchmarks must be computed, for example: echo "scale=2 ;\ e((l(11.0)+l(14.8)+l(7.81)+l(14.9)+l(5.68)+l(7.66)+l(9.92))/7)" bc -l 9.58 [... ]

32 32 IBERGRID Performance for a HP ProLiant SL165z G7 server is HEP-SPEC06 adding each core value: \ \ = This value is valid for one node, to calculate the cluster value this result should be multiplied by the total number of nodes for a homogeneous cluster. CESGA has the cluster SVG formed by 46 HP ProLiant SL165z G7 servers, hence, the total amount of HEP-SPEC06 scores provided by SVG cluster is Finally, it is showed a table 2 with HEP-SPEC2006 results of the performance evaluation with the nodes mentioned at the beginning of this section. As main conclusion, the compute power of two HP ProLiant SL165z G7 servers is equivalent to 70 DELL 750 machines. DELL 750 (3.20GHz) HP SL165z G7( 2.2GHz) Year installation at CESGA CPUs 1 24 Execution time 6h 48m 4h 58m HEP-SPEC06 per node Nodes per cluster Cluster HEP-SPEC06 80 * 6.55 = * = Table 1. HEP-SPEC06 benchmark results by type of node, cluster and year installation at CESGA from 2004 to Power Consumption Moore s Law still stands, the number of transistors are doubled every 24 months and the limit for the next 10 years is set by the atoms size. Obviously, the computing power is increased each year, but the next important question is what happens with power consumption and if it is increasing or decreasing with the new models. To answer this question, several tests were executed to compare the power consumption between CESGA s new and old nodes (Table 2) during benchmark execution. These measurements were made using a digital clamp multimeter to obtain electric current values (I). To obtain the most accurate values possible, an adapted wire connected directly to the AC power supply was used. This adapted wire was used to connect the multimeter to the single phase wire to display the instant power consumption in amperes. Using the combination of Ohm s law and Joule s law: P = V I (1) The value of the power in watts (P) is calculated multiplying V, the voltage across the wire (230v) and the current I measured from the multimeter. The nodes

33 IBERGRID 33 used for this test were an Intel(R) Pentium(4) 4 CPU 3.20GHz with 1 core from the old cluster and a new AMD Opteron(TM) CPU 2.2GHz with 24 cores in 3 different states. The first state is calculated when nodes are switched off, in this case only the power supplies are consuming energy. The second measure is obtained when the nodes are in idle state, without any external load, only running the operating system. Finally a measurement was taken when the nodes were at maximum load using all available cores. Power values are showed in table 3. Intel(R) Pentium(R)4 (3.20GHz) AMD Opteron(TM) 6174 (2.2GHz) switched off 6.9 watts watts Idle 87.4 watts watts Maximum load watts watts Table 2. Power consumption comparison between Intel(R) Pentium(R)4 and AMD Opteron(TM) nodes ( cluster nodes). These results are summarized in figure 1. The new Opteron nodes consume more energy switched off than DELL 750 servers based on Pentium4 architecture (10 watts more), a possible reason of this difference is the power consumption of the ILO management board included in the new servers. DELL 750 servers do not include this feature and are consuming 7 watts just for being connected, power supply is always consuming a residual energy. When both nodes are in idle state, Opteron 6174 server consumes more energy than DELL 750, in this case due to Proliant SL165z G7 servers include two processors instead of only one. Finally, at maximun load Opteron processor consumes more energy than a single Pentium4 CPU, but we must take into account that for this test HP ProLiant SL165z G7 servers were running two Opteron processors with 12 cores each one at 100% cpu load. These values show that are needed 35 Dell 750 to reach a single ProLiant SL164z performance (see table 2), otherwise using a Dell 750 cluster we will consume 17 times more watts to give the same result than using a single ProLiant SL164z server. This is a clear example of how modern processor are more energy efficients and how a cluster update can save energy and money to a HPC institution. To see how modern energy management works with multi-core CPUs other kind of test was performed using the new cluster nodes. For this test several HEP- SPEC2006 benchmarks were executed increasing the used number of cores each time (from 1 to 24), meanwhile the instant current measurement was registered. As a result of this test, power consumption per core number, a graph was generated (see figure 2) which shows the trend of power consumption when the number of used cores is increasing. The power consumption trend is lineal, each core increments the power used by approximately 9 watts. The energy efficiency for each architecture is determined drawing a chart with the ratio between Hep-Spec obtained over energy used (see figure 3). This figure is a good example to explain the power consumption evolution from Pentium4

34 34 IBERGRID Fig. 1. Power comparison for different states. processors (higher results represent better power performance) to actual multi-core chips. Using the same amount of energy we obtain higher benchmark results using a single Opteron core. Additionally, this difference increases with the number of cores used. In conclusion, new multi-core processors are more energy efficient using the greatest number of cores possible. This statement is valid to execute sequential jobs or virtual machines where the machine performance is based on independent calculations. 4 A Multicore Future In recent years we have seen a new trend to increase the number of cores per processors. Multi-core processors are not only used for HPC computing, they are now also included into new home desktop CPUs due to the reduction of prices and production costs. Manufacturers such Intel and AMD have realized that performance is not the only factor in deciding on a server. Performance is important, but now there is also a new factor to consider, the next goal is to reduce the power consumption and to increase the number of cores per processor. New technology based on 32nm CPUs allows increasing the number of cores in the same chip with less power consumption. New Intel processors like the Westmere Xeon EP, offers only a six core CPU in 248 mm 2 with a power consumptions between 40w and 130w (for CPU frequencies from 1.86Ghz to 3.46Ghz). On the other hand,

35 IBERGRID 35 Fig. 2. Power consumption per core running Hep-Spec in Opteron(TM) AMD has increased recently the barrier of the hexa-core CPUs, the codename AMD Magny-Cours processor is part of the 10h processor family, includes the largest number of cores per CPU at the moment. 12 cores per CPU are included into new AMD Opteron 6174 family, in a two dies chip of 346mm 2 (6-core Istanbul CPUs bolted together) at 2.2Ghz which only consumes 80 watts (when cores are not used). These processors also include a 24 link Hyper Transport pipe, virtualization support and ultra-fine grain power management. To improve the energy efficiency CPU frequencies are managed as laptop computers, increasing or decreasing the core clock frequency on demand, this new architecture also allows disable or enable core utilization depending on CPU usage. This feature, Enhanced Halt State (C1E) reduces clock speed and is triggered by the idle state, and it is entered only after longer periods of inactivity. The future will be undoubtedly multi-core, but who will be the winner is still unknown. Intel and AMD are continuously adding more cores to their CPUs, in order to increase the number of cores they must solve several issues to keep these CPUs scaling. One of these issues is the overhead of cache coherency messages. These messages and cache missings can increase the latency and absorb a lot of bandwidth, the problem is accentuated in multi-core CPUs because those cores require a higher bandwidth. At this point the memory subsystem and its latency plays an important role. The latest AMD and Intel processors have decreased their cache latencies. Intel processors are faster with latencies ranging between 81ns

36 36 IBERGRID Fig. 3. Comparison ratio Hep-Spec per Watt. to 87ns for Intel Xeon X5570/X5670, however AMD Opteron 2435/6174 series reports latencies between 98ns-113ns. Comparing benchmarks results, prices and power consumption, there is no clear winner. For data mining, Opteron processors offer 20% better performance at 20% lower prices, however Intel Xeon processor is about 20% faster than the Opteron in virtualized servers with very high Virtual Machines (VM) counts. This difference is smaller using a virtualised server with a few very heavy VMs. The final choice only depends on the final use of AMD or Intel multi-core processors. 5 Conclusions and Future Work Decreasing the carbon footprint involves different choices. The first important choice is which hardware solution to purchase depending on mets our computing needs and our energy limitations. As we have seen, this issue is partially solved with the new processors, less power consumption by core is now a priority for the main manufacturers. But this is only the first step, based on the benchmarks and power consumption tests of the new CESGA s nodes, the most efficient energy solution is to have all the CPU cores busy for as long as possible. This option was studied in previous works, due to the low power consumption per core in the new

37 IBERGRID 37 CPUs, it is not worth it to have cores without load or in idle state. To solve this issue in a grid oriented cluster, the option is to completely fill each node with jobs. In this case, the queue scheduling algorithm must be changed to fill one by one each node to avoid a typical round-robin scheduling policy. This problem can be solved configuring CESGA s Grid Engine scheduler correctly to use a different scheduling mechanism. The next step for this approach is to start or stop nodes on demand, this can also be done running a cron script or an external daemon to detect unused nodes to stop them. This mechanism allows to start or stop machines on demand, if the cron script detects that queues are full, the system could automatically start an stopped machine to start executing new jobs. But this is not enough, due to ILO management boards new servers are still consuming energy switched off. To avoid this continuous energy loss, it is possible to use power distribution units (UDPs) with remote power monitoring to control down the individual plug level. This procedure is valid for sites with variable loads, using a correct scheduling policy and nodes on demand can save large amounts of energy. Green computing not only saves energy, it is more ecological in a world becoming increasingly dependent on energy. An efficient energy management also saves money, CESGA spends 113,000 eto maintain a long 4 years 80 textitdell 750 servers. This budget includes the power usage effectiveness (PUE) and servers power consumption. On the other hand 4 HP ProLiant SL165z G7 servers have more computing power than 80 DELL 750 nodes with much less energy consumption. Using the same calculations to maintain 4 HP ProLiant SL165z G7 a log 4 years (1 HP ProLiant SL165z G7 consumes 314 watts at maximum load), cost about e, eless than the previous cluster. Therefore, it can be deducted that with the arrival of the HP ProLiant SL165z G7 server, DELL 750 servers could be shutting down in order to save money in energy consumption. References 1. Kurp, P.: Green computing. Commun. ACM 51 (2008) Murugesan, S.: Harnessing Green IT: Principles and practices. IT Professional 10 (2008) Wang, D.: Meeting green computing challenges. In: High Density packaging and Microsystem Integration, HDP 07. International Symposium on. (2007) Das, R., Kephart, J.O., Lefurgy, C., Tesauro, G., Levine, D.W., Chan, H.: Autonomic multi-agent management of power and performance in data centers. In: Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems: industrial track. AAMAS 08, Richland, SC, International Foundation for Autonomous Agents and Multiagent Systems (2008) Ishikawa, M., Hasebe, K., Sugiki, A., Kato, K.: Dynamic grid quorum: a reconfigurable grid quorum and its power optimization algorithm. Service Oriented Computing and Applications 4 (2010) /s Le, K., Bilgir, O., Bianchini, R., Martonosi, M., Nguyen, T.D.: Managing the cost, energy consumption, and carbon footprint of internet services. In: Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems. SIGMETRICS 10, New York, NY, USA, ACM (2010)

38 38 IBERGRID 7. Brebner, P., O Brien, L., Gray, J.: Performance modelling power consumption and carbon emissions for server virtualization of Service Oriented Architectures (SOAs). In: Enterprise Distributed Object Computing Conference Workshops, EDOCW th. (2009) Liu, L., Wang, H., Liu, X., Jin, X., He, W.B., Wang, Q.B., Chen, Y.: Greencloud: a new architecture for green data center. In: Proceedings of the 6th international conference industry session on Autonomic computing and communications industry session. ICAC-INDST 09, New York, NY, USA, ACM (2009) Orgerie, A.C., Lefevre, L., Gelas, J.P.: Save watts in your grid: Green strategies for energy-aware framework in large scale distributed systems. Parallel and Distributed Systems, International Conference on Parallel and Distributed Systems (2008) Costa, G.D., Gelas, J.P., Georgiou, Y., Lefevre, L., Orgerie, A.C., Pierson, J.M., Richard, O., Sharma, K.: The GREEN-NET framework: Energy efficiency in large scale distributed systems. Parallel and Distributed Processing Symposium, International 0 (2009) Berl, A., Gelenbe, E., Di Girolamo, M., Giuliani, G., De Meer, H., Dang, M.Q., Pentikousis, K.: Energy-efficient cloud computing. The Computer Journal 53 (2010) Kant, K.: Data center evolution: A tutorial on state of the art, issues, and challenges. Computer Networks 53 (2009) Virtualized Data Centers. 13. Meisner, D., Gold, B.T., Wenisch, T.F.: PowerNap: eliminating server idle power. In: Proceeding of the 14th international conference on Architectural support for programming languages and operating systems. ASPLOS 09, New York, NY, USA, ACM (2009) Koomey, J.G.: Worldwide electricity used in data centers. Environment Research Letters (2008) 15. Koomey, J.G., Belady, C., Patterson, M., Santos, A.: Assessing trends over time in performance, costs, and energy use for servers (2009) 16. See, S.: Is there a pathway to a green grid? (presentation) (2008) 17. Pratt, A., Kumar, P., Aldridge, T.: Evaluation of 400V DC distribution in telco and data centers to improve energy efficiency. In: Telecommunications Energy Conference, INTELEC th International. (2007) Rasmussen, N.: AC vs. DC power distribution for data centers. Technical report 63, APC (2007) 19. Greenberg, A., Hamilton, J., Maltz, D.A., Patel, P.: The cost of a cloud: research problems in data center networks. SIGCOMM Comput. Commun. Rev. 39 (2008) SPEC - Standard Performance Evaluation Corporation HEP-SPEC06 Benchmark.

39 IBERGRID 39 EnergySaving Cluster experience in CETA-CIEMAT Manuel F. Dolz 1, Juan C. Fernández 1, Sergio Iserte 1, Rafael Mayo 1, Enrique S. Quintana-Ortí 1, Manuel E. Cotallo 2, Guillermo Díaz 2 1 Depto. de Ingeniería y Ciencia de los Computadores, Universitat Jaume I, 12071, Castellón, Spain. 2 Centro Extremeño de Tecnologías Avanzadas, Centro de Investigaciones Energéticas, Medioambientales y Tecnológicas, 10200, Trujillo, Spain. Abstract. With the increasing number of nodes in high performance clusters as well as the high frequency of the processors in these systems, energy consumption has become a key factor which limits their large-scale deployment. Most of the grid infrastructure is composed of clusters. EnergySaving is a middleware software that implements a power-on/power-off policy in order to save energy in a cluster setup. In this paper we present the first experiences and test for using this software in a real environment. EnergySaving is currently under evaluation in the CETA-CIEMAT Grid Computing Center. In order to obtain a estimate of the potential savings, a simulation tool has been developed. We use this tool with the experimental data of power consumption of the cluster nodes to estimate the savings due in the IT facilities of the cluster, in case the middleware was fully installed in a production environment. Finally a discussion of the combination of this tool with the glite grid environment is presented. 1 Introduction High Performance Computing (HPC) clusters have been widely adopted by companies and research institutions for their data processing centers because of their parallel performance, high scalability, and low acquisition cost. On the other hand, further deployment of HPC clusters is limited due to their high maintenance costs in terms of energy consumption, required both by the system hardware and the air cooling equipment. In particular, some large-scale data processing centers consume the same energy power as 40,000 homes, and studies by the U.S. Environmental Protection Agency show that, in 2007, the power consumption of data centers in the United States was around 70 billion KWatt-hour, representing 5,000 million euros and the emission of 50 million tonnes of CO 2 [1]. CETA-CIEMAT is a grid specialised computing resource center aiming to support the establishment and developmentof E-science research communities which {dolzm,jfernand,sisterte,mayo,quintana}@uji.es {manuelenrique.cotallo,guillermo.diaz}@ciemat.es

40 40 IBERGRID require computationally intensive needs. CETA-CIEMAT participates also as e- Infrastructure provider to multiple international projects (CERN s ALICE [2], GISELA [3], EDGI [4], and the Spanish/Portuguese grid initiative IBERGRID [5]. Nevertheless, having an amount of grid resources at the disposal of researchers may sometimes imply that they are underused, and could also induce noticeably costs for having hardware nodes always active, just in case they are needed. This situation creates unbearable operational cost overheads in terms of higher electrical bills, more frequent hardware failures, apart from implied environmental concerns. In this context a well-known energy management technique is DVFS (Dynamic Voltage and Frequency Scaling). DVFS entails reducing the system energy consumption by decreasing the CPU supply voltage and the clock frequency (CPU speed) simultaneously. This technique has a great impact on the development of work aimed at reducing consumption in this research context [6 8]. The authors in [9] present an energy-aware method in order to partition the workload and reduce energy consumption in multiprocessor systems with support for DVFS. Freeh et al. [10] analyze the energy-time trade-off of a wide range of applications using high-performance cluster nodes that have several power-performance states to lowering energy and power, so that the energy-time trade-off can be dynamically adjusted. In [11], the authors use economic criteria and energy to dispatch jobs to a small set of active servers, while other servers are transitioned to a low energy state. Alternative strategies to limit power consumption and required cooling of HPC clusters are based on switching on and shutting down the nodes, according to the needs of the users applications. An algorithm that makes load balancing and unbalancing decisions by considering both the total load imposed on the cluster and the power and performance implications of turning nodes off is described in [12]. Several policies to combine dynamic voltage scaling and turning on/off nodes to reduce the aggregate power consumption of a server cluster during periods of reduced workload are presented in [13]. Rock-solid [14] and PowerSaving [15] are prototype examples of this strategy which provide little functionality or are still under development. In the grid infrastructure context there is no software that provides this strategy, but there exist other support tools for Sun Grid Engine: Service Domain Manager/Hedeby resource provider in combination with IPMI could be used to implement energy aware policies in future projects. In this paper we present the application of EnergySaving Cluster [16] to the CETA-CIEMAT infrastructure. EnergySaving Cluster (ESC) is a middleware software prototype designed and developed by the HPC&A research group at Universitat Jaume I with the goal of saving energy of the IT infrastructure of a datacenter. It has been designed for HPC datacenters were normally the computational load is scheduled by a job queue system (PBS, SGE,... ). Working with data of future computational load, extracted from the job queue system, ESC applies a power on/off policy in order to switch off non necessary computing nodes, and thus attain a substantial energy saving. The article is organized as follows: Section 2 presents the ESC middleware. Section 3 describes the experience of installing and tuning ESC for the CETA- CIEMAT infrastructure. In Section 4 we describe the functionality test of the ESC

41 IBERGRID 41 that has been done in CETA-CIEMAT. Section 5 is a discussion of the benefits, problems and pending work to integrate the ESC tool in the Grid infrastructure. In Section 6 results of several simulations that show the benefits of the tool are shown; and finally the summary and conclusions follow in Section 7. 2 Description of the EnergySaving Cluster middleware The target hardware platform for ESC is an HPC cluster running Linux, equipped with a front-end node that is responsible of the queue system (in this case SGE) and the energy saving module. The module queries the SGE queue system to collect information on the actual jobs, nodes and queues (qstat, qmod and qhost commands). This is then used to compel the necessary statistics, and apply the power saving policy defined by the system administrator. The module also runs several daemons implemented in Python [17]. These daemons maintain a MySQL database that contains all the information and statistics, such as e.g. information on the nodes (e.g., their MAC addresses) to remotely power them on using WakeOnLAN (WOL) [18]. The nodes of the cluster have their BIOS configured with WOL (WakeUp events). 2.1 Implementation of the EnergySaving Cluster The energy saving module includes the following major components: Three daemons in charge of managing the database, collecting statistics, and executing the commands that power on and shut down the nodes. The database that stores all information necessary to make decisions. The website interface to configure and administer users groups as well as set the threshold triggers that define the power saving policy. ESC has a modular design, mapping the main functions of the system to daemons (control of queue system, collect statistics, and apply policies of activation and deactivation nodes). It uses a database to ease data mining via SQL and the user interface is web-oriented to facilitate remote access and administration. 2.2 Daemons Daemon for epilogue requests. A node of the cluster runs a epilogue script provided by the SGE queue system when a job completes its execution, and therefore abandons the queue. This script receives parameters from the SGE executor daemon which are essential for monitoring the cluster and, therefore, for implementing the energy saving policy. As the database is located in the front-end node, it is necessary to send this set of parameters through the network. For this purpose, the node that executes the epilogue script opens a connection via a TCP socket with the epilogue daemon that runs on the front-end node to pass the necessary information.

42 42 IBERGRID The epilogue daemon employs this information to perform a series of updates in the energy saving module database, extracting data from the accounting file maintained by the queue system. Updated data comprise the number of jobs for the user responsible of this job, this user s average execution time, and the queue average waiting and execution times. Daemon for the queues, users and nodes. This daemon is responsible for ensuring that all information on users, nodes and queues that are actually in operation in the SGE queue system is correctly reflected by the database. To achieve this, the daemon queries the queue system about queues and nodes (with the qhost command), and the OS about users (checking the file /etc/passwd). With these data, it ensures that the database is consistent. The daemon is also in charge of enabling nodes that were marked as disabled. Daemon for the activation/deactivation actions and statistics. This daemon, the most important of the module, activates and deactivates the nodes according to the needs of the queue system s user. The daemon compares the threshold parameters set by the system administrator and the current values of these parameters from the database to test if any of the activation or deactivation conditions is satisfied. If certain nodes have to be shut down, it suspends all queues to prevent the execution of new jobs, performs the transaction, updates the database and log file, and finally resumes the queues. The threshold conditions are not necessarily equal for all users, as the system administrator can create multiple user groups with different values for the threshold parameters. Thus, a priority system can be defined for groups of users. The daemon is divided into several functions, and the administrator can specify the order in which activation and deactivation conditions are checked. It is therefore very simple to enable or disable one or several of these conditions. This daemon also updates the waiting time (both per user and per queue) of all en-queued jobs and ensures that the current database size does not exceed the specified limit. 2.3 Activation and deactivation conditions Node activation. This operation is performed using the ether-wake command [19] which sends the magic packet WOL. Nodes can be turned on if any of the following conditions are met: There are not enough appropriate active resources to run a job. That is, as soon as the system detects that a job does not have enough resources because all the nodes that contain the appropriate type of resource are turned off, nodes are powered on to serve the request. The average waiting time of an enqueued job exceeds a given threshold. The administrator must define a maximum average waiting time in queue for the jobs of each group. When the average waiting time of an enqueued job exceeds the maximum value assigned to the corresponding user s group, the system

43 IBERGRID 43 will turn on nodes which contain resources of the same type as those usually requested by the same user. The number of enqueued jobs for a user exceeds the maximum value for its group. In this case, the daemon selects and switches on nodes which feature the properties required by most of the enqueued jobs. When the magic packet is sent, the daemon for activation/deactivation actions starts a timer. If this daemon does not detect that the node is active after the timer expires, the node is automatically marked as unavailable. The system administrator can also use the following options to select the (candidate) nodes that will be activated: Ordered: The list of candidate nodes is sorted in alphabetical order using the name of the node (hostname). Randomize: The list of candidate nodes is sorted randomly. Balanced: The list of candidate nodes is sorted according to the period that the nodes were active during the last t hours (with t sets by the administrator). The nodes that are selected to be powered on are among those which have been inactive a longer period. Prioritized: The list of candidate nodes is ordered using a priority assigned by the system administrator. This priority can be defined, e.g., according to the location of the node with respect to the flow of cool air [20]. In the context of the SGE queue system, the slots of a queue instance for a given a node indicate the maximum number of jobs that can be executed concurrently in that node. When an exclusive execution is required (as, e.g., is usual in HPC clusters), the number of slots equals the number of processors. The daemon can also specify a strict threshold to power on nodes to serve job requirements: No strict: The nodes are turned on to serve job requests if there are not enough free slots on current active nodes. This option yields low queue waiting times but saves little energy. Strict: Nodes are only turned on when the current active nodes do not provide enough slots (free or occupied at the moment) to serve the requirements of the new job. This option produces longer queue waiting times than the previous policy but may provide fair energy saving. Strict and sequential: Nodes are only turned on to serve the job requests when all current active nodes have their slots in free state. This option simulates a sequential execution of currently enqueued jobs, likely delivering the longest queue waiting times and attaining high energy savings. Node deactivation. Nodes are shut down using the shutdown command. The following parameters define when a node is turned off: The time that a node has been idle. If this time is greater than a threshold set by the administrator, the node is turned off to save power.

44 44 IBERGRID The average time waiting for users jobs is less than a threshold set by the administrator. The administrator must define a minimum value for the queue waiting time of the jobs of each group of users. In case the average waiting time of a user s job is lower than the threshold assigned to its group, the daemon turns off a node (among those which exhibit the properties that were more rarely requested in the near past). Current jobs can be served by a smaller number of active nodes. The administrator can enable this condition to run the enqueued jobs using a smaller number of nodes than are switched on at a specified moment. In such case, the system turns off one of the nodes to save energy. Although this condition can significantly increase the average waiting time of the user s jobs, it may also reduce power consumption significantly. When one of these three conditions is satisfied, the daemon executes the command shutdown -h now in a remote ssh session in the nodes that were selected to be shut down and suspends all associated queues to prevent the execution of new jobs on those nodes. 3 Experiences on installing ESC on CETA-CIEMAT s test cluster CETA-CIEMAT has been the first grid-computing center where ESC software has been installed and tested outside the laboratory of the HPC&A group at Universitat Jaume I. Some interesting feedback was generated in the process of installing under more close to real state-of-the-art data centers. Installing and tuning ESC at the CETA-CIEMAT test cluster was not a technically complicated issue, but nevertheless some aspects were taken into account when deploying it: Compatibility issues. Some compatibility issues were checked, prior to installing the software, regarding both computing and communication equipments. Computing nodes hardware issues. Computing nodes must have WOL capability present and enabled. Moreover, before deploying ESC at the CETA- CIEMAT test cluster, a revision of network chips and their WOL support was done. Selected hardware supports g capability for WOL, so Ethernet magic frames are able to trigger the startup process of a node while it is in power-off state. Network concerns. Computing nodes in the cluster must be contained in the same layer 2 subnetwork, as well as node hosting ESC daemons. Besides, all communication equipments must be set properly to allow WOL frames. Datacenter architecture, SGE s own configuration and usage. ESC was initially thought to work in small systems where it could be feasible to accept the following requirements:

45 IBERGRID 45 Web frontend, MySQL database backend and daemons run in the same node. This could be easily overcome in a production scenario by configuring database clients to use TCP sockets instead of UNIX socket connections. ESC daemons run in SGE master node. In a real environment, it could be seen as inappropriate, as SGE administrator would like to keep master nodes clean from additional software. To overcome this drawback, daemons can run outside the SGE master node if the node is able to connect to SGE for issuing qstat commands. Access to SGE s accounting logs is also needed in this node so that daemons can keep track of SGE s last activities. However, some further work must be done if the final goal is to run ESC daemons outside of the frontend machine. There is not a notion of isolated cluster queues with dedicated computing resources. Strategies for lowering power consumption involve the whole SGE s batch queue system. This is probably a bit difficult to be covered in a data center where there exists some kind of grouping of resources, maybe according to physical reasons (particular hardware specs) or perhaps because of logical ones (particular installed software). This kind of scenarios is not currently supported by ESC. After the aspects detailed above, the actual deployment of ESC done in CETA- CIEMAT testing environment, fits to this schema: Node A: web frontend, ESC daemons, having SGE client command-line and access to SGE logs; must have an Ethernet port in the same VLAN as computing nodes to allow Ethernet magic packets. Node B: MySQL database backend. Node C: SGE master node, exporting (by NFS) log directory. 4 User tests In order to verify that a correct ESC deployment was made in CETA-CIEMAT s testing environment, the following test plan was established: Minimal functional tests, in order to make sure that all subsystems were reliably configured. Stress tests, to check the system could resist some tough conditions of workload arriving to the cluster without unexpected side effects. Performance tests, with the purpose of taking approximate measures of power consumption and energy savings. The testing environment was set as follows: Web frontend/esc daemons machine/mysql database machine: All being hosted in the same VMWare virtual machine (1 GB RAM, 1 virtual processor), running CentOS 5.3, x Subclusters A and B : 5 machines each, Bull Novascale (double Quad-Core, Intel 5230, 16GB RAM), running Scientfic Linux 5.3, x86 64.

46 46 IBERGRID The ESC database schema was also modified to include a table gathering instant power records. The Bull Novascale chassis allows the access to some sensors monitoring allocated electrical power (in Watts) for blade machines, through its BMC controller. The set of tests that were finally run, comprised the following: Minimal functional tests. Loop simulating arrival of sequential short jobs (30 seconds) with no processing (just sleeping). Loop simulating arrival of parallel jobs (requiring more than 1 slot) with no processing. These tests showed the system was functional and operational as expected. It was also observed that needed time to shut down and wake up nodes was significant, and the inertia of the start up process could lead nodes to be brought up when they were no longer needed. Stress tests. One of the Bull Novascale blade-chassis, mentioned above, was divided into two subclusters: one being monitored by ESC, and the other kept apart. The same kind of workload was applied to both clusters. Bursts of ultra-short jobs (1 s), CPU intensive (99%). Once more we observed the inertia effect, no additional nodes to the initial set was brought up in cluster A, as the latency times for start up and shut down were determined and configured. Bursts of short jobs (1 h), CPU intensive (99%), with a period of one hour between bursts. In this trial it was observed that up to three nodes were brought up when needed. Workload was not equally shared among three nodes, as one node was elected always as first option, and only if this node could not serve the full workload, then the second and third nodes were elected. Performance tests. Some performance data were gathered during stress tests, which were recorded and taken into account for simulation purposes, and for further comparative analysis. 5 Open questions on the integration of ESC within glite s middleware framework Given that many of computing datacenters that work with SGE batch queue servers have part or the whole of their resources devoted to Grid computation, and due to the ample use of glite middleware in grid computing within Western Europe, it is natural to think how an schema of reduction of energy usage could be included into glite environments. Its usefulness, difficulty of being applied, and impact on grid global services and statistics (availability, accounting, information systems,... ) must be also studied in detail. The question about the usefulness of ESC in the context of a glite grid environment is self-answered from a energy saving point of view. Most resources are

47 IBERGRID 47 part of grid-based projects and, consequently, significant electrical costs are due to the participation in those projects. Some reduction of electrical billings could be obtained by applying ESC in stand-alone clusters (those not related to grid), but in order to keep a lower energy consumption, Grid clusters usage of resources must also be optimized. An important issue that arises is whether ESC could seamlessly integrate within grid computing resource centers that belong to transnational e-infrastructure federations (like Spanish and Portuguese IBERGRID Grid Initiative). Technically, from the point of view of grid federations Information Systems, the answer is negative as some work must be made at the middleware level to keep information systems aware of energy saving solutions. Those having experience in dealing with glite will rapidly notice that, if ESC is implemented in a grid cluster, computing resources will appear to be extremely volatile in Information Systems; the number of available and total CPUs will be flapping from one number to another as nodes are activated or deactivated according to the system s load and ESC policies. Tools like lcg-infosites will show, for the same site, a different amount of CPU cores or slots depending on the time the query was made, and the status of CPU resources. Although, in glite Information Systems (that is in the underlying GLUE schema), there is no corresponding key value meaning CPU offline due to energy saving, ESC could still be implemented if the following concerns are assumed. Calculations of availability of sites (agreed CPUs or cores versus measured) and global resource usage during a period of time might not be accurate since grid information systems would not be aware of the avaliable number of CPU/cores as a dynamic value, rather than a static one declared when the cluster was configured. Workload Management Systems, which apply selection policies based on data provided by Information Systems, would tend to skip ESC-enabled sites, as from their point of view, they would not be offering enough resources to cope with punctual users requirements. As a consequence, we strongly believe that the design of current architecture of glite middleware and agents collecting information from the cluster and publishing into the Information Systems (known as Generic Information Providers, GIP), should be revised in order to consider more accurate status data. For instance, the GIP provider for a SGE should be rewritten to count the total CPU online resources and the dormant ones. Unfortunately, as for SGE v6.2 u2, there seems not to be a SGE status to reflect this situation. From the user s perspective it seems to be more polite to be informed whether their jobs will be submitted to slower resources (because the latency of bringing them up should not be neglected) or they will be submitted to regular resources (all online, they experience no such latency). However, this may not be such an important issue, as latencies are inherent to the architecture of grid systems. 6 Estimation of energy savings In order to have a estimation of the potential savings in case the tool is deployed in a production environment, a variety of simulations have been done using the power

48 48 IBERGRID consumption parameters of the nodes in the CETA-CIEMAT. Figure 1 shows the power consumption evolution of a node in the CETA-CIEMAT infrastructure. The power consumption data has been obtained from the board management card of the node. From that information, a series of simulations has been done using the following system parameters: Number of nodes: 16. Cores per node: 8. Time for a shutdown: 480 s. Energy consumption during shutdown process: Wh. Time for a power on: 555 s. Energy consumption during a power on process: Wh. Power consumption of the system on a WOL waiting state: 2 W. Power consumption of the system running without computational load: 150 W. Power consumption of the system running with computational load: 230 W. Fig. 1. Power consumption of a node in the CETA-CIEMAT. We have used a pair of synthetic workloads from the Parallel Workloads Archive ( NASA: NASA Ames ipsc/860 is a set of 42,264 jobs. OSC: OSC Linux Cluster, a workload composed of 80,714 jobs. From our simulation we have obtained the information presented in Table 1, which only displays the best energy savings of our simulations. Workload Time (days, hours, minutes, seconds) Energy (MWh) NASA without ESC 92 d, 0 h, 3 m, 43 s 6.72 MWh NASA with ESC 92 d, 0 h, 12 m, 59 s 4.79 MWh OSC without ESC 677 d, 2 h, 55 m, 51 s MWh OSC with ESC 868 d, 20 h, 50 m, 39 s MWh Table 1. Information of the simulation of NASA and OSC workloads.

49 IBERGRID 49 The information of the table reveals that it is possible to obtain an important level of energy savings using the power on/off policy implemented in ESC. Depending on the load, the throughput of the system can be lowered as shown for the OSC load. In this case, while the energy consumption with ESC is 0.32 times the energy consumption without ESC, the time needed to process all the jobs is increased by a factor of On the other hand, with the NASA workload, the time is increased only by , what means that there is no penalty due to ESC, but the energy consumption is only 0.71 times that of the energy consumption without ESC. 7 Summary and Conclusions In this paper, we have presented a description of the first experiences using an energy saving middleware for cluster environments. This middleware implements a power-on/power-off policy so that, at any moment only the necessary computational resources are active, and those that are not needed remain powered off. It has a modular design which enables an easy integration with different queue systems, e.g., Sun Grid Engine, Portable Batch System/Torque or SLURM. The algorithm that takes power-on/power-off decisions employs a number of configuration parameters as, for example,: the lack of resources for a particular job, the average waiting time of the jobs, the number of enqueued jobs, the idle time of a node, etc. In addition, there are also options to select candidate nodes to be powered on, and strict levels which can produce considerable energy savings. In order to evaluate the potential savings prior yield by the EnergySaving tool in a production environment, we have also developed a simulation tool. This tool use as inputs the power consumption parameters of the system and a description of the computational load. The result is data about how the energy saving module affects the productivity and performance of the system, and the potential energy savings are obtained. We have discussed the opportunity of the integration of the middleware with the glite grid environment and the SGE queue system, which will be a must if the energy saving module is to be put in production in the Grid infrastructure. Acknowledgments This work was partially supported by project P1-1B of the Fundación Caixa-Castelló/Bancaixa, Universitat Jaume I of Castelló, CICYT TIN C04-01 and FEDER. CETA-CIEMAT acknowledges the support of the European Regional Development Fund (ERDF/FEDER) feder/. References 1. U.S. Environmental Protection. Report to congress on server and data center energy efficiency, public law Agency ENERGY STAR Program, August

50 50 IBERGRID EPA_Datacenter_Report_Congress_Final1.pdf. 2. European Organization for Nuclear Research (CERN). Alice: A large ion collider experiment Gisela: Grid initiatives for e-science virtual communities in europe and latin america Edgi: European desktop grid initive Ibergrid: Iniciativa hispano-portuguesa de infraestructuras grid. ibergrid.eu. 6. Yongpan Liu, Huazhong Yang, R.P. Dick, Hui Wang, and Li Shang. Thermal vs energy optimization for dvfs-enabled processors in embedded systems. International 8th Symposium on Quality Electronic Design (ISQED 07), pages , March Rong Ge, Xizhou Feng, and Kirk W. Cameron. Performance-constrained distributed dvs scheduling for scientific applications on power-aware clusters. Conference on High Performance Networking and Computing Proceedings of the 2005 ACM/IEEE conference on Supercomputing, page 34, B Rountree, D.K. Lownenthal, B.R. de Supinski, M Schulz, V.W. Freeh, and T Bletsch. Adagio: making dvs practical for complex hpc applications export. Proceedings of the 23rd international conference on Supercomputing (ICS 09), pages , Changjiu Xian, Yung-Hsiang Lu, and Zhiyuan Li. Energy-aware scheduling for realtime multiprocessor systems with uncertain task execution time. Proc. 44th Annual Conf. Design Automation, San Diego, CA, USA, ACM, pages , June V.W. Freeh, F. Pan, D.K. Lowenthal, N. Kappiah, R. Springer, B.L. Rountree, and M.E. Femal. Analyzing the energy-time tradeoff in high-performance computing applications. IEEE Transactions on Parallel and Distributed Systems, 18(6): , June J.S. Chase, D.C. Anderson, P.N. Thakar, A.M. Vahdat, and R.P. Doyle. Managing energy and server resources in hosting centres. Proc. 18th ACM Symp. Operating System Principles, Banff, Canada, pages , E. Pinheiro, R. Bianchini, E.V. Carrera, and T. Heath. Load balancing and unbalancing for power and performance in cluster-based systems. Workshop on Compilers and Operating Systems for Low Power, September, E.N. Elnozahy, M. Kistler, and R. Rajamony. Energy-efficient server clusters. Workshop on Mobile Computing Systems and Applications, February Rocks-solid: Extension to ROCKS cluster that make your cluster more solid! http: //code.google.com/p/rocks-solid. 15. Grid Engine project. PowerSaving. php/powersaving. 16. M. Dolz, J. C. Fernández, R. Mayo, and E. S. Quintana. Energysaving cluster roll: Power saving system for clusters. In Lecture Notes in Computer Science 5974, Architecture of Computing Systems ARCS 2010, Guido van Rossum. Python Reference Manual. html. 18. Wikipedia. WakeOnLAN Linux man page ether-wake C. Bash and G. Forman. Cool job allocation: Measuring the power savings of placing jobs at cooling-efficient locations in the data center. HP Laboratories Palo Alto, August 2007.

51 IBERGRID 51 Focusing on an integrated computing infrastructure: the IFCA experience Álvaro López García, Pablo Orviz Fernández, Ibán Cabrillo Bartolomé Advanced Computing and e-science Group Instituto de Física de Cantabria, CSIC - UC, Spain [email protected], [email protected], [email protected] Abstract. Bringing quality assurance to Information Technology (IT) clusters, through the improvement of the both human and automated practices related to service management and building solutions to strengthen and ease the interaction with the infrastructure, must be considered as a continuous work at IT organizations. IFCA datacenter has assumed this challenge more intensively over the last year starting from a redefinition of the procedures for service operation and the design and implementation of new customer and user-oriented services. 1 Introduction Moving towards a well-defined service management strategy focused on customer and user satisfaction within the computing infrastructure leads to a continual improvement and definition of best practices to be applied in service operation that sits underneath the user s perspective of an IT organization. By fulfilling the clear objective of this continuous work (namely, the customer and user needs) the IT infrastructure has the opportunity to evolve into a more mature implementation. Benefits to both the final users profiting from an integrated environment powerful but easy to work with and the administration staff granting integration, stability, reliability and availability in the services provided are obtained. This paper presents the steps made within the IFCA [8] Advanced Computing and e-science group [5] in order to consolidate its infrastructure; and how they naturally emanated from the users and customers 1 requirements either implicit or explicit. Some of these topics are in compliance with the IT Infrastructure Library (ITIL), but they are clearly outside the scope of the current work. 1 It should be noted that in the context of this work, a customer does not necessarily buy goods or services. This work focuses on the person or group defining a given Service/Operation Level Agreement (SLA/OLA) [12], as happens with the scientific collaborations.

52 52 IBERGRID 2 Behind the scenes Technical activities enclosed inside IFCA datacenter include physical environment related tasks for facilities management; as well as the logical part, concerned with applications required to satisfy the IT services functionality. The following subsections are aimed to describe the new set of enhacements implemented to achieve the aforementioned areas. Such topics are covered from the lowest part of the implementation, usually being transparent to final users; to the more visible features. 2.1 Virtualization Virtualization benefits have been widely discussed in the last few years when it became a trending topic within IT. IFCA datacenter was an early adopter of this technology, virtualizing progressively every node in the datacenter and having eventually a fully virtual infrastructure. By doing so, the management overhead of the different servers has been fairly diminished, impacting positively in the overall stability and functionality. Two different virtualization pools were defined. One pool, trying to satisfy a high level of availability, is dedicated to host local and NGI services; whereas a second pool, hosting the batch system computing nodes, was designed with ease of management and high performance as the main goal. In both cases, Xen [13] is being used as the hypervisor technology. Service virtualization Service functionality commitments are tightly related with its criticality and impact within the infrastructure in which they are integrated. By formalizing them with the relevant customers, through SLAs [11], datacenters are formally challenged to provide solutions to accomplish the agreed availability level. IFCA datacenter resources host both cluster covering the local computing farm and both IFCA-LCG2 and CMS Tier-2 Grid instances and central required for NGI IBERGRID operations services. Proper provisioning of these machines is guaranteed through the use of a High-Availability (HA) solution [9], satisfying the duties assumed in the relevant scientific communities agreements and/or contracts. Computing Nodes Virtualization Computing farm virtualization has wellknown benefits at the system management level, which give way to a infrastructure stability and usability enhancement, directly related to end-user profitability. Moreover, several studies have stated that virtualization performance does not introduce a negative impact [1, 14, 4]. Here HA concerns are not any longer a must-do since failures in these components do not compromise the infrastructure nor the service. Administrative strategies differ also when compared with dedicated services like the ones discussed in Section 2.1 as now it is required to deal with multiple cluster instances with similar assignments, in the form of execution nodes or worker nodes (WN) using Grid terminology.

53 IBERGRID 53 Putting effort on improving monotonous and repetitive daily administration techniques and dom0/domu fine-tuning (so as to improve performance) has finally led to a continuous active refinement over the last year at IFCA computing cluster. Synchronization mechanism Maintenance tasks can be carried out with policy or promise-based frameworks that provide an automated update and preservation of the system s key properties. Preventing deviations from system s functionality, these tools have a positive influence over cluster performance. However, now and then, system administration also involves some other tasks that, unfortunately, need human-supervised realization which could end in a tedious work if applied to each and everyone of the computing nodes in a production environment. The obvious procedure would consist in a manual modification or installation of the given piece of software in an unique server and its further distribution to the rest of the nodes in the cluster. There are several approaches that fit here and one of them is the prototype-based image propagation. The design of this strategy relies on the virtualization of the whole WN farm. In such scenario, the remote prototype image content is synchronized with the local domu/s installation whenever triggered by a new revision committed in the former device. The definition of a single point of modification and testing through the concept of prototype, provides stability since changes will only be propagated to production nodes once no malfunction is guaranteed; whereas the prototype replication ensures that the computing environment is the very same for every node in the cluster. The application of the new revision in a given WN is then handled by a configuration management engine, running on the physical host. This solution uses an iscsi target as the storage backend for the prototype and CFengine [3] for the update application logic. The action plan follows the schema ruled by the flow chart in Figure 1. CFengine either automatically, or when triggered by an operator running on the dom0 compares prototype and running version. If outdated, then the dom0-hosted WN is set to draining in the batch system. Once it has been properly drained, the WN is halted and the prototype s iscsi target is synchronized with the local image. When the process has finished, the WN is started, and submission to it is enabled. Resource utilization The computing nodes were virtualized so that they take up entirely the physical machine s resources. So as to maximize those available resources, the memory footprint in the host system was reduced to 200MB. All the CPUs are pinned to the unique virtual machine that is executed on each dom0. These nodes are configured with an Infiniband interconnect, devoted to be used by the communications of the parallel namely, MPI jobs that will be executed. To ensure that proper performance is achieved thus eliminating the possible overheads in these kind of communications, this Infiniband card was directly plugged to the computing node, so they have full control over them. 2.2 Resource balancing A computing centre giving resources and support to research communities will surely face a constantly growing number of users. The login nodes also known as

54 54 IBERGRID check deployed and prototype version CFEngine yes is version up to date? triggers Operator no drain node in batch system is node empty? no wait for node to drain yes upgrade the node enable node in batch system stop Fig. 1. Computing node upgrade process flow chart. bastion, or User Interfaces (UIs) in Grid terminology that give access to a larger infrastructure can turn into a bottleneck if they have to handle a high number of sessions. In such cases, it is necessary to introduce a load balancing method in the system (as other common IT services such as web serves already do), distributing the load and the connections among all the available nodes at a given moment. IFCA has deployed a self-contained and self-monitored load balancing infrastructure based on the Linux Virtual Server [10] project for its login machines. The users that connect using Secure Shell (SSH) to these nodes are transparently: they do not know the particular machine in which they will land nor the number of available machines, since it is variable along the time. 2.3 Distributed storage Clustered storage solution deployed at IFCA datacenter uses IBM s General Parallel FileSystem (GPFS) as the global distributed filesystem. Therefore the filesystem is shared among the nodes hosted in the infrastructure, this way, concurrent file access is guaranteed with satisfying I/O rates to applications running on multiple nodes.

55 IBERGRID gpfs01 DDN 2 x S2A x ENCLO 600TB RAW 440TB RAID-6 redundant 8GB FC (each) gpfs02 gpfs07 gpfs08 10GB (each) Force10 10GB CX4/LC 24P CMS T2 FS gpfs03 IBM 7 x DS x EXP TB RAW 250TB RAID-6 redundant 4GB FC (each) 2 x FC switch 16P redundant 4GB FC (each) gpfs04 gpfs05 gpfs06 10GB (each) users FS projects FS Fig. 2. Distributed storage implementation. Current hardware network links and GFPS servers schema is displayed in Figure 2. Due to security and reliability purposes, GPFS installation uses a private LAN for the clients to access its data in a multi-redundant connection setup. Since intensive I/O dramatically affects job performance, lots of efforts have been focused on its optimization [2] to reach high speed file access rates. The achieved improvements lightened the path to the creation of shared storage areas each one in a separate filesystem to make local access possible to users within the infrastructure. 2.4 Authentication and Authorization Since IFCA users no longer access directly to a unique machine, and they profit from a distributed and shared storage, it is necessary to introduce proper authentication (AuthN) and authorization (AuthZ) methods, ensuring that a given user is mapped to an unique User ID (UID) all across the cluster. The most basic approach is to distribute the credential files i.e. /etc/{passwd,shadow,group} to all the machines in which the user can log into. This, however introduces several penalties and extra work, among others, the need of synchronizing them whenever a user or group is added, modified or removed from the system any outdated file will cause trouble ; the user is requested to manually change the password in every machine wherever access has been granted to thus reducing the transparency of the system, etc. This setup is, obviously, error-prone for systems with more than a few of machines.

56 56 IBERGRID A more powerful and robust approach is to introduce a central authentication and authorization service. By doing so, the credential distribution is no longer necessary, since any component of the cluster willing to access any credential database will no longer make the query to its local files but against the authn/authz server. This eliminates the problems mentioned above, since the system administrators manage the users and groups and its related ACLs in just one single point, without the need to distribute anything to the servers. Also, users can manage its accounts without bothering to spread their changes to all the machines they have access to. In the case of the IFCA computing facilities, this setup consists of two authn/authz servers working redundantly one is in read-write mode and the second one is only read-only and works when the first server is not available. These two servers are executed on top of the system described in Section 2.1, so a high level of redundancy and availability of the system is ensured. 2.5 Batch system fine-tuning Unifying the computing resources namely, the worker nodes under an unique farm gives the advantage that the total number of slots available for either locally submitted or grid jobs would be bigger than otherwise. However, if proper scheduling mechanisms an algorithms are not enforced, the potential benefit would be shaded by long waiting times and even job starvation. The batch system deployed at IFCA Datacenter is the Open Source version of the former Sun Grid Engine now Oracle Grid Engine [6]. Secure X11 forwarding One popular demand among scientist are graphical programs i.e. X11 based that are normally used for the final stages of their works; that is, data visualization and data mining, neural networks, statistical adjusts, and so on. Nevertheless, having these kind of programs being directly executed on the login nodes is not a solution, since they are CPU-intensive and they might result in an overloaded system. The ideal scenario would be to execute them on the computing farm using an interactive and high priority session, then forwarding the X window display to the login machine (which will be eventually redirected to the user s work station). Using the X11 protocol directly in which traffic is exchanged without any kind of encryption between the worker nodes and the login machines is not an option in a shared environment, since privacy will be highly compromised (with any simple sniffer the traffic will be intercepted). Therefore, a method to secure the X11 protocol is needed, being the most common the traffic forward it through an SSH tunnel. However, the nodes have to be configured in such a way that no users shall access the node outside the GE scheme. This is achieved by: 1. disabling plain user access in the SSH configuration (using the AllowUsers and DenyUsers directives);

57 IBERGRID creating a special SSH configuration file /etc/ssh/sshd config.execd without that restriction; 3. configuring Grid Engine to invoke the proper commands for the qlogin, rlogin and rsh qlogin_command qlogin_daemon rlogin_command rlogin_daemon rsh_command rsh_daemon /gpfs/utils/gridengine/util/login_wrapper /usr/sbin/sshd -i -f /etc/ssh/sshd_config.execd /gpfs/utils/gridengine/util/login_wrapper /usr/sbin/sshd -i -f /etc/ssh/sshd_config.execd /gpfs/utils/gridengine/util/login_wrapper /usr/sbin/sshd -i -f /etc/ssh/sshd_config.execd 4. enabling the SSH and GE tight integration by either patching OpenSSH and recompiling Grid Engine, or using the PAM module available in [7]. The steps detailed above assure that no user is able to access the node directly using SSH under any circumstances, but only through the proper submission of an interactive job to the batch system, hence assuring a correct partitioning and fair usage of the resources i.e. avoiding overcommit of CPU and/or memory. Also, thanks to the tight integration, the job s actual resource utilization is properly reported on the accounting records. 3 Reaping the harvest Service operation and design enhancements both for quality and customer orientation seen in last section gave, as a result, the creation of a robust and integrated workspace for the user. Transparency is highlighted here, since most of the underlined technical points applied do not affect IFCA s computing infrastructure user s daily work. This section outlooks how those amendments evolved into the current status of the infrastructure from a user s standpoint. 3.1 Single Sign On Access to IFCA s cluster is exclusively granted through an single SSH entry point, that assures the lowest loaded login machine available at a given time. These machines are not meant for heavy computation this purpose is covered by the computing nodes (see Section 3.3) but for common tasks in a users workflow such as editing files, job submission and description, data visualization and analysis, etc, limiting user s process consumption so as not to affect other s work. Customer-oriented services commonly require authentication in order to profit from their functionality. A unique log to gain access to every application available across the infrastructure is now guaranteed at IFCA cluster through a centralized authentication implementation. Consolidating infrastructure users to the usage of unique credentials reduce the fatigue of maintaining several username and password combinations, thus lowering considerably the security risks. It also must be noted that the account creation and registration overheads are eliminated, since control access becomes a smooth step transparent to the user.

58 58 IBERGRID 3.2 Storage resources When a new project instance is to be supported within IFCA s infrastructure, SLAs are carefully managed and fulfilled. Based on contract s requirements, storage capacity is limited both for a common area devoted to project collaborative data and the individual space for the defined project s users. Based on the setup described on 2.3, user communities have direct POSIX access at their disposal in the infrastructure, independently of where they are working at including both login and computing resources. 3.3 Computing resources Local Resource Management Systems (LRMS) represent the front door for accessing computing resources. Making the most of computing capabilities, avoiding crashes due to unhanded, out-of-lrms-control node access confers to this service the main role for a efficient and safely resource exploiting. However, successful submissions rely mostly on users good knowledge of the tool they are working with, as well as the requirements it does need and the performance expectations. Selecting the most suitable form of submission batch, text or X11 interactive sessions, identifying application complex resource limits and, whenever possible, profiting from a Infiniband networked parallel computing environment results in best fit accommodation of user s jobs within the infrastructure. 3.4 Additional services Even not strictly related with computing facilities, some other applications are integrated within the infrastructure. Services such as a documentation portal, helpdesk ticket tracking system, infrastructure status monitoring, plotted/visual accounting data display or software revision/version control systems give users a helpful channel of both information and support provisioning to deal with. Despite of increasing the learning curve of a successful interaction with the overall infrastructure, experience shows that, once became familiar, users do notice a lowering of the time spent in reporting issues or even solving them, since they take advantage of the knowledge base generated and a better cluster usage as a result of status and consumption statistic feeds and mainly due to a proper documentation of the advanced scheduler features to make the most of their applications. 4 Conclusions Seeking quality and integration in the IT services at IFCA datacenter had provided benefits at infrastructure s external interaction and management levels. Motivations based on customer s requirements in the form of SLAs, service administration needs and infrastructure consistency together with ease of utilization raised the need of a redefinition in the service operation in three main stages (see Figure 3):

59 IBERGRID Revisions of procedures for maintenance, configuration and deployment mechanisms had promoted automation to cope with repetitive and inefficient system operation tasks leading to predictability in system s performance, leaving to human control risky and delicate practices that often need decision making. Strong constraints for service functionality have resulted in the deployment of new applications and systems to provide reliability, in what concerns to the use of HA, monitoring or configuration technologies. Sketching user experiences and with a clear picture of a integrated workspace in mind, some services have been implemented to add value to user s daily work. All of these goals have been accomplished following a integration-with-alreadydeployed philosophy that permits a smooth transition of deployed services within the operational environment as well as avoiding not-integrated designs that can end in arising fragmentation rather than consolidation. Homogeneous environment Unique credentials Monitoring Failover mechanisms Shared storage Configuration consistency Underlying load balancing Added value to users Infrastructure Integration Deploying reliable services Redundancy Additional services integration Ease of deployment Avoiding monotonous tasks High Availability Central authentication Prototypebased synchronization Solving IT problems Configuration management Controlled resource usage Integration and consolidation Fig. 3. Outlook of the main achievements at IFCA Datacenter.

60 60 IBERGRID Acknowledgments The authors would like to acknowledge the financial support from the Spanish National Research Council (CSIC) by means of the GRID-CSIC project. References 1. Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. Xen and the art of virtualization. In Proceedings of the nineteenth ACM symposium on Operating systems principles, SOSP 03, pages , New York, NY, USA, ACM. 2. Iban Cabrillo Bartolome and Ana Y. Rodriguez Marrero. Storage System Optimization, Improving CPU Efficiency in I/O Bounded Jobs. In 4th IBERGRID PRO- CEEDINGS, pages NETBIBLO, CFengine. Cfengine web site Ludmila Cherkasova and Rob Gardner. Measuring CPU overhead for I/O processing in the xen virtual machine monitor. In USENIX 2005 Annual Technical Conference, pages , IFCA Advanced Computing and e Science Group. web site GridEngine.org. Open source grid engine community Andreas Haupt. Ssh tight integration for grid engine (source rpm package). ahaupt/downloads/sge-sshd-control src.rpm. 8. IFCA. Ifca web site Alvaro Lopez Garcia and Pablo Orviz Fernandez. Virtualization and network mirroring to deliver High Availability to Grid services. In 4th IBERGRID PROCEEDINGS, pages NETBIBLO, Linux Virtual Server. Lvs web ITU-T study group 12. ITU-T E-800 Definitions of terms related to quality of service. ITU Reccomendations, September Sharon Taylor, David Cannon, and David Wheeldon. ITIL Service Operation Xen. Xen web site L. Youseff, R. Wolski, B. Gorda, and C. Krintz. Evaluating the performance impact of xen on MPI and process execution for HPC systems. In Virtualization Technology in Distributed Computing, VTDC 2006, 2006.

61 IBERGRID 61 Orchestrating Services on a Public and Private Cloud Federation Jorge Ejarque 1, Javier Álvarez1, Henar Muñoz 3, Raül Sirvent 1 and Rosa M. Badia 1,2 1 Grid Computing and Clusters Group - Barcelona Supercomputing Center (BSC-CNS) 2 Artificial Intelligence Research Institute (IIIA) - Spanish National Research Council (CSIC) 3 Telefónica Research and Development (TID) {jorge.ejarque,javier.alvarez,raul.sirvent,rosa.m.badia}@bsc.es, [email protected] Abstract. During the last years, the Cloud Computing technology has emerged as a new way to obtain computing resources on demand in a very dynamic fashion and only paying for what you consume. Nowadays, there are several hosting providers which follow this approach, offering resources with different capabilities, prices and SLAs. Therefore, depending on the users preferences and the application requirements, a resource provider can fit better with them than another one. In this paper, we present an architecture for federating clouds, aggregating resources from different providers, deciding which resources and providers are the best for the users interests, and coordinating the application deployment in the selected resources giving to the user the impression that it is using a single cloud. 1 Introduction Cloud Computing [1] has caused a big impact in the way a traditional data center was managed: whenever users wanted a machine to compute some of their processes, they had to negotiate in person for a good price with a data center in order to use the machines. When this contract was established, users could then deploy their services in the data center. Nowadays, the dynamism of the cloud enables the users to make this negotiation on-line, checking the price of the different infrastructure providers available, and selecting the most appropriated one for their needs. Once selected, virtualization technologies enable an easy deployment of the software that users want to run in the cloud. Many IaaS providers exist nowadays since the appearance of Amazon EC2 [2], which was one of the first providers to adopt the utility computing paradigm. There are some commercial products such as Flexiscale [3] or ElasticHosts [4], and some other cloud middleware coming from the academia side, such as Eucalyptus [5], OpenNebula [6] or EmotiveCloud [7]. One of the problem that arises from such a big set of infrastructure offers in order to deploy a service is the lack of interoperability between the different platforms: users must decide in advance to which provider they are willing to deploy their services, and thus use its API in order to do so. This has the clear

62 62 IBERGRID limitation that only a single infrastructure provider can be used in a deployment. Although some initiatives are gaining importance in order to solve this problem (i.e. OCCI [8]), many infrastructure providers are reluctant to change their defined APIs. Moreover, it is not only a problem about having different interfaces in the different providers, since they also use different data models to represent their resources, making the interoperability even more difficult. From what has been explained, it is clearly seen that an intermediate layer is needed in order to shield users from such a variety of cloud offers, and not only for this, but also for aggregating the resources of the different providers in a single entity. This is what is known as cloud federation, the ability of interacting with all these different technologies in order to deploy a service. The federation of clouds is a hot topic nowadays, and is one of the objectives of the NUBA project. The NUBA project [9], is a spanish funded R&D project whose main objective is to develop a framework which makes the deployment of business services in an automatic and easy way, allowing them to dynamically scale taking into account performance and BLOs. More specifically, in the NUBA architecture there is a layer in charge of the federation of the underlying cloud infrastructures, which is able to handle the full service deployment lifecycle among different private and public cloud providers. This layer is the main work described in this paper. NUBA also provides a low-level infrastructure to build a private cloud, and a upper layer which facilitates the service creation. The rest of the paper is structured as follows: Section 2 presents the state of the art regarding federation in cloud infrastructures. Section 3 describes the designed architecture in order to achieve federation of private and public clouds, analyzing each of the components of the architecture, whose implementation is described in Section 4. Section 5 draws conclusions of this paper, and shows future directions of this research. 2 Related Work Cloud computing focuses on delivering infrastructure (IaaS), platform (PaaS), and software (SaaS) as services. In the particular case of IaaS, we also distinguish between private Clouds (a Cloud made with the private machines of a company) and public Clouds (those offering computing resources to the general public i.e. Amazon EC2). In this paper we focus on the federation of both public and private Clouds. There exist some projects already tackling the federation of clouds, being RESERVOIR [10] one of the first European projects focusing on this topic. In RESERVOIR, the federation deals with problems such as having different providers, each of them with its own API. Besides, the federation model follows an outsourcing pattern between Infrastructure Providers (IPs): when an IP runs out of computing capacity, it outsources this demand to another IP which whom it already has an agreement to do so. RESERVOIR does not have an aggregator figure for integrating different infrastructures, such as the Federated Cloud Orchestrator in our paper, and the outsourcing cannot be done to public clouds, as we do.

63 IBERGRID 63 Another important project dealing with cloud federation is OPTIMIS [13]. The OPTIMIS goal is to create a toolkit for scalable and dependable service platforms and architectures that enable flexible and dynamic provisioning of cloud services. The innovations behind the toolkit are aimed at optimizing the whole service life cycle, including service construction, deployment, and operation, on a basis of aspects such as trust, risk, eco-efficiency and cost. Despite the similarities with our work, OPTIMIS is at its early stages and no toolkit is available by now. In InterCloud [14], cloud federation is achieved by means of three main components: a Cloud Broker (acting on behalf of users), a Cloud Coordinator (handling the private cloud), and a Cloud Exchange (aggregating the information to match user requests with service providers). In their simulation experiments, a Cloud Coordinator decides if a private cloud has capacity to attend a request. If this is not the case, the service is outsourced to Amazon EC2. Our policy is to aggregate the information of the private and public clouds available, and decide what is the best distribution of the service considering this aggregated information. Thus, public clouds are included into the equation from the beginning, and not only in case of lack of resources (which is also known as cloud bursting). The fact that InterCloud is in a very preliminary stage of design is proved because nothing is mentioned in their paper about how to deal with the integration of data coming from different data models. In addition, in terms of API interoperability between clouds, there have been several proposals for standardization, such as vcloud [11], OCCI [8] and TCloud [12]. In the NUBA architecture, both OCCI and TCloud are considered. Our work also contributes in the data conversion between different data models from different providers by using semantics. Previous work such as [15], [16] and [17] has focused on the automatic semantic annotation of XML and XML schemes to RDF and OWL. In our translation, we use these results for creating ontologies from the providers schema. Moreover, we complement this work providing a set of translation rules, which translate the concepts from one generated ontology to another. 3 Architecture The main goal of the federation layer is to coordinate and maintain a cloud infrastructure composed by several public and private clouds and offered to users in a uniformed way as if they were using a single cloud. Our proposal consists on a set of components which interact to each other in order to deploy the users services in the different managed cloud providers. Figure 1 depicts the overall architecture showing the components of the system. The Federated Cloud Orchestrator (FCO) is the main component of the architecture. It is the entry point of the users providing an abstraction layer between them and cloud providers. It also coordinates the actions required to deploy a service. The FCO is supported by the Deployment and Optimization (D&O) Manager, which decides what is the best provider for each service VM. All the required information to manage the cloud federation is stored in the Common Database

64 64 IBERGRID Fig. 1. Federated Cloud Architecture (CDB). The D&O Manager queries this component to know which providers are available, where services have been deployed, etc. Regarding the interaction with cloud providers, it is done by means of the Open Cloud Computing Interface (OCCI) because most of cloud middleware for setting private clouds (OpenNebula, EmotiveCloud, etc) already offer an OCCI implementation. However, public clouds offer their own interfaces for interacting with them and their own schemes for describing data. For these reasons, two additional components have been added to the architecture: the Resource Mapper (RM), which is used to translate data between the different cloud providers models and the Interoperability component, which serves as a bridge between the FCO and the different infrastructure providers interfaces. The following sections provide a detailed description of each one of the architecture components. 3.1 Federated Data Model The different components of the federation layer store and exchange messages containing information about the services, resources, providers, etc. A common data model has been defined for facilitating the communication between the components and having a common understanding of each of the parts of the federation. The Federated Data Model (FDM) is divided in three layers as it can be seen in Figure 2. The lowest one uses the Common Information Model (CIM) [19] for describing different types of computational resources and the management services offered by the cloud providers. The middle layer is in charge of describing the service deployment by means of the Open Virtualization Format (OVF) [20]. OVF models a service as a set of sections which describe the computational resources (Virtual Systems), shared disks and interconnection networks required by a service. It uses CIM concepts for describing these computational requirements facilitating the matching between the service requirements and the resource capabilities. The upper layer provides the concepts for the abstraction layer offered by the FCO

65 IBERGRID 65 Fig. 2. Federated Data Model which are defined by the TCloud[12] interface. With this interface, the user can define a Virtual Data Center(VDC) to describe a virtual infrastructure, Virtual Appliances (VApp) which are software system used to provide services, and Virtual Networks for interconnecting the different VApps. The TCloud data model is also related with the lower parts parts of the figure. It uses OVF and CIM terms for describing the VApps and Networks. 3.2 Federated Cloud Orchestrator The function of the Federated Cloud Orchestrator (FCO) is to provide an abstraction layer to allow service deployment management in the different cloud infrastructures. The FCO is the core of the architecture, as it is connected with the other components and coordinates them in order to achieve the orchestration of service deployments. Figure 3 depicts these relationships as well as the internal design of the component. The FCO has TCloud on its top, which defines a REST interface to build a virtual structure and create a service environment without the need of keeping in mind the real infrastructure underneath. Furthermore, within that virtual structure, service providers are able to specify the connectivity between the VMs they deploy, which is maintained by the FCO regardless of the cloud providers where the VMs are finally located. Regarding service management, the FCO provides the functionalities to deploy, undeploy and redeploy a service orchestrating all the required actions to guarantee the correct deployment and undeployment of VMs in the different providers (deployment order, connection conditions, fault tolerance, etc). When a new service descriptor is received, the FCO obtains from the D&O the best Infrastructure Providers which can host each service VM. Afterwards, the FCO selects the most appropriate providers to guarantee the connexion conditions, deciding which VMs can share the same private network and thus be deployed in the same provider, or if a Virtual Private Network (VPN) between different providers is needed. Once the providers have been selected, the FCO creates each of the mentioned VMs through the Interoperability component following the order specified in the whole

66 IBERGRID service description. Moreover, it has to guarantee that all the VMs are deployed correctly or none of them is. If some of the VMs of a service have been already deployed when the FCO realizes that one of the conditions above cannot be satisfied, it undeploys all of them returning to the initial state of the process and throwing an error. Besides, the FCO uses the CDB to store all the important information related to the managed services. In the case that an undeployment request is submitted to the FCO, it checks in which provider the VMs of the service have been deployed and then ensures their correct removal through the Interoperability component. If one or more VM undeployments cannot be completed after several tries, the FCO informs the user, which can try the undeployment again in the future. After the process has been completed, the FCO updates the information stored in the CDB. Finally, a redeployment occurs when the FCO receives an optimization request over an already deployed service for deploying a new VM or undeploying existing VMs. When this happens, the FCO must search for the providers which can host this new VM taking into account the connectivity issues as it is performed in the deployment case. Furthermore, the FCO guarantees that these actions are accomplished or none of them is. 3.3 Common Database The Common Database (CDB) serves as a storage system for all the components of the architecture. It provides tools to manage the storage of service descriptions as well as other information related to the infrastructures available in the federation. The FCO uses the CDB to store TCloud entities (e.g. VDCs) and other data required to manage services. On the other hand, the RM uses the CDB to store its translation rules and data models. Finally, the D&O uses the CDB information about the available infrastructure providers and their resources. Thus, it is Fig. 3. Federated Cloud Orchestrator s Design

67 IBERGRID 67 Fig. 4. Deployment and Optimization Manager Design important that the CDB provides an efficient XML management system, because all the information that the other components need to store is represented either in XML or plain text. 3.4 Deployment and Optimization Manager All the business logic in the federation layer is carried out by the Deployment and Optimization (D&O) Manager. The D&O Manager is in charge of taking placement decisions, so that, selecting the most suitable Cloud providers according to Service Providers (SP) and Infrastructure Providers (IP) policies and technical constrains. These decisions are taken when a new VM has to be deployed in the federated cloud (a deployment requested by the FCO) or proactively when it detects a situation where the current deployment of a VM can be improved. Figure 4 shows the high level design of the component. The central module is a rule engine, which is in charge of evaluating a set of rules (from the knowledge base) over a facts base. The rule evaluation infers what is the most suitable deployment and invokes optimization actions when optimization facts are found. The Fact s base is periodically synchronized with the CDB by means of a sensor which queries it to obtain new available providers and the resources provided by them. The knowledge base (composed by a set of rules) is provided by the entities involved in the federation to model the enterprise, user o service policies. These rules have been classified in three types according to their role in the federation. On one hand, the SP rules model the SP preferences on the resource selection such as the location, energy efficiency or preferred and forbidden providers (black list). On the other hand, there are IP rules, which model the IP preferences for selecting their customers. Finally, the federation rules model the common federation policies, such as rules for discarding resources and providers which do not fulfill the VM requirements

68 68 IBERGRID Fig. 5. Resource Mapper Design 3.5 Resource Mapper The Resource Mapper (RM) is in charge of performing a semantic translation between the models used by the different cloud providers and the FDM. The semantic translation methodology used by the RM consist on applying a set of mapping rules over semantic annotated descriptions. Figure 5 shows an overview of this methodology. From the different cloud providers schemes, the RM creates an ontology where each concept used by the provider is mapped to an ontology class. Additionally, some semantic mappings are included in the RM in order to model the equivalences between the different cloud providers concepts and the FDM. Once the ontology and the semantic mappings have been defined, the RM translates descriptions from these providers to the FDM and vice versa. When a description translation is requested, the RM automatically annotates it according to the generated provider s ontology. The annotated description together with the mapping rules are loaded into a rule engine, which evaluates the mapping rules over the annotated descriptions. As a result of this evaluation, the rule engine infers an equivalent description following the federated model. The same process is used for the reverse case (from FDM to provider s schema). The description in the federated format is automatically annotated according to the FDM concepts and the rule engine applies the reverse mapping rules. In both cases, once the translated description has been obtained, it is serialized in the corresponding format (XML in the case of the FDM and XML or JSON object for the providers schemes) 3.6 Interoperability Public providers offer resources to their users by means of Web Service interfaces. Unfortunately, there is not still a clear standard for these interfaces and currently, each provider implements their own interface. For this reason, there is a need to unify them in a common interface which is used by the FCO to deploy the service VMs. The Interoperability component tries to cover this need. It transforms the OCCI methods for managing VM instances, storage and network resources into the equivalent calls to the providers methods. To achieve this functionality, the

69 IBERGRID 69 Fig. 6. Interoperability Component Design Interoperability component is supported by the RM for translating the input and output data between the FDM and the providers schemes and by a set of plugins which implement the equivalent execution of the OCCI methods using the providers interfaces. Figure 6 gives a detailed view of the Interoperability component. When the FCO invokes an OCCI method of the Interoperability component, the OCCI server extracts the input data, the invoked method and the requested provider. Then, the input data is sent to the RM, which translates it into the provider s schema. Once the input data has been translated, it is used to execute the corresponding provider s plug-in, which executes the required method using the provider s interface. Finally the output data is translated into the FDM using again the RM. 4 Implementation This section describes the implementation of the architecture, which has been done in Java. Every component has been developed as a RESTful Web Service to keep uniformity in all the interactions. Their interfaces have been implemented using Jersey [21], which has been also used to develop clients for the different components to facilitate the interaction between them. 4.1 Federated Cloud Orchestrator The FCO has been implemented following the TCloud interface specification. TCloud was chosen over OCCI or a custom made interface because the former was considered to be too low-level oriented and the latter would have moved the component further away from standards. The FCO has to deal with two kinds of XML data: TCloud entities and service descriptors (OVF). To make easier the management of the former, the TCloud XML schema has been bound to a set of Java classes using JAXB [22].On the other hand, to deal with OVF files, the FCO makes use of an OVF manager API [27], which is based in JAXB and provides methods to split service descriptors into groups of VMs within the same private network or to easily extract service information.

70 70 IBERGRID 4.2 Common Database An open source XML repository called exist [23] acts as CDB, as it covers all the requirements needed. Thus, exist can be installed as a standalone server or deployed in any servlet container as an efficient XML management method which provides a comfortable REST interface. exist also supports the definition of queries in XQuery language [24], which can be invoked through a REST interface to easily obtain the stored information. Some of these queries have been written to allow the rest of the components to retrieve datasets or specific information within the CDB. For instance, the D&O can obtain the providers descriptors satisfying certain conditions, and ignoring the rest of them. 4.3 Deployment and Optimization Manager The D&O Manager is mainly composed by the rule engine. For the federated cloud, the JBOSS expert rule engine [28] has been used. This rule engine uses DROOLs as the rule language. In this sense, all the rules which arrive from the service manifest description or directly from the providers are translated into the DROOL language by the D&O Manager. The facts base is composed by a set of Java objects which are instances from the FDM, whose values are taking from the Common Database thanks to the Sensor module. Finally, there is the Descriptor Inspector, which is the D&O manager REST API (extending the TCloud specification), and the Action Executor, as set of plug-in actions to execute the rule actions when the condition threshold have been satisfied. 4.4 Resource Mapper The Resource Mapper has been developed with a custom made REST interface that provides methods to manage translation rules and data models, storing them in the CDB, and to translate data applying these rules. The translation part has been done using Jena [25], which provides a engine to evaluate rules over RDF graphs. For validating this concept, some rules have been written to translate between the FDM and two of the most significant public cloud providers: Amazon and Flexiscale. An example rule to convert an Amazon s small instance to the FDM is shown below. [Rule1: (?ec2 rdf:type ec2:aws_small) -> (?cs rdf:type cim:computersystem) (?cs fdm:hasmemory?mem) (?cs fdm:hasdisk?disk) (?cs fdm:hasprocessor?cpu) (?mem rdf:type cim:cim_memory) (?mem cim:blocksize 1024 ^^xsd:float) (?mem cim:consumableblocks ^^xsd:float) (?d_mem rdf:type cim:cim_memory) (?d_mem cim:blocksize 1024 ^^xsd:float) (?d_mem cim:consumableblocks ^^xsd:float) (?disk rdf:type cim:cim_diskdrive) (?disk cim:hasmediapresent?d_mem) (?cpu rdf:type cim:cim_processor) (?cpu cim:maxclockspeed 1000 ^^xsd:float)]

71 IBERGRID 71 The RM first converts the XML to RDF using an XSLT document based on [18]. Then, it applies the set of rules for the requested provider, and finally turns back the resulting RDF into XML using another XSLT file, which has been written from scratch. Those XSLT transformations are managed with JAXP [26]. 4.5 Interoperability The Interoperability front end has been implemented following the OCCI v1.0 specification with OVF as VM descriptor, because it maintains the uniformity within the system. In its back end, the Interoperability is composed by a set of plug-ins that are able to interact with the different public cloud infrastructure providers. This plug-ins use the RM to translate the OVF descriptor to the input data expected by the cloud provider. Initially, we have focused on implementing plug-ins for Amazon and Flexiscale, but there is the possibility to develop plug-ins for other providers extending the capabilities of the whole system. 5 Conclusions and Future Work In this paper, we have presented the architecture of a cloud federation layer designed in the context of the NUBA project. This layer is composed of different components in charge of the set of functionalities to achieve the federation of private and public clouds. The FCO shields the SP about having to deploy its services in different cloud providers using different interfaces for each case. The CDB stores all the information needed to aggregate the information of the cloud providers. The D&O Manager is able to decide which is the best provider to deploy a service, taking into consideration rules both from the SP and the IP. Finally, the RM and Interoperability components provide the functionalities to interact with public clouds translating the different data models used by IPs to the FDM, which is the data model used by all our components in order to represent the information in an homogeneous way. All these components have been implemented as RESTful Web Services, using Java, and taking into consideration current standard proposals, such as OCCI, TCloud or OVF. We have shown we go a step beyond current work in cloud federation, aggregating available resources in order to consider both public and private clouds in the first step of the deployment (not only to achieve cloud bursting), and also considering the conversion of data between the different data models of the cloud providers. Besides, we have an available implementation of the described architecture. Our future work includes enhancing the functionalities supported by the federation layer (more specifically by the FCO) in order to cover the full lifecycle of a service, and the development of new plug-ins for other public cloud providers in the Interoperability module. Acknowledgements This work is supported by the Ministry of Science and Technology of Spain and the European Union under contract TIN (FEDER funds), the Ministry

72 72 IBERGRID of Industry of Spain under contract TSI (Avanza NUBA project) and Generalitat de Catalunya under contract 2009-SGR-980. References 1. L. Vaquero, L. Rodero-Merino, J. Caceres and M. Linder, A Break in the Clouds: Towards a Cloud Definition, in ACM SIGCOMM Computer Communications Review, 39(1) 50-55, Amazon Elastic Compute Cloud, Flexiscale Public Cloud, Elastic Hosts, Eucalyptus Community, Open Nebula, EMOTIVE Cloud, Open Cloud Computing Interface Working Grop, NUBA project, B. Rochwerger, et. al. The reservoir model and architecture for open federated cloud computing,ibm Journal Res. Dev. vol 53(4), , vcloud API Specification v1.0, TCloud API Specification v0.9, API Spec v0.9.pdf Ana J. Ferrer, et. al. OPTIMIS: a Holistic Approach to Cloud Service Provisioning, in 1st International Conference on Utility and Cloud Computing, R. Buyya, R. Ranjan and R. N. Calheiros, InterCloud: Utility-Oriented Federation of Cloud Computing Environments for Scaling of Application Services, Algorithms and Architectures for Parallel Processing, LNCS vol 6081, 13-31, S. Battle, Round-tripping between XML and RDF, in 3rd International Semantic Web Conference, M. Ferdinand, C. Zirpins and D. Trastour, Lifting XML Schema to OWL,in 4th International Conference on Web Engineering, H. Bohring, S. Auer, Mapping XML to OWL Ontologies, in 13th Leipziger Informatik- Tage, AstroGrid-D XML2RDF, Distributed Management Task Force Inc., Common Information Model Infrastructure Specification v2.6, DMTF Standard DSP0004, Distributed Management Task Force Inc., Open Virtualization Format Specification v1.1, DMTF Standard DSP0243, Jersey: JAX-RS implementation, Java API for XML Binding, exist Open Source Native XML Database, W3C Consortium, XQuery 1.0: An XML Query Language, Jena 2 Semantic Web Framework, Java API for XML Processing, Claudia s OVF Manager Manager Drools, Last Acces to web links: March 10, 2011

73 IBERGRID 73 COMPSs in the VENUS-C Platform: enabling e-science applications on the Cloud Daniele Lezzi 1,2, Roger Rafanell 1, Francesc Lordan 1, Enric Tejedor 1,2, Rosa M. Badia 1,3 1 Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS) 2 Universitat Politècnica de Catalunya (UPC) 3 Artificial Intelligence Research Institute (IIIA), Spanish Council for Scientific Research (CSIC) [email protected], [email protected], [email protected], [email protected], [email protected] Abstract. COMP Superscalar (COMPSs) is a programming framework that aims to provide an easy-to-use programming model and a runtime to enable the development of applications for distributed environments. Thanks to its modular architecture COMPSs can use a wide range of computational infrastructures providing a uniform interface for job submission and file transfer operations through adapters for different middlewares. This paper presents the developments on the COMPSs framework in the context of the VENUS-C project for the design of a programming model enactment service that allows researcher to transparently port and execute scientific applications in the Cloud. 1 Introduction The rise of virtualized and distributed infrastructures and the emergence of multicore and GPU processing capabilities have led to a new challenge to accomplish the effective use of compute resources through the design and orchestration of distributed applications. As legacy, monolithic applications are replaced with service-oriented applications, questions arise about the key steps to be taken in architecture and design phases to maximize the utility of the infrastructures and to give to the users tools that help ease the development and design of distributed applications. Writing an application that uses the resources of a distributed environment is not as easy as writing a sequential application, and requires the programmer to deal with a number of technological concerns, like resource management or job scheduling and submission. Furthermore, programming frameworks are not currently aligned to highly scalable applications and thus do not exploit the capabilities of clouds. The design of a framework that allows the porting and execution of scientific applications on top of virtualized infrastructures, is currently a common topic in the distributed computing community. Cloud technologies are corresponding author

74 74 IBERGRID mainly based on virtualisation and service orientation combined to provide elastic computing service and storage in a pay-per-use model. One of the most common approaches to execute applications in the Cloud is MapReduce [1], a framework for data intensive distributed computing inspired by functional programming. The advantage of MapReduce is that it allows the distributed processing of the map and reduce operations. Provided that each map operation is independent of the other, all maps can be performed in parallel, though in practice it is limited by the data source and/or the number of CPUs close to those data; the other limitation is that the developer has to write the mapping and reduction functions thus forcing the researcher to explicitly adapt his application to the framework. The Microsoft Generic Worker pattern for the Windows Azure platform supports the invocation of.net code and Trident[2] workflows. It consists of a Worker process (a Windows Service) which runs 24x7 on a virtual machine in the Cloud. The pattern allows to run multiple machines with generic Workers concurrently, so that new work items can be distributed and scaled across a broad set of nodes. This framework is limited by the fact that generic binaries cannot be executed; instead, the.net environment has to be used to prepare a package to be deployed in Azure compute nodes. These limitations and other improvements are being addressed in the VENUS-C project. This paper presents the developments for extending COMPSs [3] in the context of the VENUS-C project, a recently European funded initiative whose aim is to support researchers to leverage modern Cloud computing for their existing e-science applications. The rest of the paper is structured as follows: section 2 briefly describes COMPSs, section 3 gives an overview of the VENUS-C project with special emphasis on the job management middleware, section 4 contains the description of the effort to enable COMPSs in the VENUS-C platform and section 6 concludes the paper. 2 The COMP Superscalar Framework COMP Superscalar (COMPSs) is a programming framework whose main objective is to ease the development of applications for distributed environments. COMPSs offers a programming model and an execution runtime. On the one hand, the programming model aims to keep the programmer unaware of the execution environment and parallelization details. In this programming model, the programmer is only required to select the parts of the application that will be executed remotely (the so-called tasks) specifying which methods called from the application code must be executed remotely. This selection is done by simply providing an annotated interface which declares these methods, along with metadata about them and their parameters. On the other hand, the runtime of COMPSs is in charge of optimizing the performance of the application by exploiting its inherent concurrency. The runtime receives the invocation to the tasks from the application, checks the dependencies between them and decides which ones can be executed at every moment and where, considering task constraints and performance aspects. For each task, the runtime creates a node on a task

75 IBERGRID 75 dependency graph and finds the data dependencies between the current task and all previous ones. Such dependencies are represented by edges of the graph, and must be respected when running the tasks. Tasks with no dependencies pass to the next step: the scheduling. In this phase, a task is assigned to one of the available resources. This decision is made according to a scheduling algorithm that takes into account data locality and task constraints. Next, the input data for the scheduled task are sent to the chosen host, and after that the task can be submitted for remote execution. Once a task finishes, the task dependency graph is updated, possibly resulting in newly dependency-free tasks that can be scheduled. The applications programmed and executed with COMPSs can use a variety of Grid middlewares. Besides, some developments are currently being performed for the COMPSs framework to operate on Cloud environments as well. Such developments mainly consist in the communication of the COMPSs runtime with a Cloud scheduler, which can request the creation and deletion of virtual machines in a dynamic way. Thus, depending on the number of tasks generated by the application, COMPSs can grow and shrink the pool of virtual resources which run the tasks. 3 The VENUS-C Platform VENUS-C develops and deploys an industrial-quality service-oriented Cloud computing platform based on virtualisation technologies, to serve research and industrial user communities by taking advantage of previous experiences and knowledge on Grids and Supercomputing. The ultimate goal is to eliminate the obstacles to the wider adoption of Cloud computing technologies by designing and developing a shared data and computing resource infrastructure that is less challenging to implement and less costly to operate. The programming models are a major contribution of the VENUS-C project to the scientific community. In conjunction with the data access mechanisms, these programming models provide researchers with a suitable abstraction for scientific computing on top of plain virtual machines that enable them with a scientific Platform-as-a-Service. One of the requirements of VENUS-C architecture is to define a way to support multiple programming models at the same time. In order to shield the researcher from the intricacies of the concrete implementation of different programming models, each one is enacted behind a job submission service, where researchers can submit jobs and manage their scientific workload. For this purpose, VENUS-C supports the Open Grid Forums Basic Execution Service (OGF BES) [4] and Job Submission Description Language (OGF JSDL) [5]. Each VENUS-C programming model enactment service exposes its functionality through an OGF BES/JSDLcompliant web service interface. This enactment service takes care of the concrete enactment of a job in a specific Cloud Provider. Figure 1 depicts a high level view of the VENUS-C job management middleware architecture. In VENUS-C, an e-science application is separated in two parts: the core algorithmic part is ported to the Cloud through the programming models, leaving

76 76 IBERGRID Fig. 1: The architecture of the VENUS-C Job Management Middleware at the user the interaction with the client side part, usually a graphical user interface (GUI), to prepare and modify data, visualize results, and start the scientific computation. The multiplexer component shown in Figure 1 acts as a unique front-end for multiple programming models, so that a client application can interact with a single OGF BES/JSDL endpoint, instead of maintaining lists and configurations of different enactment endpoints. At the current stage of the project two programming models are being enacted, COMPSs and a VENUS-C tailored implementation of the Microsoft Generic Worker [6] for the Windows Azure platform. Both COMPSs and the Generic Worker allow batch-style invocations. In the case of the Generic Worker, the command-line executable can be directly called as-is, thanks to the Generic Worker that starts a dedicated process inside the operating system. In the case of COMPSs, the application is executed using a pool of virtual machines properly allocated by the runtime or can be enqueued in a cluster through a batch system that assigns the required number of nodes. Each enactment service enables a specific instance of an application on the underlying computational infrastructure that includes Windows Azure virtual machines and Unix virtual resources made available through several open source Cloud middlewares, specifically OpenNebula [7] and EMOTIVE Cloud [8]. Both EMOTIVE and OpenNebula exposes OCCI [9] interfaces, so that the VENUS-C programming models can programmatically manage the virtual-machine life-cycle in these Cloud infrastructures, providing the requests in the Open Virtualization Format (OVF) [10]. An application repository makes all the code packages available to the programming model enactment service, which deploys the necessary binaries, based on the specific requirements of an incoming job request.

77 IBERGRID 77 The data has to be accessed from the Cloud-side part of the application; for this reason a Cloud SDK is made available in the project allowing the user to store the input data in the Cloud storage, to retrieve the results to the local disk and to remove data in the Cloud. The VENUS-C data management SDK supports the Cloud Data Management Interface (CDMI) [11] specification, pushed forward by the Storage Networking Industry Association (SNIA). This interface includes both a deployable web service which exposes the CDMI interface, and support libraries for different language bindings, that ease the call of the CDMI Service. 4 The VENUS-C COMPSs Enactment Service The porting of the COMPSs framework to the VENUS-C platform includes the development of several components depicted in Figure 2. A BES compliant enactment service has been developed to allow researchers to execute COMPSs applications on the VENUS-C infrastructure. The researcher uses a client to contact the enactment service in order to submit the execution of a COMPSs application. This request is expressed through a JSDL document containing the application name, the input parameters and data references. The same client allows the user to interact with the VENUS-C application repository to upload a packaged version of his code; such package contains the COMPSs binaries and configuration files needed by the application. One of the advanced features of the COMPSs runtime in this platform is the ability to schedule the tasks to dynamically created virtual machines. The runtime turns to the Cloud to increase the resources pool when it detects that workload of its Workers is too high or when no Worker fulfills the constraints of a task. This increase in the number of resources is done in parallel with the submission of tasks. The requests are made using a connector to the BSC EMOTIVE Cloud middleware that has been developed implementing an OCCI client that is also used for interoperability with other providers like OpenNebula. The features of this new resource may vary depending on the need of resources that the runtime has. OVF is the chosen format to specify which constraints must accomplish the response of the Cloud Provider. Once the request has been fulfilled and the virtual machines can be accessed, all the binaries needed by the application are deployed into it and the new resource is added to the available resources pool. Symmetrically, the runtime is able to shrink this pool by releasing some of the VMs obtained from the Cloud. This happens only when a low level of workload is detected on the Workers or when the application ends. Before releasing them, the runtime must ensure that no tasks are running in these virtual machines and there exists at least another copy of the data stored on the local disk. One of the aims of VENUS-C is to demonstrate that existing distributed infrastructures, both data centers and supercomputing centers, can be integrated into modern Cloud environments; COMPSs helps in this vision of HPC in the Cloud, being able to dispatch workloads also into a supercomputing infrastructure processing MPI workloads. This allows traditional supercomputing infrastructures

78 78 IBERGRID to be provided as a service to the VENUS-C consumer. Fig. 2: The COMPSs Enactment Service architecture 4.1 The Basic Execution Service The Basic Execution Service specification defines the service operations used to submit and manage jobs within a job management system. BES defines two WSDL port types: BES-Management for managing the BES itself, and BES-Factory for creating, monitoring and managing sets of activities. A BES manages activities, not just jobs, but the activity in the context of the enactment service is an e- Science code developed for instance as a COMPSs application. The request for an application execution is expressed in a JSDL document. The implementation of the COMPSs BES includes a Job Manager and a Job Dispatcher that actually execute the application and manage its life cycle. The implementation of the BES-Factory port type for the COMPSs enactment services supplies the following operations: CreateActivity: This operation is used to submit a job. It takes an ActivityDocument (JSDL document) describing the application to run, and returns an identifier of a WS-Addressing EndpointReference (EPR). This EPR can be used in subsequent operations. The JSDL document is forwarded to the Job Manager that, through the StartJob method, creates an enactment service job

79 IBERGRID 79 object assigning it a Universally Unique Identifier (UUID); a SAGA [12] job is created and enqueued in the Job Dispatcher which is responsible for dealing with execution as is described in the following section. GetActivityStatuses: This operation accepts a list of EPRs (previously returned from CreateActivity), and returns the state of each COMPSs job. The EPR is parsed in order to get the UUID of the job. The Job object stored in the Job Manager is requested, and its status retrieved through the GetJob- Status method. The state model includes a basic set of states being: Pending, Running (Staging-In, Executing, Staging-Out) and Finished for normal execution stages and Failed or Cancelled for abnormal execution situations. A controller for expired jobs takes care of FAILED or FINISHED jobs removing them from the list of managed jobs after a maximum allowed time using two configurable timeouts. The jobs with a CANCELLED end state are immediately removed from the Job Manager list. GetActivityDocuments: This operation accepts a list of EPRs and return an ActivityDocument for each EPR requested. The ActivityDocument just wraps a JSDL document containing the job description; it has been implemented for specification compatibility reasons. TerminateActivities: Takes a list of EPRs of COMPSs jobs to be terminated and returns true or false for each job depending on whether the operation was successful or not. The UUID is extracted from the EPR and the Terminate- Job method is called to change the status of the Job object stored in the Job Manager to CANCELLED; the Job Dispatcher is notified so that the job can be terminated if running or not started if still in the queue. GetFactoryAttributesDocument: This operation returns various attributes of the BES back to the client. These attributes include information about the enactment service itself, such as if it is accepting new jobs. It also contains information about the resources that the enactment service has accessed when scheduling jobs, and returns a list of Activity EPRs of the jobs controlled by the Job Manager. A BasicFilter extension can be passed to filter the results returned. 4.2 The Job Dispatcher As seen in the previous section, the Job Manager delegates the whole control of the execution to the Job Dispatcher (see Figure 3), which maintains a userresizable pool of threads that is responsible for picking jobs from a Job Dispatcher queue filled by the Job Manager. First a virtual machine is requested to the Cloud Provider in order to deploy the COMPSs runtime that schedule the tasks on the remote nodes. A SAGA SSH persistent connection is established with this machine where the Job Dispatcher deploys the application requesting its package to the VENUS-C application repository. This package includes the binary, the scripts for

80 80 IBERGRID setting the required environment variables, and other configuration files needed both by the COMPSs runtime and by the application itself. Fig. 3: The COMPSs Job Dispatcher Once the runtime machine is deployed the COMPSs application is remotely started using the SAGA GAT adaptor. In its turn the COMPSs runtime will schedule the tasks on a set of machines created on demand. This phase includes the staging of input files from remote locations as specified in the JSDL document. The format and protocol of these location references depend on the storage provider that hosts the data; in the case of VENUS-C platform the Cloud storage is accessed in two ways, either through an SDK tailored to the Cloud storage or through a data access proxy service which can wrap multiple Cloud storage systems. In the latter case a CDMI compliant web service is made available to the clients and to the programming models. At the time of writing this paper, the CDMI service is not available and the COMPSs enactment service uses a SCP adaptor to handle the files. When the job is completed, the output data is moved back to the storage according the user JSDL request, then the Master VM is shutdown and the information on the activity is available in the Job Manager for a time specified by the expired jobs controller.

81 IBERGRID 81 5 Evaluation of an e-science application In order to validate the described framework a bioinformatics application has been adapted to run in a Cloud environment through the COMPSs enactment service. The aim is twofold: first, evaluating the complexity of porting the application to COMPSs in the VENUS-C platform; second, comparing the performance of the Cloud and Grid scenarios. Hmmpfam is a widely-used bioinformatics tool included in the HMMER [15] analysis suite. It compares protein sequences with databases containing protein families, searching for significantly similar sequence matches. The work performed by hmmpfam is computationally intensive and embarassingly parallel, which makes it a good candidate to benefit from COMPSs. A sequential Java application which wraps the hmmpfam binary was implemented in order to be run with COMPSs [16]. The main code of the application is divided in three components: Segmentation: the query sequences file, the database file or both are split in the main program. Execution: for each pair of sequence-database fragments, a task which invokes hmmpfam is run. Reduction: the partial outputs for each pair of sequence-database fragments are merged into a final result file by means of reduction tasks. The input data to the workflow is the SMART database, whose size is approximately 20 MB, and a file containing 4096 protein sequences, which is 870 KB in size. The latter is partitioned in several fragments depending on the number of worker cores used in each series of tests to achieve parallelism. The output is a 320 MB file containing the scoring results. 5.1 Porting of the application In order to port an application to COMPSs, the programmer is only required to select which methods called from the application will be executed remotely. This is done by means of an interface where the user has to specify some metadata for each method, namely: the class that implements the method (Java applications) and, for each method parameter, its type (primitive, file,...) and direction (in, out, in/out). The user can also express constraints on the capabilities that a resource must have to run a certain method (CPU type, memory, etc). Finally, for the Hmmpfam application to run in the VENUS-C platform, a package composed by two files was created: first, a jar archive containing the code to manage the three phases of the application (Segmentation, Execution and Reduction); second, the hmmpfam binary which is used in the Execution phase to compare each sequence fragment with the protein sequences extracted from the database. These two files are packaged in a hmmer.tar.gz package that the user has to upload to the application repository through the specific client. The effort required is small compared to the case of a Grid testbed, where the user has to manually specify information about the configuration of the execution testbed. In the Cloud case, the computing infrastructure is transparent to the researcher, who

82 82 IBERGRID only has to deal with his own code ported to COMPSs and to upload the package to the application repository. 5.2 Performances Two tests have been run, one on a grid testbed composed by two quad-core nodes and the second one on a Cloud testbed using EMOTIVE as Cloud provider with the same number of available cores. (a) (b) Fig. 4: Performance results. The objective of these tests is to validate the elasticity behavior of the COMPSs runtime in the Cloud compared to a Grid testbed and the scalability of the runtime when virtual resources are used. Figure 4a depicts the evolution of the number of VMs (limited to 8) used during an Hmmpfam execution. Hmmpfam tasks are more computing intensive than the merge ones, which are shorter and data intensive. This information, provided by the user in the application interface, is used by the COMPSs runtime to ask the provider for VMs whose size fits the computing requirements of the tasks also adjusting the execution cost to the real needs of the application. COMPSs also detects the load produced by the tasks and adjust the number of required VMs able to compute them as highlighted by this example with hmmpfam tasks. As depicted in the figure, all the merge tasks can be executed on the same VM since this kind of tasks does not generate enough load to need more resources. The second test aims to measure the overhead of using cloud resources. Three configurations are used: the first one uses one physical node as cloud resource, the second one uses two nodes as cloud physical resources and the third one uses both machines as grid nodes. Figure 4b depicts the speedup of the Hmmpfam application depending on the maximum number of workers used during each execution. Due to the VM creation time the cloud execution times are always higher than the grid one using the same amount of cores. Another important aspect shown by

83 IBERGRID 83 this test is how the number of physical nodes influences the execution time. The first limitation is the number of requests that the cloud provider used in this test can deal with at the same time; the time differences between the settings grows proportionally to the number of VMs used, fixed the number of physical machines. If the provider could deal with all the requests at the same, independently from the number of physical nodes, the differences of execution times in the grid and in the cloud would be constant. The second limitation is the number of available cores. If there are computing intensive tasks, the execution time will not decrease when the number of workers outgrows the number of cores as depicted in the single node scenario with 8 workers. 6 Conclusions and future work This paper presented the latest efforts around the COMPSs framework in order to make it available in the VENUS-C platform. The main contribution is on the design and development of a programming model enactment service that makes the deployment and the execution of scientific applications transparent to the user. Such enactment service allows the easy porting of scientific applications to a cloud infrastructure; the effort required is minimum and consists on the registration of the COMPSs application and dependencies on a Cloud Application Repository and on the use of a client side utility to invoke the remote operations. The actual execution of the application is left to the COMPSs runtime that, using a cloud enabled adaptor for job management, schedules the tasks to virtual machines requested to different cloud providers; the requests is formatted using OVF format for the description of the resources and an OCCI adaptor to communicate with the providers. The data management of COMPSs has been adapted to support the use of a cloud storage for input data and to store the results of the computation. The development of a BES compliant interface for the enactment service allows on the one hand to make COMPSs usable in the context of the VENUS-C project and on the other hand to make it interoperable with other frameworks that implement the same interface. The entire framework has been validated through the porting of an e-science application; the results show that despite few limitations introduced by the specific Cloud infrastructure, COMPSs keeps the scalability of the application and the overall performance of its runtime while offering the researcher useful cloud features like optimized usage of resources and an easy programming and execution framework. Future work includes the support for security to authenticate and authorize the users in order to provide compartmentalization and application security and to provide accounting information. Another improvement is the full support to the HPC Basic Profile[14] (not needed in VENUS-C platform) and the execution of interoperability tests with other implementations to validate this interface. A specific adaptor for the Generic Worker Role will be developed in order to provide COMPSs with the capability of executing the tasks on the Azure Platform and scheduling policies will be developed in order to optimize the selection of the

84 84 IBERGRID resources. In the same way, scaling and elasticity mechanisms will be adopted to enhance the programming model with capabilities for scaling up or down the number of resources based on user-defined or policy driven criteria. Acknowledgements This work has been supported by the Spanish Ministry of Science and Innovation (contracts TIN , CSD and CAC ), by Generalitat de Catalunya (contract 2009-SGR-980), Universitat Politècnica de Catalunya (UPC Recerca predoctoral grant) and the European Commission (VENUS-C project, Grant Agreement Number: ) References 1. J. Dean, S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Available from: 2. The Trident framework, 3. E. Tejedor and R. M. Badia. COMP Superscalar: Bringing GRID superscalar and GCM Together. In 8th IEEE International Symposium on Cluster Computing and the Grid, May Foster I et. al. OGSA Basic Execution Service Version 1.0. Grid Forum Document GFD-RP /8/ Savva A (Editor). Job Submission Description Language (JSDL) Specification, Version 1.0. Grid Forum Document GFD-R November Simmhan Y. and Ingen v. C. (2010) Bridging the Gap between Desktop and the Cloud for escience Applications. Available from: Microsoft Research, U.S. 7. Open Nebula, 8. I. Goiri, J. Guitart, and J. Torres. (2009) Elastic Management of Tasks in Virtualized Environments. 9. Open Cloud Computing Interface Working Group, Distributed Management Task Force Inc., Open Virtualization Format Specification v1.1, DMTF Standard DSP0243, SNIA CDMI: activities/standards/curr standards/cdmi/ 12. Tom Goodale and Shantenu Jha (2011) A Simple API for Grid Applications (SAGA) Available from: EMOTIVE Cloud, The HPC Basic profile specification HMMER Analysis Suite, Enric Tejedor, Rosa M. Badia, Romina Royo, Josep L. Gelpi, Enabling HMMER for the Grid with COMP Superscalar, in proceedings of the International Conference on Computational Science 2010, ICCS2010.

85 IBERGRID 85 Merging on-demand HPC resources from Amazon EC2 with the grid: a case study of a Xmipp application Alejandro Lorca 12, Javier Martín-Caro 2, Rafael Núñez-Ramírez 3 and Javier Martínez-Salazar 3 1 Instituto de Física de Cantabria. CSIC 2 Secretaría General Adjunta de Informática. CSIC 3 Departamento de Física Macromolecular, Instituto de Estructura de la Materia. CSIC Abstract. We present an infrastructure in which HPC resources from the Amazon Web Services public cloud are combined with specific grid resources at Ibergrid. The integration is done transparently for the GridWay users through a daemon which permanently monitors the pool of available resources and submitted jobs, managing virtual instances for satisfying the demand under budget restrictions. The study has been proved with a specific application from the Xmipp package, which performs image processing from electron microscopy data and requires heavy high-throughput computing offering parallelization capabilities. The application was ported successfully for such a hybrid framework with the help of the MPI-Start package. Some preliminary results from test runs are presented for a controlled sample of thousand input images. Some inconveniences and troublesome aspects of the deployment are also reported. 1 Introduction There is common agreement regarding the status reached by cloud computing being on top of the peak of expectations according to the Gartner s hype cycle of technologies [1]. This differs from what has already happened to grid computing, which is getting closer to a productivity plateau as required by the Large Hadron Collider experiments and supported by other scientific communities. Simultaneously, the development achieved by the architecture of the processor has brought to the market multi-core servers at very affordable prices even for small-to-medium research units. Not only the speed, but the parallelization, virtualization and energy-saving capabilities of the hardware offer compelling reasons to upgrade the equipment. The resources offered by the institutions taking part in grid computing are slowly moving into this direction. Looking at the figures offered by the Ibergrid of corresponding author: [email protected] [email protected] [email protected] [email protected]

86 86 IBERGRID infrastructure in February 2011 [2], 12,257 declared logical CPUs (cores) are sustained by 3,359 physical CPUs (processors) averaging 3.65 cores/processor, far above the single-core architecture. Similar increases have occurred regarding memory and network speed since the early days of grid computing. Therefore, it seems natural to port to the grid high performance computing (HPC) applications or, at least, the so-called many task computing (MTC) ones. Merging resources coming from the cloud would also appear to be a step forward in introducing an important degree of flexibility into a relatively rigid infrastructure. To this end, the key role of Amazon EC2 1 and other providers of Infrastructure as a Service utility model have made affordable the use of on-demand virtual instances for the general public. A service based on a static infrastructure which can be enlarged and reduced dynamically is still a challenge and there are remarkable European ongoing projects appointed on the subject, focusing on site provisioning 2, virtual environments 3 and e-infrastructure enrichment 4. In this paper we present a simple model of hybrid infrastructure grid-cloud in Sect. 2, with emphasis in a transparent experience for the user. An application used for molecular imaging is described in Sect. 3 since it has been used whilst developing the framework. The porting of the application to the hybrid infrastructure is detailed in Sect. 4. Some preliminary results are shown in Sect. 5 for both, technical and scientific interest with comments on some issues which appear during the implementation of the infrastructure. We summarize the study in Sect. 6 with some recommendations and future work. 2 The grid-cloud model 2.1 Architecture In our model, an heterogeneous grid composed out Ibergrid nodes is considered under the following restrictions: 1. glite 3.1 lcg-type computing elements (CEs), 2. support of any of the Ibergrid VOs or CSIC VOs, 3. capability of handling MPI jobs. The available CEs are shown in Table 1. Additionally, we consider many particular High CPU Extra Large instances of Amazon EC2, described in Table 2. The virtual machines are launched according to a customized Amazon Machine Image (AMI) and are capable of handling MPI jobs through the Open MPI suite and secure-shell. The AMI has been saved in an Amazon S3 bucket during the usage of the infrastructure. The last element required in order to submit jobs is the user interface. It is a dedicated machine physically located at the institute SGAI-CSIC and accessible to the users Stratuslab. 3 Venus-C. 4 Siena Inititative.

87 IBERGRID 87 ARCH CPU MEM Cores LRMS Endpoint Site (bits) (MHz) (MB) lcgsge ce01.up.pt UPorto sge egeece01.ifca.es IFCA-LCG sge egeece02.ifca.es IFCA-LCG sge egeece03.ifca.es IFCA-LCG lcgpbs ce.iaa.csic.es IAA-CSIC lcgpbs ce.cp.di.uminho.pt UMinho-CP Table 1. Resources from LCG-CEs in Ibergrid with MPI support. ARCH CPU MEM Cores LRMS API name Region Price (bits) (EC2 CU) (MB) ($/hour) none c1.xlarge US-EAST 0.68 Table 2. Resource type used from Amazon EC Middleware The user interface. It hosts the middleware handling the submission of jobs, and thus needs to provide the different services which are required by the lcg- CE computing elements: creation of proxies, verification from a VO, file transfer mechanism, command execution, etc. An installation of the glite User Interface 5 and the Globus Toolkit 6 suffices for these purposes. On top of that, a job scheduler different from the glite WMS is required, since the power coming from the cloud has to be locally managed. The GridWay metascheduler [3] is a tool designed for the submission, scheduling, control and monitor of jobs from a single access point. In the last available version 7 it includes a ssh plugin, gaining remote access to cloud instances. The GW EC2 service manager. A deeper analysis on the many grid-cloud enabling mechanisms has been discussed elsewhere [4], proposing also a general framework based on GridWay where a Service Manager would take care of the interoperability with cloud resources. So far we have no evidence of any implementation of such service, being our contribution a novel attempt which validates the framework. It consists of four modules: 1. Amazon Web Services account, where the system administrator (GridWay administrator) has access to in order to launch instances and be subsequently billed. The module can read the certificates off for issuing such operations. 2. Budget policy, regarding the limits to start and finish instances. Here we propose a simple budget rate scheme where the rate limit is input in terms of amount of money per time unit. 5 glite UI v Globus Toolkit v GridWay v

88 88 IBERGRID Fig. 1. Scheme of the middleware design. The enclosing light-gray box represents the user interface for job submission where the local components are installed. The left solid block stands for the GridWay package situated above the execution plugins to interact with the resources. The novel contribution (GW EC2) is placed on the right showing the four existing modules. 3. Provider database, making the system is aware of the different machine types offered by the provider and its pricing. This information could be properly updated at running time. 4. A daemon, monitoring both the job and the resource pools from GridWay and communicating with the provider. According to some configuration parameters and the other modules, it decides when the pending jobs deserve dedicated machines for their execution and how many to launch. It does also the opposite action; to keep an eye if the job queue gets empty and shutdown the instances. The daemon uses the Amazon EC2 API Tools 8 for this purpose. The interaction of the different components is depicted in Fig HPC-enabler Due to the heterogeneous environment found in the grid, the necessity of a single interface to process parallel jobs arises. MPI-Start [5] proposes an interface to hide the implementation details for the submitted jobs. It is composed by a set of scripts which process all the life cycle of the job, including MPI implementation and the additional features needed before and after the parallel execution. Because it can be installed also on single machines without a local resource manager, the implementation favors the usage of the Amazon EC2 multi-core instances. For the study we prepared a private AMI based on the Amazon Linux AMI, with Open MPI and MPI-Start. A specific set of users were created on it with passwordless ssh-access from the user interface through public keys. 8

89 IBERGRID 89 3 Scientific usage 3.1 Specific problem As an example of a scientific problem which required high computational resources we have used image processing of electron microscopy images of biological macromolecules. This technique allows the visualization and characterization of the structure of large macromolecular complexes. Due to the requirement of a low electron dose to minimize radiation damage during image recording the images typically suffer from low signal to noise ratio. In the last few years several image processing algorithms have been developed in order to improve the signal to noise ratio of these images [6]. Basically, these algorithms consist of 2D translational and rotational alignments of the images which yield an average image with an improved signal to noise ratio. Such averaging is only possible if the experimental images correspond to the same orientation of the same macromolecule. However, it is very common for electron microscopy data set to contain more than one different 2D structure. For that reason several classification methods have been developed which curiously require that the images are aligned beforehand. This paradoxical situation has been solved with the developing of algorithms which combine 2D alignment with classification iteratively, known as multi-reference alignment. The outputs of these algorithms are averaged images with improved signal to noise ratio for each of the subgroups in the original data set (Fig. 2). 3.2 Xmipp The Xmipp package is a suite of electron microscopy image processing programs [7]. Among other, Xmipp include an algorithm for 2D alignment and classification widely used by the electron microscopy community, named as maximum likelihood multi-reference refinement (ML2D). The details and mathematical background of ML2D are explained in [8]. Briefly, the set of images are compared to a predefined number of reference images, which are assumed to represent the structural diversity of the data. Each experimental image is compared and aligned with respect to all references and a probability is assigned to each combination of alignment and reference. Since each experimental image is compared to all references in all possible rotations and translations the searching space became huge. This makes ML2D computationally expensive and high performance computing is advantageous. In this work we have tested the availability of grid computing for ML2D. We have used as experimental data electron microscopy images of GroEL, a large macromolecular complex from Escherichia coli bacteria. 4 Application porting The aforementioned Xmipp application was ported from cluster computing to the grid-cloud infrastructure by creating a self-contained pack of scripts, input files and

90 90 IBERGRID Fig. 2. Schematic representation of multi-reference alignment procedure. The 3D structure of the macromolecular complex is observed under the electron microscope as very noisy 2D images. Thousands of these images are processed by multireference alignment procedures to obtain average images with improved signal to noise ratio. the Xmipp package. The strategy was to perform a dedicated remote compilation of the xmipp mpi ml align2d program using, whenever possible, as many parallel threads as were asked for. The user submits a template like this: NAME=ml2d_1000_8 EXECUTABLE=mpi-start-wrapper.sh ARGUMENTS=ml2d OPENMPI 8 TYPE="single" NP=8 INPUT_FILES=mpi-start-wrapper.sh, mpi-hooks.sh, ml2d, images-1000.tgz, Xmipp-2.4-src.tar.gz OUTPUT_FILES=ml2d.tgz STDOUT_FILE=ml2d.out STDERR_FILE=ml2d.err altogether with a set of files: Two mpi scripts: the first one is the wrapper executable and it is almost the same for every job (mpi-start-wrapper.sh) accepting as arguments the file to run in parallel (ml2d), the MPI taste to use (OPENMPI) and the parallelization degree for the case of ssh execution (8), otherwise the NP parameter will be used. The other script (mpi-hooks.sh) deals with additional features before and after the MPI execution. The script for run (ml2d) sets the library paths and calls the application with the configuration parameters. The compressed files, including the input images (images-1000.tgz) for processing and the package distribution (Xmipp-2.4-src.tar.gz).

91 IBERGRID 91 Fig. 3. The MPI-Start plays a wrapper role between the middleware and hardware stacks. The submission and job control in then handled by GridWay automatically, according to its own configuration. When the job is submitted and accepted by a resource, it is when the MPI-Start takes care of the correct settings, prepares the execution in the pre-hook phase and, after the parallel processing, does a similar post-hook action in order to retrieve the output. The flow control is sketched in Fig Results and discussion 5.1 Application profile The application was run in many occasions, obtaining more averaged images and testing the underlying infrastructure. A very nice feature observed was the scaling shown for the speed-up due to parallelization. In Fig. 4 equivalent test runs for the application were considered, taking 1000 input images and 4 reference output images. In comparison to the single-core run, the 2-cores run takes 54% of time, which gets quite close to the theoretical half-time limit. Equally the 4-cores corresponds to a 57% of the 2-cores time and the 8-cores job takes 57% of the time employed by 4-cores task. We observed an standard deviation for the total time taken by the same jobs of about 10%. Note that also the pre-execution time gets reduced due to

92 92 IBERGRID Fig. 4. Scaling of the application for an input sample of 1000 images averaged to 4 reference output images. The columns indicate the total elapsed time for each job, consisting of input transfer (prolog), pre-execution (pre-hooks), execution, postexecution (post-hooks) and output transfer (epilog). The four jobs were exclusively run in the same Amazon instance at different times. the multi-threaded option at compile for the Xmipp package. Runs at the lcg-ces showed a much larger deviation (20%) with high failure rate, depending on the hardware and cluster availability. They also suffer from the undesirable waiting time for free slots, but in general one could expect jobs being done at IFCA-CSIC on a bit more time than the invested in Amazon EC2. For the other sites we were unable to run successfully the jobs. A more detailed summary of job runs is given in Table 3. During execution, all the cores were used to a very large degree consuming all the available CPU cycles. On the contrary, memory was not a limitation for the kind of hardware tested. Access to disk became relevant uncompressing the package at the pre-execution stage, lowering the performance of the application in those systems with slow I/O access. Networked shared file-systems without low-latency, showed this performance degradation as well, added up to the MPI communication layer saturation. 5.2 Daemon behavior The GW EC2 being still experimental delivered a very smooth behavior. It monitors the pool of jobs and resources known to GridWay at each configuration step time (one minute by default). From there on, if there are pending jobs in the queue, it launches a determined amount of Amazon instances in order to satisfy the jobs conditions. The daemon stops launching new instances if there are no more jobs

93 IBERGRID 93 Job SITE Cores Prolog Pre-hook Execution Post-hook Epilog Total [hh:mm:ss] ml2d AWS EC2 1 2:46 3:13 2:26:38 0:01 1:01 2:33:49 ml2d AWS EC2 2 2:30 1:54 1:17:26 0:01 1:42 1:23:33 ml2d AWS EC2 4 3:01 1:26 0:41:25 0:01 1:15 0:47:08 ml2d AWS EC2 8 2:39 0:57 0:22:07 0:01 1:15 0:26:59 ml2d IFCA-CSIC 4 0:04 3:30 0:48:41 0:05 0:02 0:52:22 ml2d IFCA-CSIC 8 0:04 4:57 0:29:26 0:05 0:02 0:34:34 ml2d IFCA-CSIC 16 0:04 3:33 0:16:11 0:02 0:02 0:19:52 Table 3. Detailed elapsed time at the different phases for each job. For the IFCA- CSIC the pending time on the remote queue has not been taken into account. waiting on the local system or if the policy budget rate has been reached, leading to a maximum amount of enrolled machines. We deployed an infrastructure with a running costs limited to $10/h, using at most 14 simultaneous instances as shown in Table 4. The enrollment of the instances into the GridWay pool of resources happens as soon as the GridWay daemon is scheduled to do a discovery of new hosts. Because this is a costly action, the daemon realizes about new resources every DISCOVERY INTERVAL and we set that value to 60s. When there are no more jobs in a given cloud resource nor pending in the local queue, then the instance is marked for shutdown but kept as long as it approaches the billed hour, just for the case of new entering jobs. The daemon keep a historic log of the machines but currently cost savings are only spared and not used at a later moment. HID OS ARCH MHZ %CPU MEM(F/T ) DISK(F/T) N(U/F/T) LRMS HOSTNAME 0 LinuxUnknown x86_ / / /8/8 jobmanager-ssh ec compute-1.amazonaws.com 1 LinuxUnknown x86_ / / /8/8 jobmanager-ssh ec compute-1.amazonaws.com 2 LinuxUnknown x86_ / / /8/8 jobmanager-ssh ec compute-1.amazonaws.com 3 ScientificCERNS x86_ / /0 0/1456/1456 jobmanager-sge egeece01.ifca.es 4 ScientificCERNS x86_ / /0 7/1456/1456 jobmanager-sge egeece03.ifca.es 5 ScientificCERNS i /512 0/0 0/12/12 jobmanager-lcgpbs ce.cp.di.uminho.pt 6 LinuxUnknown x86_ / / /8/8 jobmanager-ssh ec compute-1.amazonaws.com 7 LinuxUnknown x86_ / / /8/8 jobmanager-ssh ec compute-1.amazonaws.com 8 ScientificCERNS x86_ /2048 0/0 3/406/416 jobmanager-lcgpbs ce.iaa.csic.es 9 LinuxUnknown x86_ / / /8/8 jobmanager-ssh ec compute-1.amazonaws.com 10 LinuxUnknown x86_ / / /8/8 jobmanager-ssh ec compute-1.amazonaws.com 11 LinuxUnknown x86_ / / /8/8 jobmanager-ssh ec compute-1.amazonaws.com 12 LinuxUnknown x86_ / / /8/8 jobmanager-ssh ec compute-1.amazonaws.com 13 LinuxUnknown x86_ / / /8/8 jobmanager-ssh ec compute-1.amazonaws.com 14 ScientificSLSL x86_ /4058 0/0 0/36/36 jobmanager-lcgsge ce01.up.pt 15 LinuxUnknown x86_ / / /8/8 jobmanager-ssh ec compute-1.amazonaws.com 16 LinuxUnknown x86_ / / /8/8 jobmanager-ssh ec compute-1.amazonaws.com 17 LinuxUnknown x86_ / / /8/8 jobmanager-ssh ec compute-1.amazonaws.com 18 LinuxUnknown x86_ / / /8/8 jobmanager-ssh ec compute-1.amazonaws.com Table 4. Output of the gwhost command from the user interface showing the hybrid grid-cloud resources. A set of 14 Amazon instances are enrolled in the pool with the hosts of the lcg-ces with MPI-Support in Ibergrid.

94 94 IBERGRID 5.3 Image classification and averaging After the executions of ML2D algorithm in the grid the output data are transferred to a local machine for further analysis. The algorithm produces averaged images and doc files for each iteration. The doc files provide detailed information about the classification and alignment process, for example, the proportion of experimental images belonging to each average, the rotations and translation of each image to be aligned, etc. The averaged images must be visualized with specific graphical applications available in the Xmipp package. Fig. 5 shows an example of these results. Whereas the original experimental images are noisy and featureless, the ML2D output averages display an increased signal to noise ratio which allows the description of structural details of the macromolecular complex. Several parameters of the macromolecule as size, shape, geometry, number of subunit, etc. can be inferred from the averaged images. This information is extremely useful for molecular biologists since the structure and function are parameters strongly related in biological macromolecules. Thus, the analysis of the averaged images of a macromolecular complex could shed light on how this complex works. In our example, the ML2D classification and alignment of 1000 experimental images yield two types of averaged images, one of them circular and with seven main masses and the other one square and with four parallel strips. Based in this information we are able to state that GroEL is a macromolecule organized as a cylinder with a seven-fold symmetry axis as has been observed in numerous previous work. Fig. 5. ML2D classification of 1000 electron microscopy experimental images into 4 final improved images after a typical run of the application. 5.4 Troubleshooting During the deployment of the research infrastructure, some technical issues arose difficulting the successful interoperability of the grid-cloud platform. Let us summarize the most relevant aspects:

95 IBERGRID 95 GridWay ssh MAD. We had some troubles when using it under a multi-user environment. Some log files were written by one user into /tmp/ssh.log and /tmp/ssh tm.log but they were unable to be overwritten by another users, rendering the MAD useful only for a single user. GRAM2 and MPI-Start versioning at sites. Because each site deploys a different version, some of the jobs described through a rsl job submission format used by GridWay did not run properly. It happened than in IAA-CSIC and UMinho-CP sites, instead of receive NP processors, a single processor was given, narrowing the availability of resources. Curiously enough, those sites did not show any problem whatsoever to run parallel jobs with the same version of MPI-Start used and the glite JDL. Firewall. The fact that the CSIC uses a firewall for controlling the access and the behavior of net traffic makes distributed computing more difficult. We dealt with unexpected time-outs due to inactive sessions which were still waiting for updates on job status, causing frequent errors. We found out that inactive connections are killed shortly after one hour, and for this reason we could not keep alive ssh connections longer without patching the GridWay ssh MAD. Memory and network restriction in the user interface. The GridWay metascheduler required a lot of memory to keep track of an enlarged pool with jobs connecting through ssh to cloud resources. We experienced unstabilities coming from saturated ethernet connections and large memory usage, causing the kernel to kill the gwd daemon due to out of memory on a stress proof. The user interface had 2GB of RAM and 1GB of swap, but regardless of the hardware specifications, a tuning of the configuration parameters is required to ensure full stability and permanent handling of the jobs. 6 Summary A working model of a service manager in which grid resources are combined with HPC resources from Amazon Web Services has been presented in this paper. The restrictions regarding MPI and VO support were high, leading to a shortage on available resources from the grid. To provide more power, Amazon instances have been customized and used ondemand. A novel component which interfaces with both, Amazon EC2 cloud and GridWay, allows for a utility usage after the saturation of grid resources but is limited to budget rate and system stability. The approach established has been a simple one and some work can be done in order to allow for more complex policies, such as prioritizing jobs or to use adaptive frameworks for minimizing expenditure. It is also mandatory to consider more machine types and providers. The use of a single type of instance (c1.xlarge) could be overcome with the recently announced Amazon Cluster Compute Instances, capable to join into a single customizable larger cluster. A real HPC application coming from the Xmipp package has been used in this context. The application provides image processing from microscopy and has been ported to the grid-cloud environment taking advantage of the parallel processing

96 96 IBERGRID thanks to the MPI-start suite. We compared a deterministic run of thousand images depending on the number of cores used and site. The results make noticeable some interesting conclusions: the application parallelization is quite good, as the execution time geometrically decreases with the number of cores used. However, typical real scenarios would use larger input data sets, one order of magnitude larger which require, in turn, to solve the inactive connection problems and better management of input storage. Acknowledgements This research was founded by the AWS in Education Research Grants program 9 by Amazon LTD., the GRID-CSIC project founded by the Spanish National Research Council-CSIC and by the Grant MAT from the CICYT (Comisión Interministerial de Ciencia y Tecnología, Spain). We are grateful to Javier Fontán for the advice using the ssh plugin for Grid- Way and debugging. We would like to thank as well to the site administrators of the different Ibergrid sites: Tiago Sá from UMinho, Rui Ramos from UP, José Ramón Rodón from IAA-CSIC, Álvaro Fernández from IFIC-CSIC 10 regarding the support offered for the mpi jobs and VO duties. We are also indebted to the IFCA-CSIC staff and in particular to Enol Fernández, without whose support with the MPI-Start package this study could not have been carried out. References 1. Gartner s Hype Cycle Special Report for Gartner, Inc (5 August 2010). 2. Availability and Reliability monthly statistics, EGI document 402, February E. Huedo, R.S. Montero, I.M. Llorente. Software - Practice and Experience 34 (7) (2004) 4. C. Vázquez, E. Huedo, R.S. Montero, I.M. Llorente. Future Generation Computer Systems 27 (5), (2011). doi: /j.future K. Dichev, S. Stork, R. Keller, and E. Fernández. Computing and Informatics, 27(3), (2008) 6. J. Frank. New York: Oxford University Press (2006) 7. C.O.S. Sorzano, R. Marabini, J. Velazquez-Muriel, J.R. Bilbao-Castro, S.H.W. Scheres, J.M. Carazo, A. Pascual-Montano. J. Struct. Biol. 148(2), (2004). doi: /j.jsb S.H.W. Scheres, M. Valle, R. Nuñez, R. Marabini, C.O.S. Sorzano, G.T. Herman, J.M. Carazo. J. Mol. Biol. 348(1), (2005). doi: /j.jmb The IFIC-CSIC lcg-ce ce02.ific.uv.es was upgraded to CREAM during the research period. 11 Web link:

97 IBERGRID 97 An Automated Cluster/Grid Task and Data Management System Luís Miranda, Tiago Sá, António Pina, and Vítor Oliveira Department of Informatics, University of Minho, Portugal Abstract. CROSS-Fire is a research project focused on wildfire management related topics, that defines an architecture integrating three main components: the CROSS-Fire Platform, implemented as an OGS-WS compatible Web Processing System layer; a Spatial Data Infrastructure platform; a Distributed Computing Infrastructure, supporting both cluster and grid environments. The main objective is to exploit Cluster/Grid middleware in order to: i) allow the concurrent execution of multiple and large simulations, ii) access and manage large data input/output files, iii) create and handle a database of past simulations and iv) allow remote and interactive monitoring of the fire simulation growth. The developed tools manage the tasks and control the data flow between components, providing the advantage of fast implementation and easy testing, on cluster and grid environments. Key words: Civil Protection, Cluster, Ganga, Grid 1 Introduction The CROSS-Fire research project 1, focused on Civil Protection topics, particularly wildfire management, defined an architecture that integrates three main components: i) the CROSS-Fire Platform, ii) a Spatial Data Infrastructure (SDI) and iii) a Distributed Computing Infrastructure. The platform is implemented as an OGC-WS compatible Web Processing Service (WPS) layer, that deals with the functionalities of its three components: Business Logic, Geospatial services and Cluster/Grid services. The Business Logic is configured to handle the wildfire management algorithms of the platform, namely, forest fire propagation and wind field calculation. We describe a fast implementation and easy testing platform that uses either a cluster or the EGEE/EGI and the interaction between the WPS, a data storage medium and an application execution environment. The CROSS-Fire platform executes fire and wind simulation applications, requiring powerful computational resources and produces a considerable amount of data that needs to be accessible everywhere. The main objective for exploiting Cluster/Grid middleware is: i) to allow the concurrent execution of multiple and 1 CROSS-Fire project homepage:

98 98 IBERGRID large simulations, ii) to access and manage large data input/output files, iii) to create and handle a database of past simulations and iv) to allow remote and interactive monitoring of the fire simulation growth. Running a simple job on the grid requires many interactions between the different underlying services. This adds extra overhead to the process and increases the time users have to wait for the results. During the application s development phase, the code has to be repeatedly tested for debugging, which makes debugging Grid applications more time consuming. In order to test our work in a controlled environment with a quicker development cycle, we resorted to the job management tool Ganga[5]. This allows the user to easily swap the endpoint where the job will run, without doing major changes in the submission process. So we increased the flexibility of our systems and introduced code to support for our cluster infrastructure in addition to the grid. Also, to simplify some tasks in the CROSS-Fire project, we took advantage of the flexibility offered by the operating system and the cluster/grid middleware and made use of scripting to implement an automated task and data management system. This article describes the developed system, starting with an overview of the distributed computing environment and the associated tools (section 2). Section 3 introduces part of the previous work, which is used for file monitoring and wind/fire simulation. The design considerations and implementation of the system are described in section 4. The document ends with future work considerations, followed by some final conclusions. 2 Distributed Computing environment The developed platform makes use of cluster and grid resources to run its jobs - a computer program, or a set of programs, scheduled according to some specific rules in a batch system. The Ganga job management tool offers an abstraction of the cluster, grid or other supported back-end. One of the advantages of Ganga is the possibility of using Python scripts to create and submit jobs in a fully automated way. Other advantage of Ganga is that, due to the abstraction of the back-end, one only needs to describe jobs once. This tool was deployed on the system, increasing the efficiency on the computational resource management task. Tools On the grid side, the main tool, currently being used as a base of our software, is the glite middleware (recently updated to version 3.2) 2, an integrated set of components designed to enable resource sharing on the grid. Each job is described using a high level language, detailing its characteristics and constraints. More information about this subject can be found on the glite User Guide[8]. 2 glite homepage:

99 IBERGRID 99 The grid is composed by a large set of sites, which share its resources by the users of specific Virtual Organizations (VOs). To ease the task of installing, configuring and maintaining a grid site, a specific software module, named egeeroll, was created in the CROSS-Fire project scope[2,4]. This tool is currently being used to manage the UMinho-CP grid site 3, supporting several VOs from different international projects, such as EELA, IBERGRID and ENMR. On the cluster side, we used the Portable Batch System (PBS), which primary function is to provide job scheduling and allocation of computational tasks among the available computing resources. PBS has a way to specify the files that are to be transferred to the job execution host and the files that are to be obtained upon job completion, using stagein and stageout options. 3 Previous Work The platform presented here is supported by a set of tools, previously developed within the project scope. The most relevant building blocks are: Web Processing Service (WPS) a set of algorithms that deal with the functionalities of the different components of the CROSS-Fire platform[6]. Console FireStation (CFS) an interface, based on gvsig, that provides a rich interaction to the user, including visualization, metadata queries and spatial data queries[6]. Watchdog an application that allows to check the changes occurred in the content of a file. This tool is useful in the grid context, since it provides access to long jobs results while they re still running, by sending those changes to a repository, thus opening the possibility of near real-time catch of the data being produced[7]. FireStation a desktop environment specially designed to the simulation of wildfires, based on wind field and topography models[1]. The fire propagation module was initially gridified to allow its execution on the grid. Afterwards, a parallel version of the sequential fire spread algorithm was developed that runs in MPI to accelerate the execution[3]. Simulation applications Nuatmos and Canyon are two external applications, developed in Fortran, that implement the mathematical wind models with the same name. These models simulate a wind field over some topography. The Nuatmos model range of application is limited to relatively smooth topographies. However, the solutions are realistic in most cases. The application code is not very time consuming and is numerically stable[1]. The second model, Canyon, is a 3D Navier-Stokes for a generalized coordinate system[1]. 3 UMinho-CP webpage:

100 100 IBERGRID 4 The Management System The CROSS-Fire management system consists of a set of modules (see fig.1), each one performing specific functions: Configuration - configures the application to adapt it to the running environment Simulation Launcher - it launches the execution of a job in the cluster or grid Simulation Folder Management - responsible for data management and storage Simulation Execution Control - the application that runs tasks and controls its execution Fig. 1. System Modules 4.1 Job Submission Overview In the CROSS-Fire platform context, a job is an execution of a certain model. Bellow, we describe how the process works (see figure 2): 1. The Console FireStation (CFS) makes a request to the Web Processing Service (WPS); 2. The WPS issues an execution order to the Simulation Folder Management (SFM), passing information about the simulation and the execution, which was received from the CFS. A back-end to execute the job should be chosen, which can be LCG for a grid execution or PBS for a cluster execution; 3. The SFM sends a data download request to the GeoServer; 4. The GeoServer answers the request with the appropriate data; 5. The SFM creates a directory structure, locally and remotely, according to the back-end - in a Storage Element (SE) on grid, or in a disk server on the cluster. Then it carries the necessary data conversions and copies the data into the corresponding locations; 6. The SFM passes the identification of the execution to the Simulation Launcher (SL), that will create a task description;

101 IBERGRID The SL uses Ganga to start a task on the chosen back-end. The application to be executed by the task is the Simulation Execution Control (SEC); 8. When the SEC is scheduled for execution in the grid back-end, it obtains the input data and software sources from the SE, while in a cluster execution, PBS manages to pass the input (stagein) data to the execution machine; 9. At the end, the data files and produced meta-data are uploaded to the SE, or grabbed by the PBS, which transfers it (stageout) to the disk-server; 10. For the duration of the simulation execution, the Watchdog monitoring tool is continuously sending portions of the produced data to the WPS; 11. The WPS copies the produced data to the AMGA database; Fig. 2. Job launch description 4.2 Configuration This module defines a set of variables containing frequently used information about the execution environment. The tables 1 and 2 show the different options available for the two back-ends, cluster and grid execution, respectively.

102 102 IBERGRID Table 1. Configuration options for cluster executions Option Description CF PBS SE HOST disk server CF USER username to access the disk server PBS OPTS batch system extra options PBS CF ROOT root path in the remote machine GANGA PATH path to the local Ganga installation job files Table 2. Configuration options for grid executions Option Description CROSSFIRE SE SE host CROSSFIRE VO grid virtual organization LFC HOST File Catalog host CE computing element LCG CF ROOT root path in the grid SE GANGA PATH path to the local Ganga installation job files 4.3 Simulation Folder Management (SFM) A simulation folder is a place where to store the data needed by the executions. This data is the physical representation of the meta-data contained in the AMGA database. The objective of this replication is not only to preserve data for postmortem analysis, but also to accelerate the access to the data available from past executions[6]. The location of this folder depends on the job execution s back-end system. If the executions are to run on the grid, the data files have to be stored in a grid SE. If the executions are to run in a cluster, the data files have to be stored in a remote disk server. Data Storage and Upload The SFM consists of a Bash script and a Java application, working together to manage the data files related to simulations and executions, thus replicating the metadata database. The management of this data comprises three phases: gathering, conversion and storage. The first phase starts when the SFM obtains the input data. Data descriptions are received from the WPS in a XML data-structure, containing links to topography related data, fetched from the GeoServer, and other data related to the execution, that is to be copied to files on the storage system. Then, some of the data in XML format needs to be converted to a format suitable to the software that will use it. The fuel description, barrier information, ignition points and meteorology information are also described in the XML data

103 IBERGRID 103 structure that needs to be written in a proper format, to a dedicated file, using a Java Library. Data storage is made in two different ways, depending on the storage system being used. In the grid, the data is copied from the local machine to the SE using the LCG Data Management tools. In the cluster, the data is copied from the local machine to the remote storage machine using the Secure Copy (SCP) tools. Data organization Data is organized in a hierarchical structure, separated by simulations, each one having a distinctive identification. Each simulation folder contains: terrain data, fuel description data, fuel distribution data, a user id, simulation date and description of the simulation area. Each simulation folder may also contain several execution folders, where the executions that belong to the same simulation are stored. Each execution has also a unique identification. Each execution folder contains the grid or cluster job identification, job info, job options, job stdout and stderr, job area, start date, end date and other information related to existent execution algorithms currently supported: fire spread and wind field calculation. In case of a fire spread execution, the folder also contains the ignition points, the barrier points, FireStation control parameters and the identification of the wind field already computed. In case of a wind execution, the execution folder will have precision information, the wind execution model, Nuatmos and Canyon are currently supported, and a file with information about wind conditions, provided by the weather stations existent in the area of simulation. 4.4 Simulation Launcher (SL) The Simulation Launcher is a Python script used to submit jobs both in the supported back-ends, which provides a simple way of benefit from Ganga features. The type of algorithm to launch can be easily specified through a set of options, described in table 3. Option [model] <backend> <sim-id> <model-execution-id> Table 3. Simulation Launcher options Function model(s) to execute in the task back-end where the job will run (PBS for cluster, LCG for grid) simulation id number model execution id number <set [model-execution-id]> defines a set of jobs, which belong to the same simulation, that can be launched together. The main job is called a parametric job <depend> job dependence definition in PBS back-end jobs

104 104 IBERGRID Using these options, it is possible to define the kind of job to be created by Ganga. An example of a command used by the WPS to run an execution is: submitjob.py backend=lcg wind fire sim-id=237 fire-execution-id=55 windexecution-id=54. The creation of a job via Ganga consists of the definition of the executable to run and the arguments to be passed to that executable, according to the corresponding Job Description Language (JDL). Optionally, the Computing Element (CE) where one wants to run the job may also be specified to allow execution on a controlled environment. The arguments that are passed to the application are listed bellow: Execution type: type of execution, or executions, to run on the job. Currently, it can be: fire, wind, or both Path to execution: path to the execution folder in the storage system Path to simulation: path to the simulation folder in the storage system Path to source files: path to the source files folder in the storage system Simulation id: simulation identification number Execution id: execution identification number In the case of the grid back-end, we add to the previous list: a) Virtual organization; b) Storage element; c) File Catalog hostname. While some of these arguments are passed as an argument in the SL command, other arguments are read from the configuration file. To create a job in the LCG back-end, it is only necessary to specify the executable and the respective arguments, while with the PBS cluster back-end, job creation is more complex. In this case, it is also necessary to define the files that are to be copied to the user home (stagein) and the files that must be copied from the user home, back to the storage system, when the job terminates (stageout). Another feature of the SL is the capacity to launch parametric jobs. A parametric job causes a set of jobs to be generated from one single Job Description Language (JDL) file. This is useful in cases whenever many similar jobs must be run with different input parameters. Using this feature, Ganga can launch several jobs at the same time, saving time during job submission. Currently, Ganga doesn t support parametric jobs, so, to add this feature to the SL, we used the argument splitter of Ganga. Through this mechanism, one can split the application arguments string field in the job description. So, while a normal job receives a set of arguments, a parametric jobs receives a set of argument sets, where each element of the initial set will be assigned to a sub job application. However, this feature has some limitations. There can be no data dependencies between the jobs that belong to a parametric job, because, currently, there is no way to specify the time execution order of the jobs. 4.5 Simulation Execution Control (SEC) The Simulation Execution Control is a Bash Script application that controls the execution of a job. It makes decisions, based on the received arguments, to determine which application source code to obtain, compile and execute (figure 3). In

105 IBERGRID 105 addition, it downloads the data files that will be used as input and initiates the Watchdog monitoring tool. This application may be used both in the cluster and in the grid, since the back-end option, specified on the SL, is passed to the SEC. In the case of a grid execution, the LCG Data Management tools are used to obtain the source code and the input data for the application to run. In the case of a cluster execution, the SEC does not need to obtain any data from any remote machine. The data is obtained through the PBS stagein option, which makes the data locally accessible to the user before the job starts so that the SEC may copy it to the execution folder. The available simulation programs were built in a way that they are continuously writing the produced values in a file, as soon as they are computed, so one can catch the produced values in real-time. In the CROSS-Fire platform, this is done using the Watchdog, which sends the produced values to the WPS. The Watchdog is also used to determine if the job execution terminated without errors. Whenever the execution of the job does not end correctly, the Watchdog sends a signal to the WPS with the word incomplete. Otherwise, it sends to the WPS the word completed. At the end of the job execution, the produced data and meta-data must be saved in the storage system. Again, on a grid execution, that information is uploaded to the SE using the LCG Data Management tools. On a cluster execution, data is copied to a user folder. Afterwards, the PBS function stageout gathers the output files and copies them to the remote disk server machine. SEC has the capacity of executing several simulation applications in the same job. This means that one can launch a wind simulation, followed by a fire simulation that uses the data produced by the earlier. This feature is useful for two reasons: i) because grid jobs usually take a lot of time to start, one doesn t have to wait for the conclusion of the first job to launch the second one. ii) Ganga, doesn t support job dependencies. A simple alternative is to be able to submit several executions within the same job. Fig. 3. Grid Simulation Execution Control workflow

106 106 IBERGRID 5 Future Work The current platform needs to be more flexible to be able to integrate new modules, according to the specifications of newer and more sophisticated civil protection requirements. In order to fulfill this objective, the platform is being enhanced according to the following ideas: 1. Each software module should be delivered in packages, where each package would correspond to an execution model that can be executed in a job. The package would contain not only the source code of the application, but also a description of: i) how to compile the software, ii) the list of the files to download from the storage system before the execution and iii) the list of the files to upload to the storage system after the execution. This could be called Execution Model Packages (EMP). 2. The SEC, could be transformed in a way it could execute any type of module. Currently, the list of files to obtain is described in the SEC. This information could become a part of the EMP. 3. There should exist a SEC for each back-end. Currently, the SEC can be executed both in grid or cluster environments. The separation of this component would ease the development and the addition of new back-ends. It would be the SL s responsibility to choose which SEC to execute in a job. 4. During cluster executions, the input files are copied to the execution folder by the PBS stagein and the execution results are collected by the PBS stageout. Since the SL is the one that configures the files that are transferred by the PBS stagein and stageout, the SL must have a way to access the list of input and output files in the EMP. The ultimate goal is to develop an architecture where one can add new software packages in a simple way. 6 Conclusion We reported the work we carried out to automate the management of the data and executables within the CROSS-Fire platform, by making use of the tools made available by the cluster/grid middleware. Each described module started as an independent application. However, as the development advanced, it was necessary to use functions of one module in another module. This raised the need to integrate everything in a single application. As an example of integration, the download of input files in a grid execution is done by the SEC application, but on a cluster execution it is done by the PBS stagein and it is the SL responsibility to specify which files to transfer to the execution folder. Looking back, there are some loose ends that could be changed. The reason for this is that, during the development phase, the CROSS-Fire team gained a greater knowledge of the used tools. This knowledge can be applied in future iterations of this project.

107 IBERGRID 107 Acknowledgements This research was funded by the Portuguese FCT through the CROSS-Fire project. It also benefited from UMinho s participation on EC FP7 E-science grid facility for Europe and Latin America (EELA2) and, more recently, in the EC FP7 Grid Initiatives for e-science virtual communities in Europe and Latin America (GISELA). References 1. A. Lopes, M. Cruz, D. V. An integrated software system for the numerical simulation of fire spread on complex topography. Environmental Modelling Software, Volume 17, Issue 3, 2002, p A. Pina, B. Oliveira, A. Serrano, V. Oliveira. EGEE Site Deployment & Management Using the Rocks toolkit. In Ibergrid: 2nd Iberian Grid Infrastructure Conference Procedings (2008), Silva, F and Barreira, G and Ribeiro, L, Ed., pp nd Iberian Grid Infrastructure Conference (Ibergrid 2008), Porto, Portugal, May 12-14, A. Pina, R. Marques, B.Oliveira. FireStation: From Sequential to EGEE-Grid. Proceedings of the first EELA-2 Conference, Bogotá, Colombia, CIEMAT, February B. Oliveira, A. Pina,A. Proenca. EGEE site administration made easy. In Ibergrid: 4th Iberian Grid Infrastructure Conference Proceedings (2010), Proenca, A and Pina, A and Tobio, JG and Ribeiro, L, Ed., pp th Iberian Grid Infrastructure Conference (Ibergrid 2010), Braga, Portugal, May 24-27, F. Brochu, J. Ebke, U. Egede, J. Elmsheuser, K. Harrison, H. C. Lee, D. Liko, A. Maier, A. Muraru, G. N. Patrick, K. Pajchel, W. Reece, B. H. Samset, M. W. Slater, A. Soroko, C. L. Tan, D. C. Vanderster, M. Williams. Ganga: a tool for computational-task management and easy access to Grid resources. Computer Physics Communications, Volume 180, Issue 11, p Pina, A., Oliveira, B., Puga, J., Esteves, A., and Proenca, A. A platform to support Civil Protection applications on the GRID. In Ibergrid: 4th Iberian Grid Infrastructure Conference Proceedings (2010), Proenca, A and Pina, A and Tobio, JG and Ribeiro, L, Ed., Netbiblo, pp Braga, Portugal, May 24-27, R. Bruno, B. Barbera, E. Ingrà. Watchdog: A job monitoring solution inside the EELA-2 Infrastructure. 8. S. Burke, S. Campana, E. Lanciotti, P. Lorenzo, V. Miccio, C. Nater, R. Santinelli, A. Sciabà. Glite 3.1 User Guide.

108

109 SESSION 2: APLICCATIONS SESSIONS

110

111 IBERGRID 111 DISET protocol for DIRAC A. Casajús 1 and R. Graciani 2 1 Institut de Ciències del Cosmos (ICC), Universitat de Barcelona [email protected] 2 [email protected] Abstract. DIRAC, Distributed Infrastructure with Remote Agent Control, is a software framework that allows a user community to manage computing activities in a distributed environment. DIRAC started as a project to take care distributed Monte Carlo simulation and it now manages all LHCb computing activities. DIRAC is currently being tested by Belle II, ILC, CTA, EELA and GISELE. DIRAC has been designed as a collection of collaborating distributed components communicating with each other using the DISET protocol. Distributed systems rely heavily on secure and reliable communications. DISET was designed to provide RPC and transfer mechanisms over an SSL communication layer with authorization capabilities. Recent developments in DIRAC require extra functionalities in DISET. The new DISET version adds, apart from other modifications, fixes derived from 3 years of experience and update to a newer OpenSSL version, two new functionalities: a persistent communication mechanism and a keep alive signaling to prevent long queries from failing due to timeouts. This paper presents a review of DISET s main features. 1 Introduction The DIRAC [1] project started in 2002 as a software tool to manage LHCb (see [2] and [3]) Monte Carlo simulation jobs in an efficient manner. The LHCb experiment is the Large Hadron Collider Beauty experiment at CERN, primarily intended for precise measurements of CP violation and rare decays in b-physics. Once DIRAC properly handled Monte Carlo simulation, it started to grow to cover all the LHCb computing and data activities. DIRAC was completely redesigned in 2007 to be able to cope with the challenge. LHCb s DIRAC instance handled an average of 7k concurrently running jobs with peaks over 30k running jobs on more than 120 sites aroung the globe in DIRAC was designed as a group of collaborating components. There are two main classes of components: Service s and Agents. A Service is a passive component that processes queries received. Agents are components that periodically execute a task. For instance, an agent can keep track of the status of running jobs, check that the data transfers are properly finishing or send more job pilots if required. Tipically agents check the status of a resource either by requesting information of corresponding author: [email protected]

112 112 IBERGRID from services or by checking resources using the required grid middleware. If some condition is met, then they react accordingly. Initially DIRAC used XML-RPC [4] to connect services and agents. XML- RPC is a Remote Procedure Call (RPC) protocol that uses XML to serialize the requests and HTTP to transport them. XML-RPC works by sending a HTTP request to a server that understands the protocol. The request can have zero or more parameters and a method name. The server will then proceed to execute the specified method with the passed parameters and return the response as the HTTP response. XML-RPC presented several limitations. No binary data could be sent without being converted to base64 and big chunks of data had to be sent in one go, thus severily limiting the possibility of transmitting files. And for small requests a big overhead was added due to the XML encapsulation. On top of that, plain XML-RPC did not provide any authentication nor authorization mechanisms. To solve some of these problems a temporal solution was adopted. A python [5] module calls pygsi was created to provide OpenSSL [6] bindings to Python that would allow to authenticate grid proxies with a specialized authentication callback. Using this module, some modifications were made to the python XML-RPC library to allow authorization based on the authentication done by OpenSSL. OpenSSL is able to authenticate X509 certificates [7] but not certificate proxies compatible with Globus [8] Toolkit version 2. pygsi provides Python with access to OpenSSL functionality and adds an authentication callback for OpenSSL to be able to accept grid proxies. Based on this authentication done by OpenSSL, each method can have an authorization rule to define who can invoke it. The second version of DIRAC used this XML-RPC + SSL solution and was deployed in production in For the third version of DIRAC, a more integrated and capable solution was needed. Sending and receiving random-size binary files was a strong requirement. DISET was designed to overcome these limitations. DISET dropped the use of XML-RPC in favor of an in-house protocol to provide RPC and file transfer capabilities. This third version was deployed in production in LHCb (as the project originator) has the biggest production instance of DIRAC. Due to the growth in jobs and CPU requirements of each job a redesign of the DIRAC WMS (Workload Management System) is underway. This redesign tries to reduce idle times when processing jobs. To do so, a faster response time is needed between services and agents. A new version of DISET has been developed to provide a stable connection mechanism that allows asynchronous communication between clients and services. This document is presented in the following way. Section 2 describes DISET s functionality. It also describes how data is serialize to minimize the impact serialization takes on then communication layer. Section 3 detail the new features that DISET will provide to DIRAC and how that will affect DIRAC s request processing thoughput. An overview of DIRAC s performance over the last year is show on section 4. Finally section 5 summarizes the information presented on this document

113 IBERGRID DISET functionality DISET provides DIRAC developers an easy way to code distributed, scalable, secure and redundant services and the clients that make use of those services. All communication is encrypted by default. Clients and services have to present valid certificates to the other end. Once the connection is stablished, an SSL handshake takes place to ensure that both sides validate the other one. Once both ends are authenticated the query is sent. DIRAC services have to process requests in a concurrent way to be able to scale (see section 4). To do so, DISET services are all multi-threaded automatically. Each request is processed in a different thread. To minimize memory clashes, for each request a new handler object is created so the request processing is sandboxed. DISET can also limit the amount of threads and the maximum concurrent call to a single method. Developers can explicitly create objects shared between threads outside the request sandbox. To create these objects there are initialization methods that are called at the start up of the service, before any request is processed. 2.1 Data serialization DISET uses Dencode as the encoding system to serialize data. Dencode is an in-house modification of Bencode [9] suited to DISET to serialize the requests. Bencode was chosen because: It can contain binary data. It is not considered a human-readable encoding but many encodings can be decoded manually. Adds a very small overhead to data compared to other serialization methods. This overhead reduction leads to less data streamed through the network; that means less time spent doing I/O. And less CPU used to serialize/deserialize and to crypt/decrypt, leaving more CPU free to actually process the request. It s really simple Dencode is a superset of Bencode. Pure Bencode was not directly used because it does not allow to encode some data types such as floating point values, booleans... that Dencode does. Dencode is able to serialize all basic Python data types, and some others that are heavily used in DIRAC, such as Python s date time type. Dencode can be used to encode binary files thus allowing DISET to send and receive binary data. 2.2 RPC DIRAC started using XML-RPC for communicating services and agents. XML has been replaced by Dencode but RPC is still the main protocol of communication within DIRAC. XML-RPC works by wrapping in a XML stub the request (method to call and arguments to the method), sending them over HTTP to the server, wait for it to send back the answer via HTTP response and unwrap the XML answer (see

114 114 IBERGRID fig. 1). Python s XML-RPC library provided a nice interface for creating clients and issuing XML-RPC requests. To create an XML-RPC client, an object had to be created passing as an argument to the constructor the URL of the service to be contacted. Once the object had been created, the remote method could be invoked as if it was a local method. By calling an object s method with the arguments, the XML-RPC client object serialized the request with the arguments, sent it to the server, waited for the response, deserialized it and returned as if it was the return value of the local method. Maintaining the ease of use of the RPC capabilities was a priority when designing DISET. DISET provides a client object that s used the same way the XML-RPC was. This way DISET hides all the complexity of the protocol to the developers of DIRAC components. DISET provides an easy framework for developers to code RPC services. Developers only need to create a handler object inheriting from the base DISET handler. Any method belonging to the handler object that starts with a special prefix will be callable from a client. DISET also provides a mechanism to ensure that the parameters received belong to the proper set of data types. Developmers only need to define a list of possible data types valid for each argument and DISET will check that the received parameters match one of those specified. If any of these requirements fail or if the number of parameters doesn t match the client will receive the corresponding error. Fig. 1. Remote Procedure Call To invoke this method, a client only needs to create an RPCClient object, point it to the right server and just call the method as if it was a local method of the RPCClient object. Developers only need to specify the name 3 of the service to which RPCClient has to connect. All components in DIRAC belong to a given system. Systems are components grouped to achieve a more complex functionality. For instance, the DIRAC WMS (Workload Management System) contains several services such as the JobManager. To invoke the JobManager, a RPCClient would have to be initialized with WorkloadManagement/JobManager. LHCb s DIRAC instance is running jobs in more than 120 sites currently. In any distributed system there more spread out the components are, the more network glitches it will suffer. LHCb s DIRAC instance has to cope with the network 3 By specifying the logical name of a service, DISET will resolve the URL and connect to it. For instance service name Configuration/Server can resolve to dips://srv1.dirac.com:9135/configuration/server

115 IBERGRID 115 instabilities of all the sites it can run jobs on. To protect from network glitches, server crashes, etc. a timeout can be defined when instantiating a new RPCClient. Whenever a RPC is requested with that object, if there s no response in the specified time the connection is dropped and the client receives an error. Due to the variable nature of queries, a given timeout can be suitable for a set of arguments when calling a method, but it might not be for a different set of arguments when calling the same method. The only solution we had was to set big timeouts to prevent killing the slower queries, but that made fast queries to react slowly if there was any problem. If a host is overloaded with requests. A connection request might be blocked by the system. The new revision of DISET includes a mechanism to automatically retry if a connection failed before the request reached the service to minimize these interruptions. It also includes keep alive signals to maintain the connection open if the request is still being processed (see section 3.2) to solve the timeout problem for requests that may vary heavily in their execution time. 2.3 File transfer One of the DISET s basic design goals was the ability to transfer files. Services can provide methods for users to upload and download either bundles or single files to services. To enable file transfer on a service, a developer only needs to include a special method for each transfer direction and type of transfer (single file or bundle of files) that the service will support. When uploading or downloading from a server, service developers must use a FileHelper object. This object will transmit or receive bytes from the network from/to a defined data source or data sink. When sending a bundle of files DISET automatically creates a compressed tarfile containing them and sends it to the server. The server can choose to receive it as the tarfile or uncompress it easily. Clients can also choose if they want to compress the tarfile. To send or receive a file clients have to create a Transfer- Client object the same way the RPCClient can be created and call the sendfile, receivefile, sendbundle or receivebundle methods. These methods expect as parameters the file path or path list of files to be sent, and the file identification the service will receive for future reference. DISET file transfer was originally designed to transfer small files across DIRAC components. But over time, the need of transferring more and bigger files has become stronger. To transfer files, DISET opens a network connection and sends or receives data through it. In order to increment the number of bytes a DISET service can receive, DISET has to be able to transfer a file in parallel through more than one network connection. Currently DISET file transfer does not allow multi stream transfers, but it is a planned feature that will come in a future revision of DISET. 2.4 Authorization DIRAC is all about a community managing it s computing resources. To do so, DIRAC has a set of authorization rules to define what each user can do. To ease

116 116 IBERGRID defining the authorization rules, users are organized in groups. Each group has a set of properties that allows to execute a set of tasks. That way users that have a similar profile can be grouped so maintaining the privileges of users is simpler. DIRAC authorization is based on properties. The authorization rules don t define directly what each group can do, but define what properties are required. Each remote method provided by DIRAC services can grant execution permission to a set of properties. These properties are assigned to user groups and hosts. Properties are assigned to methods and groups/hosts in the DIRAC CS (Configuration Service) (see [10]). The DIRAC CS holds a global configuration that contains information of all the resources, components, users, groups... known to each instance of DIRAC. It is structured as a tree. The Authorization section for each service in the Configuration Service defines rules for RPC calls. Each option under the Authorization section defines a rule for an RPC method. For instance Authorization/echo = prop1, prop2 would allow anyone with prop1 or prop2 to query RPC method echo. For file transfers authorization is done based on the direction of the transfer instead of the name of the method called. There are three special keywords that can be used in the authorization section of a service: Default as a RPC method name in the Authorization section matches any RPC method that does not have an authorization rule in the section. authenticated as a property name allows any authenticated client. all as a property name allows any client to query that method. DISET can also work using normal sockets and this property allows unauthenticated clients to connect if the server is working in non-ssl mode. Server methods can provide default values for their authorization rule. If a method does not have a explicit authorization rule in the configuration, the default properties defined will be used for the authorization. This schema allows a flexible and powerful definition of authorization rules. Developers can assign sensible defaults for each method, but instance operators can overwrite them to suit the authorization to the Virtual Organization schema. The client identity is also provided to the invoked method to allow for a finer grade authorization. An example of finer authorization would be downloading sandboxes. All users can download their own sandboxes, but operators can download any sandbox to debug a job if needed. That method would allow two properties to call it, but once called, refines what can be downloaded based on the client group. 3 New DISET functionalities DISET initially provided RPC (Remote Procedure Call) and file transfer capabilities over an SSL channel. Each connection is authenticated using X509 certificates [11] or grid proxies [7]. Once authenticated, the request is checked against the

117 IBERGRID 11 authorization rules built in. Every request requires opening a connection, authenticating the user, authorizing the request, executing and returning the response and closing the connection. On average, the SSL process is the most CPU demanding part of the whole request. One of the main focus when designing this new revision of DISET was reducing the CPU overhead dedicated to communication. To do so, DISET has to provide a way to reduce the amount of SSL connections and handshakes. Instead of opening a connection every time the client has to send a request to the server. For clients that need to constantly communicate with a service, it would be more efficient to open a connection, make the SSL handshake, and from then on send and receive requests asynchronously. 3.1 Aynchronous requests LHCb ran more than 9 million jobs last year as seen on figure 2. On average a new job was started every 3 seconds. Due to the growth of the LHCb instance of DIRAC a redesign of the DIRAC WMS (see [12]) is going underway. Fig. 2. Job executed by LHCb in the last year grouped by type One of the design goals of the new DIRAC WMS is to minimize idle times when processing jobs. In the current WMS there are a set of agents constantly

118 118 IBERGRID polling the job database checking for jobs in a certain state. Instead of constantly polling the database, the new DIRAC WMS has to push changes to the required components as soon as they happen to avoid polling to the databases. Instead of agents polling the database, agents will subscribe to a distributed callback system that will alert them whenever a certain condition happens. This distributed worker pool schema will provide DIRAC with a tool to scale much further by being able to add and remove workers from the pool at any time. To do so, servers need to be able to initiate communication to clients asyncronously. With the previous version of DISET, clients had to poll servers to get the data, but that lead to a lot of CPU wasted in SSL connections. with the new DISET, clients can stablish a connection to servers, maintain it and send and receive requests asyncronously as shown in figure 3. Fig. 3. Messages in a stable connection Once the connection has been established and the client has been authenticated, messages can be sent in both directions asyncronously. Services can also define authorization rules for each message. Authorization rules for messages are defined and processed with the same mechanism the RPC capabilities uses. Messages can be handled the same way from the service developer point of view. Developers only need to code a callback for each message they want to accept, as they would do for processing RPC requests. Message clients also have methods to define callbacks for processing messages. By defining callbacks, the client defines which messages it accepts. Connection drops and timeouts are properly detected. Should the connection be stopped, both ends will be notified via a special callback to handle the disconnect gracefully. Clients can be configured to auto reconnect if the connection drops. 3.2 Keep alive signaling RPC queries and transfers also pose another problem. Clients don t know a priori how much time a request will take and can lead to many connections kept open if a server fails or gets stuck. A possible solution is to timeout the requests. But if a particular request takes a long time to be completed, clients will start to timeout the request even if the request is being processed. To solve this, DISET now implements a keep alive mechanism. Clients will send a keep alive signal to servers that servers will reply as shown in figure 4. If

119 IBERGRID 11 Fig. 4. Keep alives persist the connection the keep alive is not received, the other end of the connection is supposed to be stuck or dead and the request will be aborted. 4 Performance DIRAC was designed to be extremely modular to be able to easily distribute components. This modularity also allows DIRAC to overcome the limitations of Python s thread implementation. Although applications written in Python can be multi threaded, only one thread is actually executing Python code. That is because Python threads need to acquire a global lock to be able to access the memory to prevent memory corruption. This global lock limits the maximum rate a Python service can manage. By running different services the rate of requests processed can be higher. DISET has been used by DIRAC for almost 4 years now. It handles all DIRAC communications and has been proved to scale. It was able to sustain in a single host an average of 50 queries/second for months with 80 queries/second peaks for a couple days as shown in figure 5 in the LHCb production environment. Some heavily used services reached 30 queries/second by themselves for months. The limit is currently not in DISET but on the amount of computing resources available and the number of concurrent activities ongoing. The most heavily used service (top dark green) was split in two and then in four to improve performance. Fig. 5. Queries/second served in one host On test environments DISET has been able to sustain almost 200 queries/second in a single service answering only ping requests. These two measurements are differ-

120 120 IBERGRID ent because in the production setup the whole system is measured. Not only DISET but DIRAC itself processing requests, matching jobs, receiving new ones, updating states... Whereas the second one is just a DIRAC service that has one ping RPC method. That method only returns it s argument so only DISET throughput is measured. Each color in figure 5 is one service running in the host. DIRAC is designed to be extremely modular. Each service in the previous figure could run on a different host if necessary. In contrast with other grid middlewares such as glite, DIRAC has no requirements on where to run the services. For a small installation all services, agents and databases can run on the same host. 5 Summary The DIRAC project started in 2002 using plain XML-RPC for it s communication needs. At that poing authenticating users and authorizing requests became a necessity and SSL capabilities were added to the DIRAC communication layer. XML-RPC wasn t designed to cope with transferring files and XML added too much overhead on the serialized request. So a new protocol called DISET was designed and deployed. DISET is able to provide RPC and file transfer capabilities to DIRAC while minimizing the overhead added to the request. DISET has proved to be a scalable solution to meet DIRAC s needs. RPC and transfer mechanisms allowed DIRAC to manage all LHCb computing needs up until now. Although DISET was designed as the base layer for DIRAC, it can be used to build any distributed system. Distributed systems built with DISET will get scalabity, authentication, authorization and redundancy for free. But to scale DIRAC further RPC is not enough. The new version of DISET will include a mechanism to establish a connection and send and receive messages through it asynchronously. This mechanism will enable DIRAC to provide new features to users and to make DIRAC more resilient to errors and network glitches. Acknowledgements The presented work has been financed by Comisión Interministerial de Ciencia y Tecnología (CICYT) (projects FPA C02-01, FPA C02-01 and CPAN CSD from Programa Consolider-Ingenio 2010 ), and by Generalitat de Catalunya (AGAUR 2009SGR01268). References 1. Tsaregorodtsev A et al DIRAC: A community grid solution Computing in High-Energy Physics and Nuclear Physics Amato S et al CERN-LHCC Antunes-Nobrega R et al. (LHCb) CERN-LHCC Xml-rpc URL 5. Python 2.5 reference manual URL

121 IBERGRID Open source ssl toolkit URL 7. Tuecke S, Welch V, Engert D, Pearlman L and Thompson M 2004 [RFC3820] Internet X.509 Public Key Infrastructure (PKI) Proxy Certificate Profile 8. Globus alliance URL 9. Bencode URL Casajus A and Graciani R 2009 DIRAC Distributed Secure Framework Computing in High-Energy Physics and Nuclear Physics Housley R, Polk W, Ford W and Solo D 2002 [RFC3280] Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile 12. Paterson S K and Closier J 2009 Performance of Combined Production And Analysis WMS in DIRAC Computing in High-Energy Physics and Nuclear Physics 2009

122 122 IBERGRID LHCb Grid resource usage in 2010 and beyond R. Graciani 1 and A. Casajus 2 1 Institut de Ciències del Cosmos (ICC), Universitat de Barcelona [email protected] 2 [email protected] Abstract. LHCb is one of the four main experiments taking data at the Large Hadron Collider accelerator at CERN (Geneva). Although the accelerator produced some collisions at the end of 2009, it has not been until the end of 2010 when the rate of proton-proton interactions at the experiments started to approach the nominal design values. All four LHC experiments have produced in the past estimates of the computing resources necessary to correctly process and analyze the collected data. For the first time in 2010, this estimates have been confronted with the reality. LHC nominal values will not be reached in the next years. Although the conditions are not yet nominal, the analysis of the computing resource usage during 2010 is an essential ingredient to refine the estimates for the coming years of LHC operation. This contribution describes the analysis of the Word wide LHC Computing Grid, WLCG, resource usage by the LHCb experiment in A comparison with similar information provided by Grid Accounting tools is presented. Using the information of this analysis, the estimated needs for 2011 and 2012 data taking periods have been reevaluated after different updates to the computing model using a newly developed tool for the simulation of computing models. The changes on model and the underlying assumptions, based on 2010 experience, and the new estimates are also presented in detail. 1 Introduction LHCb (see [1] and [2]) is one of the four main detectors at the Large Hadron Collider, LHC [3], at CERN (Geneva). Much before the LHC accelerator was foreseen to start operation it was clear that a distributed solution was the only viable option to address the problem of the processing and later analysis of the data to be delivered by the detectors. Back in 2005, LHCb prepared a description of its Computing Model [4], CM. In parallel, LHCb developed the DIRAC software for distributed computing (see [5] and references within) to handle all activities described in the CM. Given the large amount of computing resources involved, LHC experiments are requested to produce estimates of their future needs and to report back on the usage of the requested resources. Both the requests and usage reports are reviewed and approved, if appropriated, by different committees since they are the basis of corresponding author: [email protected]

123 IBERGRID 123 of the WLCG Memorandum of Understanding, MoU, in what resource provision refers. This MoU is then the basis of funding requests from the resource providing centers to the national funding agencies in order to provision the necessary capacity. The LHCb model, with all activities conducted by a central DIRAC system, is specially well suited for this situation. The detailed accounting information recorded by DIRAC allows to produce a complete and exact picture of the usage of resources, much more precise that the general purpose tools provided by the general Grid infrastructure. The LHC operation, in what respect proton-proton collisions at LHCb, has strongly deviated from the expected nominal parameters during 2010, and it is now expected to continue operation in a similar manner during the next couple of years. The most relevant change with respect to the computing model is the increased number of visible collision per bunch crossing, µ. In nominal mode, LHC was expected to operate with almost 3000 colliding proton bunches on each beam with an average µ of 0.4 that maximizes the probability of having a single interaction in the crossing. Due to technical problems, the number of colliding bunches was at most 400 during The only way to achieve high data rates, necessary for the physics programs of the experiments, is to squeeze the beams, improve the focalization and thus increase the average µ. During most part of 2010 the average µ at LHC has been around 1.5, reaching values as high a 2.5 at the end of the year. Figure 1 shows the delivered luminosity and the peak number of visible proton-proton interactions as a function of the LHC fill number 3. This increase Fig. 1. Integrated luminosity delivered by LHC and collected by LHCb (left), and peak µ (right) as function of the LHC Fill during has as direct consequence an increase on the average event size, thus increasing the storage requirements. At the same time, higher collision probability means an increased particle multiplicity in the collected data and thus higher computing requirements for the processing of the data. Thus increasing the CPU requirements. As will be discussed in section 2, this has caused important operational problems during 2010 and implies new working assumptions for 2011 and beyond. 3 A fill refers to each time LHC injects new proton beams.

124 124 IBERGRID The different LHCb data types relevant for this paper are: RAW: the data as it comes out of the detector. SDTS: is the result of processing the RAW data to obtain physically relevant magnitudes. DST: combination of RAW and SDST for a selected sample of the collected events. mdst: same as DST but only for a small fraction of the physical magnitudes of the selected events. The present document is organized in the following way: section 2 summarizes the 2010 operations and reports on the usage of computing resources by LHCb during this year 4 ; taking into account the experience of 2010, section 3 details the changes that have been introduced in the LHCb computing model for a more accurate re-assessment of 2011 and 2012 needs that are presented in section 4; finally section 5 wraps up and present some consideration for the coming years. 2 LHCb Grid computing resource usage in 2010 LHCb data taking conditions in 2010 have been rather unusual. On the one hand, there has been a rather long period in which the luminosity delivered by LHC has been rather limited. Over 90 % of the collected luminosity corresponds to fills after the end of September (fills from 1350 on figure 1). And on the other hand, collected events have been much larger and complicated than it was expected. In any case, during most of the period the detector has been collecting data with a nominal 2 khz trigger rate. This means that even if the luminosity delivered is small the amount of collected RAW data is not proportionally smaller. The amount of collected RAW data during 2010 was 181 TB, of which 155 TB corresponds to physics data distributed to Tier1s. The rest corresponds to detector calibration data that remains at CERN. The distribution of RAW data between the different Storage Elements is shown in table 1. The average size of RAW files has been 1 GB during the whole 2010 datataking period. Under nominal conditions with a µ = 0.4 and 2 khz of Trigger rate, the expected RAW size for a full LHC year, with data-taking seconds, would have been 250 TB. This is not too far from the current situation but with an integrated luminosity over one order of magnitude smaller than the nominal. Using the CESGA Accounting portal [6], the average normalized CPU power consumed in is summarized in table 2 6, together with the measured average efficiency in the usage of the CPU resources 7. 4 For the purpose of this document a given year refers to the interval between April 1 and March 31 of the next year. This is like that since this dates coincide with the time at which resource providing sites are expected to have the pledged capacity installed. 5 CESGA Accounting data is available by natural months. All data presented for 2010 corresponds to the period between April 2010 and January 2011, both included. 6 At least one of the sites publishes unnormalized elapse time. 7 When fractions are given for individual Tier1s or Tier2s, they are with respect to the sum of all Sites of the same Tier level and not the total; i.e., the fractions for all Tier1s add up to 100% and the same for all Tier2s.

125 IBERGRID 125 LHCb 2010 RAW data SE Size (TB) # of Files CERN-RAW (T1D0) CERN-RDST (T1D1) CERN CNAF-RAW (T1D0) GRIDKA-RAW (T1D0) IN2P3-RAW (T1D0) NIKHEF-RAW (T1D0) PIC-RAW (T1D0) RAL-RAW (T1D0) Tier1s Table 1. RAW data collected by LHCb during 2010 and its distribution between the different SEs. Only physics data (155 TB in total) is replicated to Tier1s. Using the LHCb DIRAC Accounting portal [7] more detailed information is available. Figure 2 shows the number of simultaneously running jobs from April 2010 to January 2011 classified by the type of the jobs. As can be seen, during most of the year the CPU usage has been not very high. The expected number of available slots, from the pledges of the different sites is around Only in the last part of the year has this number been reached. Table 3 shows the total amount of CPU and wall clock time consumed by the different job types, as well as the corresponding efficiency. When all job types are added together an overall efficiency of 83.4% is found. Taking the measured CPU requirement of 2.6 khs06s/evt from a large Monte Carlo simulation production (in particular the high momentum Di-muon samples with 24 million simulated events) an absolute normalization has been determined for the raw CPU consumed by all LHCb jobs at each site. The results are shown in table 4. The difference between this number, 39.5 khs06, and the WLCG estimation, 49.5 khs06, is due to the different normalization procedures. LHCb is defining a single procedure and applying it for all sites, while WLCG allows some freedom for each site to define its own normalization. Just for the Tier1s the WLCG value is over 40% higher in a direct comparison, or 30% if we use CERN as reference to correct for a possible overall scale factor. 3 Changes to LHCb Computing Model The main changes introduced in the LHCb computing model are due to the increase in the average number of visible proton-proton collisions per bunch crossing, µ. LHCb has proved with 2010 data that it is able to fulfill its physics cases under these new conditions, although at the cost of additional CPU requirements to handle the extra complexity of the events and extra storage requirements to cope with increased event sizes. For 2011 and 2012 it is expected that LHC will be able to provide data at the LHCb interaction point with similar conditions. Therefore,

126 12 IBERGRID Site CPU (HS06) Fraction Elapse (HS06) CPU Eff CERN-Tier % % IT-Tier % % DE-Tier % % FR-Tier % % NL-Tier % % ES-Tier % % UK-Tier % % Tier1s % % FR-Tier % % DE-Tier % % IT-Tier % % RO-Tier % % RU-Tier % % ES-Tier % % CH-Tier % % UK-Tier % % Tier2s % % All % % Table 2. Average CPU power used, relative fractions and CPU efficiency as reported by CESGA Accounting portal. Fig. 2. Concurrently running jobs as a function of time for the different LHCb jobs types.

127 IBERGRID 127 Reco. Stripping Simulation Merge Sam User CPU 232,930 1,318 1,071,925 2,143 1, ,644 Wall Time 257,744 2,836 1,192,443 31,125 6, ,685 Efficiency 90.4% 46.5% 89.9% 6.9% 28.0% 69.1% Fraction 13.2% 0.1% 60.9% 1.6% 0.3% 23.9% Table 3. Amount of CPU days, wall clock time days and CPU efficiency for the different LHCb job types in the new average of µ = 2.0 has been taken to determine the per event CPU and storage needs for the different applications and data types. The updated values, compared to the old nominal ones are presented in table 5. After the experience from 2010, these variations have driven to the following changes in the computing model: The number of replicas kept for the last and previous passes of the reconstruction (DST/MDST) of a given data sample is reduced from 7 (one at each Tier0/1) to 4. These 4 replicas are distributed in the following way: 2 master replicas on T1D1 8 (one at CERN and the other distributed among the Tier1s) and the other 2 extra replicas are on T0D1 (distributed among the Tier1s). For the simulation, most of the resources will be dedicated to signal events and related backgrounds (beauty/charm samples) while combinatorial backgrounds will largely be estimated from the data itself. Other global changes derived from 2010 experience are the following: CPU efficiency for Monte Carlo simulation, including human errors and intrinsic efficiency of the system has been increased from 70 to 80%. 2 archival replicas on T1D0 have been added to the model for DST/MDST (for real and simulated data). Experience has shown that it is not possible in practice to migrate data from T1D1 to T1D0 as final step in the life time of a given sample as foreseen. Therefore, archival replicas are created at the same time the new data becomes available. Concerning older versions of reconstructed data, the 2 most recent versions are kept as described above. The number of master and extra replicas is reduced by half for the next older version (if existing), and replicas for even older versions (if existing) are completely removed. While at the moment this is implemented exactly in this manner, it is foreseen that during 2011 new tools will be developed to use the space for extra replicas in a more dynamic manner based on some popularity measurement, allowing more than 2 replicas for popular data samples and reducing the below 2 replicas less used data. Low luminosity data taken during 2010 has shown the big potential of LHCb for charm physics. This has driven the experiment to study the possibility to increase the nominal trigger rate from 2 to 3 khz, dedicating the extra bandwidth for this purpose. 8 TXDY refers to a StorageElement providing X copies on tape and Y copies on disk for any given file.

128 128 IBERGRID Ref.Jobs Frac. Ref.CPU Norm All CPU All CPU Frac. Av.Power % day HS06 day khs06 d % HS06 Tier DE-T ES-T FR-T IT-T NL-T UK-T Tier1s CH-T DE-T ES-T FR-T IT-T PL-T RO-T RU-T UK-T Tier2s Others Total Table 4. Raw CPU work (in days), Normalized CPU work (in khs06 day), and average CPU power in HS06 for all LHCb Jobs. Normalization is calculated based on the known requirements of a Reference Monte Carlo simulation producing 24 million events requiring 660 khs06 day. 4 Re-assessment of 2011 needs and first evaluation of 2012 The nominal 2011 data-taking is assumed to include a total of seconds of LHC collisions over 35 weeks that at a nominal trigger rate of 2 khz will produce a total of events, and 500 TB of RAW data. After reconstruction it will produce 400 TB of SDST (reduced DST format). After stripping with an average retention of 10% 130 TB of DST per full pass will be produced. The new 1 khz of charm data corresponds to events, 250 TB of RAW data and 65 TB of MDST data for each reconstruction. The full Tier2 pledged CPU power dedicated to simulation would allow to produce 750 M Monte Carlo events and with size of 300 TB. The data from the detector is fully reconstructed and stripped quasi-online, following distribution of the RAW data to the Tier1s. The following re-processing and re-stripping passes are foreseen: Partial re-processing and re-stripping at the beginning of June of the data taken in the Startup and Ramp up periods, once this data has been used to re-optimize the reconstruction code for the 2011 detector and accelerator conditions.

129 IBERGRID 129 Process CPU (HS06s/evt) Data Type Storage (kb/evt) New Old New Old Data Taking RAW Reconstruction SDST Stripping DST MDST 13 Simulation DST Table 5. New and old per event CPU and Storage needs for the different LHCb applications and data types relevant for this document. Partial re-stripping at the beginning of September of all data taken so far. Full re-reprocessing and re-stripping of 2011 sample at the end of the datataking period. In order to be in time for winter conferences this pass must be completed within 2 months. Before the start of the 2012 data-taking period a full re-stripping of the whole 2011 data sample is foreseen. This will be the final pass for physics analysis based on 2011 data. Figures 3 show the expected CPU, disk and Tape resource usage for the different LHCb computing activities (quasi-online reconstruction and stripping, further re-processing passes, Monte Carlo simulations and physics analysis by users) and the different data types. The profiles 9 extend further into 2012 and For 2012 a similar LHC schedule is foreseen and, thus, the assumptions entering the model are the same as for 2011 including in the full re-processing at the end of the data-taking period the data for the whole period. For 2013, LHC foresees a shutdown in order to repair faulty connections in the supper-conducting magnets. Therefore the model includes extra contribution from the Tier0/1s for Monte Carlo simulations and another full re-processing pass of the data sample accumulated in the previous years. These estimates confirm that 2011 LHCb computing activities will be covered by the present 2011 pledges from the WLCG sites (see [8]) in what respect integrated CPU work. For the peak power necessary for the full re-processing of the data at the end of the data-taking period, a shortfall is foreseen, both in the 2011 and The exact magnitude of the shortfall will depend on the details of the LHC operations during the year. If the number of colliding bunches can be increased, the average µ will go down with respect to the µ = 2.0 used for the estimates. This will mean a reduction in CPU requirement for the reconstruction of the average event and, consequently, will reduce the shortfall or even remove it completely. A similar thing occurs with the estimates of Disk needs. They are slightly above the pledges in [8] even after having reduced the number of of replicas for the 9 The profiles have been produced using a tool for simulation of computing models that can be found at

130 130 IBERGRID active data samples. Where little reductions are possible are in the Tape storage which turn out to be the ones exceeding by a larger fraction the current pledges. This will have to be addressed in the course of Summary and Outlook The usage of Grid computing and storage resources by LHCb during 2010 has been analyzed. Due to the low profile activity of LHC the usage of resources has, on average, been below the estimates that were done in the past. At the same time, the analysis shows that when necessary, for a full reprocessing or a large Monte Carlo simulation campaign, the LHCb tools for distributed computing are able to use the foreseen amount of resources (or even more) in a very efficient manner. At the same time, the analysis has allowed to determine realistic new parameters for the coming data-taking periods. Using these new parameters and a new simulation code, the needs for 2011 have been re-assessed. The resources required for 2012 and 2013 have been estimated for a first time and this estimates, after review by the corresponding LHC and WLCG panels will drive the request to the resource providing sites. Acknowledgements The presented work has been financed by Comisión Interministerial de Ciencia y Tecnología (CICYT) (project FPA C02-01, FPA C02-01 and CPAN CSD from Programa Consolider-Ingenio 2010 ), and by Generalitat de Catalunya (AGAUR 2009SGR01268).

131 IBERGRID 131 Fig. 3. Profile of the CPU power (top), Disk (middle) and Tape (bottom) usage for the different LHCb activities and data types.

132 132 IBERGRID References 1. Amato S et al LHCb technical proposal. Tech. rep. LHCb CERN-LHCC LHCC-P-4 2. Antunes-Nobrega R et al. (LHCb) 2003 LHCb technical design report: Reoptimized detector design and performance. Tech. rep. LHCb CERN-LHCC Evans (ed ) L and Bryant (ed ) P 2008 JINST 3 S Antunes-Nobrega R et al. (LHCb) 2005 LHCb TDR computing technical design report. Tech. rep. LHCb CERN-LHCC Tsaregorodtsev A, Bargiotti M, Brook N, Casajus Ramo A, Castellani G, Charpentier P, Cioffi C, Closier J, Graciani Diaz R, Kuznetsov G, Li Y Y, Nandakumar R, Paterson S, Santinelli R, Smith A C, Miguelez M S and Jimenez S G 2008 Journal of Physics: Conference Series URL 6. CESGA WLCG Accounting Portal. Last access on February 25, URL view.php 7. DIRAC Accounting Portal. Last access on February 25, URL Production/visitor/systems/accountingPlots/job WLCG Resource Pledges. Last access on February 25, URL 15DEC2010.pdf

133 IBERGRID 133 Aggregated monitoring and automatic site exclusion of the ATLAS computing activities: the ATLAS Site Status Board Carlos Borrego 1,, Alessandro Di Girolamo 2, Xavier Espinal 3,, Lorenzo Rinaldi 4, Jaroslava Schovancova 5, for the ATLAS Collaboration, Julia Andreeva 2, Michal Maciej Nowotka 2, Pablo Saiz 2 1 Institut de Física d Altes Energies (IFAE), Universitat Autònoma de Barcelona, Edifici Cn, ES Bellaterra (Barcelona) Spain 2 CERN, European Organization for Nuclear Research, Geneva, Switzerland 3 Port d Informació Científica (PIC), Universitat Autònoma de Barcelona, Edifici D, ES Bellaterra (Barcelona) Spain 4 Instituto Nazionale Fisica Nucleare, Viale Berti Pichat 6/2, Bologna I-40127, Italy 5 Institute of Physics, Academy of Sciences of the Czech Republic, Na Slovance 2, CZ Praha 8, Czech Republic Abstract. In the context of the Large Hadron Collider (LHC) at the European Organization for Nuclear Research (CERN), ATLAS (A Toroidal LHC Apparatus) is one of the six particle detectors constructed at the accelerator. The ATLAS experiment generates large amounts of raw data which are analysed by physicists and physics groups in tens of sites all around the world. There are various monitoring tools spread around the many sites to check the status of the different activities. The ATLAS Site Status Board (SSB) is a framework to monitor the overall status of the AT- LAS distributed computing activities in the sites. From another hand, with this monitoring information we have created an infrastructure to automatically exclude and re-include sites in the different activities on the basis of dynamic policy. In this paper we present the infrastructure architecture, implementation details and lessons learned. Keywords: grid computing, monitoring, high energy physics 1 Introduction The Large Hadron Collider (LHC) experiments rely on the Worldwide LHC Computing Grid (WLCG) to store and analyze the producing data. The WLCG is the distributed computing infrastructure constituted of more than 150 sites all around the world that allows physicists, regardless their physical location, to participate of corresponding author: [email protected] Also at Física d Altes Energies, IFAE, Edifici Cn, Universitat Autònoma de Barcelona, ES Bellaterra (Barcelona), Spain

134 134 IBERGRID in the analysis of the experiment data. Each experiment needs to perform many different computing activities to accomplish the challenge of analyzing the many Peta Bytes of data produced each year by the LHC as explained in [1]. The WLCG sites are organized in level of tiers, so called Tier-0 (CERN), Tier-1, Tier-2 and Tier-3; the tier level describes the site computing activities and responsibilities. The ATLAS experiment has recorded approximately 2 Peta Bytes of RAW data in 2010, it is currently running up to 90k jobs in parallel, and reached peaks of 8GB/s of data distribution over the sites. To achieve the challenging task of finally produce important physics discoveries, the ATLAS experiment has organized its computing infrastructure in so called clouds: each cloud consists of one Tier-1, of the order of 10 Tier-2s, and eventually some Tier-3s. The Tier-0, 1 and 2 have agreed with ATLAS on a Memorandum of Understanding, in which those sites commit themselves to provide a specific service level for the different activities that the ATLAS experiment need to perform at each site. The Tier-3s offer opportunistic resources that can be used when available. The ATLAS experiment has organized a structured support system 24x7, 365 days per year to allow an efficient data taking and data analysis all over the computing centers: the support people are groups of scientists belonging to the ATLAS collaboration that continuously monitor all the detector components and the experiment activities. The computing activities do not stop even when the LHC is not in data taking mode, since Monte Carlo simulation and data analysis can be performed on the data recorded up to that moment. In the last year about 2400 incidents concerning ATLAS sites malfunctionings have been reported by those scientists during their shifts. An efficient monitoring is fundamental for the ATLAS experiment and for the WLCG infrastructure in general: it allows to measure the site performances and it allows the experiments computing operations team to quickly spot site and central services failures. Thanks to an effective monitoring it is possible to minimize the effects on the ATLAS activities of the problems related to site issues re-directing the workload on sites that in that moment are properly working. Various monitoring tools are adopted within ATLAS to monitor the different activities. In this paper we will present how the ATLAS experiment aggregates the diverse monitoring information into the Site Status Board [5], a tool developed at CERN by the Dashboard team that is taking care of providing common monitoring solution to the WLCG experiments. The Site Status Board aims to provide all the information needed to follow the status of a site: in the SSB, metrics collected from the different monitoring tools are aggregated such that a comprehensive view of the site is shown without having the need of going through the many monitoring framework where the source of the information is. The Site Status Board consists of three main components: the data collectors, data repository and the web server. The data collectors access the external applications and insert data in the data repository. The repository is implemented as ORACLE database. The database tables contain the description of the various monitoring metrics including their source, criticality, frequency of update and measurements of the metrics. Both latest values and the historical data are preserved. The web server exposes monitoring data to the users in the

135 IBERGRID 135 form of the web pages and in machine readable format, so that it can be consumed by other applications. This paper is organized as follows: Section 2 introduces an architecture for aggregating different monitoring information from different monitoring sources. Section 3 describes the Site Status Board, the core infrastructure in which monitoring information is stored and displayed. Section 4 describes the site exclusion policies we have defined to exclude a re-include sites from the different ATLAS activities, as well as implementation details. Section 5 presents the feedback obtained from the users experience while using the infrastructure. Finally, section 6 presents the conclusions we have come to as well as our future plans for the infrastructure. In figure 1 the ATLAS SSB architecture is depicted. Fig. 1. ATLAS SSB architecture 2 An architecture for Aggregating ATLAS monitoring information The Site Status Board (SSB) is a framework which allows to aggregate data from various information sources and to customise the presentation of this data according to the needs of a particular user community. The monitoring metrics collected and presented by the SSB describe the health and behaviour of the computing site. Metrics are defined by the user community, they can be numeric and quantitative, can have different frequency update and different level of criticality. All these attributes of the metrics are defined by the users. The way information is presented can be customised. Several metrics can be organized in a view, which is typically shown on the web page as a table consisting of several columns. Each column contains one monitoring metric. The metric can be defined with the status which is shown with a particular colour or/and with a numeric value. This structure is depicted on figure 2. An example of a view is depicted on figure 3.

136 136 IBERGRID The computing sites can be named according to different conventions. The name with which the site is shown in SSB is the one published in GOCDB [3] or OSG Group Resource Name [17]. Information about all sites used by the ATLAS VO is stored in the ATLAS Grid Information System (AGIS) [8]. This information includes mapping between various aliases for the sites. External applications like SSB can use AGIS API in order to obtain information about mapping of various aliases of the sites. Currently ATLAS instance of SSB keeps track only about Tier- 0, Tier-1 and Tier-2s sites. In SSB there are two modes in which the views can be presented : The Index view [14] and Expanded view [15]. Fig. 2. Sensors, data sources, metrics and views Fig. 3. Shifter View In the Index View, as seen on figure 4, the sites and their overall status are displayed. The overall status is given by the combination of the critical sensors and is represented by a graphic symbol: a green tick indicates that the site status is good, a yellow tick indicates that at least a sensor is in warning, a red circle indicates that at least a sensor is failing and the work on progress symbol is used when the site is under maintenance. In the Index View the sites are grouped by clouds, but it is also possible to group them by tier level. The Expanded View has a table format. The sites are listed in the first column and the other columns show the sensor values. It is also possible to display a one-row expanded view, with a single site.

137 IBERGRID 137 The sensors gather the status of the site from the primary information source (for example DDM Dashboard [9], the site downtime calendar [12], the Panda Monitor [11], etc.). The primary information sources publish monitoring metrics via web server. Information is published in the ASCII file containing records in the csv (comma separated values) format and stored on a web server. Each record of the file is associated to a site. Each line is formed by a time stamp, the site name, the site status (online, offline, active) or a number, a color code and the URL of the service. In the ATLAS instance of the SSB, the metrics are produced by the collectors, which are shell and python scripts running automatically and periodically on a dedicated server. Furthermore, in the SSB framework, it is possible to have a combined metric, obtained by grouping together more sensors. An important feature of the Site Status Board is that the historical evolution of the sensor in a given time period can be displayed, and thus allowing to follow the availability of the site according to the monitored activity. In figure 5 for example, a historical presentation of a given view and a given site for a week period is depicted. In the Site Status Board, several attributes are associated to each sensor: the column name, description, URL and data source; the critical attribute makes the column contribute to the overall status of the site in the Index View. Another important attribute is the update time, that allow to set how frequently the metric should be updated in the Site Status Board. Fig. 4. The Index View 3 Using the SSB infrastructure The SSB Core collects all defined metrics from the URL with frequency specified in metrics definition and publishes them on the View pages. The collecting frequency ranges from once per hour to once per 10 minutes, the particular frequency values

138 138 IBERGRID Fig. 5. Historical view for a week period are properly tuned so that the time window is of just the right size. In the current implementation of the SSB Core the metrics update cannot be forced from the SSB front-end due to security reasons, but it can be performed on demand internally by the SSB developers. The majority of the recent bug reports or feature requests concern the frontend part of the SSB. For what concerns the back-end, scalability issues have been addressed improving the tool in terms of reliability and efficiency. The development of the front-end is driven in a collaborative mode. It is coordinated with the representatives of all LHC experiments in order to ensure that the applied solutions satisfy all user communities. General communication with the SSB Core developers is done via the CERN Dashboard Savannah Project tracker [18]. The ATLAS SSB metrics developers and the SSB Core developers hold regular short weekly meetings to discuss progress on recent bug reports, and addition of new features to the SSB Core. 4 Excluding sites from ATLAS activities. Criteria and implementation In ATLAS distributed computing model[2] there are several activities which use different resources from the different sites from the tier of ATLAS [4]. These activities can be grouped in: Data Analysis Events retrieved from the detector and from Monte Carlo samples will be reconstructed to produce different data such as Event Summary Data (ESD), Physics Analysis Object Data (AOD) and Event tags, short event summaries primarily for event selection (TAG) using the production process. Data Transfer. This activity uses the different grid data management services such as FTS, SRM and LFC to transfer data from site to site based on a subscription model. Files are registered in local catalogs hosted at the different Tier-1s.

139 IBERGRID 139 Data Processing / Reconstruction. RAW data from the detector is processed obtaining the Event Summary Data (ESD). First-pass processing is CERN Tier-0 responsibility, although Tier-1s can eventually provide their facilities. Alignment and Calibration. This activity is in charge of generating nonevent data needed for the reconstruction of ATLAS event data, as well as processing the trigger/event filter system, prompt reconstruction and subsequent later reconstruction passes. Simulation process. Simulated data is created mainly in the Tier-2s infrastructure and stored in the Tier-1s. We are using different monitoring sources as input for our exclusion/recovery criteria algorithm. Since the Data Source, the information required by the Site Status Board to define a metric, as explained in section 2, is really flexible, we are able to collect any data from any monitoring system. This, allows us a great flexibility to define our criteria to exclude or re-include a site from an activity. In the case of Data Analysis and Data Transfer, monitoring sources which are taken into account include: Panda ATLAS Functional Tests (AFT PANDA)[6]. Tests performed by the gangarobot team with the WLCG back-end which check the different DATADISK space token for the different sites. This space token contains the AOD s and data on demand. Distributed Data Management Functional Tests(DDMFT )[7]. Tests provided by the ATLAS dashboard infrastructure [9], which check the transfers between sites. Storage Resource Manager Functional Tests (SRMFT )[10]. Provided by the Service Availability Monitoring (SAM), provides a set of probes regularly executed to tests the different services all around the WLCG. 4.1 Criteria We have created an infrastructure ready to exclude and recover sites from any kind of ATLAS activity described above. As a first step we focused on the exclusion deployment of the rst two activities: Data Analysis and Data Transfer. Once defined the criteria to exclude/recover sites from the rest of the activities, implementing them will be a trivial matter. On figure 6, we present the criteria to exclude and to recover a site for the different activities. We can see in this table which metrics and their corresponding thresholds to trigger a site exclusion, from both Data Analysis and Data Transfer. It is interesting that failing for a longtime, depending on the metric, can trigger a site exclusion from all of the activities. In order to clarify to the site administrator the reason why any site has been excluded or re-included, a special view with the metrics involved is defined [13].

140 140 IBERGRID Fig. 6. Exclusion/Recovery for activities Data Analysis and Data Transfer 4.2 Alert system Having an infrastructure with the state of many metrics for different sites allows us to be able to generate alarms to inform ATLAS support structure about site exclusion. These alarms are triggered using composite expressions from different metrics. A task in the ATLAS Site Status Board Core is executed every hour to check if any site should be excluded or re-included in the different activities following the criteria described above. This task triggers the alerts which are sent by to a egroup [16] in which different site administrators subscribe in order to follow the exclusion/re-inclusion activity. An alert example would be the one depicted on figure 7: Fig. 7. Alert A metric defining the state of the site concerning exclusion for the different activities is defined. This allows the user to check, using the very same infrastructure, historical values for exclusion activity actions.

141 IBERGRID Shifters experiences The ATLAS Site Status Board is used by the shifter community to monitor the site status and activities. A portal called GGUS [19] keeps track of site incidents. In figure 8 the number of opened GGUS tickets per week since 1st January 2010 is depicted. The ATLAS Site Status Board is used by the shifter community to monitor the site status and activities. Many views have been setup to allow different typologies of monitoring. Fig. 8. Number of GGUS tickets opened per week since 1st January 2010 The Shifter view collects the data transfer, data analysis and data processing sensors and the site exclusion columns. In this view, the shifter can easily spot problematic sites. The Site Status View collects the metrics displaying the site exclusion status in the data transfer and processing services. The shifter can understand whether a site has been excluded from an activity and, using the historical information, can follow display the availability of the site according to a particular service. The Alert View is used by the shifters to take decision about the exclusion/inclusion of the site in the computing activities. The proposal in the near future would be to automatize the exclusion procedures. The Sonar View allows to monitor all the cross transfers within Tier-2s in the new ATLAS data distribution model, according to which some Tier-2s need a direct connection to the other sites for a more balanced workload and data distribution among the ATLAS grid sites. The Site Status Board has been tested by the shifter community from the British, Italian and Spanish clouds, and by the ADCoS shifters. We have received a very positive feedback, considering the ATLAS Site Status Board a reliable tool as a starting point for the monitoring of the ATLAS grid sites.

142 142 IBERGRID 6 Conclusions and future plans We have described the key features of the ATLAS Site Status Board (ATLAS SSB) and the use cases for the ATLAS SSB in the previous sections of this paper. Future development of the ATLAS SSB will be driven by needs of the ATLAS Distributed Computing Operations. The ultimate goal of the ATLAS Distributed Computing is to create a resilient computing system, which is robust enough to sustain most kinds of outages of its functional parts. When a service is down, it is necessary to exclude it from the production grid services for ATLAS, and notify its administrators so that they can investigate and fix the issue. ATLAS SSB sensors and alarms together with other automatic notification services of ATLAS Distributed Computing proved to be very useful source of information for the ATLAS support structure. One of the future automation plans is to introduce the full exclusion of a failing service, which is currently done only in some of the ATLAS Distributed Computing activities. The ATLAS SSB is a natural part of the automation system of the ATLAS Distributed Computing Operations. The web application provides up-to-date status information of ATLAS Grid sites and the services provided by these sites, which is useful for ATLAS Distributed Computing Operations Shifters, operations experts, site administrators, and for ATLAS grid users. The alert application notifies ATLAS cloud squads, site administrators, and operations experts about status of their respective services, resulting in faster response to an ongoing issue, hence improvement of reliability and availability of the ATLAS grid services. Acknowledgements Carlos Borrego gratefully acknowledges the support from MICINN, Spain. Jaroslava Schovancova gratefully appreciates support from the Academy of Sciences of the Czech Republic, and from the ATLAS experiment. The Port d Informació Científica (PIC) is maintained through a collaboration between the Generalitat de Catalunya, CIEMAT, IFAE and the Universitat Autònoma de Barcelona. This work was supported in part by grant FPA C02-00 and FPA C03-02 from the Ministerio de Educacin y Ciencia, Spain. References 1. S. Campana Experience Commissioning the ATLAS Distributed Data Management system on top of the WLCG Service 7th International Conference on Computing in High Energy and Nuclear Physics 2. D. Adams, D. Barberis, C. P. Bee, R. Hawkings, S. Jarp, R. Jones, D. Malon, L. Poggioli, G. Poulard, D. Quarrie, T. Wenaus The ATLAS Computing Model (2004) 3. Gilles Mathieu, Dr Andrew Richards, Dr John Gordon, Cristina Del Cano Novales, Peter Colclough and Matthew Viljoen GOCDB, a topology repository for a worldwide grid infrastructure Journal of Physics: Conference Series Volume 219 Part 6, 2010

143 IBERGRID Tier of Atlas (TiersOfATLAS - ATLAS TOA) /GROUPS/DATABASE/project/ddm/releases/TiersOfATLASCache.py 5. J. Andreeva, S. Belforte, M. Boehm, A. Casajus, J. Flix, B. Gaidioz, C. Grigoras, L. Kokoszkiewicz, E. Lanciotti, R. Rocha, P. Saiz, R. Santinelli, I. Sidorova, A. Sciab, A. Tsaregorodtsev Dashboard applications to monitor experiment activities at sites Journal of Physics: Conference Series, v. 219, n.6, Atlas Functional Tests 7. DDM Dashboard - CERN - the European Organization for Nuclear Research 8. Anisenkov, A ; Klimentov, A ; Kuskov, R ; Krivashin, D ; Senchenko, A ; Titov, M ; Wenaus, T ATLAS Grid Information System ATLAS internal note ATLAS Dashboard Service Availability Monitoring Vistualization Panda Monitor Atlas Grid Downtime Exclusion Imput View input 14. Index View Expanded View Simba at CERN OSG Resource Name Y. Perrin, F. Orellana,M. Roy, D. Feichtinger The LCG Savannah software development portal Computing in High Energy Physics and Nuclear Physics 2004, Interlaken, Switzerland, 27 Sep - 1 Oct 2004, pp GGUS, a Global Grid User Support

144 144 IBERGRID A Geographical Information System for wild fire management António Pina, António Esteves, Joel Puga, and Vítor Oliveira Department of Informatics, University of Minho, Portugal {pina,esteves,joel,vspo}@di.uminho.pt Abstract. The CROSS-Fire project focus on developing a grid-based framework for wild fire management using FireStation (FS) as a standalone application that simulates the fire spread over complex topography. The overall software development is made of several components: client applications, which request geo-referenced data and fire spread simulation, Spatial Data Infrastructures (SDI), which provide geo-referenced data, and the GRID, which gives support to the computational and data storage requirements. Herein we present the central Web Processing System (WPS) layer developed to support the interaction between all components of the architecture. This OGC-WS compatible layer provides the mechanism to access the Grid facilities for processing and data management and including all the algorithms, calculations, or models that operate on spatially referenced data, also mediating communication with the FS console. We also describe the work that has been done to provide FS with dynamic fuel maps, by using an OGC-WCS suite of services and satellite data. This task complements the previous integration of dynamic data from meteorological stations using OGC-SWE services. Key words: OGC-WS, Wild Fire Management, Grid Computing, Satellite Imagery 1 Introduction The CROSS-Fire project[1] aims to develop a grid-based framework as a risk management decision support system for the civil protection authorities, using forest fires as the main case study and FireStation (FS) [2] as the standalone application that simulates the fire spread over complex topography. The approach is based in an architecture that includes: information models, encodings, and metadata that represent the scientific knowledge associated to FS execution models and standards to enable the discovery and access of Web services, data repository, in-situ and satellite sensors, and data processing facilities. The overall software development is made of several components: client applications, which request geo-referenced data and fire spread simulation, Spatial Data Infrastructures (SDI), which provide geo-referenced data, and the GRID, which gives support to the computational and data storage requirements. CROSS-Fire

145 IBERGRID 145 uses the EGI/EGEE distributed computing infrastructure to provide raw technological capability provision, including data management and storage, access to meta-data databases and HPC. On another hand, taking advantage of Open Geospatial Consortium (OGC) proposals for open standards for geospatial interchange formats, a standard-based SDI layer centered on GeoServer is being exploited to provide FS with static data, and to publish data for further processing, while provisions to dynamic data is make available by using an OGC-SWE compatible layer. The dynamic data includes meteorological information and satellite images, coming from sensors in weather stations (such as DAVIS Vantage Pro2) and sensors aboard satellites (such as Terra/Aqua MODIS). The console of FireStation (CFS) is based on gvsig, a full feature Open Source GIS desktop solution, funded by EC, which conforms to INSPIRE for managing geospatial information. 1.1 Open Geospatial Consortium Open Geospatial Consortium (OGC) [3] provides several open standards to regulate geospatial content and services, GIS data processing and data sharing. Next, we describe its use in the context of CROSS-Fire. Web Map Service (WMS) provides georeferenced images (maps) with no other data associated to them, used as background maps for simulation. Web Coverage Service (WCS) is similar to WMS, but it allows the association of data to the image (one or multiple values per cell), used for example, to provide altitude maps, where the value of a cell represents the height at that point, as is the case of raster type data (terrain, fuel maps and DTMs). Web Feature Service (WFS) is used to provide representations of real life entities (roads, buildings, rivers, monuments, etc.) that have a wide array of data associated with them, such as: wind field and fire spread simulations results. Web Processing Service (WPS) is used to request execution of a process that works with geospatial data. It is the core layer of the CROSS-Fire platform. Catalog Service Web (CSW) provides a catalogue of meta-data especially designed to organize geospatial data, used to catalogue and organize geospatial data, and speed up the search for the sources of simulation data. Sensor Observation Service (SOS) provides access to geo-referenced sensors and their readings. We use it to get the readings of our meteorological stations. 1.2 CROSS-Fire Platform The architecture (fig. 1) integrates the WPS core layer of CROSS-Fire and three infrastructures: i) a Distributed Computing infrastructure built on top of the EGI/Grid middleware, ii) a Spatial Data Infrastructure conforming to OGC-WS and SWE-SOS standards, and iii) a (third party) knowledge database organized according to Portuguese CP organization. In addition, a Human Interface layer provides all the functionalities needed for the interaction with the platform.

146 146 IBERGRID Distributed Computing CROSS Fire Platform Spatial Information Information&Monitoring Services Job Management Services Security Services Data Services Human Interface Civil Protection Training Scientists Grid Portal Clouds OGC Client Clusters Computational Services Human Interface Browser Wind Field Fire Propagation WPS Fire Geospatial Services Weather Index Information SWE SOS Open Geospatial Services Regional Municipal Consortium National OGC WS Maps Remote & In Situ Sensors Meteorological Stations Satellites Knowledge Base Civil Protection Agents Scientific Institutions Other Agencies Citizens Means and Resources Fig. 1. CROSS-Fire Platform Architecture The core of the platform is a 52North WPS layer [4] that deals with most of the functionalities of the dependent layers: Business Logic (BL), Computational Services and Geospatial Services, Information Service and Human Interface (fig. 2). The BL is an abstract layer containing all algorithms that provide the functionality of the application specific semantic, such as fire spread and wind field simulations. 1.3 FireStation The FS is an integrated system aimed at the simulation of fire growth over a complex topography, that integrates a wind field generation the computation of the Fire Weather Index. FS requires three different types of input data to simulate fire propagation: (i) a description of the terrain features (e.g., morphology), (ii) meteorological data, namely the wind conditions affecting a selected region, and (iii) the initial conditions and some control parameters. The description of the topography and fuel characteristics of the terrain is registered in two files that store the altitude and the fuel type for each cell. The wind field module receives as inputs a modified version of the terrain map called Digital Terrain Map (DTM) and the wind readings (direction and velocity) provided by the meteorological stations existing in the area of interest, returning as an output a map of the wind behaviour (direction and velocity) within the terrain limits. The fire-spread module receives as inputs the wind field simulation results, a fuel map and the respective fuel index (describing the characteristics of the different kinds of fuel), the terrain map and several inputs and parameters of the simulation (ignition points, fire barriers, stop conditions, etc...).

147 IBERGRID 147 For each burned cell the simulation returns the values of: fire intensity in the line of fire, flame size, local fire velocity, flame life span, energy for unit of area and time of cell ignition. Fig. 2. WPS Layer 1.4 Meta-Data Management In CROSS-Fire one expects not only to be able to run simulations but also to access and manipulate the results produced in the past. To accomplish these requirements, the platform must be able to register in a convenient way all input/output data used in the simulation. To achieve those goals, all the simulations input/output data, along with the meta-data that characterize the executions, such as the date, the execution time or the resources used, is recorded on a database server with GIS extensions. In order to have this data available on the Grid and, at the same time, to have access to the replication, redundancy and security facilities offered by the Grid, the database is interfaced via AMGA[5] data-base. 1.5 Spatial Data Infrastructure The SDI of CROSS-Fire is composed by two mains components: a PostgreSQL database and the geographical data server GeoServer[6]. The database incorporates a PostGIS [7] extension with several functionalities for geospatial data, compatible with GeoServer. All the input/output used/produced by FS is imported/made available in the database. As the original FS uses as input data files in a proprietary format, we needed to convert them to standard formats. The terrain and fuels maps are natively available in ARCgrid, an ASCII raster format widely used and compatible with most GIS software, so no conversion was needed. The modified DTMs are converted to this format; since they have only one value associated to

148 148 IBERGRID each cell. Other input/output data with more than one value associated to each elementary cell such as: wind field results, meteorological readings and fire spread results, are converted to SQL file format, to be stored in a PostgreSQL database. Both types of data may be added to the geospatial server for post retrieve through the WFS and WCS standards. Whenever there is a need to get data from the GeoServer, several algorithms we developed are used to reverse the standard formats into the original format data used by the FS. 1.6 Related Work Virtual Fire is a Web GIS platform to predict, simulate and visualize wildfires in real time [8]. EuroGEOSS is a large scale European integrated project for Environment Earth Observation that follows INSPIRE directive and builds an operating capacity for three areas: drought, forestry and biodiversity [9]. PREVIEW is an UN sponsored project to easily integrate and share spatial data on global risk from natural hazards [10]. It provides interactive and almost full access to data sets on 9 types of hazards at the global scale. The EUMETSAT delivers weather and climate-related satellite data all days of the year. Instituto de Meteorologia, one of its members, operates the Land Surface Analysis Satellite Applications Facility (LSA SAF) whose objective is to increase benefit from EUMETSAT satellites. GreenView, a Europe Grid based environment, offers tools for vegetation classification, satellite measurements calibration or temperature analysis and prediction [11]. ImazonGeo is a Web GIS based on a SDI, oriented to monitor Amazon forest, especially to avoid deforestation [12]. 2 CROSS-Fire WorkFlow The WPS layer comprises several modules or algorithms (the official term). It was developed in JAVA using the 52North framework that works in conjunction with Apache Tomcat. A CROSS-Fire specific XML data structure has been created to support most of the interactions between clients and algorithms, and the algorithms themselves. This structure, which is encapsulated on the standard XML used for the WPS request, comprises several fields used to register all the information associated with the inputs/outputs of FS, that are progressively filled by each of the different algorithms used for the launching and execution of the simulations. In what follows, we present the CROSS-Fire platform workflow from the WPS point of view. The implemented algorithms are identified by names of gods and titans from Greek mythology. 2.1 Data Requests Atlas - This algorithm is used to discover and return the geographical data provided through the WMS, WCS and WFS standards, that represent terrain and fuel maps.

149 IBERGRID 149 Fig. 3. Phase 1 - Data Requests It takes as an input (1) a specific geographical area provided by the client through the XML data structure (see figure 3). It then (2) queries the GeoServer for all the maps within the specified input using a JAVA library that provides methods to search data, conforming to any of the OGC standards supported by the GeoServer (WMS, WCS and WFS). Each of these methods request the XML capabilities documents of each standard, that includes the bounding box of each layers, and then it looks for specific defined tags included in the map names to return (3) the correspondent maps types. In the case of WCS it is also necessary to search in the XML document returned by the DescribeCoverage request. Finally, the algorithm returns through the XML data structure (4) the URL of each of the maps that are within the input area. Eolus - This algorithm (see figure 3) return the meteorological readings provided by the SOS servers existent in a specified geographical area and in the time interval provided as inputs in a XML data structure (5). First, a method is used to discover (6) all the weather stations in the selected area. Then, a second method returns (7) for each founded weather station all the observations taken within the bounds specified by the time interval. If the first date specified in the interval is a future date another method will be used to obtain the latest read from each weather station. For each station, the read values are inserted in the input data structure to be returned (8) to the client, by associating them with a generated identification (an hash value based on its name), its name, and its location. 2.2 Simulation Data Infrastructure Ares - The algorithm configures the geospatial database, the GeoServer and the meta-data database to receive the results of the simulations (see figure 4). It then launches the jobs on the grid, through the following actions: (i) creation of the identifications of the different components required for the simulation, (ii) creation of a table for the simulation results, (iii) creation of a meta-data table for the

150 150 IBERGRID simulation meta-data, and (iv) copy the input parameters of the simulation to the grid and launching the correspondent jobs. Fig. 4. Phase 2 - Simulation Data Infrastructure As the previous algorithms, it receives (1) as input the XML data structure, which should contain all the data returned from the previous two algorithms, already customized by the client, to conform with each specific simulation requirements, and updated with other simulation parameters also provided by the client, such as stop conditions, ignition points, barriers, and so on. Since this is a complex algorithm, in what follows the description of its functionality is divided in several phases: Identifications generation: First, Ares verifies if the elements provided in the XML data structure are new, i.e. no previous simulation has used them, by checking if their values are equal to -1. If this is the case (2), the algorithm registers the new elements on the meta-data database by creating the elements in the database, and also defining their identifications, which are automatically generated by the PostGIS database used to store the meta-data. Next, the new identifications are returned (3) and inserted in the XML data structure. Space for the results: The identifications are then used to generate the tables (4) on the geospatial database that will contain the results of the simulations, through a JDBC connection. Each table identifies the type of the simulation and it s identification. Next, the tables are made available through WFS (5) using the GeoServer Representational State Transfer (REST) interface. Finally, the library returns (6) the location (URL) of the results that is inserted (7) in the meta-data database and the XML data structure is returned to the client (8).

151 IBERGRID Simulation Launching This phase is divided in two steps implemented using bash scripts that: (i) copy the input data from the GeoServer to the grid and (ii) launch the jobs (see figure 5). Ares - In the first phase, the algorithm obtains the data used by the scripts from the XML data structure, namely the simulation identifications and the URLs of the input data. Then, invoking a simulation folder script (SFM), it creates a folder structure in the Grid (1), which is identified by the identifiers, and copies the required data from the GeoServer into the Grid. It is at this point that the reversers are used to convert the data from a standard format to the format used by the FS executables. Once terminated the copy of the input data, the job may then be submitted (3) using a simulation launcher (SL) script. All the information, related to the job returned by the Grid WMS, is registered in the XML data structure. Hades - This algorism is used to gather the output from the simulations and insert it in the table created by Ares, and to update, at the same time, the related meta-data. Unlike the other algorithms, it does not receive as input the XML data structure, but a series of files, each one containing the simulation identifiers as header, followed by a series of lines, each one representing the simulation results for a specific cell in the map. Fig. 5. Phase 3 - Simulation Launching These files are provided by the Watchdog [13] script, launched in conjunction with the simulation, that runs in parallel to detect the availability of new results in the Grid, and to send them back to Hades (3). The algorithm may then insert the new data results on the corresponding table (4), previously created by Ares, using the simulation identifier. This is implemented by a Java library that inserts a batch of data in the geospatial database through a JDBC connection. An internal meta-data algorithm, evoked by Hades, also updates (5) a simulation meta-data field in the AMGA database that states the time of the last

152 152 IBERGRID data received or the time of the simulation termination. This field may be used to determine if there is new data available to download (6) or the simulation is already terminated. The data results are available through the GeoServer in the URLs (7) previously provided by Ares. This data results may be used, for example, to visualize the progression of simulations. 3 Integration of Satellite Dynamic Data on FireStation Given the potential of data that may be acquired from the satellite imagery we decided to complement the integration of dynamic information from weather stations with the dynamic data/imagery obtained from Terra and Aqua satellites, in particular, data from MODIS [14] and ASTER [15] sensors. Since the MODIS set of products is broader and more freely available than the ASTER s one we opted for MODIS data. The objective is to automatically improve and update the available static fuel maps using the following satellite data: (i) updated vegetation, (ii) recently burned areas, and (iii) land cover type. The implementation is based on the access to geo-referenced images using the OGC-WCS standard [16] and its Web Coverage Processing Service extension (WCPS) [17][18]. The WCS standard defines basic extraction operations, such as spatial, temporal and band sub-setting, scaling, reprojection, and format encoding. Although WCS does not offer a powerful processing capability, it is declarative (describes the results rather than the algorithms), it has a safe evaluation (impossibility of a request to keep the server busy infinitely), it is optimizable (capability of the server to rearrange the request to produce the same result faster), and its explicit semantics allows machine-machine communication and reasoning. The OGC WCS-Transactional (WCS-T) is a WCS 2.0 extension that allows inserting and updating coverages stored on a WCS-type coverage server. WCS-T specifies an additional Transaction operation. WCPS is a WCS 2.0 extension that provides an additional ProcessCoverages operation. WCPS defines a flexible interface for the navigation, extraction, and ad-hoc analysis of large, multi-dimensional raster coverages, that allows combining several coverages in one evaluation. Each coverage is optionally checked first for fulfilling some predicate, and it contributes to the result list only if the predicate evaluates to true. 3.1 MODIS Data for Dynamic Fuel Maps In what follows we describe three MODIS land products used to compute updated and improved fuel maps. Burned Areas - Produced from both Terra and Aqua MODIS daily surface reflectance, MCD45A1 is a monthly L3 gridded 500 m product that approximates the date of burning, and maps the spatial extent of recent fires. The most relevant MCD45A1 layer, among the 8 available ones, is the burn date. This field approximates the Julian day of burning (1-366) from eight days before the beginning of the month to eight days after the end of the month, or a code indicating especial cases.

153 IBERGRID 153 Land Cover Type - These MODIS products containing multiple classification schemes describe land cover properties derived from one year of Terra and Aqua observations. The MODIS Terra and Aqua land cover type yearly L3 global 500 m sinusoidal grid product (MCD12Q1) incorporates five different land cover classification schemes. As an example, the IGBP scheme identifies 17 land cover classes containing: 11 natural vegetation classes, 3 developed and 3 mosaicked classes. Vegetation Indices - Using MODIS blue, red, and near-infrared reflectances, and masking water, clouds, heavy aerosols, and cloud shadows, vegetation indices are computed daily. Normalized Difference Vegetation Index (NDVI) and Enhanced Vegetation Index (EVI) are among these indices. For example, the global MOD13A1 product is provided every 16 days at 500 m spatial resolution as a gridded L3 product in the sinusoidal projection. Vegetation indices are used for global monitoring of vegetation conditions and are used in products displaying land cover and land cover changes. 3.2 Dynamic Fuel Map Computation Next we summarize the steps needed to compute a fuel map with the data presented in 3.1. First, the data is downloaded from Nasa/MODIS database, as tile-based Hierarchical Data Format (HDF) files (figure 6), using a specifically developed FTP downloader application. HDF is a data format designed to facilitate sharing of scientific data in a heterogeneous computing environment [19]. Fig. 6. Illustration of the computation of dynamic fuel map with data from MODIS.

154 154 IBERGRID Then, the layers of interest are extracted from each file and merged to form a single raster image for a multiple tile region. This image and its metadata are inserted in the WCS server database via the WCS-T Transaction operation. The implementation is based on the WCS 2.0 available in the rasdaman project [20] that also includes a database management system (DBMS) layer that supports multidimensional arrays of any size and dimension and over freely definable cell types. At the bottom of the rasdaman architecture, arrays are decomposed into smaller units, which are maintained in a conventional relational DBMS (postgresql). To obtain a fuel map with the data stored on WCS, we may use both WCS and WCPS requests: (i) to apply filters (sub-setting) to the coverages for getting the region of interest, (ii) to convert the resolution, for equalizing the resolution of the various types of information needed to calculate the fuel map or generating the fuel map with the resolution required by FireStation, (iii) to combine the land cover type coverage layer, with the vegetation index layer and the burned area layer, and to produce a new fuel map, and finally (iv) it is necessary to return the fuel map in ArcInfo ASCII grid format, to be imported into FireStation. 4 Work in Progress Authentication - Only some preliminary research was been done on the available authentication and credential delegation software to access both the SDI and the GRID using X509 certificates. Gridsite [21] is one of the candidates to implement this functionality, because it allows the delegation of Grid credentials, and also the use of these certificates to control the access to applications served by Apache Tomcat (GeoServer, WPS, etc.) Catalog System - We have a working installation of CSW standard implementation using Degree CSW [22], natively compatible with Apache Tomcat that we plan to integrate in the WPS s Atlas algorithm. Portal - A simple Web portal was developed using JavaScript, HTML, and Openlayers [23], to provide an OGC conformant Web GIS client. The portal allows the user to visualize the results of previous simulations on top of a background terrain/street map. Dynamic Fuel Maps - Rasdaman is an evolving project for which we are contributing by adding to WCS suite the support for multi-layer coverages and the associated file formats. Acknowledgements This research was mainly funded by the Portuguese FCT through the CROSS- Fire[1], also benefit from UMinho participation on EC FP7 E-science grid facility for Europe and Latin America (EELA2) and more recently in the EC FP7 Grid Initiatives for e-science virtual communities in Europe and Latin America (GISELA).

155 IBERGRID 155 References 1. CROSS-Fire, Collaborative Resources Online to Support Simulations on Forest Fires, FCT GRID/GRI/81795/ Lopes, A., Cruz, M., and Viegas, D. FireStation - an integrated software system for the numerical simulation of fire spread on complex topography,environmental Modelling and Software, 17(3):269285, Open Geospatial Consortium, 4. Simonis I., Wytzisk A., Echterhoff J. Sensor Web Enablement: The 52North Suite, Proc. of the Free And OSF for Geoinformatics, Lausanne pp , (2006). 5. Santos N., Koblitz B. Metadata services on the Grid,Proceedings of the X International Workshop on Advanced Computing and Analysis Techniques in Physics Research, GeoServer, 7. PostGIS, 8. Kalabokidis K., Kallos G. Web GIS platform for forest fire management, Virtual Fire Final Review and PR Event, Mytilene, Greece, Vaccari L., Nativi S., Santoro M. Deliverable D Report on requirements for multidisciplinary interoperability, Project EuroGEOSS, a European approach to GEOSS, Giuliani G., Peduzz P. The PREVIEW Global Risk Data Platform: a geoportal to serve and share global data on risk to natural hazards. Natural Hazards and Earth System Sciences, Volume 11, pp , Mihon D., Bacu V., Gorgan D., Mros D., Gelyb G. Practical considerations on the GreenView application development and execution over SEE-GRID. Earth Science Informatics, Volume 3, Number 4, pp , Elsevier, Souza C., Pereira K., Lins V., Haiashy S., Souza D. Web-oriented GIS system for monitoring, conservation and law enforcement of the Brazilian Amazon. Earth Science Information, Volume 2, pp , Springer, Bruno R. A watchdog utility to keep track of job execution, INFN, ct.infn.it/twiki/bin/view/pi2s2/watchdogutility 14. United States Geological Survey (USGS). MODIS Overview, October overview. 15. Abrams M., Hook S., Ramachandran B. ASTER User Handbook, Version 2, Jet Propulsion Laboratory, California Institute of Technology, August Whiteside A., Evans J. Web Coverage Service (WCS) Implementation Standard, Open Geospatial Consortium Document r5, March Baumann P. Web Coverage Service (WCS) - ProcessCoverages Extension, Open Geospatial Consortium Document r3, March Baumann P. The OGC Web Coverage Processing Service (WCPS) Standard. Geoinformatica, July Nasa Goddard Space Flight Center (GSFC). Science Data Production (SDP) System Toolkit, April Baumann P. (rasdaman GmbH). The Rasdaman Project Documentation, GridSite, Deegree CSW, deegree_csw_2.0.2_documentation_en.html 23. openlayers,

156 156 IBERGRID Extending a desktop endoscopic capsule video analysis tool used by doctors with advanced computing resources Ilídio C. Oliveira 1,2, Eduardo Dias 1, Luís Alves 1, João Barros 3, Jorge A. Silva 3,4, Miguel P Monteiro 3,4, António Sousa Pereira 1,2, José Maria Fernandes 1,2, Aurélio Campilho 4, João Paulo Silva Cunha 1,2 1 Instituto de Engenharia Electrónica e Telemática de Aveiro (IEETA), Aveiro, Portugal 2 Dep. de Electrónica, Telecom. e Informática, Universidade de Aveiro, Portugal 3 INEB - Instituto de Engenharia Biomédica, Porto, Portugal 4 Faculdade de Engenharia, Universidade do Porto, Portugal {ico; edu; luisfalves}@ua.pt; [email protected]; [email protected]; [email protected]; {asp; jernan}@ua.pt; [email protected]; [email protected] Abstract. The CapView software supports the visualization and annotation of wireless endoscopic capsule videos in regular desktops. Computer aided event detection algorithms are available to integrate with CapView, but imposing high processing times. In this work we integrate advanced computing infrastructures, namely a production Grid and a Cluster, into the desktop workflow, to assess whether remote, on-demand analysis of exams could be offered to clinical routine. The achieved CapView/Advanced computing integration provides a seamless approach in support of clinical users with faster execution times. Keywords: High-performance computing, endoscopic capsule, grid computing, cluster computing, image processing. 1. Introduction The Wireless Endoscopic Capsule (WEC) is a recent non-invasive medical diagnosis technology allowing recording a video of the inner body as the capsule travels through the gastrointestinal tract [1]. This procedure consists in the ingestion of a capsule containing a micro-camera capable of recording video for the total of the gastrointestinal tract. Although the relevance of WEC for medical practice is well established (e.g.: [2]), it has the practical drawback of producing a video with 6 to 8 hours long, taking about 2 hours to fully review an exam by an experienced clinician. The use of automated image processing methods can help clinicians and reduce the review and classification time as demonstrated in [3]. In this context, we have previously developed an Automated Topographic Segmentation (ATS) algorithm able to locate and discriminate the four main topographic regions (entrance, stomach, small intestine, large intestine)

157 IBERGRID 157 and also calculate the capsule transit times [3]. Without the computer support, this task takes about 15min when performed by a medical expert in routine clinical practice, which can be saved with the introduction of automated computer aided detection methods. The ATS algorithm has been included in the CapView annotation software suite 1, a desktop application previously developed by our group to assist clinicians in the endoscopic capsule video review and annotation process. In previous related work, we have shown that the image classification stage of the ATS algorithm could be successfully ported to a Grid execution environment without modifications, with acceptable results [4]. Based on these results, in this work we focus on the integration of advanced computing infrastructures, such as Grid computing and dedicated cluster resources, into the clinical workflow. 2. Methods 2.1 CapView: desktop-based clinical annotation software The CapView Annotation Software provides a full endoscopic capsule exam reviewing workflow (Fig. 1). It was created with the active participation of clinicians to support user-friendly and feature-reach endoscopic capsule exams reviewing and report generation. The main features include a frame strip displaying up to 128 images in simultaneous, a report generator, an exam browser with reviewing status and the ability to share and discuss events or cases of interest on the online endoscopic capsule community ( The application is compatible with the most common wireless endoscopic capsule data format available in the market. CapView also includes computer assisted analysis of videos. Using computer vision algorithms, an image similarity discriminator is used to filter the data set (several hours of video) to displays only images that are actually different and effectively reduce reviewing time. It also provides a gastric emptying time and small bowel transit time automatic calculation that provides an indication of the capsule movement within the gastrointestinal tract. Another feature that uses vision components is the dominant colour bar available at the bottom of the application that allows for a quick visual assessment of the capsule images location. The CapView Annotation Software is already in use in normal clinical routine at hospital s gastroenterology services. The previous discussion reveals that WEC analysis raises multiple opportunities for computer methods on image processing. 1 See for details on the CapView annotation software suite.

158 158 IBERGRID Fig. 1: The Graphical-user interface of the CapView Annotation Software. 2.2 Computer-aided detection of anatomic sections in CapView The CapView Annotation Software makes use of Computer-aided detection (CADe) to perform an automated topographic segmentation of the human gastrointestinal tract. The topographic segmentation allows the discrimination of the four main anatomic sections of interest (Entrance; Stomach; Small Intestine and Large Intestine). The automatic segmentation applies a computer vision algorithm that uses Support Vector Machine (SVM) classification applied to MPEG-7 descriptor vectors (the detailed algorithm is described in [3]). The Automated Topographic Segmentation (ATS) algorithm execution cycle comprises three stages: training, single image classification and segmentation (Fig. 2). The training stage is a preparatory step, executed once to obtain the SVM classifiers that are later loaded into the single image classification stage. The later, iterates over each frame and classifies it based on the train model using MPEG-7 scalable color features. Each frame is coded by its zone number (1, 2, 3 or 4); an average WEC exam takes about 60,000 frames. Finally, the segmentation stage applies a global model fitting approach to estimate the position of the esophagus-gastric junction, the pylorus and the ileo-cecal valve based on zone transitions. The locations are then written to a XML file and are ready to be opened with the CapView annotation software. The algorithm was implemented in C++, resourcing to the compatible FFMPEG and MPEG7 libraries. The elaboration on the ATS algorithm can be found in [2] and thus will be omitted here.

159 IBERGRID 159 Fig. 2: Thee-stage Automated Topographic Segmentation (ATS) algorithm. 2.3 Accessing advanced computing infrastructures from CapView CapView runs the ATS algorithm on the desktop computer where the application is in use. This segmentation, however, is computing intensive and takes a long time to complete, depending on the machine s resources. A current desktop with a single CPU can take ~16min to complete the ATS per 100MB of data; the exam size will usually vary between 200 and 600 MB. The analysis of a single frame could take between 8 and 15 seconds (varies on the frames information content). Note that the existing ATS implementation within the CapView application does not handle concurrency. The availability of advanced computing infrastructures for e-science [5] motivated us to investigate whether they could provide a feasible solution to move the computing intensive analysis tasks to remote infrastructures, with no extra knowledge of advanced computing architectures required from the domain users. We use the existing desktop-based CapView application (Fig. 1) as a familiar front end to submit the exam for a remote analysis in a high-end infrastructure, in which the single image classification stage of the ATS algorithm is run. This stage is the most costly part representing ~90% of the overall analysis time. The user just selects the menu option from the CapView (Auto > Submit for online topography detection). The submission parameters (e.g.: endpoints) and partitioning strategy are defined in configuration files in the CapView application, so the submission is just a single step to the end-user. The submission form provides relevant feedback to the user on the status of each exam (e.g.: completion status, start time, finished time as well as upload time) (Fig. 3).

160 160 IBERGRID Fig. 3: Monitoring the status of remote analysis from CapView. Each user submission, besides the input data, takes a XML manifesto document, describing how the job should be handled remotely. The XML schema is as follows: <?xml version="1.0" encoding="utf-8"?> <xsd:schema attributeformdefault="unqualified" elementformdefault="qualified" version="1.0" xmlns:xsd=" <xsd:element name="jobdescription"> <xsd:complextype> <xsd:sequence> <xsd:element name="job"> <xsd:complextype> <xsd:sequence> <xsd:element name="file" type="xsd:string" /> <xsd:element name="infrastructure" type="xsd:string"/> <xsd:element name="totalparts" type="xsd:int" /> <xsd:element name="part" type="xsd:int" /> <xsd:element name="completed" type="xsd: string " /> </xsd:sequence> <xsd:attribute name="type" type="xsd:string" /> <xsd:attribute name="segmentation" type="xsd:string" /> </xsd:complextype> </xsd:element> </xsd:sequence> </xsd:complextype> </xsd:element> </xsd:schema> To validate this approach, we performed a quantitative analysis on the end-to-end times; end-to-end time means the elapsed time since the user submits an exam for remote analysis in the CapView, until the exam segmentation results are obtained and ready for visualization. The overall online analysis workflow has the following steps (Fig. 4): 1. The user submits an exam for remote analysis in CapView. 2. The exam and a XML job description file are uploaded to a staging location, a secure FTP server. 3. If a new unprocessed exam is detected (by pooling the FTP area), the exam is picked-up and submitted to a production infrastructure. 4. Once the analysis is completed, the set of results is stored in the same staging area. 5. The CapView application monitors the staging area at a configurable frequency for results and if the analysis is ready it retrieves the results.

161 IBERGRID Finally the global fitting model (or segmentation) results (1KB in size) are read by the CapView application and the visual segmentation is presented to the user. The above mentioned secure FTP area acts as a staging space. The exams still need to be submitted to a workload manager system for actual scheduling and processing. The choice of FTP as a shared collaborative space makes it easier to integrate with the advanced computing infrastructures, for the sake of validating the approach. Fig. 4: Online topography detection workflow (CapView managed workflow). 2.4 Grid-enabled image classification workflow Grid infrastructures gather massive processing power and storage resources, geographically distributed, abstracted by a Grid middleware layer [6]. The Grid infrastructures are being successfully used in large medical imaging analysis [7]. Application developers need to define a data distribution and job parallelization strategy compatible with the Grid execution environment, usually leading to a workflow defining dependencies between subtasks and data. In our case, a simple bag of tasks approach with domain partition is being tested, since the analysis of each video frame does not depend on the previous or following frames. The original dataset can be divided into smaller segments and distributed to several working nodes, running the same operator (single image classification) in parallel. During the processing stage, there s no need for communication between tasks, but, afterwards, partial results need to be collected as the topographic segmentation uses the results from each video frame. Grid access was implemented over the IEETA Grid Framework (IGF), a Grid interfacing SDK developed in our group. IGF allows the interaction with the glite middleware, exposing a developer-friendly Java API. In this experience, IGF was used to integrate with the Ibergrid infrastructure, specifically to use the IEETA and the main central node in Portugal, running glite middleware 2. It runs a glite 3.1 LCG CE Computing 2

162 162 IBERGRID Element 3 that is composed by a TORQUE 2.3 Resource Manager 4 and a Maui 3.2 Cluster Scheduler 5. Element For the 3 that purpose is composed of this work, by a we TORQUE created 2.3 a helper Resource application Manager on 4 top and of a Maui the IGF 3.2 framework Cluster Scheduler to provide 5. an automatic job submission system to the Grid For infrastructure, the purpose of the this Grid work, Automated we created Submission a helper application on (GAS). top of the The IGF GAS framework acts as a to watchdog : provide an automatic it monitors job the submission FTP staging system area to the for unprocessed Grid infrastructure, exams under the Grid a given Automated periodicity, Submission and when application a new exam (GAS). is available, The GAS it acts starts as a Grid watchdog : proxy, transfers it monitors the video the FTP data staging to be analyzed area for unprocessed and creates the exams required under jobs a given specification periodicity, in Job and Description when a new Language exam is available, (JDL) that it match starts a the Grid logic proxy, and transfers partition the strategy video data configured to be analyzed for the and automatic creates analysis. the required Next, jobs the specification data set is uploaded in Job Description to a Grid Language Storage (JDL) Element that (SE) match and registered the logic in and the partition LCG File Catalog strategy (LFC); configured only then for the automatic jobs are submitted analysis. and Next, the job the monitoring data set is phase uploaded starts. to a Grid Storage Element Each processing (SE) and registered node runs in the ATS LCG single File Catalog image (LFC); classification only then (ATS- the jobs SIC) are on submitted different video and the sections job monitoring and produces phase starts. a binary file as output, typically Each processing 150Kb node size. runs When the GAS ATS application single image detects classification that a job (ATS- is SIC) finished on in different the Grid, video it fetches sections the and corresponding produces a results. binary GAS file as must output, wait typically for the last 150Kb job to in be size. completed When to GAS aggregate application all job detects results that and a produce job is finished the all-video in the analysis Grid, it output. fetches This the corresponding aggregation is results. achieved GAS by must a simple wait for algorithm the last that job joins to be the completed results in to the aggregate correct all order job into results a single and produce file and the runs all-video the segmentation analysis output. phase This of the aggregation ATS algorithm is achieved returning by a the simple final algorithm segmentation that results joins the (typically results less in the than correct 10 seconds order into execute). a single file Finally, and runs the results the segmentation are then placed phase in the of designated the ATS algorithm FTP area (Fig. returning 5). the final segmentation results (typically less than 10 seconds to execute). Finally, the The results overall are then Grid-supported placed in the analysis designated workflow FTP area may (Fig. be summarized 5). in the following steps: The overall Grid-supported analysis workflow may be summarized in 1. The GAS application detects if there are unprocessed exams on the the following steps: staging area; if so, it uploads the video file to the Grid storage 1. The element GAS and application prepares detects the job if submission. there are unprocessed exams on the 2. staging The jobs area; submitted if so, uploads to the the Grid video workload file to manager the Grid storage be run in element parallel. and The prepares number of the jobs depends submission. the partition strategy The Each jobs node are runs submitted the ATS-SIC to the on Grid its workload assigned video manager section. to be run in 4. parallel. The progress The number of jobs is of monitored jobs depends by the on GAS the partition application strategy. and, when 3. Each all jobs node are runs successfully the ATS-SIC completed, on its assigned the partial video output section. results are 4. aggregated The progress and of the jobs segmentation is monitored phase by the is GAS executed. application and, when 5. The all jobs segmentation are successfully results completed, are uploaded the to partial the shared output staging results area. aggregated and the segmentation phase is executed. 5. The segmentation results are uploaded to the shared staging area

163 IBERGRID 163 Fig. 5: Grid interaction schematics. (A - Data Sets upload to FTP server; B - Data Sets download to Grid UI; C Data Sets upload to Grid SE; D - Job submission to Fig. the Grid; 5: Grid E interaction Data Sets download schematics. to Grid (A - Data WN; Sets F Job upload status to and FTP results server; upload; B - Data G Sets Results download upload to to Grid FTP UI; server; C Data H Results Sets upload retrieval to Grid by the SE; Capview D - Job application.) submission to the Grid; E Data Sets download to Grid WN; F Job status and results upload; G Results upload to FTP server; H Results retrieval by the Capview application.) 2.5 Cluster-enabled image classification workflow 2.5 In this Cluster-enabled work, we also image tested classification the remote workflow exam analysis in a dedicated cluster, an IBM BladeCenter JS21 (Dual Core 2.3 GHz 64-bit PowerPC 970 In Processors), this work, available we also at tested University the remote of Porto exam 6 analysis running in TORQUE a dedicated 2.3 cluster, scheduler. an IBM BladeCenter JS21 (Dual Core 2.3 GHz 64-bit PowerPC 970 Processors), As for the Grid available scenario, at we University ported the of single Portoimage 6 running classification TORQUE of the 2.3 scheduler. ATS algorithm to the cluster without considering specific optimizations. Although As for the some Grid work scenario, has been we ported started the on single optimizing image classification the single image of the ATS classification algorithm of to the the ATS cluster algorithm without for considering a cluster environment, specific optimizations. in this test Although we are just some using work domain has partition, been started assigning on optimizing a video section the to single each image node. classification The cluster analysis of the workflow ATS algorithm is similar for a to cluster the Grid environment, one. in this test we are just using domain partition, assigning a video section to each node. 1. It starts by detecting any unprocessed exams on the stating area and, The cluster analysis workflow is similar to the Grid one. if so, uploads them to the cluster infrastructure It The starts cluster by detecting creates a any number unprocessed of pre-defined exams jobs on the depending stating area on the and, if partitioning so, uploads strategy them to and the runs cluster the infrastructure. Single image classification of the 2. The ATS cluster algorithm creates on its a number available of blades. pre-defined jobs depending on the 3. partitioning When all parts strategy have been and runs completed, the Single the image results classification of each part are of the ATS collected algorithm and aggregated. on its available blades When The segmentation all parts have step been is executed completed, to produce the results a XML of each report part file are collected listing the and estimated aggregated. topographic locations The segmentation final step consists step in is uploading executed to the produce results a to XML the staging report file area. listing the estimated topographic locations. 5. The final step consists in uploading the results to the staging area

164 164 IBERGRID 3. Results 3.1 Experimental setup The conducted experiments try to validate whether a seamless integration of remote processing in CapView is feasible and can lead to better execution times. For these experiments, 10 anonym sample exams were picked from the 190 WEC exams IEETACapDB database 7 (Figure 6), representative of the actual file size distribution, with small, average and large files included. In clinical routine, in which WEC exams identify concrete patients, privacy issues need to be handled; this topic is not being addressed in this paper, as we are using anonymous data and assuming that routine exams can be submitted for remote analysis without patient identification. Nevertheless, several solutions have been proposed to secure the use of clinical data on the Grid (e.g.: [8]). Figure 6: Selected WEC data set for the experiment. Our initial experiment consisted in uploading the whole video exam for analysis and equally dividing the analysis task between nodes, testing three different data partitions: 4, 8 and 16 parts. The ATS-SIC is executed at each node, with different sections of the exam, in parallel, using the same number of nodes as the number of parts. For example, in the 4 split run, we split the video in 4 equal parts and have them analyzed in 4 nodes, each node analyzing 1/4 of the video (0%-25%, 25%-50%, 50%-75% and 75%-100). The GAS module ensures the proper data partition, generated matching jobs specification and handles the submission. Once all jobs are completed, the results are aggregated by the application which initially submitted the job and the segmentation step is executed. Finally the segmentation results will be uploaded to the staging area. This experiment was submitted to the Cluster and the Ibergrid infrastructure. It is important to stress that the cluster infrastructure was idle and with full resources available for the entire duration of our tests, while the Ibergrid infrastructure was used in production, with a competing workload. For reference propose, all exams in the dataset were also analyzed on a modern desktop computer (Intel Core 2 Duo though the desktop computer is a multi-core, the ATS was run without concurrency). One of the drawbacks encountered with the initial Grid submission approach was that the entire exam was being fully uploaded to the correspondent worker node, even when the node was to process only a 7

165 IBERGRID 165 segment. This would imply data transfers costs if the nodes don t share a close storage element. The subsequent step was to change the Grid segment. workflow This to address would imply this bottleneck, data transfers by uploading costs if the only nodes the don t part share of the a close exam that storage would element. be required The for subsequent analysis at step each was node. to change the Grid workflow The Grid to tests address were this bottleneck, run under by the uploading bing.vo.ibergrid.eu only the part Virtual of the exam Organization that would 8, supported be required by for Ibergrid analysis infrastructure at each node. [9] and the cluster tests The were Grid executed tests at were the University run under of Porto the (Porto). bing.vo.ibergrid.eu Virtual Organization 8, supported by Ibergrid infrastructure [9] and the cluster tests were executed at the University of Porto (Porto). 3.2 Results discussion 3.2 The Results experiments discussion demonstrated that through the CapView remote analysis submission form we were able to process and retrieve results from the The remote experiments ATS execution demonstrated advanced that computing through the resources. CapView remote analysis submission In the desktop form we environment, were able to the process larger the and video retrieve file, results the longer from the analysis, remote ATS as execution expected on (Fig. advanced 7), with computing an average resources. sized size exam taking about In the one desktop hour process. environment, the larger the video file, the longer the analysis, as expected (Fig. 7), with an average sized size exam taking about one hour to process. Fig. 7: ATS analysis times in hh:mm:ss when executed in a modern desktop computer. Fig. 7: ATS analysis times in hh:mm:ss when executed The results for the cluster in infrastructure a modern desktop show computer. a linear optimization: as the exam is further partitioned (reducing the data to process at each node) the The processing results for time the for cluster each infrastructure part is reduced show (Fig. a 8). linear In this optimization: experiment, as the we exam can see is further that cutting partitioned the data (reducing segments the by data half to process (4, 8, 16 at splits) each node) also the reduces processing the processing time for time each part by half is reduced (this is (Fig. only true 8). In for this the experiment, algorithm we processing can see times; that cutting other operations the data segments are independent by half (4, and 8, common 16 splits) to also all reduces jobs). the processing time by half (this is only true for the algorithm processing times; other operations are independent and common to all jobs). 8 The bing.vo.ibergrid.eu Virtual Organization (VO) was created to serve the Portuguese Brain Imaging Research Community. It can count, for the moment, 8 with the resources from 2 Portuguese Grid Sites, IEETA and LIP-Lisbon. The bing.vo.ibergrid.eu Virtual Organization (VO) was created to serve the Portuguese Brain Imaging Research Community. It can count, for the moment, with the resources from 2 Portuguese Grid Sites, IEETA and LIP-Lisbon.

166 166 IBERGRID Fig. 8: ATS analysis times in hh:mm:ss from the Cluster runs (statistics compare the duration of parallel jobs). The Grid tests (Fig. 9) follow the general cluster trend with the 16 split strategy getting the best results, followed by the 8 split and last by the 4 split, but here we don t have a dedicated and homogeneous environment, as with the cluster experiment. Note that the Grid infrastructure used is a production one with varying workloads (no dedicated nodes were used). The standard deviation values (comparing the changeability of duration within the parallel jobs of an exam) reveal the less predictable nature of the Grid when compared to the cluster example. Fig. 9: ATS analysis times in hh:mm:ss from the Grid runs (statistics compare the duration of parallel jobs). The direct comparison between end-to-end times in the desktop, the cluster and the Grid is not rigorous, as no significant effort was made to ensure the controlled and comparable test conditions; in addition, no optimization was implemented taking into consideration the specific target infrastructure; moreover, it s known that the Grid resources heterogeneity impact the overall performance [10]; all these factors can significantly affect the results. With these constraints in mind, it is still usefully to compare the processing times to set an overall assessment. This is depicted in Fig. 10 showing the average time it takes to process 100MB of data for the different experiments. The available results show that integrating advanced computing remote submission capabilities in CapView is feasible and can deliver faster results to the domain end-user, usually a medical doctor.

167 IBERGRID 167 Fig. 10: Average processing time for 100MB of data in hh:mm:ss. 4. Conclusions In this study we were able to integrate advanced computing infrastructures in a clinical tool that is being used by medical doctors for several years in routine hospital care, enabling it to invoke the WEC video analysis in a production Grid and an academic Cluster. Grid infrastructures provide extensive amounts of resources but still require a significant knowledge of its architecture and operations to develop applications. The use of familiar desktop application for Grid interfacing can be an important step to make the use of Grids practical to end-users. This was demonstrated in our experiment: the access to remote advanced computing facilities is seamlessly integrated into CapView annotation software that doctors already use in routine. The results obtained demonstrated that the use of advanced computing infrastructures can be seamlessly integrated in an end-user medical imaging application and perform complex data analysis in a more efficient way. Moreover, both Grid and Cluster technologies were integrated and can be used from the same clinical interface. This sort of advanced computing capabilities integration with end-user applications can provide solutions to the ever-growing CPU demand on modern medical imaging analysis. Acknowledgements The author would like to thanks S. Lima and D. Pacheco for their previous contributions. This work was partly supported by FCT (Portuguese Science and Technology Agency) and FEDER EU program under grants CapView (PTDC/EEA-ELC/72418/2006), BING (GRID/GRI/81819/2006) and GeresMed (GRID/GRI/81833/2006).

168 168 IBERGRID References [1] W. A. Qureshi, "Current and future applications of the capsule camera," Nature Reviews Drug Discovery, vol. 3, pp , May [2] S. L. Triester, et al., "A meta-analysis of the yield of capsule endoscopy compared to other diagnostic modalities in patients with obscure gastrointestinal bleeding," American Journal of Gastroenterology, vol. 100, pp , Nov [3] J. P. S. Cunha, et al., "Automated topographic segmentation and transit time estimation in endoscopic capsule exams," IEEE Transactions on Medical Imaging, vol. 27, pp , Jan [4] I. C. Oliveira, et al., "Automated endoscopic capsule analysis using a Grid computing environment.," presented at the IberGrid 2010, Braga, Portugal, [5] F. Gagliardi, "The EGEE European grid infrastructure project," High Performance Computing for Computational Science - Vecpar 2004, vol. 3402, pp , [6] I. Foster and C. Kesselman, The Grid 2: Blueprint for a New Computing Infrastructure, 2nd ed. Amsterdam ; Boston: Morgan Kaufmann, [7] P. Bonetto, et al., "Integrating medical Imaging into a Grid based computing infrastructure," Computational Science and Its Applications - Iccsa 2004, Pt 2, vol. 3044, pp , [8] M. Olive, et al., "SHARE the journey: A European Healthgrid Roadmap," [9] J. P. S. Cunha, et al., "BING: The Portuguese Brain Imaging Network GRID," presented at the IberGrid 2007, Santiago de Compostela, Spain, [10] I. Kouvakis and F. Georgatos, "A report on the effect of heterogeneity of the grid environment on a grid job," Large-Scale Scientific Computing, vol. 4818, pp , 2008.

169 IBERGRID 169 Exchanging Data for Breast Cancer Diagnosis on Heterogeneous Grid Platforms Damià Segrelles 1, José Miguel Franco Valiente 3, Rosana Medina 2, Ignacio Blanquer 1, José Salavert 1, Vicente Hernández 1, Luis Martí 2, Guillermo Díaz Herrero 3, Raúl Ramos Pollán 3, Miguel Ángel Guevara López4, Naymi González de Posada 4, Joana Loureiro 5, Isabel Ramos 5 1 Instituto de Instrumentación para Imagen Molecular (I3M), Universitat Politècnica de València, València, Spain {dquilis}{iblanque}{vhernand}@dsic.upv.es, [email protected] 2 Universitary Hospital Doctor Peset, Valencia, Spain [email protected], [email protected] 3 CETA-CIEMAT, Centro Extremeo de Tecnologías Avanzadas, Trujillo, Spain {josemiguel.franco}{guillermo.diaz}{raul.ramos}@ciemat.es 4 Faculty of Engineering, University of Porto, Porto, Portugal {mguevaral}{nlgezposada}@inegi.up.pt 5 Faculty of Medicine, University of Porto, Porto, Portugal [email protected], [email protected] Abstract. This article describes the process of defining and implementing new components to exchange data between two real GRID-based platforms for breast cancer diagnosis. This highly collaborative work in development phase pretends to allow communication between middleware, namely TRENCADIS and DRI, in different virtual organizations. On the one hand, TRENCADIS is a Service-Oriented Architecture in which the usage of resources is represented with Grid services based on the Open Grid Service Architecture specification (OGSA).On the other hand, DRI is a software platform aimed at reducing the cost of hosting digital repositories of arbitrary nature on Grid infrastructures. TRENCADIS has been deployed in the Dr. Peset Hospital (Valencia, Spain) and DRI has been deployed in the São João Hospital (Porto, Portugal). The final objective of this work in progress is to share medical images and its associated metadata among geographically distributed research institutions, while maintaining confidentiality and privacy of data. 1 Introduction Nowadays, radiologists use computer technologies to improve the image diagnosis process. Image diagnosis involves medical images and reports with the description of the findings in the image. There are two main research lines in this field: the storage and processing of images; and the integration of diagnostic reports through semantic interoperability. With respect to the processing and storage of digital images, approaches concentrate on the use of Peer to Peer (P2P) and Grid technologies to allow the federation of distributed data storages and computing resources.

170 170 IBERGRID Talking about data sources, PACS (Picture Archiving and Communication Systems) and RIS (Radiology Information Systems) organise data according to patient-centric information, as they are oriented to healthcare delivery. Also, its security systems do not support multi-domain structures. Regarding Data Grid technologies, SRB (Storage Resource Broker [1]) and glite [2] provide general-purpose tools for storing large amounts of data in distributed environments. On the one hand, several projects like BIRN (Biomedical Informatics Research Network) [3], a North American initiative that aims at building a virtual community of shared resources in the field of brain degenerative diseases, are based on SRB. On the other hand, the NeuroLOG project [4], which is a French middleware aimed at sharing and processing brain disease images, uses MDM [5] (Medical Data Manager) to manage information. MDM uses components from glite (AMGA, LFC, Hydra...). Moreover, the NeuGRID project [6] is a European initiative whose goal is to develop public electronic infrastructures needed by the European neuroscience community. It is composed by several services that follow the SOA (Service Oriented Architecture) philosophy and it also uses glite components (gliteui, AMGA, CE, SE, BDII, DPM and WN). Other projects have developed their own components based on Grid technology. CaBiG [7] is creating a network that will connect the entire cancer community. This project has developed a Grid Layer namely CaGRID [8]. With respect to the second research branch (semantic interoperability for diagnostic reports integration), imaging repositories have strongly evolved, providing tools for sharing and processing studies of images and its associated metadata, and allowing data-mining techniques that improve diagnosis and therapy. In this sense, evidence-based medicine is a case-based methodology that relies on high quality and organized medical knowledge. Data annotation is often performed by centrally storing metadata (in the case of MDM on centralized AMGA databases). Some projects, like BIRN, define their own lexicon in order to manage complex data representations. There are also collaborative data collection projects that define diagnostic report templates, Health-e-Child [9] focuses on creating a common database of pediatric disorders with the support of many medical centers. Other approaches consider also structured reports to enable context-based retrieval (such as Mammogrid [10] or NeuroBase [11]). In those approaches, metadata required for context-based searching is also located on central repositories that store the annotation of the medical images. In both research branches, developments based on the DICOM standard [12] (Digital Imaging and Communications in Medicine) have demonstrated to be a convenient and widespread solution among the medical community. In fact, the majority of hospitals and vendors choose the DICOM standard to store and exchange digital images (CT, MRI, X-Ray ). Our work aims at developing an Iberian collaborative network on breast cancer diagnosis by sharing medical images and their associated reports metadata among geographically distributed research institutions from Spain and Portugal, while maintaining the confidentiality and privacy of data. To achieve this goal, we collaborate to allow communication between two different middleware in different virtual organizations, namely TRENCADIS [13] and DRI [14] [15]. This results in the definition and implementation of a platform that

171 IBERGRID 171 allows medical data to be shared between geographically distant infrastructures. Both middleware use glite-compliant [2] storage resources as backend (such as EGI [16] or NGI [17]). Two real deployments are presented and the specification of new middleware compatibility components for DICOM data sharing. The first deployment is based on TRENCADIS and has been performed in the Dr. Peset Hospital (Valencia, Spain). The second deployment is based on DRI and has been performed in the São João Hospital (Porto, Portugal). Both deployments aim at the management of breast cancer mammography reports. Section 2 shows the concrete objectives of this article. Section 3 summarizes the DRI technology and its deployment in the São João Hospital. Section 4 describes the TRENCADIS technology and its deployment in the Dr. Peset Hospital. Section 5 specifies the steps to follow in order to define the components needed for data exchange between the platforms, detailing the steps already completed. Finally, expected benefits and conclusions are presented. 2 Objectives In this paper, the main objective is to set a procedure to develop data exchange components for TRENCADIS and DRI. These components must allow information exchange among both technologies, while improving the diagnosis of the breast cancer process. For the attainment of this general objective, the following targets have been defined: * To study two real infrastructures, TRENCADIS and DRI, in order to allow connectivity between them. These infrastructures have been already deployed in Dr. Peset Hospital and São João Hospital respectively, leaving the door opened for the incorporation of new medical institutions. * To identify and specify the TRENCADIS and DRI data fields needed to translate the information in a bidirectional way. This information is referred to as diagnosis of breast cancer mammography explorations. * To design an interface in both middleware accommodated to the identified data fields. The new components will use this interface to allow compatibility. * The remaining steps will be completed in future works. 3 DRI Technology The Digital Repository Infrastructure (DRI) is a software platform developed by CETA-CIEMAT aimed at reducing the cost of hosting digital repositories of arbitrary nature on Grid infrastructures, providing both users and repository providers a set of graphical and conceptual tools that easily define repositories and manage content. The digital repository presented here is composed of a set of units of digitalized content annotated with metadata [18] described through an entity-relationship

172 172 IBERGRID model. With DRI, a repository provider describes his data model in an XML file and has immediate access to a set of standard graphical user interfaces for browsing and managing repository content stored on a Grid infrastructure. On top of that, he could also develop custom tools to provide specific functionality for his repository (for content viewing, data analysis, etc.). This way, a repository of mammography studies is composed of digital content (mammograms) and metadata (patient info, diagnoses, etc.). A deep description of the DRI architecture and services can be found in [14] [15]. Fig. 1. General diagram of DRI architecture 3.1 Deployment of Infrastructure The current deployment of infrastructure involves three centres: The Faculty of Medicine and the INEGI, from Porto, and the CETA-CIEMAT, located in Trujillo, Spain. There is a full deployment of the DRI platform located in the São João Hospital, in Porto, where the Faculty of Medicine is located. The platform is configured in a standalone mode, storing the repository metadata in a local MySQL database and the digital content in the local file system of the deployment machine. The Mammography Image Workstation for Analysis and Diagnosis (MIWAD) is a rich DRI client doctors use as a frontend to interact with the repository

173 IBERGRID 173 platform. As a rich client, MIWAD implements image processing operations that are used for supervised mammography segmentation and classification. Both INEGI and CETA-CIEMAT host another instances of the DRI platform used as replicas of the repository data. INEGI DRI configuration stores metadata in a MySQL database and digital content in a FTP server. CETA-CIEMAT DRI instance also stores metadata in a MySQL database, but digital content is stored in a glite storage element. 3.2 Deployment objectives The aim of the previously mentioned deployment is validating DRI usage to support two scenarios: (1) managing large collections of federated mammograms studies and (2) building CAD systems that help doctors on mammograms analysis and diagnosis. A sample deployment scenario and a description of the CAD system were presented in [19]. 4 TRENCADIS Technology TRENCADIS technology defines a horizontal architecture that organizes virtual repositories of DICOM objects. TRENCADIS is a Service-Oriented Architecture (SOA) in which the usage of resources is represented as Grid services based on the Open Grid Service Architecture specification (OGSA). TRENCADIS is structured into several layers that provide the developers with different abstraction levels. Figure 1 shows a diagram of the TRENCADIS infrastructure. A description of the TRENCADIS architecture and its services can be found in [13]. Fig. 2. General diagram of TRENCADIS architecture

174 174 IBERGRID 4.1 Deployment of Infrastructure The deployment of infrastructure involves two centres; these are the Polytechnic University of Valencia (UPV) and the Universitary Hospital Dr. Peset. The TRENCADIS Grid services have been distributed as follows: * VOMS Server: The VOMS Service manages user memberships, groups and roles in the VO. * Ontologies Server: This Grid Service contains the federated report templates. * Storage Broker: This Grid Service offers the infrastructure needed for indexing DICOM objects. * Index Information Service: This Grid Service keeps information about the Grid services deployed in the infrastructure. Fig. 3. Scheme of the TRENCADIS infrastructure deployment The services located in Core Middleware have been deployed in the Universitary Hospital Dr. Peset: * Storage DICOM : This Grid Service offers the infrastructure needed for sharing DICOM objects by using federated report templates. * AMGA Server: AMGA manages the metadata from files stored in the Grid (mainly DICOM images and DICOM-SR). A more detailed description of all these Grid Services was presented in [13]. Also, a web application prototype has been developed and hosted in the UPV. The application is acceded from the Mammography Department of Universitary Hospital Dr. Peset and uses TRENCADIS middleware components to access the Grid services deployed in the infrastructure.

175 IBERGRID Deployment objectives The two main objectives of this deployment are to federate structured report schemes among different centers and to provide a framework for sharing semantic annotations of breast cancer images. At the end, a comprehensive view of images and diagnostic reports is achieved along the whole infrastructure. 5 Data exchange components for DRI and TRENCADIS As it was mentioned in previous sections, this article describes the current process in which involved institutions are defining components to exchange data between the deployments from Porto and Valencia, in order to improve breast cancer diagnosis. This is a highly collaborative work that involves the DRI and TRENCADIS development teams and the doctors from both hospitals. The design of data exchange components between the deployed instances in São João Hospital and Dr. Peset Hospital must deal with two main issues: * Both systems were not designed to interoperate from the beginning. * Data confidentiality and privacy must be preserved between them. In order to deal with the first issue of the data exchange between both systems because of their different architectures, we chose to define a common data exchange schema, XML based, so that the data exchange components will be developed as client modules of both infrastructures that will be able to import and export data from and to the defined data exchange schema. On the other hand, in order to preserve confidentiality and privacy of data to exchange, a set of rules was defined in collaboration with doctors from both hospitals to mark all sensitive data from patients so that data exchange components will ignore it when exporting data. Once all data to exchange is converted to follow the data exchange schema, it will be properly encrypted and transmitted through a secure transmission channel. The connection between components will be point to point. Therefore, several steps have been established to afford the development of the data exchange components. These steps are the following: 1. Defining a data exchange schema. This schema should include all data fields from both systems to be exchanged, stating the mapping from similar meaning fields and including the rest of data as new fields. 2. Modifying both data models to support non-existing fields. 3. Designing and developing components to export system data to the data exchange format and to import data received in that format. 4. Determining data security and integrity between systems. 5. Defining data replication policies. Next sections describe the progress of these steps.

176 176 IBERGRID 5.1 Definition of the data exchange schema The first step to complete in the process of developing data exchange components between DRI and TRENCADIS deployments is the definition of a data exchange schema, as both systems were not designed to interoperate at the beginning. Both middleware have different components and are based on different technologies. Moreover, even though glite storage elements are used to store data, the organization of data is quite different. The data exchange schema should allow data exchange between both systems without losing information that would be relevant to the diagnosis process. After analysing the data models of each system, it was decided to design a data exchange schema based on XML. The election of XML as data exchange format was because it is an open standard, highly extensible and widely used in web services interoperability. The flexibility offered by XML allows any future adaptation of the data schema while minimizing changes impact. Doctors from both hospitals have played an important role in the definition of the data exchange schema, because, without their expertise, it would be a really hard work to define the mapping rules between the fields of each system data model and the fields of the data exchange schema. These rules will be used by the exchange components in the data importing and exporting process. Also, the exclusion rules that will be applied over sensitive data from patients during the exporting process have been defined. A diagram of the XML based data exchange schema is shown in Fig. 4 (just the top elements of the schema hierarchy are shown). This process is being deeply influenced by the feedback of the doctors from both hospitals. Data models are continuously changing in order to fit expert needs and, therefore, the development speed is not as quick as expected. 5.2 Update of both systems data models Data exchange between the two systems requires the modification of the data models in order to accommodate the unique fields of each system. Otherwise, there would be a loss of data relevant to the diagnosis. Currently, the changes made to data models have not modified the relations of previously existing models; they have simply added fields to existing entities, such as Study or Lesion. At the moment, just the first two steps established to develop the data exchange components have been completed. Data models will continue changing to meet doctors needs, so future changes are expected. 5.3 Current status of the development of the data exchange components At this point, the development of data exchange components is in progress. Early versions of these components can import and export information in the format defined by the data exchange schema. As mentioned before, the design determines that these components will be implemented as platform clients that interact with APIs provided by middleware,

177 IBERGRID 177 Fig. 4. Diagram of the data exchange schema abstracting the underlying storage layer used. For example, in the DRI case, the same data component can be used to import and export data from the São João Hospital DRI deployment (local file system based), the INEGI replica (FTP based) and the CETA-CIEMAT replica (glite storage element based). According to data integrity and security between both systems, the use of an XML based data exchange schema allows data to be transmitted through any secure protocol such as SCP, SFTP, GSIFTP, SOAP + WS-Security, etc. Data replication policies are not defined yet. 6 Conclusions Interoperability is a key feature in a Grid middleware. However, the integration of different Grid systems with different internal representation models is not straightforward and requires an important effort. Despite of this effort, the integration of different environments is a target that presents large benefits. In this collaboration, it is envisaged that the annotated mammographic database deployed in Portugal, using the DRI technology, will be linked to the annotated mammographic database deployed at the Valencian Hospital Dr. Peset, using TRENCADIS technology. In this sense, this work has allowed to define a data exchange schema. This schema includes all data fields from both systems (DRI and TRENCADIS) to be exchanged, modifying both data models to support non-existing fields.

178 178 IBERGRID This will end up with larger deployments, enriched databases and increased processing tools, without strongly compromising the autonomy of the centres in terms of structure and policies. Acknowledgements The authors wish to thank the financial support received from The Spanish Ministry of Education and Science to develop the project nggrid - New Generation Com-ponents for the Efficient Exploitation of escience Infrastructures, with reference TIN Prof. Guevara acknowledges POPH - QREN-Tipologia 4.2 Promotion of scientific employment funded by the ESF and MCTES, Portugal. CETA-CIEMAT acknowledges the funds granted by the ERDF (European Regional Development Fund). References 1. Moore, R.W, Wan, M., Rajasekar, A. Storage resource broker; generic software infra-structure for managing globally distributed data. Local to Global Data Interoperability - Challenges and Technologies, ISBN: glite. Lightweight Middleware for Grid Computing. Last visited on January Jeffrey S. Grethe, Chaitain Baru, Amarnath Gupta et. al. Biomedical Informatics Research Network: Building a national collaboratory to hasten the derivation of new understanding and treatment of disease. Studies in health technology and informatics, Vol. 112, pp , Johan Montagnat, Alban Gaignard, Diane Lingrand, Javier Rojas Balderrama, Philippe Collet, Philippe Lahire. NeuroLOG: A community-driven middleware design. HealthGrid Johan Montagnat, kos Frohner, Daniel Jouvenot et al. A Secure Grid Medical Data Manager Interfaced to the glite Middleware. Journal of Grid Computing 6(1):45-59, March University of the West of England, Bristol UK. Neugrid deliverable: D6.1b Distributed Medical Services Provision (Design Strategy). Technical Report, CaBIG. Cancer Biomedical Informatics Grid. Last visited on January Infrastructure Overview Page. Last visited on January Jrg Freund, Dorin Comaniciu, Yannis Ioannidis, Peiya Liu, Richard McClatchey, Edwin Morley- Fletcher, Xavier Pennec, Giacomo Pongiglione and Xiang Zhou. Healthe-Child: An Integrated Biomedical Platform for Grid-Based Paediatrics. HealthGrid S.R. Amendolia, F. Estrella, W. Hassan, T. Hauer, D. Manset, R. McClatchey, D. Rogulin, T. Solomonides. MammoGrid: A Service Oriented Architecture based Medical Grid Application. Lecture Notes in Computer Science. ISBN Volume 3251, pp (2004). 11. C. Barillot, R. Valabregue, J-P. Matsumot, F. Aubry, H. Benali, Y. Cointepas et al. NeuroBase : Management of Distributed and Heterogeneous Information Sources in Neuroimaging. DiDaMIC-2004, a satellite workshop of Miccai-2004.

179 IBERGRID Digital Imaging and Communications in Medicine (DICOM) Part 10: Media Storage and File Format for Media Interchange. National Electrical Manufacturers Association, 1300 N. 17th Street, Rosslyn, Virginia USA. 13. I. Blanquer, V. Hernndez, J.E. Meseguer, D. Segrelles. Content-Based Organisation of Virtual Repositories of DICOM Objects. Future Generations Computer Systems. FGCS Journal. ISSN X. DOI: /j.future A. Calanducci, J.M. Martín, R. Ramos, M. Rubio, D. Tcaci, glibrary/dri: A gridbased platform to host multiple repositories for digital content. Proceedings of the Third Conference of the EELA Project, 3-5 December 2007 Catania, Italy. 15. Roberto Barbera et alt. The glibrary/dri platform: new features added to the User Interface and the Business Layer. First EELA-2 Conference, Bogota (Colombia), Proceedings of the First EELA-2 Conference ISBN: , CIEMAT European Grid Infrastructure (EGI). Towards a sustainable grid infrastructure. Last visited March National Grid Inicitive home page. Last visited March W.Y. Arms. Key concepts in the Architecture of the Digital Library. DLib Magazine, July R.R. Pollán et al. Exploiting E-infrastructures for Medical Image Storage and Analysis: A Grid Application for Mammography CAD. BIOMED 2010 Innsbruck, Austria. (Proceedings) Biomedical Engineering (Vol. 1 and Vol. 2) ISBN: I: / II:

180 180 IBERGRID Analyzing Image Retrieval on Grids Oscar D. Robles 1, Pablo Toharia 1, José L. Bosque 2, and Ángel Rodríguez3 1 Dept. de Arquitectura y Tecnología de Computadores y Ciencia de la Computación e Inteligencia Artificial. U. Rey Juan Carlos (URJC) 2 Dept. de Electrónica y Computadores. U. Cantabria (UC) 3 Dept. de Tecnología Fotónica, Universidad Politécnica de Madrid (UPM) Abstract. This paper presents a grid implementation analysis of a largescale Content-based Image Retrieval system. This solution offers a good cost/performance ratio due to its excellent flexibility, scalability and fault tolerance. Experimental performance results are collected in a real-world situation in order to show the feasibility of applying this solution to different Virtual Organizations. The experiments take into account several grid setups with heterogenous grid nodes, like different type of workstations and clusters, including MPI implementations for clusters in order to obtain optimal performance of the available resources. 1 Introduction The level of maturity achieved in different areas such as signal processing, computer vision, databases or man-machine interaction, along with the impressive evolution of communication systems, has brought about the spread of information systems whose objective is the efficient storage and management of large amounts of multimedia data [10]. This type of systems is focused on searching for data as a function of the real data content, giving rise to the so-called CBIR 4 systems [19]. The complexity of this task depends heavily on the number of items stored in the system. In this way, large volumes of data are considered when dealing with image or video databases. Therefore, alternative strategies to the conventional centralized server must be sought in order to manage the storage and processing of data in an efficient and flexible way. From the point of view of computational complexity, CBIR systems are potentially expensive, and have user response times growing with the ever-increasing sizes of the databases associated to them. One of the most common approaches followed to reach acceptable price/performance ratios has been to exploit the algorithms inherent parallelism at implementation time [18]. Grid computing is becoming nowadays a feasible solution for computer applications with high levels of computational power demand [3, 7]. CBIR implementations based on grid solutions are currently rising as a new proposal to increase the order of magnitude of the information available to people and because of both high flexibility and availability offered by this computation paradigm, it makes it 4 Content Based Information Retrieval.

181 IBERGRID 181 easier to cooperate and share resources among institutions, named Virtual Organizations (VOs) [1, 13]. The Grid implementation of a CBIR system allows to achieve two different goals: quantitatively, to gain speed or to tackle significantly larger volumes of data in a short period of time; qualitatively, to improve resource quality and availability. This paper presents a grid implementation of a large-scale Content-based Image Retrieval system. This solution offers a good cost/performance ratio due to its excellent flexibility, scalability, security and fault tolerance. This flexibility allows to dynamically add or remove nodes from the grid between two user queries, achieving reconfigurability, scalability and an appreciable degree of fault tolerance. This approach also allows a dynamic management of very large specific databases that can be incorporated to or removed from the CBIR system depending on the desired user query. Experimental performance results have been collected in a realworld situation involving 5 Virtual Organizations. To show the feasibility of this solution several grid setups with heterogeneous grid nodes, like different type of workstations and clusters, have been taken into account. This solution contributes with decisive advantages considering two alternative points of view: From the application perspective, grids provide transparent but at the same time exhaustive resource management if required. But also very high levels of security, accessibility and simple web interfaces among others can be achieved. The capacity of adding new remote databases improves the system s retrieval ability, since from a computational point of view the system provides better performance, scalability and storage capacity, and also is enriched with heterogeneity. To our knowledge, this is the first full operative CBIR system implemented in an heterogeneous real-world grid offering the users resources from different Virtual Organizations in a secure and transparent way. The maturity level of the implementation allows an easy migration to more complex systems. In addition, MPI implementations with load balancing mechanisms have been programmed for clusters in order to obtain an optimal performance of the available resources and ensure the application portability [2]. Some papers have been published in this topic. As for example, Sun et al. [20] perform semantic retrieval of Remote Sensing Images (RSI) using grid technologies and ontologies. Precision and recall values are reported in the paper, but no results are provided about computational performance. The work of Ohkawa et al. [15] is focused on the construction of a theoretical model for logical cluster construction inside a grid environment where proteins are searched using 3D features extracted from the protein surface. Experimental results validate the proposed model. Ding and Lu have implemented Trellis-SDP, a programming system for data-intensive applications [4]. They present a CBIR system as a domain application of Trellis- DSP over a cluster of workstations, being 60,000 the number of images stored in the database. The results are focused on testing cluster performance and overhead introduced by Trellis-SDP. CBIR systems and grid computing have mixed well in the past for retrieving medical images, but most of the papers describing those systems lack of a real experimentation and only present a general description

182 182 IBERGRID Interactive operations Query Parameters for image computing the signature On line CBIR operations Off line CBIR operations Signature computation Signature of the query image Signature comparison Signature database Computation of signatures Image database Interactive operations update no Suitable result? yes Sort and display results Fig. 1. CBIR system operation. of the system architecture [8, 5]. Only Oliveira et al. present precision vs. recall figures for the CBIR system and provide global execution times of the image registration implementation, comparing both sequential and grid versions [16]. Recently, Town and Harrison have tested the Ganga framework for job submission and management in a CBIR system composed by 18 million images distributed over 7 sites [21]. As it can be noticed, previous works present a very limited experimentation from both points of view of grid setup and performance. They are limited to a grid simulation over a cluster of nodes connected by a LAN. Furthermore, no different VOs following a widespread geographical layout are involved. The work herein presented aims to provide a grid implementation tested over a real-world situation but also quantitatively extending previously preformed experiments. This implementation manages heterogeneous real-world CBIR environments, involving different VOs placed far from each other. This approach allows a dynamic management of the resources available in the grid for each query. This setup also requires the introduction of security mechanisms to guarantee the privacy of the images retrieved. 2 CBIR system operation Figure 1 depicts the main operations involved in a CBIR system. This process has an off-line stage which distributes the database items among the nodes and computes each image s signature from the database. Then the system is ready to process interactive queries following these steps: 1. Input/query image introduction. The user first selects an image to be used as a query reference and the system computes its signature [17]. 2. Query and DB image s signature comparison and sorting. The signature obtained in the previous stage is compared with all the DB images signatures and the identifiers of the p most similar images are extracted. Not being a quite costly operation, the volume of the computations to be performed is very high.

183 IBERGRID 183 Node1 User interface Node n User interface Node1 User interface Node n User interface Local CBIR operation CBIR and grid management Local CBIR operation CBIR and grid management Local CBIR operation CBIR and grid management Local CBIR operation CBIR and grid management Grid middleware Grid middleware Grid middleware Grid middleware LAN LAN VO 1 Grid middleware VO k Grid middleware WAN Fig. 2. Software module decomposition of the CBIR implementation in the grid. 3. Results display and query image update. The system provides the user with a dataset with the p images considered as the most similar to the query one. If the result does not satisfy the user, she/he can choose a new image. Upon observing the operations involved, it is possible to notice that the comparison and sorting stage involves a much larger computational load than the others (it is, in fact O(np log(p))). Luckily, the lack of dependencies in combination with very low communication requirements makes it possible to exploit data parallelism by dividing the workload among q independent nodes. Grids allow to update the database configuration by adding or removing specific nodes or databases considering the type and contents of the queries at a given moment. From that moment onwards, each node can compare the signature of the query image against all the available signatures. Storage capacity becomes also a problem when dealing with large-scale CBIR systems. The most efficient approach to solve this point, relaxing additionally per-node storage demands, is to distribute signatures, images and computations over an ensemble of q processing and storage nodes [12]. From a functional point of view, the main features of a grid system are flexibility, versatility, security, usability, performance and scalability, allowing a dynamic addition or removal of resources at any time according to the user s interests or to the available resources. This setting brings out frequently changes in performance or load over time [14]. 3 Grid implementation The CBIR application programmed in the grid can be decomposed into the following modules (Fig. 2): User interface, CBIR and grid management and Local search per grid node. All these modules have been programmed using Globus toolkit vs. 4 [6, 9] with different services installed, in particular, data Management (GridFTP and RFT), Information Services, (MDS + Ganglia Cluster Toolkit) and Execution Management (WS-GRAM). The following sections will describe each one of the CBIR modules.

184 184 IBERGRID 3.1 User interface The application allows the user to specify a whole environment at each system execution. This way users can update the working configuration by adding or removing specific nodes or databases considering the type and contents of the queries at a given moment. The system presents to the user an intuitive web interface that asks a password to log into the grid. If the login is successful, the proxy checks the user certificates and gives him access to the grid resources where the user has permission. Then, a set of parameters can be defined. These parameters have two clearly different orientations: Grid resources and CBIR features. Grid resources Grid parameters cover aspects such as: Number of grid nodes where the query will be run. Selection of the type of grid node. In this implementation, three type of grid nodes are considered: (1)Monoprocessor with only one processor and one hard disk; (2) Multiprocessor or CMPs which allows to run simultaneously several query processes as well as multiple I/O operations when RAIDs or SAN are connected to the parallel computer; and (3) Clusters in which the system is able to launch a parallel application, programmed using MPI libraries. This approach allows optimal management of all cluster resources and offers a good degree of portability among parallel platforms. Number of available CPUs when a multiprocessor node is selected. Number of processing nodes considered when a cluster node is selected. It must be noted that a cluster node can also be a multiprocessor computer itself. Computational power the system nodes. This is and optional parameter. If the user has some a priori knowledge about the performance of the processing nodes can also be introduced to the application. Anyway, the application collects some statistics about each node performance in order to make in a future a better assignment of the work load among. CBIR features Among the programmable CBIR parameters, we can mention: Name of the process that will perform the search and will sort the signatures in each node. Search criterion: including color, shape and a combination of both features. For each criterion, several possibilities exist: energies, color histograms, multiresolution color primitives, Hu and Zernike invariants, etc. [17]. Metric used in the similarity computation stage for the computed input image signature and the precomputed signatures stored in the database. File name of the query image. Name of the selected database where the query is performed. Number of images presented to the user for query refinement. Figure 3 shows a capture of the GUI. On the left side, the user can fix grid resources selected from the VOs as well as CBIR parameters for the query. On the right side, results retrieved for a specific query are presented. When the user moves the mouse

185 IBERGRID 185 Fig. 3. Snapshot of the GUI. over the grid icon nodes, detailed information about their features is presented in order to choose the proper selection of the required computational resources and available databases. 3.2 CBIR GRID process This process is in charge of collecting all parameter specification defined by the users as well as starting the execution in the distributed environment. It is also in charge of keeping the fault tolerance of the whole system. This process can be decomposed in the following stages and can be summarized in the pseudocode specified by Alg. 1.: Read the system information provided by the users and initialize date information structures. Setup of the testbed, the number of nodes and check their availability. Finally, permissions for executing remote jobs over them are verified. Selection of remote jobs to be executed in each node of the grid. In the case of sequential jobs, it corresponds to a monoprocessor system, while a parallel job is assigned to a shared-memory multiprocessor. Finally, a distributed job based on a message passing implementation like MPI is run on a cluster.

186 186 IBERGRID Algorithm 1 Pseudocode of the CBIR GRID process. 1: Read system parameters provided by the user. 2: Initialize local and remote data structures. 3: Setup the grid system. {Fix the nodes of the grid and check their availability and permissions to execute remote jobs.} 4: Classify jobs as sequential, parallel or distributed. 5: Dispatch jobs among the available and most suitable grid nodes considering previous classification. 6: Compute the input signature. 7: Execute a globus-url-copy to broadcast this signature in the grid. 8: Launch local searches with globus-job-submit and collect job status with CBIR GRID process using the command globus-job-status. 9: Collect partial results with the gridftp service. 10: Gather all partial results and select the best N. 11: Pick up the images from the corresponding local nodes and present them to the user. Compute the signature of the input image, as described in [17]. Send the previously computed input image signature to all the nodes that compose the grid. Once the signature is distributed, job execution must be launched in each of the grid nodes, searching over their own local databases. Then, the CBIR GRID process performs a loop that controls the state of each job launched. When a job ends, it must collect all partial results generated by the corresponding grid node. This way, a set of files with partial results is generated, one for each node of the grid. The last step in this process is to gather all partial results and select the best p. Then, the process picks up the p images from the corresponding local nodes and presents them to the user in a sorted mosaic view inside a user window. 3.3 LOCAL RETRIEVAL process This process is responsible for performing the local retrieval in each node of the grid. Several versions of this process have been implemented taking into account the type of grid node where it will be executed: Sequential version: it is executed over one single node of the grid that is a monoprocessor node (PC or workstation). Parallel version: it is launched in a multiprocessor system that has several processors but only one storage system. Distributed version: it is a distributed application implemented with MPI libraries [2]. It is launched in a cluster with a distributed database among the cluster nodes. In this case, each cluster node has a piece of database stored in its local partition. This implementation includes a dynamic, distributed and highly scalable load balancing algorithm for these architectures. Load balancing operations between pairs of nodes take place whenever a node finishes its job, resulting in a receptor-triggered scheme which minimizes the system s

187 Ether 10/ COL ! Power Ether 10/ COL ! Power Ether 10/ COL ! Power IBERGRID 187 UCM Cygnus Draco LAN CIEMAT Gridimadrid URJC Artico Globus Globus LAN Globus WAN Globus UPM Baobab Pulsar Globus Africa UC3M Cormoran LAN Cluster Workstation Brea Fig. 4. Scheme of the grid used in the experiments. communication overhead. The algorithm automatically turns itself off on global overloading or under-loading situations [2]. 4 Experimental results A number of experiments has been executed in order to test the behavior of the grid implementation of the CBIR system with the following goals: To verify the feasibility of applying a distributed solution based on a real-world grid, mixing multiprocessor and monoprocessor grid nodes; to study the systems behavior with faulty nodes as well as grid nodes that are dynamically added to or removed from the grid; to estimate the overhead introduced by Globus software; and finally, to analyze the grid response in order to optimize the distribution of CBIR data among the grid nodes to achieve a better performance and scalability. The testbed used in the experiments is composed by the aggregation of resources from five VOs with high degree of heterogeneity. These resources belong to the GRIDIMadrid Consortium [22]. Figure 4 depicts a schematic representation of this grid. Dashed lines group the computational resources available in each VO. Table 1 collects the mean value and the standard deviation of the response user time per VO considering a database size of signatures each. This table also shows values obtained for both Globus overhead and data communication time. All time values are measured in seconds. Finally, the efficiency achieved for each VO is shown in the last row. It can be observed the wide range of values obtained because of the heterogeneous nature of the available processing nodes.

188 188 IBERGRID Table 1. Statistics collected in each VO for a query considering a database with images per processing node: mean value (MV) and standard deviation (SD). VO URJC UPM CIEMAT UC3M UCM MV SD MV SD MV SD MV SD MV SD User time Slowest node Globus overhead Data transfer Efficiency Table 2. User response time in the grid for a query considering different database sizes: mean value (MV) and standard deviation (SD). User time Slowest node Globus overhead Data transfer Efficiency Database size MV SD MV SD MV SD MV SD MV SD Table 2 shows the response user time of the grid (in seconds), the overhead and its efficiency considering different database sizes. In this case, the number of images in the database varies considering different database sizes, as detailed in the first column of Table 2. Images are homogeneous and statically distributed among the VOs. As can be noticed, grid efficiency increases as the size of the database grows up. This fact is explained by the almost constant overhead introduced by the grid implementation and the lack of data dependencies in the most demanding stage of the proposed CBIR application that allows a fully parallelization of both signature comparison and sorting. These values show how grid behavior is only limited by the computational power of the processing nodes, but not by communication bottlenecks. Overhead values confirm the feasibility of extending the dimensions of the proposed grid implementation of a CBIR system to large scale grids. The grid herein presented allows a dynamic management of the CBIR system since Globus provides a set of services focused on infrastructure management. Specifically, GRAM (Grid Resource Allocation Management) supports the control of the available resources in the grid at a given moment. Experiments have proven the robustness of the CBIR system when faulty nodes appear in the grid or specific databases are dynamically added or removed from the grid. To simulate these situations, VOs can be incorporated to the grid taking into account the most significant parameters that have influence in the response time: computational power and Globus overhead. Tables 3 5 collect the results of adding new VOs to the grid with this criteria: decreasing computational power (Table 3), increasing computational power (Table 4) and increasing Globus overhead (Table 5). In these cases, the number of images managed per processing node is , so the total amount of images managed in the grid is These results show the excellent scalability of this system. Grid efficiency remains almost constant for all setups, achieving a linear isoefficiency function with respect the problem size, which is the optimum case for parallel systems [11]. Analyzing overhead values,

189 IBERGRID 189 Table 3. Statistics collected in the grid adding new nodes going from the slowest (URJC) to the fastest VO (UCM): mean (MV) and standard deviation (SD). GRID URJC+UC3M +CIEMAT +UPM +UCM MV SD MV SD MV SD MV SD User time Slowest node Globus overhead Data transfer Efficiency Table 4. Statistics collected in the grid adding new nodes going from the faster (UCM) to the slowest VO (URJC): mean value (MV) and standard deviation (SD). GRID UCM+UPM +CIEMAT +UC3M +URJC MV SD MV SD MV SD MV SD User time Slowest node Globus overhead Data transfer Efficiency Table 5. Statistics collected in the grid adding new nodes going from the lowest Globus overhead (UC3M) to the highest one (URJC) (mean/standard deviation). GRID UC3M+CIEMAT +UPM +UCM +URJC MV SD MV SD MV SD MV SD User time Slowest node Globus overhead Data transfer Efficiency the only case where the overhead grows up is when URJC VO is added to the grid. This is due to the slowest node of the URJC VO, which is one order of magnitude slower than the others from both points of view: computational power and data communication bandwidth. This fact slows down the control and data communication tasks resulting in an increment of the grid overhead. In short, the grid presents an outstanding scalability but the introduction of very slow processing nodes may damage the grid performance. 5 Conclusions and future work This paper is focused on the analysis of the feasibility of applying a grid solution to CBIR systems shared among different Virtual Organizations. In this work we have measured the performance reached by the grid considering a real environment with very heterogeneous nodes. The experience achieved with the development of the image retrieval system over a real-world grid has been very positive, but two remarks must be pointed out: More powerful debugging tools must be developed to assist programmers to implement new applications over a grid. Without these tools, it becomes very difficult a future widespread of grids as an effective solution for migrating existing applications with new requirements demanding more and more resources.

190 190 IBERGRID Another problem found was the management and maintenance of the grid, not from a technical point of view, but from the point of view of establishing contacts with all people involved. This amount of people from different organizations involved in this task may difficult some supposed trivial formalities, such as keeping a fluid contact with computer administrators, request the inclusion of new processing nodes or new users or keeping certificates updated. Anyway, apart from the previous statements, grids provide exhaustive, versatile and dynamic resource management transparent to end users, and grant access security with simple accesses with just a web interface. The performance of the implementation has been very satisfactory, achieving low user response times. This has been originated by the lack of data or algorithmic dependencies and by the low communication overhead. Thanks to the heterogeneity of the system, the communication overhead overlaps with the execution time in other grid nodes. These features also result in very good performance figures for large databases, where efficiency values higher than 90% have been achieved. The experiments presented here show that the amount of overhead introduced by this implementation remains almost constant, so the system is scalable with respect to the database size. In fact, improvements introduced to deal with grid heterogeneity also solve this overhead. To conclude, remarkable advantages of the grid implementation are its good priceperformance ratio, robustness and system scalability, which allow increasing the number of nodes in the grid to higher scale values without degrading performance. Finally, further work will be devoted to incorporate load balancing mechanisms to dynamically redistribute the workload corresponding to the sorting stage and increase the global performance of the grid. Acknowledgements This work has been funded by the Spanish Ministry of Education, Science and Innovation (grants TIN , TIN C02-02, Consolider CSD and the Cajal Blue Brain project). References 1. Fran Berman, Geoffrey Fox, and Anthony J.G. Hey, editors. Grid Computing: Making the Global Infrastructure a Reality. John Wiley & Sons, April ISBN José Luis Bosque, Óscar D. Robles, Luis Pastor, and Angel Rodríguez. Parallel cbir implementations with load balancing algorithms. Journal of Parallel and Distributed Computing, 66(8): , August Jose C. Cunha and Omer F. Rana, editors. Grid Computing: Software Environments and Tools. Springer, Meng Ding and Paul Lu. Trellis-SDP: A simple data-parallel programming interface. In 3 rd Workshop on Compile and Runtime Techniques for Parallel Computing (CRTPC), pages , Quebec, Canadá, August R. Bellotti et. al. Distributed medical images analysis on a grid infrastructure. Future Gener. Comput. Syst., 23(3): , I. Foster. Globus Toolkit Version 4: Software for Service-Oriented Systems. In IFIP International Conference on Network and Parallel Computing, volume 3779 of Lectures Notes in Computer Science, pages Springer Verlag, 2005.

191 IBERGRID Ian Foster and Carl Kesselman, editors. The Grid 2: Blueprint for a New Computing Infrastructure. Computer Architecture and Design. Morgan Kaufmann, 2 edition, November C. Germain, V. Breton, P. Clarysse, Y. Gaudeau, T. Glatard, E. Jeannot, Y. Legre, C. Loomis, J. Montagnat, J.-M. Moureaux, A. Osorio, X. Pennec, and R. Texier. Grid-enabling medical image analysis. In CCGRID 05: Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid 05), volume 1, pages , Washington, DC, USA, IEEE Computer Society. 9. Globus Alliance: University of Chicago and others. Globus. Web, Retrieved March 14, 2011, from source, S. Khoshafian and A. Brad Baker. Multimedia and Imaging Databases. Morgan Kauffman, Vipin Kumar and Anshul Gupta. Analysis of scalability of parallel algorithms and architectures: a survey. In ICS 91: Proceedings of the 5th international conference on Supercomputing, pages , New York, NY, EEUU, ACM Press. 12. I. Kunttu, L. Lepistö, and J. Rauhamaa & A. Visa. Grid-based clustering in the content-based organization of large image databases. In Proceedings of the 5th International Workshop on Image Analysis for Multimedia Interactive Services, WIAMIS 2004, Lisboa, Portugal, April Zsolt Nemeth and V. Sunderam. Characterizing grids: Attributes, definitions, and formalisms. Journal of Grid Computing, 1(1):9 23, Zsolt Németh and V. Sunderam. Virtualization in grids: A semantical approach. In Jose C. Cunha and Omer F. Rana, editors, Grid Computing: Software Environments and Tools. Springer Verlag, Takenao Ohkawa, Yusuke Nonomura, and Kenji Inoue. Logical cluster construction in a grid environment for similar protein retrieval. In Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW 04), pages IEEE, IEEE Computer Press, January ISBN: / Marcelo Costa Oliveira, Walfredo Cirne, and Paulo M. de Azevedo Marques. Towards applying content-based image retrieval in the clinical routine. Future Generation Computer Systems, 23(3): , Óscar D. Robles, Pablo Toharia, Angel Rodríguez, and Luis Pastor. Towards a content-based video retrieval system using wavelet-based signatures. In M. H. Hamza, editor, 7th IASTED Int. Conference on Computer Graphics and Imaging - CGIM 2004, pages , Kauai, Hawaii, EEUU, August IASTED, ACTA Press. ISBN: , ISSN: Timothy K. Shih, editor. Distributed multimedia databases: techniques & applications. Idea Group Publishing, Hershey, PA, EEUU, Arnold W. M. Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh Jain. Content-based image retrieval at the end of the early years. IEEE Transactions on PAMI, 22(12): , December H. Sun, S. Li, W. Li, Z. Ming, and S. Cai. Semantic-based retrieval of remote sensing images in a grid environment. IEEE Geoscience and Remote Sensing Letters, 2(4): , October Chris Town and Karl Harrison. Large-scale grid computing for content-based image retrieval. In ISKO UK, Universidad Autónoma de Madrid (UAM), Universidad Complutense de Madrid (UCM), Universidad Rey Juan Carlos (URJC), and Universidad Politécnica de Madrid (UPM) and Centro de Investigaciones Energéticas, Medioambientales y Tecnológicas (CIEMAT). GRIDIMadrid. Web, Retrieved september 25, 2009, from source,

192 192 IBERGRID The CHAIN Project Federico Ruggieri 1,2, Angelines Alberto 3, Giuseppe Andronico 1, Roberto Barbera 1,4, Jorge Blanco-Yagüe 3, Gang Chen 5, Shri P.S. Dhekne 6, Guillermo Díaz 3, Salma Jalife 7, Kostas Koumantaros 8, Aleš Krenek 9, Alberto Masoni 1, Luděk Matyska 9, Rafael Mayo 3, Tiwonge Msulira-Banda 10, Raquel Muñoz 3, Sinha Neeraj 6, Maragaret Ngwira 10, Luis Núñez 7, Marco Paganoni 1,11, Ognjen Prnjat 8, Manuel Rodríguez-Pascual 3, Antonio J. Rubio-Montero 3 and Stefano Troiani 3 (On behalf of the CHAIN project) 1 Istituto Nazionale di Fisica Nucleare (Italy) 2 Università degli Studi Roma Tre (Italy) 3 Centro de Investigaciones Energéticas Medioambientales y Tecnológicas (Spain) 4 Università di Catania (Italy) 5 Institute of High Energy Physics Chinese Academy of Sciences (CHINA) 6 Office of the Principal Scientific Adviser to the Government of India (India) 7 Cooperación Latino Americana de Redes Avanzadas (Uruguay) 8 Greek Research and Technology Network (Greece) 9 CESNET (Czech Republic) 10 UbuntuNet Alliance (Malawi) 11 Università degli Studi di Milano-Bicocca (Italy) Abstract. The European Commission has focused in the latest years its efforts in Science development on promoting the research on computational Science. This is so because of the increasing importance of simulating phenomena in almost every field of interest. In particular, a huge investment has been done in Distributed Computational Infrastructures in Europe, but also in other continents such as Latin America, Asia and Africa, mainly in Grid. Thus, different phases for infrastructures projects in all these continents from 2006 to 2011 have produced excellent results and have promoted the use and research on Grid computing. The CHAIN project aims to coordinate and leverage these efforts and their results with a vision of a harmonised and optimised interaction model for e-infrastructure and specifically Grid interfaces between Europe and the rest of the world. 1 Introduction Over the past 6 years, the European Commission has invested to extend the European e-infrastructure technology and European e-infrastructure (and particularly Grid) operational and organisational principles to a number of regions in the world, and reinforcing the close collaboration and exchange of know-how with similar technologies in other areas. [email protected]

193 IBERGRID 193 Thus, although big steps have been made to extend the European Grid principles to other regions, the results obtained so far have to be leveraged and customised so as to provide an overall model for sustainable interoperation between the European Grid Initiative/Infrastructure and external e-infrastructures. Furthermore, following the evolution path of Grid collaborations between Europe and the rest of the world, it is deemed beneficial to also analyse the issues and models related to coordination between European and external regional e-infrastructures taking into account not only Grids but also HPC, Clouds, Voluntary computing infrastructures and networking issues On the technology level, the coordination of different world-wide Grid efforts, listed below, has been restricted to basic operational, organisation and technology know-how transfer/exchange. Moreover, the co-ordination of Grids and HPC has been to some extent neglected, although there is currently a pressure to arrive to some agreement and to a more organised coexistence of the two approaches to scientific computing. Finally, the recent upsurge of other paradigms should be evaluated in the light of a large world-wide e-infrastructure; these would be the cases of Virtualisation, Voluntary Computing or Cloud Computing. The extension of European e-infrastructure to other regions of the world developed on three main tracks: Research and Education Networks (i.e. GEANT, ALICE, TEIN, EUMED- CONNECT, etc.); Grid Computing (i.e. EGEE, BalticGrid, EELA, EUAsiaGrid, EUChinaGRID, EUIndiaGrid, EUMEDGRID, SEEGRID); and, Virtual Research Communities (Health, Biomedical, HEP, Earth Sciences, etc.). Very little has been done until now to link e-infrastructures, at intercontinental level, due to their specific requirements. As aforementioned, a number of different collaboration models have then been established between Europe and the rest of the world, while the projects implementing these collaborations have had impacts typically focused on their regions. The Co-ordination & Harmonisation of Advanced e-infrastructures project (CHAIN, a Coordination and Support Action with a budget around 1.5 Me, now receives this legacy and intends to coordinate the work carried out by these regional projects as well as optimise an interaction model for e-infrastructures and specifically Grid interfaces between Europe and the rest of the world. The project is working on elaborating a strategy and defining the instruments in order to ensure coordination and interoperation of the European Grid Infrastructures with other external e-infrastructures. This way, the CHAIN consortium, coordinated by INFN and consisting of leading organisations in all the regions addressed by the project, will ensure global coverage, European leadership and most efficient leveraging of results with respect to preceding regional initiatives. In concrete, CHAIN counts on 4 European institutions (INFN, CESNET, CIEMAT and GRNET) and other 4 are from countries in the list of International Co-operation Partner Countries (CLARA, IHEP, PSA, UBUNTUNET).

194 194 IBERGRID The different steps to be taken are as follows. First, the project will define a coherent operational and organisational model, where a number of EU countries/regions will possibly act, in collaboration with EGI ( as bridges/gateways to other Regions/Continents. Further, the project will validate this model by supporting the extension and consolidation of worldwide virtual communities, which increasingly require distributed facilities (large instruments, distributed data and databases, digital repositories, etc.) across the regions for trans-continental research. Finally, the project will act as a worldwide policy-watch and coordination instrument, by exploring and proposing concrete steps for the coordination with other initiatives and studying the evolution of e-infrastructures. To do so, it is of outmost importance the experience gained through the past by the CHAIN partners, who have participated in GRID project such as the different EGEE phases or the regional associated initiatives in Latin America, Asia and or Africa. At the same time, all the European partners also collaborates in the European Grid and HPC Initiatives and its related National Initiatives, so they are up-to-date of the different steps that e-science is performing. 2 WP1 Project Management As usual, this WP performs a mandatory activity for ensuring the correct and timely execution of the project. The CHAIN project appears to be challenging from the management point of view due to the large number of regional partners and third parties involved. Moreover, the project has to organize and develop several high level activities and monitor their impact. All these aspects will be addressed, at the management level, by a correctly articulated structure which will avoid overloads and unnecessary bureaucracy. Moreover, the large geographical coverage of the project poses several challenges, while the work-plan has activity lines, derived from the project s objectives, which span all the regions involved. The managerial structure has thus been designed with a traditional technical management that develops horizontally across the regions and an executive management that vertically addresses the different regions taking into consideration their specificities. As stated above, INFN leads the project and all the activities which are strictly related to the overall technical and administrative management of the project have been distributed in several tasks. Managerial activities which are related to the other work-packages have been included in the respective work plans. This way, the governance structure is then composed by the following roles and boards: A Project Director (PD) nominated by the Coordinator (INFN) to run the project; A Project Management Board (PMB): one member for each partner plus a deputy, to specifically address strategic and contractual decisions concerning the project; A Regional Executive Board (REB): one deputy Project Director for each Region, to act as an executive board that will be entitled to take urgent decisions between the PMB Meetings. The members of the REB will act as Deputy PDs for the region concerned;

195 IBERGRID 195 A Project Office (PO): 3 persons, to deal with administrative and managerial activities and support the PD; A Technical Board (TB): made up by the managers of all the work-packages, to take care of the technical management of the project; A Technical Manager (TM) to chair the TB and deal with the day-to-day technical discussions ensuring the coherence of all the technical actions in line with the project s objectives. WP1 is divided in two Tasks: T1.1 Administrative management for dealing with the administrative, financial and overall management of the consortium; and, T1.2 Technical management, whose aim is to coordinate the overall technical activities, for the achievement of milestones and the delivery of the deliverables according to the foreseen work plan. Typical milestones of this WP are the organization of the project Kick-off meeting and the annual reviews at the European Commission. 3 WP2 Consolidation of existing state of the art The maturity of non-european e-infrastructures varies across the world regions, where 3 typical scenarios could be identified when CHAIN proposal was submitted (2010): Completely green-field regions which need to be supported from scratch (e.g. sub-saharan Africa with the exception of South Africa) Stable (if small) regional infrastructure established and interoperational with Europe (e.g. Mediterranean, some LA & Caribbean countries, SEE) Advanced countries/regions which have autonomously invested in several e- Infrastructures and are willing to interoperate with European ones (e.g. China, India, New Zealand, USA, Japan, some LA countries, etc.). A set of EU-funded activities have been so far investigating different modes of interoperation and interoperability between Europe and the rest of the world. While strong information exchange has been taking place, there is still a diversity of solutions in place with a wealth of know-how available. This work-package is working on systemising the current know-how and producing a coherent and structured consolidated picture of the state of the art in the different regions, highlighting commonalities and differences. On the basis of this analysis, a set of guidelines will be produced to guide the interoperability and interoperation among different types of regions. To fulfil this goal, reference will be made to the recommendations developed from international standardization forum as Grid Interoperation Now, part of Open Grid Forum, which is working extensively on this point. WP2 is leaded by GRNET and its first action has been to elaborate, jointly with the rest of Work packages, a questionnaire with two different versions depending if it is delivered to Regional or National contacts. In both versions, questions related to Infrastructure, Operational and User support services, User communities, EGI

196 196 IBERGRID interoperation approaches, Middleware, Network, Related technologies (cloud, for example) and Sustainability appear. WP2 is structured in two Tasks. T2.1 State of the art analysis deals with the in-depth analysis of the current state of the art across the regions and is the main responsible for the questionnaire, which will produce data collection and will gather and systemise the data. The activity of the Task is devoted to gather the accumulated experience and the current best practices in the existing e-infrastructures with specific focus on the basic and advanced services provided, the scientific communities addressed, the security policies, the sustainability opportunities and plans and the current interoperation level with the European Grid Infrastructure. Task T2.2 Interoperation and interoperability guidelines will analyse the data collected in T2.1, on the basis of which a set of guidelines for interoperability and interoperation will be produced. The data will be analysed under the various topics and a comparison among the different approaches will be done focusing, on one hand, on the areas of possible synergies and harmonisation and, on the other hand, on the specific region-dependant solutions and differences evaluating their impact on the level of interoperations and interoperability with other e-infrastructures. Lately, Task T2.3 Long-term sustainability support will disseminate the accumulated know-how on NGI and regional sustainability issues and will carry out actions on strengthening National Grid Initiatives and regional structures. The task will examine the current and emerging plans of sustainability, enumerate the opportunities that could influence these and future plans and eventually propose measures of improvement based on the current best practices and new opportunities. The milestones reflected in this activity are focused on providing the questionnaire to find out what the different regions need as well as produce valid guidelines with the obtained results. 4 WP3 Present and emerging needs of trans-continental scientific communities There are several communities which are trans-continental by constituency, such as LHC experiments ( European Organisation for Astronomical Research ( Earth Science, Biomedical applications for neglected and emerging diseases, etc. An inclusive landscape of the virtual research communities accessing or willing to access trans-continental common developments, distributed data and facilities constitute the main motivation for coordination, interoperation and interoperability of e- Infrastructures across several regions, i.e. it also means the CHAIN objective. Some institutions participating in the CHAIN project are already involved in several of the cited communities and are obviously interested in continuing to support them, but this work-package is aiming to go beyond the legacy communities and intends to address new fields that have not completely exploited yet the opportunities of such large intercontinental e-infrastructures. Thus, the work-package aims at

197 IBERGRID 197 continuing to provide (limited) support to the existing and well experienced Virtual Research Communities such as LHC, Biomedicine, Earth Science, etc.; providing support for communities that have already approached the Grid technology but are interested to widen their activities collaborating with other continents (they could be the previous ones, but with new applications); and, discovering and attracting new scientific communities or merging similar ones actually operating within separate e-infrastructures worldwide. Under these premises, WP3 has a main objective: to involve at least a couple of reference communities that will be selected and involved in the project to validate the proposed model jointly developed by the whole project. To the date, some of these communities could be some of the ones mentioned above, but in any case, the CHAIN project is collaborating with several specific ones: We-NMR initiative ( in the Biomedical field; DECIDE initiative ( in the Biomedical field; Phylogenetics applications ( developed at Universidad de Vigo in the Biomedical field; INDICATE project ( in the Digital Cultural Heritage field; and, WRF4G ( in the Earth Science one. With regard to this, both initiatives already appear in the WP2 questionnaire inside the National version so CHAIN is looking for specific contacts of people working on these topics, which will be contacted lately jointly with the We-NMR and WRF4G representatives. Working this way, WP3 expects to fulfill its own milestones, i.e. obtain a shortlist of reference communities from a first call and the questionnaire that will conduct to a study on their required services as well as some Memoranda of Understanding in order to collaborate closely. WP3 is coordinated by CIEMAT and is divided in T3.1 Scientific communities across the continents, which is on charge of performing a large spectrum investigation on the existing and potential trans-continental communities, so the requirements, commonalities, challenges and possible synergies could be gathered and analysed, and T3.2 Proposed road-map of services for communities to be deployed on the e-infrastructures, the aim of which is to produce a study, updated every 6 months, of the necessary steps for e-infrastructures to fulfil the requirement of present and emerging virtual communities with trans-continental span. To do so, information will be collected from the existing plans of relevant organisations and/or committees (e.g. ESFRI, e-irg), the preliminary results of T3.1, the feedbacks received during the workshops and high-level conferences that CHAIN could attend or organise and the collaboration with other relevant projects.

198 198 IBERGRID 5 WP4 Modelling the cooperation of European e-infrastructures with non-european ones On the basis of the results of WP2 and WP3, and in continuous cooperation with them, a model for cooperation and interoperation among European and non- European e-infrastructures should be studied and proposed; this is the task of WP4, leaded by CESNET. In addition, a strong collaboration with EGI and EGI- Inspire ( is a must, so the Work Package makes its work fully transparent to EGI. WP4 is focusing on having a strong organizational study characterization with regional customisation applied to a shared model for sustainability. A collection of feedbacks will produce a final version with a road-map for the follow-up of potential extensions to the European Grid Infrastructure. The activities will be based on a series of cyclic processes that will involve in sequence the WP2 results, the meetings with EGI and that eventually will culminate with final reports produced as deliverables. The timing of those relevant documents (March and September 2012) has been chosen in order to have, on one side, the EGI plans already stabilised and, on the other side, receive the results of WP2 work. The process is not just EGI driven, even when, of course, it will be cyclic and will imply brainstorming Meetings jointly organised with EGI as well as a series of steps that will be put in place in order to maximise the involvement of EGI-Inspire in the definition of the model that will be proposed by the organisational study. To the date, CHAIN is strongly fostering the development of Africa ROC ( which will set the Continent much closer to the Grid deployment that already exists in other regions. As a consequence, Africa ROC GOCDB ( is properly working and, beyond it, xgus regional support system ( has been adapted to CHAIN characteristics and requirements. This work has been of importance in order to shorten the digital divide of Africa with the better established Grid Initiatives in Latin America and Asia. WP4 main activities are the following. T4.1 Organisational study interacts mainly with WP2 and WP3 to study an organisational model that will allow the European Grid Infrastructure to interoperate and cooperate with external infrastructures. It also cooperates with EGI organisation and other Regional infrastructure in order to evaluate the opportunities offered by the evolution of the European Grid Infrastructure, assess the issues that will affect the interoperation with other infrastructures worldwide and evaluate the appropriate measures that will allow a smooth interoperability in the near future. The second Task is T4.2 Regional Operations Studies, which deal with the specificities of the various regional operations and will address the specific issues related to interoperability with different middleware and migration to standards, as the ones proposed by GIN/OGF, in cooperation with the EGI planning. The activity involves a couple of different regional situations, i.e. developing regions with European based Middleware and well established regional e-infrastructure

199 IBERGRID 199 with different middleware. The Task will, in a second phase, deploy a pilot to demonstrate the applicability of the preliminary results of T4.1. T4.3 Road-Map ought to prepare a roadmap for the follow-up of the project which will address the timeline of implementation, based on the work of T4.1 and T4.2. The road-map will be based on the following pillars: a detailed analysis of the changes that will have taken place occurred in the European Grid Infrastructure with the new organisational structure; the characteristics of the non-european e-infrastructures, highlighting commonalities and differences including, but not limited to, organisation, HPC, Middleware, Security, Accounting, User Support; the opportunities offered by the present and emerging standards with recommendations on the support of new possible standards; and, the emerging paradigms of Virtualisation, Voluntary Computing and Cloud Computing. The outcome of this work will be disseminated not only in Europe but also in the regions where CHAIN is focusing on and beyond. The main milestone of this project is to perform a organisational model about e-infrastructures to be presented at the CHAIN workshops and any other events of interest. This is so since CHAIN will not provide an own computing infrastructure but those of its associated projects and partners (CLARA, EUIndiaGrid-2, Ubuntunet Alliance, etc.). 6 WP5 Dissemination and Outreach The CHAIN project, as many other EC funded projects, proposes an intense series of dissemination activities which have as focal point a series of workshops and conferences that will address, respectively, the virtual research communities and the policy makers. It is coordinated by INFN, which is an asset due to its implication in many Regional Grid projects. Nevertheless, its first step has been to develop and deploy its web-site, which has been migrated from its first Joomla! ( version to Liferay ( The web site was thus designed to achieve all the objectives originally foreseen in the Project s Description of Work and it makes use of the already available tools such as Agenda, Document Repository, Video Conference, etc. The Liferay Technology allows to plug-in new tools profiled as portlets that can be easily re-used in the web and better addresses the needs and expectations of the Grid communities and offers opportunities to better integrate grid tools. This will of course not only facilitate the maintenance and upgrading of the web site, but also will allow benefiting of portlets developed by third parties (e.g. other projects). The ever-evolving website can be found at CHAIN plans to attend and organise thematic workshops and high-level conferences to better address the Virtual Research Communities of relevance and exploit possible synergies with other international events, which is also its milestones. To

200 200 IBERGRID the date, it s been present in EGI events, both Technical and User Forums, in Open Grid Forum jointly organised with the International Symposium on Grid Computing and has supported the Conference on Role of e-infrastructures for Climate Change Research. This way, CHAIN project also supports the Climate Change community. The main activities inside WP5 are T5.1 Involvement of scientific & technical communities, whose aim is to promote the usage of e-infrastructures also in cooperation with other projects, T5.2 Solicitation of high-level policy awareness, that intends to support the regional communities towards the governments, stakeholders and policy makers getting consensus on the relevant aspects of the analysis of the issues and proposed solutions, and T5.3 Dissemination and outreach. 7 Conclusions The CHAIN project is a Coordination and Support Action which aims to coordinate and harmonise the Grid efforts currently developed world-wide as well as consolidate the Virtual Research Communities deployments in a wide area. Thus, it intends to address to main impact challenges. The regions addressed by the CHAIN project are all considered strategic by the European Union, so the collaboration and cooperation with these non-eu International Co-operation Partner Countries (ICPC) is considered a high priority. The cooperation with other countries, which are not in the ICPC list, is also very relevant when a worldwide coordination is required to address common issues such as interoperation, interoperability and standardisation. Moreover, such a wider geographical coverage is mandatory to address the needs of international research communities which have, among their requirements, the transparent access to worldwide resources, laboratories, facilities and data repositories. e-infrastructures are indeed widely considered key enablers of scientific and social development. Their widespread usage is rapidly changing the landscape of science and represents one of the most effective answers to problems such as the digital divide and the brain drain. The creation and support of common worldwide e-infrastructures devoted to research is thus doubly strategic regarding this aspect: It aims to speed-up the catch-up process of less developed countries giving them the opportunities to use cutting-edge European technologies and stateof-the-art computing and storage resources; and, It allows the extension of the European Research Area to other countries with the opportunities of assessing unique facilities and a large number of talented scientists from Europe as well as outside Europe who will actively and productively collaborate in challenging e-science research activities. The second relevant impact that the project aims to achieve is in the field of interoperations and interoperability. CHAIN is uniquely positioned for the large number of key players around the same table, not only for agreeing on the time of adoption of the present and emerging standards, but, more importantly, to promote new relevant ones based on shared policies and best practices. The above

201 IBERGRID 201 impacts directly correspond with those stated within the Work Programme Capacities (Work Programme 2010, Capacities, Part 1, Research Infrastructures, (European Commission C(2009)5905, July 29 th 2009), Par. 1.3, p. 20) Acknowledgements CHAIN project would like to thank Dr. Alexandre M.J.J Bonvin (Universiteit Utrecht, We-NMR) and Dr. Antonio Cofiño (Universidad de Cantabria, WRF4G) for their support. CHAIN project is funded by the European Commission under its Seventh Framework Programme (FP7-INFRASTRUCTURES , contract number ).

202 202 IBERGRID Stellarator Optimization Using the Grid Antonio Gómez-Iglesias 1, Francisco Castejón 1, and Miguel A. Vega-Rodríguez 2 1 National Fusion Laboratory, CIEMAT, Madrid, Spain, {antonio.gomez, francisco.castejon}@ciemat.es WWW home page: 2 Dep. of Technologies of Computers and Communications, University of Extremadura, Cáceres, Spain [email protected] Abstract. The design of enhanced fusion devices constitutes a key element for the development of fusion as a commercial source of energy. Stellarator optimization presents high computational requirements because of the complexity of the numerical methods needed as well as the size of the solution space regarding all the possible configurations satisfying the characteristics of a feasible reactor. The size of the solution space makes not possible to explore every single feasible configuration. Hence, a metaheuristic approach is used to achieve optimized configurations without evaluating the whole solution space. In this paper we present a distributed algorithm that mimics the foraging behaviour of bees. This behaviour has manifested its efficiency in dealing with complex problems. 1 Introduction Nuclear fusion research presents open problems which must be solved in order to be able to design commercial reactors. Some of the limitations of the fusion research are due to the knowledge required to carry out some experiments. But other limitations are related to the computational requirements of the problems being solved. Moreover, also the techniques required for some issues may be challenging. In the case of optimization problems, the main issues are related to the exploration of the solution space and the exploitation of approximated solutions for the problem being solved. The use of distributed metaheuristics is an excellent approach for these optimizations. Swarm intelligence-based algorithms [1], as for example those based on bees, can be implemented using the capabilities of large computational infrastructures like the grid to carry out large-scale optimization problems [2]. Large-scale problems present various peculiarities such as long execution times or large requirements in terms of storage. In the case of the problem here presented, the execution time needed for a single evaluation makes the optimization process to be challenging. [email protected] [email protected] [email protected]

203 IBERGRID 203 Honey bees forage for nectar in a changing environment. Discovery of new food sources may mean the abandon of others previously known. Bees are able to adapt their behaviour to these changes to maximize their productivity. Bees constitute a hierarchical society in which the role of each individual is defined. The division of labour improves the efficiency of the colony [3]. Bees do not usually switch among tasks, being specialized on a given task. Scouts search for new sources and provide information to the rest of the colony. This is performed by means of a decentralized system without any global decision-making. Individuals select the source with the best ration of gain to cost from all of the available nectar sources. The decisions bees make regarding their movements are based on the communication among bees [4]. Individuals can move in groups in which a bee plays a leader role. This bee recruits more individuals in the colony during the waggle dance. In this dance, the leader exchange information with other bees waiting in the colony about food sources. This information is related to the location and quality of the sources. Bees can reach an agreement and follow the leader. However, the recruits will not have all the information about the source. Hence, they will follow the leader but will introduce small changes in their paths. This helps the exploration of the space and also prevents from over-harvesting. The rest of this paper is organized as follows: section 2 summarizes our previous efforts and introduces a novel algorithm based on the foraging behaviour of bees while section 3 details the grid-based implementation of the algorithm. Section 4 explains the problem being optimized, whereas the results of this optimization are shown in section 5. Finally, section 6 concludes the paper, summarizes the main achievements, and proposes some future lines of work. 2 Optimization Algorithm 2.1 Previous Efforts Previously, we focused on the adaptation and use of well-known optimization techniques [5]. Genetic algorithms (GAs) and Scatter Search (SS) algorithm were considered. Genetic Algorithms We implemented three different GAs regarding the operator used to create a new offspring and the replacement mechanism. The three implementations achieved optimal results in terms of optimization but the usage of the computational results was far from optimal. The reason is that GAs consider a large number of candidate solutions to be generated and, only when all of the solutions have been evaluated, a new set of solutions is generated. This model leads to bottlenecks since the algorithm can be waiting for a long time for a single solution to finish without performing any other computation. The implementation of GAs showed to be easy for being implemented in a heterogeneous and distributed infrastructure following a master-slave model.

204 204 IBERGRID Scatter Search The SS algorithm was developed following the implementation proposed by Martí [6]. This implementation was adapted to the grid and achieved a better optimization than those obtained with the GAs. The number of candidate solutions being handled at any time is smaller in this case, making the relevance of bottlenecks smaller. However, due to the iterative model of the algorithm and the number of improvement processes required, the number of bottlenecks and dependencies among jobs increased. Thus, the usage of resources was also poor. 2.2 Distributed and Asynchronous Algorithm Due to the issues explained, we decided to implement a new algorithm, specifically designed for being executed on distributed platforms, taking advantage of the characteristics of these infrastructures. The synchronization required by the previous algorithms is replaced by an asynchronous model. In this new model, the creation of new solution does not require the evaluation of another candidate solution. This generation is performed based on the status of the exploration of the solution space and the exploitation of optimal solutions. The algorithm is called Distributed and Asynchronous Bees (DAB) algorithm since it is based on the foraging behaviour of bees and follows an asynchronous and decentralized model. It has been designed to run on large and distributed computational platforms. The algorithm has proved to be an optimal and efficient system to carry out large-scale optimization problems in distributed environments [7]. Four different types of bees, or processes, can be found in the algorithm: Two levels of scouts (devoted to the exploration of the solution space): 1. Rovers: they use diversification methods to explore the solution space. 2. Cubs (associated to a rover bee): random exploration changing variables based on a good solution, in terms of dispersion, found by a rover. Two levels of employed (devoted to the exploitation of known solutions): 1. Elites: perform a wide search using an approximated solution previously found by a scout. 2. Workers (associated to an elite bee): by using a local search procedure [8], they explore in-depth the best solution found by elite bees. The pseudocode of the resulting DAB algorithm is as shown in Algorithm 1. As can be seen, not all of the bees start foraging simultaneously. The creation of bees depends on the evolution of the optimisation process. Furthermore (and not included in Alg. 1), the DAB algorithm checks whether the bees (jobs) are performing their tasks. In case of failure on the grid infrastructure, the number of bees or jobs can be automatically readjusted. Moreover, if the user changes the configuration of the algorithm during its execution, the number of processes can be also modified to adapt the algorithm to the new configuration.

205 IBERGRID 205 Algorithm 1: The DAB Algorithm Pseudocode 1 Initialise the population of solutions and create bees; 2 while Stop criterion not reached do 3 foreach Evaluated Individual x by bee B do 4 Obtain the probability of the x to be selected; 5 if x is the global best then 6 if B is an elite then 7 Create new workers; 8 else if Idle Elites then 9 Create elite; 10 else if B is a Rover then 11 if Idle Cubs then 12 Create new solutions with an in-depth local search; 13 Send B to explore new solutions; 14 else if Idle Elites then 15 Create elite; 16 Create new solution for the bee B based on its type; Exploration In order to perform an optimal exploration of the solution space, the resources devoted to this task try to explore the most diverse areas of the space. Thus, they generate a set of possible solutions and select the one with the highest distance to the previously evaluated solutions. This distance is calculated after performing a normalization of the variables involved in the optimization. The selected individual will be then evaluated. During the generation of new solutions the algorithm must ensure that the value of each variable is within the limits of that variable. If the value is out of the limits, the algorithm introduces a new value using a boundary factor. This helps enhancing the exploration of the solution space. Exploitation The exploitation is based on approximated solutions previously found by means of the exploration processes. In the tests here presented, the elite bees use a local search over the improved solutions considering a wide modification of the variables defining the solution. The workers introduce smaller modifications to perform a more focused enhancement. Parameters of the Algorithm Taking into account the previous characteristics of the algorithm, and also considering the requirements of the problem being solved, the parameters of the algorithm are as shown in Table 1. While the parameters regarding the number of resources devoted to the optimization will exist for any problem, the other parameters have been introduced based on the problem being solved in this work.

206 206 IBERGRID Parameter Table 1. Parameters of the algorithm. Symbol Description Grid Related Elites Number EN Maximum number of elites Workers Number WN Maximum number of workers per elite Rovers Number RN Maximum number of rovers Cubs Number CN Maximum number of cubs per rover Optimization Related Modification Rate MR Probability to modify a variable when a candidate solution is generated Elite Solutions ES Maximum number of elite solutions considered Candidate Factor τ Constant to modify the variables of new solutions Boundaries Factor ϕ Constant to modify the variables of new solutions to fix within their boundaries Workers Factor ψ Constant to modify the variables of new solutions when a Local Search is performed Specialization While in our previous efforts the exploitation of the known approximated solutions was based on a mutation process using the standard deviation of the variables and local searches, in the current research we only consider different levels of local searches. Taking into account the division of labour and the specialization, as well as the fact that bees do not switch among tasks, the final design of the algorithm uses two different types of processes exploiting optimized solutions and two other procedures exploring the solution space. Dynamic Reconfiguration The algorithm can be reconfigured at any time just by modifying an XML file. Any of the configuration parameters can be changed including the maximum number of resources being used, the local search modification factor, or those related to the grid infrastructure itself. For example, the preferred CEs or SEs can be specified or the rank criteria to order the CEs. Moreover, the algorithm itself is able to change the configuration related to the grid infrastructure based on the quality of service being provided by the sites on the VO. 3 Grid Implementation The algorithm has been developed in Python. It has dependencies with the NumPy library and the lightweight DOM (Document Object Model) implementation library. Depending on the version of the Python interpreter installed on the User Interface (UI) of the grid infrastructure it might be necessary to download and compile those libraries since they are not included in the old versions. The algorithm makes use of a hash table to store the explored solutions in order to avoid evaluating twice the same configuration. A binary search tree is included in the table for collision resolution.

207 IBERGRID Long-time Jobs In order to reduce the impact of the time waiting in queues on the final execution time, we have introduced the idea of long-life jobs in the design. These jobs read the SE waiting for input and, as soon as this input is available, they process the information and store the result in the SE. Another process will use this information and produce new input data. The jobs will be running as long as the security constraints of the grid infrastructure will allow. Only two types of processes (rovers and elites) use this feature. The other processes are short jobs which just evaluate a single input and finish. Using the long-time jobs for all submitted tasks may lead to problems with the LFC (Logical File Catalogue). 3.2 Communication WN-UI As mentioned, the long-time jobs regularly require of new input. They ask a master process for this input. This master process runs on the UI of the grid. The implementation of this communication uses the LFC: the remote process updates a file on a location known by the local and the remote tasks, the master receives this update, downloads the data generated by the remote element, generate new input, and stores the input using a given path of the LFC. In order to reduce the number of checks that the local process has to perform, all the remote processes update the same file. Hence, a barrier system has been developed to avoid that two or more tasks update the file at the same time overwriting the information stored on that file. Thus, the LFC and the SEs of the infrastructure are used as a shared-memory component of the infrastructure. 4 Stellarator Optimization The problem being solved in the enhancement of the magnetic confinement of plasma in a fusion device [9][10]. By improving this confinement, the plasma may become more stable and the probabilities for obtaining fusion reactions increase. Therefore, the efficiency of the device increases. To perform this optimization, we consider the improvement of the Fourier modes describing the plasma boundary. 4.1 Costs of the Optimization Process The fitness function of this problem requires the execution of an application workflow (shown in Fig. 1) [11]. The required execution time of the workflow depends on the characteristics being modelled. For optimized configurations, this workflow has 40,458,792 Millions of Instructions (MI). This value changes depending on the input configuration because the grid used by the main code involved in the process can have more data. This code (called VMEC, Variational Moments Equilibrium Code [12]) could require of more iterations to find the equilibrium and the output file can have more data, so the objective function can take longer. Thus, the execution time may vary from several minutes to hours, showing a large variability.

208 208 IBERGRID Fig. 1. Workflow to measure the quality of the confinement The workflow only requires one CPU, although some of the components of the workflow are suitable for a parallel implementation. Since 32 and 64-bit machines may be found in the production grid infrastructures used, the applications involved in the optimization process must deal with these different architectures. Distinct versions of the libraries required by the applications are included in the optimization process. Bandwidth Load balancing is critical in other optimization algorithms running on the grid [13]. For the DAB algorithm this load balancing is not crucial, as synchronization is not required among the different processes involved in the optimization. The information transmitted through the network is high in this algorithm. Every bee must send not only the small files to communicate with the colony, but must also store all the results in the SE. The size of a single result is 5.5 MB. Since during an optimization process thousands of results are generated, various GB are transferred from the WN to the SE. All this data has been previously analysed, which also shows the necessity of distributed environments for this problem. CPU As previously noted, the workflow required to calculate the fitness function of interest takes different time depending on the configuration being evaluated. One of the applications (VMEC) implements a LevenbergMarquardt algorithm. After VMEC, the workflow executes the fitness function that is given by Eq. 1. This function uses the magnetic surfaces (i) of the confined plasma, and the intensity of the magnetic field (B) at each of these surfaces. The values involved in the expression are extracted from the output of VMEC. This function provides a value used for measuring the quality of the magnetic confinement of plasma in the fusion device. The physics involved in this expression, as well as the explanations of why this function is relevant can be also found in [11]. N B B F fitnessfunction = (1) i=1 Once the fitness value has been calculated, the workflow analyses the Mercier [14] and ballooning stabilities of the configuration [15]. If both criteria are satisfied, the configuration is valid. B 3 i

209 IBERGRID Results Fusion VO has been used to carry out this optimization process. It is a production infrastructure running glite with resources distributed through several locations in Europe. Table 2 shows the configuration used for the optimization being carried out. Table 2. Parameters of the algorithm. Grid Related Optimization Related Parameter Value Parameter Value Elite Number 50 MR 0.20 Worker Number 8 ES 200 Rover Number 5 τ Cub Number 4 ϕ 0.02 ψ Computational Results The aggregated execution time of all the evaluations was 5,287 hours, with a wall time of 120 hours, which implies a speed-up of During the execution of one of the tests performed (shown in Fig. 2) the number of computational resources was changed to show the automatic reconfiguration of the algorithm. It can be seen how the number of elite bees is increased up to 100 and then reduced again to 50. The variability on the number of tasks running was caused by a problem with the WMS (Workload Management System). As can be seen, the system reconfigured the number of resources and continued the execution automatically. It waited until all the workers had finished, without creating new tasks of this type. This reduced the load of the WMS. 5.2 Optimization Results The first configuration found during the optimization process had a value of 156,106,868 for the expression in Eq. 1, while the best configuration attained had a value of 1,717,129. Table 3 shows the best values found for different tests, revealing the reproducibility of the experiments. The different speed-up is due to the different evolution, leading to a different number of jobs being submitted, and also to the usage level of the grid infrastructure. Fig. 3 shows the cross-section of the confined plasma for the best configuration found. Each of the lines in the figure represents a magnetic surfaces, which corresponds to the values of i in Eq. 1.

210 210 IBERGRID Fig. 2. Evolution of the number of bees Table 3. Tests performed Best Result Wall Time Execution Time Speed-up Test 1 1,739, :08:41 5,296:12: Test 2 1,799, :00:37 5,356:23: Test 3 1,769, :01:05 5,005:58: Test 4 1,794, :11:10 5,311:00: Test 5 1,717, :03:31 5,287:10: Conclusions As shown, the use of metaheuristics with the grid permits the achievement of optimized results in a reasonable time. However, the design of the optimization technique needs to consider the special features of the computational platform in which the enhancement process will take place. Large-scale optimization problems present some characteristics that make challenging to design and develop automated optimization systems. The optimization achieved demonstrates how the combination of large-scale computational platforms and distributed metaheuristics represents an optimal approach to carry out optimizations of large-scale problems. As future work we plan on performing further optimizations and adding more target functions. This will lead to configurations improved considering additional relevant physics characteristics. The complexity of the optimizations will also in-

211 IBERGRID 211 Fig. 3. Cross-section of the optimized configuration crease. A new version of the algorithm adapted to use CREAM-CE is being developed [16]. Direct submission to the CE should avoid situations of problems with the load of the WMS. Acknowledgements The research leading to these results has received funding from the European Community s Seventh Framework Programme (FP7/ ) under grant agreement number RI (EGI-InsPIRE). References 1. E. Bonabeau, M. Dorigo, and G. Theraulaz, Swarm Intelligence: From Natural to Artificial Systems. Oxford, I. Foster, and C. Kesselman. The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann Publishers,1998.

212 212 IBERGRID 3. G. E. Robinson, Regulation of division of labor in insect societies, Annual Review of Entomology, vol. 37, no. 1, pp , I. D. Couzin, J. Krause, N. R. Franks, and S. A. Levin, Effective leadership and decision-making in animal groups on the move, Nature, vol. 433, no. 7025, pp , feb A. Gómez-Iglesias, M. A. Vega-Rodríguez, F. Castejón, M. Cárdenas-Montes, E. Morales-Ramos, and J. M. Reynolds. Grid-based metaheuristics to improve a nuclear fusion device. Concurrency and Computation: Practice and Experience 22(11), 1476 (2009). 6. Laguna, M. and Martí, R. (2003). Scatter Search. Methodology and Implementations. Kluwer Academic Publishers, Cambridge. 7. A. Gómez-Iglesias, M. A. Vega-Rodríguez, F. Castejón, and M. Cárdenas-Montes, Distributed and asynchronous bees algorithm: an efficient model for large scale problems optimizations, in DCAI. Advances in Intelligent and Soft Computing, Springer, 2010, pp H. H. Hoos, T. Stützle, Stochastic Local Search : Foundations & Applications. Morgan Kaufman Publishers, P. M. Bellan, Fundamentals of Plasma Physics. Cambridge: Cambridge University Press, K. Miyamoto, Fundamentals of Plasma Physics and Controlled Fusion. Tokyo: Iwanami Book Service Center, A. Gómez-Iglesias, M. A. Vega-Rodríguez, F. Castejón-Magaña, M. Rubio-Solar, and M. Cárdenas-Montes, Grid computing in order to implement a three-dimensional magnetohydrodynamic equilibrium solver for plasma confinement, in PDP. IEEE Computer Society, 2008, pp S. P. Hirshman and G. H. Neilson, External inductance of an axisymmetric plasma, Physics of Fluids, vol. 29, no. 3, pp , F. Luna, A. J. Nebro, and E. Alba, Observations in using grid-enabled technologies for solving multi-objective optimization problems, Parallel Computing, vol. 32, no. 5-6, pp , C. C. Hegna and N. Nakajima, On the stability of mercier and ballooning modes in stellarator configurations, Physics of Plasmas, vol. 5, no. 5, pp , R. Sanchez, S. P. Hirshman, J. C. Whitson, and A. S. Ware, Cobra: An optimized code for fast analysis of ideal ballooning stability of three-dimensional magnetic equilibria, Journal of Computational Physics, vol. 161, no. 2, pp , C. Aiftimiei et al., Design and implementation of the glite CREAM job management service, Future Generation Computer Systems,vol. 26, no. 4, pp , 2010.

213 IBERGRID 213 OptiWeb: An optimization application for steel cut industries ported to the Grid in the framework of PireGrid project Jaime Ibar 1 Gonzalo Ruiz 1 Alfonso Tarancon 1 Ruben Valles 1 BIFI: Institute for Biocomputation and Physics of Complex Systems of the University of Zaragoza Abstract. PireGrid[1] (Project Number EFA35/08) is an INTERREG IV A project which has two main objectives. The deployment of a production Grid Computing infrastructure in the regions of Aragon, Aquitaine and Midi-Pyrenees and the achievement of successful cases in execution of applications from small and medium size companies of the named regions, in order to demonstrate its usage of these emerging technologies to the companies. In the framework of this project we present Optiweb, which is the first successful porting of an application from Schnell Software[2] company, and the process followed to adapt it to a web interface and its transparent execution in a glite grid infrastructure. 1 Introduction In any building work, taking the rebar element lists (like beams, piles and floor structures) as a starting point, an computerized procedure is performed in order to take data, process it, cut the elements and classify them. The steel segments cut optimization is the most demanding part in computer resources terms. In fact, Schnell Software and the University of Zaragoza have been collaborating for 9 years in the development of internationally competitive software based on techniques used to study complex systems. Optimo is the result of this close collaboration, a product commercialized in more than 20 countries. Last year, some facts made us think about changing some important aspects of the application basis. On one hand, the product needed to take part in more steel demanding markets, more computational demanding markets like Obra Publica, or other cut systems from Latin America. On the other hand, BIFI took part in GRID technology, which would offer some performance improvements to this kind of application executed at that moment in the clients computers. Thus, we decided to turn it into a distributed and more efficient one. This very ambitious aim called Optiweb would allow Schnell clients to optimize their steel cut on a distributed [email protected] [email protected] [email protected] [email protected]

214 214 IBERGRID environment, with no CPU and storage limits like it happened before, and with no need for these clients to buy dedicated high power computers. This would reduce their costs, and also it would permit them to get better results in order to save steel, money and energy, reducing the environmental impact of this process. This project was the first successful case of an Aragonese company using GRID technologies and was awarded with the first prize of the 2011 research transfer contest organized by the University of Zaragoza. Currently, Schnell Software has granted access to this application to its more important clients in order to show them how powerful it is, and in the coming future is planing to charge them for using it selling licenses. In order to satisfy the company needs and give it an available product as soon as possible, this project was divided into two phases: 1. Adapting local Optimo to a web environment, Optiweb. This would permit the company clients to run their optimizations from anywhere, using any device which had a browser and an Internet connection (like desktops, notebooks, netbooks, smartphones, and so on). This would remove the computing load that their computers had before, because it would be supported by the application server. This server had no power enough to deal with a lot of clients running their optimizations at the same time, and that was the reason why we planned another phase to introduce GRID technology on it. 2. Porting the optimization process to a working GRID. For this purpose, the PireGRID platform was chosen. It has hundreds of cores availables for running optimizations. This would allow Schnell Software clients to perform better optimizations using less time than in their own dedicated computers. 2 Phase I: the web application The original application is a command line C program developed by the researcher Alfonso Tarancon. Thus, the first study done was to look for the best way of adapting it to a web based application. Due to Schnell Software company mainly works with Microsoft environments, we tried to make a first approach using ASP technology, but it was soon ruled out due to the program complexity and resource consumption. As BIFI has deep experience in Linux and free software, these could speed up the development process, so the technologies finally used are described below. Once the platform where the application would be executed had been chosen, different technologies were considered for the final implementation of the solution. Although a UNIX based operating system would be used, we tried to use multiplatform technologies, because it may be necessary in the future to migrate it to a Windows Server belonging to Schnell Software. The three main possibilities which arose were Java, PHP and Python[6]. The last one was finally selected because its easiness of integrating it with other programming languages, the existence of an API to access the grid resources and the BIFI previous experience in the development of all kind of applications with this technology.

215 IBERGRID 215 With this topic decided, the next step was to face the implementation of the web interface, including all the features that Schnell Software required during the analysis. The used server is Pylons[7], which is one of the most extended servers for Python development. The system was designed from scratch to be compliant with all the browsers of the different platforms, included mobile ones. It has also been used JSON[5], which is an increasing client-server technology. It allows to speed up the communications and make lighter systems without being necessary to reload entire web pages with each operation. With this objective it was also selected the jquery[8] framework, which facilitates all the previous tasks mentioned. In order to store all the data, a MySQL[?] database was developed ad hoc for the project. The main features grouped by sections are the following: Protection: a user can access the system with the user and password or access the registration web Register: a user can create an account to use the system introducing different data, some of them are compulsory to trace the usage of the system by means of Schnell Software. Main: a user can configure and submit optimizations of the cut process. It has the following subsections: Input: where language can be selected (now available in Spanish and English although Italian, German, even Japanese will be soon available) and the metric system in which data is entered. Data: the user adds the input data, that is, the steel bars needed to form the different elements of rebar. They can be introduced both manually and loading a file. Machine: where the user configures the cutting machine, the optimization parameters and the stock of bars in the store. Standard cut: where some advanced options can be added to optimize a special cut. Optimization: where the execution is released locally or GRID and in which the progress of the optimization with partial results of best solution found so far. Results: where final results are shown and where the user can print or download files that go directly to the cutting machines. My account: the user can edit data as well as a historical list with all optimizations, see the way they ended and download input, configuration and result files. Admin: this section is only approachable for system administrators, that is, Schnell Software staff and BIFI. It is possible to get a list of all users and change their permissions in a way that a user can be disabled in case of misuse of the application, change the privileges only to local version, give permissions to execute in GRID or make the user administrator. There is also a list with all the system activity as well as the optimizations done by all the users. The next step was the integration of the binary program in charge of the optimization process. Some modifications were necessary in order to allow the communication between the program and the we application, and this way being able to

216 216 IBERGRID trace the status of the execution. Each optimization runs in an independent thread on background, letting the concurrency of optimizations from different users at the same time. The major problem of this so-called local version, is the performance degradation if the number of optimizations/users increases remarkably. The input and output data are stored in different directories depending on the execution in order to avoid interferences between the optimizations and also to allow the availability of the results once the machine operator needs them. Fig. 1. In this image we can see a real optimization executed in GRID. The system shows us the progress of the simulation, as well as the best solution results till the moment. We can observe that the amount of managed data is quite big because the solution found was 3500 steel bars in a casting of 60Tm 3 Phase II: Porting to the GRID This phase consisted in the adaptation of the application to the GRID infrastructure and its integration in the web site previously developed. The named application is an stochastic optimization, which means that the results of each execution may vary independently of using the same input parameters. So the bigger the number of execution the higher is the probability of finding an optimum solution. This is the key point of the advantage of using a grid infrastructure, because it al-

217 IBERGRID 217 lows you to run hundreds of optimizations simultaneously, getting a better solution and in less time compared with a unique computing machine. Fig. 2. Scheme of the system running with all its actors. To achieve the porting, the first step was to test the application in a standalone version. This was possible making some modifications to the source code, compiling again, submitting it as batch jobs and checking the integrity of the obtained results. The following issue was to develop a Python script which would deal with all the complexity of authentication, communication, checks, etc... with the grid. In order to deploy it and have a full control of the jobs submission the glite APIs[3][4] have been used executed from a User Interface, which is the machine in charge of finally sending the optimization jobs to the grid. This script is, in fact, the mediator between the web application and the supercomputing platform, which in our case is a glite grid. From the web interface, in a transparent way for the user, the input parameters and the configuration for the job submission are sent to the script and also the number of job optimizations that we want to be done in a parallel way. The named Python script is in charge of sending submitting the jobs to the PireGrid WMS, which selects the final execution environment depending on the workload of each node. It checks the status of the job, warning in case of any failure and it is also responsible of automatically collecting the data of all the succeeded jobs. Once it has compiled all the outputs, it starts a local program to select the best solutions among all of the ones computed in the grid sending it back to the web server and showing it to the final user. In this way, the final operator do not feel any difference selecting the local or the GRID process. It is just a matter of configuring the input data via web, pressing a button to launch the optimization and it is the system which deals with the issues to obtain the final result and show it to the user, provided that GRID results will be better in most of the cases due to the space of solutions is much bigger.

218 218 IBERGRID Fig. 3. Scheme of the application running in grid mode. Obviously, the security in the approach of communication has been handled with secure protocols, always sending the information encrypted, which is one of the main requirements of data used by companies. 3.1 How the Web Server works The web server is the mediator between the operator using the application and the user interface. There are three main steps in the use case of the application. The job submission: The user inputs the data filling up the web forms or uploads an input file with all the initial parameters for the application. Then the web server prepares a tar file which is sent via ssh to the User Interface that contains the Python script, which is invoked and started. The execution: During the execution of the jobs in the grid the user needs to know how his application is progressing, but all the complexity of the computing platform is hidden for him. This way, he is not the one who asks how everything is going on, but it is the web server which reads an status file from the User Interface to know if the execution is yet in progress and how many jobs have already finished to update the status showing the percentage of the optimization task that has been completed. The results: When the status bar reaches 100% means all grid jobs have finished properly and the final output has been copied from the UI to the Web Server, so the final results can be formatted and be shown in the web page to the operator. 3.2 How the Python script works First of all, it checks whether a valid voms proxy exists, in order to be able to get authenticated in every operation with the services of the grid infrastructure. Only in case it detects there is no valid proxy enabled it creates a new one.

219 IBERGRID 219 Fig. 4. Communication work flow client computer, web server and User Interface starting the optimization Fig. 5. Communication work flow client computer, web server and Python script while the jobs are running in the grid Fig. 6. Finished status: Retrieving the output results from the grid to the web server

220 220 IBERGRID After that, the jdl file with the description of the job is generated, according to the specification of the parameters that the web server has sent to the script as input parameters. In the case of this application, also an optimization of the way the jobs are sent to the grid has been performed. It consists in generating a jdl file of type collection, which lowers the load of the WMS service, because it only receives one big job, composed of several ones and the input data is only transfered once. This approach is possible because, once the parameters are fixed in the web server interface, these are the same for all the jobs sent to the grid. Having created the jdl file there are two options depending on the InputSAndbox: If there are no files in the InputSandbox, only a jobstart is required, so the job will be directly submitted to execution. In our case, there are input files, so there is a previous step to be done. First we have to do a jobregister to obtain the id of the job collection and the url where the input files will be stored in the WMS (of the type gsiftp:///path to input files) This URL is then used to upload the input file to the WMS using the lcg util API. As they are quite small and as using the job collection approach they are transmitted only once, the usage of the Storage Element Service is not required. When the input files have been properly upload to the WMS, the jobstart can be done in order to submit the job to the WMS. The later is obviously the one in charge of deciding in base of his algorithms to decide the most suitable site to execute the jobs. The jobstart command returns the list of the unique ids of the jobs which conform the collection. This list is constantly used in the main loop of the script, because once they have been submitted they are treated as independent jobs and not as a collection. The main sequence of tracing the job execution is checking the status of all the job ids. In case of its termination, the output is retrieved from the WMS and it is deleted from the list to avoid future checks. At the same time, in each iteration the status file, where the execution profile is stored, is updated with the correct number of jobs sent, running, done, etc... When the number of finished jobs equals the input parameter received from the server, the execution of the Python script finishes and the control is delegated to the web server. One of the most interesting advantages that we have using the Python API is the abstraction of the command line, the simplicity and speed to develop this kind of script, but above all the complete control that you have at low level to interact with the different services and steps of the workflow of submitting jobs and handling their data. 4 Analysis of results Below we show two graphics where the improvement in the performance of the implemented system solution can be observed. In the first one we can see the

221 IBERGRID 221 Fig. 7. Scheme of the communication of the Python script with the different services of the grid infrastructure and the web server execution of 1000 optimization jobs in the grid for a casting of 60Tm. In the Y axis we observe the percentage of remains, and in the X axis the number of jobs. The lower the point the better the solution. This example would be equivalent to a 1000 local executions, which would last for a very long period. If we only executed one local solution the probability says that it would be very close to the mean, represented with a blue line. Thanks to the power of the grid we can observe a 2,5% of improvement, which in our example would mean a saving of 2 Tm of steel, storage, transport, time, etc... In the second graphic, we can observe the improvement of the solutions with the growth of the jobs. If the client would execute it only in his PC, the obtained solution would have been in the point 1 of abscissas and would have a little bit more than a 7,4% of remains. In the grid approach, executing 100 of jobs, the remains would be reduced to a 5,8%. Acknowledgements A special acknowledge for the Schnell Software colleagues (Miguel Caro and Mari Mar Lahuerta) who have collaborated with all the technical issues related to analysis requirements of the web application. Thanks also to all the PireGrid partners and all the people collaborating in deploying and monitoring the PireGrid infrastructure.

222 222 IBERGRID Fig. 8. Execution in GRID of 1000 optimization for a casting of 60 Tm. Fig. 9. Representation of the improvement while the number of jobs/optimizations increases.

223 IBERGRID 223 References References 1. PireGrid web page: Schnell PireGridSoftware web page: webhttp:// page: Python Schnell Software API Webweb page: page: wiki/3.1/htmlpython/wmproxyapipython.html 4. Python API API Webweb page: page alternative: i.infn.it/egee-jra1-4. wm/api Python docwmproxy API web python.html page alternative: wm/api Json webdocwmproxy page: python.html Python: Json webhttp:// page: Pylons: Python: jquery: Pylons: MySQL: jquery: MySQL:

224 224 IBERGRID DataLight: data transfer and logging of large output applications in Grid environments Paulo Abreu 1, Ricardo Fonseca 1,2, Luís O. Silva 1 1 GoLP, Instituto de Plasmas e Fusão Nuclear, Instituto Superior Técnico, Lisboa, Portugal [email protected] 2 Departamento de Ciências e Tecnologias da Informação, Instituto Superior de Ciências do Trabalho e da Empresa, Lisboa, Portugal [email protected] Abstract. In this paper we present a powerful and lightweight library called DataLight that handles the transfer of large data file sets from the local storage of the Worker Node to an glite Storage Element. The library is very easy to integrate into non-grid/legacy code and it is flexible enough to allow for overlapping of several data transfers and computation. It has the ability to send messages to a Web Portal while the application is running, so that a user can login to the Portal and receive immediate updates. In particular the user can start post-processing the produced data while the application is still running. We present the application details both of the library and the Portal and discuss some relevant security issues. Keywords: web portal, large data sets, Grid migration, data transfer. 1 Introduction The need for a large-scale distributed data storage and management system has been rising in the last forty years at least as fast as the need for computational power. In fact, the recently formed European Grid Infrastructure (EGI) has its roots partially in the European Union funded DataGrid project [1], aimed at sharing large-scale databases, from hundreds of TeraBytes to PetaBytes, across widely distributed scientific communities. This makes the EGI highly optimized for dealing with both data intensive and computational intensive applications, since two of its main purposes are to offer high computing power to applications and high storage and replication facilities for data. However, this leads to a certain dominance of Grid applications that require large amount of data as input (like the ATLAS or CMS experiments), against Grid applications that produce large amount of data as output. A very common type of high-performance computing (HPC) algorithm is one that produces very large data files as output. An example is the class of algorithms used in physics simulations, where the motion and interaction of several thousands or millions of particles are simulated. Codes like this (e.g., PIC [2]) require a relatively small amount of data as input (e.g., 1 kb) but can produce a huge amount

225 IBERGRID 225 of output (e.g., up to 1 TB). In our particular case, we develop, maintain and deploy two massively parallel plasma simulation codes (Osiris [3] and dhybrid [4]), which have been tuned for HPC systems ranging from one to several thousands of processors. These codes, besides being good candidates for application deployment in the EGI Grid, also challenge the current migration infrastructure, since they usually produce large data sets (several hundreds gigabytes of data) as output for post-processing (e.g., visualization). Hence it is important to make these large amount of data produced by such codes available as quickly as possible, ideally still at runtime, so that the user can have almost immediate access to preliminary results. It is also likely that if the application is writing the results locally on the Worker Node (WN), this node might not have enough storage for the complete output data, specially when it is in the order of several hundreds of GB. In this paper, we present an improvement and expansion of a previous tool that we have developed [5], which we now call DataLight. In its current form, DataLight is a library that eases the deployment on the EGI of applications that output large amounts of data, specially in the case where that data should be made available as soon as it is produced. It allows for multi-threaded transfers of data, thus taking advantage of current fat nodes (some cores in a computing node can be used for the transfer threads while the rest is doing the calculations and generating the data), and it automatically updates the status of the program by sending messages to a Web portal. This allows for an immediate feedback to the user, that can start post-processing the data while the program is still running. This paper is organized as follows: in Section 2 we present an overview of previous work that has been done in this area, in particular in the available glite tools for data transfer and in the previous tool that we have developed; in Section 3 we present a general description of the algorithm developed for the library and give an overview of the solutions we found for the user interface; in Section 4 we present the Web portal implementation and the security issues that had to be solved; finally, in Section 5 we conclude and point to some directions for future development. 2 Previous work A great amount of work has been done on the ability to replicate and transfer large data sets on the Grid [6 8]. In fact, one of the main features of the EGI middleware (glite) is its emphasis on data handling, namely on its replication and accessibility. However producing large data sets as result of simulations running on the EGI poses an interesting problem for application development. On one hand, glite offers two high-level APIs for data management (GFAL and lcg util [9]) which are well suited for the reliable transfer, replication and management of data on the Grid, and one low-level API for data transfer submissions (FTS [10]), which is suited for point to point movement of physical files. On the other hand, such data management operations should occur still during simulation time, not just after it, such that the generated data is made generally available as soon as possible and the Worker Nodes (WN) are never filled with too much output results. This

226 226 IBERGRID requirement can lead to performance degradation (either at the application level or at the WN level) due to the overhead and slowness of network transfers when compared to local storage access. In [5] we explained in detail the reason for our choice of the lcg util API. In an overview, we need to deal with complete files (unlike GFAL, which uses POSIX to create, write to, and read from, files on the Grid), we need a higher level abstraction to data management (unlike FTS), and we would like to base our library in a stable and powerfull data transfer layer like GridFTP [11, 12]. Although other tools have been proposed for file transfers on the Grid (like [13]), we found that a GridFTP based tool had all the requirements of stability, reliability and performance that we needed. Also, GridFTP is one of the most widely used middelware for data transfer on Grid systems, so building our library on top of a software stack based on GridFTP is a first guarantee for portability. 3 DataLight Implementation Figure 1 shows a simplified fluxogram of the algorithm in DataLight. It is based in [5], but with several important additions, that will be explained in the following sections. Our general goal was to develop a tool that would ease the deployment on the Grid of applications that output large amounts of data, specially in the case where that data should be made available as soon as it is produced. To this end, we have improved on a previous implementation (described in [5]), adding several configuration capabilities and an interaction with a Web portal. The top left of Figure 1 represents the main application that uses DataLight. It mainly consists of a simulation cycle followed by a write of the result data to local storage; these two steps are repeated until the complete simulation finishes. Each local write operation is followed by a call to the DataLight function (write_remote), that adds the file to a waiting queue of files waiting to be transferred. The first time the main app calls this function, it launches also a queue manager thread in DataLight (right hand side of Figure 1), which is responsible for managing the waiting queue. This queue manager dispatches the files on the waiting queue list in a FIFO order to a transfer queue. It checks if there is enough transfer slots available (these are bandwidth and system dependent and can be defined by the user) and, for each file that is moved from the waiting queue to the transfer queue, it initiates a transfer thread (bottom left of Figure 1). Each transfer thread uses a simple algorithm represented on the bottom left of Figure 1. It is up to each transfer thread to remove the file reference from the transfer queue and to remove the local file as soon as the transfer finishes. Finally, as the complete simulation finishes, the application sends a signal to the queue manager thread in DataLight by calling the function write_finished. This forces the queue manager to wait until all queues are empty and then quit, thus allowing the application to finish.

227 IBERGRID 22 Main program Compute results Write locally Disk Queue manager thread Add to DataLight's waiting queue Waiting queue Queue manager Main program Waiting queue empty? Y Wait N Transfer thread Transfer slots available? N Transfer Þrst element from transfer queue Transfer queue Y Move Þrst element from waiting queue to transfer queue Remove transfered element from transfer queue Initiate transfer thread End DataLight transfer thread DataLight queue manager thread Fig. 1. A simplified fluxogram of the DataLight library (left and bottom right), together with the application that might use it (top left).

228 228 IBERGRID 3.1 User interface The two functions referred to in the previous paragraphs (write_remote and write_finished) are the only exported functions of the DataLight library. This allows for a minimal code intervention in the application that wishes to write files to the Grid. This was one of the main goals of the library, since it is expected to be deployed with applications that were built without the Grid in mind. This saves the cumbersome and error prone change of (possibly complicated) output routines. Several aspects of the behavior of DataLight can be controlled by environment variables. We consider this approach to be more flexible than passing parameters directly to the called function. On one side, it keeps the API minimal and hence avoids extensive legacy application code editing. On the other side, environment variables can be simply set in the JDL file the define the Grid job, thus allowing for a general configuration of the application and not of a particular function call. All DataLight variables start with the prefix DL_ and the library tries to use sensible values if they are not set. Here is the complete list, together with a description and the default value: DL_DEBUG It is a numerical value that sets the verbosity level of the library. At the moment it uses two values: 0 for no extra messages (this is the default behavior if this variable is not set) and 1 for extra logging. Any other integer value is handled as a 1. Any non integer value is handled as a 0. DL_VO It is a string the contains the Virtual Organization this job belongs to. It is used to create a correct Logical File Name for the each file that is transferred. If not set, the environment variable LCG_GFAL_VO is used. If this variable is not set, an error message is send and this value is empty. DL_JOBID A string that uniquely identifies this job. This will be used in the portal to collect all messages from the same job. Normally, it should be not set, since DataLight uses the value from GLITE_WMS_JOBID. If this variable is not set, it uses EDG_WL_JOBID. Finally, if neither variables are set, DataLight issues a warning and uses the date() function. In this case, it is worth noting that this value will stay the same during all the time that the application is running. DL_SE This is a string that contains the FQDN of the Storage Element where the files should be transferred to. If it is not set, than its value is copied from the environment variable VO_${DL_VO}_DEFAULT_SE (an automatic conversion of DL_VO to upper case is done). If this variable is not set, an error message is send and the program continues, but without data transfer. DL_NTHREADS This variable is a numerical value that sets the number of simultaneous file transfer threads (represented in the bottom left of Figure 1). Default value is 1. Negative values or non integer values act like a 0, and no file transfer takes place. This is useful if one wishes to use just the Portal capabilities of DataLight to have an updated situation of the running application, without the need to actually transfer any data.

229 IBERGRID 229 DL_NB This is equivalent to the nbstreams parameter of the lcg_cr command. It specifies the number of parallel streams per file transfer thread. The default value is 1. DL_USER This is a string variable that should be set to the user name in the UI that runs the Portal (more on this later). There is no default value. If the variable is not set, it is most likely that DataLight will not be able to contact the Portal. DL_ROOT This variable is a string that defines the root path for creating the correct Logical File Name for the each file that is transferred. Usually, it should be set to the user s CN of the X509 certificate. In the LFN, the local path of the file, as passed to the function write_remote, is reproduced. As a complete example (also using the variable DL_VO), let us assume that files are locally written in the directory output, that the VO is test and that DL_ROOT is set to paulo_abreu. In that case, each write_remote function call will have the format write_remote( output/... ) and the complete LFN for each transferred file will be: /grid/test/paulo_abreu/output/... where... represents the name of each file. There is no default value. DL_PORTAL This string has the FQDN of the Web portal page that is to be contacted with messages from the DataLight library. The next section has more information on this issue. Most of this variables do not need to be set, since the default values do what is to be expected. The only recommended variables to be set are DL_ROOT, so that LFN entries are locally kept in directories related to the user, and DL_PORTAL, which points to the Web portal to be updated while the job is running. Without this last variable, the user will only be able to access the transferred files by consulting the stdout file, after the job has finished, instead of having an immediate access to them during the job, through the Web portal. 4 Portal Implementation Another important feature that we have implemented in this version of DataLight is the ability to send HTTP POST messages while it is running. In the current version of the library, we have implemented messages related with file transfer (success, failure, retry,... ), but any kind of message can easily be added. These messages are sent to a Web page URL defined in the environment variable DL_PORTAL. Obligatory fields for each message is its identification (called ID), the user this messages belongs to (called Owner), and some text (called Log). The ID is the environment variable DL_JOBID, the Owner corresponds to the DL_USER variable and should be the name of a user registered in the system the Portal is running on (usually, the UI), and the Log is any text that the library finds

230 230 IBERGRID necessary to send to the user. For example, each time a file is written to the SE and removed from the local WN, DataLight sends a message with the LFN of the transferred file in the Log. The Portal itself is written mainly in PHP and interfaces with a MySQL database. Each message is added as an entry to the database and the current date and time added to it. At the moment, the main use of the Portal is to register the transfer of files from the WN to the SE using DataLight. By logging into the Portal, the user can check the status of the job and has access to the LFN of each file that was transferred. Then, with the usual glite tools (for example, lcg-cp) the user can get the file from the SE to the UI for post-processing, while the simulation is still running. An important issue to note is that each WN must be allowed to have an outbound network connection. This can be achieved be setting the GlueHostNetwork- AdapterOutboundIP attribute to true. 4.1 Security issues We have identified two points where security of the Portal can be an issue. The first point is in receiving the HTTP POST messages: without proper care, it is easy to flood the Portal with false messages (DoS attack). The second point is in accessing the Portal to receive information about jobs: only the owner of a job should be allowed to view its messages. To handle both issues, the Portal in its current form is deployed on the UI of a site and only allows secure connections from users registered on that UI. When a user authenticates with the Portal, it only has access to the messages received that have the Owner field set to the correct user name. All other entries are not accessible. Also, the Portal silently drops any message that has an owner field that does not correspond to a user. This limits the possibilities of misusing of the Portal. 5 Conclusions and future work One of the main strengths of DataLight is its simplicity. It allows for the overlapping of computation and data transfers with a minimal effort to the developer of the non-grid application. By implementing the corresponding Portal, the generated files are readily available to the users while the application is running. For applications that require long computational times (several hours or days) and that write hundreds of files, this is another major feature of the library. However, we consider the DataLight Portal in its current form to be at an initial stage. An important next step will be to offer an user interface in the Portal to control the launching of jobs: after login, the user specifies the input files and parameters, and the Portal automatically generates the JDL file and launches the job. The allows for an increased security, since the Portal has access to the WMS job ID from the beginning, and can be configured to only accept remote messages

231 IBERGRID 231 from running jobs with valid IDs. Furthermore, this also lifts the necessity of hosting the Portal in the UI: any user with valid Grid credentials can apply for a login to the Portal. The final step will be to implement some post-processing capabilities in the Portal. Getting the files from the SE to the UI is trivial, if the user has the corresponding LFN. But it is also possible to implement in-situ post-processing (with the files staying in the SE) using GFAL and a visualization tool with a browser interface (for example, Vtk [14] or VisIt [15] with Java bindings). Acknowledgements This work was partially supported by grant GRID/GRI/81800/2006 from FCT, Portugal. References 1. The CERN DataGrid Project C. K. Birdsall and Langdon. Plasma Physics via Computer Simulation (Series on Plasma Physics). Taylor & Francis, January R. A. Fonseca, L. O. Silva, F. S. Tsung, V. K. Decyk, W. Lu, C. Ren, W. B. Mori, S. Deng, S. Lee, T. Katsouleas, and J. C. Adam. OSIRIS: A three-dimensional, fully relativistic particle in cell code for modeling plasma based accelerators. In Sloot, P. and Tan, C. J. K. and Dongarra, J. J. and Hoekstra, A. G., editor, Computational Science-ICCS 2002, PT III, Proceedings, volume 2331 of Lecture Notes in Computer Science, pages Springer-Verlag Berlin, L. Gargaté, R. Bingham, R. A. Fonseca, and L. O. Silva. dhybrid: A massively parallel code for hybrid simulations of space plasmas. Computer Physics Communications, 176(6): , P. Abreu, R. A. Fonseca, and L. O. Silva. Migrating large output applications to Grid environments: a simple library for threaded transfers with glite. In Ibergrid nd Iberian Grid Infrastructure Conference Proceedings. Netbiblo, Mehnaz Hafeez, Asad Samar, and Heinz Stockinger. A Data Grid Prototype for Distributed Data Production in CMS. In 7th International Workshop on Advanced Computing and Analysis Techniques in Physics Research, Mehmet Balman and Tevfik Kosar. Data scheduling for large scale distributed applications. In Proceedings of the 5th ICEIS Doctoral Consortium, In conjunction with the International Conference on Enterprise Information Systems (ICEIS?07, Caitriana Mairi and Macleod Nicholson. File Management for HEP Data Grids. PhD thesis, University of Glasgow, S. Burke, S. Campana, A. Peris, F. Donno, P. Lorenzo, R. Santinelli, and A. Sciabà. glite 3 User Guide, Peter Kunszt. File Transfer Service User Guide, document/591792/. 11. The Globus Project. GridFTP: Universal Data Transfer for the Grid. White Paper, September C2WPdraft3.pdf. 12. B. Radic, V. Kajic, and E. Imamagic. Optimization of Data Transfer for Grid Using GridFTP. In 29th International Conference on Information Technology Interfaces, 2007., pages , June 2007.

232 232 IBERGRID 13. Jiazeng Wang and Linpeng Huang. Intelligent file transfer protocol for grid environment. In Wu Zhang, Weiqin Tong, Zhangxin Chen, and Roland Glowinski, editors, Current Trends in High Performance Computing and Its Applications, pages Springer Berlin Heidelberg, The Visualization Toolkit VisIt: Software that Delivers Parallel Interactive Visualization. gov/codes/visit/.

233 SESSION 3: SOFTWARE DEVELOPMENT AND QUALITY SESSIONS

234

235 IBERGRID 235 Software Provision Process for EGI Mário David 1, Gonçalo Borges 1, Jorge Gomes 1, João Pina 1, Isabel Campos 2, Enol Fernandez 2, Alvaro Lopez 2, Pablo Orviz 2, Javier López Cacheiro 3, Carlos Fernandez 3, Alvaro Simon 3 and Alvaro Fernandez 4 1 Laboratório de Instrumentação e Física Experimental de Partículas Lisboa, Portugal 2 Instituto de Física de Cantabria, CSIC, Santander, Spain 3 Fundacion Centro de Supercomputacion de Galicia, Santiago de Compostela, Spain 4 Instituto de Física Corpuscular CSIC/University Valencia, Spain Abstract. The European Grid Initiative (EGI) provides a sustainable pan-european Grid computing infrastructure for e-science based on a network of regional and national Grids. The middleware driving this production infrastructure is constantly adapted to the changing needs of the EGI Community by deploying new features and phasing out other features and components that are no longer needed. Unlike previous e-infrastructure projects, EGI does not develop its own middleware solution, but instead sources the required components from Technology Providers and integrates them in the Unified Middleware Distribution. In order to guarantee a high quality and reliable operation of the infrastructure, all UMD software must undergo through a release process that covers the definition of the functional, performance and quality requirements, the verification of those requirements and testing in production environments. 1 Introduction The European Grid Initiative (EGI) [1,2] provides a computing infrastructure formed by a federation of cooperating resource providers from Regional and National Grid Initiatives (NGIs) all across Europe. EGI does not develop the software deployed in the grid infrastructure all upgrades and new components are provisioned in partnership with Technology Providers. Technology Providers are organizations or projects collaborating with EGI that develop or deliver software for use within the production infrastructure. Two types of software are deployed in the EGI Infrastructure: The middleware delivered by external Technology Providers such as the European Middleware Initiative (EMI) [4] or the Initiative for Globus in Europe (IGE) [5], are integrated by EGI in the Unified Middleware Distribution (UMD), and a set of Operational Tools providing ticketing systems for user and operational support, accounting and monitoring that support the running of the production environment. EMI is the technology provider that will distribute the ARC, glite and UNI- CORE middleware stacks in an integrated release, while IGE is the technology provider that will distribute and support the Globus middleware stack. of corresponding author: [email protected]

236 236 IBERGRID Most Operational Tools are developed within the framework of the EGI-InSPIRE [3] project, acting as an internal Technology Provider. Each software component submitted by the Technology Providers must go through the software provision process before it s General Availability for deployment in the production infrastructure. This well-defined workflow assures the quality and reliability of the software by assessing it against a set of defined criteria. The workflow consists of two main phases: software verification and staged rollout. This paper is organized as follows. Section 2 gives an overall description of the full Software Release Process. Section 3 describes the Software Verification phase including the Quality Criteria. The Staged Rollout workflow, the past year experience and Early Adopters are described in section 4 and section 5 gives a brief overview about the upcoming EMI1.0 release due at the end of April Conclusion and some prospect are drawn in section 6. 2 Software Provisioning Workflow The Software Provisioning activity in EGI [6] governs two processes to ensure software quality and its correct integration into EGI UMD repository. These main activities are the continuous revision of the process to adapt any change required by the EGI community (including new features and phasing out features and components that are no longer needed), and the maintenance and improvement of the process quality as defined by EGI. The Software Provisioning activity acts as a filter to accept or to reject new products into EGI s UMD repository. The Software Provisioning workflow depicted in Fig. 1, starts when a new software component release is available to be included in the EGI production infrastructure. The Technology Provider creates a ticket in the Global Grid User Support system (GGUS) [7]. The GGUS ticket is filled with complete information about the new software release such as: release notes, documentation, installation and configuration notes and the list of packages contained in the release. The ticket has to be assigned to EGI Technology Unit (SA2), triggering the creation of a ticket in the EGI RT system [8], in a dedicated queue named sw-rel. The state diagram of this queue is shown in Fig. 2, and is the main tool to handle the Software provisioning workflow. Upon the RT ticket creation, the Software Delivery phase (implemented by an external RT module called Bouncer ) takes place and the new Technology Provider release is processed automatically. The end result of the Bouncer is a list of product release descriptions specific for each platform and architecture (Product-Platform-Architecture PPAs), creating one child ticket (in the same RT queue) per combination of Product- Platform-Architecture. From EGI s point of view a Product is a solution delivered by Technology Providers which offer the functionality for one or more capabilities as one single and indivisible unit. This concept is important due to the intrinsic architecture of the infrastructure, in particular to the resource infrastructure administrators and EGI Operations Unit. The previously mentioned parties have driven the requirements for the EGI repository structure [10] has being Product-Platform-Architecture oriented, e.g. there is one repository per Product-Platform-Architecture combination.

237 IBERGRID 23 Fig. 1. Software provisioning workflow showing also the tools used in the whole process. Each Product-Platform-Architecture combination is tracked in it s own child ticket during the Verification and Stage Rollout phases. In this way, it is possible to have several Product-Platform-Architecture combinations being processed in parallel and independently from each other. The RT ticket state changes from Unverified to InVerification when the Verification process starts (see Fig. 2). If the Verification phase as an outcome of accepted the RT ticket state changes from InVerification to StagedRollOut (Fig. 2). This state change triggers the creation of a child ticket in another specialized RT queue named stagedrollout. The full Staged Rollout phase is tracked in the tickets belonging to this queue. The details will be given below. This phase is complete when the ticket is set to status resolved with a given outcome. If the outcome of the Staged Rollout is accepted, a new release is prepared for General Availability to the EGI infrastructure. Several actions are performed: preparation of a new UMD release, transfer of software product packages to the production repositories, broadcast of the new release to all resource infrastructure administrators and other involved parties. The Product-Platform-Architecture swrel child ticket is set to resolved with outcome accepted.

238 238 IBERGRID Fig. 2. Software release workflow implemented in the EGI RT and repository portal [9]. It should be noted that any of the phases previously described, an outcome of rejected is set if problems or bugs are discovered. In this case the release is rejected and the sw-rel child ticket is set to resolved. Whatever the outcome, setting the RT ticket to resolved triggers the updating of the GGUS ticket providing a report and its status is changed to Resolved. This is the final state of the workflow. Presently, the full Software Provision workflow is matured and largely implemented, both in terms of the process itself as well as the supporting tools. 3 Software Verification process The EGI TSA2.3 task handles the Software Verification process. The objective of this task is to verify the software quality before the Staged Rollout phase. Software might work but must follow the quality criteria to pass to production. Over the last project year the verification process was started based on Unified Middleware Distribution Quality Criteria capabilities. The release of the first version of the Quality Criteria has allowed the creation of the verification templates that are publicly available in the EGI DocumentDB [11]. The Verification phase of the workflow is now described.

239 IBERGRID 239 When the sw-rel RT ticket changes into the state InVerification the Verification phase starts (Fig. 2). The actions that the verifier has to perform are: 1. Check several documents such as: release notes, changelogs and which bugs or issues are solved, existence of updated documentation and the certification reports provided by the Technology Providers. 2. Check if there are security vulnerabilities that need special treatment and a closer look. 3. Determine if all required capabilities are present. 4. In certain cases, do a deployment and configuration test of the component. 5. Produce a verification report (using the report templates mentioned above), and setting an outcome for the release. Depending on the outcome of the Verification phase, if the release is rejected, this is automatically communicated back to the original GGUS ticket together with the reasons of the rejection. If the outcome is accepted, the workflows proceeds to the Staged Rollout phase. The Verification task is handled by Ibergrid EGI partners, it is foreseen to involve other EGI partners in the case of Products where other skills and experience are needed. This is the case for ARC and UNICORE components. 3.1 Quality Criteria The EGI task TSA2.2 handles the Software Quality Criteria specification. This is a continuous process that is formally reviewed every 6 months producing new/updated versions of the Quality Criteria documents, with input from all involved parties: user community, operations, Technology Providers and the EGI Technology Unit (SA2). There are two major types of Unified Middleware Distribution Quality Criteria documents: generic and specific. The generic Quality Criteria [12] is applicable to all Unified Middleware Distribution components and Operational Tools. The main sections are given below: Documentation (User guides and software documentation). Software Releases (Component license, source code readability, etc). Service Criteria (Service logs, management and monitoring). Security (File access policy). Thus, the new software must be well documented, it should include a license that permits the deployment of that software in the EGI infrastructure, publicly available open source code, etc. Generic On the other hand, software capabilities depend on the product used, as such, the specific Quality Criteria is needed targeting the specific components. Due to the high heterogeneity of Unified Middleware Distribution software capabilities and sources, all the specific Quality Criteria are grouped in six major groups: Compute Capabilities: Products with Compute Capabilities QC are aimed at job execution (parallel and sequential) and jobs scheduling.

240 240 IBERGRID Data Capabilities: Product capabilities aimed at data management and data or metadata catalogs. Information Capabilities: Grid information model schema and service discovery capabilities for products used in EGI. Operations Capabilities: covers monitoring and accounting capabilities. Security Capabilities: Grid authentication and Virtual Organizations management capabilities. Storage Capabilities: Products that target data transfer and storage capabilities. The activity in the first year of the EGI project has been in the definition of all Quality Criteria documents where input from the Technology Providers was essential, with the aim to have the first version of those documents finalized for the first EMI release. 3.2 Verification of the Quality Criteria: results During the first year of EGI, some software components have been put through the workflow. These first uses of the workflow process where aimed at testing the system and the tools. They also served to fine tune the tools and check for additional customization that needed to be implemented. As such, the following products where used for the tests : Trust Anchors updates: this component is internal to EGI and contains the root certificates of all Certification Authorities. Version was the first software going through the workflow. Due to the importance of the Certificate Authority updates to the Quality Criteria were specifically defined in the egi.eu wiki to ensure the correctness and validity of the certificates included in the package. The new CA version was released to production without problems. After the CA update verification a new version of Trust Anchors (v1.38-1) was also verified following the same procedure. In this case this release was performed after the final release of the Quality Criteria documents. SAM monitoring updates: This component is part of the Operational Tools and the Technology Providers are internal to the EGI project. The Update 6 of SAM monitoring probe was the first Operational Tools component that went through the verification process. Verification started with the development and definition of Quality Criteria prior to the actual release with the collaboration of monitoring experts from the Operations Community. The criteria, as with the Trust Anchors, was initially made available in the egi.eu wiki [13] and later as part of the first Quality Criteria release. The Verification team has verified up to the present, updates 6 to 9. Presently it is considered that all tools used to support the Software Provisioning workflow, as well as the procedures are ready to process all other software components.

241 IBERGRID Staged Rollout The EGI Staged Rollout is the procedure through which newly verified software releases are first deployed and tested by Early Adopter sites before General Availability to all sites in the production infrastructure. In the general case, the Staged Rollout test is performed in a production service. As such, the new version of the software is exposed to a more heterogeneous/chaotic environment with different users and applications, compared to the certification and verification phases. During this process, if issues or problems arise, then workarounds must be added to the release notes or, in more extreme cases, the update may be rejected. An initial description of the process can be found in [14,15], thus in this paper we will concentrate on actions taken and implementation of the process during the past year. As previously described, the Staged Rollout phase is triggered when the state of the RT ticket in the queue sw-rel changes from InVerification to StageRoll- Out and triggers the movement of the packages into the respective repository [10]. This means that the software component as passed the verification process and a child ticket is created in the staged-rollout queue. At this point, the following actions are preformed: 1. The newly created child ticket imports the relevant information from it s parent ticket: Release notes, bugs or issues fixed, documentation. 2. The staged rollout manager assigns the ticket to all Early Adopter teams responsible for the test. 3. Each Early Adopter team deploys the new software in their corresponding production service. 4. Any problem found is either reported in the ticket, or if it s a serious bug through the GGUS. 5. The new version of the software component is exposed to production load (environment) and users. This period may last between 5 to 7 days, but may be extended depending on the components. 6. Each Early Adopter team must fill a report of the test. 7. The staged rollout manager collects all reports, produces a summary with an outcome and make it publicly available through the EGI Document DB. The identifier of the reports is inserted into the ticket and ticket is closed with an outcome of < Accept Reject >. When the component passes the staged rollout phase, the child ticket is set to resolved, the outcome and relevant information about the Staged Rollout reports are communicated back to the parent ticket and ending this part of the workflow. 4.1 Past year experience The Staged Rollout procedure is being coordinated within EGI through the TSA1.3 task. The transition process has had a smooth evolution from the one implemented

242 22 IBERGRID Fig. 3. EGI portal for Early Adopters management. Hosted at during the Enabling Grids for E-sciencE project (EGEE) [14]. It involved the change of EGEE procedures into the new software release workflow, using the tools provided within EGI and deprecating the ones previously used. One of the important points during this transition was to have a gradual and smooth transition in the procedures that the Early Adopter sites have to follow. On the other hand, the transition has had a larger impact on the coordination and management of the whole process, e.g. using partially the tools and processes inherited from EGEE together with tools and processes/workflow devised in EGI. In the first iterations all involved parties (EGI SA1 and Technology Providers), it was agreed that Technology Providers should not interact with the EGI RT. As such, the initial workflow was re-designed so that the process starts with a GGUS ticket that is opened by the external Technology Provider, followed by all phases previously described. Therefore all communications between EGI and Technology Providers are carried through GGUS ticket(s), while allowing a single well determined point of exchange of information, public availability and traceability, and easier extraction of metrics. This well determined communication channel does not exclude other means of more informal communication, such as . Although some software components already went through the EGI Software Provision workflow, the glite middleware components still go through the old process, i.e. using the CERN Savannah patches ( to track the verification and staged rollout phases. Regarding glite components, it s not foreseen to adopt the new procedure. Only the up-

243 IBERGRID 23 coming EMI releases (as well as IGE releases) are foreseen to undergo the new workflow. Fig. 4 shows the number of staged rollout tests undertaken per NGI during a period of 10 months. It can be seen that a few NGI s are performing a large fraction of this task, and it is perceived that the workload has to be spread among the several NGI s. Fig. 4. Number of staged rollout tests per NGI. Period between May 2010 and March Early Adopters The number of Early Adopter sites has increased since the beginning of the project, from an initial set of around 25 teams to around 42 at the present moment. Consequently most glite software components are covered for testing by at least one Early Adopter team, contrary to the situation in the beginning. Furthermore, ARC and UNICORE components are also covered, spanning all components that are part of the first EMI release. The aim is to have as many Early Adopter teams as possible to cover different deployment and heterogeneous scenarios as well as redundancy to perform the

244 2 IBERGRID test (when given team or teams are unavailable) A portal, as shown in Fig. 5, was made available to monitor the number of Early Adopter teams available per software component. The work performed by the Early Adopter teams was briefly described above. One common issue with new teams is to find all the needed information to perform the staged rollout test, and produce the final < Approve Reject > outcome. Part of this issue steems from the use of repositories with unintuitive names, or structure that changes quite often. With the new EGI repositories and including releases, starting from operational tools and with including EMI future releases, these problems should be solved gradually. Furthermore, the initial information for installation and configuration, provided by the Technology Provider has improved considerably during the course of the past year. Another aspect to bring more Early Adopter teams into the process was the possibility to declare the service under Staged Rollout test in the Grid Operations Centre database, so that the infrastructure operations teams monitoring the status of the production services are aware of those services, taking that fact into account in the reliability and availability service metrics. Presently the major challenge is to move the Early Adopters of glite 3.2 components to the corresponding EMI 1.0 ones where major changes and backward incompatible new versions are expected while using a stable production infrastructure. Fig. 5. Number of Early Adopters covering each software component. Hosted at

245 IBERGRID Near Future: EMI 1.0 release The EMI project gathers components from three major middleware stacks: glite, ARC and UNICORE. The first major release called EMI 1.0, is expected by the end of April All EMI 1.0 components will undergo the previously described Software Provision workflow, the ones accepted will be gathered in the UMD release. The workflow has been designed so that each component can undergo through it in independently from each other and in parallel. The workflow described earlier will be tested soon with some agreed components in order to exercise the procedure both in terms of tools and human players. Currently the EGI infrastructure is largely based on glite components, consequently all operational and monitoring tools have been targeted to this middleware stack. There is also a significant fraction of sites deploying ARC components particularly in the Nordic countries. Some level of integration between ARC components and the operational tools had already been achieved. During the first year of the project a large effort has been made to achieve a seamless integration of both ARC and UNICORE with these operational tools. The first major release will imply a large effort from all parties: Quality Control teams, Verification teams, Early Adopters and Staged Rollout managers. The main points are the following: It s the first major release and consequently the first real use of the full workflow: tests with the full chain of procedures are undergoing and it may result in some more fine tuning. It is known to EGI that this first release of EMI is backward incompatible [4] in terms of deployment and configuration, but not in terms of interfaces and interaction with current production services. This will imply additional effort from the Early Adopter teams. The latest point above is considered the most problematic because the current Early Adopter teams volunteering for a given glite component will have to move to the corresponding EMI one. Since the staged rollout test should primarily be done in production nodes, it will imply complete re-installation and configuration of production services. One way to work around this, at least for some services, is to deploy new instances of the service in parallel with the production ones. In any case it will imply additional effort with respect to the currently deployed components. Going a step further; when the new release gains General Availability it is not expected that a large fraction of the production infrastructure performs a major deployment of the new release. Partly due to the points made above for the Early Adopter sites, but also not to disrupt the services being provided to the Virtual Organizations and users. A further point is that the support schedule for glite, as well as for each major EMI release is already set. As such, all production resource infrastructure managers and users should be aware of it and plan ahead their schedules to upgrade the production instances. The glite has an end of support date of April 2012, coinciding with the EMI 2.0 major release.

246 246 IBERGRID 6 Conclusions and prospects The Software Provision Process for EGI has it s great challenge in the months following April 2011 with the first major release of EMI. This means that the full chain of the workflow will have to be fully implemented and and working, but more importantly to convincing first the Early Adopter teams and after the rest of the productions sites to move to the new versions of the software components. It is perceived that the deployment of new software versions into the productions infrastructure will happen during a long time span, and it may happen that unsupported versions of components or services be found in the production infrastructure. This has happened in the past and even today, although one very important point is that a clear support schedule for any given piece of software is known well in advance as is the case we are facing now with both glite and EMI major releases. 7 Acknowledgments This work is partially funded by the EGI-InSPIRE (European Grid Initiative: Integrated Sustainable Pan-European Infrastructure for Researchers in Europe) is a project co-funded by the European Commission (contract number INFSO- RI ) as an Integrated Infrastructure Initiative within the 7th Framework Programme. EGI-InSPIRE began in May 2010 and will run for 4 years. Full information is available at: References 1. EGI Design Study, EGI Blueprint ; 2. EGI Web portal; 3. European Grid Initiative: Integrated Sustainable Pan-European Infrastructure for Researchers in Europe, EGI-InSPIRE proposal 4. European Middleware Initiative (EMI); 5. Initiative for Globus in Europe (IGE); 6. Fernandez C, MS503: Software Provisioning Process ; public/showdocument?docid=68 7. Global Grid User Support; 8. EGI Request Tracker (RT); 9. EGI SA2 and SA1.3, New Software Release Workflow ; wiki/nsrw_implementation_rt 10. Angelou E et al, MS504: EGI Software Repository Architecture and Plans ; https: //documents.egi.eu/public/showdocument?docid= The EGI Document Database Fernandez E et al, UMD Quality Criteria ; The EGI wiki pages; David M et al, Provisioning of Grid Middleware for EGI in the framework of EGI- InSPIRE, 4 th Iberian Grid Infrastructure Conference Proceedings, Braga, Portugal, May 24 27, 2010, pp David M, MS402: Deploying Software into the EGI production infrastructure ;

247 IBERGRID 247 Support to MPI applications on the Grid Enol Fernández del Castillo Instituto de Física de Cantabria, CSIC-UC Santander, Spain Abstract. The current middleware stacks provide varying support for the Message Passing Interface (MPI) programming paradigm. Users face a complex and heterogeneous environment where too many low level details have to be specified to execute even the simplest parallel jobs. MPI-Start is a tool that provides an interoperable MPI execution framework across the different middleware implementations to abstract the user interfaces from the underlying middleware and allow users to execute parallel applications in a uniform way, thus bridging the gap between HPC and HTC. In this work we present the latest developments in MPI-Start and how it can be integrated in the different middleware stacks available as part of EMI, providing a unified user experience for MPI jobs. Keywords: Grid, MPI, parallel jobs. 1 Introduction Execution of parallel applications in grid environments requires the cooperation of several middleware tools and services. Two main phases can be identified in the submission of such applications: the allocation of nodes where the user job will run, and the execution of the application using those allocated nodes. The middleware support for MPI applications is usually limited to the possibility of allocating a set of nodes. The user still needs to deal with low level details related to the actual execution of the jobs that make the task non trivial. Furthermore, the heterogeneity of resources available in Grid infrastructures aggravates the complexity that users must face to run their applications. The European Middleware Initiative (EMI) [8] project is the main developer of grid middleware in Europe by supporting the development and integration of three different middleware stacks for the the execution of jobs: ARC [5], glite [17], and UNICORE [6]. All of them have some support for the execution of parallel applications. However, all of them have different approaches that prevent users from easily moving from one stack to the other. MPI-Start [4] is also being developed in the context of the EMI project. MPI- Start is a unique layer that hides the details of the resources and application frameworks to the user and upper layers of the middleware. By using a modular and pluggable architecture it manages the details of several elements for the user: from Local Resource Management System (LRMS) to the specific syntax to start an application for a given MPI implementation.

248 248 IBERGRID In this paper we describe the current support for MPI jobs in the different EMI middleware stacks and how MPI-Start may be integrated with all of them in order to provide a unified user experience across the different stacks. In Section 2, a description of the middleware support for allocation of nodes is given. Section 3 describes MPI-Start and the integration with the middleware stacks. In Section 4, we describe the monitoring probes for MPI in the EGI [7] Infrastructure. Finally, in Section 5 we give some conclusions and an outlook of future work. 2 Job Submission Prior to the execution itself, the parallel application must be submitted to a grid middleware that will create a work item on a Local Resource Management System (LRMS). This middleware is usually referred to as Computing Element (CE). EMI provides implementations of such CE in three different middleware stacks: ARC, glite and UNICORE. All of them provide some support for parallel applications although the level of control for the job definition vary from one stack to other. ARC provides the ARC-CE, which uses the Extended Resource Specification Language (xrsl) language [9] for defining the jobs. In order to submit a parallel application, the count attribute is used to specify the number of slots that must be allocated for the application. ARC provides also the Runtime Environments (RTE), that allow the site administrator to define an environment for the execution of specific applications. The usual way of supporting an MPI implementation in ARC is by defining a RTE for a specific MPI implementation. The user must then write a script that uses a set of predefined variables to start the application. Listing 1.1 shows an example of a 4 process MPI application that is submitted to an ARC-CE using the OPENMPI-1.3 Runtime Environment, the script that the user should provide for its execution is shown in listing 1.2. Note that the user builds the command line required to start the job, therefore the user must know the specific syntax for the MPI implementation used. Listing 1.1. ARC parallel job description &( e x e c u t a b l e= runopenmpi. sh ) ( e x e c u t a b l e s =( h e l l o o m p i. exe runopenmpi. sh ) ) ( count = 16 ) ( i n p u t f i l e s =( h e l l o o m p i. exe runopenmpi. sh ) ) ( stdout= std. out ) ( s t d e r r = std. e r r ) ( runtimeenvironment= ENV/MPI/OPENMPI 1.3/GCC64 ) #! / bin / sh Listing 1.2. ARC parallel job script $MPIRUN np $NSLOTS. / h e l l o ompi. exe glite provides the CREAM [1] as Computing Element. The language used to describe jobs in CREAM is the Job Description Language (JDL) [19]. The user has

249 IBERGRID 249 several ways of defining a parallel job. The most basic case is using the CPUNumber attribute, that determines the total number of slots to be allocated by the CE. Advanced placement of the processes on the physical hosts can be also requested with the following attributes: SMPGranularity This value determines the number of cores any host involved in the allocation has to dedicate to the application. WholeNode Indicates whether whole nodes should be used exclusively or not. NodeNumber This integer value indicates the number of nodes the user wishes to obtain. The CREAM does not provide any additional support for the job execution. However, MPI-Start is usually available in glite sites to start parallel jobs. Details on MPI-Start are given in the next Section. Listing 1.3 shows the an example of a MPI application with 4 processes submitted to a CREAM. Listing 1.3. glite parallel job description JobType = Normal ; CPUNumber = 1 6 ; Executable = s t a r t e r. sh InputSandbox = { s t a r t e r. sh, h e l l o ompi. exe } ; StdOutput = std. out ; StdError = std. e r r ; OutputSandbox = { std. out, std. e r r } ; The starter.sh script invokes MPI-Start after setting some variables that determine which application the user wants to start. An example script is shown in Listing 1.4. #! / bin / sh Listing 1.4. glite parallel job script export I2G MPI APPLICATION=h e l l o ompi. exe export I2G MPI TYPE=openmpi $I2G MPI START In the case of UNICORE jobs, users can describe their jobs using a JSON representation of the Job Submission Description Language (JSDL) [2] format. Parallel jobs are described using the Resources attribute. This is a complex attribute that may contain the number of requested slots with the CPUs attribute, or advanced placement of the processes with the CPUsPerNode attribute, that indicates the number of cores in the host involved in the allocation, and the Nodes attribute, that indicates the total number of nodes the user wishes to use. The execution of the applications is handled with the Execution Environments. Site administrator defines as many Execution Environments as needed with the specific details of each kind of job the user may execute at the site. A template based language is used to describe these Execution Environments. Listing 1.5 depicts the description

250 250 IBERGRID of a MPI job with 4 processes submitted to UNICORE. Note that there is no need to specify any user script, the Execution Environment takes care of all the details. { } Listing 1.5. UNICORE parallel job description Executable :. / h e l l o ompi. exe, Imports : [ {From : / m y f i l e s / h e l l o. mpi, To : h e l l o ompi. exe }, ], Resources : { CPUs : 16, }, Execution environment : { Name : OpenMPI, Arguments : { P r o c e s s e s : 4, }, }, 3 MPI-Start As shown in the previous Section, users are able to submit and execute parallel jobs using the current middleware stacks. However, each of them provides different level of support and abstraction that turns the migration from one implementation to another hard for most users. MPI-Start provides an abstraction layer that simplifies the execution of the jobs in heterogeneous systems available in grid environments. MPI-Start takes care of the following details for the user in an automatic way: Local Resource Management System (LRMS). Each system has particular ways to manage and interact with the nodes of the cluster. MPI-Start automatically detects and prepares the list of machines for SGE [11], PBS/Torque [3], LSF [21] and Condor [18] batch systems. File distribution. The execution of a parallel application requires the distribution of binaries and input files into the different nodes involved in the execution. Collecting the output is a similar problem. MPI-Start has a file distribution hook that detects if a shared filesystem is available. In the case of not being available, MPI-Start distributes binaries and input files into the execution hosts using the most appropriate method. Application compilation. In order to obtain good performance and to assure that the binaries will fit the available resources, MPI jobs may need to be compiled with the local MPI implementation at each site. MPI-Start checks the compilation flags in the system and assures that users are able to compile their applications.

251 IBERGRID 251 Application execution. Each parallel library or framework has different ways of starting the application. Moreover, for a given framework there may be differences depending on the LRMS or file distribution method used in the execution environment. In the case of MPI, the different vendors use mpirun and mpiexec in a non-portable and non-standardized way. MPI-Start builds the command line for common available MPI implementations such as Open MPI [10], MPICH [12] (including MPICH-G2 [15]), MPICH2 [13], LAM-MPI [20] and PACX-MPI [16]. The latest developments of MPI-Start have introduced a new architecture for extensions, the ability to define the way the user logical processes are mapped in the physical resources and a complete review of the LRMS and Application Execution support. These developments will be available as part of the EMI-1 release, due in May Hybrid MPI/ Applications Parallel applications using the shared memory paradigm are becoming more popular with the advent of multi-core architectures. MPI-Start default behavior is to start a process for each of the slots allocated for an execution. However, this is not suitable for applications using a hybrid architecture where several threads access to a common shared memory area in each of the nodes. In order to support more use cases, the latest report of MPI-Start includes support for better control of how the processes are started, allowing the following behaviors: Define the total number of processes to be started, independently of the number of allocated slots. Start a single process per host. This is the usual use case for hybrid jobs with and MPI applications. MPI-Start prepares the environment to start as many threads as slots available in the host. Define the number of processes to be started in each host, independently of the number of allocated slots at each host. 3.2 Integration with ARC and UNICORE MPI-Start was originally developed for the integration with the glite middleware, although the design and architecture of MPI-Start is completely independent of this middleware. For the EMI-1 release, we have integrated the tool with the ARC and UNICORE middlewares by providing new Runtime Environments and Execution Environments that can be easily configured by the site administrators. In the case of ARC, the definition of a Runtime Environment consists in the creation of a shell script that is invoked three times for any given execution: before the job is submitted, before the execution of the job itself and after the job has finished. Listing 1.6 shows the code for this RTE. Site administrators only need to define any special configurations that may have their site for MPI-Start. In the given example, by defining the variable MPI START SHARED HOME to yes, the site admin is indicating to MPI-Start that it should not try to detect which kind of

252 252 IBERGRID filesystem is available and assume a shared file system will be used. Users of this site would only need to require in their job description the mpi-start Runtime Environment and use MPI-Start as in a glite site. Listing 1.6. ARC Runtime Environment #! / bin / bash p a r a l l e l e n v n a m e= mpi s t a r t case $1 i n 0 ) # l o c a l LRMS s p e c i f i c s e t t i n g s, no a c t i o n ; ; 1 ) # user environment s e t u p export I2G MPI START=/usr / bin /mpi s t a r t export MPI START SHARED HOME=yes ; ; 2 ) # no p o s t a c t i o n needed ; ; ) # e v e r y t h i n g e l s e i s an e r r o r return 1 e s a c The definition of UNICORE Execution Environments is done using a XML file where the options and the rendering of these options when used is described. In order to use MPI-Start in such way, we introduced the possibility of setting the parameters via command line arguments instead of environment variables. Listing 1.7 shows partially one example definition of an Execution Environment for MPI- Start. In the example the user can define the MPI implementation to use, the total number of processes, additional MPI-Start variables and enable the verbose output. A complete Execution Environment would provide additional options for controlling all the MPI-Start features. Listing 1.7. UNICORE Execution Environment <ExecutionEnvironment> <Name>mpi s t a r t</name> <D e s c r i p t i o n>run p a r a l l e l a p p l i c a t i o n s</ D e s c r i p t i o n> <ExecutableName>/ usr / bin /mpi s t a r t</executablename> <Argument> <Name>mpi type</name> <IncarnatedValue> t </ IncarnatedValue> <ArgumentMetadata> <D e s c r i p t i o n>mpi implementation to use</ D e s c r i p t i o n> <Type>s t r i n g</type> </ ArgumentMetadata> </ Argument> <Argument> <Name>Number o f P r o c e s s e s</name> <IncarnatedValue> np </ IncarnatedValue> <ArgumentMetadata> <D e s c r i p t i o n>the number o f p r o c e s s e s</ D e s c r i p t i o n> <Type>i n t</type>

253 IBERGRID 253 </ ArgumentMetadata> </ Argument> <Argument> <Name>MPI S t a r t V a r i a b l e</name> <IncarnatedValue> d </ IncarnatedValue> <ArgumentMetadata> <D e s c r i p t i o n> Define a MPI S t a r t v a r i a b l e ( e. g., I2G MPI START VERBOSE=1 ) </ D e s c r i p t i o n> <Type>s t r i n g</type> </ ArgumentMetadata> </ Argument> <Argument> <Name>Verbose</Name> <IncarnatedValue> v</ IncarnatedValue> <OptionMetadata> <D e s c r i p t i o n>be verbose</ D e s c r i p t i o n> </ OptionMetadata> </ Argument> </ ExecutionEnvironment> With the integration of MPI-Start into the ARC and UNICORE approaches for job execution, users are provided with a unified user experience. They only need to specify the correct parameters to MPI-Start and can easily move from one middleware to other, or from one MPI implementation to other without worrying about the details of each of them. For example, a hybrid application that uses Open MP and Open MPI for execution, that needs to be compiled at the site using the MPI-Start hooks mechanism could be defined for the three middlewares as shown in Listings 1.8, 1.9 and In the example the user defines the variable MPI USE OMP to activate the OpenMP support, it requires the execution of only one MPI process per host with the pnode option and includes the hook myhook.sh for compilation before the actual execution. Note that the only differences are due to the description language of each middleware. Listing 1.8. ARC MPI-Start example ( Arguments= t openmpi d MPI USE OMP=1 pnode pre myhook. sh myapp ) Listing 1.9. glite MPI-Start example Arguments= t openmpi d MPI USE OMP=1 pnode pre myhook. sh myapp ; Arguments : Listing UNICORE MPI-Start example

254 254 IBERGRID { mpi type : openmpi, pre : myhook. sh, Per node : 1, MPI S t a r t V a r i a b l e : MPI USE OMP=1, }, 4 Infrastructure Monitoring The execution of parallel application not requires the middleware support for such jobs, it also needs a correct configuration of the infrastructure where the jobs are actually run. Grid infrastructures are mainly used for the execution of collections of sequential jobs [14]. Hence the support for parallel applications was not a priority, although most infrastructures are composed of clusters where execution of parallel applications is possible. In order to assure the correct execution of these applications and, therefore, attract more users to the infrastructure, monitoring probes that check the proper support for such jobs has been introduced. The monitoring probes are executed at all the sites that announce the support for MPI-Start and they consist in the following steps: 1. Assure that MPI-Start is actually available. 2. Check of the information published by the site. This first step inspects the announced MPI flavors supports and selects the probes that will be run in the next steps. 3. For each of the supported MPI flavors, submit a job to the site requesting 2 processes that is compiled from source using the MPI-Start hooks. The probe checks that the number of processes used by the application was really the requested number. Although the probes request a low number of slots (2), the existence of such probes allows, both to infrastructure operators and users, to easily detect problems. These probes are flagged as critical, thus any failure may cause the site to be suspended from the infrastructure. The introduction of these probes over the last year has improved significantly the quality of the MPI support thanks to the commitment of the site administrators to ensure no failures in the tests. 5 Conclusions The execution of parallel applications in grid environments is a challenging problem that requires the cooperation of several middleware tools and services. The support from middleware is constantly improving and the three computing middleware stacks of EMI provide ways to execute MPI jobs. However, the support varies from one implementation to other and users still need to care about too many details. With the use of MPI-Start, users do not need to worry about all the low level aspects of starting MPI applications in an heterogeneous infrastructure such as the grid. The latest developments in MPI-Start have introduced better

255 IBERGRID 255 control of job execution and the integration with ARC, glite and UNICORE. Users are totally abstracted by using the unique interface of MPI-Start and can easily migrate their application from one middleware provider to another. The EGI Infrastructure is committed to the support of parallel applications and provides monitoring probes that allow the early detection of any problems that may arise at the sites. The capability of executing MPI jobs and having a single interface for all kind of resources and MPI implementations creates an attractive infrastructure for users from different scientific communities with specific computational needs. The usage of parallel applications will arise with the availability of multiple core machines, the latest developments of MPI-Start provide a better control for the location and number of processes and will continue to improve those features in future releases. The use of advanced topologies, FPGAs, GPGPUs, and massively multi-node jobs will be investigated for use on high-end resource types. Acknowledgements The authors acknowledge support of the European Commission FP7 program, under contract number through the project EMI ( eu/). References 1. C. Aiftimiei et al. Design and implementation of the glite CREAM job management service. Future Generation Computer Systems, v. 26 (2010) A. Anjomshoaa et al. Job Submission Description Language (JSDL) Specification, Version 1.0. GFD-R.056 (2005). 3. A. Bayucan, R. L. Henderson, C. Lesiak, B. Mann, T. Proett, and D. Tweten. Portable batch system: External reference specification. Technical report, MRJ Technology Solutions (1999). 4. K. Dichev et al. MPI Support on the Grid. Computing and Informatics, v. 27 (2008) M. Ellert et al. Advanced Resource Connector middleware for lightweight computational Grids. Future Generation Computer Systems, v. 23 (2007), D. Erwin UNICORE - A Grid Computing Environment. Lecture Notes in Computer Science (2001), European Grid Initiative (EGI) 8. European Middleware Initiative (EMI) 9. Extended Resource Specification Language. NORDUGRID-MANUAL-4 (2011). 10. E. Gabriel et al. Open MPI: Goals, Concept, and Design of a Next Generation MPI. Implementation. Lecture Notes in Computer Science, v (2004), W. Gentzsch. Sun Grid Engine: towards creating a compute power grid. Proceedings of the first IEEE/ACM International Symposium on Cluster Computing and the Grid (2001), W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing, v. 22(6), (1996),

256 256 IBERGRID 13. W. Gropp. MPICH2: A new start for MPI implementations. Proceedings of the 9th European PVM/MPI Users Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface (2002), A. Iosup et al. The Grid Workloads Archive. Future Generation Computer Systems, v. 24(7) (2008), N. T. Karonis et al. MPICH-G2: A grid-enabled implementation of the message passing interface, J. Parallel Distrib. Comput, v. 63(5) (2003), R. Keller, E. Gabriel, B. Krammer, M. S. Müller, and M. M. Resch. Towards efficient execution of MPI applications on the grid: Porting and optimization issues. Journal of Grid Computing, v. 1(2), (2003), E. Laure et al. Programming the Grid using glite. EGEE-PUB (2006). 18. M. Litzkow, M. Livny, and M. Mutka. Condor - a hunter of idle workstations. Proceedings of the 8th International Conference of Distributed Computing Systems (1988). 19. F. Pacini and A. Maraschini. Job Description Language (JDL) attributes specification, Technical Report , EGEE Consortium, (2006). 20. J. M. Squyres. A component architecture for LAM/MPI, Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming (2003), S. Zhou. Lsf: Load sharing in large-scale heterogeneous distributed systems. Proccedings of the Workshop on Cluster Computing (2002).

257 IBERGRID 257 EGI-InSPIRE Software Quality Criteria A. Simón 1, C. Fernández 1, I. Diaz 1, J. López Cacheiro 1, and E. Fernández 2 1 Fundacion Centro de Supercomputacion de Galicia, Santiago de Compostela, Spain [email protected] 2 Instituto de Fisica de Cantabria, CSIC, Santander, Spain [email protected] Abstract. Ibergrid is responsible in EGI-InSPIRE of the Unified Middleware Distribution (UMD) quality criteria definition. This activity (TSA2.2) develops a generic component acceptance criteria to evolving needs for the new components on the UMD Roadmap. Ibergrid is also responsible of the Verification of conformance criteria, this activity (TSA2.3) validates the components contributed to the repository against the generic and component specific conformance criteria. This article describes the complete EGI- InSPIRE workflow for Quality Criteria definition and verification processes and how these activities are maintained and coordinated by a Ibergrid experts group. 1 Introduction Verification processes are always a priority for software development (commercial or not) to improve the middleware and the EGI-InSPIRE [2] project (Integrated Sustainable Pan-European Infrastructure for Researchers in Europe) is not an exception. Middleware certification and testing is not a new concept for European grid projects. The past Enabling Grids for E-sciencE (EGEE) project [3] has exploited this idea. In the past the middleware lifecycle, from development to production, was supported by three main activities. The first activity was the middleware development, integrated into the JRA1 task. The EGEE development team was in charge to deploy new open source solutions and services for EGEE infrastructure. New middleware components were never released directly to production, instead, new components were integrated and tested by integration and certification teams (EGEE SA3 task). This process could take a week, after which new middleware reached the Pre-Production service. Pre-Production was a step in the middleware lifecycle where the new certified middleware where tested. After a week or two, if no issues were found, the middleware were deployed to production to be used by grid users and Virtual Organizations. Now this workflow has changed in EGI-InSPIRE era. One of the main differences between the EGEE and EGI projects is that EGEE middleware was based on the Lightweight Middleware for Grid Computing (glite) project [8], part of the EGEE project. EGI is technology independent, that means all technology that satisfies project conditions could theoretically be included into the EGI middleware repository. EGI can also collaborate with external Technology Providers (TP) to ensure that their solutions meet the needs of the operational and user community

258 258 IBERGRID in terms of reliability, scalability and functionality. EGI middleware selection is not restricted to a single solution, and of course, this new policy also affects to the software certification and verification processes. This work with external technology providers will define new or improved services based on existing grid services, assessing the quality of the new delivered services, followed by their deployment into the production infrastructure. Currently, part of this effort is coordinated by work package SA2 [4] within EGI and due to the new process complexity, the verification and quality assurance has completely changed since EGEE era. The new Quality Criteria definition (assigned to task TSA2.2) and verification process (TSA2.3) are responsibility of the NGI Ibergrid, through CESGA and IFCA sites. The next sections will describe the complete process and the work developed by these sites within EGI and how this work will improve the middleware performance and reliability during the next project years. This paper is structured as follows: in section 2, the Unified Middleware Distribution (UMD) Quality Criteria is described including which products are currently included and a summary of specifications and documents. Section 3 explains the current Technology Providers involved within the project and their relationship with SA2. Section 4 describes the current Verification workflow. Section 5 provides a detailed description of task TSA2.3 and the Verification of conformance criteria. Finally sections 6 and 7 talk about future work and conclusions. 2 UMD Quality Criteria The middleware deployed in EGI production infrastructure is described in the Unified Middleware Distribution (UMD) roadmap [5]. This roadmap describes the features that are necessary to operate a distributed and federated computing infrastructure. It defines the capabilities of the software and how the functionality within each capability will evolve over time in response to the requirements coming in from the community. Special attention is given to the the applicable standards that may be used for each capability. It is through software components strictly adhering to standards that the free choice of implementation is enabled, thus fostering a development and competition among providers. The Quality Criteria definition task (TSA2.2) formalises the software requirements of the capabilities listed in the roadmap into a set of criteria. Each defined criterion describes what is expected from the software product, and how to assess it during the software verification phase. This definition of criteria process is a continuous activity driven by several sources: requirements from the User Community, requirements from the Operations Community (specially requirements originated from software incidents found in production systems), recommendations and issues covered by the Software Vulnerability Group (SVG), and analysis of reference implementations of the capabilities defined in the UMD roadmap. As a result of this definition process, a set of Quality Criteria documents were created and made available for all the community. In order to guarantee a stable verification of software, a fixed date release of the documents were established. A major release of the documents is planned to be released every 6 months. Between these major releases, two minor releases are made available after a lightweight

259 IBERGRID 259 review (one minor release every 2 months) within the main Technology Providers and the EGI community. All the Quality Criteria document versions are publicly available at the EGI Document Server [6]. In order to clearly distinguish between the different versions that are produced of each of the documents available, 3 different possible states for the documents are defined: FINAL, meaning that the documents are actively used for verification of software products DRAFT, for documents that are in preparation and will be used in the future for verification. DEPRECATED, for documents that are no longer used in verification. At any given time there is only one Final version of the criteria documents that is the one used for verification. This 6 month release schedule with minor review releases allows both the verification team and the Technology Providers to plan their testing efforts and to maintain stable results for the verifications. These final documents include some release notes with the main differences between the previous versions and a detailed revision log for each criterion with the particular changes for it. Pointers for the source requirement that originated the definition are also included in order to allow a complete tracking of the criteria activity. The Quality Criteria documents are classified according to the following types: Generic Criteria that hold for any product in UMD (e.g. interoperability, extensibility, availability on a specified minimal set of platforms, availability of an SDK, security, requirements on documentation, etc.) Specific Criteria valid for a concrete product only (e.g. requirements on throughput or stability). The Specific Criteria is further classified according to the UMD roadmap capabilities: Information, Compute, Storage, Data, Instrumentation, Virtualisation, Operational, Security and Client. A first major release of the Quality Criteria was made available in February of 2011 [7]. This release covers most of the capabilities described in the UMD Roadmap using the reference implementations currently in production. Table 1 shows which capabilities and the reference implementations used. 3 Technology Providers The Technology Providers (TP) not only has to develop and certify quality software with new functionalities but also needs to do it following the Quality Criteria defined in TSA2.2. In order to do that, the Technology Providers are not only required to carefully study the Quality Criteria documents but to assure that the software that they produce follows these criteria. This quality level would be impossible to reach if the TP does not release previously certified software. If they would not, the verification would fail and the release would not go into production unless there were a high pressure or demand for it and low implication in the process by the higher decisions bodies, but in this case low quality software

260 260 IBERGRID Capability Reference Implementation Authentication X.509 proxy certificates and HTTPS Attribute Authority VOMS Credential Management MyProxy Authorization Argus Information Model GlueSchema v1 and v2 Information Discovery BDII File Encryption/Decryption glite Hydra File Access POSIX file access File Transfer gridftp and WebDAV File Transfer Scheduling glite FTS Data Access OGSA-DAI Metadata Catalogue AMGA and LFC Job Execution CREAM, ARC-CE and UNICORE Parallel Job mpi-start Job Scheduling glite WMS Monitoring SAM and MyEGI Accounting glite APEL Table 1. Covered capabilities in first Quality Criteria release would enter into production. If the TP do this previous certification successfully in advance, SA2 Verification Team should be able to quickly fill the QC Verification templates reviewing previous tests executed by the Technology Provider. For example, to verify that one service has the start, status and stop commands, and implement correctly this functionality, the TP could provide a log showing the outcome of each of these commands. The TP should provide a link to these results. The QC Verification team will have to determine if this is enough or if more information is needed. To summarize: The Technology Providers have to be aware of the current EGI Quality Criteria (available at EGI Document Database [7]). The Technology Providers have to know and enroll properly into the QC documentation and Verification process. Technology Providers have to verify in advance the quality of the software following the QC applicable to each component. It would not be realistic to think that the software will pass the quality criteria if the TP does not verify it first. As a result of the previous requirement, the TP have to provide all the tests run and outcomes to the verifier, completely filling the QC Verification Templates [9]. The TP have to provide any additional information about this process. EGI should work in close collaboration with the Technology Providers to provide such resources, but this is, in principle, out of the scope of the SA2 activity. EGI is currently collaborating with two external Technology Providers, European Middleware Initiative (EMI) [11] and the Initiative for Globus in Europe (IGE)

261 IBERGRID 261 [12]. EMI is the result of the collaboration of three major middleware providers, ARC, glite and UNICORE, and other specialized software providers like dcache. EMI and EGI relationship is close due to previous interactions in EGEE with glite. SA2 and EMI members had regular meetings in the last weeks. These meetings are a important starting point to share suggestions and gather the TP feedback for a continuous improvement process. The result of these meetings is the review of the UMD Quality Criteria and the verification process. Until now (March 2011) there has been no meeting scheduled with the IGE members. 4 Verification Workflow Quality Criteria It s integrated into EGI s Software Provisioning workflow (see figure 1). Software Provisioning activity governs and performs these main processes: Establish agreements with key software providers. Maintain the UMD Roadmap. Define general and component specific quality criteria to be applied to software components. Verify the software components against these criteria. Provide a repository for the software components within UMD and the related support tools. Provided a distributed support unit within the EGI Helpdesk infrastructure with expertise on the deployed middleware in production use. Fig. 1. EGI Software Provisioning Workflow The workflow starts with the Technology Coordination Board (TCB) decisions. Members of the TCB are members of the executive management of EGI.eu and they represent the EGI User Community, the EGI Operations Community and EGI Software Provisioning activity. The objective of TCB is to control the overall process of technology evolution by formal agreements with Technology Providers.

262 262 IBERGRID These decisions are taken into account to define and review the Quality Criteria. Technology Providers feedback is also important for Quality Criteria review. The next step, after the Quality Criteria definition, is to include the new TP products into the EGI Software Repository. This part of the workflow (Criteria Verification and Staged Rollout) is maintained by the Ibergrid SA2 team. Verification workflow is started when a new software product is released by the Technology Providers, if the new software met these Quality Criteria then the process continues and its submitted to Staged Rollout. The Staged Rollout is coordinated by LIP to deploy a production infrastructure to test the new software. This production infrastructure is maintained by several sites (Software Early Adopters) which test the new middleware before it is sent to production. If no issues are found in the Staged Rollout by the Early Adopters, the new middleware finally is released to production infrastructure using the EGI repository. Due to the spiral lifecycle model followed by the development, the process does not end here. The behavior of the new software included into production is reviewed by the User Community Board (UCB) and the Operations Management Board (OMB). The main tasks assigned to these groups are to Identify, prioritize and resolve issues related to users or Virtual Organizations and identify current and future problem areas and propose corrective actions. If any issue is found in production UCB and OMB submits their feedback to take appropriate remedial actions. These actions are assigned to TCB and the lifecycle starts over again. 5 Verification of conformance criteria Technology Providers are basically in charge to develop and certify their software. When a new product is available and the software is correctly uploaded to the repository, the release enters into the verification phase. All this process is registered and tracked using the EGI RT ticket system (see figure 2), so when a new product is available a new ticket is created in RT, and all the members of the verification team receive a notification . Technology Providers must include all the necessary information to the verification team (Documentation, tests results, etc) for the verification process. If this information is accessible, the verification team can assess that the Technology Provider has tested in advance the quality of the software. Depending on the release type, different actions will be taken by the verifiers. The verification team checks that the pre-conditions of the verification phase are met, which means that: The RT ticket is in state Open. The RT RolloutProgress field is set to Unverified. The CommunicationStatus field is set to OK. The Ticket owner is set to nobody. If all these conditions are met, then the verification process can be started and the RolloutProgress field is changed to In Verification to start to work. Verifiers should use a specific template report for each middleware product or service, these checklists are the Quality Criteria Verification templates and

263 IBERGRID 263 Fig. 2. RT for Verification process are available at EGI Document Database [9]. Verification Templates are generated based on the current UMD Quality Criteria, and the Quality Criteria service mapping [10]. Both documents are used to write a template per product with a complete list of tests and requirements to be verified. Due to the hard dependency between Verification templates and Quality criteria documents, if the Quality Criteria is modified then Verification templates must change accordingly. Verification templates are divided in two parts (see figure 3). The first section shows the name of the component to verify (in this case is AMGA service), the number of the new release an the number of the RT ticket to track the verification process. Names of the software provider and validator team are also included in this section. In the next section is where the list of tests to check are stored, tests are divided by quality criteria subsections depending on the product to verify. The first column shows the criteria name to verify, the second and third columns are be filled by the verification team (it describes if the current criteria is accepted or it does not apply). Finally, the last column is reserved for the validation team comments, issues found, etc. When the report is completed, the verification team writes a Executive Summary and submits the results to the EGI document database and the RT ticket is updated to include this information. Finally, if the new product satisfies the Quality Criteria, the ticket status is changed from In Verification to StageRollout, and it is released to the Staged Rollout production infrastructure. 6 Future Work The EMI-1 mayor release is scheduled for the end of April 2011 and the EGI Software Provisioning activity must be prepared to receive it. Quality Criteria and Verification processes should be finished and up to date in April. One of the biggest challenges for the new EMI era is the deployment of a SA2 verification testbed. CESGA is in charge to create the new testbed and to install and test the new incoming software from TPs. The main aim of verification testbed is to help the TSA2.3 team to check the new components before submitting the software

264 264 IBERGRID Fig. 3. Verification Template to production. The new testbed would act as a filter, so if TSA2.3 team detects any issue installing the new middleware, then the Staged-Rollout team and the Technology Providers will be notified to take corrective actions. Installing and configuring a complete testbed is not a trivial task, more than thirty products will be included in the same EMI mayor release. Each of these products should be installed and tested during the verification process before the Staged-Rollout, so this test should be done as quickly as possible. The most feasible solution in this case would be to use virtualisation. CESGA is deploying a complete cloud computing infrastructure based on OpenNebula technology [13]. CESGA is now installing and configuring a new cluster which includes support for virtual machines. The new full virtualised testbed will be started on demand, sharing the new CESGA cluster with local users. Table 2 shows the current hardware that is being used to run OpenNebula instances. Processor 1200 cores 2.2GHz Memory 2400 GB Storage 56 TB Networking GigaE Performance TFlops Table 2. OpenNebula cluster hardware This mechanism will avoid a bottleneck in the verification workflow. Virtual machine instances (with a pre-installed software and configured repositories) will be ready to be launched in a short period of time using a pool of network IPs. These machines are to be started on demand to install and test the new middleware, and

265 IBERGRID 265 since they will work as an isolated testbed, if a new service is needed they can be launched and configured in a few minutes. 7 Conclusions EGI s Software Provisioning activity is the ideal place where new grid applications coming from Technology Providers can be tested. This activity can also help to discover bugs in the middleware before it is phased into production. The verification process is included in this activity, and the TSA2.2 and TSA2.3 Ibergrid teams are in charge to act as intermediary between the Technology Providers and EGI production infrastructure. This intermediation is a filter which could help to improve grid middleware quality and to find potential issues. The main issue at this moment is to find an agreement between Technology Providers and the SA2 team about Quality Criteria assessment. To maintain and improve the verification process in the future, the Technology Providers feedback is essential. Changes in the criteria documents are also triggered from other sources as User Community, Operations Community or the Staged Rollout but the Technology Providers review and analysis is particularly critical. In the coming months it remains to be seen the success of this collaboration. 8 Acknowledgments The development of this paper would not have been made possible without the work of many people. The authors wish to thank the EGI SA2 team for making software provisioning and quality assurance possible. This work also makes use of results produced by EGI-Inspire project, a project co-funded by the European Commission (under contract number INFSO-RI ), which brings together more than 50 institutions in over 40 countries to establish a sustainable European Grid Infrastructure. Full information is available at References 1. European Grid Initiative: Integrated Sustainable Pan-European Infrastructure for Researchers in Europe, EGI-InSPIRE proposal 2. The European Grid Infrastructure, Last visit: The Enabling Grids for E-sciencE project, Last visit: Provisioning the Software Infrastructure (SA2), _Provisioning_the_Software_Infrastructure_\%28SA2\%29 Last visit: Dreschder, M., D5.2 UMD Roadmap, Last visit: EGI Document Server, Last visit:

266 266 IBERGRID 7. Fernandez, E. et al, UMD Quality Criteria, Last visit: Lightweight Middleware for Grid Computing, Last visit: Quality Criteria Verification Templates, Last visit: EMI Quality Criteria Service Mapping, Last visit: European Middleware Initiative (EMI), Last visit: Initiative for Globus in Europe (IGE), Last visit: OpenNebula Toolkit for Cloud Computing, Last visit:

267 IBERGRID 267 An Aspect-Oriented Approach to Fault-Tolerance in Grid Platforms Bruno Medeiros, Jo o Lu s Sobral Departamento de Inform tica, Universidade do Minho, Braga, Portugal [email protected] Abstract. Migrating traditional scientific applications to computational Grids requires programming tools that can help programmers to update application behaviour to this kind of platforms. Computational Grids are particularly suited for long running scientific applications, but they are also more prone to faults than desktop machines. The AspectGrid framework aims to develop methodologies and tools that can help to Grid-enable scientific applications, particularly focusing on techniques based on aspect-oriented programming. In this paper we present the aspect-oriented approach taken in the AspectGrid framework to address faults in computational Grids. In the proposed approach, scientific applications are enhanced with fault-tolerance capability by plugging additional modules. The proposed technique is portable across operating systems and minimises the changes required to base applications. 1 Introduction Enabling scientific applications to run on computational Grids requires mechanisms to enable scientific applications to address resource faults. This is especially important for long running applications to avoid losing work when a fault occurs, due to the need to restart the application from the beginning. One effective technique to tolerate faults is to periodically checkpoint the application to disk, in order to restart the execution from the last checkpoint, when a fault occurs. System Level Checkpointing (SLC) takes a snapshot of the program and all of its memory. This kind of checkpoint has to store all the information of the program, including stack, pointers, so that it can restart the program later. While some tools that do this are able to checkpoint a program without having to halt it (e.g. Berkeley Lab's Checkpoint/Restart [1]), the program has to be linked to a certain library, at compile time. Because of its nature, the time to take a SLC snapshot of the program is longer than with other approaches and the checkpoint usually is larger. Some tools also support parallel programs built with MPI (e.g. BLCR). SLC approaches require support from the underlying middleware and the checkpoint data is intrinsically nonportable across machines, since it is saved on a machine dependent format. Application Level Checkpointing (ALC) adds new code to the base application that limits the areas to be checkpointed. This approach is smarter than SLC because it uses the knowledge of what needs to be checkpointed, causing fewer problems when working with MPI and/or OpenMP parallel applications. Having to add code to applications is one of its greatest disadvantages. Application-level checkpointing

268 268 IBERGRID mechanisms for MPI were proposed in [2, 3]. Both approaches are based on a compiler that assists the programmer to identify the state and places in the program where checkpoint can be performed. Application-level Checkpointing mechanisms for OpenMP were proposed in [4]. In Grid systems it is important to provide portable checkpoint mechanisms. Portability should be two-fold: 1) by implementing checkpoint without requiring changes to the current Grid middleware and 2) by saving checkpoint data in a portable format. Saving checkpoint data in a portable format brings the additional benefit of making it possible to restart applications on a different set of resources. This is suitable for computational Grids since available resources could change during the application run time. The approach taken in the AspectGrid framework addresses the previous issues by relying on application level-checkpoint mechanisms. In the proposed approach, described in this paper, scientific applications are enhanced with checkpointing capabilities by plugging additional modules implemented with Aspect Oriented Programing (AOP) techniques [5]. Portability is addressed by being a Java-based approach, where application and data are independent of specific platforms. Moreover, provided application level mechanisms avoid changes to the current Grid middleware and the checkpoint data is also portable, supporting the migration of checkpoint data across platforms. The AspectGrid approach differs from previous works by providing portability in Grid platforms. The framework is fully based on pluggable AOP modules that allow a uniform approach to checkpoint sequential, thread-based and MPI based applications. Pluggable AOP modules combined with a Java based approach add the possibility to take snapshots and to restart applications in different sets of Grid resources and in any of these execution modes (e.g., sequential, thread-based and MPI based applications). The remainder of this paper is organised as follows. The next section introduces aspect oriented programming techniques and section 3 introduces the AspectGrid approach to checkpoint. Section 4 provides a performance evaluation and section 5 concludes the paper. 2 Overview of Aspect Oriented Programing Aspect Oriented Programming was proposed to address the problem of crosscutting concern in software systems. These concerns are normally transversal to the application base functionality and are not effectively managed with traditional modularisation techniques. One typical example is the logging functionality, whose implementation with traditional mechanism entails changing the implementation of each function to log. AOP address this kind of functionality by introducing a new unit of modularity: the aspect. An aspect can intercept a well-defined set of events in the base program (a.k.a., join points) and attaches aspect specific behavior to intercepted events. Additional behavior can be, for instance, to print the name of the intercepted method call. A point-cut specifies a set of events to intercept and point-cut designators can be used to gather information specific to each intercepted event.

269 IBERGRID 269 AspectJ is an [6] extension to Java that includes mechanisms for AOP. In AspectJ it is possible to capture various kinds of events, including object creation, method calls or accesses to instance fields. Objects and primitive values specific to the context of the captured event are obtained through point-cut designators this, target and args. Fig. 1 shows the example of a logging aspect, applied to a class Point. In this example, a message is printed on the screen on every call to methods movex or movey. The wildcard in the pointcut expression is used to specify a pattern for the callõs signature to intercept. public aspect Logging { void around(point obj, int disp) : call(void Point.move*(int)) && target(obj) && args(disp) { System.out.println("Move called: target object = " + obj + " Displacement " + disp); proceed(obj,disp); // proceed the original call } } Fig. 1. Example of an aspect for logging The important AspectJ characteristic is that it allows plugging additional functionality into base applications in a non-invasive manner. In the previous example the program base does not need to be changed to include the logging functionality. Moreover, the logging aspect is ÒpluggableÓ in the sense that it can be included in the program when logging functionality is required. 3 Aspect Oriented Checkpointing in the AspectGrid Framework This section describes extensions made to the AspectGrid framework [7], by providing AspectJ modules that help to include checkpointing capabilities into scientific applications, minimising the amount of changes required to base programs. The provided approach is completely implemented at application level, avoiding the need to change the current Grid middleware. Moreover, it also saves checkpointing data in a portable manner allowing the application to restart on a different set of resources. Portability is also extended to parallel applications developed with AspectGrid tools [8], which include applications that provide Java thread-based parallelism and MPI-based parallelism. Application-level checkpoint requires saving of application data into a permanent storage. Application data includes the data structures used by the application as well as the call stack, which specifies the particular point in execution where the checkpoint was taken. Application level mechanisms also rely a set of pre-defined points in execution where checkpoint can be taken. This set is required since application-level techniques require cooperation from the programmer/compiler to define the checkpoint frequency and the corresponding places in execution flow.

270 20 IBERGRID Fig. 2. AspectGrid checkpointing phases The AspectGrid approach to checkpoint is based on the indication of a set of application data fields (object allocations) to be saved into the checkpoint and a set of safe points that provide points in execution where checkpoint can be taken. Both are specified through AspectJ pointcuts. Checkpointed applications execute as follows (Fig. 2): 1) at application start-up, the DetectActive aspect verifies if the last execution was concluded without failures; by intercepting the execution of the ÒmainÓ method and checking the existence of checkpoint data; 2) if no failure occurred in the last execution the application runs normally and the Allocations aspect keeps track of the address of data that must be saved; 3) when a safe point in execution arises the SafePoints aspect increments the number of executed safe points and 4) when a predefined number of safe points is executed the data in addresses gathered by the Allocations aspect is saved into a file, along with the number of executed safe points. Application restart in the case of a failure relies on a set of ignorable methods that can be skipped during restart (also specified by means of a pointcut). Application restart proceeds as follows (Fig. 3): 1) at application start-up, the DetectActivate aspect identifies a failure in the last execution activating the replay mode; 2) the IgnorableMethods aspect skips the execution of methods that can be safely ignored. 3) the SafePoints aspect increments the number of executed safe points and 4) when the number of safe points saved in the checkpoint file is accomplished the checkpoint data is loaded and execution proceeds normally from that point. Notice that this process rebuilds the calling stack by replaying the original application, ignoring a set of method calls specified by the programmer. Thus, a highly portable solution is attained, since all mechanisms are implemented at application level. To summarise, in the AspectGrid framework, the programmer has to write three pointcuts: 1) data allocations; 2) safe points and 3) ignorable methods. The AspectGrid framework provides the required additional code to take application snapshots and to restart the application. Moreover, the framework provides a profiling tool that helps the programmer to find and write those pointcuts.

271 IBERGRID 21 Fig. 3. AspectGrid restart phases Safe points and ignorable methods allow an effective checkpointing strategy. During normal execution, the aspect counts the number of safe points executed. During restart, the application is replayed, ignoring the specified methods, until the same safe point is reached. The selection of the set of safe points is a trade-off between checkpointing overhead and computation lost when a failure occurs. Note that a checkpoint might be taken only after a set of safe points. AspectGrid approach provides two important benefits: 1) the base code (domainspecific code) remains unchanged following the philosophy of the framework, by providing an additional set of aspects that localise fault-tolerance related issues and 2) the framework automatically provide mechanisms to perform checkpointing in shared and distributed memory systems. Checkpoint in shared memory systems is performed as follows. When a checkpoint is to be taken (i.e., on a safe point) we introduce a barrier before and another after the safe point. When all threads have reached the first barrier the master thread saves the data specified and the number of safe points executed. Restart is preformed by replaying the application as on a sequential execution, but thread-creation constructs are still executed to rebuild the number of threads and their corresponding call stack. A barrier is introduced after the safe point where the checkpoint was taken. The master thread reads the saved data when reaching that safe point and then releases the other threads waiting at the barrier. Checkpoint in distributed memory systems is performed as follows. We perform checkpoint on each process as in the sequential case, only special care must be taken to ensure that every process takes the snapshot on the same safe point. We provide two implementation alternatives to save data fields. In the first case, each process

272 272 IBERGRID takes a local snapshot. In that case we need to introduce two global barriers, as in the case of the shared memory. In the second alternative we collect the partitioned data on the master node, which avoids the need for barriers (this is possible in our programming model, since we know how the data is partitioned among processes). Collecting the data and taking the snapshot at the master process has the advantage of making it possible to restart the application on any of the execution modes supported: 1) sequential execution; 2) parallel execution in shared memory systems and 3) parallel execution in distributed memory systems. This is possible since the checkpointed data is the same in all environments. Thus, adaptation can be performed by saving the checkpointing data and restarting with a different configuration. An additional benefit of this approach is that the framework can also checkpoint a hybrid shared/distributed memory parallelisation. 3.1 Illustrative Example This subsection illustrates the proposed approach by showing how to introduce checkpointing capabilities into a typical scientific application: a Successive Over Relaxation (SOR) that computes the solution to a set of a linear system of equations. This version uses the red-black variation of the algorithm to enable parallelism. This benchmark is a typical scientific application, where a five-point stencil is successively applied to a matrix. Fig. 4 presents a code snippet of the benchmark (this code is based on the version provided by the Java Grande Forum [9]). The doiteration method iteratively calls method iteration on red and black matrix elements, alternatively. The iteration method calls the updaterow on each row, which applies the stencil to all elements in the row. public class Sor { static double[][] G; static int Mm1, Nm1; static double of, omf; static final void doiterations(int num_iterations) { Mm1 =... for(int p=0; p<num_iterations; p++) { iteration(0); // iteration on ÒredÓ elements iteration(1); // iteration on ÒblackÓ elements } } static final void iteration(int is_red) { for(int row=1; row<mm1; row++) updaterow(row, (row+is_red)%2+1); }

273 IBERGRID 273 static final void updaterow(int row, int start_elem) { double[] Gi=G[row]; double[] Gim1=G[row-1]; double[] Gip1=G[row+1]; } } for(int j=start_elem; j<nm1;j+=2){ Gi[j]=of*(Gim1[j]+Gip1[j]+Gi[j-1]+Gi[j+1])+omf*Gi[j]; } Fig. 4. Base code for the SOR benchmark The first step to introduce checkpoint capabilities is to identify potential safe points. This can be done using the AspectGrid provided profiling tool. In this case there are three potential points in execution to introduce a safe point: 1) doiterations; 2) iteration and 3) updaterow. Selecting the best place for safe points involves a trade-off between checkpoint frequency and overhead. In this case, the doiterations is called only once during program execution. The iteration method is called 200 times, with an interval of approximately 2 seconds and updaterow is called with an execution time of a few miliseconds. Thus, in this case, the AspectGrid profiling tool suggests placing safe points on calls to the iteration method. After selection of the safe points, the programmer needs to define the application data structures that must be saved on those safe points. Those correspond to data that is changed between two consecutive executions of safe points. In this case the AspectGrid tool indicates the matrix G. The last step is the identification of ignorable methods. In this case, the tool suggests that the execution of the code inside safe points can be ignored. The programmer can also indicate other methods that can be ignored. The three pointcuts generated for this case study are provided in figure 5. pointcut safepoints() : call(void iteration(..) ); pointcut allocations() : call (double[][] new(..)); pointcut ignorablemethods() : call(void iteration(..); Fig. 5. Pointcut definitions to introduce checkpoint in the SOR benchmark 3.2 Implementation Overview The checkpointing mechanism is based on a set of safe points, ignorable methods and safe data fields. The implemented behaviour is different when the application is running normally and when the application is restarting after a failure. Fig. 6 presents a sketch of the implementation. In normal operation the implementation counts the number of safe points and takes the snapshot when requested (lines 07-12). In replay mode the implementation ignores the specified method calls (lines 22-26) while replaying the application and reload the data when the number of safe points defined in the checkpoint is attained (lines 13-17).

274 274 IBERGRID 01 aspect checkpointing { pointcut safepoints(); 04 pointcut ignorablemethods(); 05 Boolean replay; void around(): safepoints(...) { 08 numberofsafepoints++; if (!replay) if (takesnapshot) // save data fields 13 else 14 if (numberofsafepoints==chksafepoints) { // get saved data fields 16 replay = false 17 } proceed(); // execute original call 20 } void around(): ignorablemethods(...) { 23 if (replay) ; // ignore the method call 24 else proceed(); 26 } 27 } Fig. 6. Code for checkpointing 4 Performance Evaluation This section presents an evaluation of the proposed checkpoint mechanism by measuring the overheads relative to hand written versions. These results were collected on a cluster with two machines, dual Opteron 6174 per node (i.e., 24 cores per machine). Presented results are median of 20 executions. Performance results where obtained on a typical scientific application: the Successive over Relaxation (SOR) presented in previous section. The first test measures the overhead of introducing code for checkpoint, when 0 or 1 checkpoints are taken. Fig. 7 shows the execution time of: 1) the ÒoriginalÓ benchmark; 2) when checkpointing is introducing using classic ÒinvasiveÓ techniques and 3) when checkpointing is introduced through AOP. Presented results include sequential execution (seq); execution with 2 to 16 threads (T) and with 2 to 32 MPI processes (P). These results show that: 1) the overhead of checkpointing is very low, as it would be expected, since the overhead is the time required to count safe points, which is less than 1% in most cases; 2) AOP does not impose any additional overhead when compared to traditional invasive programming techniques; 3) there is a relevant overhead required to save checkpointing data that is directly connected to the amount of saved data.

275 IBERGRID Original Invasive - 0 checkpoint AOP - 0 checkpoint Invasive - 1 checkpoint AOP - 1 checkpoint Time (s) Seq. 2 T 4 T 8 T 16 T 2 P 4 P 8 P 16 P 32 P Fig. 7. Overhead of checkpointing One important point of the proposed approach is the ability to replay the application on a different environment. Figure 8 illustrates such case by showing the time per SOR iteration. In this case the application started with 2 processes and on iteration 26 it restarted on 8 processors, shortening the overall application execution to more than half. Time per iteration (ms) 1,6 1,4 1,2 1,0 0,8 0,6 0,4 0,2 0,0 [1,5] [6,10] [11,15] [16,20] [21,25] [26,20] [31,35] [36,40] [41,45] [46,50] [51,55] [56,60] [61,65] [66,70] [71,75] [76,80] [81,85] [86,90] [91,95] Iteration Fig. 8. Application restart increasing assigned resources [96,100]

276 276 IBERGRID 5 Conclusion This paper presented an aspect-oriented approach to checkpointing in computational Grid systems. The approach is based on the ability to plug checkpointing modules in scientific applications. The paper showed the feasibility of the approach and showed that the performance penalty can be very low, when compared with similar hand written versions. Current implementation of this approach rely on external tools to determinate the optimal set of resources to be used by applications. A natural evolution is to incorporate mechanisms to find opportunities for self-adaptation to improve execution time, by monitoring the application and the system state. References 1. P. Hargrove, J Duell Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters In Proceedings of SciDAC 2006, June R. Fernandes, K. Pingali and P. Stodghill, Mobile MPI Programs in Computational Grids, ACM Symposium on Principles and Practices of Parallel Programming (PPoPP), G. Rodr guez, M. Mart n, P. Gonz lez, J. Touri o, R. Doallo, CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications, Concurrency and Computation: Practice & Experience, Volume 22 Issue 6, April G. Bronevetstsky, K. Pingali, P. Stodghill, Experimental Evaluation of Application-Level Checkpointing for OpenMP Programs, ICSÕ06, Australia. 5. G. Kiczales, E. Hilsdale, J. Hugunin, M. Kersten, J. Palm, W. Griswold, An Overview of AspectJ. ECOOP 2001, Budapest, Hungary, June G. Kiczales, E. Hilsdale, J. Hugunin, M. Kersten, J. Palm, W. Griswold, Getting Started with AspectJ. Communications of the ACM, 44(10), October E. Sousa, R. Gonçalves, D. Neves, J. Sobral. Non-Invasive Gridification through an Aspect-Oriented Approach, 2nd Iberian Grid Infrastructure Conference (Ibergrid 2008), Porto, Portugal, May J. Pinho, M. Almeida, M. Rocha, J. Sobral, Parallelization Service in the AspectGrid Framework, 4th Iberian Grid Infrastructure Conference, Braga, May J. Smith, J. Bull, J. Obdrz lek, A Parallel Java Grande Benchmark Suite, Supercomputing Conference (SC 2001), Denver, Nov

277 IBERGRID 277 A SLA-based Meta-Scheduling in Advance System to Provide QoS in Grid Environments F. Javier Conejero, Luis Tomás, Blanca Caminero, and Carmen Carrión Dept. of Computing Systems. The University of Castilla La Mancha. Spain. {FJavier.Conejero,Luis.Tomas,MariaBlanca.Caminero,Carmen.Carrion}@uclm.es Abstract. The establishment of agreements between users and the entities which manage the Grid resources is still a challenging task. On the one hand, an entity in charge of dealing with the communication with the users is needed, with the aim of signing resource usage contracts and also implementing some renegotiation techniques, among others. On the other hand, some mechanisms should be implemented which decide if the QoS requested could be achieved and, in such case, ensuring that the QoS agreement is provided. One way of increasing the probability of achieving the agreed QoS is by performing meta-scheduling of jobs in advance, that is, jobs are scheduled some time before they are actually executed. In this way, it becomes more likely that the appropriate resources are available to run the jobs when needed. So, this paper presents a framework built on top of Globus and the GridWay meta-scheduler to provide QoS by means of performing metascheduling in advance. Thanks to this, QoS requirements of jobs are met (i.e. jobs are finished within a deadline). Apart from that, the mechanisms needed to manage the communication between the users and the system are presented and implemented through SLA contracts based on the WS- Agreement specification. Key words: Grid meta-scheduling, QoS, SLAs, WS-Agreement 1 Introduction In highly variable and heterogeneous systems, like Grid environments, where resources may be scattered across multiple domains and under different access policies, it is extremely difficult to provide QoS to users. Hence, the Grid infrastructure must provide the needed services for automatic resource brokerage which take care of the resource selection and negotiation process [1]. This infrastructure is named meta-scheduler [2] and hides this process from the user. However, the scheduling process is complicated due to several facts, like the heterogeneous and distributed nature of resources, the different characteristics of the applications, and specially due to the fact that the meta-scheduler entity typically lacks total control and even complete knowledge of the system resources. This means that is not always possible to reserve the resources selected for executing the jobs which finally imply that is not possible to ensure the execution of a job into a resource within the requested time. of corresponding author: [email protected]

278 278 IBERGRID Thus, some mechanisms to provide QoS in such kind of environments have to be developed. As reservations in advance are not always feasible, the idea is to try to ensure that a specific resource is available when a job requires it through metascheduling in advance of resources. This meta-scheduling in advance algorithm can be defined as the first step of the reservations in advance algorithm, in which the resources and the time periods to execute the jobs are selected (and the system keeps track of the decisions already made and the usage of resources) but making no physical reservation. This way, the system needs to estimate the future resource status (when the jobs to schedule will be executed) and how long their execution will be. To this end, some predictions techniques are implemented. On the other hand, and taking into consideration that the user s experience of the Grid is determined by the functionality and performance of this meta-scheduler system, it is needed to develop some mechanism to deal with the interaction among users and the meta-scheduling system. To this end, Service Level Agreements (SLAs [3]) contracts are the main mechanism to achieve this goal. Nowadays economy is moving from product oriented economy to a service oriented economy in computational environments. This trend requires new mechanisms to manage and enforce the use of computational resources in an efficient way for an optimized exploitation of them. But this exploitation is strongly managed by economical and business motivations, where mechanisms to enforce and negotiate legal statements are needed [4]. Due to this fact, many efforts have been done to tackle this problem within Grid environments, resulting in the Grid Resource Allocation Agreement Protocol (WS-Agreement) specification [5] for SLAs. Thenceforth, there are many Grid projects interested on the implantation and use of SLAs (e.g. AssessGrid [6], Brein [7], and SLA@SOI [8] among others). For our purpose, SLAs represent a formalization of the job submission process for the Grid. Furthermore, they are a mechanism for a formal representation of the temporal restrictions that correspond to the associated job, which are used in the meta-scheduling in advance process, with the objective of a QoS improvement. To sum up, the main contribution of this paper is a framework built on top of the GridWay meta-scheduler [9] to provide QoS by means of performing metascheduling in advance, with a SLA-based user interface. The usage of this framework allows jobs to be executed within their deadlines thanks to some implemented heuristics which estimate the future status of resources and how long a job execution will be. The paper is organized as follows. Related work is presented in Section 2. In Section 3 the framework to perform SLA-based meta-scheduling in advance is outlined. Section 4 shows the methodology to carry out the communication process between the users and the system through SLA contracts. Finally, the conclusions obtained and the suggested guidelines for future work are outlined in Section 5. 2 Related Work The provision of QoS in Grids is still an open issue which has been explored by several research projects, based on advanced reservation, such as GARA [10], Grid Capacity Planning [11], or VIOLA [12], among others. All these techniques have the same main drawback, namely not all the resources can be reserved for several

279 IBERGRID 279 reasons (e.g. not all the resources provide this functionality). Due to this limitation, our work aims at performing scheduling in advance rather than reservations of resources in advance. Meta-scheduling in advance needs to perform predictions about the future resource status and about job duration into resources. A survey of some prediction techniques can be found in [13]. Examples include applying statistical models to previous executions [14] and heuristics based on job and resource characteristics [15]. In [14], it is shown that although load exhibits complex properties, it is still consistently predictable from past behaviour. In [15], an evaluation of various linear time series models for prediction of CPU loads in the future is presented. In our work, a technique based on historical data is used, since it has been demonstrated to provide better results compared to linear functions [16]. This kind of scheduling needs to have a suitable data structure to be able to manage all the information efficiently. There are several structures for managing this information needed by the scheduler, for instance Grid Advanced Reservation Queue [17] (GarQ), and a survey can be found in [17]. But in this work, red black trees are used since they provide us with efficient access to the information about resource usage, as it has been demonstrated in [18]. On the other hand, SLAs are a hot topic nowadays. Many efforts have been done on several fields, like their management [19], QoS implications [20], semantic and virtualization exploitation [21] and specially on their standardization. The most important improvement within SLAs has been the WS-Agreement specification [5], which is considered the de-facto standard. The structure and mechanisms to deploy SLAs over a system are described from a global point of view, and thanks to the recent revision of the WS-Agreement specification [22], a new negotiation protocol has been defined, introducing the renegotiation concept as a multiple message interaction between user and service provider to achieve better agreements. But WS-Agreement is not the only available specification. SLAng [23] and WSLA [24] are alternatives to it, but due to their lack of support they are not recommended. Due to the Service Level Agreements importance, many projects are interested on its implementation [25]. Most of them implement the WS-Agreement, like SLA@SOI [8], AssessGrid [6] and Brein [7]. The first one is focused on the implantation of SLAs into Service Oriented Infrastructures (SOIs [8]) from a generic point of view. AssessGrid and Brein have a common purpose, which is to promote Grid computational environments into business environments and society. However, AssessGrid is focused on risk assessment for trustable Grids while Brein is focused on an efficient handling and management of Grid computing based on artificial intelligence, web semantic and intelligent systems. Another important project within this matter is WSAG4J(WS-AGreement for Java [26]), which is a generic implementation of the WS-Agreement specification developed by the Fraunhofer SCAI Institute as a development framework. It is designed for a quick development and debug of services and applications based on WS-Agreement. It should be noted that not all projects implement WS-Agreement for their SLA management. An example is NextGrid [27], which is focused on business Grid exploitation.

280 280 IBERGRID 3 Scheduling in Advance framework (SA Layer) In a real Grid environment, many resources cannot be reserved, due to the fact that not all the local resource management systems permit them. Apart from that, there are other types of resources such as bandwidth, which are shared among several administrative domains making their reservation more difficult or even impossible. This is the reason to perform meta-scheduling in advance rather than reservations in advance to provide QoS in Grids. This means that, the system keeps track of the meta-scheduling decisions already made to take future decisions and with the aim of not overlapping executions. However, no physical reservations are done. So, our scheduling in advance process follows the next steps (see Figure 1): 1) A user sends a request to the meta-scheduler at his local administrative domain through the SLA manager (see Section 4). Every SLA contract (job execution request) must provide a tuple with information on the application and the input QoS parameters: (in file, app, t s, d). in file stands for the input files required to execute the application, app. In this approach the input QoS parameters are just specified by the start time, t s (earliest time jobs can start to be executed), and the deadline, d (time by which jobs must have been executed). 2) The meta-scheduler communicates with the Gap Management entity to obtain both the resource and the time interval to be assigned for the execution of the job. The heuristic algorithms presented here take into account the predicted state of the resource (both for computational resources and interconnection networks), the jobs that have already been scheduled and the QoS requirements of the job. 3) If it is not possible to fulfill the user s QoS requirements using the resources of its own domain, a communication with meta-scheduler of other domains starts. In order to perform the inter-domain communications efficiently, techniques based on P2P systems (as proposed by [28, 29], among others) can be used. This way, the meta-scheduler at each domain knows some of the meta-schedulers at other domains, and can forward jobs to them when necessary. 4) If it is still not possible to fulfill the QoS requirements (not even in other domains), a renegotiation process is started between the user and the SLA manager in order to define achievable QoS requirements. Recall that, this renegotiation, as well as the overall interaction with users, is conducted by means of Service Level Agreements (SLA). A scheme for advancing and managing QoS attributes contained in Grid SLAs contracts is implemented and detailed in Section 4. As Figure 1 depicts, there may be more than one meta-schedulers in each local administrative domain (subdomains of a Virtual Organization (VO)), albeit they have to communicate with the same Gap Management entity. The Gap Management entity has the information about future usage of the resources of its domain and could also be replicated to avoid the single point of failure problem. Even the resources may be splitted into several subdomains in case of a huge number of them, making it quite scalable. This represents an idealistic scenario where all the jobs are submitted through the Gap Management entities in charge of the resources usage. However, this is not the rule into a real Grid environment, where resources usually are shared among users and VOs. For this reason, the system needs to estimate the future resources status for taking into account the resources load which is not submitted through the meta-scheduling in advance process. So,

281 IBERGRID 281 Fig. 1. Meta-Scheduling in Advance Process all the load not submitted through our system, as resource owners load or the rest of jobs submitted by using other meta-schedulers of the VO, is considered as a load into the resource that must be predicted. This meta-scheduling in advance functionality has been implemented as a layer on top of the GridWay meta-scheduler [2], called Scheduler in Advance Layer (SA layer) [16], as Figure 2 depicts. The SA layer uses functionality provided by Grid- Way in terms of resource discovery and monitoring, job submission and execution monitoring, etc.. Also, the information concerning previous jobs executions and the status of resources and network over time are stored in DB Executions and DB Resources, respectively. The usage of the resources is divided into time intervals, named slots. So, the system has to schedule the future usage of resources by allocating the jobs into the resources at one specific time (taking one or more time slots). Therefore, data structures (represented by Data Structure in Figure 2) to keep a trace of the slots usage are needed. In this work the red black trees [18] are used as a data structure with the objective of developing techniques to efficiently identify feasible idle periods, without having to examine all idle periods. The reason for choosing this kind of structure is its property which enforce that the longest path from the root to any leaf is no more than twice as long as the shortest path from the root to any other leaf in that tree. So, the tree is roughly balanced, and as a result of that, inserting, deleting and finding values require worst case time proportional to the height of the tree (O(log n)). The idea of using red black trees was firstly proposed by Castillo et al. [18]. Nevertheless, their proposal does not take into account the performance fluctuation. Moreover, authors of [18] assume that users have prior knowledge on the duration of jobs, which is not necessarily true in a real grid. Our work does not depend on such assumption, so there is a necessity of developing algorithms for estimating job durations into resources (Predictor in Figure 2), and consequently, to infer how many slots a job will need to be executed in a certain resource. 3.1 Job completion time predictions The different performance of Grid resources makes rather difficult to obtain predictions about jobs durations into resources. What is more, the job performance

282 282 IBERGRID Fig. 2. The Scheduler in Advance Layer (SA layer). characteristics may vary for different applications and from time to time. Due to these facts, it is needed to estimate the future status of resources and taking it into account, estimating the time needed to complete the job in a resource at the target time interval. With the objective of making those predictions as accurate as possible, they are calculated by estimating the execution time of the job and the time needed to complete the transfers separately. To do that, the system takes into account the characteristics of the jobs, the power and usage of the CPU of the resources and the network status. To this end, our system implements a technique based on an Exponential Smoothing function which calculates the future status of the resource CPUs and the future status of the network links. For more information about this function see [30]. Taking into account those informations about Grid status, an estimation of the execution time is calculated using information of previous executions, as it is depicted in Algorithm 1. This algorithm uses all the execution times records in the database (which are stored in DB Executions) for the application app in a resource R i to calculate the mean execution time for app in R i this includes execution and queueing times (line 8). After that, the prediction on the future CPU status of each resource is calculated by means of an exponential smoothing function (line 9). Finally, the mean execution time is tuned by using the prediction about the future CPU status of each resource (line 10). The way of calculating the transfer times is pretty similar. The mean bandwidth predicted for the time period between the job start time and its deadline is calculated through an exponential smoothing function. Then, using this information along with the total number of bytes to transfer, the time needed to complete the transfers is estimated. Finally, the predictions obtained are weighted taking into account the trust into the resources chosen. This implementation is explained in Algorithm 2. With the estimations on execution and transfer times and the information about trust in resource R i, labelled as RT (R i ), the execution time is tuned (line 12) and an estimation for the total completion time of the job, JT Ri, is calculated (line 14). The trust on resources is computed as Equation 1 denotes: n j=(n N) RT (R i ) = (Estimated (j,i) Real (j,i) ) (1) N being Estimated (j,i) the job completion time estimation made for the j execution in the resource R i ; and Real (j,i) the real completion time of a job j in the resource R i. The output of this function is the mean of the errors made in those N predictions and it is used to tune the prediction made for the job execution times

283 IBERGRID 283 Algorithm 1 Estimation of execution time (ExecT Estimation) 1: Let R = set of resources known to GridWay {R 1, R 2,...,R n} 2: Let app the job to be executed 3: Let initt the start time of the job 4: Let d the deadline for the job 5: Let ExecutionT ime(app, R i) j the j execution time for the application app in the resource R i 6: Let ES cpu(db Resources Ri, initt, d) the exponential smoothing function that calculates the percentage of free CPU in resource R i between time initt and d 7: Let CP U free(r i, initt, d) the percentage of free CPU in the resource R i from time initt to time d n 8: ExecutionT ime = 9: CP U free(r i, initt, d) = ES cpu(db Resources Ri, initt, d) 10: ExecutionT ime = ExecutionT ime (2 CP U free(r i, initt, d)) 11: return ExecutionT ime j=1 ExecutionT ime(app,r i) j n Algorithm 2 Job Completion Time Estimation 1: Let R i = a resource 2: Let app = the job to be executed 3: Let initt = the start time of the job 4: Let d = the deadline for the job 5: Let size IN = the number of input bytes to be transferred 6: Let size OUT = the number of output bytes to be transferred 7: for each R i having a gap do 8: P rolog = T ranst Estimation(R i, initt, d, size IN ) 9: Epilog = T ranst Estimation(R i, initt, d, size OUT ) 10: ExecT = ExecT Estimation(R i, app) 11: if RT (R i) < 0 then 12: ExecT = ExecT + RT (R i) 13: end if 14: JT Ri = P rolog + ExecT + Epilog 15: end for in that resource. As a result, the confidence in the estimations depends on how trustworthy is the resource where the job will be run. The benefits of tuning the obtained prediction by using this trust factor were evaluated in [16] highlighting the usefulness of this approach. So, now we are mixing the estimation techniques presented in [16] and [30] to obtain a more accurate prediction. It is important to highlight that predictions are only calculated when a suitable gap has been found in the host. In this way there is no need to calculate the completion times for all the hosts in the system which would be quite inefficient. On the other hand, when a resource suddenly quits the system (e.g. the resource fails or it is shut down), the jobs scheduled on it (including currently running jobs) have to be reallocated to other hosts. The way how jobs are rescheduled is the same as when they were first submitted to the system. This feature is very important in Grids since resources may join and leave the Grid at any time, and

284 284 IBERGRID failures of resources are the rule rather than the exception. This task is performed by the Job Rescheduler module (see Figure 2). Finally, the jobs that are able to manage this layer are simple jobs. But dealing with workflows and pilots jobs is about our future work. 4 Service Level Agreements (SLAs) Once the execution of the job may be ensured by the system with enough accuracy, the next step to address is the communication with the user in order to reach agreements for executing his jobs. This process is carried out through Service Level Agreements (SLAs). The SLA concept within Grid computing, is defined as a contract between user and service provider. On this contract, the expectations, obligations and legal implications are explicitly defined [3]. So, it can be said that the QoS that the user expects to receive is represented on each SLA. Furthermore, SLAs are the main mechanism to improve the commercial expansion of Grid computing due to their support for pay per use models and the fact of being a legal statement [31]. Nowadays, this points are very important because of the business interest on exploiting Grid computing. Formally, the use of SLAs enforces the relationship between user and service provider in two ways: as a legal statement that must be accomplished and as an agreement that can be negotiated. Negotiation implies that the service provider has the opportunity to decide in advance if the user requirements can be fulfilled and if possible, negotiate with it to reach a better agreement. Moreover, the use of SLAs improve the interoperability among Grids and from users to manage multiple Grids. But this can only be a reality if a robust and realistic standard is applied. Nowadays, the most important and widely used standard within SLAs is the WS-Agreement. The last version of its specification was released in March 2007 [5], and with it, all the aspects related to the creation, structure and SLA management were defined. WS-Agreement defines a basic scheme for the agreements as Figure 3 illustrates. Each agreement has a name identifier and a context. In the context, all the information about the document is defined, like service provider information. The terms block consists on two subblocks: the first one, known as service terms, has the information relative to the services/resources that are going to be provided (e.g. CPU count, CPU architecture, RAM amount, etc.); and the second one, known as guarantee terms, has the service level that must be guaranteed for each service/resource of the service terms (e.g. 2 (CPU count), x86 64 (CPU architecture), 2 GB (RAM), etc.). Finally, the creation constraints block is used for setting limitations on a negotiation and this block can only be defined on the template. On the negotiation process it is not used. In January 2011 a new revision of the WS-Agreement was released [22]. In this revision, an extension of the negotiation protocol defined on the first release is presented. The negotiation protocol previously defined on WS-Agreement only contemplates a simple negotiation workflow, where the user requests one template or more, fills it with the requested QoS and sends it back to the service provider, which accepts or rejects the SLA. But with the recent extension (see Figure 4), renegotiation is available through a loop between the user and service provider

285 IBERGRID 285 Fig. 3. SLA structure Fig. 4. WS-Agreement negotiation protocol before an offer is committed. This allows to achieve a better agreement for both participants. The definition of the terms on every SLA is not defined on the WS-Agreement specification, so their definition is left to the service provider, who is in charge of specify the terms for its own needs. This flexibility of the WS-Agreement specification lets the service provider define terms related to hardware needed, time restrictions or job related restrictions. This terms can be very numerous and different among them, but there are several that may appear: related to the hardware needed (like number of CPUs or amount of RAM among others), and more important, related to time restrictions. These restrictions often appear as start time and duration (or deadline) of a job. But it is possible to define new terms to improve the knowledge of the jobs and exploit them into the meta-scheduling process. For this purpose, each SLA submitted to this framework should follow the WS- Agreement specification. So, the service terms specified on each SLA are the job and execution parameters needed for the meta-scheduling process (see Figure 3). These terms are mainly: job (app, in file), start-time (t s) and deadline (d). The name block only specifies the agreement name for a better human identification while the context block contains two main parameters: template-id for internal identification and service provider for service provider name identification. This structure is open for future term and context parameter extension. Finally, creation constraints are not expected. This framework implements the WS-Agreement specification and it is possible to interact with it through a web portal (see Figure 5). This portal offers the main fields to fill from a template. Once submited, the information is converted to an offer and sent to the SLA manager. The result of the request is shown to the user through the portal, and if the submission has been succesfull returns the EPR.

286 286 IBERGRID Fig. 5. SLAs Web portal SLA monitoring and the inclusion of the negotiation extension represents the next milestones of our work. There are several advantages that emerge from the use of SLAs into our system and more specially from the implantation of the WS-Agreement specification. First of all, it represents a formalization of the job submission process. Moreover, they are a mechanism for a formal representation of the temporal restrictions that the user sets and that our Grid system has to respect. Finally, SLAs are XML format messages (as specified in WS-Agreement), so they can be easily handled within Web environments. Therefore, technologies such as Gridsphere [32] can be exploited for the Web environment development. This way, the complexity of the system is hidden to the user, who has the ability of interacting with the Grid through a Web portal by filling the jobs to be executed and their requirements. Then, this information is translated to a SLA and sent to the job submission process in an easy and systematic way. 5 Conclusions and future work Several research works aim at providing QoS in Grids by means of advanced reservations, albeit making reservations of resources is not always possible in this kind of scenario. So, this paper proposes a SLA-based framework to perform metascheduling in advance (first step of the reservation in advance process) to provide QoS to Grid users. Nonetheless, this type of scheduling requires to estimate whether a given application can be executed before the deadline specified by the user. So, this requires to tackle many challenges, such as predicting the jobs completion time into the resources. For this reason, the system is concerned with the dynamic behaviour of the Grid resources, their usage, and the characteristics of the jobs. Furthermore, this system takes into account the accuracy in the recent predictions for each resource in order to calculate a resource trust.

287 IBERGRID 287 Moreover, a SLA manager is implemented to deal with the user interaction and to enable QoS agreements between both of them. This module manages the communication between the system, by interacting with SA-Layer, and the users and makes possible to provide QoS to the users in a contractual way (through SLAs). Furthermore, each SLA can specify more job related information that can be used in the meta-scheduling process than usual job submission. One interesting guideline for future research is the development of techniques to perform better estimations for the transfer times. For this reason, it is a good point to try to reserve network bandwidth when and where this could be possible. Moreover, work on developing algorithms to schedule data as another resource is also considered for future research. Finally, another issue that can be addressed is the improvement of the SLA manager to make more efficient the scheduling by taking into account the associated costs, such as reducing the wasted energy. Acknowledgements This work was supported by the Spanish MEC and MICINN, as well as European Commission FEDER funds, under Grants CSD , TIN C04 and through a FPI scholarship asociated to TIN C04-03 project. It was also partly supported by JCCM under Grant PII1C References 1. U. Schwiegelshohn and et al. Perspectives on grid computing. Future Generation Computer Systems, 26(8): , E. Huedo, R. S. Montero, and I. M. Llorente. A modular meta-scheduling architecture for interfacing with pre-ws and WS Grid resource management services. Future Generation Computing Systems, 23(2): , J. Padgett, K. Djemame, and P. Dew. Grid-Based SLA Management. In Proc. of the European Grid Conference (EGC), Amsterdam, The Netherlands, V. Stantchev and C. Schröpfer. Negotiating and Enforcing QoS and SLAs in Grid and Cloud Computing. In Proc. of the 4th Intl. Conference on Advances in Grid and Pervasive Computing (GPC), Geneva, Switzerland, A. Andrieux and et al. Web Services Agreement Specification (WS-Agreement). Technical report, AssessGrid. Web page at Accessed: 15th March, EU-Brein. Web page at Accessed: 15th March, SLA at SOI. Web page at Accessed: 15th March, C. Vázquez, E. Huedo, R. S. Montero, and I. M. Llorente. Federation of teragrid, egee and osg infrastructures through a metascheduler. Future Generation Computer Systems, 26(7): , A. Roy and V. Sander. Grid Resource Management, chapter GARA: A Uniform Quality of Service Architecture, pages Kluwer Academic Publishers, M. Siddiqui, A. Villazón, and T. Fahringer. Grid capacity planning with negotiationbased advance reservation for optimized QoS. In Proc. of the 2006 Conference on Supercomputing (SC), Tampa, USA, O. Waldrich, Ph. Wieder, and W. Ziegler. A meta-scheduling service for co-allocating arbitrary types of resources. In Proc. of the 6th Intl. Conference on Parallel Processing and Applied Mathematics (PPAM), Poznan, Poland, 2005.

288 288 IBERGRID 13. M. Dobber, R. van der Mei, and G. Koole. A prediction method for job runtimes on shared processors: Survey, statistical analysis and new avenues. Performance Evaluation, 64(7-8): , P. A. Dinda. The statistical properties of host load. Scientific Programming, 7(3-4): , H. Jin, X. Shi, W. Qiang, and D. Zou. An adaptive meta-scheduler for data-intensive applications. Intl. Journal of Grid and Utility Computing, 1(1):32 37, L. Tomás, A. C. Caminero, C. Carrión, and B. Caminero. Network-aware metascheduling in advance with autonomous self-tuning system. Future Generation Computer Systems, 27(5): , A. Sulistio, U. Cibej, S. K. Prasad, and R. Buyya. GarQ: An efficient scheduling data structure for advance reservations of grid resources. Int. Journal of Parallel Emergent and Distributed Systems, 24(1):1 19, C. Castillo, G. N. Rouskas, and K. Harfoush. On the design of online scheduling algorithms for advance reservations and QoS in grids. In Proc. of the Intl. Parallel and Distributed Processing Symposium (IPDPS), Los Alamitos, USA, W. Theilmann and L. Baresi. Multi-level SLAs for Harmonized Management in the Future Internet, chapter Towards the Future Internet, pages IOS Press, I. Brandic and et al. Advanced QoS Methods for Grid Workflows Based on Meta- Negotiations and SLA-Mappings. In Proc. of the 3rd Workshop on Work ows in Support of Large-Scale Science, Austin, USA, J. Ejarque and et al. Exploiting semantics and virtualization for SLA-driven resource allocation in service providers. Concurrency and Computation: Practice and Experience, 22(5): , O. Waeldrich and et al. WS-Agreement Negotiation Ver Technical report, D. Davide Lamanna, J. Skene, and W. Emmerich. Slang: A language for defining service level agreements. In Proc. of the Intl. Workshop of Future Trends of Distributed Computing Systems, Los Alamitos, USA, WSLA: Web Service Level Agreements. Web page at com/wsla/, Accessed: 15th March, M. Parkin, R. M. Badia, and J. Martrat. A comparison of sla use in six of the european commissions fp6 projects. Technical Report TR-0129, coregrid.net/mambo/images/stories/technicalreports/tr-0129.p%df. 26. WSAG4J - WS-Agreement framework for Java. Web page at scai.fraunhofer.de/wsag4j/, Accessed: 15th March, NextGrid - Architecture for Next Generation Grid projects. Web page at http: // Accessed: 15th March, A. Caminero, O. Rana, B. Caminero, and C. Carrión. Network-aware heuristics for inter-domain meta-scheduling in grids. Journal of Computer and System Sciences, 77(2): , A. Di Stefano, G. Morana, and D. Zito. A P2P strategy for QoS discovery and SLA negotiation in Grid environment. Future Generation Computer Systems, 25(8): , L. Tomás, A. Caminero, C. Carrión, and B. Caminero. Exponential Smoothing for network-aware meta-scheduler in advance in Grids. In Proc. of the 6th Intl. Workshop on Scheduling and Resource Management on Parallel and Distributed Systems (SRMPDS), San Diego,USA, D. Armstrong and K. Djemame. Towards Quality of Service in the Cloud. In Proc. of the 25th UK Performance Engineering Workshop, Leeds, UK., Gridsphere. Web page at Accessed:15th March, 2011.

289 IBERGRID 289 Vulnerability Assessment Enhancement for Middleware Jairo Serrano, Elisa Heymann, Eduardo Cesar, Barton Miller 2 Universitat Autònoma de Barcelona, Spain University of Wisconsin-Madison 2, USA [email protected], [email protected], [email protected], [email protected] Abstract. Security on Grid computing is often an afterthought. However assessing security of middleware systems is of the utmost importance because they manage critical resources owned by different organizations. To fulfill this objective we use First Principles Vulnerability Assessment (FPVA), an innovative analystic-centric (manual) methodology that goes beyond current automated vulnerability tools. FPVA involves several stages for characterizing the analyzed system and its components. Based on the evaluation of several middleware systems, we have found that there is a gap between the initial and the last stages of FPVA, which is filled with the security practitioner expertise. We claim that this expertise is likely to be systematically codified in order to be able to automatically indicate which, and why, components should be assessed. In this paper we introduce key elements of our approach: Vulnerability graphs, Vulnerability Graph Analyzer, and a Knowledge Base of security configurations. Keywords: grid, middleware, security, vulnerability assessment, vulnerability graph 1 Introduction Vulnerability assessment is a security task that is insufficiently addressed in most existing grid and cloud projects, even SCADA systems. Such projects use middleware software that often manages lots of critical resources, making them an attractive target for attackers and terrorism activities. Usually supercomputing middleware bases its security on mechanisms such as authentication, authorization, and delegation to protect passwords, credentials, user files, databases, system access, storage, and so on. These mechanisms have been studied in depth and effectively control key resources, but are not enough to assure that all application s resources are safe. However, middleware systems usually do not undergo a thorough vulnerability assessment during their life cycle or after deployment, whereby security flaws may be overlooked. One possible solution would be to use existing automated tools such as Coverity Prevent [2] or This research has been supported by the MEC-MICINN Spain under contract TIN

290 290 IBERGRID Fortify Source Code Analyzer (SCA) [5], but even the best of these tools find only a small percentage of the serious vulnerabilities [13]. A thorough vulnerability assessment requires a systematic approach that focuses on the key resources to be protected and allows for a detailed analysis of those parts of the code related to those resources and their trust relationships. Consistently, First Principles Vulnerability Assessment (FPVA) [14] answers these requirements. FPVA had been successfully applied to several large and widely-used middleware systems, such as Condor high-throughput scheduling system [1], Storage Resource Broker (SRB) a data grid management system [9], Crossbroker a grid resource management for interactive and parallel applications [12], among others [6]. FPVA starts with an architectural analysis, identifying the key components in a middleware system. It then goes on identifying the resources associated with each component, and how the privilege level of each component is delegated. The results of these steps are documented in clear diagrams that provide a roadmap for the last stage of the analysis, the manual code inspection. This top-down, architecture-driven analysis, can also help to identify more complex vulnerabilities that are based on the interaction of multiple system components and are not amenable to local code analysis. For all these systems analysts noticed that there is a gap between the three initial steps and the manual code inspection. The analyst should provide certain expertise about the kind of security problems that the systems may present. For example, depending on the language used the analyst should look for different kind of vulnerabilities. We have realized that this knowledge is similar to the one recorded on several available vulnerability classifications, suchs as CWE [4], Plover [10], and McGraw et al. [15], and that it can be codified in the form of rules to be applied automatically. In particular, we used the vulnerability assessment carried out on CrossBroker, which is based on glite middleware, to sketch our initial ideas [16, 17]. We showed that FPVA clearly overcome the best current automatic tools available, and proposed an approach for systematically determining how the analyst expertise is used for deciding which middleware components are critical based on the FPVA artifacts. In addition, we also proposed a suitable representation for the information gathered in the initial steps of FPVA. The major contributions of this paper are a Vulnerability Graph definition, which is the first stage to the approach we are developing, a middleware and vulnerability taxonomy characterization, the Vulnerability Graph Analyzer, as well as a study case. The remainder of this paper is structured as follows. In Section 2 we briefly describe the FPVA methodology. Section 3 introduce Vulnerability Graphs. Section 4 describes the Vulnerability Graph Analyzer approach. Section 5 discusses an example of VGA working on a vulnerability graph. The related work is introduced in Section 6. Finally conclusions and the work ahead before VGA can be applied are discussed in Section 7. 2 First Principles Vulnerability Assessment FPVA proceeds in five stages: architectural, resource, privilege, and component analysis, and result dissemination. We provide a brief description of the first four

291 IBERGRID 291 FPVA methodology stages, because the vulnerability graph is derived from the information gathered in these stages. Architectural Analysis: This step identifies the major structural components of the system, including modules, threads, processes, and hosts. For each of these components, FPVA identifies the way they interact, both with each other and with users. Interactions are particularly important as they can provide a basis for understanding how trust is delegated through the system. The artifact produced at this stage is a document that diagrams the structure of the system and the interactions amongst the different components, and with the end users. Resources Analysis: The second step identifies the key resources accessed by each component, and the operations supported on those resources. Resources include hosts, files, databases, logs, and devices. These resources are often the target of an exploit. For each resource, FPVA describes its value as an end target or as an intermediate target. The artifact produced at this stage is an annotation of the architectural diagrams with resources. Privilege Analysis: The third step identifies the trust assumptions about each component, answering such questions as how are they protected and who can access them? The privilege level controls the extent of access for each component and, in the case of exploitation, the extent of damage that it can accomplish directly. A complex but crucial part of trust and privilege analysis is evaluating trust delegation. By combining the information from the first two steps, we determine what operations a component will execute on behalf of another component. The artifact produced at this stage is a further labeling of the basic diagrams with trust levels and labeling of interactions with delegation information. Component Analysis: The fourth step examines each component in depth. For large systems, a line-by-line manual examination of the code is unworkable. In this step, FPVA is guided by information obtained in the first three steps, helping to prioritize the work so that the code relating to high value assets is evaluated first. The work in this step can be accelerated by automated scanning tools. While these tools can provide valuable information, they are subject to false positives, and even when they indicate real flaws, they often cannot tell whether the flaw is exploitable and, even if it is exploitable, the tools can not tell if it will allow serious damage. The artifacts produced by this step are vulnerability reports, which are provided to the software developers. It can be seen that FPVA is focused in analysing the data and control flows among the system components looking for unsecure features. This orientation has led to the definition of the following concepts: Attack Surface as the set of coordinates from which an attack might start, indeed it tells security practitioners where to start looking for the attacker s initial behaviour. Impact Surface as the set of coordinates where exploits or vulnerabilities might be possible. Attack Vector as the sequence of transformations that allows controlflow to go from a point in the attack surface to a point in the impact surface.

292 292 IBERGRID 3 Vulnerability Graphs With the objective of reducing the gap between the first stages of FPVA and the Component Analysis one, we have defined a structure called Vulnerability Graph for representing the results of these initial steps. Vulnerability graphs are aimed Fig. 1. First-approach vulnerability graph at finding possible malicious patterns between middleware components and/or resources, following controlflow through their relationships. There are several elements in vulnerability graphs. Figure 1 shows a small vulnerability graph example. Here we can assume intuitively that Component 1 is part of the attack surface, and the Resource 1 might be a point on the impact surface. Based on the information present in Figure 1, we can potentially derive two different attack vectors, the first one includes Component 1, Component i, and Resource 1, the second includes Component 1, Component j, and Resource 1. Formally, Vulnerability Graph is defined as: Definition 1. A vulnerability graph G = (V, E) is a tuple where, V represents the vulnerability graph nodes, a nonempty set of middleware components and resources. E represents the vulnerability graph edges, a nonempty set of actions that associate vulnerability graph nodes. Definition 2. In security context, Vulnerability graph nodes representing components or resources that do not satisfy safety attributes, properties, or characteristics during vulnerability assessment might be considered vulnerable. Vulnerability graph edges can associate vulnerable nodes through actions with non-vulnerable nodes, which in turn may become vulnerable or exploitable.

293 IBERGRID 293 The characterization proposal for middleware components, resources, and critical actions is based on the information that FPVA artifacts, developer teams, and documentation could provide. This characterization step is based on several FPVA artifacts, from six different middleware systems: Condor, SRB, MyProxy, glexec, VOMS-admin, and CrossBroker. Table 1 shows the most relevant elements of the Name Description c id An identifier for the component c host The component hostname, where the component is actually running c suid Is the sticky bit set up on the component? c priv The component privileges, Unix style c cons The component constraints is related to data, time, users, privs, and other restrictions c rel The components and/or resources which are straight related to the component Name Description r id An identifier for the resource r host The resource hostname, where the resource is actually installed or shared r suid Is the sticky bit set up on the resource? r priv The resource privileges, Unix style r cons The resource constraints is related to data, time, users, privs, and other restrictions Name Description i id An identifier for the interaction i host The interaction host specifies if the interaction happens in an unique host or more i stat The interaction state, describes if interaction is active or passive between components and/or resources i type The interaction type indicates a critical action which could be read, write, open, execute, query, etc i priv The interaction privileges, specifies if interaction type runs as a privilege user i cons The interaction constraints is related to data, time, users, privs, and other restrictions Table 1. Middleware characterization proposal. first middleware characterization approach of the vulnerability graph. First column contains the characterization items, and the second one a description of each item. The table is divided in three sections; the first is the components characterizaction; the second is the resources characterization; and the last one the interactions characterization. In our approach, we are going to use model checking techniques to analyze the safety attributes, properties, or characteristics in the vulnerability graph, along with the controlflow steps that allows to go from a point in the attack surface to a point in the impact surface. 4 Vulnerability Graph Analyzer A manual vulnerability assessment following (FPVA) proceeds initially on architectural, resources, and privileges analysis, and then on a component analysis based on their results (i.e. the artifacts). However, which vulnerabilities are going to be searched in the selected components depend on the implementation details of each component and the analyst s expertise. Consequently, there is a gap between the artifacts generated on the first FPVA steps and the component analysis step that must be currently filled with knowledge of an external source. We claim that this knowledge can be found in several existing vulnerability classifications and that, in

294 294 IBERGRID consequence, it can be systematically codified in order to be able to automatically indicate which components should be analyzed and why. To reach this objective we have defined the Vulnerability Graph Analyzer (VGA). VGA will traverse a vulnerability graph following the controlflow with the aim of finding potential malicious patterns or attack vectors, that might lead analists to determine where to search for a vulnerability. We know that most of the generated FPVA artifacts describe a particular operation of the middleware, such as submitting a job in CrossBroker, then starting and ending nodes belonging to the attack and impact surfaces can be clearly identified. In addition, the order in which the graph should be traversed is also quite clear because every edge is labeled with a number indicating when the interaction represented by the edge takes place. Finally, a characterization of a vulnerability taxonomy is required, to build a knowledge base where security configurations about possible malicious patterns are stored. Ultimately, VGA outcomes are presented as security alerts, because we are not analysing the components code, nor the actual controlflow. 4.1 VGA sketch A visual representation of the vulnerability graph analyzer is shown in Figure 2. It contains the FPVA Artifacts, the Characterization Proposal, the Knowledge Base, the Graph Analyzer Engine, and Security Alerts. The main component of VGA is Fig. 2. Vulnerability graph analyzer the graph analyzer engine, which receives two inputs, and then it calculates the possible attack vectors to be analyzed. The first input is the Vulnerability Graph, which includes the set of components, resources, and critical actions from FPVA artifacts, translated accordingly to our characterization proposal. The second input is a knowledge base of potential and generic attack vectors. VGA basically consists in a instantiation process between the specific vulnerability graph representation and the generic attack vectors. This process generates a security alert

295 IBERGRID 295 each time that a generic attack vector can be instantiated with the information in the vulnerability graph. 4.2 Vulnerability Taxonomy Characterization A vulnerability taxonomy characterization will provide the vulnerability graph analyzer with the knowledge about the different existing vulnerabilities that the security practitioner applies when he does the component analysis, this knowledge is in turn used by the graph analyzer engine in order to know how the vulnerabilities might be related to middleware elements and attributes during the instantiation process. We started classifying 51 vulnerabilities found using FPVA, publicly listed on [6], with two different taxonomies. In addition we introduce the CWE taxonomy. The 51 vulnerabilities belong to six different middleware systems: Condor, SRB, MyProxy, glexec, CrossBroker, and VOMS-Admin. The seven kingdoms taxonomy is the vulnerability classification from Mc- Graw et al. [15], which has been supported by Fortify Software Security Research Group. The taxonomy includes seven general categories: 1) Input Validation and Representation, 2) API abuse, 3) Security features, 4) Time and State, 5) Error handling, 6) Code quality, 7) Encapsulation, in addition to an extra category called Environment. The whole taxonomy includes 86 different vulnerabilities. In this case, the classification has shown that using McGraw s taxonomy is neither easy nor clear enough to properly fit the 51 vulnerabilities because we have found that nine vulnerabilities belong to two different categories, two vulnerabilities belong to more than two different categories, and 35 vulnerabilities are not thoroughly ambiguous. Also no vulnerabilities fit the last two categories, Encapsulation and Environment, which are related to specific language or framework programming (e.g. J2EE, ASP.net). PLOVER is the preliminary list of vulnerability examples for researchers, from Mitre Corporation [7]. Table 2 shows the plover taxonomy. PLOVER taxonomy includes around 300 vulnerabilities categorized in 28 classes, thus the likelihood of properly fitting the 51 vulnerabilities increases considerably. The classification of our vulnerabilities with PLOVER showed that 32 vulnerabilities belong to two different classes, three vulnerabilities belong to more than two classes, and 14 vulnerabilities are not thoroughly ambiguous. With PLOVER the 51 vulnerabilities fit into almost 50% of the whole taxonomy, because it includes a detailed and large classification structure from a diverse set of sources, including McGraw. Commom Weaknesses Enumeration is an enhanced and improved effort for organizing vulnerability data that contributes with different perspectives (e.g. seven kingdoms, PLOVER, and other efforts), in a hierarchical fashion. CWE support multiple stakeholders with multiple views which serve to different purposes and audiences. We are going to move to the research view of the Commom

296 296 IBERGRID Buffer overflows, format strings, etc. Structure and Validity Problems Special Elements (Characters or Reserved Words) Common Special Element Manipulations Technology-Specific Special Elements Path Traversal and Equivalence Errors Channel and Path Errors Cleansing, Canonicalization, and Comparison Errors Information Management Errors Race Conditions Permissions, Privileges, and ACLs Handler Errors User Interface Errors Interaction Errors Initialization and Cleanup Errors Resource Management Errors Numeric Errors Authentication Error Cryptographic errors Randomness and Predictability Code Evaluation and Injection Error Conditions, Return Values, Status Codes Insufficient Verification of Data Modification of Assumed-Immutable Data Product-Embedded Malicious Code Common Attack Mitigation Failures Containment errors Miscellaneous WIFFs Table 2. PLOVER taxonomy Weaknessess Enumeration (i.e. CWE-1000) because it is organized according to abstractions of software behaviors and the resources that are manipulated in those behaviors. 4.3 Knowledge Base In our approach, we translate the combination of both the middleware and the vulnerability taxonomy characterization into a set of generic security configurations. Having previously defined key elements of VGA, we proceed to define a basic structure for the knowledge base (KB). A security configuration, can be built as follows: Definition 4. Consider a set C of security conf igurations, then a configuration c C, can be: c = m i (a j ) t(a j ), simple. c = m 1 (a 1 ) m 2 (a 2 )... m i (a j ) t(a 1, a2,..., a j ), compound. Where m i G : {m i V m i E}, and a j is some attribute, property, or characteristic of m i ; and t is a vulnerability class (belonging to some known taxonomy T ), that can be present in the system if c can be set. 5 Case Study: Through an Integer Overflow to a Denial of Service This case study demonstrates that VGA concept, and its associates definitions, can be used to guide an analyst performing a source code inspection in finding vulnerabilities. In this case study based on CrossBroker, we assume that the vulnerability graph, and the knowledge base is already built. Let us proceed to analyze an attack vector from the CrossBroker vulnerability graph (Figure 3). The set V of components and resources in the vulnerability graph are {SUBMIT, UAM, input q, SA, RS, output q, sandbox dir, AL, CONDOR G, LB, mysql, BDII, LRMS,

297 IBERGRID 297 Fig. 3. CrossBroker vulnerability graph CONDOR STARTD, JOB}, and the set E of actions in the vulnerability graph are {connect, globus-url-copy, enqueue, dequeue, matchmaking, ldap query, enqueue, dequeue, query, query, query, sql query, condor submit, claim worker node, allocating, globus-url-copy, start job}, accordingly with the FPVA artifacts. First, specific coordinates should be choosen from the middleware attack and impact surface, hence input and impact nodes are selected, in this case the SUB- MIT and MySQL node. Second, having defined the input and impact nodes, the attack vector composition must be clearly depicted and recognized by the nodes and edges involved (Figure 4); In this case the SUBMIT, the LB, and the MySQL nodes, the query and sql query edges compose the possible attack vector. Since nodes and edges were previously characterized accordingly to our proposal, the third step is to try instantiate the attributes, properties, and/or characteristics accordingly to the security configurations (generic attack vectors) described in the knowledge base. Fig. 4. CrossBroker attack vector

298 298 IBERGRID Instantiation process: for the CrossBroker attack vector, A) SUBMIT.[constraint] [configuration]: Are big messages allowed? B) query.[state] [configuration]: Is it a persistent connection? C) LB.[constraint] [configuration]: Is the data in the correct format and size? D) sql query.[state] [configuration]: Is it a persistent connection? E) MySQL.[error handling] [configuration]: Are the error codes returned properly? A B = SUBMIT.[constraint] query.[state] [configuration]: Has the user requested a timeout period to try to finish and release the connection even if the message has not been transmitted? B C = query.[state] LB.[constraint] [configuration]: Were the data transmitted correctly and completely within the right time? C D = LB.[constraint] sql query.[state] [configuration]: Has the component requested a timeout period to try to finish and release the connection even if the message has not been transmitted? D E = sql query.[state] MySQL.[error handling] [configuration]: Are the code and the query correctly returned and properly handled within the right time? Fourth, the graph analyzer engine then should return a the set of alerts concerned to the security configurations which were instantiated by the different current values of the attributes, properties, and/or characteristics. In this case, when a submit request happens on CrossBroker, it is possible that submit.[constraint] allows either a big message or attachment, then query.[state] becomes persistent and the data starts transmitting to the LB component to save information about job status, but the LB.[constraint] trust that data is being properly transmitted based on the message header previously received. The LB component will try to register on MySQL component the job status, but the database returns an unexpected error due to an incorrect size of the data transmitted at the begin of all, hence sql query.[state] remains established and the MySQL.[error handling] contains an unexpected code because the LB component is still trying to write on the database, in addition to blocking the next incoming requests by not releasing the link, therefore becomes finally in a denial of service by an integer overflow in the size message difference and the improper handling of the unexpected errors. 6 Related Work Vulnerability Assessment of middleware systems is a field that has attracted the interest of both research and commercial communities, due to the increasingly rapid growth of the use of distributed and high performance computing, as well as the increasingly rapid growth of threats. Our VGA approach is related to the Open Vulnerability and Assessment Language [8] project, and to the vulnerability cause graphs [11].

299 IBERGRID 299 The Open Vulnerabilities and Assessment Language (OVAL) is an international, information security, community standard to promote open and publicly available security content, and to standardize the transfer of this information across the spectrum of security tools and services. OVAL includes a language used to encode system details, and an assortment of content repositories held throughout the community. The language standardizes the three main steps of an assessment process: 1) representing configuration information of systems for testing; 2) analyzing the system for the presence of the specified machine state (vulnerability, configuration, patch state, etc.); 3) and reporting the results of this assessment. The repositories are collections of publicly available and open content that utilize the language. OVAL is based primarily on known vulnerabilities identified in Common Vulnerabilities and Exposures (CVE) [3], a dictionary of standardized names and descriptions for publicly known information security vulnerabilities and exposures developed by the MITRE Corporation. In contrast to OVAL, our effort is not based on the specific CVE vulnerabilities, instead we claim that VGA approach works with CWE classification and with nonspecific software vulnerabilities, also VGA approach is based on FPVA stages, thereby gathering more meaningful information about the assessment process. Vulnerability Cause Graphs is based on a thorough analysis of vulnerabilities and their causes, similar to root cause analysis. The results are represented as a graph, which Byers et al. [11] called vulnerability cause graph. Vulnerability cause graphs provide the basis for improving software development best practices in a structured manner. The structure of the vulnerability cause graph and the analysis of each individual cause are used to determine which activities need to be present in the development process in order to prevent specific vulnerabilities. In a vulnerability cause graph, vertices with no successors are known as vulnerabilities, and represent classes of potential vulnerabilities in software being developed (analysis may start with specific instances of known vulnerabilities). Vertices with successors are known as causes, and represent conditions or events that may lead to vulnerabilities being present in the software being developed. In our case, the most noticeable difference is that we want to know whether a vulnerability may exist and why, instead Byers work knows the vulnerabilities and looks for their causes. 7 Future Work & Conclusions In this paper we have described the vulnerability graph structure and the vulnerability graph analyzer to guide security practitioners during a source code assessment to identify effectively where and why vulnerabilities might be possible. There is a lot of tasks which have to be done before vulnerability graphs and VGA can be applied as effectively as we claim to grid security. The most relevant tasks we have noticed are the following: Graph Representation: A vulnerability graph must be able to depict the middleware composition in a suitable and easy way. Vulnerabilities Characterization: A

300 300 IBERGRID complete characterization of a set of vulnerabilities is required in order to check if the knowledge base is good enough to provide the vulnerability graph analyzer with the proper security configurations. Attack Vectors: With the improvements on the vulnerability graph and the middleware characterization, the vulnerability graph analyzer engine has to be able to construct meaningful attack vectors with a well-defined algorithm. Instantiation process: In the graph analyzer engine the instantiation is the most important process, it has to be clear and easy to deploy, indeed it must be based on a kind of weighted value for the middleware elements. Because all the security configurations (the knowledge base) can not be applied to all middleware elements in the same way. In addition to the vulnerability graph and VGA definitions, we have proposed a middleware characterization along with a formal definition of a knowledge base of security configurations, having improved our previous work with a meaningful approach. Finally, a case study has been introduced, where all definitions and elements have been applied, showing that it is possible to reduce the gap between the first stages of FPVA and the Component Analysis one. References 1. Condor Project Coverity Prevent CVE - Common Vulnerability and Exposures. The Mitre Corporation CWE - common Weakness Enumeration. The Mitre Corporation Fortify Source Code Analyzer MIST Group: Middleware security and testing web site Mitre - The Mitre Corporation OVAL - Open Vulnerability and Assessment Language SRB - Storage Resource Broker PLOVER - Preliminary list of vulnerability examples for researchers. March, D. Byers, S. Ardi, N. Shahmehri, and C. Duma. Modeling software vulnerabilities with vulnerability cause graphs. In Software Maintenance, ICSM nd IEEE International Conference on, pages , E. Fernandez del Castillo. Scheduling for Interactive and Parallel Applications on Grid. PhD thesis, Universitat Autònoma de Barcelona, J. Kupsch and B. Miller. Manual vs. automated vulnerability assessment: A case study. International Workshop on Managing Insider Security Threats, 469:83 97, June J. Kupsch, B. Miller, E. Heymann, and E. Cesar. First principles vulnerability assessment, mist project. Technical report, UAB & UW. September G. McGraw, K. Tsipenyuk, and B. Chess. Seven pernicious kingdoms: A taxonomy of software security errors. IEEE Security and Privacy, 3:81 84, J. D. Serrano Latorre, E. Heymann, and E. Cesar. Developing new automatic vulnerability strategies for hpc systems. In Latinamerican Conference on High Performance Computing - CLCAR, pages , August J. D. Serrano Latorre, E. Heymann, and E Cesar. Manual vs automated vulnerability assessment on grid middleware. In Actas del XXI Jornadas De Paralelismo, JP2010. Celebrado en el marco del III Congreso Espanol de Informatica - CEDI., Sep 2010.

301 IBERGRID 301 Web interface for generic grid jobs, Web4Grid Antònia Tugores 1, Pere Colet 2 Instituto de Física Interdisciplinar y Sistemas Complejos, IFISC(UIB-CSIC) 1,2 [email protected] 1 [email protected] 2 Abstract. Scientific grid has been proven to be a useful tool in some very computationally demanding fields as for example in analysis of particle physics or astrophysics data. While it has been extended to other fields such as plasma research it is still viewed as a tool associated to large projects. However the capability of grid computing of processing a large number of jobs simultaneously is, by itself, not restricted to large projects. There are many research areas in which small teams or even individual researchers may need to run many jobs in an efficient manner. An example would be the exploration of the dynamics for different parameter values. Another example could be to evaluate statistical quantities on systems where dynamics is subject to noise fluctuations. Complex systems would be a prototypical area where such calculations are performed. While this high throughput computational needs are very much suitable for what grid was intended for, very few users take advantage of it because the access is cumbersome and requires a learning period that many researchers, mainly in small groups, can not afford. To facilitate the use, some user-friendly web applications have been developed for single applications. An example could be e-nmr interface for biomolecular nuclear magnetic resonance and structural biology. Other graphical environments that allow users to run their own applications such as Migrating Desktop, Ganga or P-GRADE Portal are too complex to be used without prior training. To popularize grid it is required to have user friendly interfaces where simple programs can be easily uploaded and executed. Those interfaces should not aim at replacing sophisticated interfaces developed for specific applications. Neither it is required they allow all the options in submitting a job or in monitoring the grid system. On the contrary they should cover the most basic aspects and include suitable default options for the others. Web4Grid interface is intended to be a user-friendly web where grid users can submit their jobs and recover the results on an automatic way. Besides the interface provides the capability to monitor the existing jobs and to check the (local) grid status (load, number of free cores available,...). Web4Grid interface does not require specific grid usage training nor any knowledge about the middleware behind it. Keywords: non classic user communities, grid, web-interface, glite, EGI

302 302 IBERGRID 1 Introduction Enabling Grid for E-sciencE (EGEE) [1] was a series of projects that began in 2004 to construct a multi-science computing grid infrastructure for the European Research Area, allowing researchers to share computing resources. The aims of this projects were to build a secure, reliable and robust grid infrastructure, to develop a middleware solution specifically intended to be used by many different scientific disciplines and to attract a wide range of users. They produced the glite [2], [3] middleware, engineered using components from some sources like the Globus Toolkit [4]. glite services have been adopted by a large number of computing centers around the world. Since 2010, the pan-european Grid Infrastructure (EGI) [5] in collaboration with the National Grid Initiatives (NGIs) guarantees the long term availability of the generic e-infrastructure created by the EGEE project. User communities that use glite are grouped into Virtual Organizations, and before accessing the grid, users must have a X509 certificate and be a member of a Virtual Organization. To access the grid users have first to log into a User Interface (UI). The UI is a computer with a X509 certification which runs specific glite components, and in which the user certificate is installed. Once it has log in the user can have access to the grid functionalities. To run a job, first the user has to create a proxy certificate consisting of a new certificate signed by the end user and a new private-public key pair. Then a file specifying the job characteristics such as the executable program, the parameters, input and output files, etc. has to be created. This file is written using the Job Description Language (JDL). Once the JDL descriptor file has been created, the user can submit the job to the available Workload Management System (step 3 in Fig. 1) by using specific glite commands at the User Interface. The set of available WMS is determined by the Virtual Organization. Once the job has been accepted by the WMS, the WMS assigns it to the most suitable Computing Element (step 4 in Fig. 1). A Computer Element is a set of computing resources localized at a site like a cluster or a computing farm. Finally the job is executed in an idle computing resource or Worker Node (step 5 in Fig. 1) and users can query the status from the UI to the WMS. The certifications allow for this process to take place without the need for the user to have an account in any computer besides the UI. glite has two different ways to handle the input and output files. Small input files (whose size is smaller than 100MB) can be just referenced when submitting the job. They are uploaded to the WMS and made available to the Worker Nodes. Larger files require a more cumbersome procedure consisting in uploading the files to a Storage Element through a LCG File Catalog (LFC) which acts as an interface providing human readable identifiers (step 2 in Fig. 1). Similarly, when the execution is completed, small outputs are sent directly to the WMS while large files have to be updated to the Storage Element through the LFC (step 7 in Fig. 1). Finally the user, depending on the outputs size, has to download the outputs from the WMS (step 10 in Fig. 1) or from the SE (step 12 in Fig. 1) to the UI using specific commands that require the job identifier or the name of the output file.

303 IBERGRID 303 Fig. 1. glite job workflow. 1. The user logs into the User Interface. 2. The user uploads large input files to an SE. 3. The user submits the job to the WMS. 4. The WMS sends the job to the most suitable CE. 5. The job is redirected to an idle Worker Node. 6. The WN and executes the job (if needed, the user application download the inputs from the SE). 7. If needed, the user application uploads the outputs to the SE. 8. The WN notifies its CE that the job has finished. 9. The CE notifies the end of the job to the WMS. 10. The user queries the WMS through the UI and notices that the job status is DONE. 11. The user retrieves the outputs from the WMS through the UI. 12. The user retrieves the outputs from the SE through the UI. The need to learn JDL language, the artificial separation between data storage and job execution, and the cumbersome and sometimes hard to memorize glite commands are some of the nuisances scientists find when using glite user interface. Using portal technology in sciencific applications such as described in [6] and [?] would solve this issues and users would have a web point of access to the available computable resources. The objective of this paper is to present a user friendly environment to use grid we have developed. The main component of this environment is a web application that the user can use to upload the programs, monitor its evolution and download the results. The web application makes use of several scripts. Both the web application and the high level command line interface are executed in the User Interface where the user has its files as well as the certificates. The paper is organized as follows Section 2 summarizes existing graphical grid interfaces. Section 3 describes the main script. Auxiliary scripts are described in section 4. Section 5 is devoted to

304 304 IBERGRID the web application. Section 6 analyses the usage and finally, concluding remarks are given in section 7. 2 Current interfaces Different user interfaces have been developed to allow easy access to grid. We sumarize here the characteristics of most important ones. Within the e-nmr project, [8], some of the nuclear magnetic resonance applications have been adapted to the grid. Besides, some user friendly web portals, specific for each migrated application, have been also created. To use these portals, users with a personal X509 certificate loaded on his web browser and registered with the enmr.eu Virtual Organization are enabled to register with username/password to the application portal and execute jobs. The most important issue of this kind of portals is the bijection between portals and applications. That does not allow researchers to use their own applications with the grid without technical help. A more flexible portal is Ganga, [9]. It was first developed to meet the needs of the ATLAS and LHCb for a grid user interface. As a result of its modularity, it has been extended and customised to meet the needs of different user communities like Geant4 regression tests and image classification for web-based searches. The number of commands to submit jobs and retrieve results has been reduced, but the relation between portal and application is still an issue. All jobs must specify the software to be run (application), the processing system (backend) to be used and the input and output datasets. Optionally, a job may also define functions (splitters and mergers) for dividing a job into subjobs that can be processed in parallel, and for combining the resultant outputs. Although there is a graphical interface related to Ganga, the command line interface is more used. The most complete graphical user interface is Migrating Desktop, [10]. It is similar to a window-based operating system that hides the complexity of grid middleware and makes access to grid resources transparent. Migrating Desktop supports batch and interactive jobs handling sequential and parallel applications, but the interface is cumbersome and requires training. Finally, the Parallel Grid Run-time and Application Development Environment Portal (P-GRADE Portal), [11] and Kepler [12] are graphical environment that cover every stage of grid application. They are oriented to directed acyclic graphs (DAG) where each node has a computing resource and a job to be launched on that resource. This interface allows users to use jobs outputs as inputs for other jobs without human interaction. This sophisticated interface requires prior training on DAGs definition. Besides it is not suitable to simple job execution. 3 Improved command line interface. Main command. Unlike usual command line interfaces, a user friendly one should not have a large number of commands nor too many parameters, and simplicity is a must. This starting point has led to a single command call invoked at the beginning to manage

305 IBERGRID 305 the whole job work flow even when the Storage Element is used. And this aim is achieved with an straightforward finite state machine (FSM), a database to safely keep all the data up-to-date, a daemon in charge of detecting ending of jobs, an unified home directory between the User Interface and the users desktops, and the well known Python flexibility. The main script, rungrid, can be easily invoked by the user with just the name of the executable and the input files. rungrid performs the tasks at the level of the User Interface in a transparent way, not requiring the installation of software in any other grid computer. We assume that: a) a researcher typically uses the same Virtual Organization; b) jobs to be run are not parametric nor direct acyclic graphs (DAGs); c) all the files generated by the job are expected to be returned to the user; d) it is not known how much time the job will take. The command rungrid is typically invoked in the form rungrid -a application -p param1 param2 -i *.dat,dir/inputfile where application stands for the name of the executable file, param for the parameters to be passed to the executable (optional) and finally the name of the input data files is given after -i (wildcards are accepted). We make use of long term proxies to allow for easy creation and update of mandatory short term proxies in the UI before they expire, so that long programs can be executed without user intervention. The FSM associated to rungrid comprises several states. The first stage, INIT, is triggered by the call to rungrid. In this state a new entry is added to the database in charge of the jobs, the command line parameters are checked and a long term proxy is created. Then, the FSM moves from the INIT state to the UPLOADING state where a compressed file containing all the input files is created and uploaded to a suitable Storage Element. Once the upload has finished the FSM enters the READY state in which some auxiliary files are created: a) a wrapper to manage the application inputs and outputs within the Worker Node (WN); and b) the Job Description Language (JDL) file in which the wrapper plays the role of the executable and in which the standard output and error are redirected to specific files. At this point, the job is ready to be submitted to the grid. Some checks like the availability of Computing Elements for the Virtual Organization are done to ensure everything is alright, and finally the job is submitted. When the Computing Element job status is Running (the job has been sent to an idle Worker Node), the state changes from SUBMITTED to RUNNING and remains in this state until the job execution has finished. In the Worker Node the application wrapper script is executed. It initially downloads the inputs from the Storage Element and after uncompressing them it launches the user application. When the application finishes, it compresses the outputs, and uploads them to the Storage Element. Finally, the exit code returned to the WMS is the same the user application returned. When the daemon notices the job has finished the FSM moves to EXECUTED and the outputs are ready to be downloaded. In the DOWNLOADING state, both the WMS outputs and the file stored in the Storage Element are downloaded and

306 30 IBERGRID uncompressed to a specific directory in the same folder the job has been submitted from. Then, the auxiliary files are cleaned (CLEANING), and finally the FSM moves to the final DONE state. So far we have described the regular work flow but other states like ERROR or CANCELING exist. When a job has not finished with the correct exit code or there has been any non predictable issue the status changes to ERROR and the work flow goes on but keeping auxiliary files and increasing log details. Fig. 2. The first version of rungrid (1) involved just one script to control job management that updated the job status to the local database and while the job was running it checked periodically the status via the WMS to detect the job end. The improved version (2) involved a daemon that checks periodically all the jobs of each user. In this new implementation the main script stops after submitting the job to the grid, and the daemon awakes it when the job has finished, allowing the main script to download the results and clean the data. The workflow of the main script first version can be seen in Fig. 2 (1). An instance of the script kept running until the results had not been downloaded to the User Interface (UI). In the states SUBMITTED and RUNNING the amount of connections to the WMS was too high and we had to restrict the number of parallel connections in order to avoid saturating both the WMS and the UI. To solve this issue, we have explored several solutions: The glite Logging and Bookkeeping service (LB) allows the user to be notified on every job status change so in principle we could use it to avoid queries alltogether. We did not get this service to work properly, and we had to continue querying the WMS to check the job status. Thus, a second version was developed (Fig. 2 (2)). In this case the main script stops when the job is submitted and a single daemon (checkjobs) checks periodically the status of all the running jobs with an unique query per user to the WMS. This way, the number of queries to the WMS has been drastically reduced. Finally, when the daemon detects that the job status is Done, it launches rungrid again to continue downloading the outputs and cleaning any auxiliary data. Advanced users can use additional command line parameters like Worker Nodes memory requirements, specific Virtual Organization or set a particular outputs directory amongst others and configure specific user default values.

307 IBERGRID 30 Fig. 3. States of the rungrid command and checkjobs daemon as described in the text. We also indicate the glite commands associated to each state. and query the status Fig. 4. In the Worker Node the user application is replaced by the application wrapper that downloads the inputs from the Storage Element, executes the user application, compresses the outputs and uploads the file to the Storage Element. Finally, the exit code returned to the WMS is the same the user application returned. 4 Auxiliary commands. Job submission is not the only important action to be run on the grid. Users should know the status of their jobs and be able to cancel them or look up for job details. For all those secondary, but not less important functionalities some auxiliary scripts have been created.

308 308 IBERGRID The above mentioned checkjobs daemon has been essential to reduce the User Interface and the Workload Management System load, and more effort will be dedicated to solve the issues with the LB system and receive the job status notifications instead of pushing the system. This daemon queries the WMS in order to detect job finalization, and when detected, updates the FSM state to EXECUTED and launches the main script to continue the job workflow. Another basic functionality apart from submitting jobs is canceling them. For that purpose, canceljobs script allows users to delete all or some of the owned jobs. When canceling a job, the FSM moves to CANCELING to cancel the submission if needed, and then continues with CLEANING state removing any local or remote auxiliary data related to the job. Last but not least, users should be able to check their jobs status, and status presents a table containing basic or detailed information (depending on the parameters) of the owned jobs and launched from that User Interface. (a) (b) Fig. 5. (a) User Interface layered software. The basic layers include the operating system and the glite-ui packages which provide the low level command line interface. On top there are the scripts described on Sect. 4 Web4grid runs on the uppermost level. (b) Interdependencies among the components: The command line scripts rungrid, checkjobs and canceljob read and write to the database and access to the UI Services. Status and Django just read information from the database. Finally Web4grid interacts with the scripts to launch and cancel jobs and displays the status using the information provided by Django 5 Web4Grid Web4Grid is a web interface that runs on the top of the scripts (Fig. 5 (a)), allowing users to submit jobs from every computer with a web browser without logging into a User Interface. This user-friendly web interface has been developed with one of most known open source web application frameworks, Django [13]. Django s goal is the creation of complex database-driven websites. Besides, Django provides a large degree of modularity which has turned out to be essential for our purposes; Web4Grid comprises the following modules: authentication, grid management, sessions and

309 IBERGRID 309 pagination. The authentication module currently used is LDAP [14]. We are considering the possibility to implement another authentication module to use the certificate stored in the browser as web and grid authentication and do not require researchers to physically store the certificate in a particular User Interface. This way, researchers from anywhere could access to any User Interface. After authenticating, already submitted jobs are monitored through the web interface allowing the user, with just one click, to view a detailed description of each job, submit a new job, cancel one or more jobs and check the general grid status (busy and free cores) allocated to each Virtual Organization the user belongs to, (see Fig. 6 (a)). Job submission options fit rungrid basic command line: users just select the application to be run and the related input files contained in their home folder. Finally they set the command line parameters (Fig. 6 (b)) and the job is ready to be submitted. As all the actions are launched through the web interface and are managed by the improved command line interface described above, all other essential information (VO, SE, CE,... ) is defined in the user profile and can be edited and updated any time the user is logged in. As Web4Grid is working on the top of the command line scripts, outputs are stored in the application directory to allow users to manage parameterization results easily, but if they prefer to check the results locally, they can download them through the web interface. 6 Implementation and user s experience We have implemented rungrid and Web4Grid at the IFISC User Interface. IFISC users are physicists working on a broad range of topics that within the context of complex systems include quantum transport, quantum and nonlinear optics, photonics, fluid dynamics, biological physics, nonlinear phenomena in ecology and physiology, complex networks and collective phenomena in social systems. Numerical simulations are heavily used to model the dynamics of these systems. Users typically wrote down their own codes which use different algorithms ranging from pseudoespectral methods for spatio-temporal dynamics to correlations calculations and neighbourhood determination in on-line social networks. These programs were executed on a computer cluster running MOSIX [15] which was a quite user-friendly environment where users could submit their jobs to be distributed to the cluster. Home directories are fully integrated so that users have the same home directory in desktops and in the cluster. This allows for users to edit the program in the desktop, compile and run in the cluster and analyse and visualize the data in the desktop or in a multiprocessor server. When IFISC joined Grid-CSIC, we created rungrid with the aim to provide an environment which was user friendly and flexible enough so that it suits to all the users programs. To facilitate the migration to grid, a seminar explaining basic ideas on grid computing and how to use the rungrid script was given. The run- Grid script was initially deployed on June Following the users suggestions we made several improvements in the script such as allowing for the use wildcards to list the input files or returning all the output files to the same directory

310 310 IBERGRID (a) (b) Fig. 6. (a) Job status and job details can be easily checked. (b) Job submission through Web4Grid. After selecting the folder the application is placed in, the user is asked to select the application and then the input files and set the command line parameters. Fig. 7. The first image shows the total number of jobs ran with the high level command line interface is summarized. In the second image, the time the jobs lasted can be seen. where rungrid is invoked rather than to a subdirectory. We also increase the script robustness and fixed some initial bugs.

311 IBERGRID 311 Although users could use both raw glite command line interface and the high level one, all of them used rungrid. A total of 70 different applications have been executed and the total number of jobs submitted using the scripts since June has been approximately 4500 (last months data is shown in Fig. 7). As can be seen in Fig. 7, in one-third of this jobs, long term proxies were a must because jobs lasted more than 12 hours. The amount of applications executed and the large difference between them lead to different requirements. While some programs needed a few MB of memory, others expect at least 16GB. And some applications were submitted to the grid just once, but others more than one thousand of times. Despite the variability of programs, we have not detected any major issues or handicaps using the high level command line interface. 7 Conclusions and further work We have developed a web application that would allow researchers to run generic grid jobs and monitor the grid status from any computer or mobile device provided it has a web browser. We aim at promote the use of grid in scientific environments well beyond the traditional ones. Besides we have also developed a high level script that allows for a easy job submission from the command line. The rungrid script is particularly useful for users that want to submit many programs using scripts. While there is some provision for job parametrization within glite, we have not incorporated it in the rungrid script. The practice shows that different users have different needs for job parametrization and implementing all of that in a single script would lead to a cumbersome interface. Therefore it is more flexible that each user constructs scripts suitable to its needs. Still, rungrid provides a simple command to be called from these user scripts which takes care of all the burden of the grid submission and output files downloading allowing for the user script to be simple and clean. We are currently working on an implementation of Web4Grid that would allow for these user scripts to be uploaded and executed from the web interface. Users that need more advanced requirements can use more elaborated interfaces such as already existing ones. To spread the use of this kind of portals, next steps should probably be the use of this kind of applications in large research centers or even having some portals per National Grid Initiative users with certificate could log in and submit their jobs. Acknowledgements: Financial support from CSIC through project GRID-CSIC (Ref E494) is acknowledged.

312 312 IBERGRID References 1. Enabling Grids for E-sciencE glite Grid Computing Middleware Kunszt, P. et al.: Data storage, access and catalogs in glite. Local To Global Data Interoperability - Challenges And Technologies, 2005, pp I Foster: Globus Toolkit version 4: Software for service-oriented systems. Network and Parallel Computing Proceedings, Vol. 3779, 2005, pp European Grid Infrastructure Wege, C.: Portal server technology. Journal of IEEE Internet Computing, Vol. 6, 2002, No. 3, pp Thomas, MP. et al.: Grid portal architectures for scientific applications. Journal Of Physics Conference Series, Vol. 16, 2005, pp Loureiro-Ferreira, Nuno et al.: e-nmr glite grid enabled infrastructure. IBERGRID: 4th Iberian Grid Infrastructure Conference Proceedings. 2010, pp Moscicki, J. T. et al.: GANGA: A tool for computational-task management and easy access to Grid resources. Journal of Computer Physics Communications, Vol. 180, 2009, No. 11, pp Kupczyk, M et al.: Applications on demand as the exploitation of the Migrating Desktop. Journal of Future Generation Computer Systems, Vol. 21, 2005, No. 1, pp Kacsuk, Peter: P-GRADE portal family for grid infrastructures. Journal of Concurrency And Computation-Practice & Experience, Vol. 23, 2011, No. 3, Sp. Iss. SI, pp Ludascher, Bertram et al.: Scientific workflow management and the Kepler system. Journal of Concurrency And Computation-Practice & Experience, Vol. 18, 2006, No. 10, pp The Django Book Koutsonikola, V and Vakali, A: LDAP: Framework, practices, and trends. Journal of IEEE Internet Computing, Vol. 8, 2004, No. 5, pp Barak A. and La adan O.: The MOSIX Multicomputer Operating System for High Performance Cluster Computing. Journal of Future Generation Computer Systems, Vol. 13, 1998, No. 4-5, pp

313 POSTER SESSION

314

315 IBERGRID 315 Population-Based Incremental Learning Algorithm to Search for Magic Squares Miguel Cárdenas-Montes 1, José Miguel Franco Valiente 2, Álvaro Cortés Fácila 2, Miguel Ángel Díaz Corchero2, Carolina Gómez-Tostón Gutiérrez 2, Adrián Martínez Ramírez 2, César Suárez Ortega 2 and Juan Antonio Gómez Pulido 3 1 CIEMAT, Centro de Investigaciones Energéticas, Medioambientales y Tecnológicas, Avda. Complutense 22, 28040, Madrid, Spain. [email protected], WWW home page: 2 Student of Master of Grid and Parallelism, University of Extremadura, Spain WWW home page: 3 Head of Master of Grid and Parallelism, ARCO Research Group, Dept. Technologies of Computers and Communications, University of Extremadura, Escuela Politécnica, Cáceres, Spain. [email protected] WWW home page: Abstract. This article describes how the combination of Population-Based Incremental Learning algorithm (PBIL) and Grid Computing is used to search for magic squares. In machine learning, PBIL algorithm is an optimizing algorithm and belonging to the category of Estimation of Distribution Algorithm (EDA). This is a type of evolutionary algorithm where the genotype of an entire population (probability vector) is evolved rather than individual members. In recreational mathematics, a magic square is a matrix where the sum of any of the rows or any of the columns is identical. The creation of magic squares is well documented whereas the composition of magic squares by only using Prime Numbers is not applicable to any existing formula. The search space for this problem is all the matrices of size 3x3 composed by Prime Numbers lower than 1,000,000. This corresponds to candidates. This high CPU-time consuming task was accelerated by the parallelization of the production over Grid Computing resources. The creation of magic squares is a combinatorial problem related with Knapsack Problem and Cutting Stock Problem, being impact over economic issues as planning or logistics of goods. Finally, this experiment was developed as a practice of the subject Adaptation of Grid Applications for Image Processing of the Master in Grid Computing and Parallelism at University of Extremadura. Keywords: Combinatorial Problems, PBIL, Recreational Mathematics. of corresponding author: [email protected]

316 316 IBERGRID 1 Introduction By definition, a magic square is an arrangement of numbers in which the sum of the elements of any row or the sum of the elements of any column is identical. The creation of magic squares is simple enough and has a well-established methodology. On the other hand, a magic square of primes is a magic square composed exclusively by prime numbers (also including the number 1). Obviously the formation of magic squares composed only by prime numbers [19] is a scientific challenge more complex and difficult that the formation of simple magic squares, not existing any formula for this kind of composition. For these complex problems, meta-heuristics techniques have usually produced optimal or sub-optimal solutions. Within these techniques, Genetic Algorithm [16] [9] (GA) has proved to produce good solutions in similar conceptual problems. The Population-Based Incremental Learning algorithm (PBIL) [2] is an Estimation of Distribution Algorithm (EDA) [14] suitable for optimizing combinatorial problems. As an evolutionary algorithm, it avoids systematic searches in the overall space but, at the same time, the genotype of an entire population (probability vector) is evolved rather than individual members towards the optimal solutions. The search for the optimal solution or alternatively high-quality sub-optimal solutions is a high time-consuming task sometimes not affordable through the use of a powerful workstation. Fortunately, Grid Computing [11] has revealed itself as a solution while dealing with high CPU-time tasks, especially when the objective is increasing the number of executions per time unit. Grid Computing allows computers to be connected via special software called middleware. Scientists are provided a standard layer by the middleware which exports and handle all the computer resources and where experiments can be run. A grid training infrastructure with glite [21] middleware called GILDA [13] was used in this work. Beyond of the recreational aspect, the solution of the problem of magic squares is relevant of industrial and economics activities, such as: packing problems, scheduling of good, elaboration of time-tables or complex planning. Therefore, the issue proposed in the teaching activity is very pertinent as induction topic to other more close to real industrial problems. As introduced in the abstract, this experiment was developed as a practice of the subject Adaptation of Grid Applications for Image Processing by the students of the Master in Grid Computing and Parallelism at University of Extremadura (Spain). In the next section the problem to solve is described, as well as the main features of PBIL algorithm and Magic Squares. In section III the PBIL implementation is presented and production results are shown in section IV. This article finishes with the conclusion and future work. 2 The problem The students of the Master in Grid Computing and Parallelism were presented the following problem: the evaluation of the Population-Based Incremental Learning algorithm (PBIL) to search for magic squares.

317 IBERGRID 317 As a restriction, magic squares should be composed exclusively by primes (including the number 1) between 0 and 1,000,000. Repetitions are not allowed in order to avoid symmetrical solutions. A magic square is a matrix of numbers, the initial size was set to 3x3. This involves an assessment of around candidates. 2.1 Magic Square A magic square is defined as an arrangement of numbers in a square or matrix such that the sum of the elements of any of their rows or the sum of the elements of any column is identical. In the following, some examples of magic squares composed by primes are presented: The first magic square presented is the first magic square of 3x3 dimension formed by the smaller prime numbers (including number 1), and was discovered in 1917 by Dudeney [8] The second square is the smallest magic square of 3x3 size composed only of the smaller numbers based on a magic constant [15]. It should be noted that this magic square not only matches the sums of rows and columns, but also match the sums on diagonals The third magic square is the smallest magic square composed of prime numbers in arithmetic progression. In this case, the arithmetic constant is 210 [15] Common magic squares exist for all orders n >= 1 except n = 2, although the case n = 1 is trivial, consisting of a single cell containing the number 1. The generation of magic squares is quite simple and has a well-documented methodology. However, creating magic squares composed of prime numbers only is a complex problem that does not correspond to any known formula. That is the reason why this problem involves the evaluation of candidates within a huge search space. In science and in art there are numerous examples of magic squares, such as that included in Albrecht Dürer picture Melancholia I (see Fig. 1) or the one that can be found in the Cathedral of the Holy Family of the architect Antonio Gaudí (see Fig. 2).

318 318 IBERGRID Fig. 1. Magic square in Dürer Melancholia I. Restrictions on the construction of this mathematical puzzle can be as complicated as desired, for example, stating that the sum of the diagonal is identical to the sum of the rows or the sum of the columns or using only consecutive numbers to compose the matrix. Fig. 2. Magic square of the Cathedral of the Holy Family of the architect Antonio Gaudí. 2.2 The PBIL algorithm According to [22], in machine learning, Population-Based Incremental Learning algorithm (PBIL) is an optimization algorithm in the category of Estimation of Distribution Algorithm. PBIL is a type of evolutionary algorithm where the genotype of an entire population (probability vector) is evolved rather than individual members. This algorithm was first proposed by Shumeet Baluja in 1994 [2]. The

319 IBERGRID 319 algorithm is simpler than a standard genetic algorithm, and in many cases leads to better results than a standard genetic algorithm. In PBIL, genes are represented as real values in the range [0,1], indicating the probability that any particular allele may appear in that gene. The PBIL algorithm is as follows: 1. A population is generated from the probability vector. Usually in the first generation all probability values are initialized to The fitness of each individual is evaluated. Later the individuals are ranked based on their fitness values. 3. The probability vector is updated based on the individuals. The simplest way to perform the update is to find a good candidate solution (based on best fitness individuals) and then increase the probability of each of the values of its alleles in the distribution (positive learning). The reverse can be done with a bad candidate solution with probabilities of values being reduced (negative learning). 4. Mutation is applied. Mutation is often used with PBIL to help increase the search space much as with GA. Various schemes to implement mutation exist however two approaches are either to vary the value frequencies by some amount with low probability or, alternatively, apply mutation with a low probability to generated population members before they are evaluated. 5. If end condition is not met, then back to step number 2. Variations of the PBIL algorithm include a negative learning ability [2] as well as duality [20]. The PBIL algorithm has been widely used in many applications: Chen and Petroianu [6] used PBIL algorithm for optimizing Power System Stabilizer (PSS) tuning, Al-Sharhan et al. [1] applied it on the design of a high-speed backbone network, Gosling et al. [12] made a comparison between GA and PBIL applied to playing Iterated Prisioners Dilemma (IPD) and Bekker and Olivier [5] applied an integration of the PBIL algorithm and computer simulation to combinatorial optimization problems. According to ([2] [3] [4]), Baluja provides a detailed comparison between the GA and PBIL on a range of problems such as the j m job shop scheduling problem (scheduling j jobs on m machines, an NP-complete problem [7]), a 50- city travelling salesperson problem [17], as well as the bin-packing problem [18]. 2.3 Dimension of the problem In order to estimate the size of the search space of solutions, the number of matrices of size 3x3 that can be formed with prime numbers less than 1,000,000 could be calculated. This set of primes has a total of 78,499 items. According to [1], the total number of individuals to assess corresponds to the following formula: V R m n = n m (1) where m is the number of elements in each matrix, in this case 9 (3x3) and n is the number of primes up to one million, that is 78, 499. Applying Eq. (1), the total

320 320 IBERGRID number of matrices that can be formed is, following Eq. (2), V R = (2) The size of the resulting search space does not permit the systematic exploration of individuals. Therefore, the use of evolutionary computation algorithms is highly recommended. 3 PBIL algorithm implementation After evaluation of an implementation of the classic genetic algorithm carried out in previous work, an implementation of the PBIL algorithm was proceeded to check the results. This algorithm has been tested to solve many combinatorial problems such as the Travelling Salesman Problem and the Knapsack problem obtaining successful results. Unlike the classic genetic algorithm, a typical crossover operator is not applied (One-point, Two-point or Uniform crossover). In the PBIL algorithm, the genotype that is evolved is the vector of probabilities that is calculated in every iteration and it is based on individuals with better fitness. The standard deviation of the sums of the rows and columns is used as fitness function and it is calculated as shown in Eq. (3). a 1,1 a 1,2 a 1,3 a 2,1 a 2,2 a 2,3 a 3,1 a 3,2 a 3,3 S f1 = a 1,1 + a 1,2 + a 1,3 S f2 = a 2,1 + a 2,2 + a 2,3 S f3 = a 3,1 + a 3,2 + a 3,3 S c1 = a 1,1 + a 2,1 + a 3,1 S c2 = a 1,2 + a 2,2 + a 3,2 S c3 = a 1,3 + a 2,3 + a 3,3 M = S f1 + S f2 + S f3 + S c1 + S c2 + S c3 6 S f = (M S f1 ) 2 + (M S f2 ) 2 + (M S f3 ) 2 S c = (M S c1 ) 2 + (M S c2 ) 2 + (M S c3 ) 2 DS = S f + S c (3) In addition, a parameter that determines the probability of mutation during iteration and a second parameter that determines the impact of the mutation on the genotype are included. The application of this impact is random. Also, if there is a matrix with fitness 0, but that is not a magic square, mutation is applied in next iteration (mutation probability of 100%). All algorithm parameters are listed as follows:

321 IBERGRID NUM PRIMES: Selectable number of primes for the experiment (including the number 1). 2. POP SIZE: Population size to generate at each iteration of the algorithm. 3. SQUARE SIZE: Magic square matrix size. 4. ITERATIONS: Number of iterations that the algorithm will perform. Equivalent to the number of generations that will be generated. 5. MUT PROB: Mutation probability at each iteration. 6. MUT IMPACT: Probability to apply when a mutation occurs. 7. BEST CANDIDATES: Number of eligible candidates with the best fitness used to update the genotype (probability vector). 4 Adaptation of the algorithm to the GRID After the algorithm implementation, the next decision to make was determining if this problem is better suited to a High Performance Computing (HPC) paradigm or a High Throughput Computing (HTC) paradigm. In HPC, the main objective is reducing the execution time of an application, increasing the number of operations per time unit. HPC tasks are characterized as needing large amounts of computing power for short periods of time. This paradigm better suites to time-critical applications, and it is used on climate prediction, fusion, etc. On the other hand, the main objective of HTC is increasing the number of executions per time unit, no matter the time of a single execution itself. As HPC, HTC tasks also require large amounts of computing, but for much longer times (months and years, rather than hours and days). HTC paradigm is used in High Energy Physics, Biocomputing, etc. HPC applications use shared or distributed memory computational resources connected with low latency networks (supercomputers, homogeneous clusters or GPGPUS) to minimize the process communications overhead. HTC applications use grids and clouds because the impact of process communication is less significant. As mentioned before, the problem to solve involved the evaluation of candidates from a huge search space. This problem fits better to a HTC paradigm because the expected overcome is the evaluation of the highest number of candidates at the same time. Besides, if the evaluation a set of candidates is a considered as a task, there is no dependency between two parallel tasks (a evaluation of a set of candidates does not depend on the evaluation of another set of candidates previously). Therefore, the problem could be considered Embarrassingly Parallel [10]. This kind of problems, which little or not effort is required to separate the problem into a number of parallel task are highly suitable to Grids, because of the heterogeneity of the computational and network resources. The adaptation of the algorithm to the grid was made by generating scripts to facilitate interaction with the middleware. The algorithm implementation was parametrized as described in the previous section so it was not modified. Also several scripts were generated to process the algorithm results. The GILDA training infrastructure based on glite GRID middleware was used to perform the production.

322 322 IBERGRID Several tests were executed and it was measured that an algorithm execution configured with a matrix size of 3x3, the first 1,000 prime numbers, 10,000 individuals and 20,000 iterations as stop limit could take from a few seconds to 1 hour if a magic square was found. At the beginning one job was launched per algorithm execution but, to maximize the CPU time of the allocated worker node, the scripts were modified to perform 10 job executions per job. The glite job collection feature was used to launch and monitor the status of jobs. 5 Production and results As usual, a first production was performed to calibrate the algorithm. This configuration used the first 1,000 primes. Therefore, by applying Eq. 1, the search space is approximately of candidates. Table 1 shows the different configurations used and the results obtained from this production. In this production the mutation was deactivated in order to understand the steering capacity of this operator. Table 1. Results of the simpler production without mutation: 20 jobs, 1,000 primes and 100 best candidates. Population Square size Iterations Magic Squares The results obtained with the PBIL algorithm are very good. 159 magic squares where found, some of them duplicated. No best fitness is displayed because in all cases is (a magic square was found). In order to adjust fitter the parameters of the algorithm, a second production was executed. Each run of this production was composed by 1,000 jobs. The results are shown in Table 2. The configuration column pattern is: NUM PRIMES, POP SIZE, SQUARE SIZE, ITERATIONS, MUT PROB, MUT IMPACT and BEST CANDIDATES. Analysing the results, the optimal configuration of the algorithm for this problem cannot be determined. Despite this, 7,175 magic squares were found, 3,979 of them unique. These results are satisfactory because all magic squares found were constructed by using non-repeated elements. In order to face the initial problem, a third production was executed, but this time the first 100,000 prime numbers were used. That is to say, from number 1 to 1,299,689. The rest of the configuration was similar to the previous production. The results are shown in Table 3.

323 IBERGRID 323 Table 2. Production executed with first 1,000 prime numbers. Jobs Configuration Magic Squares Table 3. Production executed with first 100,000 prime numbers. Jobs Configuration Magic Squares

324 324 IBERGRID For this production, 305 magic squares were found, 169 of them unique. To finish with 3x3 matrices, a final production was executed. This time all primes among 1 and 10,000,000 were included (664,580 prime numbers in total), what means that the search space size grew to an evaluation of candidates. The results are shown in Table 4. Table 4. Production executed with prime numbers between 1 and 10,000,000. Jobs Configuration Magic Squares In this production 770 magic squares were found, 714 of them unique. To sum up, after the execution of the previous mentioned productions (1,000, 100,000 and 664,580 prime numbers), 8,251 magic squares were found, 4,861 of them unique. In addition, some tests were performed with a magic square of size 4x4. Configurations were similar to previous ones, but only changing the square size. In this case, 592 jobs were executed fine, obtaining 148 magic squares, this time 97 of them unique. 6 Conclusion and future work This paper shows that, compared with the past results, the implementation of Population-Based Incremental Learning algorithm performs very well, producing a profusion of solutions. The generation of a population at each iteration based on the vector of probabilities, as opposed to mechanisms of mutation of the classical genetic algorithms, of higher random component, fits better to the problem to solve. Besides, the implemented solution can be easily adapted to other problems with a similar space search size. The profusion of satisfactory results makes us confidence on the usage of PBIL and magic square to be applied to real problems such as logistics.

325 IBERGRID 325 Finally, the following future work is proposed: Reducing the implementation complexity, optimizing the search for the best candidates and reducing the number of loops in order to speed up the implementation. Testing the performance of this algorithm under different fitness functions and parameter combinations. Verifying the efficiency of other algorithms, such as Hill Climbing with Learning (HCwL), Tabu Search, Particle Swarm Optimization (PSO), etc. All in all, it should be noted that the resolution of the problem this paper has dealt with which was part of a practice in the subject of Adaptation of Grid Applications for Image Processing of the Master in Grid Computing and Parallelism at University of Extremadura has proven itself useful not only for the field of research on magic squares, but also for the development of students abilities first, to face a previously unknown problem, secondly, to analyse it with the aim of determining the existing computing paradigm that better fits it and finally, for the enhancement of group work. Acknowledgements This work is part of the activities of the Master in Grid Computing and Parallelism from University of Extremadura. The authors wish to thank GILDA for providing resources and support without which this work could not have taken place. References 1. Al-Sharhan, S., Karray, F. and Gueaieb, W. Approach of optimizing computer networks using soft computing techniques. Proceedings of the International Conference on Software, Telecommunications and Computer Networks (SOFTCOM 01), (2001) 2. Baluja, S. Population-Based Incremental Learning: A Method for Integrating Genetic Search Based Function Optimization and Competitive Learning Technical Report (Pittsburgh, PA: Carnegie Mellon University) (CMU-CS ) (1994) 3. Baluja, S. An Empirical Comparison of Seven Iterative and Evolutionary Function Optimization Heuristics Technical Report (Pittsburgh, PA: Carnegie Mellon University) (CMU-CS ) (1995) 4. Baluja, S. and Caruana, R. Removing the Genetics from the Standard Genetic Algorithm, Morgan Kaufmann Publishers, pp (1995) 5. Bekker, J. and Olivier, Y. Using the Population-Based Incremental Learning Algorithm with computer simulation: some applications South African Journal of Industrial Engineering (2008) 6. Chen, L. and Petroianu, A. Application of PBIL to the optimization of PSS tuning International Conference on Power System Technology Proceedings, 2, (1998) 7. Cook, S. A. The complexity of theorem proving procedures, Proceedings, Third Annual ACM Symposium on the Theory of Computing, ACM, New York, pp (1971)

326 326 IBERGRID 8. Dudeney, E. Amusements in Mathematics, Dover, New York, Eberhart R.C. and Morgan Y.S. Computational Intelligence: Concepts to Implementations, Kaufmann Publishers, 467 (2007) 10. Foster, I. Designing and Building Parallel Programs, Addison-Wesley (1995). 11. Foster, I. and Kesselman, C. The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufman Publishers (2003) 12. Gosling, T., Jin, N. and Tsang, E. Population-based incremental learning versus genetic algorithms: Iterated prisoners dilemma. Technical Report, CSM-401. University of Essex, Essex, England (2003) 13. G. Andronico, R. Barbera et al, GILDA: The Grid INFN Virtual Laboratory for Dissemination Activities, First International Conference on Testbeds and Research Infrastructures for the DEvelopment of NeTworks and COMmunities, p (2005) 14. Larrañaga, P. and Lozano, J.A. Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers (2001) 15. Madachy, J. S. Magic and Antimagic Squares. Ch. 4 in Madachy s Mathematical Recreations, Dover, New York, pp , Michalewicz, Z. Genetic Algorithms + Data Structures = Evolution Programs, Springer (1999) 17. Weisstein, Eric W. Traveling Salesman Problem. From MathWorld A Wolfram Web Resource. (2011) 18. Weisstein, Eric W. Knapsack Problem. From MathWorld A Wolfram Web Resource. (2011) 19. Wells, D. Prime numbers: the Most Mysterious Figures in Math, Wiley, p. 16, USA (2005) 20. Yang, S. and Yao, X. Experimental study on population-based incremental learning algorithms for dynamic optimization problems Soft Computing - A Fusion of Foundations, Methodologies and Applications, 9(11), pp (2005) 21. glite - Lightweight Middleware for Grid Computing. (2011) 22. Several authors. Population-based incremental learning From Wikipedia. incremental learning (2011)

327 IBERGRID 327 WRF4G: simplifying atmospheric modeling experiment design in distributed environments V. Fernández-Quiruelas 1, L. Fita 1, J. Fernández 1, A.S. Cofiño 1 and J.M. Gutiérrez 2 1 Dept. of Appl. Math. & Computer Scie., Universidad de Cantabria, 2 Instituto de Física de Cantabria, (CSIC-UC) Abstract. Atmospheric science, and in particular the atmospheric modeling community, is a high computer-demanding research area. Grid Computing provides a powerful computing infrastructure, that is an alternative to supercomputing. European and National Grid infrastructures are ready to be used by the scientific community and, today, several disciplines such as High Energy Physics and Biomedicine are taking advantage of this technology. However, due to the complex workflow and the high computational requirements of climate models, the atmospheric science community is not currently taking a real benefit of the power of Grid infrastructures. WRF4G (WRF for Grid) is a flexible framework designed to manage the WRF workflow covering a wide range of experiments. It is a user friendly framework, that solves the complexity of the execution of the model workflow in local clusters and Grid environments. In this paper we review the computational requirements that atmospheric models pose when running in a Grid infrastructure. The improvements required by both WRF4G and production Grid infrastructures in order to attract the atmospheric science community to the Grid are analyzed. 1 Introduction Earth science applications and, in particular, climate and meteorological models pose a great challenge to the Grid in terms of computing and storage capacity. Climate models resolve numerically the governing equations of the atmosphere. The integration of these equations is CPU intensive and usually requires long wall-times and produce large amounts of data. Several application areas such as high energy physics or bio-applications have benefited for years from Grid technologies. Nowadays, computing power offered by the European and National Grid infrastructures provides an enormous amount of storage and computing resources difficult to reach by a single institution. This allows the research community to face new challenges that could not be achieved with traditional computing paradigms. At the same time, using a Grid infrastructure frees the researcher from administering, maintaining and updating his own resources. [email protected]

328 328 IBERGRID Weather Research and Forecasting model [WRF 1 SKD + 08] is a finite differences limited area model. It is one of the most popular models among the atmospheric science community and it is used all around the world. Its development is a collaborative effort of different US federal institutions. Regardless the infrastructure chosen to run WRF (single computer, cluster or Grid), the execution of the model is not a simple task. Before executing a simulation, some programs and processes have to be run. Depending on the kind of experiment to be performed, there can appear complex dependencies among these processes. This complexity makes really hard preparing a working WRF model ready to perform real experiments in any computing infrastructure. Thus, preparation of automatic management of WRF workflow is usually done for a single kind of experiment on a given cluster or supercomputer. Usually, once researchers succeed on configuring WRF for a specific purpose in a computer infrastructure, they find very hard migrate to other resources. WRF4G (WRF for Grid) is a framework that allows researchers to perform easily a wide spectrum of experiments with WRF taking advantage of all the computing resources available (Clusters, supercomputers or Grid infrastructures). It separates scientific tasks from infrastructure configuration. Unlike other initiatives, like [DS10], where the migration of WRF to Grid has been designed to perform only one kind of experiment, WRF4G allows running any kind of experiment. WRF4G provides an easy way to develop research activity on Grid infrastructures with the WRF model. Thus,teams using WRF4G, will benefit from both, the potential capacities of the WRF model and the powerful computational resources provided by the Grid technology. A given example of a community that could benefit from WRF4G, is the Red Iberica MM5 research network. This network is a Spanish coordinated activity that joins different institutions that work with the MM5 model (initial base of the WRF model). WRF4G would be an easy way to drive the research activities of these teams to the WRF model. Although WRF4G is used in production for running any kind of experiment in Cluster o supercomputers, due to some limitations in current Grid infrastructures (wall-times, disk quotas, MPI enabled sites, etc ), it is not spreadly used in Grid environments. Detecting and solving these issues will make easier the approachment of atmospheric science community to Grid infrastructures. This work presents the requirements that a Grid infrastructure should fulfill in order to properly run WRF. 2 Atmospheric models A limited area model model (LAM) is a numerical approximation of the atmosphere. It solves the dynamics of the atmosphere in a small region integrating the discretized version of the fluid dynamics equations. It requires an initial state of the atmosphere in order to start the simulation. Moreover it requires some boundary conditions in order to interact with the atmospheric changes that occurred outside the simulation domain. At the same time, a 3D representation of the simulation 1

329 IBERGRID 329 domain is also required. All these data is generated in the preprocess step. This preprocess is achieved by the consecutively execution of different programs (see figure 1 with a schematic representation of the WRF workflow). Fig. 1. Representation of the generic WRF workflow In general, atmospheric models are high demanding computational applications. They require high computational power since equations solved by the model are quite complex and are integrated simultaniously for a long period of time on a large amount of places (grid points). At the same time, they involve the management of a large amount of input and output data. The data used for the definition of the integration domain comes from global geographical databases, while initial and boundary atmospheric conditions are retrieved from spatial portions of a set of variables of the Global Circulation Models (GCMs). Resultant output simulations are entire 4-dimensional representations of the atmosphere and generated output have a large amount of variables (usually more than 30). The experiment domain size (number of Grid points) determine the memory requirements for a given experiment. Depending on the kind of experiment, some experiments may need the use of MPI (Message Passing Interface) to make a faster simulation and to be able to handle all the memory required. Due to the flexibility of the WRF model, users community is very heterogeneous, ranging from national weather services to small research groups. 3 WRF4G framework Due to the complexity of the WRF workflow, management of WRF experiments is a complex task often done by hand. Automatize this process involves running scientific software which often requires manage complex execution and data flows. Performing these tasks requires a deep knowledge of software engineering and system administration. As previously stated, usually, once researchers succeed on configuring WRF in a computing infrastructure for running a kind of experiment, they find very hard migrate to other resources. WRF workflow can be substantially changed depending on the experiment characteristics. These experiments specifications would need the execution of workflows of different nature such as: production, complex preprocess, complex execution and complex postprocess. Some experiments could even require a combination of some of them. WRF4G (WRF for Grid) is a WRF framework which tries to automate the workflow execution of WRF for a wide variety of experiments taking advantage at

330 330 IBERGRID the same time of several computational resources, including local machines, clusters, supercomputers and Grid infrastructures. It has the following characteristics: It manages seamlessly the WRF workflow overcoming all the possible failures, being able to re-initialized interrupted experiments in different computing resources. Users can also activate some features that facilitate experiment execution. For example, on the RCM experiments, where a single long-time simulation is run, a series of restart files are written on a given frequency. Simulation can be restarted in case of failures from these files. It provides a monitoring environment that allow users to track their experiments status. All the monitoring information is stored in a database that is fed by means of two monitors that run together with the model. It manages all the data flow. It allows to perform the following kind of experiments: forecast (single high resolution simulation), ensemble forecast (multiple simulations), Regional climate modeling (RCM, long, hundred years, single low resolution simulation), re-hindcast (multiple high resolution simulations), sensitivity tests (multiple simulations modifying some aspects of them), data assimilation (adding observations to the simulations) and coupling WRF model to other models (such as ocean, vegetation, chemistry) WRF4G has been designed as a collaborative effort between software developers and atmospheric science researchers. Thus, WRF4G satisfy requirements of both communities: it is modular and robustly designed and it encompasses a large variety of useful experiments for research activity on the atmospheric science, in contrast to other ports of WRF to Grid that are focused on a particular application of the model. Additionally, users can adapt the WRF4G workflow to their particular needs. A series of waypoints have been introduced in the general workflow of the model, where users can introduce their self-designed applications that fulfill the requirements of their experiments (see figure 2). Thus, WRF4G usability is not limited in any way. Fig. 2. Representation of the user interaction with the generic WRF4G workflow Although WRF4G is used in production status for running any kind of experiment in local machines, cluster and supercomputers, due to some limitations in

331 IBERGRID 331 current Grid infrastructures (wall-times, disk quotas, MPI enabled sites, etc ), it is not possible to use in a satisfactory way, in Grid environments. Given the huge community of WRF users, it would be very interesting that users could run their experiments in the European and National Grid infrastructures with the same facilities they do in Clusters or supercomputers environments. This way, communities such as the Spanish Red Iberica MM5 research network could benefit of their countries Grid initiatives. In Spain, this initiative is called IberGrid and integrates 25 centers/clusters with a total of cores and 5 PB of storage. Next section shows the issues found running WRF4G in the European and National Grid infrastructures based on glite and how some of this issues have been solved. 4 WRF4G in Grid environments The complex execution workflow of the WRF model together with its high computational requirements makes impossible running WRF experiments in a Grid infrastructure without a framework in charge of overcoming all the possible failures. In particular, limits in disk, memory and CPU usage, often, cause an abnormal termination of the simulations in the computing resources. In order to detect these issues, WRF4G contains agents that prepare the running environment and check the resources characteristics. If a resource do not meet the experiment requirements (it has low disk quotas or low memory limits), it is discarded. Furthermore, taking advantage of the restart capabilities of the WRF model, when a simulation crashes, WRF4G is able to restart the simulation again from the last restart data. WRF4G also provides a monitoring system that keeps track of every experiment and shows its status. In order to obtain the experiment status, two monitors (one for managing the data and the othr for the execution) are run together with the model. One of the main issues not solved yet in WRF4G is the MPI jobs management. Although some efforts are being done in order to improve the usability of MPI jobs in Grid with the MPI-Start scripts, WRF4G is not yet able to successfully use them. The distributed architecture inherent to Grid also poses several challenges in terms of data management. While in supercomputers or clusters, data management is performed through high capacity networks, in Grid all the data has to be transfered through Internet. In some scientific disciplines such as biomedicine, transferring the job s input and output data through Internet is not a problem, but in atmospheric science, data management is crucial. Common experiments may need gigabytes of input data and can generate gigabytes of output data that have to be transfered in a reasonable period of time (it would not have any sense expending more time downloading and uploading data than running the model). In order to make an efficient data management it is necessary the use of replica catalogs that allow the data to be distributed intelligently among the SEs of a given infrastructure. In the first WRF4G prototype, this task was done through the LFC glite Catalog but it proved not to be robust and reliable enough for

332 332 IBERGRID the atmospheric models requirements. For this reason, WRF4G provides a selfdeveloped data catalog that was designed with the aim of optimizing the data transfers. Apart from the issues mentioned before, the main operational problem found in glite based infrastructures is the configuration of the VOs in the resources. Although Grid infrastructures are huge, some VOs are only configured in a few sites. Moreover, although operational tests are run periodically in order to test the sites configuration, these tests do not check the VO configuration and often several sites present failures for some VO. In order to promote Grid among the atmospheric science community it is necessary to provide a tool that allows them to run experiments isn a Grid environment the same way they do in a local computer. This task has been achieved with less demanding experiments using other models using the same framework used in WRF4G [FQFn + 11]. Achieving with WRF4G the success obtained with other applications involves improving on one hand. the WRF4G framework (in some aspects such as the MPI capabilities) and on the other the Grid production infrastructures (in others aspects such as the VO configurations and the information systems, giving detailed information about the resources usage limits and quotas). Thus, the atmospheric science community will take Grid infrastructures into consideration when analyzing the possible computing solutions to perform their experiments. Bibliography [DS10] D. Davidovic and K. Skala. Implementation of the WRF-ARW prognostic model on the Grid. In MIPRO, 2010 Proceedings of the 33rd International Convention, pages IEEE, [FQFn + 11] V. Fernández-Quiruelas, J. Fernández, A.S. Cofi no, L. Fita, and J.M. Gutiérrez. Benefits and requirements of grid computing for climate applications. an example with the community atmospheric model. Environ. Modell. Softw., page accepted, [SKD + 08] W. C. Skamarock, J. B. Klemp, J. Dudhia, D. O. Gill, D. M. Barker M. G. Duda, X.-Y. Huang, W. Wang, and J. G. Powers. A description of the advanced research wrf version 3. NCAR TECHNICAL NOTE, 475:NCAR/TN475+STR, 2008.

333 IBERGRID 333 GSG Tool: General Scripts on Grid Creating generic scripts for problems on Astrophysics J.R.Rodón 1 A.D.Benitez 1 M.Passas 2 J.Ruedas 1 J.Ortega 3 1 Grupo Grid. Departamento de Centro de Cálculo. Instituto de Astrofísica de Andalucía. Granada. Spain 2 Departamento de Sistema Solar. Instituto de Astrofísica de Andalucía. Granada. Spain 3 Departamento de Arquitectura y Tecnologa de Computadores. E.T.S. de Ingenieras Informtica y de Telecomunicacin. Universidad de Granada. Granada. Spain Abstract. The work of Grid Group of the Institute of Astrophysics of Andalucía (IAA) is mainly focused in software simulations, data analysis observatories or image filtering. We have noticed that Grid users think that this infrastructure is very difficult to use and it is not 100% safe because of its heterogeneity. Users must know many command lines and how to use them all the time. They will use them if they want to submit jobs, test the job status, handle data or monitor the Grid infrastructure. A great effort is necessary in order to learn and to implement each command. To solve these problems we have created a set of tools that synthesize the main commands used in the glite middleware. We have analyzed the use cases, designing, implementing and testing a set of shell scripting language to make the sending, monitoring and data collecting of large batches of data from the Grid infrastructure easier. These powerful general modules can be modified for specific cases. The work we have developed throughout the last two years can be summarized in the creation of a set of general tools for data managing, jobs sending, status monitoring and results collecting on the Grid infrastructure. As extra work, our group has modified these tools to adjust them to specific astrophysics cases, adapting these modules to port all the astrophysical applications to the Grid Infrastructure. This effort was rewarded when astrophysical scientists improved their expectations to use grid. So, there are two examples of solved problems using ours modules. The first one is HEALPIX (Hierarchical Equal Area isolatitude Pixelization of a sphere) and the second one is DDSCAT (The Discrete Dipole Approximation for Scattering and Absorption of Light by Irregular Particles). The general modules and their descriptions are: prepare creates an input data structure containing all necessary files for sending a group of jobs to the Grid platform. run sends Grid infrastructure jobs created by the module prepare; verify checks the status of the submitted jobs and if they are in a wrong state (CANCELLED or ABORTED) resubmit them, thus the user does not have to worry about corrupted work; cancel cancels the group of jobs created by prepare module. This module is not a mandatory of corresponding author:[email protected]

334 334 IBERGRID but it is very useful. A cron module is necessary to take advantage of the former ones more efficiently. So, the end user should not be worry about the state of the work nor the location of the final files. Cron file is created by the module prepare, it is launched by the module run and it runs the module verify very often. If the job fails, the job will be submitted again. If the job is DONE, the module verify will be submitted and the results will be put in their proper directory. 1 Introduction Scientific projects early this century approach large ambitious objectives. For this reason, there are computing and storage needs in order to solve complex numerical problems. Grid technology provides a distributed computing that combines the power of many machines in order to obtain an unlimited capacity of both computation and storage. Grid Computing allows sharing geographically distributed resources for solving large scale problems in a uniform, secure, transparent, efficient and reliable way and provides services to many research fields such as astrophysics, among others. Therefore, scientists have adopted the technology as a possible solution to their needs. Services provided by current Grid middleware are usually presented to users via command line (Command Line Interface, CLI). This is a non-trivial for the users because they have to develop the entire process of submitting jobs to the infrastructure. There are dozens of different commands with many options and rigid sequences, and even the job description has an specific language (Job Description Language, JDL). All this has a negative effect on the use of infrastructure because that discourages many potential users due to the large amount of time they have to invest to use it. The toolkit GSG (General Scripts on Grid) arises from the need for users to use the Grid infrastructure without having to waste time learning and using of this technology. This tool promotes the use of Grid in communities that until now have not dared to use it for lack of time to use this technology. The portability of applications to the Grid environment and GSG tools adaptation to each of the applications is derived from the Grid site administrator or support team for each of the sites, leaving free of charge to scientists who can devote his time to the investigation. The motivation behind the creation of these tools is to facilitate the use of grid technology to users, thus avoiding the cost of learning time with this environment. A faster job submission and minimize errors sending jobs to the Grid platform are achieved with a simple, easily manageable and user-friendly tools. GSG tool has been developed to achieve the following objectives: To facilitate the use and access to different Grid infrastructures scientific communities in CLI environments.

335 IBERGRID 335 To reduce the learning, development, submit and collection work time. This reduction is remarkable because the tool has all the knowledge of the use of infrastructure and the user does not need to write commands constantly. 2 GSG Tool This tool consists of a set of modules developed under shell script language. Thus it is very easy to maintain and expand its functions. The tool has five basic modules: 1. prepare. This module prepares the working area where the input and output file will be placed as described in section 4. The script creates the JDL file that defines the work to be submitted for each job. This module will be modified for use in each application while maintaining a similar overall structure. The input parameters of this module are divided into two groups depending on their general or specific character: General Input Parameter: The parameter working area contains the name of the directory in where is contained all input files and settings necessary for the execution of scientific application. Module prepare creates a directory system whose base directory is named as designated by the parameter working area.the optional parameter flag compile indicates the module that the application executable to be run in the Grid contained in the Computing Element (CE) destination. Thus, the module will send the file to the Grid application binaries for subsequent local build in each of the Worked Nodes (WN). This process will make a slower total run time on each WN, so it is recommended to use CE only where it is known that the executable is not pre-installed. Specific input parameters: Depending on the nature of the application to be run, certain information will be needed to know such as file name, input parameters, etc. The number of such parameters and its usefulness depends on each treated application. Moreover, the module also needs some input files, as parameters, differs in two groups depending on their general or specific character: General input files. The module will receive a configuration file called config.def containing the configuration of the GSG tool. A module with the command is also needed to execute remote machine within each Grid infrastructure. Its name is start. This module must be written and provided by each investigator and is different for each application. Specific input files. The input files depend on the characteristics of the applications.

336 336 IBERGRID The main task of this module is the creation of JDL file for each job. It tests the existence of a working directory with the same name that is passed in the input parameters. If this is the case, the module alerts the user indicating that the directory has to be renamed. The module will split the general problem into pieces have to be processed by a job submitted to the Grid. A specific directory will be created for each job. The input files and the necessary implementation (start module and JDL file) will be copied in it. This will send every piece of work as an independent job. Finally, some files will be packaged and uploaded to the SE using this module. These files are large input data files and source code files in order to compile the application if it is necessary. 2. run. This module submit all jobs prepared by the prepare module to the Grid. This module is identical for all the applications. The input parameter is the working area directory where all the jobs that the users want to submit are placed. The tasks of this module are to test if there is a working area directory which a job is executed. By means of the command submit, the module will send the jobs to the Grid environment and also creates a set of logs with the information in these posts: submitted jobs and submitted count. The module will create a job id file to make it possible that each job submitted could be used by the rest of modules. It will also include a row in the tested job.log file to periodically test the jobs (see Section 6). 3. verify. This module tests the status of the job submitted to Grid. If the jobs have been aborted they will be cancelled and submitted again. If they are DONE the module picks up the results and submits them to the working area directory in the User Interface (UI) machine. This module will be modified for any given application because the output data of each application can different, although the general structure is similar.the input parameter is the working area name. The working area directory contains the jobs to be submitted. This module tests each job according to the situation: Existence of the working area directory. If this directory does not exist, the module will report the user about the situation and will not continue executing. Existence of the job id directory. This file is necessary to each job submitted to Grid. If this file does not exist it will be sent again and a new job id file will be created. Previously finished job. Job testing carried out by this module is periodic thanks to the cron described in section 6. If a job has finished and the verify module has processed them, the module will just report the user that the job has been processed correctly before (skipping). Job finished with DONE. At first, the module tests if the output files size is less than the written in the configuration file, or checks if the file does

337 IBERGRID 337 not exist. In this case the module will understand that the job has failed and it will be resubmitted. If none of these circumstances have taken place the result will be included in its output directory, either by sandbox or by transferences from the SE. Job finished with ABORT or DONE (Failed). In these cases the module will cancel the submitted jobs and they will be resubmitted again. Job in READY, SCHEDULED or WAITING status. A request counter will be used, and if the status remains the same for N times (described in the config.def file), the job will be cancelled and resubmitted again. This is done so that the jobs change their status whenever a Workload Management System (WMS) or connection fail happen. Job in RUNNING status. The module will do nothing in this case. The module creates a set of information files for each working status. It will contain the jobs with that status and the total number of them. These log files will be stored in the logs directory in each working area. 4. cancel. This module cancels all the jobs associated to a Worked Node (WN). It is very useful if an error configuration or corrupted input files are detected once all the jobs are submitted. This module is the same for any other application. The input parameter is the working area name. This working area directory contains all the jobs to be sent. The module executes the cancel command described in the configure file. The module removes the command line in the tested job.jdl file which will use the cron file (see Section 6). The module will remove the structure created by the prepare module too. 5. start. This module is executed in each WN launching the user application. It must be passed as an input parameter in the JDL file. This was done by the prepare module. This module will be modified to each application use because the output data of each application are different, although the general structure is similar. The Input parameters of this module will be test name, input name and n job, and they will be used to name correctly each job output files. The Input and output files necessary to execute the module will be the typical of any user application. These files have to be passed to the WNs, either through Sandbox system or SE files system. The task carried out by this module is the input files preparing. So, necessary application input files are downloaded from the SE, they are extracted if they are compressed, and finally they are placed in the right location. Whenever it is necessary to compile binary files, the module will compile the source code previously. Once the module execution is finished, the module will compact all the generated files by the application to upload them to the SE. Finally the module will delete the temporary files generated by the application.

338 338 IBERGRID 6. config.def. This is the name of the GSG tool configure file. This file is located in each working area directory. The user can modify and locate it in the input directory in the base directory, so the prepare module can place it properly. The file defines the next variables: local dir is the directory where all modules are found. lfc host is the host address where Grid files catalog is found. lfc folder is the file catalog directory. se host is either the SE or the Group of SEs that are going to be used. If it is empty, no SE will be selected. vo is the Virtual Organization that will be used in the work sending. working home is the directory where all the application files are found. max attempts is the maximum number of job status attempts that returns the same status. If this number of attempts is overflown, the job will be cancelled and submitted again. Depending on the middleware, other variables have to be defined to complete the typical Grid commands: submission command is the command to submit jobs. get output command is the command to pick up the jobs. status command is the command to know the job status. cancel command is the command to cancel the jobs. logging info command is the command to display each jobs logs. 3 How to 1. Execution of prepare This point sets the steps to be followed by the user to use the GSG tool adapted to the application. Firstly we go to the base directory. The user must create a directory named as the working area in the inputs subdirectory. The user will store all the input files necessary to the application execution and include a file called config.def which contains all the necessary parameters to the grid jobs execution (see Section 2). The Input directories must be adequately modified by the user. The other directories will be used to execute the modules or to store other information (logs). Once all the necessary files are located in the right place, the user will use the prepare module from the scripts directory in the base directory. This tool will chop the problem and each piece to be sent as a job. A JDL file will be created for each job, a directory structure will be also created (see Section 4) and the GSG tool will be managed with, logs and job output results files included. There will be also included a job execution module called start in each jobs directory.

339 IBERGRID 339 Fig. 1. GSG Application working mode 2. Execution of run Once the structure of directories is prepared, the user will send the Jobs Collection to the Grid infrastructure through the corresponding command (depending on the middleware you are using). To do this, the user will use start the script and the input files to the application. This will generate a job id file for each sent job and be stored in the working area structure. This file is necessary to run the following modules. Finally, the tool will add a new input in the tested jobs.log file, located in the user s home directory. This file is used by the cron task to check periodically its status and to know the active job collection. 3. Execution of verify Once the user s jobs are submitted, the module verify can check the job status. Depending on the job status, the tool will execute a set of instructions so the user does not need to do anything if the job is corrupted, because this module resubmits the job automatically. The user does not need to use the Grid commands to download the results. The tool will locate correctly all output files and logs in the working area structure (see Section 4). This module

340 340 IBERGRID shows the status of all the working area jobs too and it will be periodically called by the cron task, making transparent the users testing if they want to. This tool generates logs and shows the job status, by grouping them according to their status: RUNNING, ABORTED, WAITING, SCHEDULED, READY or DONE. When all the working area jobs have finished, this tool will send an to the user s address located in the configuration file, so the user knows that the job has finished. If the long proxy (myproxy) expires and the jobs have not finished successfully, the tool will send an to the user noticing which jobs have finished and which ones have not. 4. Execution fo cancel This module cancels all the jobs associated to a working area avoiding the user to cancel them one by one. It also removes the necessary line to the working area jobs periodic test to the specific cron task (see Section 6). 4 Directories structure The general structure of directories starts of a base directory called Base Directory. This directory has all the GSG tool structure to use it in the Grid platform. The Base Directory has three basic subdirectories (inputs, scripts and logs) and one subdirectory associate to each jobs collection called working area. 1. Inputs This directory has many subdirectories as the number of working area directories exist. The directory name is the same as working area generated by the prepare and the Users of the application must modify only the corresponding inputs directory. Each subdirectory has all the necessary input files to use the application on Grid (Input data, parametric files, configure files, and so on). This directory could have a compressed file which contains application binary files. If the Grid node does not have the pre-installed application, the binary files will be compiled to use the application. To use the binary files, the prepare module have to indicate wether or not it is necessary a pre-compilation by means of an optional flag called /compile. Finally, the directory has a configure file to use the GSG tool (see Section 2). 2. Scripts This directory contains all modules which will be used by the job collection management. The modules are invoked from this directory. 3. Logs This directory contains the general logs generated by the GSG tool cron. 4. working area This directory is automatically generated by prepare module. It contains all files necessary to jobs submit, logs and output files used by the GSG tool. The directory has a log subdirectory with all general logs (see Section 5). In the other hand, the directory has as subdirectories as jobs submitted to Grid. They are called job target N. The number of directories depends of

341 IBERGRID 341 Fig. 2. Directories Structure the number of parts in which the general problem have been divided. 5 Logs The users and site administrators have more control on the jobs submitted to Grid and the modules status using the GSG tool. This system is a good complement of the Grid services logs generated by the Grid middleware commands.the GSG tool logs system has three layers: The first layer contains the log generated by the GSG cron module (see Section 6). It is called cron verify.log and has information about the cron status and the active working areas. The second layer contains the general logs of each working area. The general logs are generated by the verify module or the cron module. The last layer is the most specific one. It contains the particular logs of each job.

342 342 IBERGRID Fig. 3. Logs scheme 6 Cron The cron verify module is a useful application which periodically tests the status of the submitted jobs. Users have not to test manually their job status by means of the verify module thanks to this cron module. The cron module is random and it has been estimated that the testing period is to be half an hour to optimize the testing. Cron system behaves as follows: Periodically invokes verify module passing the line which contains the tested job.log file as an argument. Then, the module verify is executed thus creating the corresponding files and logs. If a run is executed, the working area path will be included as a new line in the tested logs.log file, which will query the cron file. If ALL the jobs are finished the corresponding line will be deleted and the cron module will not test periodically this working area again. 7 Impact The impact of this work focuses on all areas where these scripts have been modified for. The Research Departments are Extragalactic Astronomy, Stellar Physics, Radioastronomy and Galactic Structure and Solar System of the Instituto de Astrofisica de Andalucia. This work has had a relevant impact in the areas of astrophysics where they have been tested. In general terms, the astrophysical applications launched an average of 1000 jobs per run.by using Grid and parallel jobs, the range of improvement over time is between 40% and 80% depending on the degree of parallelization. This improvement is due to the use of Grid and not to the use of

343 IBERGRID 343 Fig. 4. Cron file scheme the scripts. With respecto to the number of commands, the improvement is due to the use of the implemented script. For example, a group of one hundred standard jobs in the case the user do not use the scripts. Then the user should use 1000 command lines to prepare the structure, 100 command lines to submit jobs to the Grid infrastructure, 200 commands lines to check the job and resubmit the wrong jobs, 500 command lines to collect the results and put them in the correct output structure and 100 command and 100 command lines to remove groups of works. To summarize, the user have to use about 1900 command lines without scripts. In the other case, using scripts the total command lines using scripts are five. Effectiveness is 95% earned in addition to the usability of the Grid Infrastructure or large groups of jobs. As the cron user should be continually reviewing the status of job. The time spent in this task is significantly improved using this tool. 8 Conclusions and future works The usefulness of some scripts that enhance and help scientists in their work has become evident. The scripts are not only used as a tool for massive use of the Grid,

344 344 IBERGRID it is also used as a tool for monitoring and fixing bugs. Using these scripts, users improve the 95% of time when the jobs are sent to the Grid infrastructure. Users do not have to train or learn the necessary orders to use the Grid. These scripts are ideal to launch working groups with a lot of computing requirements. Scientists using the scripts developed by our group will notice that the fact of sending jobs to the Grid infrastructure has become much easier as they do not need to know the basic Grid commands. As further works, we suggest the adaptation of all applications that use our tools node with GSG toolkit and the modification tools that also supports work in parallel with MPI. 9 Acknowledgements The authors would like to thank to Luis Bellot and Dominika Dabrowska (both from Solar System Department, IAA-CSIC), and Mattia Fornassa (from Extragalactic Astronomy Department, IAA-CSIC) for providing us with the astrophysic applications for the gridification. This work was supported by GRID-CSIC Project (200450E494). References 1. IAA Grid-CSIC web 2. HealPix: Healpix: 3. DDSCAT: draine/ddscat.6.1.html

345 IBERGRID 345 Simulation of batch scheduling using real production-ready software tools Alejandro Lucero 1 Barcelona SuperComputing Center Abstract. Batch scheduling for high performance cluster installations has two main goals: 1) to keep the whole machine working at full capacity at all times, and 2) to respect priorities avoiding lower priority jobs jeopardizing higher priority ones. Usually, batch schedulers allow different policies with several variables to be tuned by policy. Other features like special job requests, reservations or job preemption increase the complexity for achieving a fine-tuned algorithm. A local decision for a specific job can change the full scheduling for a high number of jobs and what can be thought as logical within a short term could make no sense for a long trace measured in weeks or months. Although it is possible to extract algorithms from batch scheduling software to make simulations of large job traces, this is not the ideal approach since scheduling is not an isolated part of this type of tools and replicating same environment requires an important effort plus a high maintenance cost. We present a method for obtaining a special mode of operation for a real production-ready scheduling software, SLURM, where we can simulate execution of real job traces to evaluate impact of scheduling policies and policy tuning. 1 Introduction Simple Linux Utility for Resource Manager (SLURM) [1] has been used at Barcelona Supercomputing Center (BSC) [2] since Initially designed as a resource manager for jobs execution through large linux cluster installations, it has today extended features for job scheduling. This software is a key tool for a successful management of such a large machine as Marenostrum, a 2500 node and more than cores installed at BSC. SLURM first versions supported a very simple scheduler: a First Come First Served algorithm. Last versions improve scheduling with a main algorithm based on job priority allowing to define different quality of service queues with users getting access to specific ones. A second scheduling algorithm is present with SLURM as configurable: the backfilling algorithm. It can work together with the main scheduler, its goal being to use free resources by lower priority jobs as long as none higher priority job is delayed. SLURM offers other features as reservations, special job requests and job preemption. It is worth to mention SLURM scheduling is fully dependent on how resources are managed. For a better understanding of this relationship, the scheduling should be thought of as a two step process: [email protected]

346 346 IBERGRID 1. A first step selects a job which requisities can be granted based on pure scheduling algorithm. At this point a job was selected and there are enough resources to execute it. 2. A second step selects best nodes for job execution trying to use resources in such a way that job requisities are accomplished and no more resources than really needed are wasted. For example, jobs can ask for an amount of memory per node, so resource management needs to know which nodes own such amount of memory, and in case of other jobs running at the nodes already, it needs to know current memory available. Resource management is currently done with cpus/cores/threads and memory but it would be possible to use others like disk, network bandwith or memory bandwith. Taken SLURM as the batch scheduler the simulation software will be based on, it would be possible to extract the SLURM code related to scheduling and work with it. However, as we have seen above, scheduling is linked to resource management so it would be needed to extract resource management code as well if we really want to work with a real scheduling software. Assuming this task is feasible without a high effort, there are other components we would need for using this code: job submission supporting special job requests reservations controlling nodes controlling job execution job preemption... For a valid simulation all of these possibilities should be contemplated therefore extracting just scheduler code from SLURM is not enough. It would be possible to build a simulation bench based on SLURM code but there are two main problems with this approach: 1. It would require an important effort implementing all the components needed 2. It would have a high maintenance cost for every new SLURM version. We present a solution for job trace execution using SLURM as the simulation tool with minor SLURM code changes. Our design goal is to hide simulation mode from SLURM developers and to have everything SLURM supports available in simulation mode. We think the effort should be done once, then new SLURM features in future new versions will be ready for simulation purposes with minor effort making easier simulation maintenance. The rest of the paper covers related work, design, implementation and preliminary results.

347 IBERGRID Related Work Process scheduling is an interesting part of operating systems and a focus of research papers for decades. It is well known there is no perfect scheduler covering whatever the workload and in specific systems like those with real time requirements scheduling is tuned and configured with detail. Batch scheduling for supercomputers clusters shares same complexity and problems as process or I/O scheduling presented at operating systems. Previous work trying to test scheduling algorithms and scheduling configurations has been done as in [3] where process scheduling simulation is done at user space copying scheduler code from Linux kernel. That work suffers from one of the problems listed above: maintenance is needed to keep same scheduler code at user space. Another work more related to batch scheduling for supercomputers is presented at [4] where it is well-described the importance of having a way of testing scheduling policies. But this work also faces problems since simulator code replaces real scheduler and wrappers for resource management software are specifically built, so simulation is done assuming such an enviroment being realistic enough. Other works [5] use CSIM [6] modelling batch scheduling algorithms with discrete events. Although this is broadly used for pure algorithm research it requires an important effort and it is always difficult to reproduce the same environment what can lead to understimate other parts of real scheduler systems like resource management. Research using a real batch scheduler system has been done [7] but without a time domain for simulation: assumption is made for dividing per 1000 time events obtained from a job trace. In this case batch scheduling software runs as usual but every event take 1/1000 of its real time. This approach has some drawbacks as for example what happens with jobs taking less than 1000 seconds to execute. Clearly it is not possible to simulate a real trace and it is important to notice small changes of job scheduling can have a huge impact on global scheduling results. It is worth to mention Moab [8] cluster software which is currently used at BSC along with SLURM. Moab takes care of scheduling with help from SLURM to know which are the nodes and running jobs state. Moab has had support for a simulator mode since initial versions although it has not always worked as marketing said. In fact, this work was started to overcome the problem trying to use this Moab feature in the past. Being Moab a commercial software there is not documentation about how this simulation mode was implemented. However, we got some information from Moab engineers commenting how high was the cost of maintaining this feature with new versions. We have not found any work using the preloading technique with dynamic libraries for creating the simulation time domain and neither using this approach with such a multithread and distributed program as SLURM.

348 348 IBERGRID 3 Simulator Design SLURM is a distributed multithreading software with two main components: a slurm controller (slurmctl) and a slurm daemon (slurmd). There is just one slurmctl by cluster installation and as many slurmd as computing nodes. The slurmctl tasks are resource management, job scheduling and job execution, although in some installations SLURM is used along with an external scheduler as MOAB. Once a job is scheduled for execution the slurmctl communicates with those slurmd belonging to the nodes selected for the job. Behind the scene there is a lot of processing for allowing job execution on nodes, like preserving user environment when job was submitted. SLURM design had scalability as a main goal so it supports thousands of jobs and thousands of nodes/cores per cluster. When slurmctl needs to send a message to the slurmds it uses agents (communication threads) for supporting a heavy peak of use as, for example, signaling a multinode job for terminating, getting nodes status or launching several jobs at the same time. Besides those threads, created for a specific purpose and with a short life, there are other main threads as the main controller thread, power management thread, backup thread, remote procedure call (RPC) thread, and backfilling thread. In the case of slurmd there is just one main thread waiting for messages from the slurmctl, an equivalent to the slurmctl RPC thread. On job termination slurmd sends a message to the slurmctl with the jobid and job execution result code. Figure 1 shows SLURM system components. Along with slurmctl and slurmd there are several programs for interacting with slurmctl: sbatch for submitting jobs, squeue and sinfo for getting infotmation about jobs and cluster state, scancel for cancelling jobs, and some other programs not shown in the picture. sbatch sinfo squeue slurmctl backup scancel User commands slurmd slurmd slurmd... slurmd Fig. 1. Slurm Architecture From a simulation point of view the task done by slurmd is simpler since it is not necessary job execution, just to know how long will it take, information we get from the job trace. A simple modification for simulation mode in the slurmd will support this easily just sending a message when the job finishes knowing how long ago the job was launched. The key is how to know the current time since in simulation mode we want to execute a trace of days or months in just hours or minutes. Another issue is to support simulation for thousand of nodes: it is not necessary to have one slurmd by node but just one super-slurmd taking control of jobs launched

349 IBERGRID 349 and job duration. No change is needed for this super-slurmd since SLURM has an option called FRONTEND mode which allows this behaviour although for other purposes. Although slurmd by design can be in a different machine than slurmctl, for simulation this is not needed. For slurmctl no changes are needed. However it is a good idea to avoid some threads execution related to power management, backup, slurm global state save and signal handling, which make no sense under simulation mode. Also, during a normal execution, SLURM takes control of all computational nodes running a slurmd. Periodically the slurm controller pings the nodes and ask for current jobs being executed. This way SLURM can react to nodes or jobs problems. Under simulation we work within a controlled environment so no problem is expected, therefore that funcionality is not needed. As in the case of slurmd, slurmctl execution is based on time elapsed: both main scheduler algorithm and backfilling are infintite loops executed within a speficic period; a job priority is calculated based on fixed parameters dependent on user/group/qos, but usually priority increases proportionally to job s time queued; jobs have a wall clock limit fixing maximum execution time allowed for the job; reservations are made for future job execution. In simulation mode we want to speed up time therefore execution of long traces in a short time can be done for studying best scheduling policy to apply. Our design is based on the LD PRELOAD [9] functionality available with shared libraries in UNIX systems. Using this feature it is possible to capture specific function calls exported by shared libraries and to execute code whatever the necessity. As SLURM is based on shared libraries we can make use of LD PRELOAD to control simulation time, so calls to time() function are captured and current simulated time is returned instead of real time. Along with time() there are other functions time related used by SLURM which are wrappered: gettimeofday() and sleep(). Although there are other functions time related, like usleep() and select(), SLURM uses them for communications what is not going to affect simulation. Finally, as SLURM design is based on threads, and calls to time functions are made indistincly by all of them, some POSIX threads related functions are wrappered as well. Figure 2, shows how simulator design works: a specific program external to SLURM is created, sim mgr, along with a shared library, sim lib, containing wrappers for time and thread functions. During initialization phase, sim mgr waits for programs to control: slurmctl and slurmd. Once those programs are registered (executing init function related to sim lib, 0a and 0b in the figure), every time pthread create is called, wrapper code at sim lib registers the new thread(1a and 4b1). In the figure, lines with arrows represent simulation related actions, with first number at the identifier representing the simulation cycle when that action happens. The letter at the identifier is just to differenciate the two threads shown,

350 350 IBERGRID SLURM SLURMCTL SLURMD pthread pthread pthread pthread pthread 4a 1a 3a 0a SIM_LIB... 0b 4b1 4b2 4b3 SIM_MGR Fig. 2. Simulator Design and the second number indicates the sequence of actions in the same cycle. So the slurmctl thread was created at simulation cycle 1 and the slurmd thread was created at simulation cycle 4. Inside pthread wrappers there is a call to the real pthread functions, what is not true for the time wrappers which fully replace the real time functions. Registration links a thread to a POSIX semaphore used for simulation synchronization and it also links the thread to a structure containing sleep value for that thread and other fields used by the sim mgr. A simulation clock cycle is equivalent to one real time second and in each cycle sim mgr goes through an internal thread array working with registered threads. There are two classes of threads from sim mgr point of view: those with do sleep calls running through several simulation cycles (and someones through all the simulation). Sim mgr detects a sleepy thread and keeps a counter for waking up the thread when necessary. those which are created, do something and exit in the same simulation cycle.

351 IBERGRID 351 Back to figure 2, thread A does a sleep call at simulation cycle 3 (3a) and sim mgr wakes up the thread at simulation cycle 4 (4a). So we know it was a sleep call for just one second. At that point sim mgr waits till that thread does another sleep call or maybe a call to pthread exit. The main threads in slurmctl have this behavior: infinite loops with a frequency defined by thread. The other type of threads is shown with the B thread in the figure. This thread is created at simulation cycle 4 (4b1) and at the same cycle it calls to pthread exit (4b2). When sim mgr needs to act with that thread there are two situations: the thread did not finish yet or it did finish. In the first case sim mgr waits till the thread calls to pthread exit. In the second one sim mgr just sets the array thread slot free for new threads. 4 Simulator Implementation A main goal for the simulator is to avoid SLURM modifications or, if necessary, to make those changes trasparent for SLURM developers. The ideal situation would be simulator feature going along future SLURM versions smothly just with minor maintenance. If this turns out to be an impossible task a discussion should be opened at SLURM mailing list exposing the problem and studying the cost. Just if the simulator becomes a key piece for SLURM, could it make sense major changes of core SLURM functionality. Simulator design implies to implement two pieces of code outside SLURM code tree: the simulation manager and the simulation shared library. The simulation manager, sim mgr, keeps control of threads and simulation time, waiting for threads to terminate or waking up threads when sleep timeouts requested by those threads expire. The simulation lib captures time related calls thanks to LD PRELOAD functionality and synchronize with sim mgr for sleep calls or getting simulation time. This design is quite simple for a single process or thread but it gets more complex for the multithread and distributed SLURM software. Sim mgr needs a way to identify threads and to control each one independently, therefore sim lib implements wrappers for some pthreads functions like pthread create, pthread exit and pthread join, which give information to sim mgr when they are called. A first implementation with sockets inside those pthread wrappers was made but it made sim mgr less intuitive to understand and debug. A second implementation was done using POSIX shared memory containing global variables and per thread variables. One of the structures inside the shared memory is an array of POSIX semaphores used for synchronizarion between sim mgr and slurm threads. A limitation of this design using POSIX mechanisms is slurmd needs to be at the same machine than slurmctl and sim mgr, although we have commented above that this is not a problem for simulation purposes. When sim mgr is launched shared memory and semaphores initialization is done. After that, a new thread is registered when pthread create wrapper is called. Both, slurmctl and slurmd can create threads so it is necessary registration to be atomic for getting a simulation identifier by thread. Linked to this identifier

352 352 IBERGRID are two main fields: a sleep counter and a semaphore index. For tracking which thread has which identifier a table is created using pthread t value returned by real pthread create function. A thread calling pthread exit, pthread join or sleep functions will use that table and value of pthread self() to finding out which semaphore to use for synchronization with the sim mgr. When a thread calls time function, sim lib wrapper for time is executed. Inside this wrapper there is just a read of a global simulation address using the shared memory space containing the current simulation time. This is the same for gettimeofday with a difference related to microseconds value returned by this function. Initially we do not need to have a simulation with such a fine grain precission so microseconds is a fake value wich increments every time gettimeofday is called inside same simulation cycle. When a thread calls sleep function the sim lib wrapper is executed as follows: 1. wrapper code gets which thread identifier that thread has inside simulation. For this a search through a table linking pthread t values coming from pthread create and simulation thread identifiers is made. A thread can know which is its pthread t value with a simple call to pthread self which is part of standard pthread library. A potential problem with this approach is the possibility of getting two threads with same pthread t value. This is not possible for a single Linux process but SLURM is a distributed program with at least two processes, a slurmctl and a slurmd, so it could be possible to have two threads with same pthread t value. Linux uses the stack address of the created pthread for that pthread t value so it is quite seldom for two different programs to get same pthread t values. Nonetheless a table using pthread t along with process identifier (pid) would be safer. 2. Using the thread identifier as an array index to thread info structure, the thread sleep field is updated to the value passed as a parameter to sleep function. 3. Next step is synchonization with sim mgr. There is a POSIX semaphore by pthread by design but this is not enough for getting synchronization right so there are two semaphores by thread. First semaphore is for thread waiting to be woken up by the sim mgr meanwhile the other thread is for sim mgr to wait thread coming back to sleep or exiting. 4. If the thread calls sleep again the loop goes back to 1). If pthread exit is called by the thread a sem post(sem back[thread id]) is done inside pthread exit wrapper. We have seen what wrappers do except for pthread join. This wrapper is quite simple, adding a new valid thread state, joining, for simulation. This is necessary for avoiding interlocks when SLURM agents are used.

353 IBERGRID 353 Simulation manager Wrappers implemented inside sim lib library are useful for connecting slurm threads with the simulation manager, sim mgr. The global picture of how simulation works can be clarified with a detailed view of what sim mgr does in a simulation cycle: 1. for each registered thread being neither a never ending thread nor a new thread: (a) it waits till that thread sleeps, exits or joins (b) if thread is sleeping and sleep counter is equal to 0 i. it does a sem post(sem[thread id]) ii. it does a sem wait(sem back[thread id]) The term never ending thread needs some explanation. What sim mgr does is to allow just one slurm thread to execute at thread same time but allowing each thread to execute within a simulation cycle. However, some specific threads need less control for avoiding an interlock between threads. This is the case of threads getting remote requests at both slurmctl and slurmd. Usually those threads call to pthread create when a new message arrive and the new thread will be under sim mgr control as usual. 2. Using the trace information, if there are new jobs to be submitted at this simulation cycle, sbatch, usual slurm program for job submission, is executed with the right parameters for the job. 3. For each new thread sim mgr waits till that thread sleeps, exits or joins. As slurm can have some specific situations with a burst of new threads, sim mgr needs a way of controlling how threads are created avoiding to have more than 64 threads at the same time, since this is the current limit sim mgr supports. So this last part of the simulation cycle manages thread creation delaying some pthread create calls and identifying what we have named proto-threads, this is almost created threads. This is loop managing this possible situation: (a) For each new thread, if the thread is still alive, wait till it does a call to sleep, exit or join. In other case, it does mean this thread did its task quickly and called pthread exit. (b) When all new threads have been managed, all threads which did a call to pthread exit during this simulation cycle can release their slot so new threads can be created. If there were not any free slot it could be possible a pthread create call is waiting for getting a simulation thread identifier. A global counter shows how many proto-threads are waiting for free slots. If this counter is greater than zero then sim mgr leaves some time for those pthread create calls to finish then goes back to previous step looking again for new threads. Otherwise simulation cycle goes on. 4. At this point any thread alive or any thread created through this simulation cycle has had a chance to execute. So if the thread is still alive it is waiting on a semaphore. Sim mrg can now decrement sleep value for all of those threads. 5. All threads slots belonging to threads which called to pthread exit during this cycle are released. 6. Last step is to increment the simulation time is one second. 7. Go back to 1

354 354 IBERGRID 5 Slurm code changes Our main design s goal was to use SLURM program for simulation without major changes. In the design section were described which changes are needed for the slurmd and slurmctl in order to support the simulation mode. Once implementation has been done, changes can be shown in detail: Our design implies threads to explicitly call pthread exit. Usually this is not needed and the operating system/pthread library know what to do when a thread function is over. Therefore we have added pthread exit calls to each thread code routine what is not an important change for SLURM code. Under simulation we build up a controlled environment: we know when threads will be submitted, when a node will have problems (if simulation supports this feature) and how long a job will run. There will not be unexpected job or node failures, therefore we can avoid code execution related to monitor jobs and nodes state when simulation mode is on. As jobs are not executed it is not necessary to send packets to each node for job cancelling, job signaling, job modification, etc. Removing this part of SLURM simplifies simulator work significantly since slurmctl uses agents for these communications. In clusters with thousands of nodes could it be possible jobs with thousand of nodes so sending messages to each node is not negligible at all. SLURM design was made for supporting those communication bursts and agents take care of sending messages from slurmctl to slurmds. The problem with agents is they are based on threads strongly: agents are threads themselves, there are pools of available agents with a static maximum limit and watchdog threads are specifically created to monitor agent threads. Although this is working fine for SLURM getting a high scalable software, such a mesh of threads is too much complex for our simulation design since interlocks between threads appear easily. As we said before, slurmd is simpler under simulation since no job needs to be really executed. By other hand, an extra task needs to be done: to know how long a job will run and to send a message of job completion when simulation time reaches that point. A new thread is created, a simulator helper, which each second checks for running jobs to be completed taken into account current time and job duration, and sends a message to slurmctl if necessary. Finally, slurmctl works as normal SLURM execution. Just some changes for sending messages without using agents has been done which makes sim mgr simpler and facilitates syncronization for obtaining determinsm through simulations using the same job trace.

355 IBERGRID Preliminary Results A first simulator implementation is working and executing traces with jobs from different users and queues. A regression testsbench has been build up using synthetic traces, useful for testing exceptional situations like a burst of jobs starting or completing at the same cycle. Several simulations using same synthetic trace are executed to validate design expecting same scheduling results which implies we achieve determinsm under simulation. After some simulator executions we discovered a problem hard to overcome since it is linked to normal SLURM behavior. As we have seen, sim mgr controls simulation execution leaving threads to execute but in sequential order, one by one. This is just the opposite threads were designed for, but it allows to execute traces with determinism. Leaving threads without control would mean small differences like a job starting one second later. Althouhg this looks harmless it could mean other jobs being scheduled in a different way with such a change being spread through the whole simulation, ending with a totally different trace execution. So sim mgr leaves a thread to execute and waits till that thread calls sleep function again. There is a slurmctl thread for backfilling which is executed two time per minute by default, and the problem is the task done by this backfilling thread can easily take more than a second: the more jobs queued the more time will it take. It is also related to amount of resources, then doing simulations with a cpus nodes cluster and with 1000 jobs, with jobs duration from 30 seconds to an hour, it is normal for the backfilling thread to take up to 5 seconds to complete. Therefore, what is the goal behind simulation, to speed up job execution, is compromised due to this natural SLURM behaviour.we hope SLURM simulator can help to overcome this problem for real SLURM, then simulator itself can take benefit for executing long traces faster. Next table shows results using a Intel Core Duo at 2.53Mhz and synthetic traces of 3000 jobs executed on a cpus node cluster: Trace number Executions Sim Time Exec Time Speedup Time results per trace number are mean of 5 executions. Simulation of four days of jobs execution takes around one hour and a half. Extrapolating these results, for a one month trace simulation will last for a bit more than 10 hours. 7 Conslusions and Future Work A solution is presented for using a real batch scheduling software as SLURM for job trace simulation. Although SLURM design is strongly based on threads it is

356 356 IBERGRID possible to keep the original SLURM functionality with minor changes. For this some calls done to standard shared libraries are captured and specific code for simulation purposes executed. Implementation complexity has been left outside SLURM code so main SLURM developers will not to be aware of this simulation mode for future SLURM enhancements and simulator maintenance will be easier. Next step will be to present this work to SLURM core developers and users and if it is welcomed to integrate the code with SLURM code project. References 1. M. Jette, M. Grondona, SLURM: Simple Linux Utility for Resource Management, Proceedings of ClusterWorld Conference and Expo, San Jose, California, June Barcelona SuperComputing Center, 3. LinSched: The Linux Scheduler Simulator, John M. Calandrino, Dan P. Baumberger, Tong Li, Jessica C. Young, and Scott Hahn. jmc/linsched/ 4. Impact of Reservations on Production job Scheduling Martin W. Margo, Kenneth Yoshimoto, Patricia Kovatch, Phil Andrews. Proceedings of the 13th international conference on Job scheduling strategies for parallel processing A measurement-based simulation study of processor co-allocation in multicluster systems. S. Banen, A. I. D. Bucur, and D. H. J. Epema In Job Scheduling Strategies for Parallel Processing, Parallel Job Scheduling Under Dynamic Workloads (2003). Eitan Frachtenberg, Dror G. Feitelson, Juan Fernandez, Fabrizio Petrini. In Job Scheduling Strategies for Parallel Processing, ld-linux.so(8), Linux Programmer s Manual Administration and Privileged Commands

357 IBERGRID 357 Analysis of Xen efficiency in Grid environments for scientific computing Antònia Tugores 1, Pere Colet 2 Instituto de Física Interdisciplinar y Sistemas Complejos, IFISC(UIB-CSIC) 1,2 [email protected] 1 [email protected] 2 Abstract. Grid and cloud computing aim at providing access to a large amount of computing resources in a distributed way, are scalable, resources can be shared among large pool of users, and both computing paradigms involve multitenancy and multitask, meaning that many customers can perform different tasks, accessing a single or multiple application instances. While grid computing is typically used to execute CPU intensive jobs in physical computers for scientific data analysis or modeling, cloud computing is mostly used to access installed applications on virtual computers or manage data-centers and servers. Cloud computing offers the possibility to scale up to massive capacities in an instant without having to invest in new infrastructure and helps managing peak load capacity without incurring the higher cost of building larger data centers. Therefore while cloud strongly relies on virtualization, this technique plays no role in grid computing. Virtualization is becoming one of the main solutions in Information Technologies for maximizing the use of resources while minimizing the maintenance costs. Still because of its traditional lack of efficiency, right now it is not commonly used in scientific environments that need high CPU performance. However nowadays CPUs include specific hardware to increase virtualization performance and thus it is appropriate to explore its potential for intensive CPU tasks in scientific environments. One of the solutions to integrate cloud computing with grid computing in scientific environments can be to upload virtual machines with a configuration specifically prepared for high computing. In this case, efficiency of virtual computers is essential. This study we present shows that virtualization is good enough for parallel CPU intensive applications, but the most significant result is related to applications that compete for computer resources. In this case, performance when several instances are executed in single core virtual computers can be larger than when executed in the physical computers. Keywords: Xen, virtualization, efficiency, cloud 1 Introduction The paradigm of cloud computing, that is currently used in enterprises, is not consolidated in scientific computing. There are several virtualization softwares for

358 358 IBERGRID Linux environments which make use of different virtualization techniques. OS-level virtualization overhead is the best, but is not as flexible as other virtualization approaches since it cannot host a guest operating system different from the host one, or a different guest kernel. Among others, the only technique that uses a thin layer to interface the hardware to all operating systems (host and guests) is paravirtualization. Its flexibility and the low overhead makes paravirtualization one of the most efficient techniques. A key component of this technique is the hypervisor, a thin layer of software that presents to the guest operating systems a virtual operating platform and monitors the execution of the guest operating systems. Hypervisors are installed on server hardware which are used exclusively to run guest operating systems. Examples of hypervisors are Xen [1], VMware ESX [2] or Microsoft s Hyper-V technology [3]. Right now, cloud computing is mostly used to access installed applications on virtual computers. The most important cloud computing platform is Amazon Web Services (AWS, [4]), with its central part Amazon s Elastic Compute Cloud (EC2, [5]). It allows users to deploy virtual computers containing any software desired, on which to run their own computer applications. As being the open source industry standard for virtualization with hypervisors, Xen is the EC2 virtualization engine. Xen systems have a structure with the Xen hypervisor as the lowest and most privileged layer. Above this layer come one or more guest operating systems, which the hypervisor schedules across the physical CPUs. The first guest operating system, called in Xen terminology domain 0 (dom0), boots automatically when the hypervisor boots and receives special management privileges and direct access to all physical hardware by default. The layer between the hardware and the virtualized computer is getting thiner. This helps virtualized computers to be used like physical computers. Apart from that, performance of Xen virtualized computers is essential to introduce and consolidate virtualization and cloud in scientific computing specifically prepared for CPU intensive jobs. The study of the efficiency of Xen virtual machines is one of the first steps for cloud integration in scientific environments and the aim of this paper. 2 Experiment To evaluate the performance of Xen virtualized computers as compared to physical computers we used virtual machines with identical software to the physical one running on a dedicated host and we focused on 64-bit applications. The operating system we have used for all the tests is Scientific Linux 5.4 with kernel version and Xen 3.3 is the virtualization software we have tested. The physical computer had two quad-core Intel Xeon L5520 processors running at 2.27 GHz and with 8MB cache. Each processor had 8 GB of DDR memory. As for the virtual machines, the scenarios tested were: a) a single virtual machine with eight cores and 12GB of RAM; b) eight single-core virtual machines with 1.5GB of RAM per instance. Those two scenarios were tested with both enabled and disabled hiperthreading in the host computer. And finally c) a single virtual machine with sixteen cores and 12GB of RAM. Notice that not all the RAM

359 IBERGRID 359 could be used for the virtual computers because the host, required 4GB for the hypervisor to properly manage all the guests. For some benchmarks we consider an additional scenario: sixteen single core virtual machines the 750MB of RAM per instance. We have considered several benchmarks with different memory requirements: a) The first benchmark was WhetStone [6]. This synthetic benchmark was first written in Algol 60 [7] and then it was adapted to Fortran [8]. With these change, it became the first general purpose benchmark that set industry standards of computer system performance. The version used in this experiment was written in C [9]. The Whetstone benchmark measures computing power in Millions of Whetstone Instructions Per Second (MWIPS). This benchmark was chosen instead of Dhrystone [10] because while Dhrystone measures integer arithmetic, Whetstone measures the floating-point arithmetic performance, more in line with scientific activity. b) The second benchmark was Linpack [11]. It measures how fast a computer solves a dense N by N system of linear equations Ax = b, which is a common task in scientific and engineering computing programs. The solution is obtained by Gaussian elimination with partial pivoting. It makes heavily use of double precision floting point operations and the test result is reported in GFLOPS (one GFLOP is 10 9 floating point operations per second). This test, writen in FORTRAN, has became a standard in high performance computing and it is used to classify the fastest computers in the world [12]. Here we considered matrices of order 7000 which is about the maximum it fits in the memory of the virtual computers. c) The third test was the computation of a matrix-vector product (MatVec). The chosen matrix order was 1024 and the operation was repeated 2000 times. This program was written in C. This program results are not relevant and run time was the element to be tested in this experiment. d) Finally, the last test was the integration of the complex Ginzburg-Landau equation (CGL 2d). This is a prototypical Partial Differential Equation (PDE) widely used to model oscillatory instabilities in physical or chemical spatially extended systems [13]. Here we consider a 2-dimensional space with periodic boundary conditions. The CGLE is integrated using a pseudospectral method as described in Ref. [14]. The program was written in Fortran and for this test we have used a system with 1024x1024 grid points. The results by itself are not relevant for the purpose of this article, so in this test we focus on the run time as measurement of the performance. The program was written in Fortran and the results are not relevant, so run time was the element to be tested in this experiment. Programs were compiled using Intel Fortran and C compilers version 11.1 with the options -fast, we have used the Intel Math Kernel Library for the Linpack and CGL 2d tests and -parallel for MatVec. In the scenarios where the computer had more than one core, programs were executed using a single core in order to facilitate the comparison with the single core virtual computers. In particular we have limited the number of threads in the Intel Math Kernel Library setting the environment variable OMP NUM THREADS to 1. The programs were tested running one, two, four, six or eight simultaneous instances on the physical computer and on the virtual computer scenarios. Similarly, when hiperthreading was enabled,

360 360 IBERGRID 9 Physical computer (Linpack) Physical computer (Whetstone) GFlops MIPS Parallel executions Parallel executions seconds Physical computer (MatVec) Parallel executions seconds Physical computer (CGL 2d) non hiperthreading hiperthreading Parallel executions Fig. 1. Benchmarks results. MatVec and CGL 2d runtime in seconds increases when increasing the number of parallel executions. Linpack GFlops do not change when increasing the number of parallel execution until hiperthreading is needed. Then, values descend. Whetstone MIPS results are similar to Linpack ones. the number of simultaneous executions increased to ten, twelve, fourteen and sixteen. Every test was run three times, and the data considered for this paper is the arithmetic mean of the values. 3 Results and discussion The applications we have considered turn out to show a very different behavior when many instances where simultaneously executed. For Linpack and Whetstone benchmarks the instances practically do not compete among them while for MatVec and CGL 2d the instances were strongly competing for resources. Fig. 1 shows the results in the physical computer when running multiple copies of the same application simultaneously. For Linpack, increasing the number of simultaneous instances the performance of each of them remains practically constant until the number of instances reaches 8, the number of physical cores. Activating hyperthreading has practically no influence in this regime (although 4 instances run sligtly faster with hyperthreading). For Whetstone the performance remains mostly constant up to 8 copies although small oscillations, with a minimum at 4

361 IBERGRID Relation Physical computer efficiency (HT) Parallel executions Relation Physical computer efficiency WhetStones Linpack MatVec CGL 2d Parallel executions Fig. 2. Physical computer performance with hiperthreading enabled and disabled. copies, are observed. Thus both Linpack and Whetstone benchmarks do not show competition for the resources provided the number of copies does not exceed the number of physical cores. In both benchmarks, when the number of instances is larger than the number of real cores the performance degrades linearly with the number of copies. The degradation is so heavy for Linpack that 16 simultaneous copies will take almost 20% more time than the same 16 instances executed 8 at a time. Whetstone degradation is not as pronounced and reaches a loss of performance of about the 30% for 16 instances respect to 8. For MatVec and CGL 2d run time increases linearly when increasing the number of simultaneous executions. While MatVec results remain constant, CGL 2d ones are slightly better when enabling hiperthreading. The other two benchmarks do not show such a strong competition between the different instances. Fig. 2 shows the results obtained for the physical computer executing simultaneously several instances normalized to the result of a single execution. For Linpack and Whetstone the ratio remains close to 1 from 1 to 8 simultaneous copies while it clearly decreases for CGL 2d and MatVec. We now consider the simultaneous execution of several instances in virtual computers. Fig. 3 shows the efficiency which corresponds to the results obtained normalized to the results obtained executing the same number of instances in the physical computer. We consider a single virtual computer with 8 cores as well as 8 virtual computers single core with both hyperthreading activated and not. We also consider a single virtual computer with 16 cores and in Whetstone and CGL 2d, 16 single-core computers. In all the cases when a single instance is executed the efficiency is reduced, sometimes by more than 20%. This is due to the fact that the operating system for balancing reasons migrates the single thread between cores located in different physical CPUs. This inter-cpu migration can be avoided by assigning by hand the virtual CPUs to physical cores. We have checked that if we do that, then the efficiency for a single thread applications becomes close to 1. However, for the results presented here we leave the virtual CPUs unassigned. Applications that do not compete for resources show an efficiency close to 1 (Fig. 3 (a), (b)). We notice that Linpack performance is reduced in all the config-

362 362 IBERGRID (a) (c) Efficiency Linpack 7000 efficiency Parallel executions MatVec efficiency Efficiency Parallel executions 7 8 (b) (d) Efficiency Whetstones efficiency Parallel executions Efficiency CGL 2d efficiency 0.85 single VM w 8 cores 0.8 single VM w 8 cores (HT) 8 VM single core VM single core (HT) 0.7 single VM w 16 cores (HT) VM single core (HT) Parallel executions 7 8 Fig. 3. Virtual computers performance. urations using hiperthreading and WhetStone is highly reduced for 4 executions and 16 single core virtual machines. For applications with interactions, the most significant results are obtained when running in the 8 single-core virtual machines with hiperthreading deactivated and the 16 single core virtual computer in CGL 2d. The virtual machines prevent the collisions between the multiple copies and the programs run more efficiently than in the physical computer (efficiency up to 1.1 for 8 copies for CGL 2d and 1.33 for MatVec). As similar to the physical computer results, multicore virtual computers efficiency reduces while increasing the number of parallel executions. We can conclude that Xen virtualization provides good results for CPU intensive concurrent applications if the number of threads is not larger than the number of physical CPUs. For parallel execution of multiple copies of an application, virtualization is at least as efficient as the physical machine and, by running the copies in different virtual machines collisions are minimized which leads to a significant improvement over the physical machine. Therefore cloud environments running Xen virtual computers can be quite suitable for scientific computing. All the benchmarks indicate that for most of the applications when using one instance for parallel testing, virtual machines performance was not as good as the one of the physical computer. But using one instance for parallel testing is not the aim of this experiment.

363 IBERGRID Conclusions. We can conclude that Xen virtualization is good enough for parallel high CPU performance applications when using more than one processor. In general virtualization efficiency turns out to be close to one indicating that the overhead generated by virtualization is small. Besides, one core virtualized computers have a very positive effect on applications that compete for computer resources since it limits the competition. Therefore when several instances of these applications are executed simultaneously in virtual computers the performance can be larger than when executed in physical computers. Thus in high throughput environments where many CPU intensive independent computations are executed simultaneously, implementing virtualization can indeed improve the performance. Therefore, in these cloud environments making use of Xen hypervisors and hosting single core virtual computers can be quite convenient option to manage large amounts of distributed resources shared among a large pool of users. Acknowledgements: Financial support from CSIC through project GRID-CSIC (Ref E94) is acknowledged. References 1. Xen hypervisor website VMWare ESX Microsoft s Hyper-V technology Amazon Web Services Amazon EC2 website Whetstone benchmark ALGOrithmic Language Fortran C language Dhrystone Linpack benchmark Top M.C. Cross and P.C. Hohenberg, Rev. Mod. Phys. 65, 891 (1993); D. Walgraef, Spatio-Temporal Pattern Formation, Springer (1996). 14. D. Gomila, A. Jacobo, M.A. Matías, P. Colet, Phys. Rev. E (2007).

364 364 IBERGRID Datacenter infrastructures remote management: A practical approach Enrique de Andres, Antonio Fuentes and Tomas de Miguel RedIRIS E.P.E. Red.es, Plaza Manuel Gomez Moreno s/n, Edificio Bronce, E Madrid, Spain [email protected] Abstract. Demand for computer services, networking, storage and applications by the scientific community is increasing day by day and their availability requirements. This has increased the necessary infrastructure to meet this demand and therefore rising housing costs. One way to reduce housing costs is to leverage economies of scale but this involves the moving of these infrastructures to remote rented datacenters, and therefore the need for remote management tools. Datacenter infrastructures remote management allows managing them almost as if we were in the own datacenter room. In addition, well-designed datacenters can reduce electrical and cooling costs and it allows improving the reliability and availability of the services we are providing. In many facilities, three problems need to be solved immediately: limited power, increasing cooling demands, and space constraints. This paper shows a practical example of how RedIRIS operates its remote infrastructure and tools available to them. 1 Introduction When we talk about systems administration in the area of Information and Communication Technology (ICT), it is usual to focus our attention on the servers, computing resources, storage equipments and network devices that we are hosting in our datacenter. Moreover, it is usual to be focused on the operating systems of those equipments. In this sense, we could say that we consider our equipments as logical entities according to the function, role or service they are performing for us, blocking out us from the underlying physical reality, the infrastructures over they are supported. These infrastructures would include the facilities where our equipments are housed in, their cooling systems and power supply, as well as the hardware of our equipments, regardless of the operating system they could have installed. A minimal control of these three elements allows taking not only reactive measures, but also proactive measures when anomalous situations are detected. On the other hand, system administration is usually done in-band. It means that the access to the systems and the administering tasks are done using the own elements we are administering. In case of a network failure or an operating system bad configuration we run the risk of not being able to access our systems. Our systems would be cut off, so we couldn t solve the problem or, at least, minimize its impact until it can be solved. In order to avoid cutting off our systems, it is

365 IBERGRID 3 a good idea to have mechanisms, devices and network architectures that allow an alternative access method. This method shouldn t use the same elements we are trying to access to, so we refer to out-of-band access and administration. Fig. 1. Datacenter infrastructure elements The paper is structured as follows: Section 2 describes the RedIRIS need to have a model for remote management of equipment. Section 3 introduces the aspects of datacenters that must be taken into consideration. Section 4 describes the mechanisms and devices for out of band management. Section 5 describes how to deploy out of band access and the conclusions are given in Section 6. 2 RedIRIS scenario RedIRIS [1] is the Spanish academic and research network that provides advanced communication services to the scientific community and national universities.. Scientific advances and changes in the way of doing science have caused that the network infrastructure has turned into a critical element for scientific development. RedIRIS network has 18 communications nodes distributed in Spain, although now a new communications network, RedIRIS-NOVA [3], is being deployed. It will increase that number up to more than 60. c Communication nodes are operated 7x24x365 and many of them are hosted in institutions where, in some cases, out of hours access is complex. As well, RedIRIS doesnt have local staff located in the nodes so, in case of an equipment failure, its necessary that a technician moves there to analyze the problem or local institution staff help us to solve it. The deployment of mechanisms for access and administration out-of-band represents an alternative to access the equipment housed in RedIRIS nodes, regardless of the status of equipment and communications network. This makes it possible to

366 366 IBERGRID carry out maintenance tasks and solve troubles remotely without having to move out the office. 3 Datacenter management The three pillars on which ICT infrastructure is supported, apart from the own equipments, are the facilities where they are located, the cooling and the power supply. Depending on the magnitude of the datacenter, the management of some of the elements involved in its infrastructure could be delegated to a third party, either because we have a service provider which is in charge of these tasks in our datacenter, or because the datacenter is provided by a external company; on the other hand, many of these elements are part of conception, design and execution decisions so they are beyond our means. It is usual having delegated the cooling and the power supply and redundancy, so it will free us of a thorough control of them. Nevertheless, it is interesting to conduct a minimal supervision, which will let us to know the status and the health state of our infrastructure in the datacenter. It will make possible to detect anomalous situations, to carry out measurements, to forecast resources according to trends, etc. For this purpose we rely on devices such as video surveillance cameras, environmental sensors, power measuring devices, as well as the software for a centralized remote management. Temperature in the datacenter [5] is an important issue because it affects directly to the performance of the equipments and the reliability and life of its components. The temperature monitoring allows taking decisions about cooling and the equipments location. Temperature isnt distributed homogeneously in the datacenter room, so its useful to have a set of distributed temperature sensors. Its usual to distribute them in three levels: room, rack and equipment. Room level temperature sensors are usually independent devices strategically located along the room; they provide a general idea about rooms temperature and its interesting place them in the most unfavorable location. Rack level temperature sensors are usually satellite elements connected to other devices which are installed into the racks (although they could be independent devices too); these devices are usually smart power distribution units with auxiliary ports for connecting sensors. Equipment level sensors are embedded in the own equipment; although they can be controlled by the equipments operating system if its installed the appropriate software, they are usually managed by service processors (dealt later on). Distributing temperature sensors in these three levels allow detecting general cooling problems or detect hot spots[6]. Not only temperature affects systems performance but also humidity can cause rapid deterioration of systems. Monitoring humidity in a datacenter is as important as monitoring temperature. While too low humidity may result in electrostatic discharge (ESD), causing immediate permanent equipment damage, too high humidity could create corrosion on components, a slow and often irreversible process. Power control [4] makes possible measuring and monitoring power consumptions, as well as determining costs. Smart Power Distribution Units (PDUs) have among their features the ability to collect information about power consumption.

367 IBERGRID 367 These devices, added to centralized management software, allow having complete reports and trends. Smart PDUs also show instant power consumptions and alert on exceeding thresholds. It makes possible distributing equipments between power lines and not overloading them. They also allow balancing loads between phases on a three-phase system where the dominant loads are single phase, such as threephase UPS systems. With regard to facilities, many of their issues are part of the decisions taken during datacenter conception, design and execution, so they are outside the scope of remote management: cabling topology, equipment zones (communications, server, patching elements, ), rack and equipment distribution, labeling, etc. Using video surveillance cameras is part of remote datacenter management and it helps to control the status of the datacenter. It also helps to supervise works carried out in it. In order to cover all the areas of the datacenter, its recommended to install one camera on each corridor. 4 Out-of-band administration The systems administration is usually performed remotely via telnet, ssh, vnc, remote desktop, etc. This kind of administration is in-band administration, because it refers to a management that uses the own elements we are trying to administer. For example, if we access to an equipment via ssh, we are using the operating system, the network service and the sshd daemon of the own equipment. In-band administration [2] assumes that all the components involved on it are working correctly, so if one of them fails, we will lose the remote access to the systems and the possibility of solving the problem. If the datacenter is near the location of the system administrators in charge of the equipments, they could move there and take the appropriated measures. If the datacenter is far, moving there wouldn t be the best option and we should take another actions. Even if the datacenter is near, the temperature, noise level and other environmental conditions in it is very harsh, which makes it an undesirable place for humans to work in. It is not necessary to go to the datacenter if we have the appropriated mechanisms. If we have alternative administration methods we will get minimize in-situ interventions. These ones are commonly called out-of-band administration methods, because they don t use the same resources of the systems we are administering. Out-of-band administration mechanisms make possible working with our equipments almost as if we was in front of them. Nevertheless, in the worst case, if these mechanisms fail too and we are not close to the datacenter, sometimes we have a pair of remote hands and we can give them the suitable instructions. Humans beyond remote hands are very heterogeneous, from qualified staff that perfectly knows the sort of equipments we are working with, to just remote hands providers. In the first case, giving suitable instructions is usually easy. In the second one, not necessarily bad, we have to give detailed instructions about the operations we want remote hands carry out. Therefore, we need to have detailed information about our equipments and how they are physically installed (buttons layout, network interfaces, power supplies, network and power cables, labels, etc.). For this purpose, it not only helps the documentation of the equipment, but also it

368 368 IBERGRID helps other information sources as photographs made during the installation and video surveillance cameras available in our room. Until now we have talked about operating systems maintenance, administration once our equipments are installed and they have a minimal configuration. We don t have to forget another important topic, the operating system installation. Although we can do it during the physical installation of the equipment, many times we want to leave operative the equipment (mounted in the rack and made the power and network connections) but we have not taken several decisions yet, or we simply want to avoid staying at the datacenter too much time, and we postpone the installation of the operation system to another moment. In these situations, as well as when, for some reason, we want to reinstall the operating system, it s worth having a way of working remotely. In these cases, out-of-band administration methods are useful too. 4.1 External devices One of the options we have for out-of-band administration is using additional equipments [7], external equipments to the equipments we want manage. Its main goal is enabling management as if we were in front of (or behind) the managed equipments, that is, as if we was doing a local administration of them. Local administration mechanisms depend on the equipment type. Communications equipments usually have serial console ports and server have keyboard, mouse and video connectors, although, depending on the brand and model, there are servers that have serial console ports too. For accessing remotely to serial console ports it will be required a console server or terminal server. They are usually small devices, one rack unit sized, with a number of ports between 4 for basic models and 48 for more advanced ones. They are appliances that don t have hard drives, they have flash memories instead for increasing its reliability. Depending on the brand or the model, terminal servers enable additional features as the following ones: Dual Ethernet ports for redundancy. Built-in modem for out-of-band access. PC Cards and USB Devices support (i.e. optical fiber network adapters, WIFI adapters, wireless modems, additional storage). Configurable pin-outs for serial ports. Internal temperature sensor for self-monitoring. Notification of fault conditions. Event notification. Data logs management. Web access. For remotely access to the keyboard, mouse and video connectors we can use KVM (Keyboard-Video-Mouse) servers that add the ability of accessing through the network to traditional KVM switches. Apart from the main functionality, KVM servers are appliances with similar features as terminal servers, even there are devices which provides both main functionalities, KVM and console access.

369 IBERGRID 3 One especially interesting feature in some KVM servers is virtual media. This feature enables USB media such as CD-ROMs, flash memory and external drives to be virtually attached to a remote server s USB port. On the other hand, using terminal servers and KVM servers is not enough for a full out-of-band administration. When an operating system is locked up we ll need to reset the equipment and previous devices don t make it possible. Locally we would press power on/off button or we would disconnect/re-connect power cables, but remotely it isn t possible unless we ask for it to a remote hands. Smart power distribution units enable selective remote control of their outlets, so, if our systems are powered through these devices, we could easily reset them. Equipments managed by terminal servers or KVM servers for large infrastructures are connected to them through UTP cables, so its necessary to use specific adapters: RJ45, DB9, DB25, etc. to RJ45 (with the right pin-outs) for terminal servers and VGA and PS2/USB to RJ45 for KVM servers. It implies a high deployment cost and it would be interesting to explore another options. Fig. 2. Out-of-band and sideband access to server service processors 4.2 Integrated devices In addition to previous devices, nowadays a lot of IT equipments manufacturers have deployed servers with service processors as part of their equipments. A service processor (SP) is a separate, dedicated internal processor located on the motherboard of a server, a PCI card or on the chassis of a blade server or a telecommunications platform, which runs independently from the servers main processor and operating system. It provides remote access to power control, sensor readings and, in some cases, they also provide server configuration, monitoring and control, even when the server is down or the CPU or operating system is locked up or inaccessible. Service processors [?, OCHO]re accessible either from dedicated management interfaces (out-of-band), independent of the service interface, or from shared in-

370 370 IBERGRID terfaces (sideband), that are used for accessing to both the service processor and the data network. Sideband approach reduces the number of connections to the switch but, on the other hand, it doesnt allow separated networks for SP and data connections, unless vlan tagging is supported. Features provided by service processors depend on its type, although usual important features are the following: Remote power control: servers can be remotely powered off, on or cycled. Graceful shutdown support: it s possible to send a signal to the server operating systems in order to shut it down in a controlled way. Remote serial over LAN (SoL) access: server console can be accessed through the service processor interface, via telnet or ssh connections, as if we were locally connected to the serial port. Virtual KVM: server keyboard, mouse and video can be remotely accessed. Virtual Media: it allows the server to access storage media connected to the service processor client, such as CDs, DVDs, and USB flash drives, as if they were directly attached to that server. Health monitoring: service processors gather together information from the server sensors (fan speed monitors, voltage meters and temperature readers, etc.) in order to detect anomalous situations. Platform Event Traps (PETs): its possible to define thresholds for previous information and send out snmp traps according to them. System event log (SEL): service processors can store information about events related to the server hardware, such as chassis opening and closing, hard drive functional alarms, RAM test errors, etc. Main service processors types are listed below: Intelligent Platform Management Interface (IPMI). HP Integrated Lights-Out (ilo). IBM Remote Supervisor Adapter (RSA). Dell Remote Assistant Card (DRAC). Sun Advanced Lights Out Management (ALOM). Sun Embedded Lights Out Management (ELOM). Sun Integrated Lights Out Management (ILOM). IPMI requires special attention because it s an open standard management interface specification. It was originally driven by Intel (its main proponent), Dell, HP and NEC, but now IPMI has been adopted by more than 150 other companies. IPMI defines a communications protocol between embedded management subsystems. IPMI information is exchanged through baseboard management controllers, which are located on IPMI-compliant hardware components. For enabling out-ofband administration, service processors usually host these controllers, but it s also possible access to IPMI information through the server s operating system.

371 IBERGRID Out-of-band access Operation and control of our equipments is usually carried out through a dedicated independent network. This network is called management network. Management networks provide access to the devices described previously smart PDUs, terminal servers, KVM servers and SPs but they also provide access to in-band management dedicated equipment interfaces (or common interfaces used for management tasks) and to other equipments such as those used for monitoring and control. Some of the equipments accessed through a management network are the communication equipments that provide access to that network, so if one of them fails we will lose access to it. But it isn t necessary an equipment failure, firmware or software updates could require to get them into inactive status, causing loss of the management connection. The most important issue in these situations is not losing general access to the elements that management network interconnects; the most important issue is losing access to the out-of-band administration mechanism provided by the equipment that is causing the loss of connection. Out-of-band access deployment enables an alternative access method to management networks. It complements out-of-band administration methods providing an alternative way of access to them and it s especially critical when we are managing the equipments that interconnect the management networks. A simple but effective design for out-of-band access is shown in figure above. It s based on an ADSL line with static IP address provided by an external telecommunications operator. This design is made up of the following elements: An ADSL router with switching capacity and only one level 2 vlan. A server for controlling out of band accesses and management tasks. For minimizing points of failure, switching equipment should be as simple as possible. For this purpose we could only use the ADSL router, but if we need a large number of connections, it s very likely that it isn t enough and we have to connect an extra switch. We must avoid the possibility of critical elements connected to the management network stay isolated, which are the elements that will make possible to solve a connectivity problem in the management network, as terminal servers. In this sense, it s desirable connect them directly to the ADSL or, if it isn t possible, connect them to a not managed switch, which avoids bad configurations and then, isolation of connected devices. 6 Conclusions Datacenter control mechanisms allow having knowledge of the datacenters status, carrying out measurements and forecasting resources according to trends, but also they help us to forecast unwanted situations and take the appropriate measures. Out-of-band administration and access mechanisms make possible remote equipment management almost as if we were locally working with them. When datacenters are a long way from administrators location, it improves trouble solving

372 32 IBERGRID Fig. 3. Out-of-band access example design and reduces their duration. It also minimizes the number of trips to the datacenter and the use of remote hands. If we have the appropriate tools, remote datacenter management doesnt mean a problem and we can consider taking advantage of economies of scale centralizing our ICT equipments into rented shared datacenters. References 1. RedIRIS. The Spanish Academic and Research Network Antonio Fuentes, Toms de Miguel. Diseo y despliegue de la red fuera de banda de RedIRIS. Boletn de la red nacional de I+D, RedIRIS, n 81. Diciembre RedIRIS-NOVA: Una red fotnica para la investigacin y educacin en Espaa Antonio Fuentes. Diseo de la instalacin de infraestructuras de sistemas y comunicaciones en un datacenter. Grupos de Trabajo de RedIRIS Esteban Domnguez Glez-Seco. Implantacin de instalaciones de climatizacin y electricidad acoplados al paradigma Green IT. Grupos de Trabajo de RedIRIS Javier lvarez Cutillas, Enrique Jos Garca, Luis Morell, Francisco. Implantacin de las tecnologas de informacin y comunicaciones en un hospital del siglo XXI. Grupos de Trabajo de RedIRIS Avocent. Unleashing the Power of ilo, IPMI and Other Service Processors: A Guide to Secure, Consolidated Remote Server Management. Avocent Emerson Network Power White Paper. 8. Sun Microsystems. Embedded Lights Out Manager Administration Guide. Oracle Servers Documentation.

373

374