ATLAS computing operations within the GridKa Cloud

Home Search Collections Journals About Contact us My IOPscience ATLAS computing operations within the GridKa Cloud This content has been downloaded from IOPscience. Please scroll down to see the full text. 2010 J. Phys.: Conf. Ser. 219 072039 (http://iopscience.iop.org/1742-6596/219/7/072039) View the table of contents for this issue, or go to the journal homepage for more Download details: IP Address: 178.63.86.160 This content was downloaded on 07/07/2016 at 15:28 Please note that terms and conditions apply.

ATLAS Computing Operations within the GridKa Cloud 1 J.Kennedy 2, C.Serfon, G.Duckeck, R.Walker (LMU Munich), A.Olszewski (Institute of Nuclear Physics Krakow), S.Nderitu (The University of Bonn), and the ATLAS GridKa operations team. Abstract. The organisation and operations model of the ATLAS T1-T2 federation/cloud associated to the GridKa T1 in Karlsruhe is described. Attention is paid to Cloud level services and the experience gained during the last years of operation. The ATLAS GridKa Cloud is large and divers spanning 5 countries, 2 ROC s and is currently comprised of 13 core sites. A well defined and tested operations model in such a Cloud is of the utmost importance. We have defined the core Cloud services required by the ATLAS experiment and ensured that they are performed in a managed and sustainable manner. Services such as Distributed Data Management involving data replication,deletion and consistency checks, Monte Carlo Production, software installation and data reprocessing are described in greater detail. In addition to providing these central services we have undertaken several Cloud level stress tests and developed monitoring tools to aid with Cloud diagnostics. Furthermore we have defined good channels of communication between ATLAS, the T1 and the T2 s and have pro-active contributions from the T2 manpower. A brief introduction to the GridKa Cloud is provided followed by a more detailed discussion of the operations model and ATLAS services within the Cloud. 1. Introduction This document describes the ATLAS T1-T2 federation which is associated to the GridKa T1 in Karlsruhe Germany and the operational model which has been deployed to ensure that ATLAS services within this Cloud run smoothly. A brief introduction to the GridKa Cloud is provided followed by a more detailed discussion of the ATLAS services within the Cloud. 2. Sites and Infrastructure The ATLAS GridKa Cloud is formed from the GridKa T1 center at Karlsruhe in Germany and several associated T2 centers within Germany, Poland, the Czech Republic, Switzerland and Austria see figure 1. The GridKa T1 at FZK - DE A federated T2 from LMU/LRZ and MPI/RZG in Munich - DE A federated T2 from Wuppertal and Freiburg - DE A federated T2 between DESY Hamburg, DESY Zeuthen and Goettingen - DE 1 Throughout this paper the Term Cloud refers to a federation of resources 2 At present at the Rechenzentrum Garching of the Max Planck Society and the IPP c 2010 IOP Publishing Ltd 1

17th International Conference on Computing in High Energy and Nuclear Physics (CHEP09) IOP Publishing FZU Prague - CZ CYF Krakow - PL CSCS Manno - CH Innsbruck - AU In addition to the T1 and T2 centers several T3 centers are also active. The Cloud is unusually diverse, covering several countries (DE,CH,CZ,PL,AU) as well as two ROC regions (DECH,CentralEurope). This diversity requires an extra effort to ensure good communication. Figure 1. A view of the GridKa Cloud with the data movement(fts) channels shown 3. Operational Approach and Organisation In our experience the development of an good operational and organisational model is the key to a successful Cloud. We have placed emphasis on forming groups to perform certain necessary tasks/services within the Cloud and on coordination both on a global and subgroup level. This section details the steps we have taken to provide organisation and develop the needed subgroups. 3.1. Cloud Coordinator During the initial phase of the Cloud setup a single person was identified to perform the central role of coordination in the Cloud. The role of the so called Cloud Coordinator was to provide technical and operational coordination for the Cloud during the initial Cloud startup. Some of the responsibilities of this role were: (i) Ensure good contact between the sites within the Cloud exists (ii) Maintain contact to ATLAS computing for both operational and development issues (iii) Have good understanding of the ATLAS services running within the Cloud, their dependencies,development and relation to each other (iv) Provide planning for tests, both internal within the Cloud and ATLAS wide (v) Organising meeting and keeping information flowing 2

As the Cloud evolved sub-coordination roles were filled in areas such as DDM/MC Production and the level of communication between these areas was enhanced. This sharing of responsibilities coupled with regular communication between the distinct areas lead to a reduced need for a single coordinator. After an initial setup period of 2-3 years the Cloud coordinator role became less essential as several groups now covered the responsibilities. The Cloud now consists of an organised group rather than a top down hierarchy. 3.2. Site deployment and Contacts The startup phase of the Cloud saw the deployment of glite middleware at the sites and the identification of site contacts. Each site (T1,T2) identified a list of people who act as site contacts. Ideally this contact list should contain an ATLAS aware person and a system administrator for the relevant site. 3.3. Identifying services The focus within the Cloud during the last 12-18 months has changed from the initial site deployment and has become more operations centered. The major ATLAS level services/tasks which we believe should and must operate within our Cloud were defined with manpower being invested in each area. The main services areas are (i) Monte-Carlo Production (ii) Distributed Data Management (iii) ATLAS Software Installation (iv) Distributed Analysis (v) Data Reprocessing These areas are described in more detail in the Services (see section 4). 3.4. Organisation The main organisational areas addressed are planning,communication and documentation. Planning: Functional tests both central ATLAS and internal within the Cloud need to be planned and executed. Communication: Ensure information flow between ATLAS and the working groups/sites within the Cloud. A mailing list has been set up and monthly video conferences are held. A weekly phone conference is held between the ATLAS technical contacts at GridKa and the Cloud coordinator. Documentation: A Cloud wiki page has been set up where the Cloud is described and information about functional tests etc is written. 3.5. T1 Contact The Tier 1 center at GridKa is of special importance to the whole Cloud since services it provides such as LFC and FTS are central to the Cloud. Good contact to the Tier1 Administration team and a high level of information exchange and planning is required to ensure smooth operations within the Cloud as a whole. ATLAS contacts are stationed at the T1 center at GridKa for several days each week where they attend the T1 middleware/services meetings both providing information from the experimental community and gaining information form the site administration. In this way we 3

Figure 2. Two views of the GridKa Monitoring web pages. High level views are provided which allow users to gain an impression of the health of the overall system thus allowing problems to be quickly identified. The ability to drill down to more informative low level views is available thus enabling the users to identify the root cause of the problem. ensure that a solid base for planning activities/tests is provided and also that problems are quickly identified and solved. To ensure a high level of service is maintained during the daily operations an extensive monitoring framework is provided by GridKa. The ATLAS T1 contact can use this monitoring information to identify problems as well as to gauge the impact of specific ATLAS activities on the GridKa system. 4. Services 4.1. Distributed Data Management (DDM) The Distributed Data Management system (DDM) is based on Grid software packages which automatically manage the data as well as providing a transparent and unique view of the connected resources. The ATLAS framework for DDM is called dq2. DDM operations are performed by two teams. One central team at CERN and one local Cloud team which manage DDM issues for GridKa and the associated Tier2s [1]. These two teams work in close collaboration and perform the following tasks : Analysis Data distribution: The local Cloud team helps define the distribution plan for data within the Cloud and additionally monitors its distribution. This requires liaison with the sites as well as central ATLAS DDM to ensure that data is replicated in accord with the computing model as well making intelligent use of the resources within the Cloud. T0 tests and T1-T2 functional test: Throughout the last years several scaling/functional tests have been performed to ensure that the data management within the Clouds is functional and well understood. Figures 3 and 4 shows the results of throughput tests in 2007 during which nominal rates to the T1 site and accompanying T2s were reached. Integrity check: Checks are performed on a regular basis to ensure that the files on the storage systems and the information registered in the LFC about these replicas is consistent. It has been observed that files may be registered in the LFC however be unavailable at a site and viceverse. The former results in job failure since the data is unavailable while the latter leads 4

Figure 3. Throughput (MB/s) to FZK during T0 tests from the 15 th of May to the 15 th of June. The nominal throughput is about 90 MB/s. Figure 4. Throughput (MB/s) to FZK tier2s during T0 tests (right). to a data leak (often referred to as dark data) where resources are used but the files are effectively lost to the grid. The GridKa ATLAS DDM team developed scripts which may be used to cure these problems and these are ran on a regular basis. Cleaning of datasets (in the event of data loss): In the event of data loss, failure of a storage pool of the accidental deletion of a file, actions must be taken to ensure the integrity of the meta data regarding the associated datasets. If files are lost completely, i.e. no other replica exists the LFC and central DDM catalogs require cleaning to remove the file entry. If on the other hand a replica exists this may simply be copied to the site which suffered the data loss. Datasets and FTS monitoring: The transfer of data to the sites is monitored using central DDM and local Cloud monitoring tools. Members of the DDM team ensure not only that data is being transferred to sites at the nominal rate but also that dataset completion is achieved. Figure 5 shows an example distribution of file arrival times after a dataset is subscribed to a site. 4.2. Software Installation The ATLAS software framework ATHENA evolves at a fast pace with new major versions of ATHENA being released regularly. Each new major version requires installation at all the TIER- 1 and TIER-2 sites within the GridKa Cloud and these installations are subsequently validated. This large scale deployment enables the MC production system, which needs the new major ATHENA versions, to use all of the available ATLAS computing power. New releases of the ATHENA software are prepared in so-called grid installation kits. These installation kits are run by the ATLAS installation team via grid-jobs at the specific site. The grid-job installs the corresponding ATHENA version and validates the installation. Each time a new version is released a centrally managed script starts an automatic installation at all known ATLAS TIER-1 and TIER-2 sites. If the automatic installation fails a member of the ATLAS installation team, normally the software manager assigned to the Cloud to which the site belongs, tries a manual installation. If the manual installation also fails the software manager contacts the responsible site administrator to better understand and solve the problem. In addition the installation team ensure that older, outdated, ATHENA releases are removed from sites and on occasion install minor ATHENA versions at sites when specific versions are requested by users. 5

trig1_misal1_mc12.005200.t1_mcatnlo_jimmy.recon.aod.v12000601_tid005997 Number of files 2200 2000 1800 1600 1400 1200 1000 800 600 400 200 0 03/03 10/03 17/03 24/03 Time Figure 5. Evolution of the number of files of a dataset in FZKDISK versus the time (from the subscription to the completion) 4.3. Monte-Carlo Production The ATLAS Monte-Carlo production is preformed by the production system team. A central database is maintained at cern which has so called tasks defined within it where a task is a collection of jobs of a particular type which are defined together. Tasks are assigned to a Cloud and the output data associated with a task is aggregated via the DDM system on the Tier1 of the assigned Cloud. The distribution of jobs to a Cloud has varied over the last years with the current model using pilot jobs which are submitted to the Cloud sites and subsequently gather a real job to the node once a jobs starts running. This method circumvents many problems which can occur in the job submission phase to sites and as such increases the efficiency of the production system. A central monitoring framework is provided and a shift team follows the progress of tasks raising trouble tickets when a task,site or Cloud experiences problems. Initially MC production was performed in a stress test manner in which a large amount of data was produced over a period of several months and the system was subsequently assessed and evaluated after the production run. However during the last years the production system has operated continuously. Within the Cloud two instances of the pilot submission servers, known as pilot factories, are supported. This ensures that the Cloud production team have a high level of influence over the job distribution within the Cloud and also that more information is available to them for debugging purposes. The number of jobs which may be run within the Cloud has increased dramatically and it is now possible to run over 7500 jobs simultaneously on the GridKa Cloud resources. Figure 6 shows the number of running production jobs for the period May 2008 - May 2009 as a function of Cloud. 6

Figure 6. Monte Carlo production during the period May 2008 - May 2009 as a function of Cloud. The GridKa(DE) Cloud is shown in light green. 4.4. ADCOS shifts Through the last years we have taken an increasingly active role in the production shifts group. The production shifts are designed to ensure that at any point in time a group of people are responsible for managing the production and tracking/reporting problems/bugs. The expertise in running production and identifying and treating problems within the grid has helped greatly during the last year. By aiding with the shifts and also helping to produce the tools required for the system we have helped improve the efficiency of our Cloud and ATLAS production as a whole. 4.5. Distributed Analysis As the LHC gears up towards turnon and data taking comes ever closer we have seen an increasing turn of attention towards distributed analysis. It is extremely important that a stable and simple user analysis framework be put in place to allow physics users the opportunity to exploit the grid resources while ensuring that these same resources are not adversely impacted by the chaotic analysis patterns which are expected. A great deal of effort has been placed into the development of user analysis tools and the evaluation of their usage on the Clouds resources. One of the user analysis tools, Ganga [2], has been used to extensively test the resources in several Clouds including the GridKa Cloud. Figure 7 shows the results of a test run of the so called Ganga Robot against the resources in the GridKa Cloud. Here a high level view of the jobs split by site and status is shown. The test suite however provides a more detailed level of information which allows fine grained analysis of the users analysis at individual sites. An example of this is shown in figure 8 where the cpu/walltime ratio is shown for a sample of analysis jobs which were ran at a site. The left plot shows a clear dominance of walltime thus indicating a problem when performing analysis at the site. Upon investigation it was found that a poor distribution of data on the site lead to a bottleneck when many jobs attempted to access this data. A redistribution of the data lead to the much improved cpu/walltime distribution as seen in the right plot. Through continued regular tests of the analysis framework the GridKa Cloud has become ever ready for the start of a more aggressive user analysis period which is expected once LHC data taking starts. 7

Figure 7. Ganga robot tests ran against the GridKa Cloud Figure 8. The CPU/Walltime ratio for distributed analysis test jobs before(left) and after(right) re-distribution of data at a site to enhance access rates. The analysis tests have also been folded into the SAM framework such that ATLAS specific tests are ran on a regular basis and sites can ensure that they are not only able to support the grid functionality but also the experimental requirements. 4.6. Cloud Monitoring Several central monitoring tools exist to monitor Data Management, MC Production, Site availability etc. These tools are well developed and provide a great deal of information. Despite the existence of these tools two monitoring projects were undertaken within the GridKa Cloud. Firstly a meta monitoring project was undertaken to gather information from several sources and provide a global overview of the status of the Cloud, as shown in Figure 9. Here a global overview can be easily gained however a large amount of detail is nevertheless contained within this single view. The Cloud monitoring page allows users to quickly identify possible problems and links are provided to the original monitoring sources thus allowing a more indepth analysis 8

Figure 9. The GridKa Cloud Monitoring web page. A wide range of information is gathered from several monitoring sources and presented in a single global view. of the problem to be undertaken. Secondly a monitoring project was undertaken on a lower level to allow the analysis of data transfers between centers. Data may be moved between the sites in several manners which have slightly differing dependencies. By performing and monitoring several data transfer methods we can (a) gain a better understanding of the data movement in the Cloud and (b) identify problems more precisely. For instance a situation in which the data transfer via the managed FTS system slows down while direct transfers remain constant would indicate a problem within the FTS system and not the site/network. Figure 10 shows data transfer monitoring, transfers can be seen from the T1 to a T2 site with several different transfer mechanisms being deployed. 4.7. Reprocessing A re-processing of the RAW ATLAS data will take place at regular intervals to allow the application of improved calibration and alignment data and improved algorithms. The RAW data is stored on the TAPE at the T1 site and as such a staging plan is needed to ensure that data is intelligently and quickly staged from tape for processing and later removed from the disk cache to allow further reprocessing to take place. The pre-staging task, involving massive recalls from tape, is inherently difficult. Several large scale tests have been performed to evaluate the system and identify problems. This requires good contact between the experimental and site working groups and an in depth understanding of both the physical system deployed at the T1 site and the experimental framework for reprocessing. Several recall tests have been performed with increasing levels of success with sustained recall rates of upto 190MB/s observed. Figure 11 shows the results from such a recall test. Numerous physics data sets were recalled from tape with an initially high rate being observed followed by a slowdown in which a large time interval passes until the final files are staged. 9

Figure 10. An example of the data transfer monitoring deployed in the GridKa Cloud. Transfers from the T1 to a T2 site are shown via several possible transfer mechanisms. Figure 11. Data retrieval from TAPE at FZK 5. Conclusion A successful operational model has evolved within the GridKa Cloud over several years. Each major aspect of the Cloud operations is covered by a working group and good channels of communication are established between the groups and the sites. Although we face many challenges as ATLAS data taking starts we are confident that we will see a well functioning and successful Cloud with every opportunity for great physics results. 6. Acknowledgements I would like to thank the many people who contributed to the GridKa Cloud operations for their time and effort, for making this venture work and making the last few years so enjoyable. I would like to specifically thank Gen Kawamura and the Goettingen group for their work on the Cloud monitoring. [1] Data Management tools and operational procedures in ATLAS Serfon, C et al., CHEP 09 proceedings [2] Distributed Analysis in ATLAS using GANGA Elmsheuser, J et al., CHEP 09 proceedings 10