Stefano Bagnasco, Domenico Elia, Grazia Luparello, Stefano Piano, Sara Vallero, Massimo Venaruzzo For the STOA-LHC Project Interoperating Cloud-based Virtual Farms
The STOA-LHC project 1 Improve the robustness and usability of the existing LHC Italian infrastructure Funded as an Italian PRIN (research Project of Relevant National Interest) (See the summary poster in Poster Session B) Common effort to ease data and resource access for the LHC Community This talk focuses on the ALICE-related activity: Parallel and interactive analysis solutions (the Virtual Analysis Facility) Standard access to interactive resources in different local deployments (e.g. centralised authentication system) Federation among single analysis facilities to optimise distribution and access to remote data Interoperating Cloud-based Virtual Farms - 2
The STOA-LHC project 2 Improve the robustness and usability of the existing LHC Italian infrastructure Funded as an Italian PRIN (research Project of Relevant National Interest) (See the summary poster in Poster Session B) Build a uniform environment for last mile of analysis: Use familiar interfaces Exploit existing tools Benefit from Cloud Computing technologies locally (isolate applications, elasticity) Use high-level tools for federation (no Cloud federation or bursting) Extend the model to allow users outside high-energy physics to re-use tools and exploit computing infrastructures Interoperating Cloud-based Virtual Farms - 3
the infrastructure Trieste: Test deployment OpenStack 24 cores, 1.2 TB 3 Gbps WAN Torino: Production Cloud OpenNebula 1.3k cores, 1.6 PB 10 Gbps WAN Padova-Legnaro: Test deployment OpenStack 100 cores, 5 TB 10 Gbps WAN Coming soon: Catania and Cagliari Bari: PRISMA testbed OpenStack 600 cores, 110 TB 10 Gbps WAN Interoperating Cloud-based Virtual Farms - 4
the strategy Don t write new tools! Use existing tools and features Exploit good GARR networking between sites Explore Cloud Computing technologies Workload management The Virtual Analysis Facility Presented at CHEP2013 (see next slide) Based on PROOF for interactive analysis Data access Use xrootd s available federation tools Interoperating Cloud-based Virtual Farms - 5
key component: the VAF The Virtual Analysis Facility PROOF+PoD CernVM HTCondor elastiq What is the VAF? A cluster of CernVM virtual machines: one head node, many workers Running the HTCondor job scheduler Capable of growing and shrinking based on the usage with elastiq Configured via a web interface: cernvm-online.cern.ch Entire cluster launched with a single command User interacts only by submitting jobs Elastic Cluster as a Service: elasticity is embedded, no external tools PoD and dynamic workers: run PROOF on top of it as a special case Dario.Berzano@cern.ch - A grounds-up approach to High-Throughput Cloud Computing in High-Energy Physics 26 Dario Berzano s talk @ CHEP2013 Interoperating Cloud-based Virtual Farms - 6
key component: the vaf Dario Berzano s talk @ CHEP2013 Interoperating Cloud-based Virtual Farms - 7
ongoing activity summary Activities: Benchmarking activities at all sites Common analysis task and data-set Tests on local data storage access (Trieste) Application monitoring with the ElasticSearch ecosystem (Torino, Padova) See Sara Vallero s talk on Monday Production use at the Torino site: in operation since November 2013 60 TB of dedicated storage (GlusterFS, Xrootd) up to ~100 workers mainly analysis on ntuples (TSelector) Data federation (Bari and all sites) Check the poster in Poster Session A Interoperating Cloud-based Virtual Farms - 8
Workers deploy time If new VMs need to be instantiated, workers deploy time ranges from 2.5 min to 3.5 min If VMs are already available, deploy time ranges from 16s to 3 min The golden number of 30 workers (see later) is reached in 2.5 min in the first case and 25 s in the latter Optimal number of workers Interoperating Cloud-based Virtual Farms - 9
Wall-time for different analysis steps QAMultistrange: event selection re-vertexing QAMultistrange analysis Simple pt spectrum analysis Data sample: LHC10h (PbPb) run 139510 226k events Interoperating Cloud-based Virtual Farms - 10
Wall-time for different analysis steps QAMultistrange: event selection re-vertexing Results: For this type of analysis and number of events, 30 workers is the optimal number Wall-time is comparable for low and high CPU-intensive analyses QAMultistrange analysis Simple pt spectrum analysis Data sample: LHC10h (PbPb) run 139510 226k events Interoperating Cloud-based Virtual Farms - 11
the storage federation blueprint Interoperating Cloud-based Virtual Farms - 12
the storage federation blueprint Work in progress Bari Meta-manager deployed Ongoing tests on a subset of sites Interoperating Cloud-based Virtual Farms - 13
Distributed Storage and Data Federation Distribute and share data using a unique XRootD Italian redirector Results: This is an ongoing task! Difference within 10-20% at most, even for " Two steps of a test analysis: I/O intensive jobs 1. 75% I/O intensive and 25% CPU intensive 2. Encouraging 17% I/O intensive to further and 83% develop CPU intensive the VAF data federation using such XRootD option Plot the ratio between wall time of jobs accessing files via XROOTD- IT Still and to locally investigate: scalability, stability 1: I/O intensive analysis 2: CPU intensive analysis Interoperating Cloud-based Virtual Farms - 14
VAF monitoring with the ELK stack ELK stack HTTP MySQL DB Also accounting " INFN Grid services Dedicated DB tables TProofMon SenderSQL VAF See Sara Vallero s talk on Monday Collect monitoring and accounting data from both IaaS and application Investigation of the ELK stack to handle heterogeneous and unstructured data sources Possible solution for Monitoring-as-a- Service providing uniform extendable monitoring platform to applications Interoperating Cloud-based Virtual Farms - 15
provisional conclusions and outlook The VAF model works well and can be easily adapted to different use cases Just need to package an end-to-end toolkit suited to different communities E.g. without PROOF or PoD or other specific tools This needs to include a working accounting system The ELK stack can be used to build a flexible system: to provide accounting information and Monitoring-as-a-Service for applications The Data Federation model also is feasible Small performance penalty balanced by flexibility and deduplication Scalability and stability still under investigation Interoperating Cloud-based Virtual Farms - 16
thanks! The present work is supported by the Istituto Nazionale di Fisica Nucleare (INFN) of Italy and is partially funded under contract 20108T4XTM of Programmi di Ricerca Scientifica di Rilevante Interesse Nazionale (PRIN, Italy). Interoperating Cloud-based Virtual Farms - 17