Batch and Cloud overview Andrew McNab University of Manchester GridPP and LHCb
Overview Assumptions Batch systems The Grid Pilot Frameworks DIRAC Virtual Machines Vac Vcycle Tier-2 Evolution Containers Event servers 2
Basic assumptions You have a series of pieces of computation that need doing From more than one person in your collaboration? With different priorities? With different requirements? You want to execute these pieces of computation on remote resources Remote not the machine you edit files on Accessing remote resources means you can use what is available May allow you to scale your computation up by orders of magnitude These pieces of computation need to read or write data Where they run? Somewhere remote from that too? Piece of computation sounds like a job, but not necessarily 3
Batch systems Traditionally all this has been done with batch queuing systems Long and noble history going back to first mainframes, when operators physically loaded people s stacks of punched cards into readers Many systems descended from or influenced by NASA NQS have the command qsub to submit a file that defines the job to be run. Simplest way of running remote batch jobs is to do something like: scp job.sh me@somewhere.ac.uk: ; ssh me@somewhere.ac.uk qsub job.sh In the 1990s CERN built a system called SHIFT that did just that If you have access to one big system somewhere else, you may be doing that yourself today However, it s very very unscalable, as it relies on setting up ssh access knowing that somewhere.ac.uk has resources you can run on 4
The Grid (Here The Grid is the infrastructure of European Grid Infrastructure (EGI) and Worldwide LHC Computing Grid (WLCG)) The Grid gives access to remote batch systems in a much more scalable fashion Uses X.509 and VOMS certificates for identifying users and services and for virtual organization (VO) membership An information system to discover suitable resources Lots of operational support (tickets, mailing lists, monitoring) Makes it possible to automate the submission of jobs Originally via a Broker or Workload Management System Now via the client submitting the jobs directly (That sounds like a backward step but it s not) 5
The Grid CREAM or ARC CE & batch queues Underlies the infrastructure run by EGI/WLCG Job Grid Site Originally proposed by EU DataGrid in 2000 WMS Broker (deprecated) Central agents & services Job factory User and production jobs 6
The Grid with Pilot Jobs The Grid + pilot jobs is the dominant model for running HEP jobs. Well established, and gives access to resources around the world. CREAM or ARC CE & batch queues Pilot Job. Runs Job Agent to fetch from TQ Pilot jobs Grid Site This late binding model allows the virtual organization to decide priorities itself. It sidelines the batch system s scheduling of time slots: each type of pilot job are all the same irrespective of the contents. Requests for real jobs Central agents & services Director (pilot factory) Matcher & Task Queue User and production jobs 7
Pilot frameworks Pilot job concept was introduced by LHCb s DIRAC project in 2003 DIRAC now spun-off as a separate project and used by other experiments, and provided by infrastructures like EGI and GridPP GridPP DIRAC service is intended to support experiments and VOs without their own pilot framework ATLAS and CMS also use pilot frameworks and have done some work on using ATLAS s BigPanda Pilot framework becomes a higher level batch system Can be used to hide the details of the underlying batch system, or even whether jobs are running inside VMs at cloud sites The virtual organization can monitor and manage all of the resources available to it centrally 8
Virtual grid sites Cloud systems like OpenStack allow virtualization of fabric What s new is that they expose (defacto?) standard APIs for the creation and management of virtual machines by external users Bringing up a machine can be delegated towards user communities Rather than using PXE+Kickstart+Puppet on bare metal Zeroth order virtualization is just to provide a Grid with Pilot Jobs site, on VMs Potential to use staff more efficiently: one big pool of hardware and hypervisors managed all together rather than separate clusters However, can take this a step further and have the experiment managing the cloud resources as a virtual grid site 9
Virtual Grid with Pilot Jobs site Gatekeeper (ARC? Condor?) Cloud Site Batch VM. Runs Job Agent to fetch from TQ Experiment creates its own conventional grid site on the cloud resources. Transparent to existing central services, and user/production job submitters. (Or via VM factory) CloudScheduler (ATLAS) and elastiq (ALICE) are implementations of this model. Requests for real jobs Central agents & services VM factory Pilot factory Matcher & Task Queue User and production jobs 10
Experiment creates VMs directly? Cloud Site VM. Runs Job Agent to fetch from TQ Experiment creates VMs instead of pilot jobs. Job Agent or pilot client runs as normal inside. CMS glidein WMS works this way for cloud sites: looks at demand and creates VMs to join Condor pool Requests for real jobs Central agents & services VM factory Matcher & Task Queue User and production jobs 11
Further simplification: Vacuum model Following the CHEP 2013 paper: The Vacuum model can be defined as a scenario in which virtual machines are created and contextualized for experiments by the resource provider itself. The contextualization procedures are supplied in advance by the experiments and launch clients within the virtual machines to obtain work from the experiments' central queue of tasks. ( Running jobs in the vacuum, A McNab et al 2014 J. Phys.: Conf. Ser. 513 032065) Loosely coupled, late binding approach in the spirit of pilot frameworks For the experiments, VMs appear by spontaneous production in the vacuum Like virtual particles in the physical vacuum: they appear, potentially interact, and then disappear Three implementations so far: Vac, Vcycle, HTCondor Vacuum 12
Vac - the first Vacuum system Pilot VM. Runs Job Agent to fetch from TQ Vacuum site Since we have the pilot framework, we can do something really simple Strip the system right down and have each physical host at the site create the VMs itself. Requests for real jobs Instead of being created by the experiments, the virtual machines appear spontaneously out of the vacuum at sites. Use same VMs as on Cloud sites Central agents & services Matcher & Task Queue User and production jobs 13
Vcycle Site Vcycle VM factory Cloud Site Pilot VM. Runs Job Agent to fetch from TQ Apply Vac ideas to Clouds Vcycle implements an external VM factory that manages VMs. Can be run centrally by experiment or by site itself or by a third party Third party Vcycle VM factory Requests for real jobs VMs started and monitored by Vcycle, but not managed in detail ( black boxes ) Central agents & services Vcycle VM factory No direct communication between Vcycle and task queue Matcher & Task Queue User and production jobs 14
Pilot VMs Vac, Vcycle, HTCondor Vacuum assume the VMs have a defined lifecycle Need a boot image and user_data file with contextualisation Provided by experiment centrally from an HTTPS web server Virtual disks and boot media defined and VM started The VM runs and its state is monitored VM executes shutdown -h when finished or if no more work available Maybe also update a heartbeat file and so stalled or overruning VMs are killed shutdown_message file can be used to say why the VM shut down Experiments VMs are a lot simpler for the site to handle than running worker nodes LHC experiments and GridPP use CernVM-FS wide-area filesystem to provide VMs with operating system files and experiment software 15
Some implications of virtualization Decouples operating system versioning requirements of experiments and sites Jobs see an identical environment across sites Simplifies many aspects of security model (escalation much harder) Many opportunities for other simplifications (eg Vacuum) For the sites, VMs make WLCG systems seem a lot more normal Especially if unusual middleware is only inside the VMs When the Grid started, several sites had problems (or in fact failed) trying to run shared clusters with local IT services eg shared PBS cluster for HEP and other sciences, with same WN configuration for everyone Virtualization gives us a second chance to try this where appropriate 16
GridPP s Tier-2 Evolution project In the plan for GridPP5, we proposed radically simplifying what s needed to run a university grid site (a Tier-2 site ) using VMs We re working through all the requirements to be able to run purely VM-based sites Getting VMs signed off by experiments for all Tier-2 workloads Making sure monitoring, accounting, ticketing etc all support VM-based sites and the software we re using to manage them (Also trying to simplify the storage side) This is partly in response to reduced staff funding But it also dramatically lowers the effort required to create and run a Tier-2 style site to run jobs in GridPP DIRAC VMs and the other VM flavours we support 17
Containers Linux containers can provide an alternative (often a drop-in alternative) to most of the places I ve talked about VMs Can use containers to provide a protected virtual environment of libraries, files etc to payload jobs Container as VM So we should talk about Logical Machines to be general Other more lightweight models are possible, more focussed on the environment seen by particular processes or sets of processes CernVM base-image we re using it being adapted to work as a Docker container Again using CernVM-FS to get OS and experiment files Support for this will be added to Vac 18
Micro jobs and event servers So far we ve talked about jobs lasting hours or days Push model of early Grid became the pull model of Pilot Jobs But much finer grained pull models also exist Notably the ATLAS experiment s Event Service Still have something that looks like a job that runs for hours or days But now takes smaller packages of work from a central service Every few minutes? Makes it much easier to solve the Masonry Problem of packing work into the available time and space provided by a fixed lifetime job But the cost is potentially a lot more network connections and opportunities bottlenecks 19
Summary Broad trends towards automation and decoupling jobs from details of the sites they run at Automation gives opportunities for accessing a lot more resources As easy to run at 300 sites once you can automatically pick 1 of 3 But running at 300 sites is more complicated when things go wrong Pilot job frameworks abstract the specifics of how the job got there GridPP DIRAC service intended for experiments and VOs without their own pilot framework A lot of activity in using virtualized and now containerized resources GridPP aims to enable sites to run purely using VMs if they choose Event servers and micro jobs offer new opportunities to make more efficient use of job slots 20
Extra slides 21
The Masonry Problem... 22
The Masonry Problem Maximum length job, starting just before advance notice given Max length Max length job Wasted resources Max length job Short job Short job Max length Shorter job Max length job Short job Start of advance notice Hard deadline for jobs to finish 23
OpenStack cloud site architecture 24