HTCondor within the European Grid & in the Cloud



Similar documents
HTCondor at the RAL Tier-1

Batch and Cloud overview. Andrew McNab University of Manchester GridPP and LHCb

Sun Grid Engine, a new scheduler for EGEE

Sun Grid Engine, a new scheduler for EGEE middleware

Using Parallel Computing to Run Multiple Jobs

Virtualisation Cloud Computing at the RAL Tier 1. Ian Collier STFC RAL Tier 1 HEPiX, Bologna, 18 th April 2013

GRID workload management system and CMS fall production. Massimo Sgaravatto INFN Padova

Running real jobs in virtual machines. Andrew McNab University of Manchester

The Grid-it: the Italian Grid Production infrastructure

ARC Computing Element

Clusters in the Cloud

PES. Batch virtualization and Cloud computing. Part 1: Batch virtualization. Batch virtualization and Cloud computing

An objective comparison test of workload management systems

Status and Evolution of ATLAS Workload Management System PanDA

Database Services for CERN

CERN local High Availability solutions and experiences. Thorsten Kleinwort CERN IT/FIO WLCG Tier 2 workshop CERN

ATLAS job monitoring in the Dashboard Framework

The Evolution of Cloud Computing in ATLAS

A Year of HTCondor Monitoring. Lincoln Bryant Suchandra Thapa

MSU Tier 3 Usage and Troubleshooting. James Koll

Integration of Virtualized Workernodes in Batch Queueing Systems The ViBatch Concept

Interoperating Cloud-based Virtual Farms

Virtualization, Grid, Cloud: Integration Paths for Scientific Computing

Cloud and Virtualization to Support Grid Infrastructures

Anar Manafov, GSI Darmstadt. GSI Palaver,

Manjrasoft Market Oriented Cloud Computing Platform

Solution for private cloud computing

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

Evaluation of Nagios for Real-time Cloud Virtual Machine Monitoring

Monitoring HTCondor with Ganglia

Log managing at PIC. A. Bruno Rodríguez Rodríguez. Port d informació científica Campus UAB, Bellaterra Barcelona. December 3, 2013

The dashboard Grid monitoring framework

The CMS analysis chain in a distributed environment

GT 6.0 GRAM5 Key Concepts

ARC batch system back-end interface guide with support for GLUE2

Moab and TORQUE Highlights CUG 2015

Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource

Report from SARA/NIKHEF T1 and associated T2s

Dynamic Resource Provisioning with HTCondor in the Cloud

Data storage services at CC-IN2P3

Auto-administration of glite-based

A Service for Data-Intensive Computations on Virtual Clusters

CONDOR CLUSTERS ON EC2

RenderStorm Cloud Render (Powered by Squidnet Software): Getting started.

HPC-Nutzer Informationsaustausch. The Workload Management System LSF

Condor: Grid Scheduler and the Cloud

U-LITE Network Infrastructure

SUSE Cloud 2.0. Pete Chadwick. Douglas Jarvis. Senior Product Manager Product Marketing Manager

Martinos Center Compute Clusters

Building a Volunteer Cloud

Adaptive Resource Optimizer For Optimal High Performance Compute Resource Utilization

Michael Thomas, Dorian Kcira California Institute of Technology. CMS Offline & Computing Week

HAMBURG ZEUTHEN. DESY Tier 2 and NAF. Peter Wegner, Birgit Lewendel for DESY-IT/DV. Tier 2: Status and News NAF: Status, Plans and Questions

A High Performance Computing Scheduling and Resource Management Primer

Manjrasoft Market Oriented Cloud Computing Platform

HappyFace for CMS Tier-1 local job monitoring

Transcription:

HTCondor within the European Grid & in the Cloud Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX 2015 Spring Workshop, Oxford

The Grid

Introduction Computing element requirements Job submission from LHC VOs AliEn: ALICE HTCondor-G: ATLAS, CMS DIRAC: LHCb EMI WMS job submission Still used by some non-lhc VOs Usage (probably) likely to decrease Information system Information about jobs & worker nodes needs to go into the BDII

CREAM CE Does not currently work out-of-the-box with HTCondor Support for Condor used to exist but was dropped, some functionality still exists (e.g. BLAH) New set of scripts to be added to the official CREAM release is under development Including info-providers, APEL parser, Support is available for any sites wishing to install CREAM & HTCondor starting from the existing scripts

CREAM CE Problems with existing scripts No YAIM function for configuring CE to use HTCondor Scripts to publish dynamic information about state of jobs missing YAIM function for configuring BLAH doesn t support HTCondor RAL solution Made use of scripts from very old versions of CREAM which did support HTCondor Updated these to work with the current EMI-3 CREAM CE Needed modernizing, e.g. to support partitionable slots Wrote our own APEL parser

CREAM CE RAL has been running 2 CREAM CEs with HTCondor in production for ~1.5 years Was used by ALICE, LHCb, non-lhc VOs Now only used for ALICE SUM tests, will be decommissioned soon Running jobs over past year CREAM CEs @ RAL

ARC CE NorduGrid product Features Simpler than CREAM CE Can send APEL accounting data directly to central broker File staging: can download & cache input files; upload output to SE Configuration Single config file /etc/arc.conf No YAIM required

ARC CE Can the LHC VOs submit to ARC? ATLAS Use HTCondor-G for job submission Able to submit to ARC ARC Control Tower for job submission CMS Use HTCondor-G for job submission LHCb Last year added to DIRAC the ability to submit to ARC ALICE Recently regained the capability to submit to ARC Submission via EMI WMS Uses HTCondor-G for job submission

ARC CE 4.1.0 Contains many patches provided by RAL for HTCondor backend scripts 4.2.0 Bug fix for memory limit of multi-core jobs (HTCondor) 5.0.0 (upcoming) Able to make use of per-job HTCondor history files Used in production at RAL for a long time now Bug fix for CPU time limit for multi-core jobs Repository Available in EMI, UMD, EPEL, NorduGrid repositories At RAL we use NorduGrid

ARC CE At RAL, we re doing things the HTCondor way ARC CE configured to have a single queue VOs need to request resources Number of cores, memory, wall time, CPU time, No problem for ATLAS and CMS It s of course also possible to setup queues

ARC CE Information system limitations Doesn t publish per-vo running/idle jobs by default With HTCondor as backend LRMS, doesn t publish max time limits HTCondor has no MaxWalltime parameter like other batch systems Instead, policy expression configured on CE (or on worker nodes) Not easy (or possible?) to extract a MaxWalltime in general We ve made some small changes to /usr/share/arc/glue-generator.pl to overcome this

ARC CE The only way all 4 LHC VOs (& other VOs) now run work at RAL is by ARC CE RAL ARC CE usage Running jobs over past year ARC CEs @ RAL

Accounting & scaling factors Most sites have heterogeneous farms Need to scale CPU & wall time for correct accounting Torque is able to scale CPU & walltime cpumult & wallmult in the pbs_mom config file HTCondor doesn t have this ability. At RAL: Machine ClassAds contain scaling factors Schedds configured to insert the appropriate scaling factors into the job ClassAds For ARC CEs, we use an auth plugin to scale CPU & wall time before accounting records are generated For CREAM CEs, our APEL parser scales CPU & wall time appropriately

HTCondor-CE OSG product https://twiki.grid.iu.edu/bin/view/documentation/releas e3/installhtcondorce Special configuration of HTCondor Can work at sites with PBS, GE, etc, as batch system, not just HTCondor Most interesting for sites with HTCondor as batch system

The Cloud

HTCondor & the cloud There are 2 things that can be done HTCondor can manage VMs in cloud resources Using the grid universe (EC2, GCE) HTCondor can use cloud resources Dynamic worker nodes

Managing VMs in the cloud Example submit description file universe = grid grid_resource = ec2 http://cloud.mydomain:4567 executable = $(ec2_ami_id) log = $(executable).$(cluster).$(process).log ec2_access_key_id = /home/alahiff/accesskeyid ec2_secret_access_key = /home/alahiff/secretaccesskey ec2_ami_id = ami-00000067 ec2_instance_type = m1.large ec2_keypair_file = $(executable).$(cluster).$(process).pem ec2_user_data_file = /home/alahiff/my_user_data queue 5

Managing VMs in the cloud Submit jobs in the usual way -bash-3.2$ condor_submit vms.sub Submitting job(s)... 5 job(s) submitted to cluster 86.

Managing VMs in the cloud Check status of jobs (VMs) -bash-3.2$ condor_q -- Submitter: lcggwms01.gridpp.rl.ac.uk : <130.246.180.119:45554> : lcggwms01.gridpp.rl.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 86.0 alahiff 3/24 09:08 0+00:01:17 R 0 0.0 ami-00000012 86.1 alahiff 3/24 09:08 0+00:01:17 R 0 0.0 ami-00000012 86.2 alahiff 3/24 09:08 0+00:01:22 R 0 0.0 ami-00000012 86.3 alahiff 3/24 09:08 0+00:01:17 R 0 0.0 ami-00000012 86.4 alahiff 3/24 09:08 0+00:01:17 R 0 0.0 ami-00000012 -bash-3.2$ condor_q -af EC2InstanceName EC2RemoteVirtualMachineName i-00003557 130.246.223.217 i-00003559 130.246.223.243 i-00003555 130.246.223.208 i-00003558 130.246.223.219 i-00003556 130.246.223.211

Dynamic resources Most batch systems require a hardwired list of worker nodes E.g. Torque /var/torque/server_priv/nodes Difficult to handle resources appearing & disappearing HTCondor startds send updates to the collector Collector maintains details about resources Can configure how quickly startds disappear from the collector after it stops sending updates This makes HTCondor an ideal choice for situations when there are dynamic resources

Provisioning resources There are number of ways that resources can be provisioned as they are needed Vcycle Cloud scheduler Cloud auto-scaling OpenStack Heat, OpenNebula OneFlow, mcernvm Has HTCondor & Elastiq built-in Elastiq monitors the queue Requests VMs if there are idle jobs (scales up) Shutsdown idle VMs (scales down)

Provisioning resources Can also use existing HTCondor functionality condor_rooster Designed to wake-up hibernating bare-metal worker nodes Can specify an arbitrary command to run to wake-up worker nodes Configure it to create VMs instead! Can provision different types of VMs as needed Single-core or multi-core VO-specific VMs

Summary There are no blocking issues preventing European sites from migrating to HTCondor Different choices of grid middleware available An official release of CREAM supporting HTCondor will be available eventually HTCondor & clouds Can manage resources on clouds Can use dynamic resources easily Many ways of automatically provisioning resources on clouds