HTCondor within the European Grid & in the Cloud
|
|
|
- Natalie Cain
- 9 years ago
- Views:
Transcription
1 HTCondor within the European Grid & in the Cloud Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX 2015 Spring Workshop, Oxford
2 The Grid
3 Introduction Computing element requirements Job submission from LHC VOs AliEn: ALICE HTCondor-G: ATLAS, CMS DIRAC: LHCb EMI WMS job submission Still used by some non-lhc VOs Usage (probably) likely to decrease Information system Information about jobs & worker nodes needs to go into the BDII
4 CREAM CE Does not currently work out-of-the-box with HTCondor Support for Condor used to exist but was dropped, some functionality still exists (e.g. BLAH) New set of scripts to be added to the official CREAM release is under development Including info-providers, APEL parser, Support is available for any sites wishing to install CREAM & HTCondor starting from the existing scripts
5 CREAM CE Problems with existing scripts No YAIM function for configuring CE to use HTCondor Scripts to publish dynamic information about state of jobs missing YAIM function for configuring BLAH doesn t support HTCondor RAL solution Made use of scripts from very old versions of CREAM which did support HTCondor Updated these to work with the current EMI-3 CREAM CE Needed modernizing, e.g. to support partitionable slots Wrote our own APEL parser
6 CREAM CE RAL has been running 2 CREAM CEs with HTCondor in production for ~1.5 years Was used by ALICE, LHCb, non-lhc VOs Now only used for ALICE SUM tests, will be decommissioned soon Running jobs over past year CREAM RAL
7 ARC CE NorduGrid product Features Simpler than CREAM CE Can send APEL accounting data directly to central broker File staging: can download & cache input files; upload output to SE Configuration Single config file /etc/arc.conf No YAIM required
8 ARC CE Can the LHC VOs submit to ARC? ATLAS Use HTCondor-G for job submission Able to submit to ARC ARC Control Tower for job submission CMS Use HTCondor-G for job submission LHCb Last year added to DIRAC the ability to submit to ARC ALICE Recently regained the capability to submit to ARC Submission via EMI WMS Uses HTCondor-G for job submission
9 ARC CE Contains many patches provided by RAL for HTCondor backend scripts Bug fix for memory limit of multi-core jobs (HTCondor) (upcoming) Able to make use of per-job HTCondor history files Used in production at RAL for a long time now Bug fix for CPU time limit for multi-core jobs Repository Available in EMI, UMD, EPEL, NorduGrid repositories At RAL we use NorduGrid
10 ARC CE At RAL, we re doing things the HTCondor way ARC CE configured to have a single queue VOs need to request resources Number of cores, memory, wall time, CPU time, No problem for ATLAS and CMS It s of course also possible to setup queues
11 ARC CE Information system limitations Doesn t publish per-vo running/idle jobs by default With HTCondor as backend LRMS, doesn t publish max time limits HTCondor has no MaxWalltime parameter like other batch systems Instead, policy expression configured on CE (or on worker nodes) Not easy (or possible?) to extract a MaxWalltime in general We ve made some small changes to /usr/share/arc/glue-generator.pl to overcome this
12 ARC CE The only way all 4 LHC VOs (& other VOs) now run work at RAL is by ARC CE RAL ARC CE usage Running jobs over past year ARC RAL
13 Accounting & scaling factors Most sites have heterogeneous farms Need to scale CPU & wall time for correct accounting Torque is able to scale CPU & walltime cpumult & wallmult in the pbs_mom config file HTCondor doesn t have this ability. At RAL: Machine ClassAds contain scaling factors Schedds configured to insert the appropriate scaling factors into the job ClassAds For ARC CEs, we use an auth plugin to scale CPU & wall time before accounting records are generated For CREAM CEs, our APEL parser scales CPU & wall time appropriately
14 HTCondor-CE OSG product e3/installhtcondorce Special configuration of HTCondor Can work at sites with PBS, GE, etc, as batch system, not just HTCondor Most interesting for sites with HTCondor as batch system
15 The Cloud
16 HTCondor & the cloud There are 2 things that can be done HTCondor can manage VMs in cloud resources Using the grid universe (EC2, GCE) HTCondor can use cloud resources Dynamic worker nodes
17 Managing VMs in the cloud Example submit description file universe = grid grid_resource = ec2 executable = $(ec2_ami_id) log = $(executable).$(cluster).$(process).log ec2_access_key_id = /home/alahiff/accesskeyid ec2_secret_access_key = /home/alahiff/secretaccesskey ec2_ami_id = ami ec2_instance_type = m1.large ec2_keypair_file = $(executable).$(cluster).$(process).pem ec2_user_data_file = /home/alahiff/my_user_data queue 5
18 Managing VMs in the cloud Submit jobs in the usual way -bash-3.2$ condor_submit vms.sub Submitting job(s)... 5 job(s) submitted to cluster 86.
19 Managing VMs in the cloud Check status of jobs (VMs) -bash-3.2$ condor_q -- Submitter: lcggwms01.gridpp.rl.ac.uk : < :45554> : lcggwms01.gridpp.rl.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 86.0 alahiff 3/24 09: :01:17 R ami alahiff 3/24 09: :01:17 R ami alahiff 3/24 09: :01:22 R ami alahiff 3/24 09: :01:17 R ami alahiff 3/24 09: :01:17 R ami bash-3.2$ condor_q -af EC2InstanceName EC2RemoteVirtualMachineName i i i i i
20 Dynamic resources Most batch systems require a hardwired list of worker nodes E.g. Torque /var/torque/server_priv/nodes Difficult to handle resources appearing & disappearing HTCondor startds send updates to the collector Collector maintains details about resources Can configure how quickly startds disappear from the collector after it stops sending updates This makes HTCondor an ideal choice for situations when there are dynamic resources
21 Provisioning resources There are number of ways that resources can be provisioned as they are needed Vcycle Cloud scheduler Cloud auto-scaling OpenStack Heat, OpenNebula OneFlow, mcernvm Has HTCondor & Elastiq built-in Elastiq monitors the queue Requests VMs if there are idle jobs (scales up) Shutsdown idle VMs (scales down)
22 Provisioning resources Can also use existing HTCondor functionality condor_rooster Designed to wake-up hibernating bare-metal worker nodes Can specify an arbitrary command to run to wake-up worker nodes Configure it to create VMs instead! Can provision different types of VMs as needed Single-core or multi-core VO-specific VMs
23 Summary There are no blocking issues preventing European sites from migrating to HTCondor Different choices of grid middleware available An official release of CREAM supporting HTCondor will be available eventually HTCondor & clouds Can manage resources on clouds Can use dynamic resources easily Many ways of automatically provisioning resources on clouds
HTCondor at the RAL Tier-1
HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier, James Adams STFC Rutherford Appleton Laboratory HTCondor Week 2014 Outline Overview of HTCondor at RAL Monitoring Multi-core
Batch and Cloud overview. Andrew McNab University of Manchester GridPP and LHCb
Batch and Cloud overview Andrew McNab University of Manchester GridPP and LHCb Overview Assumptions Batch systems The Grid Pilot Frameworks DIRAC Virtual Machines Vac Vcycle Tier-2 Evolution Containers
Sun Grid Engine, a new scheduler for EGEE
Sun Grid Engine, a new scheduler for EGEE G. Borges, M. David, J. Gomes, J. Lopez, P. Rey, A. Simon, C. Fernandez, D. Kant, K. M. Sephton IBERGRID Conference Santiago de Compostela, Spain 14, 15, 16 May
Sun Grid Engine, a new scheduler for EGEE middleware
Sun Grid Engine, a new scheduler for EGEE middleware G. Borges 1, M. David 1, J. Gomes 1, J. Lopez 2, P. Rey 2, A. Simon 2, C. Fernandez 2, D. Kant 3, K. M. Sephton 4 1 Laboratório de Instrumentação em
Using Parallel Computing to Run Multiple Jobs
Beowulf Training Using Parallel Computing to Run Multiple Jobs Jeff Linderoth August 5, 2003 August 5, 2003 Beowulf Training Running Multiple Jobs Slide 1 Outline Introduction to Scheduling Software The
Virtualisation Cloud Computing at the RAL Tier 1. Ian Collier STFC RAL Tier 1 HEPiX, Bologna, 18 th April 2013
Virtualisation Cloud Computing at the RAL Tier 1 Ian Collier STFC RAL Tier 1 HEPiX, Bologna, 18 th April 2013 Virtualisation @ RAL Context at RAL Hyper-V Services Platform Scientific Computing Department
GRID workload management system and CMS fall production. Massimo Sgaravatto INFN Padova
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova What do we want to implement (simplified design) Master chooses in which resources the jobs must be submitted Condor-G
Running real jobs in virtual machines. Andrew McNab University of Manchester
Running real jobs in virtual machines Andrew McNab University of Manchester Overview The Grid we have now Strategies for starting VMs Fabric / Cloud / Vac Hypervisors Disk images Contextualization Pilot
The Grid-it: the Italian Grid Production infrastructure
n 1 Maria Cristina Vistoli INFN CNAF, Bologna Italy The Grid-it: the Italian Grid Production infrastructure INFN-Grid goals!promote computational grid technologies research & development: Middleware and
ARC Computing Element
NORDUGRID NORDUGRID-MANUAL-20 15/7/2015 ARC Computing Element System Administrator Guide F. Paganelli, Zs. Nagy, O. Smirnova, and various contributions from all ARC developers Contents 1 Overview 9 1.1
Clusters in the Cloud
Clusters in the Cloud Dr. Paul Coddington, Deputy Director Dr. Shunde Zhang, Compu:ng Specialist eresearch SA October 2014 Use Cases Make the cloud easier to use for compute jobs Par:cularly for users
PES. Batch virtualization and Cloud computing. Part 1: Batch virtualization. Batch virtualization and Cloud computing
Batch virtualization and Cloud computing Batch virtualization and Cloud computing Part 1: Batch virtualization Tony Cass, Sebastien Goasguen, Belmiro Moreira, Ewan Roche, Ulrich Schwickerath, Romain Wartel
An objective comparison test of workload management systems
An objective comparison test of workload management systems Igor Sfiligoi 1 and Burt Holzman 1 1 Fermi National Accelerator Laboratory, Batavia, IL 60510, USA E-mail: [email protected] Abstract. The Grid
Status and Evolution of ATLAS Workload Management System PanDA
Status and Evolution of ATLAS Workload Management System PanDA Univ. of Texas at Arlington GRID 2012, Dubna Outline Overview PanDA design PanDA performance Recent Improvements Future Plans Why PanDA The
Database Services for Physics @ CERN
Database Services for Physics @ CERN Deployment and Monitoring Radovan Chytracek CERN IT Department Outline Database services for physics Status today How we do the services tomorrow? Performance tuning
CERN local High Availability solutions and experiences. Thorsten Kleinwort CERN IT/FIO WLCG Tier 2 workshop CERN 16.06.2006
CERN local High Availability solutions and experiences Thorsten Kleinwort CERN IT/FIO WLCG Tier 2 workshop CERN 16.06.2006 1 Introduction Different h/w used for GRID services Various techniques & First
ATLAS job monitoring in the Dashboard Framework
ATLAS job monitoring in the Dashboard Framework J Andreeva 1, S Campana 1, E Karavakis 1, L Kokoszkiewicz 1, P Saiz 1, L Sargsyan 2, J Schovancova 3, D Tuckett 1 on behalf of the ATLAS Collaboration 1
The Evolution of Cloud Computing in ATLAS
The Evolution of Cloud Computing in ATLAS Ryan Taylor on behalf of the ATLAS collaboration CHEP 2015 Evolution of Cloud Computing in ATLAS 1 Outline Cloud Usage and IaaS Resource Management Software Services
A Year of HTCondor Monitoring. Lincoln Bryant Suchandra Thapa
A Year of HTCondor Monitoring Lincoln Bryant Suchandra Thapa HTCondor Week 2015 May 21, 2015 Analytics vs. Operations Two parallel tracks in mind: o Operations o Analytics Operations needs to: o Observe
MSU Tier 3 Usage and Troubleshooting. James Koll
MSU Tier 3 Usage and Troubleshooting James Koll Overview Dedicated computing for MSU ATLAS members Flexible user environment ~500 job slots of various configurations ~150 TB disk space 2 Condor commands
Integration of Virtualized Workernodes in Batch Queueing Systems The ViBatch Concept
Integration of Virtualized Workernodes in Batch Queueing Systems, Dr. Armin Scheurer, Oliver Oberst, Prof. Günter Quast INSTITUT FÜR EXPERIMENTELLE KERNPHYSIK FAKULTÄT FÜR PHYSIK KIT University of the
Interoperating Cloud-based Virtual Farms
Stefano Bagnasco, Domenico Elia, Grazia Luparello, Stefano Piano, Sara Vallero, Massimo Venaruzzo For the STOA-LHC Project Interoperating Cloud-based Virtual Farms The STOA-LHC project 1 Improve the robustness
Virtualization, Grid, Cloud: Integration Paths for Scientific Computing
Virtualization, Grid, Cloud: Integration Paths for Scientific Computing Or, where and how will my efficient large-scale computing applications be executed? D. Salomoni, INFN Tier-1 Computing Manager [email protected]
Cloud and Virtualization to Support Grid Infrastructures
ESAC GRID Workshop '08 ESAC, Villafranca del Castillo, Spain 11-12 December 2008 Cloud and Virtualization to Support Grid Infrastructures Distributed Systems Architecture Research Group Universidad Complutense
Anar Manafov, GSI Darmstadt. GSI Palaver, 2010-03-09
Anar Manafov, GSI Darmstadt HEP Data Analysis Implement algorithm Run over data set Make improvements Typical HEP analysis needs a continuous algorithm refinement cycle 2 PROOF Storage File Catalog Query
Manjrasoft Market Oriented Cloud Computing Platform
Manjrasoft Market Oriented Cloud Computing Platform Aneka Aneka is a market oriented Cloud development and management platform with rapid application development and workload distribution capabilities.
Solution for private cloud computing
The CC1 system Solution for private cloud computing 1 Outline What is CC1? Features Technical details Use cases By scientist By HEP experiment System requirements and installation How to get it? 2 What
Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015
Work Environment David Tur HPC Expert HPC Users Training September, 18th 2015 1. Atlas Cluster: Accessing and using resources 2. Software Overview 3. Job Scheduler 1. Accessing Resources DIPC technicians
Evaluation of Nagios for Real-time Cloud Virtual Machine Monitoring
University of Victoria Faculty of Engineering Fall 2009 Work Term Report Evaluation of Nagios for Real-time Cloud Virtual Machine Monitoring Department of Physics University of Victoria Victoria, BC Michael
Monitoring HTCondor with Ganglia
Monitoring HTCondor with Ganglia Ganglia Overview Scalable distributed monitoring for HPC clusters Two daemons gmond every host; collects and send metrics gmetad single host; persists metrics from local
Log managing at PIC. A. Bruno Rodríguez Rodríguez. Port d informació científica Campus UAB, Bellaterra Barcelona. December 3, 2013
Log managing at PIC A. Bruno Rodríguez Rodríguez Port d informació científica Campus UAB, Bellaterra Barcelona December 3, 2013 Bruno Rodríguez (PIC) Log managing at PIC December 3, 2013 1 / 21 What will
The dashboard Grid monitoring framework
The dashboard Grid monitoring framework Benjamin Gaidioz on behalf of the ARDA dashboard team (CERN/EGEE) ISGC 2007 conference The dashboard Grid monitoring framework p. 1 introduction/outline goals of
The CMS analysis chain in a distributed environment
The CMS analysis chain in a distributed environment on behalf of the CMS collaboration DESY, Zeuthen,, Germany 22 nd 27 th May, 2005 1 The CMS experiment 2 The CMS Computing Model (1) The CMS collaboration
GT 6.0 GRAM5 Key Concepts
GT 6.0 GRAM5 Key Concepts GT 6.0 GRAM5 Key Concepts Overview The Globus Toolkit provides GRAM5: a service to submit, monitor, and cancel jobs on Grid computing resources. In GRAM, a job consists of a computation
ARC batch system back-end interface guide with support for GLUE2
NORDUGRID NORDUGRID-TECH-18 18/2/2013 ARC batch system back-end interface guide with support for GLUE2 Description and developer s guide A. Taga, Thomas Frågåt, Ch. U. Søttrup, B. Kónya, G. Rőczei, D.
Moab and TORQUE Highlights CUG 2015
Moab and TORQUE Highlights CUG 2015 David Beer TORQUE Architect 28 Apr 2015 Gary D. Brown HPC Product Manager 1 Agenda NUMA-aware Heterogeneous Jobs Ascent Project Power Management and Energy Accounting
Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource
PBS INTERNALS PBS & TORQUE PBS (Portable Batch System)-software system for managing system resources on workstations, SMP systems, MPPs and vector computers. It was based on Network Queuing System (NQS)
Report from SARA/NIKHEF T1 and associated T2s
Report from SARA/NIKHEF T1 and associated T2s Ron Trompert SARA About SARA and NIKHEF NIKHEF SARA High Energy Physics Institute High performance computing centre Manages the Surfnet 6 network for the Dutch
Dynamic Resource Provisioning with HTCondor in the Cloud
Dynamic Resource Provisioning with HTCondor in the Cloud Ryan Taylor Frank Berghaus 1 Overview Review of Condor + Cloud Scheduler system Condor job slot configuration Dynamic slot creation Automatic slot
Data storage services at CC-IN2P3
Centre de Calcul de l Institut National de Physique Nucléaire et de Physique des Particules Data storage services at CC-IN2P3 Jean-Yves Nief Agenda Hardware: Storage on disk. Storage on tape. Software:
Auto-administration of glite-based
2nd Workshop on Software Services: Cloud Computing and Applications based on Software Services Timisoara, June 6-9, 2011 Auto-administration of glite-based grid sites Alexandru Stanciu, Bogdan Enciu, Gabriel
A Service for Data-Intensive Computations on Virtual Clusters
A Service for Data-Intensive Computations on Virtual Clusters Executing Preservation Strategies at Scale Rainer Schmidt, Christian Sadilek, and Ross King [email protected] Planets Project Permanent
CONDOR CLUSTERS ON EC2
CONDOR CLUSTERS ON EC2 Val Hendrix, Roberto A. Vitillo Lawrence Berkeley National Lab ATLAS Cloud Computing R & D 1 INTRODUCTION This is our initial work on investigating tools for managing clusters and
RenderStorm Cloud Render (Powered by Squidnet Software): Getting started.
Version 1.0 RenderStorm Cloud Render (Powered by Squidnet Software): Getting started. RenderStorm Cloud Render is an easy to use standalone application providing remote access, job submission, rendering,
HPC-Nutzer Informationsaustausch. The Workload Management System LSF
HPC-Nutzer Informationsaustausch The Workload Management System LSF Content Cluster facts Job submission esub messages Scheduling strategies Tools and security Future plans 2 von 10 Some facts about the
Condor: Grid Scheduler and the Cloud
Condor: Grid Scheduler and the Cloud Matthew Farrellee Senior Software Engineer, Red Hat 1 Agenda What is Condor Architecture Condor s ClassAd Language Common Use Cases Virtual Machine management Cloud
U-LITE Network Infrastructure
U-LITE: a proposal for scientific computing at LNGS S. Parlati, P. Spinnato, S. Stalio LNGS 13 Sep. 2011 20 years of Scientific Computing at LNGS Early 90s: highly centralized structure based on VMS cluster
SUSE Cloud 2.0. Pete Chadwick. Douglas Jarvis. Senior Product Manager [email protected]. Product Marketing Manager djarvis@suse.
SUSE Cloud 2.0 Pete Chadwick Douglas Jarvis Senior Product Manager [email protected] Product Marketing Manager [email protected] SUSE Cloud SUSE Cloud is an open source software solution based on OpenStack
Martinos Center Compute Clusters
Intro What are the compute clusters How to gain access Housekeeping Usage Log In Submitting Jobs Queues Request CPUs/vmem Email Status I/O Interactive Dependencies Daisy Chain Wrapper Script In Progress
Building a Volunteer Cloud
Building a Volunteer Cloud Ben Segal, Predrag Buncic, David Garcia Quintas / CERN Daniel Lombrana Gonzalez / University of Extremadura Artem Harutyunyan / Yerevan Physics Institute Jarno Rantala / Tampere
Adaptive Resource Optimizer For Optimal High Performance Compute Resource Utilization
Technical Backgrounder Adaptive Resource Optimizer For Optimal High Performance Compute Resource Utilization July 2015 Introduction In a typical chip design environment, designers use thousands of CPU
Michael Thomas, Dorian Kcira California Institute of Technology. CMS Offline & Computing Week
Michael Thomas, Dorian Kcira California Institute of Technology CMS Offline & Computing Week San Diego, April 20-24 th 2009 Map-Reduce plus the HDFS filesystem implemented in java Map-Reduce is a highly
HAMBURG ZEUTHEN. DESY Tier 2 and NAF. Peter Wegner, Birgit Lewendel for DESY-IT/DV. Tier 2: Status and News NAF: Status, Plans and Questions
DESY Tier 2 and NAF Peter Wegner, Birgit Lewendel for DESY-IT/DV Tier 2: Status and News NAF: Status, Plans and Questions Basics T2: 1.5 average Tier 2 are requested by CMS-groups for Germany Desy commitment:
A High Performance Computing Scheduling and Resource Management Primer
LLNL-TR-652476 A High Performance Computing Scheduling and Resource Management Primer D. H. Ahn, J. E. Garlick, M. A. Grondona, D. A. Lipari, R. R. Springmeyer March 31, 2014 Disclaimer This document was
Manjrasoft Market Oriented Cloud Computing Platform
Manjrasoft Market Oriented Cloud Computing Platform Innovative Solutions for 3D Rendering Aneka is a market oriented Cloud development and management platform with rapid application development and workload
HappyFace for CMS Tier-1 local job monitoring
HappyFace for CMS Tier-1 local job monitoring G. Quast, A. Scheurer, M. Zvada CMS Offline & Computing Week CERN, April 4 8, 2011 INSTITUT FÜR EXPERIMENTELLE KERNPHYSIK, KIT 1 KIT University of the State
