Dynamic Resource Provisioning with HTCondor in the Cloud

Similar documents

The Evolution of Cloud Computing in ATLAS

HTCondor at the RAL Tier-1

Dynamic Slot Tutorial. Condor Project Computer Sciences Department University of Wisconsin-Madison

Shoal: IaaS Cloud Cache Publisher

Batch and Cloud overview. Andrew McNab University of Manchester GridPP and LHCb

Running real jobs in virtual machines. Andrew McNab University of Manchester

PoS(EGICF12-EMITC2)005

The Evolution of Cloud Computing in ATLAS

OpenStack Introduction. November 4, 2015

Solution for private cloud computing

Integration of Virtualized Workernodes in Batch Queueing Systems The ViBatch Concept

Status and Evolution of ATLAS Workload Management System PanDA

Deploying Business Virtual Appliances on Open Source Cloud Computing

Virtualisation Cloud Computing at the RAL Tier 1. Ian Collier STFC RAL Tier 1 HEPiX, Bologna, 18 th April 2013

2) Xen Hypervisor 3) UEC

Repoman: A Simple RESTful X.509 Virtual Machine Image Repository. Roger Impey

Using S3 cloud storage with ROOT and CernVMFS. Maria Arsuaga-Rios Seppo Heikkila Dirk Duellmann Rene Meusel Jakob Blomer Ben Couturier

Scheduler in Cloud Computing using Open Source Technologies

A Complete Open Cloud Storage, Virt, IaaS, PaaS. Dave Neary Open Source and Standards, Red Hat

How To Run A Cloud Server On A Server Farm (Cloud)

Nebula Cloud Computing Project: Background, Technology, Operations, Challenges, and Status

Cloud Computing for Control Systems CERN Openlab Summer Student Program 9/9/2011 ARSALAAN AHMED SHAIKH

SUSE Cloud 2.0. Pete Chadwick. Douglas Jarvis. Senior Product Manager Product Marketing Manager

SUSE Cloud Installation: Best Practices Using a SMT, Xen and Ceph Storage Environment

Elastic Management of Cluster based Services in the Cloud

Introduction to OpenStack

OpenStack Ecosystem and Xen Cloud Platform

SUSE Cloud Installation: Best Practices Using an Existing SMT and KVM Environment

Data Centers and Cloud Computing

Building a Volunteer Cloud

Virtualization, Grid, Cloud: Integration Paths for Scientific Computing

The CMS openstack, opportunistic, overlay, online-cluster Cloud (CMSooooCloud)"

Performance Comparison Analysis of Linux Container and Virtual Machine for Building Cloud

Shareable Private Space on a Public Cloud

HPC performance applications on Virtual Clusters

Openstack. Cloud computing with Openstack. Saverio Proto

CERN Cloud Storage Evaluation Geoffray Adde, Dirk Duellmann, Maitane Zotes CERN IT

CernVM Online and Cloud Gateway a uniform interface for CernVM contextualization and deployment

Using SUSE Cloud to Orchestrate Multiple Hypervisors and Storage at ADP

Efficient Cloud Management for Parallel Data Processing In Private Cloud

Alternative models to distribute VO specific software to WLCG sites: a prototype set up at PIC

HTCondor within the European Grid & in the Cloud

PES. Batch virtualization and Cloud computing. Part 1: Batch virtualization. Batch virtualization and Cloud computing

Comparison and Evaluation of Open-source Cloud Management Software

PES. High Availability Load Balancing in the Agile Infrastructure. Platform & Engineering Services. HEPiX Bologna, April 2013

Research computing in a distributed cloud environment

How To Build A Cloud Computing System With Nimbus

Enabling Technologies for Distributed Computing

CONDOR CLUSTERS ON EC2

CS312 Solutions #6. March 13, 2015

Cluster, Grid, Cloud Concepts

OGF25/EGEE User Forum Catania, Italy 2 March 2009

Empowering Cloud with latest VMDIRAC improvements

Using Apache VCL and OpenStack to provide a Virtual Computing Lab

Deploying complex applications to Google Cloud. Olia Kerzhner

CUDA in the Cloud Enabling HPC Workloads in OpenStack With special thanks to Andrew Younge (Indiana Univ.) and Massimo Bernaschi (IAC-CNR)

Experience in integrating enduser cloud storage for CMS Analysis

StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud

RED HAT ENTERPRISE VIRTUALIZATION

FleSSR Project: Installing Eucalyptus Open Source Cloud Solution at Oxford e- Research Centre

KVM, OpenStack, and the Open Cloud

Software Testing and Deployment Using Virtualization and Cloud

Comparing Open Source Private Cloud (IaaS) Platforms

Cloud Computing. Lecture 5 Grid Case Studies

Cloud computing: the state of the art and challenges. Jānis Kampars Riga Technical University

Scientific Computing Data Management Visions

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Transcription:

Dynamic Resource Provisioning with HTCondor in the Cloud Ryan Taylor Frank Berghaus 1

Overview Review of Condor + Cloud Scheduler system Condor job slot configuration Dynamic slot creation Automatic slot defragmentation Target Shares Flexible use of cloud resources 2

Be Dynamic Do the right thing... On any type of cloud or hypervisor Nimbus, Openstack, KVM, Xen Anywhere in the world Data Federations Squid Federator: Shoal For any kind of job SCORE MCORE HIMEM New requirement... using just one image, and minimizing static configuration Old requirement 3

Cloud Scheduler A python package for managing VMs on IaaS clouds Users submit HTCondor jobs Optional attributes specify VM properties Dynamically manages quantity & type of VMs in response to user demand Easily connect to many IaaS clouds, aggregate resources, present them as HTCondor batch system Used by ATLAS, Belle II, CANFAR, BaBar https://github.com/hep gc/cloud scheduler Publication in CHEP 2010 Proceedings 4

VMs boot, join Condor pool Contextualized with x509 credential for authentication VMs run jobs continuously CS automatically retires unneeded VMs User Job Pilot Job 5

VM Image ucernvm: OS loaded on demand via CVMFS 10 MB image (kernel + CVMFS client) Contextualization: Cloud init and puppet Configure Condor, CVMFS, Shoal, grid environment, etc. https://github.com/berghaus/atlasgce modules Developed by Frank Berghaus and Henric Öhman Instrumented with Ganglia monitoring agm.cern.ch 6

ATLAS Production in the Cloud (Jan Sep 2014) Helix Nebula Also running 10% of Belle II jobs in the cloud with Dirac Google 1.2M Completed Jobs in 2014 CERN North America Australia UK Jan 2014 Sep 2014 Frank Berghaus Over 1.2M ATLAS jobs completed Single core and multi core production 7

Condor Slot Configuration Static slots: 8 1 core slots (or 1 8 core slot) NUM_SLOTS = 8 SLOT1_USER = slot01 old way SLOT2_USER = slot02... Dynamic slots NUM_SLOTS_TYPE_1 = 1 SLOT_TYPE_1 = cpus=100% SLOT_TYPE_1_PARTITIONABLE = True SLOT1_1_USER = slot01 new way SLOT1_2_USER = slot02... 8

Dynamic Slot Creation Jobs specify resource requirements request_cpus = 2 request_memory = 8000 request_disk = 40000000 # MB # KB Condor dynamically creates a slot Problem: VMs get fragmented into smaller slots Larger jobs can't run Solution: DEFRAG daemon coalesces slots together again 9

Automatic Slot Defragmentation Currently being tested/tuned. Thanks to Andrew Lahiff from RAL for sample configuration DAEMON_LIST = DEFRAG DEFRAG_INTERVAL = 600 DEFRAG_DRAINING_MACHINES_PER_HOUR = 10.0 DEFRAG_MAX_CONCURRENT_DRAINING = 30 DEFRAG_MAX_WHOLE_MACHINES = 50 DEFRAG_SCHEDULE = graceful DEFRAG.SETTABLE_ATTRS_ADMINISTRATOR = DEFRAG_MAX_CONCURRENT_DRAINING,DEFRAG_DRAINING_MACHINES_PER_HOUR,DEFRAG_ MAX_WHOLE_MACHINES ENABLE_RUNTIME_CONFIG = TRUE DEFRAG_RANK = ifthenelse(cpus >= 8, -10, (TotalCpus - Cpus)/(8.0 Cpus)) 10

Target Shares Use Condor groups to prioritize job types GROUP_NAMES = group_analysis, group_production GROUP_QUOTA_DYNAMIC_group_analysis = 0.05 GROUP_QUOTA_DYNAMIC_group_production = 0.95 GROUP_ACCEPT_SURPLUS = True In the job definition add: AccountingGroup = "group_production" or AccountingGroup = "group_analysis" Thanks to Joanna Huang and Sean Crosby from the Australian ATLAS group 11

Flexible Cloud Resources CERN Cloud Scheduler cloudscheduler.cern.ch Dynamically provisioned slots GridPP cloud group CERN cloud group 12

Flexible Cloud Resources GridPP_CLOUD GridPP_CLOUD_HIMEM request_cpus = 1 TargetClouds = GridPP request_cpus = 1 TargetClouds = CERN CERN-Prod_CLOUD request_memory = 8000 TargetClouds = GridPP request_cpus = 8 TargetClouds = CERN CERN-Prod_CLOUD _MCORE To run jobs with arbitrary resource requirements: just submit them GridPP cloud group Automatic No dedicated nodes CERN cloud group 13

Dynamic Resource Provisioning aka How to provision MCORE resources using IaaS 1)Connect IaaS cloud to Cloud Scheduler 2)Create Panda queue 3)Compose JDL 4)Configure Pilot Factory 5)Profit 14

Questions Bonus Material 15

Credential Transport User Securely delegates user credentials to VMs, and authenticates VMs joining the Condor pool.

Security Configuration Worker GSI_DAEMON_DIRECTORY = /etc/grid security GSI_DAEMON_CERT = /etc/grid security/hostcert.pem GSI_DAEMON_KEY = /etc/grid security/hostkey.pem Central Manager SEC_DEFAULT_AUTHENTICATION = REQUIRED SEC_DEFAULT_AUTHENTICATION_METHODS = GSI SEC_DEFAULT_ENCRYPTION = REQUIRED SEC_DEFAULT_ENCRYPTION_METHODS = 3DES SEC_DEFAULT_INTEGRITY = REQUIRED GRIDMAP = /etc/grid security/grid mapfile GSI_DAEMON_CERT = /etc/grid security/condor/condorcert.pem GSI_DAEMON_KEY = /etc/grid security/condor/condorkey.pem GSI_DAEMON_TRUSTED_CA_DIR = /etc/grid security/certificates/ GSI_DELEGATION_KEYBITS = 1024 GSI_DAEMON_NAME = /C=CA/O=Grid/CN=condor/condor.heprc.uvic.ca, /C=CA/O=Grid/CN=condorworker/condor.heprc.uvic.ca

Virtual Machine Virtual Machine Cloud Interface Cloud Interface... Virtual Machine Virtual Machine GCE Cloud Interface OpenStack OpenStack Scheduler status communication Cloud Scheduler Virtual Machine Virtual Machine Condor Central Manager User 18

Shoal: Squid Cache Federator Jun. 2014 S&C Week Shoal tracks squids discover new ones remove missing ones Workers find best squids Build a configuration less mesh instead of statically defined site specific mappings CHEP 2013 Poster github.com/hep gc/shoal WLCG Proxy Discovery TF 20140605 Ryan Taylor - ATLAS Cloud Computing R &D 19

Dynamic Federation Sep. 2014 S&C Workshop High performance aggressive metadata caching in RAM maximal concurrency scalable ~ 10^6 hit/s per core 24 GB of RAM for metadata cache is enough for ~100 PB of data in end points Well designed stateless, no persistency standard components and protocols, not HEP specific simple, lightweight trivial to add endpoints no site action needed! Dynamic automatically detect and avoid offline endpoints federating is done on the fly 10/27/14 Ryan Taylor - ATLAS S&C Week 20