Clusters in the Cloud



Similar documents
Data Center Evolu.on and the Cloud. Paul A. Strassmann George Mason University November 5, 2008, 7:20 to 10:00 PM

CONDOR CLUSTERS ON EC2

Batch and Cloud overview. Andrew McNab University of Manchester GridPP and LHCb

Deployment of Private, Hybrid & Public Clouds with OpenNebula

Assignment # 1 (Cloud Computing Security)

Cloud Computing for Control Systems CERN Openlab Summer Student Program 9/9/2011 ARSALAAN AHMED SHAIKH

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

Introduc)on of Pla/orm ISF. Weina Ma

Chapter 3. Database Architectures and the Web Transparencies

Return on Experience on Cloud Compu2ng Issues a stairway to clouds. Experts Workshop Nov. 21st, 2013

Cloud Computing Architecture with OpenNebula HPC Cloud Use Cases

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

TECHNOLOGY WHITE PAPER Jun 2012

Last time. Today. IaaS Providers. Amazon Web Services, overview

High Performance Computing OpenStack Options. September 22, 2015

Managing your Red Hat Enterprise Linux guests with RHN Satellite

CERN Cloud Architecture

Scyld Cloud Manager User Guide

Solution for private cloud computing

CernVM Online and Cloud Gateway a uniform interface for CernVM contextualization and deployment

Amazon Elastic Beanstalk

Comparing Ganeti to other Private Cloud Platforms. Lance Albertson

Cloud Based Tes,ng & Capacity Planning (CloudPerf)

Secure Hybrid Cloud Infrastructure for Scien5fic Applica5ons

Experiments on cost/power and failure aware scheduling for clouds and grids

Scalable Architecture on Amazon AWS Cloud

Lustre Monitoring with OpenTSDB

Savanna Hadoop on. OpenStack. Savanna Technical Lead

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Putchong Uthayopas, Kasetsart University

Open Source Cloud Computing Management with OpenNebula

Kaseya Fundamentals Workshop DAY THREE. Developed by Kaseya University. Powered by IT Scholars

HTCondor at the RAL Tier-1

CloudCenter Full Lifecycle Management. An application-defined approach to deploying and managing applications in any datacenter or cloud environment

Building Success on Acquia Cloud:

Comparing Open Source Private Cloud (IaaS) Platforms

How To Use Arcgis For Free On A Gdb (For A Gis Server) For A Small Business

STeP-IN SUMMIT June 18 21, 2013 at Bangalore, INDIA. Performance Testing of an IAAS Cloud Software (A CloudStack Use Case)

Automating Big Data Benchmarking for Different Architectures with ALOJA

Develop a process for applying updates to systems, including verifying properties of the update. Create File Systems

Chapter 2: Getting Started

Shoal: IaaS Cloud Cache Publisher

How the ersa Problem became the ersa Solu3on. Why a network and network security is impera3ve for ersa s NeCTAR cloud. Paul Bartczak Infrastructure

The Evolution of Cloud Computing in ATLAS

HDFS Cluster Installation Automation for TupleWare

U-LITE Network Infrastructure

LSKA 2010 Survey Report Job Scheduler

ArcGIS for Server in the Amazon Cloud. Michele Lundeen Esri

Chapter 1 - Web Server Management and Cluster Topology

The Pitfalls of Encrypted Networks in Banking Operations Compliance Success in two industry cases

OpenStack Introduction. November 4, 2015

Mark Bennett. Search and the Virtual Machine

OpenNebula Open Souce Solution for DC Virtualization. C12G Labs. Online Webinar

Integration of Virtualized Workernodes in Batch Queueing Systems The ViBatch Concept

Automated Configuration of Open Stack Instances at Boot Time

Database Services for CERN

Load and Performance Load Testing. RadView Software October

Multi Provider Cloud. Srinivasa Acharya, Engineering Manager, Hewlett-Packard

Amazon Web Services Primer. William Strickland COP 6938 Fall 2012 University of Central Florida

UZH Experiences with OpenStack

ABRAHAM ARCHITECTURE OF A CLOUD SERVICE USING PYTHON TECHNOLOGIES

SURFsara HPC Cloud Workshop

Copyright 2014, Oracle and/or its affiliates. All rights reserved. 2

Deploying Your Application On Public Cloud

Accelerate OpenStack* Together. * OpenStack is a registered trademark of the OpenStack Foundation

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

WHITE PAPER. Software Defined Storage Hydrates the Cloud

Manjrasoft Market Oriented Cloud Computing Platform

Neptune. A Domain Specific Language for Deploying HPC Software on Cloud Platforms. Chris Bunch Navraj Chohan Chandra Krintz Khawaja Shams

Maintaining Non-Stop Services with Multi Layer Monitoring

TECHNOLOGY WHITE PAPER Jan 2016

2) Xen Hypervisor 3) UEC

Web Application Platform for Sandia

Cloud computing - Architecting in the cloud

CERN Cloud Storage Evaluation Geoffray Adde, Dirk Duellmann, Maitane Zotes CERN IT

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Of Pets and Cattle and Hearts

depl Documentation Release depl contributors

Ansible in Depth WHITEPAPER. ansible.com

VMware vcloud Automation Center 6.0

Design and Implementation of IaaS platform based on tool migration Wei Ding

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

How To Run A Tompouce Cluster On An Ipra (Inria) (Sun) 2 (Sun Geserade) (Sun-Ge) 2/5.2 (

Amazon EC2 Product Details Page 1 of 5

SUSE Cloud 2.0. Pete Chadwick. Douglas Jarvis. Senior Product Manager Product Marketing Manager

Transcription:

Clusters in the Cloud Dr. Paul Coddington, Deputy Director Dr. Shunde Zhang, Compu:ng Specialist eresearch SA October 2014

Use Cases Make the cloud easier to use for compute jobs Par:cularly for users familiar with HPC clusters Personal, on- demand cluster in the cloud Cluster in the cloud A private cluster only available to a research group A shared Node- managed cluster Preferably dynamic/elas:c Cloudburs:ng for HPC Dynamically (and transparently) add extra compute nodes from cloud to an exis:ng HPC cluster

Cluster infrastructure Configura:on Management System Monitoring Shared File System Applica:on Distribu:on SoWware Layer Local Resource Management System / Queueing system Hardware, Network

Tradi:onal Sta:c Cluster Hardware/Network Dedicated hardware Long process to get new hardware Sta:c not elas:c SoWware Assumes a fairly sta:c environment (IPs etc) Not cloud- friendly Some systems need restart if cluster is changed Not adaptable to changes

Cluster in the cloud Hardware / Network Provisioned by the cloud (OpenStack) Get new resources in minutes Remove resources in minutes Elas:c/scalable on demand SoWware Dynamic Can easily add/remove nodes as needed

Possible solu:ons Condor for high- throughput compu:ng Cloud Scheduler working for a CERN LCG node Recent versions of Condor support cloud execu:on Torque/PBS sta:c cluster in cloud Works, but painful to set up and maintain Dynamic Torque/PBS cluster No exis:ng dynamic/elas:c solu:on StarCluster for personal cluster Automate setup of VMs in cloud, including cluster Can add/subtract worker nodes manually Only Amazon, SGE and Condor but not PBS/Torque

Our work Condor for high- throughput compu:ng Cloud Scheduler for Australian CERN LCG node Torque/PBS sta:c cluster in cloud Set up large cluster in cloud for CoEPP Scripts to automate setup and monitoring Dynamic Torque/PBS cluster Created Dynamic Torque system for OpenStack StarCluster for personal cluster Ported to OpenStack and added Torque plugin Add- ons to make it easier to use for ersa users

Applica:on sowware Want familiar HPC applica:ons to be available to cloud VMs And we don t want to install and maintain sowware twice, in HPC and cloud But limit on size of VM images in the cloud Want to avoid making lots of custom images We use CVMFS Read- only distributed file system, hbp based Used by CERN LHC Grid for distribu:ng sowware One VM image, with CVMFS client Downloads and caches sowware from HPC cluster

HTC Cluster in the Cloud NeCTAR eresearch Tools project for high- throughput compu:ng in the cloud ARC Centre of Excellence in Experimental Par:cle Physics (CoEPP) Needed a large cluster for CERN ATLAS data analysis and simula:on Tier 2 (global) and Tier 3 (local) jobs Augment exis:ng small physical clusters at mul:ple sites running Torque

CERN ATLAS experiment

CERN ATLAS experiment

Sta:c Cluster in the Cloud Built a large Torque cluster using cloud VMs A challenging exercise! Reliability issues, needed a lot of scripts to automate setup, monitoring, recovery, etc Some types of usage are bursty but cluster resources were sta:c Didn't take advantage of elas:city of cloud

Dynamic Torque Sta:c/dynamic worker nodes Sta:c: stays up all the :me Dynamic: up and down according to workload Independent of Torque/MAUI Runs as a separate process Only add/remove worker nodes Query Torque and MAUI scheduler periodically S:ll up to MAUI scheduler to decide where to run a job

Dynamic Torque

Dynamic Torque for CoEPP Ganglia Nagios CVMFS Torque/MAUI LDAP and Dynamic NFS Puppet Torque Worker nodes in SA Interac:ve nodes in Melbourne Worker nodes in Monash Worker nodes in Melbourne

Dynamic Torque for CoEPP

CoEPP Outcomes Three large clusters in use for over a year Hundreds of cores in each Condor and CloudScheduler for ATLAS Tier 2 Dynamic Torque for ATLAS Tier 3 and Belle LHC ATLAS experiment at CERN 530,000 Tier 2 jobs 325,000 CPU hours for Tier 3 jobs Belle experiment in Japan 150,000 jobs

Private Clusters Good for building a shared cluster for a large research group with good IT support who can set up and manage a Torque cluster What about the many individual researchers or small groups who also want a private cluster using their cloud alloca:on? But have no dedicated IT staff and very basic Unix skills? Is there a simple DIY solu:on?

StarCluster Generic setup Create security group for the cluster Launch VMs (master, node01, node02 ) Set up public key for password- less SSH Install NFS on master and share scratch space to all nodexx Can use EBS (Cinder) volumes as scratch space Queuing system setup (plugins) Condor, SGE, Hadoop and your own plugin!

StarCluster for OpenStack ersa App Repository (CVMFS server) CVMFS proxy Worker Node (Torque MOM) Worker Node (Torque MOM) OpenStack EC2 API Head Node (NFS, Torque server, MAUI) Worker Node (Torque MOM) Volume StarCluster

StarCluster - configura:on Availability zone Image (op:onal) Image for master Flavor (op:onal) Flavor for master Number of nodes Volume Username User ID Group ID User shell plugins

Start a cluster with StarCluster # fire up a new cluster (from your desktop) $ starcluster start mycluster # log in to the head node (master) to submit jobs $ starcluster sshmaster mycluster # Copy files $ starcluster put /path/to/local/file/or/dir /remote/path/ $ starcluster get /path/to/remote/file/or/dir /local/path/ # Add a compute node to the cluster $ starcluster addnode n 2 mycluster # terminate it awer use $ starcluster terminate mycluster

Other op:ons for Personal Cluster Elas:cluster Python code to provision VMs Ansible to configure them Ansible playbooks for Torque/SGE/, NFS/pvfs/ Heat Everything in HOT template Earlier versions had limita:ons that made it hard to implement everything May revisit in future

Private Cluster in the Cloud Can use your personal or project cloud alloca:on to start up your own personal cluster in the cloud No need to share! Except among your group. Can use the standard PBS/Torque queueing system to submit jobs (or not) Only your jobs in the queue But you have to set up and manage the cluster Straighqorward if you have good Unix skills (unless things go wrong ) Several groups now using this But ersa doing support when things go wrong

Emu Cluster in the Cloud Emu is an ersa cluster that runs in the cloud Aimed to be like an old cluster (Corvus) 8- core compute nodes But a bit different Dynamically created VMs in the cloud Can have private compute nodes Different size compute nodes if you want

Emu ersa- managed dynamic cluster in the cloud Shared by mul:ple cloud tenants and ersa users All nodes in SA zone ersa cloud alloca:on contributes 128 cores Users can bring in their own cloud alloca:on to launch their worker nodes in Emu Users don t need to build and look awer their own personal cluster It can also mount users Cinder volume storage to their own worker nodes via NFS Set up so researchers use their ersa accounts

Using your own cloud alloca:on Users add our sysadmins to their tenant So we can launch VMs on their behalf Will look at Trusts in Icehouse Add some configs to Dynamic Torque Number of sta:c/dynamic nodes, size of nodes, etc Add a group of user accounts allowed to use it Create a reserva:on for users worker nodes in MAUI A special account string needs to be put in the job To match users jobs to their group s reserved nodes A qsub filter to check if the account string is valid You can t submit a job using another group s alloca:on

Emu Sensu CVMFS Torque/MAUI LDAP and Dynamic NFS Salt Torque NFS Worker nodes of Tenant1 Shared worker nodes (ersa donated) Worker nodes of Tenant2

Sta:c cluster vs dynamic cluster Sta$c Cluster Dynamic Cluster Hardware Physical Machines Virtual Machines LRMS Torque Torque with Dynamic Torque CMS Puppet Salt Stack Monitoring Nagios, Ganglia Sensu, Graphite, Logstash App Distribu:on NFS Mount CVMFS Shared FS NFS NFS

Future Work Beber repor:ng and usage graphs More monitoring checks Queueing system Mul:- node jobs don t work because a new node is not trusted by exis:ng nodes Trust list is only updated when PBS server is started Could hack Torque source code Or maybe use SLURM or SGE Beber way to share user creden:als Trusts in Icehouse? Na:onal and/or other regional services

Future Work Spot instance queue Distributed file system NFS is the sta:c component Cannot add storage to NFS without stopping it Cannot add new nodes to the allow list dynamically needs to use iptables; update iptables instead Inves:ga:on of a dynamic and distributed FS One FS for all tenants Alterna:ves to StarCluster Heat or Elas:cluster

Resources Cloud Scheduler hbp://cloudscheduler.org/ Star cluster hbp://star.mit.edu.au/cluster OpenStack version hbps://github.com/shundezhang/starcluster/ Dynamic Torque hbps://github.com/shundezhang/dynamictorque

Nagios vs Sensu Nagios First designed in the last century for sta:c environment Needs to update local configura:on and restart service if a remote server is added or removed Server perform all checks and it is not scalable Sensu Modern design with AMQP as communica:on layer Local agent runs checks Weak coupling between clients and server and it is scalable

Imagefactory Template in XML Packages Commands Files Can backup an image in github Automa:c, no user interac:on required

EMU Monitoring Sensu Run health checks Logstash Collect PBS server/mom/accoun:ng logs Collectd Collect metrics of CPU, memory, disk, network etc

Salt Stack vs Puppet Salt Stack Puppet Architecture Server- Client Server- Client Working Modal Push Pull Communica:on Zeromq + msgpack HTTP + text Language Python Ruby Remote execu:on Yes No