Cloud Bursting with SLURM and Bright Cluster Manager. Martijn de Vries CTO



Similar documents
Bright Cluster Manager

Lustre * Filesystem for Cloud and Hadoop *

Using ArcGIS for Server in the Amazon Cloud

Amazon EC2 Product Details Page 1 of 5

II. Installing Debian Linux:

PARALLELS SERVER BARE METAL 5.0 README

SUSE Cloud 2.0. Pete Chadwick. Douglas Jarvis. Senior Product Manager Product Marketing Manager

LSKA 2010 Survey Report Job Scheduler

Implementing Microsoft Windows Server Failover Clustering (WSFC) and SQL Server 2012 AlwaysOn Availability Groups in the AWS Cloud

Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research

Red Hat System Administration 1(RH124) is Designed for IT Professionals who are new to Linux.

Course 20533: Implementing Microsoft Azure Infrastructure Solutions

70-417: Upgrading Your Skills to MCSA Windows Server 2012

A High Performance Computing Scheduling and Resource Management Primer

A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others (Fall 2012)

SLURM Workload Manager

Implementing Microsoft Azure Infrastructure Solutions

Microsoft Windows Server Failover Clustering (WSFC) and SQL Server AlwaysOn Availability Groups on the AWS Cloud: Quick Start Reference Deployment

Amazon Elastic Beanstalk

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Alfresco Enterprise on AWS: Reference Architecture

TERABYTE STORAGE IN AEM USING AMAZON S3 STORAGE

Enterprise Storage Solution for Hyper-V Private Cloud and VDI Deployments using Sanbolic s Melio Cloud Software Suite April 2011

Big Fast Data Hadoop acceleration with Flash. June 2013

Upgrading Cisco UCS Central

Interact Intranet Version 7. Technical Requirements. August Interact

Virtualization. Michael Tsai 2015/06/08

IOCOM Whitepaper: Connecting to Third Party Organizations

Financial Services Grid Computing on Amazon Web Services January 2013 Ian Meyers

SUSE Manager in the Public Cloud. SUSE Manager Server in the Public Cloud

Option nv, Gaston Geenslaan 14, B-3001 Leuven Tel Fax Page 1 of 14

Microsoft Windows Compute Cluster Server 2003 Getting Started Guide

Windows Server 2008 R2 Hyper V. Public FAQ

Technical Overview of Windows HPC Server 2008

ZEN LOAD BALANCER EE v3.02 DATASHEET The Load Balancing made easy

Chapter 5: Operating Systems Part 1

Cloud Computing with Red Hat Solutions. Sivaram Shunmugam Red Hat Asia Pacific Pte Ltd.

Alfresco Enterprise on Azure: Reference Architecture. September 2014

InterWorx Clustering Guide. by InterWorx LLC

LOCKSS on LINUX. CentOS6 Installation Manual 08/22/2013

Flexible SDN Transport Networks With Optical Circuit Switching

automates system administration for homogeneous and heterogeneous networks

Comsol Multiphysics. Running COMSOL on the Amazon Cloud. VERSION 4.3a

Managing Application Performance and Availability in a Virtual Environment

The Incremental Advantage:

What s new in Hyper-V 2012 R2

EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications

INUVIKA TECHNICAL GUIDE

Deploying Windows Streaming Media Servers NLB Cluster and metasan

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

CONNECTING TO DEPARTMENT OF COMPUTER SCIENCE SERVERS BOTH FROM ON AND OFF CAMPUS USING TUNNELING, PuTTY, AND VNC Client Utilities

HyperQ Storage Tiering White Paper

MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM?

Scala Storage Scale-Out Clustered Storage White Paper

Remote Unix Lab Environment (RULE)

Implementing Microsoft Azure Infrastructure Solutions

PrimeRail Installation Notes Version A June 9,

vcloud Air - Virtual Private Cloud OnDemand Networking Guide

High Performance Computing OpenStack Options. September 22, 2015

Multi-Datacenter Replication

Implementing Microsoft Azure Infrastructure Solutions

AWS Storage: Minimizing Costs While Retaining Functionality

Updating Your Windows Server 2008 Technology Skills to Windows Server 2008 R2

Zerto Virtual Manager Administration Guide

CITRIX 1Y0-A14 EXAM QUESTIONS & ANSWERS

Implementing Microsoft Azure Infrastructure Solutions 20533B; 5 Days, Instructor-led

MATLAB Distributed Computing Server with HPC Cluster in Microsoft Azure

Scalable Architecture on Amazon AWS Cloud

WHITE PAPER. ClusterWorX 2.1 from Linux NetworX. Cluster Management Solution C ONTENTS INTRODUCTION

msuite5 & mdesign Installation Prerequisites

Course 20533B: Implementing Microsoft Azure Infrastructure Solutions

Basic & Advanced Administration for Citrix NetScaler 9.2

Migration Scenario: Migrating Batch Processes to the AWS Cloud

Expand Your Infrastructure with the Elastic Cloud. Mark Ryland Chief Solutions Architect Jenn Steele Product Marketing Manager

ZEN LOAD BALANCER EE v3.04 DATASHEET The Load Balancing made easy

Backup and Recovery FAQs

Amazon Cloud Storage Options

OVERVIEW OF TYPICAL WINDOWS SERVER ROLES

Boas Betzler. Planet. Globally Distributed IaaS Platform Examples AWS and SoftLayer. November 9, IBM Corporation

VMware vcloud Air Networking Guide

DocuShare 4, 5, and 6 in a Clustered Environment

How To Live Migrate In Hyperv On Windows Server 22 (Windows) (Windows V) (Hyperv) (Powerpoint) (For A Hyperv Virtual Machine) (Virtual Machine) And (Hyper V) Vhd (Virtual Hard Disk

NoSQL and Hadoop Technologies On Oracle Cloud

TECHNOLOGY WHITE PAPER Jan 2016

Cloud Computing Architecture with OpenNebula HPC Cloud Use Cases

Implementing, Managing, and Maintaining a Microsoft Windows Server 2003 Network Infrastructure

VIRTUALIZATION 101. Brainstorm Conference 2013 PRESENTER INTRODUCTIONS

ovirt and Gluster hyper-converged! HA solution for maximum resource utilization

This presentation provides an overview of the architecture of the IBM Workload Deployer product.

Cloud Computing through Virtualization and HPC technologies

ovirt and Gluster hyper-converged! HA solution for maximum resource utilization

Project Documentation

iboss Enterprise Deployment Guide iboss Web Filters

IBM Global Technology Services September NAS systems scale out to meet growing storage demand.

High Availability of the Polarion Server

Transcription:

Cloud Bursting with SLURM and Bright Cluster Manager Martijn de Vries CTO

Architecture CMDaemon 2

Management Interfaces Graphical User Interface (GUI) Offers administrator full cluster control Standalone desktop application Manages multiple clusters simultaneously Runs on Linux & Windows Built on top of Mozilla XUL engine Cluster Management Shell (CMSH) All GUI functionality also available through Cluster Management Shell Interactive and scriptable in batch mode 3

Workload Manager Integration Integration with workload manager: All popular workload managers supported SLURM default choice during installation Automatic installation Points of integration: Automatic node and queue configuration Automatic high availability configuration Monitoring workload management metrics Health checking Job monitoring and control 6

Cloud Bursting Scenario I node001 node003 Head node node002 8

Mixing Local and Cloud Resources Cloud does not work well for all HPC workloads Sensitive data/computations Problems getting huge amounts of data in/out Workload may depend on low latency / high bandwidth Workload may depend on non-standard compute resources Workload may depend on advanced shared storage (e.g. Lustre) Not everyone will replace HPC cluster with EC2 account Allow local cluster to be extended with cloud resources to give best of both worlds Allow workload suitable for cloud to be off-loaded Allow traditional HPC users to try out and migrate to cloud 9

Cloud Bursting Scenario II node004 node005 node006 node007 10

Cloud Network Map 17

Uniformity Cloud nodes behave the same way as local nodes Same method of provisioning Same software image and user environment Same workload management set-up Same management interface that allows to control cluster Same monitoring & health checking Everything can talk to everything Accomplished using VPN, routing, network mapping VPN set-up automated and does not require firewall set-up (requires just outgoing access on 1194/udp) Single global DNS namespace 18

Running Cloud Nodes Cloud Director has a number of responsibilities: Gateway between local and cloud nodes Provision software image to cloud nodes Serve shared storage for cloud nodes Mirror network services for the cloud nodes (e.g. LDAP, DNS) Cloud node booting process Instances are created with 1GB EBS and ngb ephemeral/ebs disk Bright Node Installer AMI goes on EBS disk Node Installer continues with normal procedure to bring up node Software image gets provisioned onto second disk 19

SLURM & Bright Cloud Bursting Common setup: one SLURM partition per cloud region Example: [root@sc11-demo ~]# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST defq* up infinite 1 idle node001 california up infinite 4 idle cnode[001-004] oregon up infinite 4 idle cnode[005-008] Jobs that may run in the cloud should be submitted to one of the cloud partitions SLURM will schedule jobs onto cloud nodes the same way as on local nodes Current situation: /cm/shared mirrored and exported by cloud director /home mounted over VPN Works great, but /home is too slow 20

Data Locality Problem Jobs usually require input data and produce output data Input and/or output data may require significant transfer time Resources charged by the hour, so input/output data should be transferred while resources are not yet allocated Solution to data locality problem should ideally be hidden from users as much as possible 21

Data Aware Workload Management SLURM needs to be made aware of job data dependencies Jobs should not be scheduled until data is present on clouddirector As part of job script, copy input data in special input directory, copy output directory into output directory Workload management environment takes care of transferring input and output directories Option A) let SLURM take care of copying data (e.g. using job dependencies) Option B) transfer data using separate daemon and set SLURM job attributes to allow/disallow job start 22

Questions? 23