Migrating Production HPC to AWS



Similar documents
MS 20465C: Designing a Data Solution with Microsoft SQL Server

Designing a Data Solution with Microsoft SQL Server 2014

Course 20465C: Designing a Data Solution with Microsoft SQL Server

20465: Designing a Data Solution with Microsoft SQL Server

This course is intended for database professionals who need who plan, implement, and manage database solutions. Primary responsibilities include:

Designing a Data Solution with Microsoft SQL Server 2014

NCTA Cloud Architecture

Designing a Data Solution with Microsoft SQL Server

Course 20465: Designing a Data Solution with Microsoft SQL Server

Designing a Data Solution with Microsoft SQL Server

New hybrid cloud scenarios with SQL Server Matt Smith 6/4/2014

20465C: Designing a Data Solution with Microsoft SQL Server

Server & Cloud Management

Configuring and Deploying a Private Cloud

"Charting the Course... MOC C Designing a Data Solution with Microsoft SQL Server Course Summary

Course 20465C: Designing a Data Solution with Microsoft SQL Server

Designing a Data Solution with Microsoft SQL Server

Designing a Data Solution with Microsoft SQL Server 2014

Understanding Virtualization and Cloud in the Enterprise

CA Cloud Overview Benefits of the Hyper-V Cloud

ZADARA STORAGE. Managed, hybrid storage EXECUTIVE SUMMARY. Research Brief

70-414: Implementing a Cloud Based Infrastructure. Course Overview

What Is It? Business Architecture Research Challenges Bibliography. Cloud Computing. Research Challenges Overview. Carlos Eduardo Moreira dos Santos

20465D: Designing Solutions for Microsoft SQL Server 2014

Building Storage Service in a Private Cloud

Windows Azure and private cloud

A Study on Analysis and Implementation of a Cloud Computing Framework for Multimedia Convergence Services

Infopaper. Demystifying Platform as a Service

10533A: Deploying, Configuring, and Administering Microsoft Lync Server 2010

70-246: Monitoring and Operating a Private Cloud with System Center 2012

Managing and Maintaining Windows Server 2008 Servers

Business white paper. environments. The top 5 challenges and solutions for backup and recovery

How To Understand Cloud Computing

Deployment Options for Microsoft Hyper-V Server

Configuring and Deploying a Private Cloud 20247C; 5 days

YARN Apache Hadoop Next Generation Compute Platform

Introduction to Cloud : Cloud and Cloud Storage. Lecture 2. Dr. Dalit Naor IBM Haifa Research Storage Systems. Dalit Naor, IBM Haifa Research

Preparing Your IT for the Holidays. A quick start guide to take your e-commerce to the Cloud

Configuring and Deploying a Private Cloud

Private Clouds Can Be Complicated: The Challenges of Building and Operating a Microsoft Private Cloud

Implementing Microsoft Azure Infrastructure Solutions

Technology Insight Series

A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES

Three Ways Enterprises are Protecting SQL Server in the Cloud

Low-cost Open Data As-a-Service in the Cloud

A.Prof. Dr. Markus Hagenbuchner CSCI319 A Brief Introduction to Cloud Computing. CSCI319 Page: 1

MS 20247C Configuring and Deploying a Private Cloud

Microsoft Azure for IT Professionals 55065A; 3 days

Oracle SOA Infrastructure Deployment Models/Patterns

MS 10751A - Configuring and Deploying a Private Cloud with System Center 2012

Cloud Service Model. Selecting a cloud service model. Different cloud service models within the enterprise

Het is een kleine stap naar een hybrid cloud

Microsoft Private Cloud Fast Track

R3: Windows Server 2008 Administration. Course Overview. Course Outline. Course Length: 4 Day

Zadara Storage Cloud A

Solution Brief Availability and Recovery Options: Microsoft Exchange Solutions on VMware

MS Design, Optimize and Maintain Database for Microsoft SQL Server 2008

OpenStack Private Cloud

View Point. Overcoming Challenges associated with SaaS Testing. Abstract. - Vijayanathan Naganathan, Sreesankar Sankarayya

20247D: Configuring and Deploying a Private Cloud

40008A - UPDATING YOUR DATABASE SKILLS TO MICROSOFT SQL SERVER 2012 Training Course Outline. Course: 40008A

HP Converged Cloud Cloud Platform Overview. Shane Pearson Vice President, Portfolio & Product Management

Radware ADC-VX Solution. The Agility of Virtual; The Predictability of Physical

Microsoft SharePoint Architectural Models

Leveraging the Cloud. September 22, Digital Government Institute Cloud-Enabled Government Conference Washington, DC

Five Features Your Cloud Disaster Recovery Solution Should Have

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

Always On Infrastructure for Software as a Ser vice

NE-20247D Configuring and Deploying a Private Cloud

Designing Database Solutions for Microsoft SQL Server 2012 MOC 20465

Windows HPC Server 2008 R2 Service Pack 3 (V3 SP3)

EMC VPLEX FAMILY. Continuous Availability and data Mobility Within and Across Data Centers

MS-10751: Configuring and Deploying a Private Cloud with System Center Required Exam(s) Course Objectives. Price. Duration. Methods of Delivery

Cloud Courses Description

Leveraging the Cloud for Data Protection and Disaster Recovery

Infrastructure as a Service: Accelerating Time to Profitable New Revenue Streams

Cloud Computing: Concepts and Technology

CA ARCserve Replication and High Availability Deployment Options for Hyper-V

Computer Visions Course Outline

Cloud Courses Description

Designing, Optimizing and Maintaining a Database Administrative Solution for Microsoft SQL Server 2008

Hadoop in the Hybrid Cloud

MS-20246: Monitoring and Operating a Private Cloud

Building Private & Hybrid Cloud Solutions

CLOUD ERP AND ACCOUNTING: SELECTION AND PLANNING GUIDE

Configuring and Deploying a Private Cloud with System Center 2012 MOC 10751

Confidently Virtualize Business-Critical Applications in Microsoft

StorageX 7.5 Case Study

Demystifying the Cloud Computing

Virtualizing Apache Hadoop. June, 2012

An HPC Application Deployment Model on Azure Cloud for SMEs

PASS4TEST. IT Certification Guaranteed, The Easy Way! We offer free update service for one year

Transcription:

Migrating Production HPC to AWS A Story of Early Adoption & Lessons Learned Lewis Foti Mentation Solutions

Common Computing Service (CCS) The Common Computing Service (CCS) is the HPC (grid computing) environment at a major commodity trader A custom software layer providing map-reduce and memoization functions Schedules client jobs across multiple compute nodes that execute models provided by the quant teams Jobs are Closures in that their input contains all the necessary data for evaluation A REST interface provides isolation from underlying platform, Microsoft HPC Server 2

CCS Architecture Trading Systems Openlink Client Job Execution Request/Response Clustered servers hosting CCS service and associated components MS HPC Server and SQL Server Dedicated servers and (potentially) scavenged / virtual / cloud CCS Compute Node DealBus CCS Service CCS Agent Murex CCS Client Interface Application Model CCS Job Scheduler End Users Excel Client Job Execution Request/Response CCS Model Store CCS Task Execution Request/Response Excel Web Four environments in total Production, OAT/DR, Test and Development Compute Node Grid models deployed on demand CCS Compute Node CCS Compute Node CCS Agent CCS Compute Node CCS Agent CCS Compute Node Application CCS Model Agent CCS Compute Node Application CCS Model Agent CCS Compute Node Application CCS Model Agent Application CCS Model Agent Application Model Application Model

CCS in Q1 2013 CCS entered service in Q1 providing a shared grid computing environment as planned Used by multiple business units and applications As is usual with such systems load was quite volatile Average utilisation 24/7 of under 20% Peak of 100% for four hour EoD batch CCS had to be provisioned to support peak demand 4

Predicted Growth After go live there was a capacity uplift of 25% to accommodate demand from the US business Empirical evidence from other Financial Services organisations was that over 5 years grid demand grew by between 10 and 100 fold If replicated in this case would see annual operational costs rise to consume up to 20% of the divisions operating budget 5

Need to Control Costs The possible growth in operating costs was such that alternatives had to be considered The low average utilisation showed there was an opportunity to do this An alternative that could scale capacity to meet demand was very attractive So in Q3 2013 the decision was taken to investigate the feasibility of a Cloud based solution 6

Which Cloud to Use? CCS is based on Windows HPC Server so our first thought was to use Azure However there was no contract in place with Azure There was one for AWS It had taken two years to negotiate 7

Feasibility The first step was to show CCS would run in AWS Adopted a change nothing, lift and shift approach The first manual build took about a week, which included learning how to use AWS By the end of October 2013 knew the project was technically feasible The next step was to get approval to proceed with migration of all CCS environments to AWS 8

Quite A Few Stake Holders The Business Quants Digital Security Compliance, Control & Legal Central Accounting Operational Integrity Internal & External Networks Infrastructure Cloud Team Operations 9

Digital Security Worked extensively with Digital Security to show that migrating CCS to the Cloud would not introduce unacceptable risks Demonstrated that CCS was equivalent to several of the SAAS products already in use Once submitted CCS jobs did not require access to internal data All communications could be initiated internally No need for AWS machines to access in-house resources Modified CCS to encrypt all business data 10

Central Accounting No mechanism to pay AWS! Worked with central accounting function to design new process AWS provide consolidated billing at the business unit level Which needed to be recharged to the individual projects and profit centres 11

System Build The CCS environment is reasonably complicated with a strict sequence of steps required to build a new instance Time consuming and error prone to do this by hand so decided to automate the process Achieved using a combination of Chef and Power Shell to give fine grained control The end result was that a new environment could be built in 90 minutes 12

Development Migration To ensure that the system would function correctly in AWS elements of the development environment were migrated Build and Unit Tests executed in-house by TFS When a clean build was available it was automatically deployed to AWS Then the set of Acceptance Tests would run in AWS And the results returned to TFS 13

SLA for a Scalable System What is the SLA for a scalable environment? At times demand will exceed current scale After some discussion it was agreed that the appropriate measure was the maximum queue time between job submission to start of execution This was adopted as the system SLA with different values dependent on time of day and end user 14

Scaling to Meet Demand Produced a model that predicted the amount of time a job would queue once submitted Based on estimating the time taken to complete currently executing jobs and jobs already queued Challenges was that it took 15 minutes from requesting a new node to it being operational Addressed by creating a fleet of halted nodes which Resource Manager would start in 60 seconds New job submitted Queued jobs Job Scheduler CCS Est 15 sec Est 28 sec Available Node CCS Est 5 sec 15 Running Nodes Halted Nodes

Automated Scaling The Resource Manager scales the running compute node fleet to meet demand Compute nodes started as load rises and halted as it falls But always run for 60 minutes as this is AWS minimum time charged 16

ELB Status Reliability Production management components in one AZ, DR in another one Compute nodes spread across all available AZ Use AWS ELB to provide well known IP addresses Production Clients Prod ELB Heartbeat CCS Production Prod Node Prod Node Prod Node Prod Node Prod Node Prod Node Fail Over Manager DR ELB Heartbeat CCS DR DR Node DR Node DR Node DR Node 17

ELB Status & Control DR - Automated Fail Over Extended the use of the ELB components to automate fail over and fail back Production failure detected in 60 seconds Fail back automated once production system recovered Production Clients Prod ELB Heartbeat CCS Production Prod Node Prod Node Prod Node Prod Node Prod Node Prod Node Fail Over Manager DR ELB CCS DR DR Node DR Node DR Node DR Node 18

Test At Production Scale Tested with production workloads for production timescales Measured the performance of system and individual components Revealed a number of bottle necks which were addressed prior to go-live 19

What Was Delivered Fully automated, reliable and repeatable deployments Pay for usage, Opex reduced by 40% No more hardware purchases, end of Capex shocks Ability to meet unusual business demand Automated Failover 20

Lessons Learned Find and engage all the stake holders Right size the architecture, experiment with alternative platform configurations Dynamic environments are not as stable as dedicated hardware, need a strategy to cope Build automation a must in order to achieve required levels of agility Production scale testing is a must to identify and remediate bottle necks Disaster Recovery, as it is possible to rebuild the system in 90 minutes is there a cheaper approach? 21

Next Steps? Recharge to the business line Distributed Data Assets to remove repeated data transmissions Use of AWS Spot to reduce costs 22

Q & A lewis.foti@mentation.com 23