Migrating Production HPC to AWS A Story of Early Adoption & Lessons Learned Lewis Foti Mentation Solutions
Common Computing Service (CCS) The Common Computing Service (CCS) is the HPC (grid computing) environment at a major commodity trader A custom software layer providing map-reduce and memoization functions Schedules client jobs across multiple compute nodes that execute models provided by the quant teams Jobs are Closures in that their input contains all the necessary data for evaluation A REST interface provides isolation from underlying platform, Microsoft HPC Server 2
CCS Architecture Trading Systems Openlink Client Job Execution Request/Response Clustered servers hosting CCS service and associated components MS HPC Server and SQL Server Dedicated servers and (potentially) scavenged / virtual / cloud CCS Compute Node DealBus CCS Service CCS Agent Murex CCS Client Interface Application Model CCS Job Scheduler End Users Excel Client Job Execution Request/Response CCS Model Store CCS Task Execution Request/Response Excel Web Four environments in total Production, OAT/DR, Test and Development Compute Node Grid models deployed on demand CCS Compute Node CCS Compute Node CCS Agent CCS Compute Node CCS Agent CCS Compute Node Application CCS Model Agent CCS Compute Node Application CCS Model Agent CCS Compute Node Application CCS Model Agent Application CCS Model Agent Application Model Application Model
CCS in Q1 2013 CCS entered service in Q1 providing a shared grid computing environment as planned Used by multiple business units and applications As is usual with such systems load was quite volatile Average utilisation 24/7 of under 20% Peak of 100% for four hour EoD batch CCS had to be provisioned to support peak demand 4
Predicted Growth After go live there was a capacity uplift of 25% to accommodate demand from the US business Empirical evidence from other Financial Services organisations was that over 5 years grid demand grew by between 10 and 100 fold If replicated in this case would see annual operational costs rise to consume up to 20% of the divisions operating budget 5
Need to Control Costs The possible growth in operating costs was such that alternatives had to be considered The low average utilisation showed there was an opportunity to do this An alternative that could scale capacity to meet demand was very attractive So in Q3 2013 the decision was taken to investigate the feasibility of a Cloud based solution 6
Which Cloud to Use? CCS is based on Windows HPC Server so our first thought was to use Azure However there was no contract in place with Azure There was one for AWS It had taken two years to negotiate 7
Feasibility The first step was to show CCS would run in AWS Adopted a change nothing, lift and shift approach The first manual build took about a week, which included learning how to use AWS By the end of October 2013 knew the project was technically feasible The next step was to get approval to proceed with migration of all CCS environments to AWS 8
Quite A Few Stake Holders The Business Quants Digital Security Compliance, Control & Legal Central Accounting Operational Integrity Internal & External Networks Infrastructure Cloud Team Operations 9
Digital Security Worked extensively with Digital Security to show that migrating CCS to the Cloud would not introduce unacceptable risks Demonstrated that CCS was equivalent to several of the SAAS products already in use Once submitted CCS jobs did not require access to internal data All communications could be initiated internally No need for AWS machines to access in-house resources Modified CCS to encrypt all business data 10
Central Accounting No mechanism to pay AWS! Worked with central accounting function to design new process AWS provide consolidated billing at the business unit level Which needed to be recharged to the individual projects and profit centres 11
System Build The CCS environment is reasonably complicated with a strict sequence of steps required to build a new instance Time consuming and error prone to do this by hand so decided to automate the process Achieved using a combination of Chef and Power Shell to give fine grained control The end result was that a new environment could be built in 90 minutes 12
Development Migration To ensure that the system would function correctly in AWS elements of the development environment were migrated Build and Unit Tests executed in-house by TFS When a clean build was available it was automatically deployed to AWS Then the set of Acceptance Tests would run in AWS And the results returned to TFS 13
SLA for a Scalable System What is the SLA for a scalable environment? At times demand will exceed current scale After some discussion it was agreed that the appropriate measure was the maximum queue time between job submission to start of execution This was adopted as the system SLA with different values dependent on time of day and end user 14
Scaling to Meet Demand Produced a model that predicted the amount of time a job would queue once submitted Based on estimating the time taken to complete currently executing jobs and jobs already queued Challenges was that it took 15 minutes from requesting a new node to it being operational Addressed by creating a fleet of halted nodes which Resource Manager would start in 60 seconds New job submitted Queued jobs Job Scheduler CCS Est 15 sec Est 28 sec Available Node CCS Est 5 sec 15 Running Nodes Halted Nodes
Automated Scaling The Resource Manager scales the running compute node fleet to meet demand Compute nodes started as load rises and halted as it falls But always run for 60 minutes as this is AWS minimum time charged 16
ELB Status Reliability Production management components in one AZ, DR in another one Compute nodes spread across all available AZ Use AWS ELB to provide well known IP addresses Production Clients Prod ELB Heartbeat CCS Production Prod Node Prod Node Prod Node Prod Node Prod Node Prod Node Fail Over Manager DR ELB Heartbeat CCS DR DR Node DR Node DR Node DR Node 17
ELB Status & Control DR - Automated Fail Over Extended the use of the ELB components to automate fail over and fail back Production failure detected in 60 seconds Fail back automated once production system recovered Production Clients Prod ELB Heartbeat CCS Production Prod Node Prod Node Prod Node Prod Node Prod Node Prod Node Fail Over Manager DR ELB CCS DR DR Node DR Node DR Node DR Node 18
Test At Production Scale Tested with production workloads for production timescales Measured the performance of system and individual components Revealed a number of bottle necks which were addressed prior to go-live 19
What Was Delivered Fully automated, reliable and repeatable deployments Pay for usage, Opex reduced by 40% No more hardware purchases, end of Capex shocks Ability to meet unusual business demand Automated Failover 20
Lessons Learned Find and engage all the stake holders Right size the architecture, experiment with alternative platform configurations Dynamic environments are not as stable as dedicated hardware, need a strategy to cope Build automation a must in order to achieve required levels of agility Production scale testing is a must to identify and remediate bottle necks Disaster Recovery, as it is possible to rebuild the system in 90 minutes is there a cheaper approach? 21
Next Steps? Recharge to the business line Distributed Data Assets to remove repeated data transmissions Use of AWS Spot to reduce costs 22
Q & A lewis.foti@mentation.com 23