IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud February 25, 2014 1
Agenda v Mapping clients needs to cloud technologies v Addressing your pain points v Introducing IBM Platform Computing Cloud Service v Product features and benefits v Use cases v Performance benchmarks 2
HPC cloud characteristics and economics are different than general-purpose computing High-end hardware and special purpose devices (e.g. GPUs) are typically used to supply the needed processing, memory, network, and storage capabilities The performance requirements of technical computing and service-oriented workloads means that performance may be impacted in a virtualized cloud environment, especially when latency or I/O is a constraint HPC cluster/grid utilization is usually in the 70-90% range, removing a major potential advantage of a public cloud service provider for stable workload volumes HPC Workloads Recommended for Private Cloud HPC Workloads with Best Potential for Virtualized Public & Hybrid Cloud Primary HPC Workloads 3
IBM s HPC cloud strategy provides a flexible approach to address a variety of client needs Private Clouds Hybrid Clouds Public Clouds Evolve existing infrastructure to HPC Cloud to enhance responsiveness, flexibility, and cost effectiveness. Enable integrated approach to improve HPC cost and capability Access additional HPC capacity with variable cost model 60% Based on HPC Cloud s potential impact, organizations are evolving their infrastructures to enable private cloud deployments, exploring hybrid clouds, and considering public clouds. 4
Are you experiencing any of these pain points? Unable to meet business objectives (delay to market, etc.) Existing resources insufficient to meet peek compute demand Long run times on existing cluster or grid No access to local technical computing resources (workstation users) Technical resources expensive and time consuming to acquire The skills/staff to architect and manage a technical computing infrastructure can be difficult to acquire 50,000 40,000 30,000 20,000 10,000 - Planned Daily Cycle (24 x 365) 1 4 7 10 13 16 19 22 Planned Project 1600 1400 1200 1000 800 600 400 200 0 April May June Financial Services Life Sciences 5
IBM Platform Computing Cloud Service Making the cloud work for you Build Manage Support Protect Complete, ready to run clusters in the cloud Add additional capacity in hours instead of months Seamless workload management, onpremise and in the cloud Transparent user experience 24X7 cloud operation support Access to technical computing expertise when you need it Data encryption, dedicated physical machines and network Security through physical isolation Complete, end to end dynamic cloud solution 6
Ready to use Platform LSF & Platform Symphony clusters in the cloud Client and ISV Applications IBM Platform Computing Cloud Service (SaaS) IBM Platform LSF IBM Platform Symphony SoftLayer, an IBM Company Infrastructure 24X7 CloudOps Support 7
Dedicated physical and virtual machine infrastructure as a service 13+ data centers 17 network PoPs Global private network Bare metal and virtual machines 190,000+ 21,000+ 22,000,000+ SERVERS CUSTOMERS DOMAINS 8
Ready to use Platform LSF & Platform Symphony clusters in the cloud DIFFERENTIATOR RATING IBM ADVANTAGES Workload I/O intensity Low intensity workloads High intensity workloads SoftLayer s architecture outperforms by >50% equivalent AWS instances for high I/O workloads Control (APIs, hardware / network configurability) Low degree of control and customization High degree of control and customization SoftLayer offers hundreds of hardware configurations vs. 14 for AWS ~2,000 APIs for SoftLayer vs. ~60 for AWS and none for RAX Integrated platform of multiple architectures Single platform Seamless integration Unified integration & control panel for multiple cloud architectures RAX requires paid bridge, different control interfaces AWS RAX IBM 9
Non-shared physical machines for added security and performance Dedicated and isolated compute environment All machine instances are dedicated to the client Each cluster is isolated on a VLAN Only the VPN gateway has an addressable interface All customer data at rest is encrypted on shared file systems When machines instances are decommissioned the disks are scrubbed using DoD approved methods 10
Optimal performance for technical computing apps EDA Benchmark (IBM-MESA) Industrial Manufacturing Benchmark Structural Mechanics 11 Note: Benchmark results were obtained by IBM and have not yet been externally audited or validated.
Run and supported by dedicated, 24X7 HPC Cloud Operations Team CloudOps functions Pre-provisioning: Provide guidance to client on how to enable VPN, multi-cluster settings & security settings on the client on-premise environment One time setup testing: Extensive testing of the cluster prior to release to the client Extensive testing of the cluster on every event of flex-up prior to release to the client Email alerts prior to flex-down & cluster shutdown operations Email alerts in case of any overage (compute hours, download bandwidth) Provide billing details of monthly usage including overage details Provide support under IBM SLA by experts highly experienced in Platform Computing products Value: quality, peace of mind & minimum disruption to business Extensive quality checks ensures minimum loss of usage hours & disruptions Proactive alerts ensures that in-progress critical jobs are not killed in case of Flex-down & Cluster Shutdowns and Overages Highly trained & experienced Support ensures smooth on-boarding and minimize disruptions 12
Industry-leading workload management 20 years managing distributed scale-out systems with 2000+ customers in many industries High performance workload management combined with intelligent resource scheduling engine Unmatched scalability (small clusters to global grids) and production-proven reliability Heterogeneous manages System x and Power plus 3rd party systems, virtual and bare metal, accelerators / GPU, cloud, etc. Shared services for both compute and data intensive workloads Integrated solutions with vertical reference architectures 23 of 30 largest commercial enterprises 60% of top financial services companies Over 5M CPUs under management 13
IBM Platform LSF Overview Powerful workload management for demanding, distributed and mission-critical high performance computing environments. Key Capabilities Powerful 14 - Policy and resource-aware scheduling - Resource consolidation for optimal performance - Advanced self-management Flexible - Heterogeneous platform support - Policy-driven automation - CLI, web services, APIs Scalable - Thousands of concurrent users and jobs - Virtualized pool of shared resources - Flexible control, multiple policies Client Benefits Optimal utilization: reduced infrastructure cost Robust capabilities: improved productivity High throughput: faster time to results 14
IBM Platform Symphony Overview Low-latency grid management platform for distributed computing and analytics with sophisticated resource sharing Key Capabilities Accelerates service-oriented applications Extreme app scalability and throughput with very low latency Compute and data-intensive applications on a single platform Sophisticated, hierarchical resource sharing Open and flexible: choice of OS, frameworks and languages Client Benefits Increase performance and analytic result quality Reduces IT costs - increase utilization, simplify application onboarding, reduce administration costs Low Latency / High throughput Sub-millisecond, 17,000 tasks per second Large Scale 10k cores per application, 40k cores per grid Efficient shared services Heterogeneous & Open Linux, Windows, AIX, C/C++, C#, Java, Excel, Python, R 15 15
Use case 1 hybrid cluster The problem Existing resources cannot meet peak demand Resources are expensive and time consuming to acquire Skills to architect and manage clusters are difficult to find Fixed or reduced budgets On-premise constraints in space, cooling and power The solution Fully functioning IBM Platform LSF or Symphony clusters are provisioned on the SoftLayer cloud and connected to the onpremise cluster, expanding capacity as needed Leverage MultiCluster capability for managed forwarding of jobs from on premise cluster to off premise cluster The Value Access to additional compute capacity on a temporary basis as needed Near-zero wait times Reduce costs by paying for only what is used Pay for additional capacity as an operating expense Fully supported, end-to-end solution, from the on-premise to the on-cloud clusters Expected and reliable performance from running technical computing workloads on physical machines Transparent access to cloud resources, the end user experience does not change 16
Use case 2 stand-alone cluster in the cloud The problem New and emerging need for technical computing Skills to architect and manage clusters are difficult to find Resources are expensive and time consuming to acquire Inconsistent demand does not justify the investment The solution Fully functioning Platform LSF and Symphony clusters are provisioned on the SoftLayer cloud providing resources as needed The value Market-leading Platform LSF and Platform Symphony software Access to technical computing resources on a temporary basis without the need to acquire, install and configure the infrastructure and cluster software Keep costs low by paying for only what is used Pay for capacity as an operating expense Fully supported solution Expected and reliable performance from running workloads on physical machines 17
Is IBM Platform Computing Cloud Service a good fit for you? Business pain points And you experiencing lost profit due to missed deadlines? Do you experience pressure to convert your compute environment capital expense to operational expense? Have you ever missed a deadline or delayed a project because technical computing resource procurement took too long? Technology pain points Do your users ever scale back their analyses to lower fidelity or less accuracy in order to fit them into the local compute environment or to a time window? Do you regularly, occasionally, or permanently have fewer resources (CPUs, disk, memory, etc) than you would like to have to service the user s compute demand? Do you experience a large variance in compute resource utilization? Have you reached, or will you reach the capacity of your datacenter(s), and do you need a plan to grow beyond that capacity? Are your customers asking you for cloud licenses for Platform LSF or Platform Symphony? 18
Platform Computing IBM Platform Computing Cloud Service Making the Cloud Work for You IBM Hybrid Cloud On Premise On SmartCloud powered by Software & Systems Unmatched Capabilities Cloud Leadership Policy-driven Workload Management Expertise from Client Engagements Unmatched Expertise Analytics, Technical Computing, Software, Services and ISV Partnerships Consolidation Supporting heterogeneous IBM and non-ibm infrastructure 19
Thank You 20
SoftLayer and Amazon EC2 Products tested NAME SL PM SL VM SL PM (ded) EC2 CC2 EC2 2XL IaaS CPU Cores Memory Disk Space Physical / Hourly Provider (GB) (GB) Virtual Rate (USD) So'Layer 16 64 1000[1] Physical $1.85[2] So'Layer 8 8 500[3] Virtual $0.88 So'Layer 16 64 1000[1] Physical $3.83[5] Amazon EC2 (CC2) Amazon EC2 (c1.xlarge) 32 60.5 3360 Virtual $2.40[4] 8 7 840 Virtual $0.58 SL Physical Machine SL Physical Machine (dedicated) SL Virtual Machine Amazon CCI2 Amazon 2XL Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz Intel Xeon CPU E5-2690 0 @ 2.90GHz Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz 21
9000 8000 7000 Platform Computing Memory Bandwidth STREAM (higher is better) 6000 5000 4000 3000 2000 1000 0 SL PM SL VM EC2 CCI2 EC2 2XL SL PM (ded) COPY SCALE ADD TRIAD 4,500.00 4,000.00 3,500.00 3,000.00 2,500.00 STREAM Price Performance (higher is better) COPY SCALE ADD TRIAD 2,000.00 1,500.00 1,000.00 500.00 0.00 SL PM SL VM EC2 CCI2 EC2 2XL SL PM (ded) 22
CPU Performance SuperPI (lower is better) Elapsed Time 800 700 600 500 400 300 200 100 0 SL PM SL VM EC2 CCI2 EC2 2XL SL PM (ded) 10.00 SuperPI Price-Performance (higher is better) throughput per dollar 8.00 6.00 4.00 2.00 0.00 SL PM SL VM EC2 CCI2 EC2 2XL SL PM (ded) 23
Network Bandwidth 100000 openmpi 10000 Bandwidth (Mbits/s) 1000 100 SLVM EC2 2XL EC2 CCI2 SL PM SL PM Dedicated 10 1 1 10 100 1000 10000 100000 1000000 10000000 Message Size (Bytes) 24
Network Latency openmpi Latency (lower is better) 120 100 80 60 40 20 0 SL VM MPI 2 node EC2 2XL MPI 2 node EC2 CCI2 MPI 2 node SL PM MPI 2 node SL PM (ded) MPI 2 node 25
Input / Output Performance I/O Bandwidth - WRITE (higher is better) 350000 300000 kb/sec 250000 200000 150000 100000 50000 0 0 1 2 3 4 5 I/O file size (factor of memory size) SL VM Write EC2 2XL Write EC2 CCI2 Write SL PM Write SL PM Ded Write I/O Bandwidth - READ (higher is better) 400000 350000 300000 kb/sec 250000 200000 150000 SL VM Read EC2 CCI2 Read EC2 2XL Read 100000 SL PM Read 26 50000 0 0 1 2 3 4 5 I/O file size (factor of memory size) SL PM Ded Read
Software Compilation Elapsed Time (s) 800 700 600 500 400 300 200 100 0 Software Compile Performance (lower is better) SL VM SL PM EC2 2XL EC2 CCI SL PM Ded Runs / $ 9.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 27 Software Compile Price-Performance (higher is better) SL VM SL PM EC2 2XL EC2 CCI SL PM Ded
Life Science (BWA) 40000 Life Sciences Benchmark (BWA) (lower is better) 35000 Elapsed time (sec) 30000 25000 20000 15000 10000 25.00 20.00 Life Sciences Benchmark (BWA) Price Performance (lower is better) 5000 0 SL PM (ded) SL PM SL VM EC2 CCI2 EC2 2XL Series1 20846.481 26509.368 25897.44 22442.7 37491 $ / run 15.00 10.00 5.00 0.00 SL PM (ded) SL PM SL VM EC2 CCI2 EC2 2XL 28 Series1 22.21 7.79 6.33 14.96 6.04
EDA Benchmark (IBM-MESA) 3500 EDA - IBM Mesa (lower is better) Elapsed Time (sec) 3000 2500 2000 1500 1000 500 0 SL PM (ded) SL PM SL VM EC2 2XL EC2 CCI2 2.50 EDA - IBM Mesa - Price-Performance (higher is better) 2.00 Runs / $ 1.50 1.00 0.50 0.00 SL PM (ded) SL PM SL VM EC2 2XL EC2 CCI2 29
Provisioning Time Provisioning Time (sec) (lower is better) 100000 10000 1000 100 10 1 SL PM SL VM EC2 CCI2 EC2 2XL SL PM Ded 30
Industrial Manufacturing Structural Mechanics Speedup (relative to EC2 2XL) Speedup (relative to EC2 2XL) 31 13 11 9 7 5 3 1 19 17 15 13 11 9 7 5 3 1 One Node - S4D 0 2 4 6 8 10 12 14 16 CPUs Two Nodes - S4D 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 CPUs SL PM EC2 CCI2 SL VM EC2 2XL Speedup (relative to EC2 2XL) 7 6 5 4 3 2 SL PM (ded) 1 SL PM EC2 CCI2 SL VM EC2 2XL SL PM (ded) Speedup (relative to EC2 2XL) 9 8 7 6 5 4 3 2 1 One Node - S6 SL PM EC2 CCI2 SL VM EC2 2XL SL PM (ded) 0 2 4 6 8 10 12 14 16 CPUs Two Nodes - S6 SL PM EC2 CCI2 SL VM EC2 2XL SL PM (ded) 0 2 4 6 8 101214161820222426283032 CPUs
Industrial Manufacturing CFD Speedup (relative to EC2 2XL) 18 16 14 12 10 8 6 4 2 0 OpenFoam Speedup Backplane (higher is better) 1 3 5 7 9 11 13 15 # cores SL PM (ded) SL PM SL VM EC2 CCI2 EC2 2XL 8 OpenFoam Speedup Ethernet (higher is better) 32 Speedup (relative to EC2 2XL) 7 6 5 4 3 2 1 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 # cores SL PM (ded) SL PM SL VM EC2 CCI2 EC2 2XL