Building Platform as a Service for Scientific Applications

Similar documents
Exploring Software Defined Federated Infrastructures for Science

Cluster, Grid, Cloud Concepts

Scientific and Technical Applications as a Service in the Cloud

DESIGN OF A PLATFORM OF VIRTUAL SERVICE CONTAINERS FOR SERVICE ORIENTED CLOUD COMPUTING. Carlos de Alfonso Andrés García Vicente Hernández

IaaS Federation. Contrail project. IaaS Federation! Objectives and Challenges! & SLA management in Federations 5/23/11

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

CHAPTER 8 CLOUD COMPUTING

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

IaaS Cloud Architectures: Virtualized Data Centers to Federated Cloud Infrastructures

PERFORMANCE ANALYSIS OF PaaS CLOUD COMPUTING SYSTEM

BSC vision on Big Data and extreme scale computing

Cloud computing - Architecting in the cloud

Journal of Computer and System Sciences

The Cisco Powered Network Cloud: An Exciting Managed Services Opportunity

Planning, Provisioning and Deploying Enterprise Clouds with Oracle Enterprise Manager 12c Kevin Patterson, Principal Sales Consultant, Enterprise

Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber

SLA BASED SERVICE BROKERING IN INTERCLOUD ENVIRONMENTS

DISTRIBUTED SYSTEMS AND CLOUD COMPUTING. A Comparative Study

IAAS CLOUD EXCHANGE WHITEPAPER

Infrastructure as a Service (IaaS)

Grid Computing Vs. Cloud Computing

WHY SERVICE PROVIDERS NEED A CARRIER PaaS SOLUTION cpaas for Network

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

Clouds vs Grids KHALID ELGAZZAR GOODWIN 531

Cloud Computing Architecture with OpenNebula HPC Cloud Use Cases

How To Understand Cloud Computing

Grid Computing vs Cloud

International Journal of Engineering Research & Management Technology

Outlook. Corporate Research and Technologies, Munich, Germany. 20 th May 2010

Cloud computing: the state of the art and challenges. Jānis Kampars Riga Technical University

A Study on Service Oriented Network Virtualization convergence of Cloud Computing

Contents. What is Cloud Computing? Why Cloud computing? Cloud Anatomy Cloud computing technology Cloud computing products and market

White Paper on CLOUD COMPUTING

1 st Symposium on Colossal Data and Networking (CDAN-2016) March 18-19, 2016 Medicaps Group of Institutions, Indore, India

Challenges for cloud software engineering

Neptune. A Domain Specific Language for Deploying HPC Software on Cloud Platforms. Chris Bunch Navraj Chohan Chandra Krintz Khawaja Shams

HPC Programming Framework Research Team

Scalable Architecture on Amazon AWS Cloud

Toward a Unified Ontology of Cloud Computing

HPC Cloud Computing with OpenNebula

Placing Your Applications in the Best Cloud Model

Manjrasoft Market Oriented Cloud Computing Platform

Federation of Cloud Computing Infrastructure

LOGO Resource Management for Cloud Computing

A Brief Overview. Delivering Windows Azure Services on Windows Server. Enabling Service Providers

Cloud Computing Submitted By : Fahim Ilyas ( ) Submitted To : Martin Johnson Submitted On: 31 st May, 2009

Distribution transparency. Degree of transparency. Openness of distributed systems

A Brief Analysis on Architecture and Reliability of Cloud Based Data Storage

Extending IBM WebSphere MQ and WebSphere Message Broker to the Clouds 5th February 2013 Session 12628

Automation and Virtualization Increase Utilization and Efficiency of J2EE Applications

WORKFLOW ENGINE FOR CLOUDS

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 st CENTURY SCIENCE AND ENGINEERING (CIF21)

Deploying a Geospatial Cloud

Challenges in Hybrid and Federated Cloud Computing

3rd International Symposium on Big Data and Cloud Computing Challenges (ISBCC-2016) March 10-11, 2016 VIT University, Chennai, India

Manjrasoft Market Oriented Cloud Computing Platform

Amazon EC2 Product Details Page 1 of 5

Part I Courses Syllabus

Scale Cloud Across the Enterprise

Towards Elastic Application Model for Augmenting Computing Capabilities of Mobile Platforms. Mobilware 2010

Cloud Federations in Contrail

Cloud Computing: Computing as a Service. Prof. Daivashala Deshmukh Maharashtra Institute of Technology, Aurangabad

Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence

USING VIRTUAL MACHINE REPLICATION FOR DYNAMIC CONFIGURATION OF MULTI-TIER INTERNET SERVICES

CLOUD COMPUTING INTRODUCTION HISTORY

Consumption IT. Michael Shepherd Business Development Manager. Cisco Public Sector May 1 st 2014

ediscovery and Search of Enterprise Data in the Cloud

Monitoring, Managing and Supporting Enterprise Clouds with Oracle Enterprise Manager 12c Name, Title Oracle

Cloud Computing and Open Source: Watching Hype meet Reality

Data Semantics Aware Cloud for High Performance Analytics

journey to a hybrid cloud

Platform Autonomous Custom Scalable Service using Service Oriented Cloud Computing Architecture

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Mobile and Cloud computing and SE

The First Complete Cloud Management Solution with Oracle Enterprise Manager. Jean Pierre van Tiggelen EMEA Senior Sales Director Manageability

Managing Complexity in Distributed Data Life Cycles Enhancing Scientific Discovery

FCM: an Architecture for Integrating IaaS Cloud Systems

Intel IT Cloud Extending OpenStack* IaaS with Cloud Foundry* PaaS

Cloud-WIEN2k. A Scientific Cloud Computing Platform for Condensed Matter Physics

Transcription:

Building Platform as a Service for Scientific Applications Moustafa AbdelBaky moustafa@cac.rutgers.edu Rutgers Discovery Informa=cs Ins=tute (RDI 2 ) The NSF Cloud and Autonomic Compu=ng Center Department of Electrical & Computer Engineering Rutgers, The State University of New Jersey, USA

Current Challenges Development and building new applications Constraining algorithms, applications, or experiments to available resources development and deployment cycles are tightly coupled to each other, as well as to the underlying resource capabilities and availabilities Deployment and runtime management Allocating/adjusting resources and/or using queues Manually running jobs or writing complex scripts Complex workflow may require different classes of resources adding more complexity to running such workflows Rigid resource constraints (e.g. size, flexibility, no elasticity)

PaaS Advantages for Scientific Applications Development benefits: decouple development and deployment cycles, application driven by the science, not amount or type of available resources e.g. by exposing elastic resources at the application level, CDS&E developers are not constrained by available resources anymore when developing their applications or algorithms Deployment benefits: Enable ease of use and access, and increase productivity e.g. facilitate and automate the deployment of applications e.g. allocate resources dynamically to match dynamic application behavior at runtime (useful, when varying workload or when estimating resource utilization is not possible) New Application formulation: Hide resource allocation and provide more meaningful abstractions to developers e.g. enable elasticity at the development level, in terms of domain-specific values, which can be translated by the platform to resource requirements (i.e. increase accuracy, faster convergence rate, etc instead of increasing/decreasing resources) Export entire applications, applications patterns and kernels, optimized libraries, and/or specialized middleware as a service, that can be used to build other applications for example expose a platform as a service to build and run new elastic data assimilation applications

New Usage Modes of a CDS&E PaaS Enable federation of proper resource types and allocations, to match application requirements without the need to learn how to administer different systems, write complex scripts etc... Introduce more appropriate runtime policies (cost to science, time to science, energy to science) Understanding/expressing app behavior in development leads to better optimization at runtime e.g. analogous to the role of domain specific languages, which enables developer to write more optimized code (code that can be compiled more effectively due to the predefined context of the language), a PaaS provides a similar role in scientific computing, however this role is targeted towards optimizing runtime For example, due to the advanced knowledge of the workload, the system can allocate appropriate amounts and types of resources, that are dependent on the problem/ workflow, more effectively

Federation Management M Master W Secure Worker IW Unsecure Worker P Proxy R Request Handler

CDS&E PaaS Layers Scientific Developer Scientific User Development tools and APIs App APIs Cloud APIs App kernels Cloud Agents QoS policy e.g. #cores, memory Deployment tools Workflow expression & input Runtime policy e.g. cost, deadline Scalable PaaS middleware Uniform IaaS APIs Cloud Clusters Supercomputers

Application APIs used to build new applications utilize optimized kernels and libraries underneath Application kernels and libraries are optimized for every resource e.g. GROMACS, NAMD, CHARMM Cloud APIs expose cloud abstractions (i.e. elasticity) in domain specific terms (i.e. convergence) Scientific CDS&E PaaS Layers Development tools and APIs that mask the underlying hardware and enables building new applications. Such APIs provide application specific libraries, and also expose cloud abstractions such as elasticity and federation to the developer in domain-specific terms. Underneath such development tools and APIs, there are efficient application kernels/ libraries, which are optimized for different types of resources Developer Development tools and APIs Cloud Scientific Uniform IaaS APIs IaaS APIs can communicate with different resources or resource classes, while hiding different interactions and providing a uniform view to the PaaS. Provide elastic, unlimited, and on-demand HPC resources, that supports allocating, de-allocating, scaling up/down/out, running jobs, and dealing with usually long queues Deployment tools Clusters Deployment tools, which enable users to create, express, and execute scalable workflows easily. Such tools must be able to express scalability, federation, as well as complex workflows (e.g. loops and conditional feedbacks) QoS policies: tools to express QoS requirements, and execution parameters such as number of cores, environment variables, execution User modes (e.g. MPI, single core, etc.) Workflow specifications: App APIs Cloud APIs QoS policy Runtime tools to express Workflow e.g. policy specific expression #cores, & input e.g. cost, workflows App kernels Cloud Agents memory (varies deadline dependent on the application Cloud agents convert domain specific abstractions (e.g. PaaS middleware uses the workflow expression accuracy) to resources which can be used at runtime by class e.g. data and input to execute the workflow while taking in the PaaS middleware to allocate/de-allocate resources. assimilation consideration QoS policy, runtime policy, and the Different agents are used for different domains workflow, cloud agent requirements replica Scalable PaaS middleware exchange The middleware uses the uniform IaaS APIs to allocate or de-allocate resources, and execute applications workflow) that utilize such APIs to run applications and meet user requirements, which can be expressed by user friendly policies Supercomputers

Federation Requirements Scalability and extended capacity. Scale across geographically distributed resources to satisfy scientific applications computing demand Interoperability. Interact with heterogeneous resources (supercomputers, MPI and MapReduce clusters, massively parallel and shared memory supercomputers, and clouds) Capability. Optimize the resource allocation based on the particular characteristics of each resource Self-Discovery. Discovery mechanisms to provide a realistic view of the federation (dynamic availability of resources and capabilities) Elasticity and on-demand access. Create an abstraction on top of the resources to provide on-demand access and the ability to scale up, down or out as needed Democratization. Provide users with access to a larger number of resources or to specific ones enable new scientific challenges Correct Abstractions. Provide users with balanced application and cloud abstractions

Federated Asynchronous replica Exchange Replica exchange simulations require large amount of HPC resources, which is expensive, and not always available This is because replica exchange molecular dynamics simulations are very static in terms of execution models: the simulations go from start to finish irrespective of whether replicas are progressing towards correct folding The ability to select trajectories based on progress towards folding would represent a new direction for MD simulations Running multiple trajectories on traditional resources, and exporting the trajectories that are progressing quickly to HPC resources Conservation of high-end resources: by initially running replica exchange trajectories on commodity resources Accelerating the application: when converging replicas are detected, these are migrated to the high-end resources to accelerate the computation Enabling larger scale problems: Monitoring and killing/spawning replicas has the ability to accelerate protein folding events and allow scientists to explore a larger scale of MD simulations Developer expresses progress rate and thresholds for spawning to HPC resources in metadata associated with the application

Algorithmic Approach Run multiple trajectories on traditional resources Monitor the progress of a protein structure by using two method 1) Secondary Structure prediction a measure of how closely the secondary structures of simulated protein matches the actual 2) Radius of Gyration tracking a measure of whether the radius of gyration is converging towards the known radius of gyration value Kill diverging replicas and restart quickly progressing replicas on a HPC resource, resulting the acceleration of the protein folding simulations

Alpha Helix and Beta Sheet secondary structures for Ubiquitin

Radius of Gyration Comparison (a) Large Radius of Gyration (b) Small Radius of Gyration

Resulting CometCloud Replica Exchange Workflow User properties Input: e.g. executable locations, High-level Input /output workflow location description Application data Progress rate & criteria QoS policy Progress e.g. time rate to and threshold completion, criteria cost, etc. QoS requirements Replica execute Exchange replica exchange Platform workflow as a Service: Application/ Translates workflow Autonomic description manager to a runtime Adaptivity Infrastructure App Autonomic Runtime Manager adaptivity workflow manager scheduler estimator Monitor Autonomic execution of such workflow on current available resources, while optimizing user specified Analysis metrics Adaptation CometCloud Federate on different resource type, while monitoring Resource progress rate view Terminate diverging trajectories, restarting fast converging ones on HPC resources Grid Agent CometCloud Master Cloud Agent Cluster Agent HPC Grid Federation Cloud HPC Grid Cloud Cloud of Resources: Cloud Cloud Cluster Exposed to the user/application as uniform elastic, W W W W W W on demand service Application Executable: Converging Replica Traj. Multiple replica traj. Multiple replica traj Optimized kernels and libraries for specific resources

IEEE SCALE Challenge 2012 IPad GUI Reporting secondary structure and radius of gyration progress Autonomic Master (Amazon EC 2 ) Based on replica progress, Autonomic Master stops commodity trajectory and starts replica set on high performance resources. Replica Set (TACC) 2048 cores 4 ensembles 64 cores/ replica HPC *8 temperatures = 1 ensemble Replica Set (FutureGrid) 128 cores 4 ensemble 4 cores/replica Replica Set (Rutgers Cluster) 256 cores 4 ensembles 8 cores/replica Commodity *Could run multiple replicas per temperature to improve likelihood of asynchronous exchange on heterogeneous hardware. Replica Set (Amazon EC 2 ) 128 cores 2 ensembles 8 cores/replica

CometCloud is an autonomic computing engine that enables dynamic and on-demand federation of advanced cyber-infrastructures. Supports highly heterogeneous and dynamic cloud, grid, and HPC infrastructures. Resource/data coordination based on shared-space model (peer-to-peer lookup) Application/Programming layer autonomics: Dynamics workflows; Policy based component/service adaptations and compositions Autonomics layer: Resource provisioning based on user objectives; estimation of resource requirement initially, monitor application performance, and adjust resource provisioning. Service layer autonomics: Robust monitoring and proactive self-management; dynamic application/system/context-sensitive adaptations CometCloud Infrastructure layer autonomics: Ondemand scale-out; resilient to failure and data loss; handle dynamic joins/departures; support trust boundaries Red box denotes open source http://cometcloud.org

Conclusion & Challenges Demonstrate how cloud abstractions can be effectively used to support ensemble geo-system management applications or replica exchange molecular dynamics on a geographically distributed federation of cloud and supercomputing systems using a pervasive portal Application formulation provides adaptivity and elasticity at the application level This framework can be applied to other applications (reused) Need to provide abstractions that are meaningful to specific domains and different workflows Need right balance between such abstractions as to not limit development

Thank You