Performance, Availability and Power Analysis for IaaS Cloud

Similar documents

Optimization of IaaS Cloud including Performance, Availability, Power Analysis Networking 2014 Trondheim, Norway

A Gentle Introduction to Cloud Computing

Quantification of Security and Survivability

What Is It? Business Architecture Research Challenges Bibliography. Cloud Computing. Research Challenges Overview. Carlos Eduardo Moreira dos Santos

Perspectives on Moving to the Cloud Paradigm and the Need for Standards. Peter Mell, Tim Grance NIST, Information Technology Laboratory

Private Cloud in Educational Institutions: An Implementation using UEC

INCREASING SERVER UTILIZATION AND ACHIEVING GREEN COMPUTING IN CLOUD

Survey on Models to Investigate Data Center Performance and QoS in Cloud Computing Infrastructure

OVERVIEW Cloud Deployment Services

Cloud Computing Architectures and Design Issues

Infrastructure as a Service (IaaS)

Clo l ud d C ompu p tin i g

Performance Management for Cloudbased STC 2012

IS PRIVATE CLOUD A UNICORN?

Multilevel Communication Aware Approach for Load Balancing

How To Understand Cloud Computing

Radware Cloud Solutions for Enterprises. How to Capitalize on Cloud-based Services in an Enterprise Environment - White Paper

CLOUD SECURITY SECURITY ASPECTS IN GEOSPATIAL CLOUD. Guided by Prof. S. K. Ghosh Presented by - Soumadip Biswas

Cloud Computing An Elephant In The Dark

Cloud Computing: The Next Computing Paradigm

High Performance Computing Cloud Computing. Dr. Rami YARED

How To Compare Cloud Computing To Cloud Platforms And Cloud Computing

Introduction to Cloud Computing

Cloud Courses Description

Cloud Computing Flying High (or not) Ben Roper IT Director City of College Station

CHAPTER 8 CLOUD COMPUTING

Kent State University s Cloud Strategy

White Paper on CLOUD COMPUTING

Environments, Services and Network Management for Green Clouds

A Game Theoretic Formulation of the Service Provisioning Problem in Cloud Systems

A Novel Approach for Efficient Load Balancing in Cloud Computing Environment by Using Partitioning

Emerging Technology for the Next Decade

How To Understand Cloud Computing

Keywords Distributed Computing, On Demand Resources, Cloud Computing, Virtualization, Server Consolidation, Load Balancing

CLOUD COMPUTING. When It's smarter to rent than to buy

Soft Computing Models for Cloud Service Optimization

<Insert Picture Here> Enterprise Cloud Computing: What, Why and How

Introduction to Cloud : Cloud and Cloud Storage. Lecture 2. Dr. Dalit Naor IBM Haifa Research Storage Systems. Dalit Naor, IBM Haifa Research

PERFORMANCE ANALYSIS OF PaaS CLOUD COMPUTING SYSTEM

Cloud Computing. Chapter 1 Introducing Cloud Computing

Cloud Computing. Chapter 1 Introducing Cloud Computing

Essential Characteristics of Cloud Computing: On-Demand Self-Service Rapid Elasticity Location Independence Resource Pooling Measured Service

Cloud Courses Description

RESOURCE MANAGEMENT IN CLOUD COMPUTING ENVIRONMENT

Evaluation Methodology of Converged Cloud Environments

<Insert Picture Here> Cloud Computing Strategy

Cloud computing - Architecting in the cloud

Windows Azure and private cloud

Datacenters and Cloud Computing. Jia Rao Assistant Professor in CS

Cloud Computing Technology

FEDERATED CLOUD: A DEVELOPMENT IN CLOUD COMPUTING AND A SOLUTION TO EDUCATIONAL NEEDS

Li Sheng. Nowadays, with the booming development of network-based computing, more and more

Cloud definitions you've been pretending to understand. Jack Daniel, Reluctant CISSP, MVP Community Development Manager, Astaro

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

Performance Management for Cloud-based Applications STC 2012

Cloud computing: the state of the art and challenges. Jānis Kampars Riga Technical University

Topics. Images courtesy of Majd F. Sakr or from Wikipedia unless otherwise noted.

The Magical Cloud. Lennart Franked. Department for Information and Communicationsystems (ICS), Mid Sweden University, Sundsvall.

Security Considerations for Public Mobile Cloud Computing

Virtual Machine Instance Scheduling in IaaS Clouds

How to Do/Evaluate Cloud Computing Research. Young Choon Lee

Outline. What is cloud computing? History Cloud service models Cloud deployment forms Advantages/disadvantages

Security & Trust in the Cloud

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

Auto-Scaling Model for Cloud Computing System

OPTIMIZED PERFORMANCE EVALUATIONS OF CLOUD COMPUTING SERVERS

Cloud Computing Submitted By : Fahim Ilyas ( ) Submitted To : Martin Johnson Submitted On: 31 st May, 2009

Service allocation in Cloud Environment: A Migration Approach

Mobile and Cloud computing and SE

Cloud Computing. Chapter 1 Introducing Cloud Computing

ABSTRACT. KEYWORDS: Cloud Computing, Load Balancing, Scheduling Algorithms, FCFS, Group-Based Scheduling Algorithm

Big Data & Its Bigger Possibilities In The Cloud

A Secure Strategy using Weighted Active Monitoring Load Balancing Algorithm for Maintaining Privacy in Multi-Cloud Environments

Security Model for VM in Cloud

Cloud Computing: Making the right choices

How To Understand Cloud Computing

INTRODUCTION TO CLOUD COMPUTING CEN483 PARALLEL AND DISTRIBUTED SYSTEMS

Cloud Computing. Chapter 1 Introducing Cloud Computing

Multifaceted Resource Management for Dealing with Heterogeneous Workloads in Virtualized Data Centers

Programma della seconda parte del corso

Cloud Computing Services and its Application

WORKFLOW ENGINE FOR CLOUDS

Chapter 2 Cloud Computing

Overview. The Cloud. Characteristics and usage of the cloud Realities and risks of the cloud

Private Cloud Database Consolidation with Exadata. Nitin Vengurlekar Technical Director/Cloud Evangelist

Cloud Computing An Introduction

Dr.K.C.DAS HEAD PG Dept. of Library & Inf. Science Utkal University, Vani Vihar,Bhubaneswar

NATO s Journey to the Cloud Vision and Progress

Availability Modeling and Evaluation of Cloud Virtual Data Centers

Transcription:

Performance, Availability and Power Analysis for IaaS Cloud Kishor Trivedi kst@ee.duke.edu www.ee.duke.edu/~kst Dept. of ECE, Duke University, Durham, NC 27708 Universita Napoli September 23, 2011 1

Duke University 2 Research Triangle Park (RTP) Duke UNC-CH NC state USA North Carolina 2

Trivedi s Research Triangle Software Packages Theory Books: HARP (NASA), SAVE (IBM), IRAP (Boeing) SHARPE, SPNP, SREPT Blue, Red, White Stochastic modeling methods & numerical solution methods: Large Fault trees, Stochastic Petri Nets, Large/stiff Markov & non-markov models Fluid stochastic models Performability & Markov reward models Software aging and rejuvenation Attack countermeasure trees Applications Reliability/availability/performance Avionics, Space, Power systems, Transportation systems, Automobile systems Computer systems (hardware/software) Telco systems Computer Networks Virtualized Data center Cloud computing 3

Talk outline Overview of Reliability and Availability Quantification Overview of Cloud Computing Performance Quantification for IaaS Cloud (PRDC 2010) Availability Quantification for IaaS Cloud (DSN 2011) Power Quantification for IaaS Cloud (DSN workshop 2011) Future Research 4

An Overview of Reliability and Availability Quantification Methods Software + hardware in operation Dynamic as opposed to static behavior 5

Reliability and Availability Quantification Measurement-Based More Accurate Expensive due to many parameters and configurations Not always possible during system design. Model-Based Combined approach where measurements are made at the subsystem level and models are built to derive system-level measures 6

Reliability and Availability Evaluation Methods Measurement-based Quantitative Evaluation Discrete-event simulation Model-based Hybrid Numerical solution of analytic models not as well utilized; Unnecessarily excessive use of simulation Analytic Models Closed-form solution Numerical solution via a tool 7

Analytic Modeling Taxonomy Measurement-based Quantitative Dependability Evaluation Discrete-event simulation Model-based Hybrid Analytic Models Non-state-space models Analytic models State-space models Hierarchical composition Fixed point iterative models 8

Non-state space models Modeling using reliability block diagrams (RBDs), reliability graphs (relgraphs) and fault trees (FTs) are easy to use and efficient to solve for system reliability, system availability and system mean time to failure (MTTF) Product-form queuing networks for performance analysis 9

Example: Reliability Analysis of Boeing 787 Current Return Network Modeled as a Reliability Graph (Relgraph) 10

Reliability Analysis of Boeing 787 (cont d) This real problem has too many minpaths Non-state space models also face largeness problem Number of paths from source to target 11

Reliability Analysis of Boeing 787 (cont d) Our Approach : Developed a new efficient algorithm for (un)reliability bounds computation developed and incorporated in SHARPE SHARPE (Symbolic Hierarchical Automated Reliability and Performance Evaluator) 12

Non-State-Space Models Failure/Repair Dependencies are often present; RBDs, relgraphs, FTREEs cannot easily handle these (e.g., shared repair, warm/cold spares, imperfect coverage, non-zero switching time, travel time of repair person, reliability with repair). Product-form does not often hold when modeling real-life aspects such as simultaneous resource possession, priorities, retries, etc. 13

State-space models : Markov chains To model complex interactions between components, use models such as Markov chains or more generally state space models. Many examples of dependencies among system components have been observed in practice and captured by continuous-time Markov chains (CTMCs) Extension to Markov reward models makes computation of measures of interest relatively easy. 14

Markov Availability model of WebSphere AP Server Failure detection By WLM By Node Agent Manual detection Recovery Node Agent Auto process restart Manual recovery Process restart Node reboot Repair Application server and proxy server (with escalated levels of recovery) Delay and imperfect coverage in each step of recovery modeled 15 15

Analytic Modeling Taxonomy Analytic models Non-state-space models State-space (Markov) models 16

Should I Use Markov Models? + Model Fault-Tolerance and Recovery/Repair + Model Dependencies + Model Contention for Resources and concurrency + Generalize to Markov Reward Models for Degradable systems + Can relax exponential assumption + Performance, Availability and Performability Modeling Possible - Large State Space 17

State Space Explosion State space explosion can be avoided by using hierarchical model composition. Use state-space models for those parts of a system that require them, and use non-state-space models for the more well-behaved parts of the system. 18

Analytic Modeling Taxonomy Analytic models Non-state-sapce models Efficiency, simplicity State-space models Dependency capture Hierarchical composition To avoid largeness We are composing sub-model solutions together 19

Example: Architecture of SIP on IBM WebSphere AS: WebSphere Appl. Server (WAS) Replication domain Nodes 1 A, D 2 A, E 3 B, F 4 B, D 5 C, E 6 C, F 20

Hierarchical composition System Failure system App servers k of 12 proxy AS1 AS 1 AS2 AS 3 AS 4 AS 5 AS 6 PX1 PX 2 1A BS A CM 1 2A BS A CM 1 3B BS B CM 1 4B BS B CM 1 5C BS C CM 1 6C BS C CM 1 P 1 BS G CM 1 P 2 BS H CM 2 AS7 AS 8 AS9 AS 10 AS 11 AS 12 1A BSA CM1 1D BS D CM 2 4D BS D CM 2 2E BS E CM 2 5E BS E CM 2 3F BS F CM 2 6F BS F CM 2 This model was responsible for the actual sale of the system by IBM to their Telco customer 21

Fixed-Point Iteration Input parameters of sub-models can be functions of outputs of other models If the import graph is not acyclic then we solve using fixed-point iteration Analytic models Non-state-space models Efficiency, simplicity State-space models Dependency capture Hierarchical composition To avoid largeness Fixed-Point Iteration To deal with interdependent submodels 22

An Overview of Cloud Computing 23

NIST definition of cloud computing Definition by National Institute of Standards and Technology (NIST): Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. Source: P. Mell and T. Grance, The NIST Definition of Cloud Computing, October 7, 2009 24

Key characteristics On-demand self-service: Provisioning of computing capabilities without human intervention Resource pooling: Shared physical and virtualized environment Rapid elasticity: Through standardization and automation, quick scaling Metered Service: Pay-as-you-go model of computing Many of these characteristics are borrowed from Cloud s predecessors! Source: P. Mell and T. Grance, The NIST Definition of Cloud Computing, October 7, 2009 25

Evolution of cloud computing Time line of evolution Early 80s Cluster computing Early 90s Grid computing Around 2005-06 Around 2000 Cloud computing Utility computing *Source: http://seekingalpha.com/article/167764-tipping-point-gartner-annoints-cloud-computing-top-strategic-technology 26

Cloud Service models Infrastructure-as-a-Service (IaaS) Cloud: Examples: Amazon EC2, IBM Smart Business Development and Test Cloud Platform-as-a-Service (PaaS) Cloud: Examples: Microsoft Windows Azure, Google AppEngine Software-as-a-Service (SaaS) Cloud: Examples: Gmail, Google Docs 27

Deployment models Private Cloud: - Cloud infrastructure solely for an organization - Managed by the organization or third party - May exist on premise or off-premise Public Cloud: - Cloud infrastructure available for use for general users - Owned by an organization providing cloud services Hybrid Cloud: - Composition of two or more clouds (private or public) 28

Key Challenges Three critical metrics for a cloud: - Service (un)availability - Performance (response time) unpredictability - Power consumption Large number of parameters can affect performance, availability and power - Workload parameters - Failure/recovery characteristics - Types of physical infrastructure - Characteristics of virtualization infrastructures - Large scale; thousands of servers Performance, availability & power quantification are difficult! 29

Our goals in the IBM Cloud project Develop a comprehensive analytic modeling approach High fidelity Scalable and tractable Apply these models to cloud capacity planning 30

Our approach and motivations behind it Difficulty with measurement-based approach: expensive experimentation for each workload and system configuration Monolithic analytic model will suffer largeness and hence is not scalable Our approach: overall system model consists of a set of sub-models sub-model solutions composed via an interacting Markov chain approach scalable and tractable low cost of model solution while covering large parameter space 31

Duke/IBM project on cloud computing Joint work with Rahul Ghosh and Dong Seong Kim (Duke), Francesco Longo (Univ. of Messina) Vijay Naik, Murthy Devarakonda and Daniel Dias (IBM T. J. Watson Research Center) 32

Performance Quantification for IaaS Cloud [paper in Proc. IEEE PRDC 2010] 33

System model Current Assumptions [will be relaxed soon] Homogenous requests All physical machines (PMs) are identical. To minimize power consumption, PMs divided into three pools: Hot pool fast provisioning but high power usage Warm pool slower provisioning but lower power usage Cold pool slowest provisioning but lowest power usage 34

Life-cycle of a job inside a IaaS cloud Provisioning response delay Arrival Queuing Provisioning Instantiation Decision VM deployment Actual Service Out Resource Provisioning Decision Engine Run-time Execution Job rejection due to buffer full Job rejection due to insufficient capacity Provisioning and servicing steps: (i) resource provisioning decision, (ii) VM provisioning and (iii) run-time execution 35

Resource provisioning decision engine (RPDE) Provisioning response delay Arrival Queuing Provisioning Instantiation Decision VM deployment Actual Service Out Resource Provisioning Decision Engine Run-time Execution Job rejection due to buffer full Job rejection due to insufficient capacity 36

Resource provisioning decision engine (RPDE) Flow-chart: 37

CTMC model for RPDE i,s i = number of jobs in queue, s = pool (hot, warm or cold) λ 0,0 0,h δ P h h λ δ P h h 1,h δ P h h δ P h h δ P w w δ ( 1 ) h Ph δ h( 1 Ph ) δ h( 1 Ph ) δ c ( 1 Pc ) δ P δ P δ P c c δ P w w w w w w 0,w λ 1,w λ λ λ λ N-1,h N-1,w δ 1 P ) c ( c δ P c c δ w( 1 P ) δ w c δ w( 1 Pw ) δ w( 1 Pw ) δ 1 P ) δ 1 P ) δ P c c c ( c c P c ( c 0,c λ 1,c λ λ N-1,c 38

RPDE model: parameters & measures Input Parameters: λ arrival rate: data collected from cloud 1/ δ mean search delays for resource provisioning h,1/ δ w,1 / δ c decision engine: from searching algorithms or measurements P probability of being able to provision: computed from h, Pw, Pc VM provisioning model N maximum # jobs in RPDE: from system/server specification Output Measures: Job rejection probability due to buffer full (P block ) Job rejection probability due to insufficient capacity (P drop ) Mean decision delay for an accepted job (E[T decision ]) Mean queuing delay for an accepted job (E[T q_dec ]) 39

VM provisioning Provisioning response delay Arrival Queuing Provisioning Instantiation Decision VM deployment Actual Service Out Resource Provisioning Decision Engine Run-time Execution Job rejection due to buffer full Job rejection due to insufficient capacity 40

VM provisioning model Hot PM Hot PM pool Resource Provisioning Decision Engine Service out Warm pool Accepted jobs Running VMs Idle resources on hot machine Idle resources on warm machine Idle resources on cold machine Cold pool 41

VM provisioning model for each hot PM λ λ h h h 0,0,0 0,1,0 L h,1,0 µ βh βh λh λ h λh 0,0,1 (L h -1),1,1 L h,1,1 λ µ µ µ L h is the buffer size and m is max. # VMs that can run simultaneously on a PM i,j,k ( m 1)µ 2µ β h 0,0,(m-1) ( m 1)µ mµ λ h 0,1,(m-1) βh βh 0,0,m λ h λ h mµ β h 1,0,m βh 2µ ( m 1)µ λ h λ h β h 2µ βh (L h -1),1,(m-1) βh mµ i = number of jobs in the queue, j = number of VMs being provisioned, k = number of VMs running λ h λ h ( m 1)µ L h,1,(m- 1) βh L h,0,m 42

VM provisioning model (for each hot PM) Input Parameters: 1 P block ) λh = λ( n 1/ 1/ β h µ P block h can be measured experimentally obtained from the lower level run-time model obtained from the resource provisioning decision model Hot pool model is the set of independent hot PM models n h Output Measure: P h = prob. that a job is accepted in the hot pool = m 1 ( h) ( h) n ( L,1, i) ϕ( L,0, m) ) ( h ) ( h ) where, ( ϕ ( L,1, i ) + ϕ ) h ( L h, 0, m ) is the steady state probability that a PM can not i = 0 accept job for provisioning - from the solution of the Markov model of a hot PM on the previous slide m 1 1 ( ϕ + i= 0 h h h 43

VM provisioning model for each warm PM λ w 0,0,0 0,1*,0 L w,1*,0 µ γ w β w λ λ w w 0,1,0 L w,1,0 β β h λ w λ w L 0 w, 1**,0 0,1**, µ λ w λ w µ γ w 0,0,1 0,1,1 (L w -1),1,1 L w,1,1 λ λ λw w w β 2 µ h 2µ ( m 1)µ β h 0,0,(m-1) ( m 1)µ λ w mµ βh 0,1,(m-1) βh β w 0,0,m λ w λw β h mµ βh 1,0,m µ β h λw ( m 1)µ 2µ λw βh (L w -1),1,(m- 1) β h λw mµ λ w ( m 1)µ L w,1,(m-1) βh L w,0,m 44

VM provisioning model for each cold PM λ c 0,0,0 0,1*,0 γ c λc λc L c,1*, 0 γ c µ β c 0,1,0 L c,1,0 β β h λ c 0,1**, 0 0,1,1 µ λ c λ c µ λ c β 0,0,1 (L c -1),1,1 λ λ c c λc β h 2 µ 2µ ( m 1)µ β h 0,0,(m-1) ( m 1)µ λ c mµ 0,1,(m-1) β h β c 0,0, m βh λ c λc mµ L c, 1**,0 β h 1,0, m µ L c,1, 1 βh β h ( m 1)µ λc 2µ λc βh (L c -1),1,(m-1) βh λc mµ λ c ( m 1)µ L c,1,(m-1) βh L c,0,m 45

VM provisioning model: Summary Warm/cold PM model is similar to hot PM, except: (i) (ii) (iii) Effective job arrival rate For first job, warm/cold PM requires additional start-up time Mean provisioning delay for a VM for the first job is longer Outputs of hot, warm and cold pool models: Probabilities ( P )that at least one PM in hot/warm/cold pool can h, Pw, P c accept a job 46

Import graph for performance models job rejection probability and mean response delay Ph P block RPDE model P block P block Pc P w Hot pool model P h P h Warm pool model P w Cold pool model VM provisioning models 47

Fixed-point iteration To solve hot, warm and cold PM models, we need provisioning decision model P block from resource To solve provisioning decision model, we need and cold pool model respectively P, P, P h w c from hot, warm This leads to a cyclic dependency among the resource provisioning decision model and VM provisioning models (hot, warm, cold) We resolve this dependency via fixed-point iteration Observe, our fixed-point variable is equation is of the form: P block P = f ( P block ) block and corresponding fixed-point 48

Performance measures comparison with monolithic model 1 PM per pool and 1 VM per PM Jobs/hr Mean RPDE queue length Rejection probability ISP monolithic ISP Monolithic 1 9.0332e-07 9.2321e-07 9.8899e-06 1.1221e-03 5 4.1622e-05 4.3364e-05 4.2334e-02 8.0500e-02 10 2.3731e-04 2.4225e-04 2.3496e-01 2.6587e-01 15 6.3539e-04 6.4377e-04 3.9860e-01 4.1493e-01 20 1.2526e-03 1.2655e-03 5.1069e-01 5.1969e-01 25 2.0990e-03 2.1179e-03 5.8915e-01 5.9449e-01 30 3.1826e-03 3.2091e-01 6.4648e-01 6.4985e-01 35 4.5106e-03 4.5462e-03 6.8999e-01 6.9223e-01 The error is between e-03 and e-07 for all the results. The number of states in monolithic model is 912 while in ISP model it is 21 49

Availability Quantification for IaaS Cloud [paper in Proc. IEEE/IFIP DSN 2011] 50

Assumptions We consider the net effect of different failures and repairs of PMs MTTF of each hot PM is 1/ λ h and that of each warm PM is with / < 1/ λ. 1 λ h w 1/ λ w MTTF of each cold PM is with <<. 1 / λ 1/ λ 1/ λ 1 λ c λ h c Each pool has repair facilities and shared repair policy is assumed PMs can migrate from one pool to another upon a failure and repair 51

Monolithic availability model 52

Interacting Sub-models SRN sub-model for warm pool SRN sub-model for hot pool SRN sub-model for cold pool 53

Import graph and model outputs Model outputs: mean number of PMs in each pool (E[#P h ], E[#P w ], and E[#P c ]) availability of cloud when at least k PMs (with h w ) are available across all the pools. downtime 1 k ( n + n + nc ) 54

Monolithic vs. interacting sub-models Number of model states and non-zero entries #PMs in each pool #monolithic model states #submodels states #monolithic model non-zero entries 5 7056 56 44520 210 10 207636 286 1535490 1320 15 1775616 136 13948160 480 17 3508920 171 27976968 612 19 6468000 210 52189200 760 20 Memory overflow 231 Memory overflow #sub-models non-zero entries 840 50-1326 5100 100-5151 20200 150-11476 45300 55

Monolithic vs. interacting sub-models Average number of PMs in each pool #PMs in each pool to start with Avg. #PMs in pools for monolithic model Avg. #PMs in pools for interacting sub-models hot warm cold hot warm cold 5 4.99 4.98 4.99 5.00 4.98 4.99 10 10.00 9.96 9.98 10.00 9.96 9.98 15 14.99 14.95 14.97 15.00 14.95 14.97 17 16.99 16.94 16.97 17.00 16.94 16.97 19 18.99 18.93 18.97 19.00 18.93 18.97 56

Monolithic vs. interacting sub-models Comparison of downtime with 10 PMs in each pool to start with. Cloud is available when at least k PMs are UP. Maximum number of PMs that can be repaired in parallel is n r k n r Downtime (minutes/year) Monolithic Interacting sub-models 30 1 23185.793 23178.956 2 22904.919 22898.454 3 22903.681 22897.219 29 1 792.475 798.651 2 499.081 505.258 3 497.787 503.964 28 1 24.722 25.336 2 8.412 8.691 3 7.118 7.396 57

Monolithic vs. interacting sub-models Comparison of solution times #PMs in each pool to start with Monolithic model (sec) sub-models (sec) 5 0.627 0.406 10 18.670 0.517 15 373.822 0.278 17 1004.494 0.279 19 2459.553 0.280 20 Memory overflow 0.281 50-0.296 100-0.377 150-0.564 200-0.948 58

Solution time for large IaaS cloud We use closed-form solutions of the sub-models #PMs in each pool to start with Solution time (sec) 500 0.251 1000 0.592 1500 0.911 2000 1.715 3000 2.483 4000 2.651 59

Resiliency Quantification for IaaS Cloud [paper in Proc. IEEE SRDS RACOS workshop 2010] 60

Resiliency Quantification: Definitions Past research mostly interpreted resiliency as fault tolerant capability of the system We use following definition Resiliency is the persistence of service delivery that is predictable and can be trusted to perform when subjected to changes* Changes of interest in the context of IaaS cloud: Increase/decrease in workload Increase/decrease in system capacity Increase/decrease in faultload Security attacks Accidents or disasters *[1] J. Laprie, From Dependability to resiliency, DSN 2008 [2] L. Simoncini, Resilient Computing: An Engineering Discipline, IPDPS 2009 61

General steps for resiliency quantification (1) Construct a stochastic analytic model of a given system to find measure(s) of interest. Such a model can be performance or availability model of the system. (2) Determine the steady state behavior of the developed model in step (1). We compute steady state values of performance and/or availability measures. Note the analogy with the Phased-Mission System reliability analysis (3) Apply change(s) to the system by increasing (or decreasing) the value(s) of input parameter(s) of the model. Examples of such changes can be variation of call arrival rates, failure rates. 62 (4) Analyze the transient behavior of the system model to compute the transient measures after applying the change(s). Initial probabilities for this transient analysis are obtained from the steady state probabilities as computed from the system model in the step (2). Transient response of the performance/availability measures quantify the resiliency of the system. 62

IaaS cloud resiliency w.r.t. change in arrival rate t set is settling time; one of the metrics to quantify resiliency 63

Power Quantification for IaaS Cloud [paper in Proc. IEEE/IFIP DSN workshop DCDV 2011] 64

Power Consumption from Hot PM Model λ h h h 0,0,0 0,1,0 L h,1,0 µ βh 0,0,1 ( m 1)µ 2µ β h λ λh 0,0,(m-1) λ µ µ β µ ( m 1)µ mµ λ h 0,1,(m-1) βh βh 0,0,m λ h λ h λ h mµ h (L h - 1),1,1 β h When no VM is running, hot PM consumes an idle power of h l. Power consumption of a VM with average resource utilization is assumed to be v a For each state (i, j, k) of the 1,0,m βh λh 2µ ( m 1)µ λ h L h,1,1 λ h β h 2µ βh (L h -1),1,(m- 1) β h mµ λ h λ h ( m 1)µ L h,1,(m- 1) βh L h,0,m CTMC we assign a reward rate: r(i, j, k) = h l + kv a 65

Power Consumption from Warm PM Model Warm PM CTMC states Reward rates w l 1 wl w 2 l3 h l 66

Power Consumption from Cold PM Model Cold PM CTMC states Reward rates c l 1 cl c 2 l3 h l Net power consumption is sum of power consumptions in hot, warm and cold pool 67

Power-performance trade-offs region where intuition based grouping is bad (i, j, k) denotes #PMs in hot, warm and cold pool respectively optimization problem: What is the optimal #PMs per pool that minimizes total power consumption but does not violate the SLA (upper bound on mean response delay)? 68

Future Research 69

Cost analysis Providers have two key costs for providing cloud based services (i) Capital Expenditure (CapEx) and (ii) Operational Expenditure (OpEx) Capital Expenditure (CapEx) Example of CapEx includes infrastructure cost, software licensing cost Usually CapEx is fixed over time Operational Expenditure (OpEx) Example of OpEx includes power usage cost, cost or penalty due to violation of different SLA metrics, management costs OpEx is more interesting since it varies with time depending upon different factors like system configuration, management strategy or workload arrivals 70

SLA driven capacity planning What is the optimal #PMs so that total cost is minimized and SLA is upheld? Large sized cloud, large variability, fixed # configurations 71

Proposed Extensions to Current Models More detailed workload Model Different workload arrival processes [e.g., bursty] Different types of service time distributions Heterogeneous requests Requests with different priorities More detailed availability model Different types of service time distributions Model validation Application of existing models to different cloud services/systems Cost analysis Capacity planning 72

Conclusions 73

Conclusions Analytic models are powerful for the construction and numerical solution of various reliability, availability, performance, and resiliency [behavior under changes in workload, faultload, configuration] models Not only exponential but also non-exponential distribution can be admitted to construct such models. For very complex systems such as clouds, hierarchical, fixed-point iterative and approximate solutions needed. Performance, availability, resiliency and power consumption analysis can be done using such an approach. Simulative and hybrid models/solutions should be used only when absolutely necessary Models can then be used in capacity planning a feedback control setting for adapting to changes 74

Quantifying the Unquantifiable? 75

Thanks! 76