Performance, Availability and Power Analysis for IaaS Cloud Kishor Trivedi kst@ee.duke.edu www.ee.duke.edu/~kst Dept. of ECE, Duke University, Durham, NC 27708 Universita Napoli September 23, 2011 1
Duke University 2 Research Triangle Park (RTP) Duke UNC-CH NC state USA North Carolina 2
Trivedi s Research Triangle Software Packages Theory Books: HARP (NASA), SAVE (IBM), IRAP (Boeing) SHARPE, SPNP, SREPT Blue, Red, White Stochastic modeling methods & numerical solution methods: Large Fault trees, Stochastic Petri Nets, Large/stiff Markov & non-markov models Fluid stochastic models Performability & Markov reward models Software aging and rejuvenation Attack countermeasure trees Applications Reliability/availability/performance Avionics, Space, Power systems, Transportation systems, Automobile systems Computer systems (hardware/software) Telco systems Computer Networks Virtualized Data center Cloud computing 3
Talk outline Overview of Reliability and Availability Quantification Overview of Cloud Computing Performance Quantification for IaaS Cloud (PRDC 2010) Availability Quantification for IaaS Cloud (DSN 2011) Power Quantification for IaaS Cloud (DSN workshop 2011) Future Research 4
An Overview of Reliability and Availability Quantification Methods Software + hardware in operation Dynamic as opposed to static behavior 5
Reliability and Availability Quantification Measurement-Based More Accurate Expensive due to many parameters and configurations Not always possible during system design. Model-Based Combined approach where measurements are made at the subsystem level and models are built to derive system-level measures 6
Reliability and Availability Evaluation Methods Measurement-based Quantitative Evaluation Discrete-event simulation Model-based Hybrid Numerical solution of analytic models not as well utilized; Unnecessarily excessive use of simulation Analytic Models Closed-form solution Numerical solution via a tool 7
Analytic Modeling Taxonomy Measurement-based Quantitative Dependability Evaluation Discrete-event simulation Model-based Hybrid Analytic Models Non-state-space models Analytic models State-space models Hierarchical composition Fixed point iterative models 8
Non-state space models Modeling using reliability block diagrams (RBDs), reliability graphs (relgraphs) and fault trees (FTs) are easy to use and efficient to solve for system reliability, system availability and system mean time to failure (MTTF) Product-form queuing networks for performance analysis 9
Example: Reliability Analysis of Boeing 787 Current Return Network Modeled as a Reliability Graph (Relgraph) 10
Reliability Analysis of Boeing 787 (cont d) This real problem has too many minpaths Non-state space models also face largeness problem Number of paths from source to target 11
Reliability Analysis of Boeing 787 (cont d) Our Approach : Developed a new efficient algorithm for (un)reliability bounds computation developed and incorporated in SHARPE SHARPE (Symbolic Hierarchical Automated Reliability and Performance Evaluator) 12
Non-State-Space Models Failure/Repair Dependencies are often present; RBDs, relgraphs, FTREEs cannot easily handle these (e.g., shared repair, warm/cold spares, imperfect coverage, non-zero switching time, travel time of repair person, reliability with repair). Product-form does not often hold when modeling real-life aspects such as simultaneous resource possession, priorities, retries, etc. 13
State-space models : Markov chains To model complex interactions between components, use models such as Markov chains or more generally state space models. Many examples of dependencies among system components have been observed in practice and captured by continuous-time Markov chains (CTMCs) Extension to Markov reward models makes computation of measures of interest relatively easy. 14
Markov Availability model of WebSphere AP Server Failure detection By WLM By Node Agent Manual detection Recovery Node Agent Auto process restart Manual recovery Process restart Node reboot Repair Application server and proxy server (with escalated levels of recovery) Delay and imperfect coverage in each step of recovery modeled 15 15
Analytic Modeling Taxonomy Analytic models Non-state-space models State-space (Markov) models 16
Should I Use Markov Models? + Model Fault-Tolerance and Recovery/Repair + Model Dependencies + Model Contention for Resources and concurrency + Generalize to Markov Reward Models for Degradable systems + Can relax exponential assumption + Performance, Availability and Performability Modeling Possible - Large State Space 17
State Space Explosion State space explosion can be avoided by using hierarchical model composition. Use state-space models for those parts of a system that require them, and use non-state-space models for the more well-behaved parts of the system. 18
Analytic Modeling Taxonomy Analytic models Non-state-sapce models Efficiency, simplicity State-space models Dependency capture Hierarchical composition To avoid largeness We are composing sub-model solutions together 19
Example: Architecture of SIP on IBM WebSphere AS: WebSphere Appl. Server (WAS) Replication domain Nodes 1 A, D 2 A, E 3 B, F 4 B, D 5 C, E 6 C, F 20
Hierarchical composition System Failure system App servers k of 12 proxy AS1 AS 1 AS2 AS 3 AS 4 AS 5 AS 6 PX1 PX 2 1A BS A CM 1 2A BS A CM 1 3B BS B CM 1 4B BS B CM 1 5C BS C CM 1 6C BS C CM 1 P 1 BS G CM 1 P 2 BS H CM 2 AS7 AS 8 AS9 AS 10 AS 11 AS 12 1A BSA CM1 1D BS D CM 2 4D BS D CM 2 2E BS E CM 2 5E BS E CM 2 3F BS F CM 2 6F BS F CM 2 This model was responsible for the actual sale of the system by IBM to their Telco customer 21
Fixed-Point Iteration Input parameters of sub-models can be functions of outputs of other models If the import graph is not acyclic then we solve using fixed-point iteration Analytic models Non-state-space models Efficiency, simplicity State-space models Dependency capture Hierarchical composition To avoid largeness Fixed-Point Iteration To deal with interdependent submodels 22
An Overview of Cloud Computing 23
NIST definition of cloud computing Definition by National Institute of Standards and Technology (NIST): Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. Source: P. Mell and T. Grance, The NIST Definition of Cloud Computing, October 7, 2009 24
Key characteristics On-demand self-service: Provisioning of computing capabilities without human intervention Resource pooling: Shared physical and virtualized environment Rapid elasticity: Through standardization and automation, quick scaling Metered Service: Pay-as-you-go model of computing Many of these characteristics are borrowed from Cloud s predecessors! Source: P. Mell and T. Grance, The NIST Definition of Cloud Computing, October 7, 2009 25
Evolution of cloud computing Time line of evolution Early 80s Cluster computing Early 90s Grid computing Around 2005-06 Around 2000 Cloud computing Utility computing *Source: http://seekingalpha.com/article/167764-tipping-point-gartner-annoints-cloud-computing-top-strategic-technology 26
Cloud Service models Infrastructure-as-a-Service (IaaS) Cloud: Examples: Amazon EC2, IBM Smart Business Development and Test Cloud Platform-as-a-Service (PaaS) Cloud: Examples: Microsoft Windows Azure, Google AppEngine Software-as-a-Service (SaaS) Cloud: Examples: Gmail, Google Docs 27
Deployment models Private Cloud: - Cloud infrastructure solely for an organization - Managed by the organization or third party - May exist on premise or off-premise Public Cloud: - Cloud infrastructure available for use for general users - Owned by an organization providing cloud services Hybrid Cloud: - Composition of two or more clouds (private or public) 28
Key Challenges Three critical metrics for a cloud: - Service (un)availability - Performance (response time) unpredictability - Power consumption Large number of parameters can affect performance, availability and power - Workload parameters - Failure/recovery characteristics - Types of physical infrastructure - Characteristics of virtualization infrastructures - Large scale; thousands of servers Performance, availability & power quantification are difficult! 29
Our goals in the IBM Cloud project Develop a comprehensive analytic modeling approach High fidelity Scalable and tractable Apply these models to cloud capacity planning 30
Our approach and motivations behind it Difficulty with measurement-based approach: expensive experimentation for each workload and system configuration Monolithic analytic model will suffer largeness and hence is not scalable Our approach: overall system model consists of a set of sub-models sub-model solutions composed via an interacting Markov chain approach scalable and tractable low cost of model solution while covering large parameter space 31
Duke/IBM project on cloud computing Joint work with Rahul Ghosh and Dong Seong Kim (Duke), Francesco Longo (Univ. of Messina) Vijay Naik, Murthy Devarakonda and Daniel Dias (IBM T. J. Watson Research Center) 32
Performance Quantification for IaaS Cloud [paper in Proc. IEEE PRDC 2010] 33
System model Current Assumptions [will be relaxed soon] Homogenous requests All physical machines (PMs) are identical. To minimize power consumption, PMs divided into three pools: Hot pool fast provisioning but high power usage Warm pool slower provisioning but lower power usage Cold pool slowest provisioning but lowest power usage 34
Life-cycle of a job inside a IaaS cloud Provisioning response delay Arrival Queuing Provisioning Instantiation Decision VM deployment Actual Service Out Resource Provisioning Decision Engine Run-time Execution Job rejection due to buffer full Job rejection due to insufficient capacity Provisioning and servicing steps: (i) resource provisioning decision, (ii) VM provisioning and (iii) run-time execution 35
Resource provisioning decision engine (RPDE) Provisioning response delay Arrival Queuing Provisioning Instantiation Decision VM deployment Actual Service Out Resource Provisioning Decision Engine Run-time Execution Job rejection due to buffer full Job rejection due to insufficient capacity 36
Resource provisioning decision engine (RPDE) Flow-chart: 37
CTMC model for RPDE i,s i = number of jobs in queue, s = pool (hot, warm or cold) λ 0,0 0,h δ P h h λ δ P h h 1,h δ P h h δ P h h δ P w w δ ( 1 ) h Ph δ h( 1 Ph ) δ h( 1 Ph ) δ c ( 1 Pc ) δ P δ P δ P c c δ P w w w w w w 0,w λ 1,w λ λ λ λ N-1,h N-1,w δ 1 P ) c ( c δ P c c δ w( 1 P ) δ w c δ w( 1 Pw ) δ w( 1 Pw ) δ 1 P ) δ 1 P ) δ P c c c ( c c P c ( c 0,c λ 1,c λ λ N-1,c 38
RPDE model: parameters & measures Input Parameters: λ arrival rate: data collected from cloud 1/ δ mean search delays for resource provisioning h,1/ δ w,1 / δ c decision engine: from searching algorithms or measurements P probability of being able to provision: computed from h, Pw, Pc VM provisioning model N maximum # jobs in RPDE: from system/server specification Output Measures: Job rejection probability due to buffer full (P block ) Job rejection probability due to insufficient capacity (P drop ) Mean decision delay for an accepted job (E[T decision ]) Mean queuing delay for an accepted job (E[T q_dec ]) 39
VM provisioning Provisioning response delay Arrival Queuing Provisioning Instantiation Decision VM deployment Actual Service Out Resource Provisioning Decision Engine Run-time Execution Job rejection due to buffer full Job rejection due to insufficient capacity 40
VM provisioning model Hot PM Hot PM pool Resource Provisioning Decision Engine Service out Warm pool Accepted jobs Running VMs Idle resources on hot machine Idle resources on warm machine Idle resources on cold machine Cold pool 41
VM provisioning model for each hot PM λ λ h h h 0,0,0 0,1,0 L h,1,0 µ βh βh λh λ h λh 0,0,1 (L h -1),1,1 L h,1,1 λ µ µ µ L h is the buffer size and m is max. # VMs that can run simultaneously on a PM i,j,k ( m 1)µ 2µ β h 0,0,(m-1) ( m 1)µ mµ λ h 0,1,(m-1) βh βh 0,0,m λ h λ h mµ β h 1,0,m βh 2µ ( m 1)µ λ h λ h β h 2µ βh (L h -1),1,(m-1) βh mµ i = number of jobs in the queue, j = number of VMs being provisioned, k = number of VMs running λ h λ h ( m 1)µ L h,1,(m- 1) βh L h,0,m 42
VM provisioning model (for each hot PM) Input Parameters: 1 P block ) λh = λ( n 1/ 1/ β h µ P block h can be measured experimentally obtained from the lower level run-time model obtained from the resource provisioning decision model Hot pool model is the set of independent hot PM models n h Output Measure: P h = prob. that a job is accepted in the hot pool = m 1 ( h) ( h) n ( L,1, i) ϕ( L,0, m) ) ( h ) ( h ) where, ( ϕ ( L,1, i ) + ϕ ) h ( L h, 0, m ) is the steady state probability that a PM can not i = 0 accept job for provisioning - from the solution of the Markov model of a hot PM on the previous slide m 1 1 ( ϕ + i= 0 h h h 43
VM provisioning model for each warm PM λ w 0,0,0 0,1*,0 L w,1*,0 µ γ w β w λ λ w w 0,1,0 L w,1,0 β β h λ w λ w L 0 w, 1**,0 0,1**, µ λ w λ w µ γ w 0,0,1 0,1,1 (L w -1),1,1 L w,1,1 λ λ λw w w β 2 µ h 2µ ( m 1)µ β h 0,0,(m-1) ( m 1)µ λ w mµ βh 0,1,(m-1) βh β w 0,0,m λ w λw β h mµ βh 1,0,m µ β h λw ( m 1)µ 2µ λw βh (L w -1),1,(m- 1) β h λw mµ λ w ( m 1)µ L w,1,(m-1) βh L w,0,m 44
VM provisioning model for each cold PM λ c 0,0,0 0,1*,0 γ c λc λc L c,1*, 0 γ c µ β c 0,1,0 L c,1,0 β β h λ c 0,1**, 0 0,1,1 µ λ c λ c µ λ c β 0,0,1 (L c -1),1,1 λ λ c c λc β h 2 µ 2µ ( m 1)µ β h 0,0,(m-1) ( m 1)µ λ c mµ 0,1,(m-1) β h β c 0,0, m βh λ c λc mµ L c, 1**,0 β h 1,0, m µ L c,1, 1 βh β h ( m 1)µ λc 2µ λc βh (L c -1),1,(m-1) βh λc mµ λ c ( m 1)µ L c,1,(m-1) βh L c,0,m 45
VM provisioning model: Summary Warm/cold PM model is similar to hot PM, except: (i) (ii) (iii) Effective job arrival rate For first job, warm/cold PM requires additional start-up time Mean provisioning delay for a VM for the first job is longer Outputs of hot, warm and cold pool models: Probabilities ( P )that at least one PM in hot/warm/cold pool can h, Pw, P c accept a job 46
Import graph for performance models job rejection probability and mean response delay Ph P block RPDE model P block P block Pc P w Hot pool model P h P h Warm pool model P w Cold pool model VM provisioning models 47
Fixed-point iteration To solve hot, warm and cold PM models, we need provisioning decision model P block from resource To solve provisioning decision model, we need and cold pool model respectively P, P, P h w c from hot, warm This leads to a cyclic dependency among the resource provisioning decision model and VM provisioning models (hot, warm, cold) We resolve this dependency via fixed-point iteration Observe, our fixed-point variable is equation is of the form: P block P = f ( P block ) block and corresponding fixed-point 48
Performance measures comparison with monolithic model 1 PM per pool and 1 VM per PM Jobs/hr Mean RPDE queue length Rejection probability ISP monolithic ISP Monolithic 1 9.0332e-07 9.2321e-07 9.8899e-06 1.1221e-03 5 4.1622e-05 4.3364e-05 4.2334e-02 8.0500e-02 10 2.3731e-04 2.4225e-04 2.3496e-01 2.6587e-01 15 6.3539e-04 6.4377e-04 3.9860e-01 4.1493e-01 20 1.2526e-03 1.2655e-03 5.1069e-01 5.1969e-01 25 2.0990e-03 2.1179e-03 5.8915e-01 5.9449e-01 30 3.1826e-03 3.2091e-01 6.4648e-01 6.4985e-01 35 4.5106e-03 4.5462e-03 6.8999e-01 6.9223e-01 The error is between e-03 and e-07 for all the results. The number of states in monolithic model is 912 while in ISP model it is 21 49
Availability Quantification for IaaS Cloud [paper in Proc. IEEE/IFIP DSN 2011] 50
Assumptions We consider the net effect of different failures and repairs of PMs MTTF of each hot PM is 1/ λ h and that of each warm PM is with / < 1/ λ. 1 λ h w 1/ λ w MTTF of each cold PM is with <<. 1 / λ 1/ λ 1/ λ 1 λ c λ h c Each pool has repair facilities and shared repair policy is assumed PMs can migrate from one pool to another upon a failure and repair 51
Monolithic availability model 52
Interacting Sub-models SRN sub-model for warm pool SRN sub-model for hot pool SRN sub-model for cold pool 53
Import graph and model outputs Model outputs: mean number of PMs in each pool (E[#P h ], E[#P w ], and E[#P c ]) availability of cloud when at least k PMs (with h w ) are available across all the pools. downtime 1 k ( n + n + nc ) 54
Monolithic vs. interacting sub-models Number of model states and non-zero entries #PMs in each pool #monolithic model states #submodels states #monolithic model non-zero entries 5 7056 56 44520 210 10 207636 286 1535490 1320 15 1775616 136 13948160 480 17 3508920 171 27976968 612 19 6468000 210 52189200 760 20 Memory overflow 231 Memory overflow #sub-models non-zero entries 840 50-1326 5100 100-5151 20200 150-11476 45300 55
Monolithic vs. interacting sub-models Average number of PMs in each pool #PMs in each pool to start with Avg. #PMs in pools for monolithic model Avg. #PMs in pools for interacting sub-models hot warm cold hot warm cold 5 4.99 4.98 4.99 5.00 4.98 4.99 10 10.00 9.96 9.98 10.00 9.96 9.98 15 14.99 14.95 14.97 15.00 14.95 14.97 17 16.99 16.94 16.97 17.00 16.94 16.97 19 18.99 18.93 18.97 19.00 18.93 18.97 56
Monolithic vs. interacting sub-models Comparison of downtime with 10 PMs in each pool to start with. Cloud is available when at least k PMs are UP. Maximum number of PMs that can be repaired in parallel is n r k n r Downtime (minutes/year) Monolithic Interacting sub-models 30 1 23185.793 23178.956 2 22904.919 22898.454 3 22903.681 22897.219 29 1 792.475 798.651 2 499.081 505.258 3 497.787 503.964 28 1 24.722 25.336 2 8.412 8.691 3 7.118 7.396 57
Monolithic vs. interacting sub-models Comparison of solution times #PMs in each pool to start with Monolithic model (sec) sub-models (sec) 5 0.627 0.406 10 18.670 0.517 15 373.822 0.278 17 1004.494 0.279 19 2459.553 0.280 20 Memory overflow 0.281 50-0.296 100-0.377 150-0.564 200-0.948 58
Solution time for large IaaS cloud We use closed-form solutions of the sub-models #PMs in each pool to start with Solution time (sec) 500 0.251 1000 0.592 1500 0.911 2000 1.715 3000 2.483 4000 2.651 59
Resiliency Quantification for IaaS Cloud [paper in Proc. IEEE SRDS RACOS workshop 2010] 60
Resiliency Quantification: Definitions Past research mostly interpreted resiliency as fault tolerant capability of the system We use following definition Resiliency is the persistence of service delivery that is predictable and can be trusted to perform when subjected to changes* Changes of interest in the context of IaaS cloud: Increase/decrease in workload Increase/decrease in system capacity Increase/decrease in faultload Security attacks Accidents or disasters *[1] J. Laprie, From Dependability to resiliency, DSN 2008 [2] L. Simoncini, Resilient Computing: An Engineering Discipline, IPDPS 2009 61
General steps for resiliency quantification (1) Construct a stochastic analytic model of a given system to find measure(s) of interest. Such a model can be performance or availability model of the system. (2) Determine the steady state behavior of the developed model in step (1). We compute steady state values of performance and/or availability measures. Note the analogy with the Phased-Mission System reliability analysis (3) Apply change(s) to the system by increasing (or decreasing) the value(s) of input parameter(s) of the model. Examples of such changes can be variation of call arrival rates, failure rates. 62 (4) Analyze the transient behavior of the system model to compute the transient measures after applying the change(s). Initial probabilities for this transient analysis are obtained from the steady state probabilities as computed from the system model in the step (2). Transient response of the performance/availability measures quantify the resiliency of the system. 62
IaaS cloud resiliency w.r.t. change in arrival rate t set is settling time; one of the metrics to quantify resiliency 63
Power Quantification for IaaS Cloud [paper in Proc. IEEE/IFIP DSN workshop DCDV 2011] 64
Power Consumption from Hot PM Model λ h h h 0,0,0 0,1,0 L h,1,0 µ βh 0,0,1 ( m 1)µ 2µ β h λ λh 0,0,(m-1) λ µ µ β µ ( m 1)µ mµ λ h 0,1,(m-1) βh βh 0,0,m λ h λ h λ h mµ h (L h - 1),1,1 β h When no VM is running, hot PM consumes an idle power of h l. Power consumption of a VM with average resource utilization is assumed to be v a For each state (i, j, k) of the 1,0,m βh λh 2µ ( m 1)µ λ h L h,1,1 λ h β h 2µ βh (L h -1),1,(m- 1) β h mµ λ h λ h ( m 1)µ L h,1,(m- 1) βh L h,0,m CTMC we assign a reward rate: r(i, j, k) = h l + kv a 65
Power Consumption from Warm PM Model Warm PM CTMC states Reward rates w l 1 wl w 2 l3 h l 66
Power Consumption from Cold PM Model Cold PM CTMC states Reward rates c l 1 cl c 2 l3 h l Net power consumption is sum of power consumptions in hot, warm and cold pool 67
Power-performance trade-offs region where intuition based grouping is bad (i, j, k) denotes #PMs in hot, warm and cold pool respectively optimization problem: What is the optimal #PMs per pool that minimizes total power consumption but does not violate the SLA (upper bound on mean response delay)? 68
Future Research 69
Cost analysis Providers have two key costs for providing cloud based services (i) Capital Expenditure (CapEx) and (ii) Operational Expenditure (OpEx) Capital Expenditure (CapEx) Example of CapEx includes infrastructure cost, software licensing cost Usually CapEx is fixed over time Operational Expenditure (OpEx) Example of OpEx includes power usage cost, cost or penalty due to violation of different SLA metrics, management costs OpEx is more interesting since it varies with time depending upon different factors like system configuration, management strategy or workload arrivals 70
SLA driven capacity planning What is the optimal #PMs so that total cost is minimized and SLA is upheld? Large sized cloud, large variability, fixed # configurations 71
Proposed Extensions to Current Models More detailed workload Model Different workload arrival processes [e.g., bursty] Different types of service time distributions Heterogeneous requests Requests with different priorities More detailed availability model Different types of service time distributions Model validation Application of existing models to different cloud services/systems Cost analysis Capacity planning 72
Conclusions 73
Conclusions Analytic models are powerful for the construction and numerical solution of various reliability, availability, performance, and resiliency [behavior under changes in workload, faultload, configuration] models Not only exponential but also non-exponential distribution can be admitted to construct such models. For very complex systems such as clouds, hierarchical, fixed-point iterative and approximate solutions needed. Performance, availability, resiliency and power consumption analysis can be done using such an approach. Simulative and hybrid models/solutions should be used only when absolutely necessary Models can then be used in capacity planning a feedback control setting for adapting to changes 74
Quantifying the Unquantifiable? 75
Thanks! 76