EXAM IN COURSE [EKSAMEN I EMNE] TTM4110 Dependability and Performance with Discrete event Simulation [Pålitelighet og ytelse med simulering]

Norwegian University of Science and Technology Department of Telematics Page 1 of 20 Contact during exam [Faglig kontakt under eksamen]: Poul E. Heegaard (94321 / 99286858) EXAM IN COURSE [EKSAMEN I EMNE] TTM4110 Dependability and Performance with Discrete event Simulation [Pålitelighet og ytelse med simulering] Wednesday [Onsdag] 2010-12-08 09:00 13:00 The English version starts on page 2. Bokmålsutgaven starter på side 10. Hjelpemidler: C - Graham Birtwistle: DEMOS - A system for Discrete Event Modelling on Simula. Formula sheet for TTM4110 Dependability and Performance with Discrete Event Simulation is attached. Predefined simple calculator. Graham Birtwistle: DEMOS - A system for Discrete Event Modelling on Simula. Formelsamling i fag TTM4110 Pålitelighet og ytelse med simulering er vedlagt. Forhåndsbestemt enkel kalulator.] Sensur 2011-week 2

of 20 English version 1 GreenCloud is a small medium enterprise (SME) that provides a service A on a server park with three servers, C1,C2 and C3. They have some problems with stability and performance and would like to consider to consolidate their servers into a private cloud and maybe add computing resources from a public cloud provider. The solution they are considering is to let their three servers form a private cloud and to buy additional computing resources from public cloud provider 1 and 2, routed via network 1 or 2. This hybrid cloud solution is illustrated in Figure 1. Service A Private cloud Network 1 Public cloud 1 X C1 C2 C3 Network 2 Public cloud 2 Figure 1: Hybrid Cloud Computing System. Let p i be the probability that a successful service A request is handled by cloud i (i = A, A1,A2). The service times, S i, are deterministic. See Table 1 for numerical values. Table 1: Service times in the hybrid cloud i Cloud E(S i ) p i A Private 1/10 0.80 A1 Public 1 1/5 0.15 A2 Public 2 1/2 0.05 a) Plot the probability mass function for the service times in the hybrid cloud. Determine the expected service time, E(S), and the variance, V ar(s)? What is the expected service time given that the service is served by one of the public clouds? 1 In case of divergence between the English and the Norwegian version, the English version prevails.

of 20 0,8 0,6 0,4 0,2 0 1/10 1/5 1/2 E(S) =0.80 1/10 + 0.15 1/5+0.05 1/2 =0.135 V ar(s) =0.80 (1/10 0.135) 2 +0.15 (1/5 0.135) 2 +0.05 (1/2 0.135) 2 =0.008275 E(S public) =(0.15 1/5+0.05 1/2)/(0.15 + 0.05) = 0.275 In the following, assume that the service time in the private cloud is negative exponentially distributed with expectation E(S A )=1/µ A, while in the public clouds 1 and 2 it is negative exponentially distributed with expectations E(S A1 )=1/µ A1 and E(S A2 )=1/µ A2, respectively. b) Describe a generator of random variates of service times in the hybrid cloud. List the requirements for a random number generator. 1. Sample U 1 (0, 1) and U 2 (0, 1) 2. If U 1 <p A then S = Log(U 2 )/µ A 3. Else if U 1 <p A + p A1 then S = Log(U 2 )/µ A1 4. Else S = Log(U 2 )/µ A2 5. Goto 1. Requirements on page 110 in the textbook (fast, portable, long cycles, reproducible, good statistical properties) Figure 2 shows the empirical cumulative distribution function (CDF) after 1000 samples of the random variate generator above. c) Study the CDF and explain whether this distribution is symmetric or non-symmetric around the mean value. Define a quantile in a distribution? What is the 90% quantile in the CDF above (read approximately from the Figure 2)?

of 20 empirical CDF 1.0 0.8 0.6 0.4 0.2 0.1 0.2 0.3 0.4 0.5 service time Figure 2: Empirical cumulative distribution function (CDF) for service times in the hybrid cloud. 1. Not symmetric: the median is different from the mean value, the curve has no s-shape 2. X is an α-quantile in f(x) if P (x X) =α 3. The α =0.9-quantile in the given CDF: answers in the region 0.30-0.31 is acceptable (the exact value from the data set, not available to the students, is 0.303628) Now, assume that in the private cloud each server can handle only one request of service A at the time. If all three servers are busy the service request is rejected. The requests for service A are generated from an infinte population according to a Poisson process with intensity λ A =9. The GreenCloud company have trouble with the rejected request and would like to consider the following two options: 1. Buy an additional server for their own privat cloud. The operational cost is c A =1per server (ignore the capital costs in this case). 2. Buy extra capacity from public cloud providers. Cost is c A1 =5and c A2 =2for public cloud 1 and 2, respectively. You pay only when the computer is in use. First consider the initial setup with only three servers in a private cloud. d) Make a Markov model of this private cloud and determine the average number of servers in use and the probability of rejected service requests. λ A λ A λ A 0 1 2 3 µ A 2µ A 3µ A Steady state equations λ A p 0 = µ A p 1

of 20 λ A p 1 =2µ A p 2 λ A p 2 =3µ A p 3 p 0 + p 1 + p 2 + p 3 =1 p i =(A i /i!)/ 3 ν=0 (Aν /ν!) For numerical values use λ A =9(this information was missing in the Norwegian edition) and µ A =10(from Table 1): p = {p 0,p 1,p 2,p 3 } = {0.412, 0.371, 0.167, 0.050} Expected number of servers in use: E[X] = 3 i=0 i p i =0.855 Probability of rejection: λp 3 / 3 ν=0 λp ν = p 3 (This can be recognized as the Erlang s loss model, and hence the call and time congetion is the same) Now, extend the private cloud with one server. e) Extend the Markov model from the previous task with one server extra. Determine the probability of rejected service requests now with four servers instead of three. Show how you can apply the Recursive Erlang s B-formula to obtain this. λ A λ A λ A 0 1 2 3 4 µ A 2µ A 3µ A 4µ A λ A Request rejection (use recursive Erlang s B-formula): E 3 (A) =p 3 from previous point. E 4 (A) = AE 3(A) n+ae 3 (A) = 0.9 0.050 4+0.9 0.050 =0.011 The alternative is to extend the capacity in the private cloud configuration with three server by buying capacity from public cloud providers. Through a service level agreement (SLA) GreenCloud is allowed to run one single process at the time on each of the two public clouds. If all three servers in the privat cloud are busy the service request is routed to the public cloud, and if any of the private cloud servers become idle then the process running on the public cloud will immediately be moved to the idle server in the private cloud. f) Extend the Markov model from d) with two public cloud servers. Define explicitly the state variable/vector in your model. Assume that public cloud 1 is selected if both public cloud are available and that processes are NOT moved from one public cloud to another. Given that the steady state probabilities are known (you should not obtain these) how can you determine the server utilization in the three different clouds expressed by the steady state probabilities? λ A 3,1,0 λa λ A λ A λ A 3µ A + µ A2 0,0,0 1,0,0 2,0,0 3,0,0 3µ A + µ A1 µ A 2µ A 3µ A 3µ A + µ A2 λ A 3,1,1 3,0,1 µ A1

of 20 State : (i, j, k) where i =0,, 3 is the number of occupied servers in the private cloud, j =0, 1 and k =0, 1 is correspondingly the number of occupied servers in public cloud 1 and 2, respectively. If the p ijk s are known, the utilization in private cloud is 1 p 000 while utilization in public cloud 1 is p 310 + p 311 and in cloud 2 p 301 + p 311. The GreenCloud company pays only for the time the server in the public cloud is busy. In Table 2 you find the the steady state probabilities p i1 i 2 of that public cloud k is in state i k =0, 1 (represent the number of busy servers in public cloud k =1, 2). The probabilities p i 1 i 2 are for the case where cloud 2 is selected over cloud 1 when both public clouds are available. Table 2: Steady state probabilities of public clouds i 1 i 2 p i1 i 2 p i 1 i 2 0 0 0.98422 0.98284 0 1 0.00015 0.01342 1 0 0.01254 0.00038 1 1 0.00309 0.00336 g) What are the operational costs for the configurations in e) and f)? Compare the two alternative configurations for the use of public clouds, i.e., selecting public cloud 1 over cloud 2 and vice versa. Discuss whether, and why you prefer one of the configurations over the other. Cost of four server configuration is 4 c A =4(your own computers you pay as if they are 100% in use) Cost of public 1 over 2: 3 c A +c A1 (p 1,0 +p 1,1 )+c A2 (p 0,1 +p 1,1 )=3+0.085 = 3.085 Cost of public 2 over 1: 3 c A +c A1 (p 1,0 +p 1,1)+c A2 (p 0,1 +p 1,1) =3+0.052 = 3.052 The latter conf is slightly cheaper, but with 8.7% increase in rejection probability (p 311 versus p 311 ), and with the given load profile renting from a public cloud is better than adding a new server in the privat cloud. In order to improve the robustness the GreenCloud company would like to have an agreement with at least two providers. At least one server must be working to provide service A. Servers in the private and public clouds may fail according to a Poisson process with intensity λ S. Similarly, the networks fail according to a Poisson process with intensity λ N. h) Establish a reliability block diagram to determine the reliability function of service A. What is the MTFF in the private cloud? P1 P2 P P P3 N1 N2 PC1 PC2 N PC NPC

of 20 Reliability function: R(t) =1 (1 R P (t))(1 R NPC (t)) where R P (t) =1 (1 R P 1 (t))(1 R P 2 (t))(1 R P 3 (t)) and R NPC (t) =R N (t)r PC (t) where R N (t) =1 (1 R N1 (t))(1 R N2 (t)) and R PC (t) =1 (1 R PC1 (t))(1 R PC2 (t)) MTFF in a 3 node parallel structure (from formula sheet, Eq (48)): MTFF parallel = 1 3 1 λ S i=1 =(1+1/2+1/3)/λ i S =11/(6λ S ) The servers in the private cloud will be repaired by a single, shared repairman. The repair time is negative exponentially distributed with intensity µ S. i) Determine symbolically the steady state availability of the private cloud. How will it affect your approach and solution if you assume multiple, independent repairmen instead of a single, shared repairman? P1 µ S A S = 3λ S 2λ S λ S µ S + λ S 0 1 2 3 µ S µ S µ S P2 P3 (a) Single, shared repairman (b) Multiple, independent repairmen Single, shared repairman: Markov model because repair of servers are not independent of eachother: A (SR) =1 p 3 =1 6(λ S /µ S ) 3 1+3(λ S /µ S )+6(λ S /µ S ) 2 +6(λ S /µ S ) 3 Multiple, independent repairmen: Block diagram since both failure and repair are independent of eachother A (MR) =1 (1 A S ) 3 =1 (1 µ S λ S +µ S ) 3 =1 λ3 S (µ S +λ S ) 3 Finally, GreenCloud would like to add another service B. Due to privacy this service can only be executed on the private cloud. Service B will have non-preemptive priority over service A. Service A can still be executed both on the private and public clouds. The objective is to calculate the cost of providing service A and B and the corresponding rejection probabilities. j) Make a simulation model of this system. 1. What is the system state and corresponding events in your model? 2. Which system components are entities and resources in your model? 3. Describe the dynamics and interaction in the simulation model by activity diagrams.

of 20 4. How do you collect statistics to estimate the performance attributes in your objectives? System state: #servers in A, A1, A2 that are occuped, #servers in A, A1, A2 that are in operations NOTE: It is acceptable to assume that servers are always operational, however it is important to be consitent with your assumptions throughout the whole task Events: arrival of request, service completion, server failure and repair NOTE: rejects are per definition not events because it does not change the system state Entities: service A, service B, generator A, generator B, failure Resources: server A (3), A1 (1), A2 (1) Statistics: observe cluster utilization from RES report, and count number of requests and nuber of lost requests, indicated in activity diagram below. Minimum: assume no failures, see figure blow NOTE: since requests are rejected if all resources are occupied there will be no queuing and, hence, non-preemtive priority can not be taken into account To consider failures there are (at least) two options

of 20 1. Separate failure process that takes a resource in a non-preemptive way: the failure will be queued until resource becomes available. This is ok when the service times are short relative to the repair times 2. Add interrupts - send interrupt to the active service process (in serv A or serv B) that holds a server that has failed (complicated) In the figure below you find a description of the two options above. You need one per server pool (A, A1, A2).