Is Five 9 s Availability Required for Internet Services? Stephen Smaldone

Similar documents
Obtaining Five Nines of Availability for Internet Services

white paper EYETRACKING STUDY REPORT: Clamshells vs Paperboard Boxes CUshop Research, Clemson University

Deployment of express checkout lines at supermarkets

A Case for Dynamic Selection of Replication and Caching Strategies

Hedge-funds: How big is big?

Word Length and Frequency Distributions in Different Text Genres

Constructing a TpB Questionnaire: Conceptual and Methodological Considerations

Marginal Costing and Absorption Costing

The Open University s repository of research publications and other research outputs

Market Simulators for Conjoint Analysis

EST.03. An Introduction to Parametric Estimating

Predict the Popularity of YouTube Videos Using Early View Data

The Financial Crisis: Did the Market Go To 1? and Implications for Asset Allocation

Chapter 011 Project Analysis and Evaluation

arxiv: v1 [math.co] 7 Mar 2012

Concentration of Trading in S&P 500 Stocks. Recent studies and news reports have noted an increase in the average correlation

RISK PARITY ABSTRACT OVERVIEW

RELIABILITY OF SYSTEMS WITH VARIOUS ELEMENT CONFIGURATIONS

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

Existing Account Management: Building Effective Portfolio Management Tools May 2011

July Management Practice & Productivity: Why they matter

Take-Home Exercise. z y x. Erik Jonsson School of Engineering and Computer Science. The University of Texas at Dallas

ABI Position paper. Supplement to ED/2009/12 Financial Instruments: Amortised Cost and Impairment

Earned Value. Valerie Colber, MBA, PMP, SCPM. Not for duplication nor distribution 1

Performance Management for Call Centers

Immigration and poverty in the United States

VisualCalc AdWords Dashboard Indicator Whitepaper Rev 3.2

How To Use The Belbin Team/Group Reports

The impact of loan rates on direct real estate investment holding period return

Real vs. Synthetic Web Performance Measurements, a Comparative Study

Optical interconnection networks with time slot routing

Managerial Economics Prof. Trupti Mishra S.J.M. School of Management Indian Institute of Technology, Bombay. Lecture - 13 Consumer Behaviour (Contd )

2 Session Two - Complex Numbers and Vectors

Executive Cover Memo. The Allround brand is in a favorable position, but the cold medicine is also becoming a

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 10

Implementing Best Practices: The Procurement Maturity Model

Lab 11. Simulations. The Concept

UK application rates by country, region, constituency, sex, age and background. (2015 cycle, January deadline)

The labour market, I: real wages, productivity and unemployment 7.1 INTRODUCTION

CRM Systems and Customer Survey Measurement A Panoramic View of Customers by Jamie Baker-Prewitt, Ph.D.,Vice President, Burke, Inc.

AICPA Board of Directors Peer Review Task Force Report: Recommendations for Enhancing the AICPA Peer Review Programs in a Transparent Environment

Daily vs. monthly rebalanced leveraged funds

Are Custom Target Date Funds Right for Your Plan?

What is Customer Experience Management?

Statistical estimation using confidence intervals

Temperature Scales. The metric system that we are now using includes a unit that is specific for the representation of measured temperatures.

Journal of Financial and Strategic Decisions Volume 13 Number 1 Spring 2000 HISTORICAL RETURN DISTRIBUTIONS FOR CALLS, PUTS, AND COVERED CALLS

Choosing My Avatar & the Psychology of Virtual Worlds: What Matters?

Daily Traffic Control Log

Preparing cash budgets

Investment manager research

Part 1: Background - Graphing

Chapter 27: Taxation. 27.1: Introduction. 27.2: The Two Prices with a Tax. 27.2: The Pre-Tax Position

Web Browsing Quality of Experience Score

The Relationship between the Fundamental Attribution Bias, Relationship Quality, and Performance Appraisal

PC Postprocessing Technologies: A Competitive Analysis

MONOPOLIES HOW ARE MONOPOLIES ACHIEVED?

Conn Valuation Services Ltd.

Fisheries Research Services Report No 04/00. H E Forbes, G W Smith, A D F Johnstone and A B Stephen

The Importance of Brand Awareness: Quantifying the Impact on Asset Manager Growth

PRIORITY-BASED NETWORK QUALITY OF SERVICE

Location of Warnings: On Product or in the Manual? By Kenneth Ross

CHAPTER 3 CALL CENTER QUEUING MODEL WITH LOGNORMAL SERVICE TIME DISTRIBUTION

Path Selection Methods for Localized Quality of Service Routing

Cisco Info Center Business Service Manager

Wait-Time Analysis Method: New Best Practice for Performance Management

Active vs. Passive Money Management

16 Learning Curve Theory

Use and interpretation of statistical quality control charts

Objectives. Experimentally determine the yield strength, tensile strength, and modules of elasticity and ductility of given materials.

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

FRC Risk Reporting Requirements Working Party Case Study (Pharmaceutical Industry)

Management Practice & Productivity: Why they matter

Final Draft Guidance on Audit Committees

Network-Wide Change Management Visibility with Route Analytics

MANAGEMENT OPTIONS AND VALUE PER SHARE

DESIGN CONSIDERATIONS OF SSI SCHEMES FOR THEIR SUSTAINABILITY AND FARMERS MANAGEMENT SIMPLICITY

Calculation of Risk Factor Using the Excel spreadsheet Calculation of Risk Factor.xls

Speak<geek> Tech Brief. RichRelevance Infrastructure: a robust, retail- optimized foundation. richrelevance

Who takes the lead? Social network analysis as a pioneering tool to investigate shared leadership within sports teams. Fransen et al.

REPORT ON BROKER ORIGINATED LENDING

Management Accounting 303 Segmental Profitability Analysis and Evaluation

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

The Envision process Defining tomorrow, today

The newspaper on the digital road

G DATA TechPaper #0275. G DATA Network Monitoring

Do Commodity Price Spikes Cause Long-Term Inflation?

NON-PROBABILITY SAMPLING TECHNIQUES

The Attention Value of Mediabong as an Online Video Delivery Platform for Advertisers (abridged)

Security in the smart grid

Measurement with Ratios

Real GDP. Percentage of 1960 real GDP per capita. per capita real GDP. (1996 dollars) per capita. Real GDP

Effect of Lead Time on Anchor-and-Adjust Ordering Policy in Continuous Time Stock Control Systems 1

Chapter 11 Monte Carlo Simulation

New South Wales State and Regional Population Projections Release TRANSPORT AND POPULATION DATA CENTRE

MOVING THE MIDDLE. The Business Impact of Making Your Middle Sales Performers Better

FCR The Driver of All Other Metrics

abc Mark Scheme Statistics 6380 General Certificate of Education 2006 examination - January series SS02 Statistics 2

Risk Analysis and Quantification

Transcription:

Is Five 9 s Availability Required for Internet Services? Stephen Smaldone smaldone@cs.rutgers.edu Abstract Recently, there has been substantial research in the area of reliability and availability for Internet services. Typically, availability is measured according to the number of 9 s metric, which provides a measure of the percentage of availability of a system over the course of a year. This paper explores the necessity of high availability for Internet services given the relative availability of end user systems and attempts to answer the question: Is five 9 s availability required for Internet services given that end user systems that exhibit no better than three 9 s availability. The position that this paper ultimately arrive upon is that four 9 s is enough to ensure minimal degradation of perceived end user availability performance, therefore five 9 s is not required. Simulation results are presented and discussed to support this claim. 1. Introduction Availability in computer services has been an important research area for many years. Furthermore, as we have become more and more reliant on computer services in industrial and personal endeavors, the importance of availability has grown. Today, the realization of this concept has direct impact on our lives. At work, we rely upon computer systems being highly available in order for us to complete our daily workrelated tasks. At home, we trust the services we use, provided by our banks, stores, etc., to be available whenever we desire to access them. It is obvious that our reliance upon computer services will not diminish, but will continue to grow as time progresses. Internet services provide a particularly interesting subset of computer services, in that they rely upon three distinct, yet independent, components to deliver their services to the end user. The first component, the servers, are typically controlled and managed by the services provider. For example, a provider such as Amazon [1] or Google [2] has direct access to and control over the systems upon which they provide their services. The second component, the inter-network, is typically not controlled by any one service provider. Control is distributed amongst various parties and may differ substantially amongst users. Finally, the third component, is the client system. This component is usually under the direct control of the end user. There are notable cases, such as Internet Café s [3], where this is not the case, but for the sake of this discussion they will be ignored. Given this three component model of Internet services, we can define a user s perceived availability for a service to be dependant upon the availability these three components. Therefore, any given user s perception of the service s availability should be dominated by the lowest availability component amongst the three. Service availability is typically measured in the amount or percentage of downtime per year. This technique directly leads to the number of 9 s representation for availability metrics. For example, a service that is said to be five 9 s available will be available for 99.999% of time in a year. Figure 1 shows the approximate amount of downtime allowed for a service to achieve a certain level of 9 s availability. Level of Availability Downtime per year (secs) Two 9 s (99%) 315360s (or 87.6h) Three 9 s (99.9%) 31536s (or 8.76h) Four 9 s (99.99%) 3153.6 (or 52.56m) Five 9 s (99.999%) 315.36s (or 5.256m) Figure 1: Downtime per year to maintain a certain level of availability. It is important to note that to meet a certain level within the number of 9 s metric, a service cannot exceed the amount of downtime per year for that level. For example, a service that has 1 hour of downtime per year is considered to only be three 9 s available. Most typical Internet services strive for five 9 s availability. If we assume that client systems, accessing the service, meet only three 9 s availability, then the perceived availability at the client will be on the

order of three 9 s. The goal of this paper is to attempt to answer the following question: Is it required for Internet services to provide five 9 s availability, given the lower perceived availability at any given client? The next section of the paper describes the experimental methodology used to evaluate client perceived availability. Section 3 describes the experiments conducted. Section 4 presents the results and a discussion of the results is provided in Section 5. The paper is concluded in Section 6. 2. Methodology In the previous section, client perceived performance was described as being composed of the availability of the three components involved in accessing a given service. For the purposes of this paper, we assume that the second component (inter-network) is combined with the first (service system) and third (client system) components. Rather than ignoring it completely, though, it is considered to be subsumed by both the first and third components. In other words, we can attribute a portion of the inter-networking component to the client and the remainder to the service. The assumption is then that clients operate within three 9 s availability and services within five 9 s in the presence of network failures. Although an argument against this assumption can be made, there is little point in evaluating either client-side or system-side availability in the presence of an inter-network that negatively affects the availability of either side. Such an examination is considered beyond the scope of the discussion of this paper. To answer the central question of this paper, I provide a series of experimental results generated through simulation of both client and service failures. The main idea is to show, through simulation, the aggregated perception of clients with regards to service availability in the presence of both client-side and service-side failures. Failure rates are chosen for both client and service for each simulation and then plotted to demonstrate the effects of reducing service-side availability on the overall perceived availability of clients accessing the system. The following section describes the simulation method and the individual experiments performed. 3. Evaluation This section first describes the assumptions and methodology used for the simulation. It then goes on to briefly describe each of the experiments performed. 3.1. Simulation The following assumptions hold for each simulation run: (i) Client-side failures occur at a uniformly random rate. The number of failures that occur per year are determined by the level of availability for each client. In all cases, clients experience the maximum number of failures allowed to still achieve three 9 s availability. (ii) Service-side failures occur at a uniformly random rate. The number of failures that occur per year are determined by the level of availability for the service. In all cases, the service experiences the maximum number of failures allowed to still achieve the specified level of 9 s availability. (iii) Each simulation approximates 1 year of time. Rather than performing a continuous simulation over time, the year is subdivide into fixed-sized time slots. For any given time slot, client-side and service-side failures can occur. If a failure occurs in a time slot it is assumed to occur for the entire time slot. The size of a time slot is chosen to be the size, in seconds, of the maximum allowable downtime for a service to maintain five 9 s availability. Based on this, each time slot is approximately 5 minutes in duration. Based on this fixed-size time slot, a year is subdivided into 100,000 discrete time slots. (iv) Finally, it is assumed that any given client (or service) can only fail once within a given time slot and that failure is for the complete time slot. Given these assumptions, the simulation methodology is as follows. For each experiment, a single year is simulated. Client-side failures are plotted as bars in the graph. When two or more clients experience failures during the same time slot, they are aggregated in that time slot, for experiments that include multiple clients. A service-side failure is represented as an aggregated failure in the graph. It is equivalent to all clients experiencing a failure within exactly the same time slot, so its magnitude is

weighted by the number of clients in the experiment. Should a given client actually experience a client-side failure which coincides with a service-side failure, the magnitude of the service-side failure is reduced by one. This represents that case that the client-side failure masks a service-side failure. 3.2. Experiments The results provided in Section 4 are generated by six separate experiments. In each experiment, either the number of clients or the service-side failure rate is varied. (i) (ii) 4. Results Varying the number of clients illustrates the multiplicative effect that number of clients has on service-side failures. Furthermore, a direct comparison can be drawn between the average magnitudes of aggregate client-side failures versus the magnitudes of service-side failures. Varying the service-side failure rate illustrates the effect on client perception as more frequent service-side failures are introduced. It also attempts to demonstrate to what extent client-side failures mask server-side failures. This section presents and discusses the results of the simulations. 4.1. Single Client Failure Rate Figure 2 shows the results of simulating one client with a three 9 s failure rate for the course of one year. It is quite clear from this graph that the client s failure rate is uniform over the year long period of the simulation. Also, the number of failures seems consistent with the client s level of availability. This graph is presented to validate the simulation of a single client under the previously stated assumptions. 4.2. 10 Client Aggregated Failures Figure 2 Figure 3 shows the results of simulating 10 clients each with a three 9 s failure rate for the course of one year. Again, this graph shows that the clients aggregated failure rate is uniform over the year long period of the simulation. In addition, these results show the level of aggregation that occurs among clients as their individual failures coincide within the time slots.

Figure 3 4.3. 100,000 Client Aggregated Failures Figure 4 shows the results of simulating 100,000 clients each with a three 9 s failure rate for the course of one year. As in the previous two graphs, these results show that clients aggregated failure rate is uniform over the year long period of the simulation. These results also show the level of aggregation that occurs among clients as their individual failures coincide within the time slots for a larger set of clients. Figure 4 4.4. Five 9 s Service Effects The following two sets of results present the aggregation of client-side perceived availability in the presence of a service that exhibits five 9 s availability. The addition of service-side failures affects client perception of availability by introducing an additional (time slot sized) failure during the simulation period. The magnitude of this effect is increased by the number of clients in the simulation and decreased by the number of clients that experience client-side failures during the time slot in which the service-side failure occurs.

4.4.1. 10 Clients There are two interesting points to note regarding Figure 5. First, there are relatively few overlapping client errors, such that the majority of errors are of magnitude 1 (representing a single failing client). Secondly, even though there is only one service-side failure, the figure clearly shows the difference in magnitude (10 clients affected). Figure 5 4.4.2. 100,000 Clients Similar to Figure 5 (above), Figure 6 shows the relative difference between the magnitudes of service-side failures when compare to the average aggregate client-side failure magnitudes. For any given time slot, the average number of failing clients is approximately 100, but for the serviceside failure, the magnitude is 100,000 (since all clients are affected). (NOTE: the graph y-axis range only goes to 5000 so that the client-side failure data can be visible in the graph.) Figure 6

4.5. Four 9 s Service Effects The next set of results presents the aggregation of client-side perceived availability in the presence of a service that exhibits four 9 s availability. The addition of service-side failures affects client perception of availability by introducing 10 additional (time slot sized) failures (with respect to the five 9 s case) during the simulation period. (Note: only the 10 client experiment is shown for four 9 s service effects, as the 100,000 client data does not provide any further insight.) 4.5.1. 10 Clients Figure 7 presents similar results as for the five 9 s case with regards to magnitudes of client-side failures when compared to service-side failures. As is clearly seen from the results, there are a few more (10 in total) service-side failures. Figure 7

4.6. Three 9 s Service Effects The next set of results presents the aggregation of client-side perceived availability in the presence of a service that exhibits three 9 s availability. The addition of service-side failures affects client perception of availability by introducing 100 additional (time slot sized) failures (with respect to the five 9 s case) during the simulation period. (Note: only the 10 client experiment is shown for four 9 s service effects, as the 100,000 client data does not provide any further insight.) 4.6.1. 10 Clients Figure 8 presents quite a different picture when compared to the four 9 s and five 9 s cases. At a service-side failure rate of three 9 s there are a substantial number of failures over the course of the year. Furthermore, considering the difference in magnitudes (especially with regards to 100,000 clients or more), the impact of reducing the service-side availability will most certainly be felt by the clients. Figure 8 Consider the single client case. In the presence of non-overlapping failures, the client would perceive a doubling of the service failure-rate. 5. Argument and Counter This section presents the main position of this paper and describes potential counter positions. 5.1. Argument It is quite evident, through the examination of Figures 5, 7, and 8, that there are substantial differences in the gaps between five 9 s to four 9 s and four 9 s to three 9 s. When moving from five 9 s to four 9 s clients will likely encounter up to 9 additional failures, but this is the worst case, and it represents a 9% degradation in their perceived service. It is likely, though, that some of these failures will be masked therefore reducing the level of degradation. Moving from five 9 s to three 9 s represents a doubling in the number of failures, in the worst case, from the client s perspective. In the average case, a substantial number of these service-side failures should be masked. It is unlikely, given a uniformly random distribution of failures, that more than half will be masked in this fashion on the average. Therefore, we would expect the average degradation of perception for any given client to be 50%, for a service that is running at three 9 s

availability. Based upon this analysis, it is the position of this paper that four 9 s is adequate for Internet services. Five 9 s is not required. One thing to note is that this paper takes a pessimistic view of client access times. The assumption is that requests are made from each client during each time slot. The results of this paper are likely to be impacted depending upon which model of client requests are chosen. For a uniformly random client request model, we would expect the average case results to be similar to the findings of this paper. This may not be the case if some other model of client request times was chosen (e.g., Poisson, tracedriven, etc.). It is likely, though, that provided the failures remain uniformly distributed, that the magnitude of degradation would remain proportional to the findings of this paper. 5.2. Counter Argument This paper illustrates the typical reason provided in support of five 9 s availability for Internet services. For example, Figure 6 illustrates the difference between the typical client-side failure (magnitude of 1) and the typical service-side failure (magnitude of 100,000). For any service that generates revenue, this magnitude is directly proportional to the potentially lost revenue in the event of a particular type of failure. As such, it is in the best interests of the service to minimize the number and duration of service-side failures. Therefore, five 9 s is required. Although this a compelling argument, there are two things to consider. First, there is a cost element associated with maintenance of a service. By reducing availability requirements to four 9 s the reduction in maintenance cost will have a mitigating effect against the lost revenue, although this may not completely mitigate the losses. Second, the position of this paper is with respect to client-side perception. Although the findings are counter to the argument presented in this section, they are not mutually exclusive. In some cases, where potential risks of lost revenue are substantially larger than the additional maintenance costs, common sense would dictate an approach to service-side maintenance that would minimize failures regardless of client-side perception. In cases where client perception is the most important factor, then a relaxation of the availability requirements to four 9 s is acceptable. 6. Conclusion This paper explored, through the use of simple simulation, the necessity of five 9 s availability for Internet services, given three 9 s availability in end user systems. Based upon the results presented in Section 4 of this paper, it is obvious that five 9 s availability is not required for services, but four 9 s is adequate to minimize degradation of client perception. Although, as described in Section 5, there are potentially substantial effects of service-side failures with respects to lost revenues, a portion of these losses would be mitigated by the reduced cost in maintaining four 9 s availability over five 9 s. Finally, there is likely no one answer that is best for all services. Some particularly important services may require five 9 s availability in order to protect against various potentially harmful effects (other than client perception), i.e., bad press, marketing pressures, competition in the market, etc. Although this may be true for some services, this paper chose to ignore these special cases and considers only the average Internet service. References [1] Amazon.com Online Shopping Portal. www.amazon.com [2] Google Search Engine. www.google.com [3] Internet Café, from Wikipedia, the free encyclopedia. http://en.wikipedia.org/wiki/internet_cafe