Designing Fault-Tolerant Power Infrastructure System Analysis and New Standards Enable Maximization of Operating Dependability for Data Centers & Other Mission Critical Facilities J.F. Christin (Author) Enterprise & Systems Solutions APC-MGE Critical Power & Cooling Services Montbonnot, France jean-francois.christin@mgeups.com Jordi Sallent and Angel Perez (Authors) Business Development Manager and Technical Director APC-MGE Power & Cooling Services Barcelona, Spain and Madrid, Spain Jodi.sallent@mgeups.com/angel.perez@mgeups.com Abstract In today s mission critical computing environments, achieving a tolerant infrastructure is essential to smooth operation. Unfortunately, many users don t completely grasp the full meaning of tolerance as a comprehensive solution with a defined architecture to maximize operating dependability. All too often, it is viewed as a collection of -mitigating products that can end up being more of a Band-Aid than a truly effective strategy for process improvement. Keywords Fault tolerance, dependability, reliability, data center infrastructure. I. INTRODUCTION Although many view it as an art, operating dependability is actually a generic term relating to the science of failure analysis and typically encompasses four key areas: reliability availability maintainability safety The study of operating dependability aims to model system behavior in order to define a system s probable capacity to fulfill its function at a given moment in time. As a "science" based on probability and statistical calculations, users can apply established data to device, unit and system populations in accordance with the Law of Large Numbers, a fundamental statistical concept that describes how the average of a random sample from a large population will likely be close to the average of the whole population. Understanding operating dependability provides many advantages to those responsible for critical systems. These include: Enabling forecasting of the time over which a system will function correctly Selection of optimum architecture by comparing the reliability or availability of systems sharing the same function Contributions to improved product quality Improved productivity resulting from maximized uptime Using probability calculations involves making predictions that an event will happen assuming that the chance an event will happen can be quantified. This can only be verified by carrying out a large number of tests. A statistical approach consists of measuring the periodicity or recurrence of events on a sample population of devices. The statistical view obtained through in-the-field measurements accounts for the number of samples in order to determine confidence intervals for these measurements. Each measurement will always have a predicted value and a measured value. Fortunately, the recent release of the ANSI/TIA-942 Standard developed by the Telecommunications Industry Association has helped greatly in providing guidelines to achieve the various levels of tolerance. While it is still useful to have a grasp of the underlying factors that impact
operating dependability, this standard can greatly simplify planning and implementation. This paper has been developed to assist users achieve optimized operating dependability by removing the mystery surrounding operating dependability concepts and techniques. II. FOUNDATIONS OF AVAILABILITY A. What is Availability? Availability is closely related to maintainability, the reliability of the equipment, and overall safety. It requires a thorough understanding of the electrical architecture along with a dependability study. Too often, users are satisfied with simply installing an uninterruptible power supply (UPS) recommended by a contractor and thinking that they have addressed their availability requirements. While a UPS is an essential ingredient to any uptime strategy, it s only one component in the overall architecture. B. Mission of the UPS Just as a roof is vital to the structural integrity of your home, a UPS is required to assure the integrity of a data or telecommunications network. t only does it address a number of power quality needs, it supplies sufficient time to assure the start of a motor generator in the event of an extended power outage and can provide a short period of redundancy with the genset for more common short-term power anomalies. B. What is dependability? According to the International Electrotechnical Commission (IEC), operating dependability is made up of four key components [2]: Reliability: the probability that an entity is able to accomplish a required function in given conditions, and during a given time interval [t 1,t 2 ]. It is written as R(t 1,t 2 ). In general t 1 = 0 and the reliability is written as R(t). The concept of failure rate completes the notion of reliability. Failure rate: to simplify, it is the probability that an entity, in given conditions and at a given instant in time, is subject to a failure in the next time interval. It is expressed as an hourly rate, proportional to the inverse of time. It is written as λ(t). The failure rate can be defined as: λ(t) = Number of failures per unit of time Number of devices operating at time t e.g. for 1000 installed units, with 10 s observed in a period of 1 month (1month = 720 H), the failure rate is equal to 1.39.10-5 H -1. For a constant failure rate, reliability is written as: III. DEPENDABILITY STUDIES R(t) = e λt A. Why study dependability? Studying dependability in a systematic manner serves as a decision-making tool that: provides the ability to assess needs and identify risks allows comparison of functions and values of various architectures enables justification of decisions based on rational, proven data rather than advertising hype facilitates optimized design costs by rating equipment and architecture Once systems are in place, or for existing installations, evaluation of dependability is typically used to optimize infrastructure in order to: decrease the number and duration of failures better target maintenance requirements enhance overall system efficiency anticipate required backup resources such as spare parts [1] The failure rate of a system is not constant and changes over time. For all components, the failure rate at the start of service life is high, but quickly decreases. This is the period in which early failure problems of components appear. This period can be reduced by accelerated aging of components, referred to as "burn-in". For an electronic component, the failure rate then remains somewhat constant during its useful service life, after which the failure rate increases again. For a mechanical device, after early failure problems, the failure rate increases as wear and tear takes its toll. (i.e. Weibull s law) Maintainability: the probability that a given active maintenance operation can be accomplished within a time interval [t 1,t 2 ]. It is written as M(t 1,t 2 ). In general, t 1 = 0 and so maintainability is written as M(t). The concept of repair rate completes the notion of maintainability. Safety: the probability of avoiding a catastrophic event. It is linked to the notion of risk and is application dependent Availability: the probability that an entity is in a state to accomplish a required function, in given conditions
and at a given instant in time t. It is written as D(t). As this probability is always very close to 1, it is the notion of unavailability I(t) that is used in practice where I(t) = 1 - D(t). This notion is not the same as reliability which concerns a time period and not a particular instant in time. Simplify complex systems for analysis by breaking elements into simple systems (series / parallel) whenever possible. Such systems can be described as a series/parallel model using minimum cut logic. C. Methodologies and tools 1) Databases: Most major components, or component types are catalogued in databases (the most well known being Military Handbook 217) that maintain records of specifications, product life, failure rates and other useful information. These can be used for comparison and evaluation of different products. Calculations using this type of data allow assumed constant failure rates of components to be calculated according to their use. Such evaluations can be used with supplier data, if reliable, and factored based on experience with given devices. 2) Analysis methods: PHD method (Preliminary Hazard Analysis), the aim of which is to identify the potential hazards of an industrial installation and to evaluate the seriousness of the consequences. It is especially of interest at the start of a project. It presupposes the failure of each element (without taking into account the failure modes) and translates the consequences for the system. FMECA method (Failure Modes, their effects and their Criticality Analysis), which allows the influence component failures have on the system to be known. Although exhaustive, it must be completed in order to combine effects and evaluate reliability. It is the subject of the methods detailed below. 3) Reliability diagrams Reliability diagrams enable assessment of systems using both formulas and visual analysis of components. [3] Series systems: Two elements are said to be in series if both have to operate for the system to function. As both have to function, reliability is written as: R(t) = R (t). R (t) and availability: D(t) = D (t).d (t) A B A B 4) Fault trees A tree is derived from the system analysis and selection of the most likely undesirable events such as a break in UPS output voltage. This device identifies the minimum number of combinations that lead to the event and enables creation of a system reliability diagram. A quantitative analysis can then be carried out according to the probabilities of basic events occurring. This in turn allows calculation of the probability of an undesirable event. In the first instance, only non-spreadable subassembly failures are taken into consideration, i.e. failures that only affect the function concerned and do not prevent the others from functioning. Spreadable subassembly failures are then taken into consideration. These are essentially short-circuits. When this type of occurs, it prevents the other functions from functioning. Hence, a rectifier short-circuit prevents operation via the battery. All inverter s lead to the loss of output voltage, even if this does not spread; this is the reason it appears simply as "inverter ". Inverter System OR DC voltage A B AND Parallel systems: Two elements are said to be in parallel if at least one has to operate for the system to function. Reliability can now be written as: Battery Rectifier shutdown R(t) = R (t) + R (t) - R (t). R (t) A B A B and Availability now becomes: D(t) = D (t) + D (t) - D (t). D (t) A B A B Complex systems: These systems can be depicted for the general case of N elements in series or in parallel, but the general formulae are cumbersome and difficult to handle. Some systems cannot be immediately transformed into a series/parallel model. A B Utility input OR Rectifier Figure 1. Example of a tree applied to a single UPS without a standby supply.
5) State or MARKOV chains These chains represent all the system states and possible transitions between the various states. The transitions correspond to events that typically represent failures or repairs. For a system to be Markovian, transition laws must, in particular, be carried out in accordance with constant failure rate λ and repair rate µ exponential laws. 6) PETRI nets In this case, the system is represented by positions, transitions and tokens. A token that crosses a transition corresponds to a possible system event. The value of this method lies in the fact that transition laws can be of any probabilistic type, unlike Markov chains. Their disadvantage is that simulations are required to perform calculations, making this type of analysis too complex and expensive for most applications. D. Don t underestimate experience While calculations and analysis tools are important tools in optimizing system dependability, experience in providing dependability solutions is a factor that cannot be underestimated. It is vital that users identify providers with extensive background designing large system infrastructure architecture. Just because a vendor makes a solid product, is not necessarily a qualification to provide a total solution that will truly optimize dependability. Total solution providers such as Schneider Electric and MGE offer decades of real world experience with the most sensitive large-scale global applications including nuclear, military, finance, co-location and large data centers. IV. GUIDELINES FOR DESIGNING ELECTRICAL ARCHITECTURE As you can see by now, site availability and dependability represent far more than simple mean time between failure (MTBF) calculations. According to The Uptime Institute, availability reflects a combination of infrastructure design topology and sustainability. They warn that simple component counts or MTBF analysis fail to include sustainability factors that represent as much as 70% of infrastructure failures. [4] The Uptime Institute defines four levels of data center infrastructure: : Basic Site where any capacity component or distribution path failure will impact computer systems and planned work will require shutdown of computer systems : Redundant Capacity Component Site which incorporates redundant capacity components and a single, non-redundant distribution path where a component failure may impact operation and a distribution path failure will result in equipment shutdown. I: Concurrently Maintainable Site with redundant components and multiple distribution paths where unplanned activities may still disrupt operations and planned maintenance elevates risk. V: Fault-Tolerant Site with redundant capacity systems and multiple distribution paths simultaneously serving the computer equipment. More specifically: A single worst-case component or distribution failure will not impact operation. Each component and element of the distribution path can be removed from service on a planned basis without requiring system shutdown. All IT equipment must be dual-powered and installed properly to be consistent with the site s architecture. Complementary systems and distribution paths must be physically separated to prevent any single event from impacting both simultaneously. [5] Recognizing that at least 25% of information downtime results from interactions between computer hardware and it s physical environment, [6] typical large datacenters supporting mission critical applications target V compliance and in many cases are required to certify this compliance. These installations follow the V rules by: Implementing redundant sources to the critical applications, for safer maintenance operations Segmenting protection and distribution paths to avoid any critical points in the installation. Installing UPS systems in a dedicated room, away from the gensets Designing the installation for comfortable operation recognizing that a majority of power outages are the result of human error Favoring equipment with few components recognizing that they will offer a lower probability of failures, lower maintenance effort and faster mean time to repair (MTTR). Implementing comprehensive power management at all levels of the electrical architecture, including final distribution to the loads (e.g. rackable and networkable power distribution units). Assuring computer-grade power and minimizing harmonic distortion for all sensitive IT loads. Genset Genset N+1 N+1 Figure 2. Example of V Fault- Tolerant Site configuration with redundant capacity systems and multiple distribution paths.
V. STANDARDS SIMPLIFY IMPLEMENTATION Faced with all of these issues of dependability, reliability and tolerance, the Telecommunications Industry Association (TIA) defined guidelines for planning and building data centers, especially concentrating on network design and cabling systems. In mid-2005, this specification completed the ANSI approval process and is now available as ANSI/TIA/EIA- 942 Telecommunications Infrastructure Standard for Data Centers (TIA-942). [7] This Standard specifies minimum requirements for telecommunications infrastructure of data centers and computer rooms and proposes topologies intended to be applicable to any size data center. Recognizing that consideration of cost, risk tolerance and best practices, TIA- 942 identifies a range of site infrastructure characteristics that go beyond the Tier classification to include minimum requirements for: Infrastructure architecture Electrical design Environmental and mechanical design (HV) System redundancy and infrastructure Tier Cabling systems Access control and security Environmental control Power management Protection against physical hazards (fire, flood, windstorm) A. Tiering overview The tier rating used in the standard corresponds to the industry data center tier ratings as defined by the Uptime Institute. [4] It is important to note that while a data center may have different tier ratings for different portions of its infrastructure, the overall tier rating is equal to the lowest rating across all areas of its infrastructure. In addition the standard points out that system mechanical and electrical system capacity must be maintained at the correct tier level as the data center load increases over time. See Appendix A for Tiering Reference Guide. B. General electrical requirements The general electrical requirements cover several areas summarized below: Standby generation: Generators should be designed to supply the harmonic current imposed by the UPS system or computer equipment loads. Interactions between the UPS and generator may cause problems unless the generator is specified properly. Where a generator is provided, standby power should be provided to all air-conditioning equipment. Uninterruptible power supply: UPS systems can be static, rotary or hybrid type and can either be online, offline or line interactive but static UPS systems have been used almost exclusively in the United States. Static UPS systems are the only systems described in detail in the TIA standard. The UPS system selection should be based on a UPS system kw rating with a minimum 20% allowance for peak loads and modest system expansion. Batteries: Individual battery systems or multiple battery strings may be provided for each module. However, a single battery system for several UPS modules is not recommended due to the very low expected reliability. If a generator is installed, the batteries should be specified for a minimum of 5 to 30 minute capacity. If no generator is installed, sufficient batteries will range from 30 minutes to 8 hours. Battery monitoring: Consideration should be given to a battery monitoring system capable of recording and trending individual battery cell voltage and impedance or resistance. Such battery monitoring systems are strongly recommended where a single, non-redundant battery system and V have been provided. Computer Power Distribution: Power Distribution Units (PDUs) should be considered for distribution to critical electronic equipment in any data center installation. PDUs should be provided complete with an isolation transformer, transient voltage surge suppression (TVSS), output panels and power monitoring. Rack Power Distribution: In rack installations, power strips should be either a single 20 A / 120V or 16A / 230V device or two if redundancy is required and should incorporate: Power strip with indicator on/off switch/breaker Power strip label with the PDU/panel identifier Electrical control equipment: The electrical control equipment power distribution, conditioner systems and UPS up to 100 kva are permitted in the computer room unless they include flooded-cell batteries. UPS with flooded-cell batteries or exceeding 100 kva and any UPS containing internal flooded-cell batteries should be located in a separate room. Environmental operating conditions: Dry Bulb Temperature: 20 C to 25 C Relative Humidity: 40% to 55% Maximum Rate of Change: 5 C per hour VI. STATIC UPS: KEY ELEMENT OF FAULT-TOLERANT ARCHITECTURE While this paper demonstrates a holistic view of obtaining optimum operational dependability and stresses the fact that no single product is a solution in itself, it is important to note that the first building block of a truly tolerant system will typically be a static, double-conversion UPS. After all, the architecture, genset, energy storage, cabling, monitoring and other factors can all be compromised if this key area is not addressed properly. [8] The static double-conversion UPS system, discussed in depth in other papers in this series, is the most desirable cornerstone for power infrastructure for a number of reasons. The low component counts and lack of moving parts assure the
highest MTBF ratings while being easy to maintain with the lowest MTTR ratings. Separate rating of individual modules allow segmented architecture and also facilitates expansion for future growth. In addition, they tolerate wide variations in load levels and types (including non-linear loads) with the best power quality and harmonic distortion limited to approximately 3% while optimizing efficiency to assure the lowest total cost of operation. This type of performance is unmatched by any other backup power technology including so-called dynamic UPS systems that may be appropriate for less critical or other specialized applications. VII. CONCLUSION As security, safety, cost of down time and total cost of operation become more important in IT operations, the concept of operating dependability becomes ever more critical. Knowing that it can be controlled and calculated gives users the ability to assure maximum compliance with defined standards when designing architectures and selecting system hardware and components. Whether using the tiered specifications referenced here or other qualifications, requirements for certification of tolerance or adherence to defined operation specifications is becoming increasingly common. Fortunately, established computational methods and analytic tools permit the systematic study of dependability during system design phases and facilitate quality assurance monitoring throughout system lifetimes. Understanding of the factors involved combined with exact or approximate calculations allow comparison of various configurations while enabling evaluation of risk factors and total cost of ownership. At the same time, the ANSI/TIA-942 Standard has captured the best practices in a manner that simplifies implementation of systems with appropriate levels of tolerance. REFERENCES [1] S. Logiaco, "Electrical installation dependability studies," Merlin Gerin, Grenoble 1997. [2] P. Bonnefoi, "Introduction to dependability design," Merlin Gerin, Grenoble 1990. [3] G. Gatine, "High availability electrical power distribution," Merlin Gerin, Grenoble 1991. [4] W. P. Turner, J. H. Seader, and K. G. Brill, "Tier Classifications Define Site Infrastructure Performance," The Uptime Institute, Santa Fe, NM 2006. [5] [5] The Uptime Institute, "Site Uptime Procedures and Guidelines for Safely Performing Work in an Active Data Center," The Uptime Instittute, Santa Fe, NM 2005. [6] K. G. Brill, E. Orchowski, and L. Strong, "Fault-Tolerant Power Certification is Essential when Buying Products for High Avaialbility," The Uptime Institute, Santa Fe, NM 2006. [7] "Telecommunications Infrastructure Standard for Data Centers," 5th ed. vol. 942, T. I. Association, Ed.: Telecommunications Industry Association, 2005, p. 148. [8] O. Bouju, "Dependability and LV switchboards," Merlin Gerin, Grenoble 1997. APPENIX A. TIERING REFERENCE GUIDE [5] A. Generalities General I V Number of Delivery Paths Only 1 Only 1 1 active 1 passive 2 active Utility Entrance Single Feed Single Feed Dual Feed (400Volts or higher) Dual Feed (400Volts or higher) from different utility substations Concurrently Maintainable System IT & Telco Equipment Power Cords Single Cord Feed with 100% capacity Dual Cord Feed with 100% capacity on each cord Dual Cord Feed with 100% capacity on each cord Dual Cord Feed with 100% capacity on each cord Generator Fuel Capacity (at full load) 8 hours (no generator required if UPS has 8 minutes of backup time) 24 hours 72 hours 96 hours B. Battery Features Battery configuration I V One battery string per module Minimum full load standby time 5 minutes 10 minutes 15 minutes 15 minutes Battery type Flooded type batteries Mounting Racks or cabinets Racks or cabinets Open racks Open racks Wrapped plates Battery full load testing Every 2 years Every 2 years Every 2 years Every 2 years or annually
C. UPS & Battery Rooms UPS and Battery Rooms I V Aisle widths for maintenance, repair or equipment removal requirement requirement requirements (not less than 1m clear) requirements (not less than 1.2m clear) Proximity to computer room requirement requirement Immediately adjacent Immediately adjacent Fire separation from computer room and other areas requirement requirement requirements (not less than 1hour) requirement (not less than 2hours) Battery Rooms Separate from UPS rooms Individual battery strings isolated from each other Battery monitoring system UPS self monitoring UPS self monitoring UPS self monitoring Centralized automated systems to check each cell for temperature, voltage and impedance D. Other general requirements Security Access Control / Monitoring at: I V Generators Industrial grade lock UPS, Telephone & MEP Rooms Industrial grade lock Card access Card access CCTV Monitoring UPS, Telephone & MEP Rooms requirement requirement System Monitoring Central Power and Environmental Monitoring and Control System ( PEMCS) Interface with BMS