International Standards for Data Centre Electrical Design Developed from The Uptime Institute & TIA942 concepts
Benchmarking Data-Centre Quality There has long been the need to be able to measure the quality of a critical facility The quality is usually expressed as Availability of the IT functionality of the facility in terms of number-of-nines - e.g. Three Nines = 99.9% Availability - Note that several engineered and human systems have to contribute to the whole facility and its IT functionality, including the IT hardware and software itself At this top-level nines are usually applied over 5-10 years - e.g. 99.99% over 5 years = one failure event lasting ~4 hours - It should never be assumed to cover multiple failure events - It should never be assumed to span only one year
How good is 99.9%? 44 minutes of unsafe drinking water per month 3 crash-landings per week at Heathrow 3,000 letters lost by The Post Office, every hour 2,000 surgical mistakes in the NHS, every week 9,000 incorrect banking debits per hour 32,000 missed heartbeats, per person, per year - Not all in one go, please. UK numbers
Availability Nines: A measure of quality? MTBF 10 years 1 month 1 day MDT 1 hour 30 seconds 1 second Availability 99.99885% 99.99885% 99.99884% Four-Nines = OK? But do you really want a failure every day? In reality it s worse. Assuming the system recovery time is 6 hours: MDT 6+1 hours 6h+30s 6h+1s Availability 99.992% 99.17% 74.99%
20ms power events in 12 months? How many computer crashes will you accept? Availability Nines MDT 20ms failures 99.0% 2 87.6 hrs 15,768,000 99.9% 3 8.76 hrs 1,576,800 99.99% 4 53 min 157,680 99.999% 5 5.3 min 15,768 99.9999% 6 31.5 sec 1,577 99.99999% 7 3.15 sec 158 99.999999% 8 315 ms 15 99.9999999% 9 31.5ms 2 The Nines cannot be applied to power over a single year! Better to use MTBF/MDT for one failure event
Site/IT functionality and Availability Your mission critical hardware can only deliver its maximum potential if the whole facility works - Connectivity - Power - Cooling - Fire detection, alarm and suppression - EPO - Maintenance and emergency intervention - Security, internal and external, physical and software attack - Human Error, Systems Training & Facility Management - External disasters earthquake, hurricane, flood, fire.. air-crash
The Uptime Institute The Uptime Institute [1] has, for more than 10 years, sponsored research and practical studies into data centre design, operation and resultant resilience and developed a Tier Classification to describe and differentiate facilities from an availability standpoint A White Paper [2] from the Institute (authors of which include the originator of dual power supplies in IT equipment and the Tier system itself) is the basis of this review of the facility and operational concepts The Uptime Institute is a commercial organisation and the guidelines it created are not in the form of a technical standard. However much of the principles and details have been incorporated in TIA-942 (see next slide) www.uptimeinstitute.org [1] The Uptime Institute, Building 100, 2904 Rodeo Park Drive East, Santa Fe, NM 87505, USA [2] Title: Industry Standard Tier Classifications Define Site Infrastructure Performance, Turner, Seader & Brill, 2001-2005 The Uptime Institute, Inc
American ANSI/TIA Standard In the absence of any more definitive standards ANSI/TIA-942-2005 - Telecommunications Infrastructure Standard for Data Centers Telecommunications Industry Association - Standards and Technology Dept, 2500 Wilson Boulevard, Arlington, VA 22201, USA - www.tiaonline.org/standards/search_n_order.cfm Follows the same Tier I-IV format and draws heavily on The Uptime Institute publications but extends the detail, especially in connectivity Entirely a USA centric ANSI specification, so can only be used as a guide in other territories - EN/BS etc Specifically for telecom related data-centre environments <2700W/m 2
Tier Classification Tier I to IV The classification system takes into account that at least 16 major M&E systems contribute to the overall IT availability (such as fire alarms, EPO etc) - Tier IV represents 99.99% site availability (measured over five years) with the critical systems loaded to a maximum of 90% Each and every sub-system has to meet this table:
Site Availability Vs System Availability 16 major sub-systems contribute to TUI Tier Classification To reach a Tier Classification requires all 16 to achieve... Interesting to note: 5xNines UPS = Tier III
Tier IV the ultimate in resilience? Fault Tolerant: A site that that can sustain at least one unplanned worst-case infrastructure failure with no critical load impact Concurrently Maintainable: A site that is able to perform planned maintenance activity without shutting down the critical load - Note that it is acceptable that the fault tolerance level will be reduced during maintenance or after the first fault Tier IV Classification only applies to dual power supply loads where complete functionality is obtained with either power supply fed and where the two inputs, in normal operation, share the power demand, as defined by The Uptime Institute s own specification [1] A technical and philosophical argument reigns about Static Transfer Switches for single-cord loads in Tier IV designs - Is that Tier III.5 or IV.5? [1] Title: Fault Tolerant Power Compliance Specifications, v2.0, see www.uptimeinstitute.org
Electrical Single Line Diagrams There is no compunction on the designer to strictly follow the designs derived from the Tier Classifications. In many cases compromises will have to be made - The benchmarking function of the Tier system then provides a useful yardstick to measure a system against In the rest of this presentation we only refer to the Electrical systems, just one of the 16+ engineered systems that are required to achieve a Classification rating A particular facility s Tier rating is the lowest of all its system Tier Classifications - Tier IV power +Tier III all other + Tier II cooling = Tier II Facility
Tier I = majority of critical power systems A basic single-bus critical power system suitable for single-corded IT loads There is no specific redundancy called for, although it can be argued that the standby generator set is redundant for the grid supply Although only N is specified, the designer should avoid multiple components in powerparallel configuration as it drastically reduces the potential Availability, i.e. N=1 is best Maintenance generally involves supplying the load with non-ups power and an annual load shut-down Availability of Power at load typically 99.95%* *Over 5 years operation
Tier II increasing levels of redundancy A single-bus power system suitable for both single-corded loads Redundancy is called for in the standby generator installation to reduce the chance of failure-to-start, but not the mains supply N+1 is specified for the UPS so a high degree of maintenance can be concurrent Load bank connections are mandatory Availability at load typically 99.98%* *Over 5 years operation
Tier II with dual-cord loads A single-bus power system suitable for both single and dual-corded loads Redundancy is called for in the standby generator installation to reduce the chance of failure-to-start, but not the mains supply N+1 is specified for the UPS so a high degree of maintenance can be concurrent Load bank connections are mandatory Dual-corded loads (expected minority) should be fed by separate A+B PDU s whilst only the singlecorded loads should be fed via STS s (performing a maintenance function rather than Availability enhancement Note the option of a B UPS, practical when dualcord loads are few Availability at load typically 99.98%* *Over 5 years operation
Tier III more redundancy + segregation A dual-bus power system suitable for both single and dual-corded loads Redundancy is called for in the mains supply and the standby generator sets. These must be compartmentalised for lower common mode failure, fire etc N+1 is specified for the UPS so a high degree of maintenance can be concurrent Dual-corded loads should be fed by separate A+B PDU s whilst only the single-corded loads should be fed via STS s (performing a maintenance function rather than Availability enhancement) Note the ability of a rapid upgrade to a B UPS and Tier IV (but don t forget the other systems) An important extra here is the Load Bus Synchronisation. When the STS s can have UPS power on one input and the generator supply on the other it is essential (for the load) to have the two supplies within 30 Availability of Power at the load typically 99.99%
Tier IV the Uptime purists configuration Segregation
Load isolation breaker and N+? To be able to run the load via the bypass and test the UPS system as a parallel group is a very attractive and useful operational/maintenance feature - The load isolation breaker enables that function Generally that means that between the PDU and the output bus of the UPS system there are at least two MCCB s or ACB s in series - Typical MTBF published at 250,000h (28.5y) with maintenance - Two in series = 125,000h MTBF This negates the advantage of applying any reliability enhancement strategy using N+(more than 1)
Distribution limits the UPS Availability Utility/Generator Feed Input Switchboard Maintenance Bypass Output Switchboard Typically 250,000h MTBF each Two in series = 125,000h MTBF N+2 (or higher) UPS does not improve things Bus-voltage Availability depends upon these two switches Single-bus maximum MTBF = 125,000h (14 years), 8h MDT A = 99.99% Dual-bus maximum MTBF = 110,000 years, A = 8xNines
N+1 redundant UPS architecture: N? 1+1 2+1 3+1 100% Redundancy 50% Redundancy 25% Redundancy 600kVA Load 2x 600kVA modules 3x 300kVA modules 4x 200kVA modules R = 10* R = 7 R = 5 Day One only Day One to Two Day One to Three Highest UPS CapEx Scope for load shrink High scope for load shrink High risk of partial load Medium risk of partial load Low risk of partial load High load step Medium load step Low load step 1200kVA of batteries 900kVA of batteries 800kVA of batteries 25% space saving 33% space saving Lower battery CapEx etc *Based on Reliability (R) of a single module = 1
Limitation of N in N+1 systems As N grows the potential MTBF of the system decreases (see graph) A 5+1 limit is sensible - Potential MTBF x 0.333r - Doesn t fall too far during N operation - With module of 35,000h and mains of 100h MTBF, A=7xNines at bus which equates to 5xNines at UPS output *
Tier Classification is more than just power To truly achieve a Tier Classification means ticking-the-box in 16 subgroups and one of the most important is timely, skilled and proper maintenance capabilities on site - Human error remains the largest contributor group to mission failure, most often when responding to alarms in complex systems The level of cover and skills in site personnel is a major hurdle - 24x7 staffing, factory trained in every product on site, an effective BMS alarm response plan backed up by a 4 hour site response with parts and service engineer to ensure a very high first-time-fix rate For the power system the best (and only cost effective?) solution is to use 24x7 remote monitoring with trained service personnel - Detect and respond before the site personnel - Diagnose alarm and set in motion the right engineer with the right parts
Any combination of MTBF/MTTR = Answer
Tier I & II can wait for a service engineer
Tier III & IV can t
Tier IV The Uptime Institute, original version Complete physical segregation of the two power supplies from the grid to the dual-corded load a true Dual-Bus system - 2x(N+1) in every system, maximum 90% load - Concurrent maintenance possible without load shut down and without losing N+1 redundancy - Needs two grid sub-stations (they will be on the same MV-ring or diverse MV-radials) and diverse cable routes into the site - Two mechanical load power switchboards in dual-bus - Note that many engineers question having N+1 on both A & B buses ONLY dual-corded loads - No STS s, no common point of failure except the grid and the load - Simple to operate (idiot proof), fault tolerant, hence reliable With care in design, installation, operation and maintenance, 99.9999% power Availability possible
Not all loads are dual-corded, <30%? Not all loads are dual-corded - Power transparent switching via STS s is a great maintenance tool - Feeding dual-corded loads via STS s reduces Availability to that of the STS itself and negates the principle of dual-bus segregation Classic Tier IV but with STS s for single-corded loads - Essential to have Load Bus Synchronisation Three PDU s in the data-room - A fed from UPS-A for one feed of the dual-cord loads - B fed from UPS-B for the second feed of the dual-cord loads - A/B with STS fed from UPS-A & B for single-cord loads
Tier III.5 or IV.5? That is the question!
Tier IV requires uninterruptible cooling Even though the TIA-942 specification limits itself to 2700W/m 2 and TUI Tier IV refers to 1560W/m 2 as the limit across a large space they call for uninterruptible cooling for Tier IV The trend for ever-higher IT cabinet loads is well known and single hot-spots as high as 20-30kW/m 2 are no longer rare events making uninterruptible cooling essential E.g. When a 13kW loaded IT cabinet loses all cooling supply the ambient temperature rises from 22 C to 35 C in under 20s (0.65 K/s) - Interesting to note the specified rate-of-change-of-temperature limit in TIA-942 = 5 K/hour (0.0014 K/s)
The only solution to high W/m 2 = UPS? Three steps to achieve continuous cooling - Keep the air moving, server fans are often sufficient, obtain generator power after 10-15 seconds and, preferably, have high floor-to-ceiling heights - Keep the fluids moving via UPS driven redundant pumping and, wherever possible, apply Chilled-Water-Storage - If CWS is not practical then power the compressors and heat rejection plant with UPS, retaining 100% cooling capacity on a continuous basis The power required for the cooling system is typically 40% of the kw IT load (10% pumps, 30% compressors) Most engineers would prefer to keep the IT and mechanical loads separate so, separate UPS systems
UPS driven cooling alternative solution The mechanical cooling load is predominately motors and variable speed drives, not requiring the high-fidelity voltage and frequency control normally provided by UPS Generic computer grade series-on-line UPS has energy efficiency of 93% to 94% Optionally, Eco-Mode can be selected and the UPS system will operate at >98% - ready to switch back into series-online mode in <0.5ms The 4-5% delta (with no degradation in power for the mechanical load) will save ~2% of the data-centre kwh and carbon emissions, at no additional capital expenditure
Eco-Mode = 100% CapEx payback in 2 years
System Load Vs Bus-A and Bus-B Load Total load will probably peak at 80% capacity (TierIV=90%) Typically 30% single-cord loads will be present - Worst-case balance 1/3 rd to 2/3 rd on A/B system Typical Bus loads of a fully loaded system are then 36% and 44% of rated capacity (for 99.95% per year) N+1 topology: The higher the N, the higher the module load
Partial load efficiency becomes crucial 25-30% load efficiency point is critical in Tier IV Above example: At 25% load = 8% efficiency delta
New energy storage developments Vs Tiers? Flywheels, as a battery substitute, always reduce power efficiency - Autonomy 5-15 seconds of flywheel Vs 10-15 minutes of battery - Smaller footprint although <2% of stored energy - Higher capital cost typically 3-8 times that of an equivalent power battery - UPS system is 100% dependent upon the diesel-engine starting reliability - N+1 generators will need special treatment on paralleling-time Low speed flywheels (steel rotor, bearing load relief via magnetics) - Standby losses x20 that of battery float power (+10kW higher losses per MVA) Medium speed (steel rotor, bearing load relief via magnetics) - Routine bearing changes largely offset battery replacement costs - Standby losses x15 that of battery float power (+8kW higher losses per MVA) High speed (steel or composite rotor, active magnetic bearings) - Standby losses x2 that of battery float power (+1kW higher losses per MVA) - Complex, hence lower potential reliability (not predictability) than a battery - Low power module ratings make high-power data-centre application uneconomic
Other contenders in the green debate? Compressed Air Storage - Takes up 200% more floor area than an equivalent VRLA battery - High CapEx, US$1m/MW x10 cost of equivalent VRLA - Higher maintenance costs than VRLA - High standby losses - 35MWh/year higher than battery float power Hydrogen Fuel Cells - Are they a replacement for the generator rather than the battery? - Typically -48V output, needs an energy-bridge to cover starting time - Green? 50% thermal efficiency but what source the fuel? - High CapEx, US$2m/MW x10 cost of diesel genset - Low power ratings for data-centre (but well proven at 10kW) - Embryonic technology for UPS systems, either H-gas or Methanol-Water
Secure Power, Always Ian F Bitterlin PhD BSc(Hons) DipDesInn MCIBSE MIET International Sales Director Contact details Tel: +44 (0) 7717 467 579 E mail: ian.bitterlin@chloridepower.com Web: www.chloridepower.com
Unique to TIA-942 - in the detail Tier IV has to have impedance-based battery monitoring systems TIA-942 says that when a system (A or B) is shut down for routine maintenance then the maintenance bypass should be energised by a UPS supply - Not to rely on the dual-corded loads to operate with one feed dead? - TIA-942, Page 123, RH column UPS Maintenance Bypass Arrangement A third UPS (C) system? Space hungry, 0.05% utilisation and a poor return on investment - Chloride solution (red-line on diagram) Cross-feed the output of each UPS system to the maintenance bypass of the alternate system Manual control, padlocked and interlocked isolators, break-before-make, no hot-transfer, no point of common coupling in an auto-mode, sync-check blocking relays across breakers = safe
Tier IV+STS s + bypass detail from TIA-942