W h i t e P a p e r Operational Sustainability and Its Impact on Data Center Uptime Performance, Investment Value, By Vince Renaud, P.E., John H. Seader, P.E., and W. Pitt Turner IV, P.E. TM
Executive Summary Site resiliency of a data center is the combination of Design Topology and Operational Sustainability. This white paper defines Operational Sustainability, and provides clarity on the many critical management trade-offs that profoundly influence the ability of a given site to consistently deliver high levels of predictable uptime performance over long periods of time. Operational Sustainability incorporates all expected operational conditions including physical security issues and natural and man-made disasters. The usual focus of executive decisions is on Design Topology because it significantly influences data center construction costs. In reality, decisions about Design Topology account for only 30% of all site failures. The other 70% are caused by Operational Sustainability issues, which are often almost completely overlooked during data center design, especially by senior-level executives. This paper breaks Operational Sustainability down into five distinct categories. These additional categories (along with their detailed subdivisions) position the site owner with a comprehensive set of conceptual tools to extract maximum value and availability from their site and data center investment. The content of this white paper was jointly developed by ComputerSite Engineering. Purpose of this White Paper This white paper: Defines data center site resiliency as the combined result of Tier Topology and Operational Sustainability. Defines Operational Sustainability as the collection of design and operating decisions that affect site infrastructure performance, effectiveness, and long-term value. Breaks Operational Sustainability into five separate and distinct categories (Site Selection, Building Characteristics, Fitness-for-Use, Investment Effectiveness, and Management and Operations). Each category is described and further sub-divided into individual management decisions. Provides an evaluation methodology for ranking and grading Operational Sustainability factors and categories. Background & Definitions The term resiliency is often used in discussions of mission critical reliability. It describes a data center s ability to continuously maintain critical communications and computing operations. Continuous operation equates to sustained computer processing function despite regional and local utility failures, despite natural or man-made disasters, and despite site failures, errors, and unexpected operational manifestations or permutations. In addition, resiliency also includes the ability to quickly recover from downtime events when they do occur. Unfortunately, common definitions of resiliency are so large and ambiguous as to be unquantifiable and therefore unmanageable. The Uptime Institute (Institute) has developed a more useful definition. The Operational Sustainability concepts provided in this document are complementary to and integrate directly with the Institute s Tier concepts. By treating the two concepts separately and distinctly, the Institute overcomes the confusion and lack of rigor currently associated with the term resiliency. The Institute defines site resiliency as the combination of a Design Topology rating with an Operational Sustainability rating. The Institute s white paper, Tier Classifications Define Site Infrastructure Performance, has been accepted as an international standard. The Tiers range from Tier I (basic data center with no redundancy) to Tier IV (Fault Tolerant with multiple simultaneous distribution paths and robust redundancy). The Institute Certifies compliance to this standard. Only sites listed at www.uptimeinstitute.org/ certifiedsites are Uptime Institute Tiers Certified. Operational Sustainability addresses the five additional categories (beyond Design Topology) that also have a major impact on the overall success of a site s long-term operation and uptime performance. In the long run, Operational Sustainability has a greater contribution to providing longterm site availability and resilience than Design Topology (Tier). The extended uptime performance relationship between Design Topology and Operational Sustainability is very similar to that of a computer. Every computer has an internal architecture for how a processor, memory, storage, and other functions are sized and linked together to perform useful work. The computer s operating system (Linux, Windows, Solaris, etc.) allows each of these hardware functions to work together optimally and with high availability. In data centers, Operational Sustainability is like a computer s operating system, linking together many disparate functions, the most important of which is management and human factors, which accounts for 70% or more of site infrastructure failures. Focusing only on the Design Topology is like focusing only on the processor and ignoring how storage and network contribute to optimizing the performance of the system as a whole. To avoid confusion, the Institute has chosen the words Operational Sustainability to differentiate site selection, fitness-for-use, lifecycle value, and human factors from ecological sustainability. The Institute includes site infrastructure ecological sustainability considerations within the Operational Sustainability category of lifecycle value. Operational Sustainability should not be confused with corporate sustainability or, more precisely, ecological sustainability. These latter terms are often used to address the ecological aspects of business or society. The concern is to avoid depletion of the world s limited natural resources or the protection of the environment through practices such as conservation, recycling, reduction of the carbon footprint, and so on. Data centers are very energy-intensive so decisions about ecological sustainability are inherently a part of any site infrastructure design. 2
The Pursuit of Availability Availability is the uptime delivered by the built environment. Site infrastructure topology is the design and configuration of the built environment. (It may or may not be Certified to a Tier level.) The expectation that availability will be present is defined by the site s reliability. Completing predictable and preventive maintenance, for example, increases the reliability, which in turn improves availability. These factors are all woven together and contribute to what the Institute calls Operational Sustainability. Operational Sustainability considerations touch on design, operations, and management decisions in virtually every one of the sixteen subsystems that comprise a data center site. These decisions are best made by individual owner teams as the decision is heavily dependent on the owner s tolerance for risk, future availability requirements, and emerging business expectations, as well as other Concurrent Maintenance, Fault Tolerance, and Operational Sustainability factors not addressed by traditional code and standards bodies, which tend toward minimum requirements. For clients who are serious about availability, it simply does not make sense to limit thinking to the topology decisions. Sites with exceptional focus on the operational issues are able to achieve substantially better performance for their data centers than similar sites with matching topologies, but without the operational focus. It only makes sense, then, to begin to think about the Operational Sustainability factors at the same time as basic topology decisions are made. Five Categories of Operational Sustainability The five categories are introduced here along with common attributes associated with each one. Illustrative examples are included in a later section. Site Selection Regional natural disaster risks (seismic, lightning, snow/ice, tornado, flooding, etc.). Regional and site-specific natural and man-made disaster risks (adjacent land use, transportation corridors, aircraft flight paths, sink holes, abandoned mines, etc.). Dedicated vs. shared-use buildings and their inherent exposures (building a data center in a multistory office building, etc.). Utility (electric, water, and fiber) availability, diversity, and distribution path routing. Economics (utility consumption rates and demand charges, ability to exploit free cooling, etc.) and financial incentives (property and sales tax, rebates, reduced permit fees, low interest bonding, etc.). Parcel size vs. building footprint (providing for future growth, protection from unknown future neighbors). Site physical security (perimeter definition, setback from uncontrolled areas, controlled access for people, vehicle, and packages). Skilled workforce or qualified support vendor availability and costs. Suitability for construction of a data center (grades, subsurface conditions, zoning, land use restrictions, etc.). Building Characteristics Consistency of the data center s location within the building and the placement of the building on the site with the longterm business mission. Local codes, standards, and requirements and how they will impact construction costs and site use. Understanding natural and man-made disaster risks and mitigating them by designing a building that is fully functional during and after the disaster. Power, cooling, and associated subsystem capacities are coordinated for common exhaust points. (Examples include chillers coordinated with UPS capacity, and thermal storage discharge capacity coordinated with UPS battery discharge duration; this enables a coordinated capital upgrade plan to be developed.) Support spaces (storage, shop, staging, etc.) to keep unnecessary activity and materials out of the computer room. Comprehensive commissioning of the completed construction; verification that the design intent was delivered. Use of incremental investment to maximize future available capacity (e.g., instead of using a 3,000-amp bus to match the immediate requirement, use a 4,000-amp bus to gain more ultimate capacity for a fractional increase in cost). Fitness for Use Flexibility of the site and building design: the ability to accommodate reasonable but unknown future demands such as increased power and cooling capacity or functionality, enhanced security equipment, or additional security layers (plug-and-play capacity expansions, pads for future equipment, ability to implement continuous cooling in the future, etc.). Robustness and redundancy of the design (use of a 90% redline rating, redundancy level [N, N+1, N+2], suitability of the specified equipment, ability to maintain a level of redundancy or Fault Tolerance during maintenance, technology choices, etc.). Consistency of the design philosophy across all 16 site subsystems (e.g., not having a robust electrical system supported by a bare bones mechanical system). Ease of maintenance and capacity expansion. Adaptive use of proven technology; avoiding unproven emerging edge technology for a mission critical site. 3
Investment Effectiveness The ability to meet the business mission initially and to meet an evolving business mission over time without obsoleting previous investment or disrupting computer operations to make changes. Ease of responding to the changing business requirements: Tactical vs. Strategic solutions and the life of the site. Actual realized IT yield or the ability of the site to support the maximum number of servers and other IT devices. Energy efficiency and effectiveness (energy efficiency ratios); considering how operating set points and practices impact operating expense. Some investment effectiveness factors can only be influenced during site selection; see Site Selection for those items. In the event of future resale, how valuable would a high-end buyer consider this data center? Management and Operations The existence and success of the Integrated Critical Environments (ICE) Team and the interdepartmental processes and interactions. (The Institute recommends that an ICE team include strategic and tactical members from both IT and Facilities who work together in an interactive and collaborative way to manage critical sites.) Departmental effectiveness and the competency to perform their respective tasks. Use of meaningful key performance metrics and dashboards. Staffing levels, training programs, tools, and other operating expense items. Flexibility for Change The five Operational Sustainability categories are listed above in the order of decreasing constraints to respond to changing business requirements. Some Operational Sustainability factors are set when the land is selected and are virtually impossible to change thereafter. Other Operational Sustainability factors can be significantly influenced by management attitude and annual budget levels. Each of the Operational Sustainability categories should be fully understood before venturing into a new data center construction project. The site is not flexible once selected. When the site is selected, it is fixed for the life of the investment. Decisions to put a data center into an office building, to select a site on the beach in Miami, or in a basement next to a river will limit the data center site resiliency forever. Building characteristics can sometimes be upgraded, but only at great capital cost and typically only if sufficient land is available. Fitness-for-use may provide for the ability to increase capacity easily in the future, but only if this capability is anticipated in the initial design. A site designed only for a UPS redundancy level of N, for example, will have a difficult time reconfiguring to N+1 and may require extended downtime to make the change. Investment effectiveness should match the expected life of the data center. For tactical sites, the expected life may be as short as five years. For strategic sites, the expected life should be substantially longer, perhaps even indefinite. Unlike the site, building, and other tangible physical ( bricks and mortar ) assets, management and operations direction can be changed relatively quickly. Dramatic enhancements can be made to site management without expending significant capital. Often leadership, processes and procedures, and training can be addressed completely with operating expense funding. Improvements in actual operational processes can take several years to achieve. The Rating System The rating system for Operational Sustainability is similar to the financial rating system used for corporate bonds: A, B, and C. A Excellent level of Operational Sustainability demonstrated at the site. The Institute s Operational Sustainability Certification interval will be five years before a Recertification is required. A site with a grade of A would have many noteworthy practices and would be held up as a model for other sites to emulate. High levels of availability can be expected for long periods of time. B Acceptable level of Operational Sustainability demonstrated at the site. The Institute s Operational Sustainability Certification interval will be three years before a Recertification is required. A site with a grade of B would have some noteworthy practices but also a number of missed opportunities. High levels of availability can be expected, but for shorter periods of time. C Minimal level of Operational Sustainability demonstrated at the site. The Institute s Operational Sustainability Certification interval will be one year before a Recertification is required. A site with a grade of C would have few noteworthy practices on site and numerous opportunities to improve. High levels of availability may be achieved, but are not likely to be sustained over long periods of time. As with corporate bonds, the A rating is best, and there is common industry understanding that the C rating indicates a higher-risk investment. The Operational Sustainability rating, when determined by the Institute, would become a suffix to the Tier Certification. For example, a Tier III Concurrently Maintainable site with an excellent level of Operational Sustainability would be indicated by a composite rating of Tier III A. 4
As with Tier of Design Topology, the Institute reserves the exclusive right to rate data centers according to the Operational Sustainability criteria. In performing such evaluations, the Institute uses a much more comprehensive set of criteria than indicated in this white paper. Illustrative Examples of Operational Sustainability As illustrations to the importance and content of the five Operational Sustainability categories, several discussions are included in the remainder of this white paper. The discussions are by no means complete but are included to indicate the wide range of factors that impact Operational Sustainability. Site Selection The selection of a site is a major factor for the Operational Sustainability of the site. A perfectly designed data center on a poorly selected site can be repeatedly defeated by natural events or human risks, some seemingly quite minor. If the mission of a data center is high availability for long periods of time, the selection of the site should be an integral part of the planning. It is not an act of God to have a data center damaged when a hurricane comes ashore in Miami; the damage is the consequence of choosing to locate the data center in Miami. The risks of a poorly selected site may never be fully mitigated by design features. There are hosts of other examples where a well-designed and implemented topology is defeated by a site risk that was not properly identified or evaluated during site selection. Exposure to the consequence of a train derailment or overturned chemical tanker on highways must always be considered. For example, there is one well-designed data center located at the intersection of an interstate highway and a major railway line, which leaves the site permanently exposed to evacuation orders due to a nearby transportation accident. Within a site, examples can include locating a data center beneath an underground parking garage, under a kitchen, in a mixeduse building, and so on. These locations are not conducive to maintaining long-term availability. Routing and robustness of off-site support elements for a data center such as power, water, and fiber must be considered in site selection. Security, business continuity, and other governmental regulations for blast radius protection for international financial firms, for example, can render many new data centers instantly obsolete because the site simply cannot meet the new criteria. The long-term viability of the site can be dramatically affected by land parcel size. The blast radius criteria were mentioned earlier, but there are other benefits to a site that is substantially larger than minimally necessary. Several major sites with obsolete central plants have avoided the costly need to move their data center because they had excess land. These sites were able to upgrade the mechanical and electrical systems for the existing raised-floor space for a fraction of what it would have cost if the data center had to be relocated. The size of the land parcel can be a long-term strategic benefit exceeding apparent cost. Building Characteristics Support Spaces (Staging and Storage) One common error in the programming stage of design for many data centers is the failure to include sufficient storage or staging space. This failure leads directly to the unacceptable practice of having materials stored on the raised floor next to operational computers or in critical infrastructure equipment rooms. This practice introduces fire, contamination, accident, and security risks where they simply should not be. This programming opportunity is not related to the topology of the power and cooling systems, but can lead to significant operational disruptions that impact site availability. Owners who are serious about maintaining availability will include sufficient support spaces. Owners with a shorter view of long-term availability generally will not. Many data center project team debates occur over the number of gross square feet in a data center project. Data centers are not office buildings, and there is a minor correlation between data center space and cost. Because of the heavy mechanical and electrical costs (often 75-80% of a project), reductions in overall space without changing the mechanical or electrical capacity and Tier will have inconsequential impact on cost. For additional discussion on the key cost contributors and data center cost structure, please refer to the Institute white paper Cost Model: Dollars per kw plus Dollars per ft 2 of Computer Floor. Local Codes, Standards, and Requirements Some managers may want to find an outside or independent standard and follow it blindly, feeling that if they adhere to a published standard, they will be beyond the reach of criticism. When these managers understand the concepts behind many standards, they may be rudely surprised. Codes and standards are generally minimum requirements that are driven by continuous compromise within the standards body and are not responsive to the specific needs of a particular business. The building code, for example, is a minimum life-safety standard intended to maintain the structural integrity of the building long enough after an event (tornado, fire, hurricane, high wind, earthquake, etc.) to allow the human occupants to safely exit the building. The basic building code does not address or necessarily intend to have typical structures economically functional after the occupants are evacuated. Thus, the concept of putting a data center into an ordinary office building raises a question about the survivability of the data center function. That is a major Operational Sustainability consideration. 5
There are additional and more restrictive criteria identified in most building codes for structures that must remain functional after a disaster. Called essential structures in the Uniform Building Code, these buildings are designed to higher structural standards than an ordinary office building. As building codes are life-safety driven, these higher standards apply to buildings where occupants are not ambulatory (hospitals and jails), with very high occupancies (such as large auditoriums), and dispatch buildings for police and fire that need to be functional after the event. Data centers are not listed as one of these essential structures, so the owner must take the initiative to establish a higher structural standard for mission critical site(s). There are several versions of data center standards available that purport to list all the components required for a given Tier of site design. These checklists may be helpful, as basic education, to managers new to the industry. However, they may not help an owner achieve a desired level of functionality or appropriate topology. Every data center project executive must demonstrate prudent use of capital investment that is appropriate for their business. Standards that dictate closed-circuit video monitoring, or biometric security means, for example, may be well beyond the level of security necessary for most data centers. Similarly, standards that dictate component counts often completely miss the overall functional outcome. New sites built on component counts are often found to be deficient when compared to the outcome-based Concurrent Maintainability (Tier III) standard as defined by the Institute. Fitness for Use Continuous Cooling UPS systems are generally provided to ensure continuous power to the IT equipment. These systems provide a power quality component (eliminating sags and surges), as well as providing ride through time for the engine generators to start and pick up the load following the loss of the power utility. Battery discharge durations, often 15 minutes, provide the site staff reaction time when the engine-generator plant fails to start or starts but fails to pick up the critical load. This is a surprisingly common occurrence. This reaction time is reduced with shorter battery discharge durations and effectively eliminated with flywheel (no battery) technology choices. During the interval between loss of the power utility and the engine-generator plant picking up the load, the UPS continues to deliver power to the IT equipment, but in most sites, there is no equivalent cooling capability. The result is that the computer room temperature increases. These thermal excursions may be 25 F or more in 5 minutes or less, depending on the power density of the room and of individual racks. Computer room thermal excursions can damage IT equipment, void factory service warranties, and cause unstable operation for long periods of time. A client may not have a need to solve this problem today, but should plan for it in the future. For additional information, one should reference the Institute white paper, Continuous Cooling Is Required for Continuous Availability. This paper goes in-depth into the magnitude of the thermal excursion potential in a data center. Robustness and Redundancy In order to meet the acid test in the Institute s Tier definition, a Tier IV Fault Tolerant site must have a configuration or topology that can experience a worst-case unplanned event and not disrupt the IT equipment. For a UPS system, a System+System configuration is a common solution. This has two UPS systems simultaneously supporting a common IT load. A System+System topology can be provided with a number of different configurations that are commonly encountered: System+System Redundant parallel modules on each side [(N+1)+(N+1)] Parallel modules on each side [N+N] A single module on each side [Module+Module] Isolated Redundant Module or system Catcher system (3 to make 2, 4 to make 3, etc.) Each of these configurations technically meets the acid test. However, each of these topology configurations presents different levels of flexibility, protection during maintenance, and availability over time. These choices are the root of the Operational Sustainability concept. Each option must be evaluated by an owner and their project team to determine the appropriate level of longterm availability and whether that will meet the long-term business requirements. The presence of redundant modules in the (N+1)+(N+1) topology, for example, provides greater resiliency when one system is out of service for maintenance or following a failure. Isolated redundant configurations require a UPS system or module to successfully handle both a 100% load step in addition to the successful operation of a static transfer switch. Configurations such as 3 to make 2 or 4 to make 3 are commonly used to reduce the first cost of an installation but present more complex organizational and tracking requirements, which some sites may not be able to provide over time. There are many other redundancy topics beyond UPS topologies. Owners should understand that the long-term flexibility of a site is determined, or at least constrained, by the initial Design Topology. Owners who determine they only need an N+N UPS topology initially, for example, would be well served to design and construct one that can be expanded to (N+1)+(N+1) as the future business requirements may dictate. It is very difficult to modify a site to a higher level of redundancy when it is constrained by an N+N design. 6
Another area of discussion is the adequate number of redundant capacity units. Discussions often center on N+1, meaning there is one more unit of capacity than required to support the design load. In many cases, that may be sufficient. However, in other cases, having only one redundant unit may not be enough. With major equipment such as chillers or engine generators, some repairs or service activities may take weeks or months. Determining that there should still be one unit of redundancy during these activities is a strategic decision. Suitability of the Specified Equipment Not all site infrastructure equipment (UPS and batteries, cooling units, etc.) are equal. These specifications, which are created by the design professionals for each project, drive the quality, capacity, and features for the systems and components within the project. There are some substantial but subtle differences that impact the overall site Operational Sustainability. Many owners are not aware of the range of choices and their consequences to operations. Two examples are provided here: UL 891 Switchboard and UL 1558 Switchgear Switchboard construction is lighter weight and better matched to office building uses. Switchgear construction is heavier, fully compartmented to provide physical isolation between adjacent sections, and better matched to high-availability applications. The heavier construction of switchgear will provide better survivability from faults or other electrical failures that may occur over time. This is an excellent example of the trade-offs between lower first cost (switchboard) and better long-term Operational Sustainability (switchgear). Standby and Continuous Engine-Generator Ratings Most procurement processes use metrics such as dollars per kw to measure purchasing effectiveness. This approach misses the question about operational effectiveness and Operational Sustainability. For example, the warranty on an engine generator with a standby duty rating has limits on the run time that can be expected for predictable performance. The warranty for a continuous duty rating has no run time limitation. Use of a procurement metric such as $/kw will favor the standby-rated unit, even though the continuousduty unit may be a better response to the corporation s intended use in a data center. Project managers and owners must address the intended purpose of each system, such as engine generators, and determine what is in the best interest of the data center. A standby-rated engine generator must be de-rated by 20% to approximate the same usefulness as a continuous-duty engine generator. Most engine-generator run-hours are not related to the availability of the local power utility. Extended operation of an engine generator during UPS maintenance, site recovery, or other situations are common at high-end data centers. In the final analysis, lower specifications or metrics, such as $/kw, may not be in the best interests of a strategic data center seeking long-term Operational Sustainability. 90% Redline Rating Use of 100% of the advertised capacity for UPS systems, chillers, pumps, etc., is a common design approach. However, as the load nears the nominal capacity, operational issues begin to appear. These issues may be caused by fouling in a cooling unit, dirty filters just before they are changed, or a control calibration that has simply drifted out of specification. Everything must be optimal to achieve 100% of the nominal capacity, something that rarely actually happens in the operational world. A more sustainable practice is to limit the normal maximum load to 90% (a redline rating) of the nameplate capacity. This approach creates a 10% reserve to account for the capacity constraints and anomalies that appear as the load approaches the nameplate. This redline rating should be applied to every active system, including cooling and UPS systems. Another aspect of capacity is how much capacity the site will have at its ultimate development. Tactical sites use lower-power densities than strategic sites. As the power densities continue to increase, the ultimate capacity of a site will be one determinant of how long the site can accommodate additional IT equipment. Higher ultimate density targets equal longer life. Technology Choices This area addresses the different technology solutions used in creating systems for data centers. The example used here will be battery technology. Other discussions may include static or rotary UPS, double conversion or hybrid UPS, etc. The Institute of Electrical and Electronics Engineers (IEEE) defines two types of batteries for UPS applications. These are wet or flooded cells technically called vented lead acid (VLA), which are addressed by IEEE Guide 450. The second type is maintenance-free or sealed batteries, called valve regulated lead acid (VRLA). IEEE Guide 1188 covers VRLA batteries. In the large North American mission critical data center environment, most sites select VLA batteries to support the UPS. These have a higher first cost, require more floor space, and introduce requirements for a vigorous maintenance program, acid containment plans, ventilation, and etc. These more stringent requirements are accepted because of the perceived reliability benefits realized with the VLA technology. Other sites may focus on a lower first cost or a reduced space requirement and select VRLA batteries. Once this decision is made, there is little opportunity to upgrade the site to VLA batteries because they require more physical space. 7
VRLA batteries have a number of failure modes, which often appear quickly and without notice, even with suitable monitoring systems. VLA batteries develop failure modes more slowly and are more likely indicated in advance with proper monitoring. Often, reliability considerations are not selection criteria when VRLA batteries are picked. The differences between the VLA and VRLA batteries can be somewhat equalized by having a proactive VRLA replacement program every three to four years. This approach replaces the VRLA batteries before they begin to demonstrate common failure modes. Given the life-cycle costs and intrusive nature of the constant VRLA changes, the Institute believes that VLA batteries are a better Operational Sustainability choice. Investment Effectiveness Strategic vs. Tactical Operational Sustainability is the capability of a site to meet uptime objectives over extended periods of time. There are two diametrically opposed approaches to design and operation of data centers: Strategic and Tactical. Strategic sites are more sustainable over time. This is because reasonably expected unknown factors are anticipated in the design, construction, and operation of the site. One objective is to effectively eliminate the need to relocate the IT operation in the future. The migration investment to relocate a data center (provisioning, network, swing servers, etc.) can be a substantial fraction of the cost to build the new site. Additional objectives for a Strategic site are to minimize the risk and complexity of future construction work for capacity expansion or replacement of components and systems while making operation of the site as simple as practical. Providing locations for future pieces of equipment to be installed with connection provisions are ways to reduce risk in the future. These methods are sometimes called providing hooks for future equipment. Combined, the increased Operational Sustainability and flexibility of a Strategic site will effectively provide an indefinite life during which time the desired availability is more stable and predictable. Tactical sites, on the other hand, tend to focus on the initial requirements (capacity, cost, or completion dates are examples) at the expense of long-term flexibility and availability. Attention to future capacity expansion may be left to the next project team rather than taking the time to resolve them initially. These sites are fundamentally less able to meet changing needs because the flexibility has not been designed into them. As a result, Tactical sites have shorter useful life expectancies than Strategic sites. Tactical sites often have a credible life of less than five years: often only one cycle of hardware replacement. Additionally, Tactical sites have less ability to meet the availability needs of the owner over time. A premature write-off of a short-lived Tactical investment will result in a greater annual charge than a long lasting, although greater first cost, Strategic site. Strategic sites have a much greater ability to adapt and respond to the changing (and increasing) business requirements over longer terms than Tactical sites. A well-developed Strategic site will effectively have an indefinite life. At some point in the future, the power and cooling capacity of the site may be exhausted. That does not indicate the life of the site has ended, only that there is not sufficient capacity to meet all of the corporate needs in the current facility. A planned-for facility expansion at the existing site can minimize investment in new public utilities, data network providers, and relocation costs. There are appropriate applications for Tactical sites. New business start-ups need to confirm the viability of the business before worrying about long-term flexibility. A data center that supports a new product development effort is another example. If the product prospers, expanded facilities will be required to support the production and operational environments for the successful product. In these instances, a Tactical solution makes sense. It must be clearly understood that a Tactical solution may be economically thrown away in five years or even less. Established or long-term businesses are better served by the Strategic data centers. Operating Set Points The operating set points used at an active data center have a significant impact on the operating costs for the site. For example, a data center with excellent management of raised-floor air distribution should be able to increase the cooling unit discharge temperature set point to approximately 68 F instead of the more traditional 55-58 F. (Note: this discusses cooling unit discharge air temperature, not the traditional 75 F return air temperature.) This will reduce the cost of producing the cooling required to satisfy the IT devices. (Note: this temperature change CANNOT be implemented without FIRST doing an excellent job of managing air distribution. Please refer to the Institute white paper How to Meet 24 by Forever Cooling Demands of Your Data Center.) Management and Operations Interdepartmental Interactions In the best run data centers, there is a close alignment between the IT team and the site infrastructure facilities team who operates the power and cooling equipment. This alignment eliminates the isolated organizational silos that often result in duplicated work, creates gaps in essential work processes, and generally creates stress and frustration. The Institute refers to this close alignment as an Integrated Critical Environments (ICE) team. Improved processes result in easier and faster reaction to evolving business needs, sustainable availability, 8
better utilization of the data center investment, and other beneficial performance. Staffing Levels and Training Programs The size of the site infrastructure staff, their hours of coverage, and their skill sets can make a significant difference in the long-term Operational Sustainability of the site. Information from over 100 data centers tracked by the Institute s Site Uptime Network has demonstrated that site uptime is a continuous process. Failures and events are uniformly distributed across days of the week and hours of the day. They do not occur only during the day shift. The gap between this continuous process and less robust staff coverage is a major weakness at many sites. Having single shift coverage for the site facilities team leaves other shifts uncovered. Many sites have 24 by forever coverage on the IT operations desk but are not similarly supported by the site facility staff. This leaves IT operations vulnerable to site infrastructure disruptions. For example, having a 15-minute UPS battery discharge life (which provides time to respond to and potentially remedy a problem) but having to rely on a 45-minute response time to get a remote vendor on-site is fundamentally inconsistent with the perceived mission of a high-availability data center. A site staff with poorly developed skill sets will not permit timely and appropriate actions that may mitigate or prevent a potential site failure. The benefit of having on-site staff would be lost if they are not able to execute these kinds of interventions. A Tier IV site, for example, should have two individuals who have been trained on the specifics of that site on every shift. Supervision and engineering technical support are additional requirements. Training is not left to the individual. It should be planned and budgeted, organized, and effectively delivered. Course material should be consistent from one session to another. Attendance should be recorded, and every module should require a demonstration of skill mastery. Please refer to Critical Environment Staffing Considerations. Interdepartmental Practices and Procedures A well-trained staff with good operational practices, procedures, and policies can exploit the potential value from any Design Topology. A site with a poorly trained staff that lacks good operating practices and maintenance programs will likely fail to reach the availability levels anticipated during the design of a data center. If a site is truly mission critical, one would expect written site configuration change procedures so that the same steps are used each and every time a configuration change occurs. For example, well-managed sites have tested, written procedures used each time the UPS is placed into the maintenance configuration. This preoccupation with knowing the site well, being well-trained, and being prepared to intercede manually is a major differentiator for sites that are serious about availability. There are numerous other processes and procedures that should be developed for data centers. These may include access control and escort policies, standards for safe work practices in computer rooms, server installation processes, processes to accommodate IT devices that fail to meet the Institute s Fault Tolerance Power Specification Version 2.0, facility management participation in IT change management as well as IT participation in facility management change control processes. Every piece of critical site infrastructure equipment should have step-by-step operational procedures directing how to configure systems in normal and abnormal situations and following an emergency condition. For years, the Institute has offered a guideline that can be used as a basis for developing site-specific computer room work practices. This guideline is contained in the white paper, Site Uptime Procedures for Safely Performing Work in a Live Data Center. Please also see Critical Environment Governance. This white paper provides recommendations regarding global Critical Environment management protocols, processes, and standards, along with a recommended suite of management programs, work flow processes, site configuration procedures, and emergency response procedures. More than Design Topology and Tiers Long-term availability is not assured by Design Topology alone. Hosts of other factors have significant influence on the ability of a given site to provide availability over time. These factors are encompassed by the Institute s term: Operational Sustainability. Operational Sustainability must be addressed by the owner and the owner s project team during design, construction, and operation in order to provide the best basis for long-term availability. These Operational Sustainability factors have significant impacts on site availability over extended periods of time once the basic topology is established. Commissioning of the completed construction, ICE team mandate, accountability, budget, leadership, and staffing are as important, if not more important, to the ultimate performance of the investment than the Tier level. The Institute recommends that the Operational Sustainability concepts contained in this white paper receive the same level of executive attention as the Tier level and other project factors. The long-term viability of the site will benefit greatly from including Operational Sustainability in the delivery and operational planning for the site. Site management teams who incorporate the principles of Operational Sustainability into their project design and site operations practices will have notably better availability results than those sites that do not. 9
Referenced White Papers The following white papers are referenced herein and can be found at uptimeinstitute.org/whitepapers: Tier Classifications Define Site Infrastructure Performance Continuous Cooling Is Required for Continuous Availability Cost Model: Dollars per kw plus Dollars per ft2 of Computer Floor Fault Tolerant Power Specification, Version 2.0 How to Meet 24 by Forever Cooling Demands of Your Data Center Site Uptime Procedures for Safely Performing Work in a Live Data Center Referenced and Related Publications Further information can be found at computersiteengineering. com/whitepapers: Critical Environment Staffing Considerations Critical Environment Governance Life Expectancy of Facilities Infrastructure Tier Technical Users Guide The Uptime Institute s publications are protected by international copyright law. The Institute requires written requests whenever the Institute s literature or portions of the Institute s literature are reproduced or used. The Institute copyright extends to all media paper, electronic, and video content and includes use in other publications, internal company distribution, company websites, and handouts for seminars and courses. For more information, please visit www.uptimeinstitute.org/resources to download a Copyright Reprint Permission Request form. About the Authors Vince Renaud, P.E. is a Distinguished Fellow and Certification Authority for the Uptime Institute and a Principal of ComputerSite Engineering. Mr. Renaud has provided leadership and strategic direction to maintain the highest level of infrastructure availability. In varying roles from data center owner and operator to consultant, Mr. Renaud has provided planning, design, construction, operation, and maintenance of mission critical facilities for the Department of Defense and Fortune 100 companies on a world-wide basis. He is a co-author of the Institute s Tier white paper, among others. John H. Seader, P.E. is a Distinguished Fellow and Certification Authority for the Uptime Institute and a Principal of ComputerSite Engineering. Mr. Seader s career in critical facilities spans 14 years and includes responsibilities ranging from planning, engineering, design, and construction to start-up and operation for clients such as the Department of Defense, Sabre, and Williams Communication. Prior to joining ComputerSite Engineering, Mr. Seader was a Senior Technology Manager for Deloitte Consulting Outsourcing, LLC. He is a co-author of the Institute s Tier white paper, among others. W. Pitt Turner IV, P.E. is a Distinguished Fellow and Senior Certification Authority for the Uptime Institute and a faculty member for the Institute s Site Uptime Network. As a Principal of ComputerSite Engineering, Mr. Turner has personally guided billions of dollars in client site infrastructure investments. Prior to joining ComputerSite Engineering in 1993, Mr. Turner was Senior Project Manager for Pacific Bell s Fairfield Data Center; he was responsible for concept development, design, construction, and start-up for a 200,000-ft 2 facility. His work included the benchmarking of other data centers to help establish businessprocess improvements. He is a co-author of the Institute s Tier white paper, among others. Additional editorial guidance provided by Terry Altom, P.E., Ken Brill, and Rick Schuknecht: Mr. Altom is a Consultant for ComputerSite Engineering who specializes in site infrastructure audits and custom strategic-level consulting engagements. Mr. Brill is the founder of the Institute, its executive director, and a Principal of ComputerSite Engineering, Inc., which he also founded. Mr. Schuknecht is a Senior Consultant with ComputerSite Engineering who specializes in data center management and operations. 10
About the Uptime Institute The Uptime Institute, Inc. is a pioneer in creating and operating knowledge communities for improving uptime effectiveness in data center Facilities and Information Technology organizations. The 100 members of the Institute s Site Uptime Network are committed to achieving the highest levels of availability and many are Fortune 100 companies. Members learn interactively from each other as well as from Institute sponsored meetings, site tours, benchmarking, best practices, uptime effectiveness metrics, and abnormal incident collection and trend analysis. From this interaction and from client consulting work, the Institute prepares white papers documenting Best Practices for use by Network members and for the broader uninterruptible uptime industry. For the industry as a whole, the Institute publishes white papers, offers a Site Uptime Seminar Series, Site Uptime Symposium, and Data Center Design Charrette Series on critical uptime-related topics. The Institute also conducts sponsored research and product certifications for industry manufacturers. For users, the Institute certifies data center Tier level and site resiliency. About ComputerSite Engineering The content of this white paper was jointly developed by ComputerSite Engineering. ComputerSite Engineering, Inc. is a data center engineering and management consulting firm working in close collaboration with the Uptime Institute to address technical aspects of contemporary data center issues. Independent of any Engineer-of-Record or manufacturer affiliation, ComputerSite Engineering s consulting teams help clients develop and execute solutions that are responsive to their unique business needs. Since 1985, ComputerSite Engineering has guided and justified data center investments for major organizations that require high levels of continuous availability to conduct business. ComputerSite Engineering s mission is to work with clients to ensure data centers are managed for uninterruptible uptime over sustained periods. Building 125, 2904 Rodeo Park Drive East, Santa Fe, NM 87505 505.982.8300 Fax 505.982.8484 info@computersiteengineering.com computersiteengineering.com The Uptime Institute s publications are protected by international copyright law. The Institute requires written requests whenever the Institute s literature or portions of the Institute s literature are reproduced or used. The Institute copyright extends to all media paper, electronic, and video content and includes use in other publications, internal company distribution, company websites, and handouts for seminars and courses. For more information, please visit www.uptimeinstitute.org/resources to download a Copyright Reprint Permission Request form. Building 100, 2904 Rodeo Park Drive East, Santa Fe, NM 87505 505.986.3900 Fax 505.982.8484 uptimeinstitute.org 2008 Uptime Institute, Inc. TUI3027A 11