Copyright 2012 S1NED Limited all rights reserved Data Center Infrastructure Management - a special report It s not all about cost June 2012
How do I know if my Data Center is about to fail? The simple answer is that you probably don t, which ought to seem quite disturbing, given that you are likely to have spent $2,300 per square foot on the design and construction of your data center facility. Research shows that the major reasons for catastrophic data center failure is rarely critical facility design and build issues and much more likely to be caused by human error, exacerbated by inadequate process and poor metrics i. Your wonderful, expensive and highly resilient data center facility operated ineffectively will eventually fail; it s all just a matter of time and luck. In the USA, between 4% and 5% of data centers fail to some extent every year. ii Data Centers are highly complex combinations of facilities and IT systems that do not naturally integrate or work well together. Global best practice demands proactive and real- time monitoring and management of both domains as an integrated whole. Data Center Infrastructure Management platforms such as wise TM DCiM TM provide this integrated view that enables proactive management of the whole environment. Complex interactions and hidden problems can conspire together and deliver unexpected and sometimes catastrophic failures. Figure 1 The Formula for a Silent Running Data Center Estate How do I know if I am getting 100% utilization from my Data Center investment? Ever walked around a data center and seen all of the empty racks, yet been told that the facility is full to capacity? Ever tried to explain to the CFO that you need another $20M to build a data center, as the current one is full? Ever been faced with squeezing in another rack of equipment into your already full data center? These are all too common experiences for data center operators. Lets ask ourselves what does full actually mean from a data center perspective? Is there physical space in the seected rack? Will the rack support the weight of the new equipment as well as what is already in there? Is there enough power in this rack on both streams to support this equipment as well as what is already installed? Is there enough cold airflow to this rack to support the addi^onal heat load? Is there enough network capacity for the SAN and LAN installed in the rack? Figure 2 Some of the key decision making steps in selecting a location for new IT equipment in our Data Center Copyright 2012 wise Inc. all rights reserved
The myth of the full data center Data centers are considered full when they run out of one or more of their constrained resources. In the main, full data centers are hampered by a single constraint, usually either power or cooling. In many cases, this constraint can be released offering us the opportunity to improve our data center ROI significantly. Real life experience shows that it would be disappointing to see less than 10% and surprising to get more than 40% additional capacity ($230- $920 per square foot value). Building Power 910KVA UPS 300KVA UPS 300KVA UPS 300KVA Figure 3 Simplified Power Distribution Scheme (Power Stream Resilience Omitted, 140 s Omitted for clarity) The diagram shows a data center with 150 x s connected to 15 x s connected to 3 x UPS connected to the main building power. At each stage, the data center and rack planners have added safety margin, misstated the real load, introduced rounding errors and have been left with stranded capacity. In a modern data center with heavy virtualization combined with adaptive CPU and cooling systems the actual load will change in real- time as CPUs scale back under low compute load conditions and fan speed vary depending on ambient temperature at the server air inlets. Boilerplate loads are almost always higher than the true load under real data center conditions. Given real- time monitoring and recording of power draw by, and UPS it is possible to simultaneously optimize and protect the data center. The table below is based on Figure 3 and needs some explanation. It shows the scale of the potential prize when applying best practice with properly designed tools such as wise DCiM. By managing the excess capacity down as we bring power loading risk under control we could find capacity for another 21 racks worth of equipment just by re- planning the equipment layout (10 spare capacity in the row). Following up, we could physically install another 15 racks or possibly reconfigure the s to deliver more power per rack gaining an additional 6 or 13 racks worth of IT equipment. (This would involve installing additional electrical wiring and control gear). We might even consider fitting a few additional partially populated s and as many as 15 more racks to deliver an additional 7 from the partially loaded UPS equipment. In summary we could deliver an additional 50 racks worth of equipment (30% more) from our full to capacity data center. Component Boilerplate or calculated load Actual load Stranded & safety margin (design boilerplate) Design capacity Available for expansion per component (design actual) Total available for expansion for all components 4.8KVA 4.3KVA 0.2KVA 5.0KVA 0.7KVA 10 (x150) 48KVA (10 racks) 43KVA 12KVA 180KVA (x15) UPS 27 (5 s) 21 2 300KVA 8 2 (x3) Building 82 (3 UPS) 64 10KVA 910KVA 26 26 (plus stranded) Copyright 2012 wise Inc. all rights reserved
The Genesis of a Data Center Failure Let s take a real life example; we have two Power Distribution Units (s) resiliently supporting a row of racks with an A and B power feed. All of the equipment in the racks is plugged into both power feeds with a design objective of being able to survive a either a failure / single power feed circuit breaker trip or provide the capability to take one out of operation to perform routine maintenance. This is good practical critical facilities design and implementation. - A supplies half the load and provides backup to - B B supplies the other half of the load and provides backup to - A Row of s in Data Center being supplied resilliently from both - A and - B - A fails or is taken out of produc^on B now supplies the full load Row of s in Data Center being supplied non- resilliently from - B Figure 4 Before - A Fails Figure 5 After - A Fails This is where process, metrics and attention to detail come into play, the humans in our rack planning and implementation teams were under pressure to deliver fast, as a result they did not pay enough attention in planning and when adding a new server into one of the racks, they miscalculated the additional load. As a result, - A is now running at 56% capacity and - B is running at 54%. Everything goes well for months, (maybe even years) as the s are quite happy running at this level of loading. Then the inevitable happens and we have either an electrical failure in - A that causes an unplanned shutdown or we need to perform maintenance on - A and we shutdown manually as part of the maintenance plan. Our data center is resilient and designed (at great expense) to deal with failure and planned maintenance yet we have a catastrophic and unexpected event. We trigger a cascade failure as the additional load coming across from - A takes the demand on - B beyond it s design capability and it shuts down, tripping it s main breakers. The whole row of racks is now without power and we have a catastrophe. On another day, perhaps we don t get the load calculations wrong but our new guy in the rack implementation team manages to plug both of the new server power cords into the A power stream. It works quite happily like that for months and years, until the inevitable happens, the power from - A is interrupted and our server fails. Resilient processes must support Data Center designs, and these must constantly monitor and report on misconfiguration before misconfigurations can cause critical failures. Real- time monitoring of the power draw would have highlighted the problem before it became a disaster. Copyright 2012 wise Inc. all rights reserved
If you don t (or can t) measure it you can t expect to manage or protect it The key to effective Data Center operation is a tightly managed blend of trained people, operating mature and tested processes, supported by real time metrics and delivered on resilient critical facilities technology. Only by combining these together can we prevent catastrophic data center outages. Spending money on the facility and leaving out the people and processes is a recipe for disaster. Getting the people and processes right without accurate underpinning real- time data is simultaneously ineffective and risky. This brings me neatly onto Data Center Infrastructure Management, the coalescence and integration of IT and critical facilities management, reporting and planning. Data Center Infrastructure Management systems are designed to co- ordinate facility configurations with real- time information in order to deliver an essential dashboard of data center health, efficiency, capacity and performance. Data Center Compromises Data center planners are forced to make compromise decisions between maximizing utilization and protecting availability because they lack the accurate data points they need to maintain the most optimal configurations. Availability figures strongly in the compromise and as a result a large number of data centers are significantly underutilized compromising the expected Return on Investment (ROI). Efficiency ROI Availability Data Center Infrastructure Management We have seen that understanding the configuration and dependencies of the power feeds in our data center is critically important. We also know that static and theoretical calculations are prone to human error. What wise DCiM offers is a dynamic, real time environment that delivers a number of business advantages: Enables proactive protection of critical systems Reduces the risk of cascade failures Reduces operating costs Improves Data Center Capital ROI Improve Return on Capital Reduce Risk Reduce Opera^ng Costs Figure 6 The wise DCiM Prizes Improve U^liza^on Copyright 2012 wise Inc. all rights reserved
About wise DCiM wise, Inc. (OTCBB: RACK.OB) is a leader in software development and marketing within the growing markets for IT infrastructure tracking, monitoring, modeling, and management. The Company s suite of product and service offerings enables clients to effectively manage today s high density computing configurations and virtualized data centers while mitigating the risks associated with cascading faults within the infrastructure. wise solutions integrate device level, real- time monitoring with advanced modeling, reporting, analytics and other critical features and functionalities, providing users with robust, state- of- the- art platforms to responsibly optimize operations and maximize cost efficiencies of their IT infrastructures, and progress toward achieving sustainable green data centers. Its branded wise Products and Services, Solutions, are used by over 150 companies worldwide. For more information, visit http://www.rackwise.com or call 888.818.2385. About Steve O Donnell and The Hot Aisle Steve O Donnell has over 30 years global Data Center experience and is globally recognized as a specialist in Data Centers and Sustainability. He is an Engineer and holds a number of patents in the data center cooling area. He works as a speaker, analyst and commentator publishing on his blog http://www.thehotaisle.com and tweeting as @stephenodonnell. He was a member of the advisory board for Fusion- io, he currently chairs both the advisory boards for Violin Memory and wise. He serves on the Advisory Board for DC Professional Development, a division of Data Center Dynamics. He is founder and CEO at Chalet Tech a Database Security business and Chairman at Preventia a UK based Security systems reseller. Steve brings real life experience to his analysis of Data Centers with a career IT budget of $30B. He was CEO at MEEZA, the leading data center and Cloud- services Company based in Doha Qatar and serving the MENA region. He was a Senior Analyst at Enterprise Strategy Group where he specialized in data center power & cooling, data center strategy and best practice. Steve ran IT Infrastructure Internationally for First Data Corp, the largest card- processing company in the world. He was global head of data centers at BT running the largest data center estate and the biggest IT operation in Europe. He has a worldwide reputation as a thought leader in Green IT having won six industry awards for his 21st Century Data Centre vision and design. Working with Harqs Singh, a member of the wise Advisory Board, he developed The Green Grid Data Center Maturity Model (DCMM) for categorizing Data Centre Efficiency. i Source Uptime Institute 70% of data center failures caused by human error ii Source Data Center Dynamics Data Center Survey (2012) Copyright 2012 wise Inc. all rights reserved