Increasing Data Center Resilience While Lowering PUE Nandini Mouli, Ph.D. President/Founder esai LLC mouli.nandini@gmail.com www.esai.technology
Introduction esai LLC esai LLC: Is a Disadvantaged woman-owned minority business focused on providing energy management solutions for federal and state government agencies Core Competencies: Technologies: Technical/Business Feasibility Studies Energy Audits, Commissioning Dynamic Pricing, Demand Response Energy Conservation Measures Distributed Energy Services, Combined Heat and Power Evaluation, Validation and Measurement Microgrid Integration Utility, Federal and State Grants Systems Building Management Experience in consulting and implementing clean energy programs to meet DOE, EPA and FEMP policies and programs. Currently leading multiple projects to bring resiliency and energy conservation for federal agencies and private corporations
Topics for Discussion What is Resilience? Why it is Resilience critical for data centers? Dynamics of treating resilience Challenges to achieving data center resilience Some tools to achieving the resilience What is DCIM? How is DCIM a resilience platform for: Planning and implementation Monitoring Data Collection Dash Board Visualization Getting the most out of DCIM tools Key Take-aways!!
What is Resilience? TechTarget s Definition of Resilience: the ability of a server, network, storage system, or an entire data center, to recover quickly and continue operating even when there has been an equipment failure, power outage or other disruption. In the context of cyber security: Resilience is the ability of a system to resist illegitimate activity and its ability to effect a speedy recovery
Why is Resilience critical for Data Center? Forrester Research: Resilience is # 2 top priority for Facility Directors: Carrier availability and density 82% Availability, resilience 80% Control over facility 78% Access to Cloud and other partners 75% Lack of resilience is costly: IBM Reputational Risk and IT Study: system outage is one of the top two IT risks that can harm an organization s reputation. 91% of data centers have experienced an unplanned data center outage in the past 24 months. The average cost per minute of data center downtime has increased 38% from $7,908 in 2013 to $11,000 in 2015 Organizations which improve from Laggard to Industry Average levels of downtime can reduce losses ~$3 million/year.
Dynamics in Treating Resilience Achieving resilience used to mean redundancy: Two (or more) of everything servers, power supplies, generators, and even whole data centers But most of this duplicate equipment was never utilized. Waste of space and energy = Increased PUE Now, the trend: increase resilience sans waste selecting software instead of hardware Fault tolerance built right into software Improve resilience through load balancing, virtualization, prediction and other techniques.
Challenges To Achieving Data Center Resilience Measurement of how vulnerable the data center system is to failure and fixing the potential problems leads to increased uptime; However, Increase in the number of applications to be managed and backed up Organizations getting larger and more geographically dispersed Infrastructural ecosystems are more complex Decreasing costs of hardware encouraging organizations to maintain backup and recovery in house incompatible with other network software to mitigate problems Increasing use of virtualization Frequency and intensity of natural disasters Increasing risks
What Are Some Traditional Ways To Achieving Resilience? Current Methodologies Design Failure: Competent design firm, integration firm, construction companies and commissioning team Catastrophic Failure: Comprehensive maintenance and operation program Compounding Failure: Paying more attention to details of each and every possible failure mode Human-error Failure: Having experienced staff and training all responsible. Continuous training and execution with pilot/co-pilot approach for operation. Conventional Data Center relies on manual response plan and Human teams
Modern Tools To Achieving Resilience A modern data center needs the II dashboard. Due to the complexity of the operations, IT and Facility management can not rely on just the human component to combat failures occurring from a combination of two or three faults IT/Facilty Management have to align themselves in using predictive ways of disaster mitigation DCIM
What is Data Center Infrastructure Management - DCIM? It is a software platform that helps operators safely manage the physical infrastructure and controls with higher visibility and transparency of the IT and the facilities operations and quick identification and resolution of problems before they happen Maximizes the efficient use of power, cooling, and space capacities now and in the future. Two core building blocks: Asset Management Monitoring
DCIM - A Resiliency Platform: Physical Infrastructure/ Controls From Device Level Monitoring in a traditional data center system to Context-Aware Monitoring so actions can be performed to mitigate a risk!!!!
DCIM-Planning and Implementation Platform Planning tools and functions: Display impact of pending moves on power capacity and cooling distribution Graphical representations of IT equipment and its location in the rack Proactively manage within rack and floor tile weight limits Correlate data between CRAC units, the PDUs, and the UPSs. The entire chain is monitored. Simulate consequences of power and cooling device failure on IT equipment through What If? scenarios Generate recommended installation locations for rack-mount IT equipment. The selection will be based on available power, cooling, space capacity, and network ports
DCIM Monitoring and Automation Platform Alarming/Notification: DCIM sends out an alarm from the rack prior to a breaker tripping. Provides operator with the opportunity to make adjustments before shut-down Status: Notes are generated for minimum, maximum, and average usage over time for that rack and for each rack Control: If a rack gets close to an overcapacity threshold, predictive simulation can be triggered generated to determine the best way to alleviate the situation. Reports and graphs are generated to help diagnose the problem
DCIM Monitoring and Automation Platform (contd.) Comparison of Primary and Secondary Functions Certain DCIM applications will take certain data center features as primary or secondary functions. Depending on the facility and need, care must be taken to select the right ones to include in the suite of integrated platform
DCIM- Data Collection Platform The data collection subset represents devices such as meters, power protection devices, embedded cards, programmable logic controllers (PLCs), sensors and other such devices. The devices perform the fundamental function of gathering data and forwarding it to management software for processing.
DCIM- Dash Board Platform Key performance indicators are at the operators fingertips with DCIM When will I run out of power and what is the current cooling capacity? What is my current server utilization? Do I have any servers that can be retired and if so what are they? The dashboard is the key centerpiece for aggregation of actionable data that can be shared quickly with decision-makers Sample dashboard collects data across OT subsets and centralizes information anytime, any where and any user interfaces: mobile, laptop, PC
DCIM- Dash Board Platform (Contd.) Another view
DCIM Energy and Power Saving Platform DCIM provides overview of facility energy use and cost and a complete breakdown of each kw per device Cost savings realized from the Servers Rack Row Room Building and Beyond
DCIM Communication Platform
An Example of DCIM Integration
DCIM Offer in the Market: Suite and Non-Suite Providers
Comparison of Various DCIM Products
DCIM Market Trends Market is growing From $240 million in 2011 to $1.2 Billion in 2016 Growth in Data Center is very high since facilities and IT meet to think about the business Inhibitors to adoption: Cost and functionality issues Difficulty of creating and maintaining asset databases Believe blindly that it is possible to manage data center without software solutions Energy Savings from well-managed data centers Reduce operating expenses by 20% Source: the 451 group
How To Get The Highest Benefit From DCIM? There are quite a variety of options. Care must be taken to ensure best fit Scalable, modular, standardized, pre-engineered, open communication architecture with a strong vendor support structure Agreement between facilities, IT, and management on operating parameters, metrics, and goals for the data center power and cooling systems and their management A review of existing processes and comparison to DCIM requirements New processes should be formally defined and resources committed and specific owners assigned
Case Study Conclusions: Data centers are complex systems, changing constantly over time Monitoring and measurement of capacity is not enough Much lost capacity can be reclaimed using predictive modeling and state of the art tools with support of DCIM measurements
Key Take-Aways - DCIM Benefits DCIM provides higher visibility, more control and improved automation Decision Support and Information Management Asset Planning and Implementation Monitoring, Measuring and Alerting Management and Control Fault-tolerant (fail-over) Software Services Final outcome: More reliable and efficient data center higher resilience and decreased PUE.
THANK YOU!!! Contact: Nandini Mouli, Ph.D. President/Founder esai LLC www.esai.technology mouli.nandini@gmail.com (443) 691 7664