Top 10 Mistakes in Data Center Operations Eric Gallant Regional Director Lee Technologies
Lee Technologies
AGENDA Lee s Top 10 Working Group Breakouts Group 1 - Feedback of Top 10 Group 2 - What Can We Do Conclusion
BIG MISTAKE #1 NOT INCLUDING THE OPERATIONS TEAM IN FACILITY DESIGN Why it s a mistake: The Operations Team can provide valuable input to avoid: Un-maintainable Systems Unsupportable IT loads Inefficient Space Planning Unmanageable Complexity Identify design elements that drive up OpEx An early partnership between design and operations will lead to: Better Design resulting in: Decreased TCO Increased Reliability Increased Operability
BIG MISTAKE #2 RELYING TOO MUCH ON DATA CENTER DESIGN A robust design with a high level of redundancy does not justify the lack of a quality Operations and Maintenance (O&M) program. According to the new Uptime Institute s Paper on Operational Sustainability, higher Tier facilities require more operational support and, The installed infrastructure alone cannot ensure the long-term viability of the site unless Operational Sustainability behaviors are addressed. Very, very few organizations can successfully operate with a substandard or (god forbid!) break/fix maintenance scheme
BIG MISTAKE #2 WHAT CAN BE RELIED ON? An operational mindset that places a priority on: Performance Operational continuity is a core business requirement Availability 100% uptime without any plant shutdowns System Complexity - Redundant systems, failover automation and emergency recovery procedures Accountability Process Documentation, change control and auditable records An operational methodology that includes a solid foundation in: Layered Quality Control Formalized Processes and Procedures Auditable Documentation On-going Training Qualified Personnel
BIG MISTAKE #3 FAILURE TO CORRECTLY ADDRESS STAFFING REQUIREMENTS Many companies base data center facilities staffing on office building requirements. (M-F 0800-1700) Many more depend on their IT staff to manage and supervise core infrastructure O&M. Dedicated, trained facility operations staff are vital to operational sustainability
BIG MISTAKE #3 HOW ARE STAFFING NEEDS CORRECTLY ADDRESSED? Staffing levels should be based on: Risk Profile Cost of Downtime. When there's an infrastructure problem, is there time to have staff drive in? Business Requirements - Global footprint? 24x7 operations? Infrastructure Complexity and Facility Size The hours required for proper maintenance add up quickly Operating Budget Hire, develop and keep the right people.
BIG MISTAKE #4 FAILURE TO TRAIN AND DEVELOP YOUR TALENT Many companies find it difficult to justify the expense and the time required to develop and implement a quality training plan Many companies rely on the vendor/contractor provided component training or startup training. OTJ can lead to short cuts and poor methodologies Lack of training and support can lead to low job satisfaction and high staff turnover. Staff turnover is costly in terms of: Vulnerability while understaffed Knowledge loss Training costs Hiring costs
BIG MISTAKE #4 BENEFITS OF A STAFF TRAINING PLAN Timely and correct operational activities leading to maximized uptime. Improved SAFETY Costs and time to implement are offset by increased uptime, lower maintenance costs and increased retention Training programs need to be viewed as an investment in the overall business
BIG MISTAKE #5 FAILURE TO CONSISTENTLY DRILL AND TEST SKILLS Any professional that needs to respond quickly and correctly in the event of an emergency should have an aggressive program of drills and tests. Sailors, Firefighters, EMTs, Data Center Operators, Regional Sales Directors
BIG MISTAKE #5 FAILURE TO CONSISTENTLY DRILL AND TEST SKILLS Any professional that needs to respond quickly and correctly in the event of an emergency should have an aggressive program of drills and tests. Sailors, Firefighters, EMTs, Data Center Operators, Regional Sales Directors Emergency responses should be second nature Higher level of readiness from an aggressive drill and test plan results in operational efficiency and financially benefits for the corporation SAFETY! Continuously verify training through testing
BIG MISTAKE #5 DRILL AND TEST CURRICULUM Drills for emergency procedures Develop theory of operation for major systems O&M training modules Take advantage of preventive maintenance activities to simulate equipment failures Procedure walkthroughs prior complex maintenance activities Exams for multiple levels Note: A data center can be a challenging environment to conduct drills in.
BIG MISTAKE #6 FAILURE TO OVERLAY YOUR PROGRAM WITH DOCUMENTED PROCESSES AND PROCEDURES You can t manage what you don t measure You can t improve performance without benchmarking Document library forms the foundation for corrective actions Maintenance of document library promotes continuous improvement
BIG MISTAKE #6 FAILURE TO OVERLAY YOUR PROGRAM WITH DOCUMENTED PROCESSES AND PROCEDURES Examples: Equipment Lists As built Drawings Commissioning Documents Change Control Documents Walk-through reports Preventive Maintenance Reports Corrective Maintenance Reports Maintenance Scopes of Work
BIG MISTAKE #7 FAILURE TO IMPLEMENT APPROPRIATE PROCESSES AND PROCEDURES Examples of necessary procedural documents Change Control Documents Standard Operating Procedure (SOP) functional or administrative Method of Procedure (MOP) detailed, step-by-step when working on or around critical load Emergency Operating Procedure (EOP) get to safe condition, restore redundancy and isolate trouble Where these documents help Vendor management minimize unnecessary risk Emergency response mitigate damage, implement lessons learned exchange
BIG MISTAKE #8 FAILURE TO IMPLEMENT AND DEVELOP QUALITY SYSTEMS Even once proven processes can be fallible Changes to one system may effect multiple systems Quality Assurance (QA) ensures errors are not introduced Quality Control (QC) proactively identify potential issues Iterative process. Fine tuning is essential to program success
BIG MISTAKE #9 FAILURE TO USE SOFTWARE MANAGEMENT TOOLS Spreadsheets and poor documents introduce risk External audits and evaluations? Computerized Maintenance Management System (CMMS) scheduling, assignment, tracking Document Management System (DMS) electronic storage and retrieval MOPS, SOPS, One-lines, ERP, Maintenance Schedule
TYPICAL QUARTERLY SITE ACTIVITIES 50,000 SQ FT FACILITY
BIG MISTAKE #10 THINKING YOU CAN BUILD A BEST IN BREED PROGRAM AS QUICKLY AS THE DATA CENTER Building one from scratch Time, resources, internal expertise? Insource/Outsource Areas Requiring Significant Investment: Personnel Training Software Management System Procedural Development Process Integration
BIG MISTAKE #10 WHAT ARE THE COMPONENTS OF A BEST OF BREED PROGRAM? Personnel Training Documentation Processes and Procedures Emergency Response Quality Control CMMS DMS Regulatory Conformance Process integration
BREAKOUT Group 1 - Top 10 Review What would your Top X be? What sequence would you list your Top X? Why? What are the leading constraints to avoiding these mistakes? Group 2 - What Can We As a Professional Assoc. Do Training? Information sharing? Industry groups?
Group 1 Feedback of Top 10
Group 2 What Can We Do?
Conclusion